Crawler can consider or discard a url based on some filter rules. These rules are as follows
- Filter urls
- Filter file type urls
- Filter Get Parameter urls
1. Filter urls
You can put url pattern upon which all matching urls will be filtered from being crawled or alternatively crawler will crawl only those urls. To add filters, click on “Add strings” button. Every url string pattern added has two options – “will be ignored” and “will not be ignored” meaning crawler will consider or discard matching urls from getting crawled. For ex. If your site supports both http and https urls and/or your sitemap contains both http and https urls and you want to omit http urls being crawled. This can be done in either way:
- don’t crawl http:// urls and include everything else
- crawl only https:// urls and exclude everything else
Both of the options are valid and expertrec crawler will follow in same way. Last check for “urls not matching any pattern above” needs to exactly opposite of defined rules. If defined rules consist of negative filters – mostly for discarding the urls, then last check must be in state “will not be ignored”. If defined rules are consisting of positive filters, then last check must be in state “will be ignored”. There is absolutely no reason positive and negative filters can appear in same filter list. Urls will be considered or discarded on the first match itself. If first rule says to discard a url and next to consider for crawling, then it will be discarded.
Instead of defining url based rules, you can directly have a more broader rule based on file types as show below.
By default, except htmlx filetype everything will be crawled. You can turn on/off particular file type and they will be crawled/discarded by crawler respectively.
Filter Get Parameter urls:
Url’s contains get parameters so as to pass on data to the next page getting loaded. And crawler can consider or discard urls based on a boolean filter – Get Parameter filter, as shown below.
If this flag is enabled, all the urls containing pattern “?key=value”( as shown in above image) will be considered for crawling. Does it make any effect on crawl/ search ? Yes, it does. You may have a problem in search results with difference in urls only, and everything else ( title, content, etc) as same. So as to discard that problem, this flag is very useful.