Expertrec Crawler is a bot that indexes website data and makes it available for building site search. What it need’s is just a valid website url. Crawling starts with one or more given urls and ends when all urls are crawled.

Urls:

In order to build Search on top of your site, Expertrec crawler must be informed about what to crawl. This can be your own site url, or any site url on top of which you want to create search.  Read More

Sitemap:

Sitemap is a model of website content design and structure, which consist of list of all the important pages of your site. Sitemap can be for users and for search engine bots. In case for users, it will be a html page consisting of all links. In case of search engine bots, sitemap can be rich xml document with all the links and its metadata structured in xml format. Instead of starting crawl with one url and letting it go deeper by finding links to other pages of your site, you can directly guide crawler to crawl sitemap. Adding sitemap to crawl will help crawler to stick to important pages only. Read More

Filters:

Filters are rules to consider or discard any url. Expertrec crawler support filtering based on url pattern string or file type or any common pattern to filter any url.

Ex. If your website url is “https://www.example.com/” and you need to crawl only pdf files from the site, you can make use of “file type” filter and enable file type to “pdf”  only. This will make sure that only “pdf” documents will be crawled and indexed.

Read More on filters and its precise usage.

 

Recrawl Frequency:

How frequently your crawled web pages will be re-crawled can be controlled via Re-crawl frequency. Currently the options are “daily”, “weekly”, “monthly” and “yearly”. But you can always go to your expertrec controlpanel and start “recrawl” manually so as the entire search index will be refreshed. Read More explains this periodic recrawl in detail.

Advanced:

Expertrec crawler allows some of the advanced options while crawling, for ex. Crawl site with login credentials, manually extract part of the web page, removing common web page part ( header, footer etc) and many more. This advanced feature explained here in details.