How to build a WebCrawler Search Engine

How to Build a Webcrawler Search Engine

Rate this article

Share this article

Building Web crawler Search Engine is an extremely complex engineering project. Trying to build a web crawler can provide can give a great understanding of how a web crawler search engine works. There are multiple huge complex components involved in building a search engine crawler.

How to build a WebCrawler Search Engine

To build a complete search engine you would need the following components

 

What is Web Crawler?

What is a web crawler - webcrawler search engine

There are around 1.88 billion websites on the internet. The crawler visits each and every one of the websites and collects information about each website, and each page in a website periodically. It does this job so that it can provide the required information when a user asks for it.

Crawler architecture

Here is a typical crawler architecture.

Less time to download web pages - webcrawler search engine

Crawling Process

crawler architecture - webcrawler search engine

How does a Web Crawler work?

Web crawler works by first visiting a website root domain or root URL say https://www.expertrec.com/custom-search-engine/. It looks for robots.txt and sitemap.xml. robots.txt provides instruction for the crawler, web crawler instructions like not to visit a particular section of the website using disallow section. Website creators use this file to give instructions to search engines like what kinds of information can be collected or gathered.

After completing the visit of robots.txt, it looks for sitemap.xmlSitemap.xml contains all the URLs of the website and also instructions to search engines or crawlers on how frequently that particular page should be crawled. A sitemap is mainly used for easy link findability and to prioritize the contents in the website that should be crawled.

What is Search Index?

Once the crawler crawls the web pages in a website, All the information like Keywords in the website, meta-information  (information about) a page like meta description, etc are extracted and put in an index. This index can be compared to an index that can be found in every book, that can be used to lookup up the exact page number and hence the information about the keyword to retrieve results faster.

What is Search Ranking?

Every time when a user searches for information, the search results are produced by looking up the index and many other signals like your location from which you query, These extra signals help in producing better search results

Web Search User interface

Most users search in browsers or mobile apps through the search engine interface. This is usually built using JavaScript.Webcrawler Search User interface

 

Some pointers to keep in mind while designing a good WebCrawler for searching the web.

  1. Ability to  download huge web pages
  2. Less time to download web pages 
  3. Consume optimal bandwidth.
  4. Handle HTTP 301 and HTTP 302 Redirects– The crawler should be able to handle such pages.
  5. DNS caching– Instead of doing a DNS lookup every time, the crawler should cache the DNS. This helps to reduce the crawl time and internet bandwidth used.
    DNS caching - webcrawler search engine

    reducing dns caching

  6. MultithreadingMost crawlers launch several “threads” in order to download web pages in parallel. Instead of a single thread downloading the files, you can use this approach to parallel fetch multiple pages.
    multithreaded webcrawler search engine

    multithreaded webcrawler search engine

  7. Asynchronous crawl– Asynchronous crawling, since only one thread is used to send and receive all the web requests in parallel. This saves RAM and CPU usage. Using this we can crawl more than 3,000,000 web pages while using less than 200 MB of RAM. Using this we can achieve a crawl speed of more than 250 pages per second.
  8. Duplicate detection- The crawler should be able to find duplicate URLs and remove them. When a website has more than one version of the same page, the Crawler should find the authoritative page of all the versions even if the website creator does not provide a canonical URL for the web page. Canonical URLs are the authoritative URL that search engines should look for when the website creators create multiple versions of the same web page. Website creator marks canonical URL by adding link tag in all the duplicate pages as follows.
   <link rel="canonical" href="https://2buy.com/books/word-power-made-easy" />
Duplicate detection webcrawler search engine

simhash duplicate detection

  1. Handing Robots.txt – The crawler should read the settings in the robots.txt for crawling the pages. Some pages (or page patterns) will be marked as Disallow and these pages should not be crawled. Robots.txt will be found at website.com/robots.txt
    Handing Robots.txt webcrawler search engine

    robots.txt

  2. Sitemap.xml- Sitemap is a link map of the website. It has all the URLs that need to be crawled. This makes the crawling process simpler.
    sitemap.xml webcrawler search engine

    sitemap

  3. Crawler policies-
    1. selection policy which states which pages have to be downloaded.
    2. a re-visit policy that states frequency to look for changes in the website.
    3. a politeness policy  How fast the website can be crawled (so that website load does not increase)
    4. a parallelization policy  Instructions for distributed crawlers.

System Design Primer on building a Web Crawler Search Engine

Here is a system design primer for building a web crawler search engine. Building a search engine from scratch is not easy. To get you started, you can take a look at existing open source projects like Solr or Elasticsearch. Coming to just the crawler, you can take a look at Nutch. Solr doesn’t come with a built-in crawler but works well with Nutch. 

What are some open source web crawlers you can use

  1. nutch
  2. Scrapy
  3. Heritrix. https://github.com/internetarchive/heritrix3
  4. wget
  5. http://stormcrawler.net/

Expertrec is a search solution that provides a ready-made search engine ( crawler+parser+indexer+Search UI ). You can create your own at https://cse.expertrec.com/?platform=cse

 

Search UI webcrawler search engine

Setup Web Crawler Search Engine for your Website in less than 5 min

0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published.