How to build a WebCrawler Search Engine

How to Build a Webcrawler Search Engine

Rate this article

Share this article

Building Web crawler Search Engine is an extremely complex engineering project. Trying to build a web crawler can provide can give a great understanding of how a web crawler search engine works. There are multiple huge complex components involved in building a search engine crawler.

How to build a WebCrawler Search Engine

To build a complete search engine you would need the following components

 

What is Web Crawler?

What is a web crawler - webcrawler search engine

There are around 1.88 billion websites on the internet. The crawler visits each and every one of the websites and collects information about each website, and each page in a website periodically. It does this job so that it can provide the required information when a user asks for it.

Crawler architecture

Here is a typical crawler architecture.

Less time to download web pages - webcrawler search engine

Crawling Process

crawler architecture - webcrawler search engine

How does a Web Crawler work?

Web crawler works by first visiting a website root domain or root URL say https://www.expertrec.com/custom-search-engine/. It looks for robots.txt and sitemap.xml. robots.txt provides instruction for the crawler, web crawler instructions like not to visit a particular section of the website using disallow section. Website creators use this file to give instructions to search engines like what kinds of information can be collected or gathered.

After completing the visit of robots.txt, it looks for sitemap.xmlSitemap.xml contains all the URLs of the website and also instructions to search engines or crawlers on how frequently that particular page should be crawled. A sitemap is mainly used for easy link findability and to prioritize the contents in the website that should be crawled.

What is Search Index?

Once the crawler crawls the web pages in a website, All the information like Keywords in the website, meta-information  (information about) a page like meta description, etc are extracted and put in an index. This index can be compared to an index that can be found in every book, that can be used to lookup up the exact page number and hence the information about the keyword to retrieve results faster.

What is Search Ranking?

Every time when a user searches for information, the search results are produced by looking up the index and many other signals like your location from which you query, These extra signals help in producing better search results

Web Search User interface

Most users search in browsers or mobile apps through the search engine interface. This is usually built using JavaScript.Webcrawler Search User interface

 

Some pointers to keep in mind while designing a good WebCrawler for searching the web.

  1. Ability to  download huge web pages
  2. Less time to download web pages
  3. Consume optimal bandwidth.
  4. Handle HTTP 301 and HTTP 302 Redirects– The crawler should be able to handle such pages.
  5. DNS caching– Instead of doing a DNS lookup every time, the crawler should cache the DNS. This helps to reduce the crawl time and internet bandwidth used.

    DNS caching - webcrawler search engine
    reducing dns caching
  6. MultithreadingMost crawlers launch several “threads” in order to download web pages in parallel. Instead of a single thread downloading the files, you can use this approach to parallel fetch multiple pages.

    multithreaded webcrawler search engine
    multithreaded webcrawler search engine
  7. Asynchronous crawl– Asynchronous crawling, since only one thread is used to send and receive all the web requests in parallel. This saves RAM and CPU usage. Using this we can crawl more than 3,000,000 web pages while using less than 200 MB of RAM. Using this we can achieve a crawl speed of more than 250 pages per second.
  8. Duplicate detection- The crawler should be able to find duplicate URLs and remove them. When a website has more than one version of the same page, the Crawler should find the authoritative page of all the versions even if the website creator does not provide a canonical URL for the web page. Canonical URLs are the authoritative URL that search engines should look for when the website creators create multiple versions of the same web page. Website creator marks canonical URL by adding link tag in all the duplicate pages as follows.
   <link rel="canonical" href="https://2buy.com/books/word-power-made-easy" />
Duplicate detection webcrawler search engine
simhash duplicate detection
  1. Handing Robots.txt – The crawler should read the settings in the robots.txt for crawling the pages. Some pages (or page patterns) will be marked as Disallow and these pages should not be crawled. Robots.txt will be found at website.com/robots.txt

    Handing Robots.txt webcrawler search engine
    robots.txt
  2. Sitemap.xml- Sitemap is a link map of the website. It has all the URLs that need to be crawled. This makes the crawling process simpler.

    sitemap.xml webcrawler search engine
    sitemap
  3. Crawler policies-
    1. selection policy which states which pages have to be downloaded.
    2. a re-visit policy that states frequency to look for changes in the website.
    3. a politeness policy  How fast the website can be crawled (so that website load does not increase)
    4. a parallelization policy  Instructions for distributed crawlers.

System Design Primer on building a Web Crawler Search Engine

Here is a system design primer for building a web crawler search engine. Building a search engine from scratch is not easy. To get you started, you can take a look at existing open source projects like Solr or Elasticsearch. Coming to just the crawler, you can take a look at Nutch. Solr doesn’t come with a built-in crawler but works well with Nutch.

YouTube video

What are some open source web crawlers you can use

  1. nutch
  2. Scrapy
  3. Heritrix. https://github.com/internetarchive/heritrix3
  4. wget
  5. http://stormcrawler.net/

Expertrec is a search solution that provides a ready-made search engine ( crawler+parser+indexer+Search UI ). You can create your own at https://cse.expertrec.com/?platform=cse

Search UI webcrawler search engineFAQs

How do search engine crawlers work?

  •  Search engines use their own web crawlers to get information.
  •  Search engines begin their crawling by downloading its robots.txt file which contains some rules like what pages should search engines crawl or should not crawl on the website.
  • Robots.txt file may also give information about sitemaps that give lists of URLs that the site wants a search engine to crawl.
  • Search engine crawlers use algorithms and rules to determine how a page should be recrawled frequently and how many times it should be indexed.

How to build a search engine web crawler?

  • Add one or several URLs to be visited.
  • Pop a link from the URLs to be visited and add it to the visited URLs thread
  • Fetch the page’s content and rub the data you’re interested in with the scraping bot API
  • Describe all the URLs present on the page, and add them to the URLs to be visited if they match the rules you’ve set and do not match any of the visited URLs.
  • Repeat steps 2-4 until the URLs to be visited list is empty.

    What is the difference between a web crawler and a search engine?

    Both the web crawler and the search engine are actually part of the same system

    Search engine like Google, yahoo,Expertrec, etc facilitates the provision of information. A search engine (in addition to sending millions of web crawlers out to collect information) also indexes this information so that it will be organized to allow easy retrieval and provides interfaces to allow users or other programs to query the indexed information for specific results.

    The retrieval system is used by the search engine to return results when someone asks for them.

    The web crawler gathers the data, data which is then saved in some kind of retrieval system

    a web crawler performs the search operation for the query given by the user. The crawler crawls different web pages on WWW which matches the search query and indexes the web pages using ranking algorithms and provides it to the user as a search result. 

    How do detect and verify search engine crawlers?

    Detect the search engine crawler That the request has to be signed with its name, called User Agent. 

    First, we need to verify a request’s source properly, you need to check the IP address from which the request was made. Verify the search engines provide IP lists or ranges. we can verify the crawler by matching its IP with the provided list and performing a DNS lookup(It is a method of connecting a domain to an IP address) to connect the IP address to the domain name.

    Add Search to your Website

    Are you showing the right products, to the right shoppers, at the right time? Contact us to know more.
    You may also like