Building Web crawler Search Engine is an extremely complex engineering project. Trying to build a web crawler can provide can give a great understanding of how a web crawler search engine works. There are multiple huge complex components involved in building a search engine crawler.
To build a complete search engine you would need the following components
What is Web Crawler?
There are around 1.88 billion websites on the internet. The crawler visits each and every one of the websites and collects information about each website, and each page in a website periodically. It does this job so that it can provide the required information when a user asks for it.
Here is a typical crawler architecture.
How does a Web Crawler work?
Web crawler works by first visiting a website root domain or root URL say https://www.expertrec.com/custom-search-engine/. It looks for robots.txt and sitemap.xml.
robots.txt provides instruction for the crawler, web crawler instructions like not to visit a particular section of the website using disallow section. Website creators use this file to give instructions to search engines like what kinds of information can be collected or gathered.
After completing the visit of robots.txt, it looks for sitemap.xml. Sitemap.xml contains all the URLs of the website and also instructions to search engines or crawlers on how frequently that particular page should be crawled. A sitemap is mainly used for easy link findability and to prioritize the contents in the website that should be crawled.
What is Search Index?
Once the crawler crawls the web pages in a website, All the information like Keywords in the website, meta-information (information about) a page like meta description, etc are extracted and put in an index. This index can be compared to an index that can be found in every book, that can be used to lookup up the exact page number and hence the information about the keyword to retrieve results faster.
What is Search Ranking?
Every time when a user searches for information, the search results are produced by looking up the index and many other signals like your location from which you query, These extra signals help in producing better search results
Web Search User interface
Some pointers to keep in mind while designing a good WebCrawler for searching the web.
- Ability to download huge web pages
- Less time to download web pages
- Consume optimal bandwidth.
- Handle HTTP 301 and HTTP 302 Redirects– The crawler should be able to handle such pages.
- DNS caching– Instead of doing a DNS lookup every time, the crawler should cache the DNS. This helps to reduce the crawl time and internet bandwidth used.
- Multithreading–Most crawlers launch several “threads” in order to download web pages in parallel. Instead of a single thread downloading the files, you can use this approach to parallel fetch multiple pages.
- Asynchronous crawl– Asynchronous crawling, since only one thread is used to send and receive all the web requests in parallel. This saves RAM and CPU usage. Using this we can crawl more than 3,000,000 web pages while using less than 200 MB of RAM. Using this we can achieve a crawl speed of more than 250 pages per second.
- Duplicate detection- The crawler should be able to find duplicate URLs and remove them. When a website has more than one version of the same page, the Crawler should find the authoritative page of all the versions even if the website creator does not provide a canonical URL for the web page. Canonical URLs are the authoritative URL that search engines should look for when the website creators create multiple versions of the same web page. Website creator marks canonical URL by adding link tag in all the duplicate pages as follows.
<link rel="canonical" href="https://2buy.com/books/word-power-made-easy" />
- Handing Robots.txt – The crawler should read the settings in the robots.txt for crawling the pages. Some pages (or page patterns) will be marked as Disallow and these pages should not be crawled. Robots.txt will be found at website.com/robots.txt
- Sitemap.xml- Sitemap is a link map of the website. It has all the URLs that need to be crawled. This makes the crawling process simpler.
- Crawler policies-
- selection policy which states which pages have to be downloaded.
- a re-visit policy that states frequency to look for changes in the website.
- a politeness policy How fast the website can be crawled (so that website load does not increase)
- a parallelization policy Instructions for distributed crawlers.
System Design Primer on building a Web Crawler Search Engine
Here is a system design primer for building a web crawler search engine. Building a search engine from scratch is not easy. To get you started, you can take a look at existing open source projects like Solr or Elasticsearch. Coming to just the crawler, you can take a look at Nutch. Solr doesn’t come with a built-in crawler but works well with Nutch.
What are some open source web crawlers you can use
Expertrec is a search solution that provides a ready-made search engine ( crawler+parser+indexer+Search UI ). You can create your own at https://cse.expertrec.com/?platform=cse