What is site search?
Site search is an important component of every website. It helps website visitors find content within your website pretty fast without the need of having to go from page to page.
For websites selling products online, the site search feature plays the role of a salesman. When you enter a physical showroom, you ask questions to a salesman like ” where is this product”, “how much is this product”, “are there any alternatives” and more. A good search engine must be able to cater to such needs of the user.
Why Site search is important?
A single-page website might not require a search engine. But a website having 1000s of pages will surely require a search engine. Since the space on the home page is limited, you can’t make all content available on the home page. The need for a search engine becomes more pronounced as the number of pages in a website increases.
Search engine users have also a higher probability of buying your product. Site search analytics also gives an idea of what people are looking for in your website and accordingly stock up items or plan on your future content marketing strategy.
Site search also reduces the bounce rate in your website which could have positive effects on your website’s SEO. Search engines have a feature known as site links search box. Having an optimized search engine can help in getting a site links search box in Google search results which can increase your organic traffic.
What goes into building a site search engine?
- Crawler– A crawler gives can be imagined of as an input source to the search engine. A web crawler is designed to follow the links on web pages to discover and download new pages. Although this sounds simple, there are significant challenges in designing a web crawler that can efficiently handle large web pages while at the same time making sure that pages that may have changed since the last time a crawler visited a site are kept “fresh” for the search engine. A crawler can be restricted to a single site, or to a number of web pages. For enterprise search, the crawler is adapted to discover and update all documents and web pages related to a company’s operation. An enterprise document crawler follows links to discover both external and internal (i.e., restricted to the corporate intranet) pages, but also must scan both corporate and personal directories to identify email, PDFs, word processing documents, presentations, database records, and other company information.
- Conversion The documents found by a crawler or provided by a feed are rarely in plain text. Instead, they come in a variety of formats, such as HTML, XML, Adobe PDF, Microsoft Word™, Microsoft PowerPoint, and so on. Most site search engines require that these documents be converted into a consistent text plus metadata format. In this conversion, the control sequences and non-content data associated with a particular format are either removed or recorded as metadata. In the case of HTML and XML, much of this process can be described as part of the text transformation component. For other formats, the conversion process is a basic step that prepares the document for further processing. PDF documents, for example, must be converted to text. Similarly, utilities are available to convert the various Microsoft Office formats into text. Another common conversion problem comes from the way text is encoded in a document. ASCII5 is a common standard single-byte character encoding scheme used for text. ASCII uses either 7 or 8 bits (extended ASCII) to represent either 128 or 256 possible characters. Some languages, however, such as Chinese, have many more characters than English and use a number of other encoding schemes. Unicode is a standard encoding scheme that uses 16 bits (typically) to represent most of the world’s languages. Any application that deals with documents in different languages has to ensure that they are converted into a consistent encoding scheme before further processing.
- Document data store -The document data store is a database used to manage large numbers of documents and the structured data that is associated with them. The document contents are typically stored in compressed form for efficiency. The structured data consists of document metadata and other information extracted from the documents, such as links and anchor text (the text associated with a link). A relational database system can be used to store the documents and metadata. Some applications, however, use a simpler, more efficient storage system to provide very fast retrieval times for very large document stores.
- Text Transformation-
- Parsing- The parsing component is responsible for processing the sequence of text tokens in the document to recognize structural elements such as titles, figures, links, and headings. Tokenizing the text is an important first step in this process. In many cases, tokens are the same as words. Both document and query text must be transformed into tokens in the same manner so that they can be easily compared. There are a number of decisions that potentially affect retrieval that make tokenizing non-trivial. For example, a simple definition for tokens could be strings of alphanumeric characters that are separated by spaces. This does not tell us, however, how to deal with special characters such as capital letters, hyphens, and apostrophes. Should we treat “apple” the same as “Apple”? Is “on-line” two words or one word? Should the apostrophe in “O’Connor” be treated the same as the one in “owner’s”? In some languages, tokenizing gets even more interesting. Chinese, for example, has no obvious word separator like a space in English.
- Stop words-Removing stop words is the simple task of removing common words from the stream of tokens that become index terms. The most common words are typically function words that help form sentence structure but contribute little on their own to the description of the topics covered by the text. Examples are “the”, “of ”, “to”, and “for”. Because they are so common, removing them can reduce the size of the indexes considerably. Depending on the retrieval model that is used as the basis of the ranking, removing these words usually has no impact on the search engine’s effectiveness, and may even improve it somewhat. Despite these potential advantages, it can be difficult to decide how many words to include on the stopword list. Some stopword lists used in research contain hundreds of words. The problem with using such lists is that it becomes impossible to search with queries like “to be or not to be” or “down under”. To avoid this, search applications may use very small stopword lists (perhaps just containing “the”) when processing document text, but then use longer lists for the default processing of query text.
- Stemming-Stemming is another word-level transformation. The task of the stemming component (or stemmer) is to group words that are derived from a common stem. Grouping “fish”, “fishes”, and “fishing” is one example. By replacing each member of a group with one designated word (for example, the shortest, which in this case is “fish”), we increase the likelihood that words used in queries and documents will match. Stemming, in fact, generally produces small improvements in ranking effectiveness. Similar to stopping, stemming can be done aggressively, conservatively, or not at all. Aggressive stemming can cause search problems. It may not be appropriate, for example, to retrieve documents about different varieties of fish in response to the query “fishing”. Some search applications use more conservative stemming, such as simply identifying plural forms using the letter “s”, or they may do no stemming when processing document text and focus on adding appropriate word variants to the query. Some languages, such as Arabic, have more complicated morphology than English, and stemming is consequently more important. An effective stemming component in Arabic has a huge impact on search effectiveness. In contrast, there is little word variation in other languages, such as Chinese, and for these languages stemming is not effective.
- Link extraction and analysis Links and the corresponding anchor text in web pages can readily be identified and extracted during document parsing. Extraction means that this information is recorded in the document data store, and can be indexed separately from the general text content.
- Information extraction – Information extraction is used to identify index terms that are more complex than single words. This may be as simple as words in bold or words in headings, but in general, may require significant additional computation. Extracting syntactic features such as noun phrases, for example, requires some form of syntactic analysis or part-of-speech tagging. Research in this area has focused on techniques for extracting features with specific semantic content, such as named entity recognizers, which can reliably identify information such as person names, company names, dates, and locations.
- Classifier The classifier component identifies class-related metadata for documents or parts of documents. This covers a range of functions that are often described separately. Classification techniques assign predefined class labels to documents. These labels typically represent topical categories such as “sports”, “politics”, or “business”. Two important examples of other types of classification are identifying documents as spam and identifying the non-content parts of documents, such as advertising. Clustering techniques are used to group related documents without predefined categories. These document groups can be used in a variety of ways during ranking or user interaction.
- Index Creation-
- Document statistics- The task of the document statistics component is simply to gather and record statistical information about words, features, and documents. This information is used by the ranking component to compute scores for documents. The types of data generally required are the counts of index term occurrences (both words and more complex features) in individual documents, the positions in the documents where the index terms occurred, the counts of occurrences over groups of documents (such as all documents labeled “sports” or the entire collection of documents), and the lengths of documents in terms of the number of tokens. The actual data required is determined by the retrieval model and associated ranking algorithm. The document statistics are stored in lookup tables, which are data structures designed for fast retrieval
- Weighting – Index term weights reflect the relative importance of words in documents and are used in computing scores for ranking. The specific form of a weight is determined by the retrieval model. The weighting component calculates weights using the document statistics and stores them in lookup tables. Weights could be calculated as part of the query process, and some types of weights require information about the query, but by doing as much calculation as possible during the indexing process, the efficiency of the query process will be improved. One of the most common types used in older retrieval models is known as tf.idf weighting. There are many variations of these weights, but they are all based on a combination of the frequency or count of index term occurrences in a document (the term frequency, or tf ) and the frequency of index term occurrence over the entire collection of documents (inverse document frequency, or IDF ). The IDF weight is called inverse document frequency because it gives high weights to terms that occur in very few documents. A typical formula for IDF is log N/n, where N is the total number of documents indexed by the search engine and n is the number of documents that contain a particular term.
- Inversion -The inversion component is the core of the indexing process. Its task is to change the stream of document-term information coming from the text transformation component into term-document information for the creation of inverted indexes. The challenge is to do this efficiently, not only for large numbers of documents when the inverted indexes are initially created but also when the indexes are updated with new documents from feeds or crawls. The format of the inverted indexes is designed for fast query processing and depends to some extent on the ranking algorithm used. The indexes are also compressed to further enhance efficiency.
- Index distribution The index distribution component distributes indexes across multiple computers and potentially across multiple sites on a network. Distribution is essential for efficient performance with web search engines. By distributing the indexes for a subset of the documents (document distribution), both indexing and query processing can be done in parallel. Distributing the indexes for a subset of terms (term distribution) can also support parallel processing of queries. Replication is a form of distribution where copies of indexes or parts of indexes are stored in multiple sites so that query processing can be made more efficient by reducing communication delays. Peer-to-peer search involves a less organized form of distribution where each node in a network maintains its own indexes and collection of documents.
- User Interaction-
- Query input- The query input component provides an interface and a parser for a query language. The simplest query languages, such as those used in most web search interfaces, have only a small number of operators. An operator is a command in the query language that is used to indicate text that should be treated in a special way. In general, operators help to clarify the meaning of the query by constraining how text in the document can match the text in the query. An example of an operator in a simple query language is the use of quotes to indicate that the enclosed words should occur as a phrase in the document, rather than as individual words with no relationship.
- Query transformation The query transformation component includes a range of techniques that are designed to improve the initial query, both before and after producing a document ranking. The simplest processing involves some of the same text transformation techniques used in the document text. Tokenizing, stopping, and stemming must be done on the query text to produce index terms that are comparable to the document terms. Spell checking and query suggestion are query transformation techniques that produce similar output. In both cases, the user is presented with alternatives to the initial query that are likely to either correct spelling errors or be more specific descriptions of their information needs. Query expansion techniques also suggest or add additional terms to the query, but usually based on an analysis of term occurrences in documents. This analysis may use different sources of information, such as the whole document collection, the retrieved documents, or documents on the user’s computer. Relevance feedback is a technique that expands queries based on term occurrences in documents that are identified as relevant by the user.
- Results output The results output component is responsible for constructing the display of ranked documents coming from the ranking component. This may include tasks such as generating snippets to summarize the retrieved documents, highlighting important words and passages in documents, clustering the output to identify related groups of documents, and finding appropriate advertising to add to the results display. In applications that involve documents in multiple languages, the results may be translated into a common language.
- Scoring– The scoring component, also called query processing, calculates scores for documents using the ranking algorithm, which is based on a retrieval model. The designers of some search engines explicitly state the retrieval model they use. Many different retrieval models and methods of deriving ranking algorithms have been proposed. The basic form of the document score calculated by many of these models is ∑ qi .di where the summation is over all of the terms in the vocabulary of the collection, qi is the query term weight of the ith term, and di is the document term weight. The term weights depend on the particular retrieval model being used but are generally similar to tf.idf weights. The document scores must be calculated and compared very rapidly in order to determine the ranked order of the documents that are given to the results output component. This is the task of the performance optimization component.
- Performance optimization- Performance optimization involves the design of ranking algorithms and the associated indexes to decrease response time and increase query throughput. Given a particular form of document scoring, there are a number of ways to calculate those scores and produce the ranked document output. For example, scores can be computed by accessing the index for a query term, computing the contribution for that term to a document’s score, adding this contribution to a score accumulator, and then accessing the next index. This is referred to as term-at-a-time scoring. Another alternative is to access all the indexes for the query terms simultaneously and compute scores by moving pointers through the indexes to find the terms present in a document. In this document-at-a-time scoring, the final document score is calculated immediately instead of being accumulated one term at a time. In both cases, further optimizations are possible that significantly decrease the time required to compute the top-ranked documents. Safe optimizations guarantee that the scores calculated will be the same as the scores without optimization. Unsafe optimizations, which do not have this property, can in some cases be faster, so it is important to carefully evaluate the impact of the optimization.
- Distribution -Given some form of index distribution, ranking can also be distributed. A query broker decides how to allocate queries to processors in a network and is responsible for assembling the final ranked list for the query. The operation of the broker depends on the form of index distribution. Caching is another form of distribution where indexes or even ranked document lists from previous queries are left in local memory. If the query or index term is popular, there is a significant chance that this information can be reused with substantial time savings.
- Logging- Logs of the users’ queries and their interactions with the search engine are one of the most valuable sources of information for tuning and improving search effectiveness and efficiency. Query logs can be used for spell checking, query suggestions, query caching, and other tasks, such as helping to match advertising to searches. Documents in a result list that are clicked on and browsed tend to be relevant. This means that logs of user clicks on documents (clickthrough data) and information such as the dwell time (time spent looking at a document) can be used to evaluate and train ranking algorithms.
- Ranking analysis- Given either log data or explicit relevance judgments for a large number of (query, document) pairs, the effectiveness of a ranking algorithm can be measured and compared to alternatives. This is a critical part of improving a search engine and selecting values for parameters that are appropriate for the application. A variety of evaluation measures are commonly used, and these should also be selected to measure outcomes that make sense for the application. Measures that emphasize the quality of the top-ranked documents, rather than the whole list, for example, are appropriate for many types of search queries.
- Performance analysis- The performance analysis component involves monitoring and improving overall system performance, in the same way, that the ranking analysis component monitors effectiveness. A variety of performance measures are used, such as response time and throughput, but the measures used also depend on the application. For example, a distributed search application should monitor network usage and efficiency in addition to other measures. For ranking analysis, test collections are often used to provide a controlled experimental environment. The equivalent for performance analysis is simulations, where actual networks, processors, storage devices, and data are replaced with mathematical models that can be adjusted using parameters.
Best practices in site search-
- Position of the search bar– Don’t make your site visitors search for the search bar. The search bar should be present in an easy to find a place where users can access it easily.
- Placeholder text- Many search boxes’ don’t utilize the placeholder space or leave it empty. Make sure that this place gives an idea of what people can search for in your website.
- Recent searches- Personalizing the search bar helps in reminding the customers what they searched for the last time they visited your website. This can help in increased sales and engagement.
- Typeahead- Completing search queries when your website users type a few letters in your search box is a very common search feature. If you are missing on this feature, make sure you implement it ASAP.
- Images in search- In the search autocomplete box, make sure that there is an image that is related to the article or product coming up in search results. This helps the user in faster and more relevant content discovery.
- Spell correct- You can’t expect all your website users to know the correct spelling of products. Enable spell correct and see improvement in customer product engagement.
- No results page- Have a good design for your no results pages. Make sure you show some alternative content or products instead of showing an empty page.
- Search results page- Make sure the design of your search results page is aligned to your business needs. An e-commerce website will have a different layout as compared to a news website.
- Mobile UI- Make sure you focus on your mobile UI search design. The autocomplete, search icon, suggestions and search results pages will have to be designed in order to fit nicely into the multiple mobile device screens.
- Enable site search tracking- Site search tracking helps in analyzing search queries that users are typing in your website search engine. You can also install heat maps tracking software to understand user interaction with your search engine.