In the real world, the search is so large that we cannot enumerate the entire search engine. The digital world is so dynamic and disorganized that it has made it difficult to find an effective solution to ambiguous queries. The process of retrieval is affected by the ambiguous queries which average users type into the search engines. This is why they return too many results which can be manipulated by search engine black hat hackers.
Problem: How a user can find an appropriate answer that is relevant to his/her query?
Challenge: Usually a user has to traverse several search result pages to get to the desired result.
What is a web search clustering?
The term clustering means grouping of a number of similar things. Clustering agents are usually seen as an alternative to search engines. Although Search engine clustering attracted certain commercial interest, but it has some challenges.
What is a clustering? It is basically organizing similar documents into clusters so that these documents of one cluster are different from the other ones.
What are the clustering tasks?
a) Helping users to find the most relevant answer to their search
b) Presenting the search result as quickly as possible
Example of clustering search engines
Clustering search engines such as: Grouper, Carrot2, Vivisimo, SnakeT show the search results in forms of clusters. A web clustering engine takes the result, an input and performs clustering and labeling on that result.What is the main use of search result clustering? It helps users to find a quick overview of their search results. See the following screen capture of real time search.
Video: a real time search of the Carrot2 clustering engine clearly shows how an ambiguous word like “orange” relates to different groups.
How does search clustering work? It shows a high level of search result for an ambiguous query.
Do web searchers actually look only at the first 3 search result? In the perfect world of seo companies and advertisers, end-users look at the first 3 search results. However the experience has shown otherwise. Let’s analyze a user behavior in search result:
Query phrase: “A perfect marketing tool”
A) An average user behavior
– Scanning through the first page
– Reformulate the query –> search again–> reformulate–> search again
B) A knowledgeable search engine user
– Scanning the search result by holding the courser over each link on Google without clicking on the link.
– Scanning the first 20 pages
– Applying broad match and exact match search queries
– The right and most relevant result is found on 20th page
In here user found the result, but on the 20th page. If that particular page sales the right marketing tool, then they just landed a sale.
Thus, being on the first page, might drive in traffic, but this does not necessarily mean that they generate sale. A targeted buyer looks for the most relevant item that they look for in our example “a perfect marketing tool”.
Architecture of Web Search Engine
A search engine translates a user’s query into something meaningful, then indexes documents and shows the most relevant result.
The Second generation of search engine
Google is the second generation of search engine that applies off page data in order to give the most relevant result on search engine. The off page factors are such as: link analysis, anchor text and click streams data. they have added more factors to their algorithm lately which is not the main focus of this article.
The Third generation of search engine
The third generation appeared in year 2000 in order to merge many sources of queries.The modern Information Retrieval ( RI) activity was developed and search engines have established a revolutionary functionality in their algorithms. It carries many tasks such as: removing invalid characters and recognizing meta-keywords or special syntactic operators.
Removing the stop keywords such as: “a”, “to”, “which”, “the” which are too common and useless- Let’s test in real time search in Norwegian search engines . You may find a different result on your end.
Another revolution in search engine results is normalizing semantically similar words to their root forms (e.g. believe, believer, belief). So the search engine looks at actual content of a site. Assigning a weight to each word in order to rank it. Keyword. Boosting the score of the document. Cluster distance how far apart grouping of matched terms are. and finally the query terms matched.
“Whenever a Web search request is issued, it is the web index generated by Web robots or spiders, not the web pages themselves that has been used for retrieving information. Therefore, the composition of Web indexes affects the performance of a Web search engine. “ [ source: Information retrieval: on-line [by] F. W. Lancaster and E. G. Fayen, Los Angeles, Melville Pub. Co. [1973] xiv, 597 p. illus. 23 cm. ]
According to Lancaster and Fayen there are 6 criteria for assessing the performance of information retrieval systems such as: 1) Coverage, 2) Recall, 3) Precision, 4) Response time, 5) User effort, and 6) Form of output.
In conclusion, clustering is the latest possible solution for the problem of ambiguous search queries. However we need to improve the quality of cluster labels and structure. Since the web pages change and new pages are added on daily basis, algorithms should be designed in such a way that allow for overlapping clusters.
The contents of a cluster should correspond to the label and the navigation through the cluster sub hierarchies and as a result it will lead to more specific results. For this reason we need to apply an advanced simulation technique in order to find a solution to this model. When we simulate and analyze the model, we will be able to improve the search results and spare time and money for development processes. By simulation the model, we will be able to find the right pattern and write a program that helps us to improve the search engine results.
Domain clustering a problem that Google should fix