Search Tool Data Analysis

by Dylan Burkhardt (dylanbdylanb in BIT330, Fall 2008)

Questions and queries

Web search engines

Search Question: I am working on starting a new social networking website and I have looked at Drupal as a back end to use. My question is whether Drupal is a good tool for social networking websites?

Queries:

  • Google: social networking
  • Yahoo: social networking
  • Live: social networking

Blog search engines

I run several websites and am very interested in internet advertising and marketing and the new form that it is taking. For example advertising has expanded beyond banners to include theming of websites and video advertising.

Queries:

  • Google Blogs: social networking
  • Bloglines: social networking
  • Technorati: social networking

Data that I collected

Search engine overlap data

Web search Live Google Yahoo Web
Live 40 15 20
Google 75 25
Yahoo Web 55
All 10
Blog search Technorati Google Blog Bloglines
Technorati 45 0 5
Google Blog 60 0
Bloglines 50
All 0

Search engine ranking overlap data

This table provides a measure of how much of Google's responses are reproduced by Yahoo.
GY Yahoo
Google 5 10 20
5 1 1 1
10 1 3 3
20 2 5 5
This table provides a measure of how much of Yahoo's responses are reproduced by Google.
YG Google
Yahoo 5 10 20
5 1 1 2
10 1 3 5
20 1 3 5
This table provides a measure of how much of Blogline's responses are reproduced by Google Blog Search.
BG Google
Bloglines 5 10 20
5 0 0 0
10 0 0 0
20 0 0 0
This table provides a measure of how much of Google Blog Search's responses are reproduced by Bloglines.
GB Bloglines
GBlog 5 10 20
5 0 0 0
10 0 0 0
20 0 0 0

Results

Web search

The first group statistics I compiled are the precision and overlap data between search tools. This is the average, mode, median, standard deviation, maximum, minimum, and range of the entire class' data.

Web Search
Precision Overlap All
Name Live Google Yahoo L/G L/Y G/Y L/G/Y
AVERAGE 42.77777778 54.44444444 51.66666667 18.33333333 20 20.55555556 10
MODE 15 70 70 10 10 25 10
MEDIAN 42.5 57.5 52.5 20 20 20 10
STDDEV 22.76621738 20.06525303 22.42635005 9.548637106 11.37592918 7.838233761 7.475450016
MAX 80 90 85 35 45 35 25
MIN 10 20 10 0 5 5 0
RANGE 70 70 75 35 40 30 25

Basics

The most precise web search on average according to the mean and the median was Google which had an average precision of 54.44 and a median of 57.5. Yahoo was second with an average of 51.66 and a median of 52.5. Live was far behind wiht a mean of 42.77 and a median of 42.5. Google's results were also the most consistent between classmates as they had the lowest standard deviation (20.06). The mode also emphasizes that Google and Yahoo are consistently better than Live because they have a mode of 70 while Live comes up with 15 most often.

Google and Yahoo had the most overlap but Live and Yahoo also had a lot of overlap. However Google/Yahoo's standard deviation (7.83) shows that there was a smaller distribution of differences in Yahoo/Google's similarity to Live/Yahoo (11.37).


The next set is the average, median, mode and standard deviation of Google and Yahoo's overlap by ranking.

GY
Name o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,10) o(10,20) o(20,20)
AVERAGE 1.058823529 1.352941176 1.647058824 1.294117647 2 2.647058824 1.647058824 2.470588235 3.705882353
MEDIAN 1 1 2 1 2 3 1 3 4
MODE 1 0 0 1 1 4 1 3 5
STD DEV 1.197423705 1.32009358 1.411611511 1.212678125 1.322875656 1.729926894 1.221739358 1.545867356 2.114376559
YG
Name o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,10) o(10,20) o(20,20)
AVERAGE 1.058823529 1.176470588 1.647058824 1.470588235 1.941176471 2.470588235 1.882352941 2.647058824 3.764705882
MEDIAN 1 1 1 1 2 3 2 3 4
MODE 1 0 1 1 3 3 1 4 5
STD DEV 1.197423705 1.286239389 1.366618842 1.23073388 1.390619836 1.58578242 1.268973647 1.729926894 2.077540967

Basics

In these two tables the overlap between sites are compared by the rankings. For example (5,5) in the GY table is the number of times that a top 5 result on Google appears in the top 5 of Yahoo. (10,5) in the same table is the number of times a top 10 result appears in Yahoo's top 5.

Blog search

Below are the results of the blog search tool comparison. These are in an identical format to the Search Tool Data except they use the blog search engines Technorati, Google Blogs, and Bloglines.

Blog search
Precision Overlap All
Name Technorati GBlog Bloglines T/G T/B G/B T/G/B
AVG 33.05555556 52.5 44.44444444 3.611111111 9.166666667 6.944444444 1.388888889
MODE 35 40 50 0 5 5 0
MEDIAN 30 42.5 47.5 0 7.5 5 0
STD DEV 21.15342337 22.17908395 14.33720878 7.030512398 7.717436331 6.448640734 3.34556579
MAX 85 100 75 25 25 20 10
MIN 5 25 20 0 0 0 0
RANGE 80 75 55 25 25 20 10

Basics

The results show that the most precise blog search tool on average according to the mean (52.5) was Google Blogs however according to the median it was Bloglines (47.5). Technorati ranks last in nearly every category and appears to be less precise than the other blogs. Google Blogs and Bloglines also have the highest overlap which makes sense because they are the most precise. Bloglines has the smallest standard deviation which means that it was consistently around it's average and didn't sway as much as Google and Technorati, this is also emphasized by Bloglines smaller range of only 55.


Similar to before the next set of data is the average, median, mode and standard deviation of Google Blogs and Blog Lines overlap by ranking position.

GB
Name o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,10) o(10,20) o(20,20)
AVG 0.294117647 0.352941176 0.470588235 0.411764706 0.470588235 0.823529412 0.705882353 0.764705882 1.058823529
MEDIAN 0 0 0 0 0 0 0 0 1
MODE 0 0 0 0 0 0 0 0 0
STD DEV 0.469668218 0.606339063 0.624264273 0.618346942 0.717430054 1.014599312 0.919558718 1.091410313 1.197423705
BG
Name o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,10) o(10,20) o(20,20)
AVG 0.294117647 0.352941176 0.588235294 0.411764706 0.529411765 0.823529412 0.529411765 0.882352941 1.117647059
MEDIAN 0 0 0 0 0 1 0 1 1
MODE 0 0 0 0 0 0 0 0 1
STD DEV 0.469668218 0.606339063 0.870260272 0.618346942 0.717430054 1.074435556 0.624264273 0.992619825 1.166316474

The format is identical to above. For example (5,5) in the GB table is the number of times that a top 5 result on Google Blogs appears in the top 5 of Blog Lines. (10,5) in the same table is the number of times a top 10 result appears in Blogline's top 5.

There is very minimal overlap compared to the search tools, so little that most of the data comes close to zero. Zero was the mode on all but one of the overlap tests, not surprisingly the broadest test of comparing top 20 to top 20.

Discussion

Web search

Discuss the meaning of the two sets of data, especially when considered together.

The first data set provides basic precision data (what percentage of the results are relevant to the query) and overlap data (how similar are the two sets of results). The second dataset is an attempt to analyze which results are overlapping and the differences in how these rank on each search engine. There are statistics that show that most users only click on search engine results "above the fold" so the position of a result is very important. A top 5 result is a lot different than a top 20 result.

By connecting these two sets of data we can analyze which search engine is best for what purposes and how similar they are from the top down. This gives us a means to compare which site is best at getting the best results first, but also which site gets the best overall results.

What recommendation(s) can you make to a person searching for information?

By using the data we researched we can make an attempt to guess the most useful search tool for certain situations. Google came up as the most precise which is a good thing because a majority of the world uses Google. I think it is also clear that Live is the least effective search tool, it has the lowest metrics in most categories (average, mode, median).

The first recommendation I will make from looking at this data is to recommend Google. I found that you are likely to get 1.058 of Google's results in Yahoo's top 5 results and 1.294 in the top ten. However for Yahoo! you will find 1.058 of Yahoo's top 5 in Google's top 5, and 1.47 of the top 5 in Google's top 10. These numbers prove that Google does a better job incorporating Yahoo's results, because of Google's high precision mixed with this metric it means that Google is likely including the relevant Yahoo entries and not junk entries.

What did you learn from this, either from the data itself, the process of submitting the queries, or whatever?

I learned that the rest of the world might not be just blindly using Google but they have found that Google is time and time again the most precise search engine.

However after learning more about search queries in class I feel that these results could be very skewed depending on the types of queries that were entered be each individual student. There was a relatively large standard deviation for the precision measures on all of the search engines that was around 20%. This means that the data was not very consistent and quality of search query probably played a large part.

I also learned that Yahoo has other great resources. Just from my query alone I found that there were things in Yahoo! that I hadn't seen before and didn't come up in the Google results. I didn't find much of substance in the Live search that I hadn't seen before or found useful though and will probably not plan on using Live any time soon.

Are there any further questions that you might want to investigate? By this, I don't mean what further queries would you like to submit to the search engines. I mean what research questions would make sense as a follow-up? What methodological changes might need to be made to make these results better?

A further question that I have is whether certain search engines are better for specific topics or genres of queries than others. Without knowing everyone's query it may be that Google is better for technical topics than others and Yahoo is better for business, etc. For example I did a technical search and got a very high 75% precision score with Google. This is at the higher end of the distribution but maybe the distribution can be further broken down by topic. To refine the research I would think about making groups of 10 or so searches in different categories and seeing what kind of effect this had on the data. I am willing to bet that there would be some trending.

Another area that we did not focus on very much was how search engines generate the rankings and how they are influenced. They can be manipulated by a process called search engine optimization and this is a very large business on the internet. Because Google is the most popular search engine, most search engine optimization is targeted at Google. At first thought I thought this would have a negative effect on Google's precision rankings by creating more "junk" pages but it appears that it may actually have a positive effect. Because I run a website I am familiar with these processes and know that they actually do work an can influence a search. All across the internet webmasters are trying to guess the keywords that you will be entering.

Blog search

Discuss the meaning of the two sets of data, especially when considered together.
Similar to the web search data, there are two datasets that I analyzed for the blog search tools. The first data set provides basic precision data (what percentage of the results are relevant to the query) and overlap data (how similar are the two sets of results). The second dataset is an attempt to analyze which results are overlapping and the differences in how these rank on each search engine.

Bringing these sets of data allows us to compare the search tools in the same way that we compared the web search tools. First my basics like precision and overlap, and then by specific overlap by ranking. The overlap numbers are very low with the blog data but that tells us something in itself. There is so much content in the blogosphere that it's possible to have every engine indexing different things.

What recommendation(s) can you make to a person searching for information?

The best two blog search tools appear to be Google Blogs (based on mean) and Bloglines (based on median).

One use I can see in using a blog search engine is if you are looking for others opinions on a major story. Typing in a portion of a breaking news headline might be the best bet for finding results better than searching for general topics. This makes sense because blogs are usually commentaries more than factual pages.

Because of their higher precision numbers, Google Blogs and Bloglines appear to be the best place to start your blog search. However because there are so many different results and little overlap in any of the search tools I would recommend trying your search query in all three blog search tools to encompass the wealth of information in the blogosphere.

What did you learn from this, either from the data itself, the process of submitting the queries, or whatever?

The issue that is the most noticeable with the blog searches is that there is minimal overlap between search engines. The average overlap between Technorati & Google Blogs was 3.6%, Technorati & Bloglines 9.16%, and Google Blogs & 6.94%. Compare these numbers to the web search overlap averages of 18.33%, 20%, 20.55% and that is a dramatic difference. The mode for all three overlap measurements was either 0% or 5% which equates to zero or one article that is the same. These overlap numbers confirm that the number of blog posts on the internet is almost too much to manage in a blog only search engine.

Are there any further questions that you might want to investigate? By this, I don't mean what further queries would you like to submit to the search engines. I mean what research questions would make sense as a follow-up? What methodological changes might need to be made to make these results better?

Blogging is still a relatively new means of sharing information on the Internet. The ease of starting a blog is both a blessing and a downfall of blogging. Because of the number of blogs on the Internet it is very hard to distinguish between a quality blog and a junk blog. Blogs are also often used as marketing tools by companies to push their products, generating a full blog that basically is filled with fake reviews and links to products is not what we are trying to search for. Google Blogs, Blog Lines, and Technorati all have their own ways to distinguish between good and bad blogs.

An issue I have with the concept of a blog only search engine is that blogs will still come up in a typical search on Google or Yahoo. The only reason to use a blog search tool is to find only data from a blog, but with the gray areas of blogging ethics and the lack of reliability I find it hard to believe that people have a good reason to just search blogs that often.

I also found Technorati's poor results surprising, as a blogger I have always heard good things about Technorati and check my blogs Authority quite often. I definitely wouldn't have expected it's results to be dramatically lower than the other blog search engines. I also wonder whether most of the blog engines require submission of any kind to be indexed and whether that has a dramatic affect on the search results.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License