A new report confirms what surfers already know: search engines simply can't keep up with the Web's growth.
In an article to be published in the July 8 issue of Nature, NEC Institute research scientists Steve Lawrence and C. Les Giles present the findings of their latest study, which shows search engines are providing inadequate, out-of-date and biased coverage of the ever-expanding Web.
From December 1997 to February 1999, the Web more than doubled in size, from 320 million pages to 800 million. Over the same period, the top-ranking search engine's coverage of those pages dropped from approximately 34 to 16 per cent.
"Though the [information-access] situation now is better than it was before the Web and search engines, it is limited," says Lawrence. "It's not as good as it could be."
This year's study compared the ability of 11 major search engines to produce results for 1,050 queries. Northern Light (http://www.northernlight.com) was the top-scoring searcher, covering an estimated 16 per cent of Web pages. AltaVista (http://www.altavista.com) and Snap (http://snap.com) tied for a close second place with 15.5 per cent. After scoring the top spot in last year's study, HotBot (http://www.hotbot.com) slipped to fourth place this year with 11.3 per cent.
EuroSeek (http://www.euroseek.net) landed the bottom berth on the list, finding only 2.2 per cent of Web pages. Marquis portals Lycos, Excite and Yahoo ranked just better with 2.5 per cent, 5.6 per cent and 7.4 per cent coverage, respectively.
Combined, the 11 search engines found only 335 million pages, or 42 per cent, of the total Web. This means that users of metasearch engines, such as MetaCrawler (http://www.metacrawler.com) and Ask Jeeves (http://www.aj.com), have a substantially better chance of finding results on a specific topic compared to those who use a single search engine, according to the researchers.
"Basically, it appears that there are limits to how much [the search engines] can index," says Lawrence, who says that search engines face diminishing returns in their efforts to do a better job of indexing the Web.
Rather than purchase additional computational resources to index more pages, the search engines may use funds to offer new services, such as calendaring or chat. Lawrence says funding extra applications "may be better in terms of maximising ad revenue, and therefore, stock prices."
Add to this the fact that most people make relatively simple queries that require only a small database of pages. Not to mention, the time it would take to churn through 800 million pages might be longer than most surfers would be willing to wait. As a result, search engines may have little incentive to improve their coverage.
The scientists found that it can take months for search engines to index new pages. One analysis showed that a search engine took an average of 186 days before it included a new page in its results for a certain query.
Most disconcerting is the fact that search engines appear to be biased in terms of their indexing.
The study found that search engines were more likely to index the most-trafficked pages or those that had many links directing traffic to them. Also, commercial (.com) sites were more likely to be indexed than educational (.edu) sites. And except on AltaVista, pages from US sites were more likely to be indexed than non-US sites.
The danger, according to Lawrence, is that surfers will be able to find only the most popular places to get certain types of information -- which could result in biased research.
It might not matter where users find commonly available information, such as the latest stock quotes, but for other kinds of decisions, wide-ranging searches are necessary. For example, medical research efforts can be wasted if scientists aren't aware that a similar study already exists. And finding background on all candidates in a local election can be critical to casting an informed vote.
The search engines' gaps in coverage leave an opportunity for publishers to create thorough directories on niche topics. Search engines could then direct traffic to these experts.
The good news? Lawrence estimates that the exponential growth in the size of the Web will slow -- eventually. "If we wait a few years, the rate of increase in computational resources is faster than the generation of original text or content by humans."