Most information junkies would be hard-pressed to name anything that has transformed their professional lives as much as Internet search engines have. The miraculous devices can take your hot topic of the day, scan millions of Web pages and in seconds bring back product announcements, research papers, the names of experts and more -- things that would be difficult or impossible to find otherwise.
But as powerful as they are, search engines have huge weaknesses. For example, a recent Google search on the word Linux took just 0.4 seconds, but it had 95 million hits. Too bad if the one you need is No. 10,000 on the list.
But researchers are poised to revolutionize search technology over the next few years. The most common thrust is to personalize search engines so that they know, for example, that if you're an IT professional and you search for mouse, you're more likely to want information about PC devices than about animals.
Adele Howe, a computer science professor at Colorado State University in Fort Collins, and Gabriel Somlo, a CSU graduate student, have built a proof of concept called QueryTracker, a software agent that sits between a user and a conventional search engine and looks for information of recurring interest, such as the latest news about a user's chronic illness. QueryTracker submits a user's query to the search engine once a day and returns results from new Web pages and pages that have changed since the previous search.
The magic in QueryTracker comes from its automatic generation of an additional daily query -- which Howe says is often superior to the user's original query -- based on what it learns about the user's interests and priorities over time. It filters the results of both queries for relevance and sends them to the user.
QueryTracker's ability to generate its own searches can compensate for the poorly formed queries that many users write, Howe says. "Even people knowledgeable about the Web are often either lazy or they are just not informed about how to write good queries," she says. The most common mistake: queries that are too short, like the one-word Linux search.
Jeannette Jenssen, a mathematics professor at Dalhousie University in Halifax, Nova Scotia, is taking search personalization techniques a step further, to the "crawlers" that index Web content before it can be searched. She says the popular search engines have three drawbacks: They are increasingly charging corporate users for their services, they skew results in favor of advertisers, and they often retrieve huge amounts of irrelevant information. But Jenssen's "focused crawler" indexes only pages related to prespecified topics and then tailors the rankings to the interests of the user.
For example, she says, a medical society might run the crawler nightly to index just pages relating to medicine. And it would rank the resulting hits in a way that made sense to the medical establishment, not to advertisers or average Web surfers. The crawler would get progressively better at building its nightly index by observing the behavior of the searches against it.
Other focused crawlers look for pages containing information that meets specific criteria. But Jenssen's crawler can discern hidden, or indirect, links through a process she likens to the children's search game "warmer-colder."
For example, she says, imagine a Web crawler that focuses on computer science topics. Computer science research papers often are linked to the home pages of the professors who wrote them, and their pages are linked to the professors' universities' home pages. "When the crawler gets to the university page, it searches more intently than it would at a company page," Jenssen says. "It says, 'I'm getting warmer.' It analyzes user behavior and Web paths to automatically learn these trajectories."
Filippo Menczer, a computer science professor at Indiana University in Bloomington, says conventional search engines determine a document's relevance by considering various things in isolation. They may first select a document because it contains the keywords in the query. Then, to rank the results, they may consider how many links point to the document. Better results could be obtained from considering many such "measures of relevance" -- including user preferences -- in combination, and in considering combinations of pages rather than single pages, Menczer says.
Such complex and powerful searches will be practical in three to five years when computers are more powerful. "We'll do brute-force, large-scale data mining over the whole Web -- over many terabytes of information," says Menczer.
Brute force is a pretty good description of IBM's WebFountain, a huge Linux cluster that runs 9,000 programs continuously and crawls 50 million new pages every day. But WebFountain doesn't simply index keywords; it applies natural-language analysis concepts to extract meaning from unstructured text.
For example, it determines whether an entity is a person's name, company name, location, product, price and so on, and then it attaches searchable XML metadata tags to it. "We are tagging the entire Web, all of Usenet news, all the wire services and so on," says Dan Gruhl, WebFountain's chief architect at IBM's Almaden Research Center.
The software is pretty good at extracting and tagging the semantic meaning of unstructured text, but Gruhl says more research is needed to do reliable "sentiment analysis," which, for example, would let companies automatically monitor the reputations of their products.
Researchers at the Almaden center are experimenting with Sentiment Analyzer, which tries to extract opinions from online text documents. If a customer said at a Web site, "The Ford Explorer is great," that would be easy to classify, Gruhl says, but if the customer said sarcastically, "It's almost as good as the Ford Pinto," semantic analysis software would be stumped.
Making sense of that kind of statement is one of the goals of IBM's research.