Google debuts text analysis tools
- 18 December, 2010 03:59
Google has introduced two tools that may help users discover new ways to parse the company's massive collections of public information.
One tool counts how often a chosen phrase shows up across 500 years worth of digitized books, while another divvies up search results by their levels of reading difficulty.
The first service called the Books Ngram Viewer, allows people to search for specific phrases within the company's massive collection of digitized books. In addition to links to the source material, the results will also provide a timeline showing when the phrase was most often used.
The tool runs searches against a database of 500 billion words found within 5.2 million books Google has digitized. The sampled books were all published between 1500 and 2008, in Chinese, English, German, French, Russian or Spanish.
With this service, Google hopes to introduce a new form of quantitative analysis to academic fields, one that could provide insights into historical trends or the birth of new ideas by the tracking the popularity of associated words and phrases. One group of researchers has coined the term "culturomics" to describe the approach.
Such metrics can show how phrases come into and move out of vogue, oftentimes due to historical events.
For instance, a search for the phrase "World War One" shows the term began to be used just prior to the outbreak of World War II. Not surprisingly, occurrences of the phrase "The Great War" dropped by the 1950s, which was what World War I was called before people realized there would be a sequel.
Google has also added another form of analysis to its regular search as well: The company has introduced a new advanced search feature that can divide up results by reading level. The search breaks result into basic, intermediate and advanced reading levels.
Although Google does not specify what attributes define each reading level, most readability tests analyze texts by looking at attributes such as the number of words in each sentence or the number of letters and syllables in each word, under the assumption that more complicated sentences would be more difficult to read.
In one sample, 74 per cent of the material on the IDG site InfoWorld is classified as intermediate level, 21 per cent is basic and three per cent is advanced.
Google documentation explains that users might find the distinction of reading levels useful in helping to complete searches. A university professor might want only the advanced results, while a junior high school teacher might want to find more basic material for students.