Antispam system behind massive book digitization effort

Computer scientist puts automated turing test to use in book digitization effort

You know those pesky but necessary CAPTCHA boxes whose squiggly letters and digits you need to retype to make use of certain parts of sites such as Yahoo, Wikipedia and PayPal?

A computer scientist from Carnegie Mellon is looking to replace many of those boxes with antispam boxes of his own for the purpose of helping to digitize and make searchable the text from books and other printed materials. To boot, the system could help companies better secure their Web sites.

The idea is somewhat along the lines of projects like the famous SETI@Home grid supercomputer project for detecting signs of extra terrestrial life from deep space. Organizers of SETI@Home convinced computer users all over the world to allow their computers' CPU cycles to be used to process information for the ET hunt when the systems weren't otherwise being used.

But in the case of Luis von Ahn's project, he and his team are convincing organizations to replace the CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart) security boxes on their Web sites with what the assistant professor of computer science calls reCAPTCHA boxes. Instead of requiring visitors to retype random numbers and letters, they would retype text that otherwise is difficult for the optical character recognition systems to decipher when being used to digitize books and other printed materials. The translated text would then go toward the digitization of the printed material on behalf of the Internet Archive project.

"I think it's a brilliant idea -- using the Internet to correct OCR mistakes," said Brewster Kahle, director of the Internet Archive, in a statement. "This is an example of why having open collections in the public domain is important. People are working together to build a good, open system."

Von Ahn says it is estimated that people solve 60 million-plus CAPTCHAs a day, amounting to 150,000 or more man hours of work that can be put to use for the digitization effort. His team is working with Intel to offer a Web-based service enabling Webmasters to adopt reCAPTCHAs to secure their sites.

An audio version is in the works for transcribing radio programs and that can be used by blind Web users.

Join the newsletter!


Sign up to gain exclusive access to email subscriptions, event invitations, competitions, giveaways, and much more.

Membership is free, and your security and privacy remain protected. View our privacy policy before signing up.

Error: Please check your email address.
Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Network World staff

Network World
Show Comments

Most Popular Reviews

Latest Articles


PCW Evaluation Team

Tom Pope

Dynabook Portégé X30L-G

Ultimately this laptop has achieved everything I would hope for in a laptop for work, while fitting that into a form factor and weight that is remarkable.

Tom Sellers


This smart laptop was enjoyable to use and great to work on – creating content was super simple.

Lolita Wang


It really doesn’t get more “gaming laptop” than this.

Jack Jeffries


As the Maserati or BMW of laptops, it would fit perfectly in the hands of a professional needing firepower under the hood, sophistication and class on the surface, and gaming prowess (sports mode if you will) in between.

Taylor Carr


The MSI PS63 is an amazing laptop and I would definitely consider buying one in the future.

Christopher Low

Brother RJ-4230B

This small mobile printer is exactly what I need for invoicing and other jobs such as sending fellow tradesman details or step-by-step instructions that I can easily print off from my phone or the Web.

Featured Content

Product Launch Showcase

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?