Diffbot organizing Web data for enterprise use

The company claims to have created a structured representation of much of the data on the Web

Google's KnowledgeGraph organizes information on the Web so it can be programmatically queried

Google's KnowledgeGraph organizes information on the Web so it can be programmatically queried

Diffbot is trying to reorganize all the data on the Web so it can be put to better use.

The service "converts the existing Web into a structured database-like representation that can essentially be used for all sorts of intelligent applications," said Mike Tung, Diffbot CEO.

On Thursday, Diffbot said it had received $500,000 in funding from Bloomberg Beta, the investment arm of the Bloomberg media company. Andy Bechtolsheim, a founder of Sun MIcrosystems and the first major investor in Google, is also a backer. Diffbot says it already has paying customers for the service, which is being used by Microsoft's Bing, Adobe, Salesforce.com, and eBay.

The service creates an object for each Web page it finds. An object provides structure to a set of related data so that it can be programmatically reused, along with other similar objects, by a query engine or an external application. The software has been copying all the pages it finds on the Web and reorganizing them into objects.

Perhaps the most well-known example of this object-based approach is Google's Knowledge Graph, a Semantic Web project. If a search is done on a particular keyword, such as the name "Johnny Depp," Google will return, along with a standard list of Web pages, a box containing basic information on the actor, such as birth date and height. That box of information is a rendering of the "Johnny Depp" Knowledge Graph object built by Google.

Diffbot, which is based in Palo Alto, California, and was founded in 2008, claims its own collection of objects is superior to Google's.

The 14-person company says it has created an entirely automated system for accurately creating objects. Google's approach is at least partly manual, requiring individuals to edit objects after they have been created, confirmed a Google spokesman.

Google's Knowledge Graph is larger than Diffbot's, containing roughly a billion objects, while Diffbot's global index of the Web now includes 600 million objects. But Google doesn't yet offer a Knowledge Graph API for third-party commercial use, though it is working on one.

Diffbot is based on the idea that businesses could use such a collection of organized information for their own purposes. Nike, for instance, could deploy the service to build a profile of other shoe companies and their offerings, Tung suggested. DiffBot offers a set of APIs (application programming interfaces) that third-party applications can use to query the massive object set.

The company has developed a set of AI algorithms that can identify the context and subject of Web pages, some of which the company is in the process of patenting. One novel AI algorithm relies computer vision, which is not a widely used technique for indexing Web pages, Tung acknowledged. The layout and design of Web pages can provide important clues to help better define objects. "The layout is the signal that helps us determine what kind of page it is," Tung said. An e-commerce site has an entirely different structure than a news site, for instance.

Diffbot is one of a number of companies building such "knowledge graphs," through various sets of technologies, said Dave Schubmehl, an IDC research director who covers content analytics, discovery and cognitive systems. Such technology could be of potential value to any business that relies on understanding large amounts of external data, he said via email.

Another company working in this field is IBM, Schubmehl wrote. Last year, IBM purchased two companies to install similar capabilities in its Watson cognitive computing service. One was AlchemyAPI, which builds taxonomies of data assets, and the other is Blekko, which developed software for indexing Web sites.

Some organizations use other technologies to organize and synthesize large sets of otherwise unstructured information, according to Schubmehl. Neo4J and Oracle both offer graph databases, which are well-suited for identifying the connections across large collections of data. Others rely on semantic Web standards, such as the Sesame Java Framework, which is used for converting data into the structured RDF (Rich Description Framework) format.

Joab Jackson covers enterprise software and general technology breaking news for The IDG News Service. Follow Joab on Twitter at @Joab_Jackson. Joab's e-mail address is Joab_Jackson@idg.com

Join the PC World newsletter!

Error: Please check your email address.

Tags DiffBotsoftware

Our Back to Business guide highlights the best products for you to boost your productivity at home, on the road, at the office, or in the classroom.

Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Joab Jackson

IDG News Service
Show Comments

Most Popular Reviews

Latest News Articles

Resources

PCW Evaluation Team

Azadeh Williams

HP OfficeJet Pro 8730

A smarter way to print for busy small business owners, combining speedy printing with scanning and copying, making it easier to produce high quality documents and images at a touch of a button.

Andrew Grant

HP OfficeJet Pro 8730

I've had a multifunction printer in the office going on 10 years now. It was a neat bit of kit back in the day -- print, copy, scan, fax -- when printing over WiFi felt a bit like magic. It’s seen better days though and an upgrade’s well overdue. This HP OfficeJet Pro 8730 looks like it ticks all the same boxes: print, copy, scan, and fax. (Really? Does anyone fax anything any more? I guess it's good to know the facility’s there, just in case.) Printing over WiFi is more-or- less standard these days.

Ed Dawson

HP OfficeJet Pro 8730

As a freelance writer who is always on the go, I like my technology to be both efficient and effective so I can do my job well. The HP OfficeJet Pro 8730 Inkjet Printer ticks all the boxes in terms of form factor, performance and user interface.

Michael Hargreaves

Windows 10 for Business / Dell XPS 13

I’d happily recommend this touchscreen laptop and Windows 10 as a great way to get serious work done at a desk or on the road.

Aysha Strobbe

Windows 10 / HP Spectre x360

Ultimately, I think the Windows 10 environment is excellent for me as it caters for so many different uses. The inclusion of the Xbox app is also great for when you need some downtime too!

Mark Escubio

Windows 10 / Lenovo Yoga 910

For me, the Xbox Play Anywhere is a great new feature as it allows you to play your current Xbox games with higher resolutions and better graphics without forking out extra cash for another copy. Although available titles are still scarce, but I’m sure it will grow in time.

Featured Content

Latest Jobs

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?