In-memory data management speeds retrieval
- 20 February, 2006 13:46
My first real Java application, back in 1997, was a servlet-based group scheduler. It wasn't quite the smash hit that Hanson's "MMMBop" was that summer, but as some of you may recall, it had its charms.
One of the things that fascinated me was the ease with which Java enabled me to manage our data in a memory-resident object and serialize it to disk when users made changes to their calendars. The application was, quite simply and elegantly I thought, little more than a Java Dictionary exposed for transactional use on the Web.
Kent Beck and Ward Cunningham, two leaders of the agile programming movement, would have been proud of me. Although I didn't know it at the time, I had embraced one of their central tenets: Do the Simplest Thing That Could Possibly Work.
I hadn't foreclosed any options. There were ways to scale the application if I needed to, and in fact, I later experimented with swapping out Java's native serializer for an industrial-strength object database. But as often turns out to be the case, there was never any need to fire that big cannon.
My group scheduler was an example of what Clay Shirky calls "situated software" -- an application that's used by, at most, dozens of people, and that needs agility more than it needs scalability. I've since revisited that strategy from time to time, most recently for several of the services I use to search my own blog.
In April 2003 I began accumulating all of my entries in a single XML file. I also run them through a publishing system to create Web pages and RSS feeds, but the XML file is my canonical archive. And although I've written more than 700 items since then, amounting to a third of a million words, the file doesn't yet exceed three megabytes.
It's entirely feasible to keep that corpus in memory, so I do. One instance of it backs my structured search service, which I use to run XPath queries over the collection. That gives me instant access to a variety of microformatted elements: quotes by Ward Cunningham, or code snippets in XSLT or Python.
Structured search is handy, but like everyone else I still regard good old-fashioned full text search as my bread and butter. Until recently, I'd been relying on InfoWorld's Ultraseek engine. But because it crawls my site, which includes templated elements, the results aren't very precise. I wanted to search just the words I've written.
So now I load up another instance of the file and search that. The index? There isn't one. The service just rips through memory, finding substrings. It's blindingly fast. And charting my productivity alongside Moore's Law suggests this strategy won't run out of gas anytime soon.
When we consider the exponential growth of storage, we often forget that our most essential data is textual and numeric. And that stuff tends to grow only linearly. For example, my 2005 e-mail archive tops 100 megabytes, but a big chunk of it is PowerPoint attachments people have sent me. Boiled down to their textual and numeric essence, they'd occupy a fraction of the space.
There's nothing new about in-memory databases. They come in many different flavors, all of which are still fairly exotic, but emerging technologies such as Microsoft's LINQ (language integrated query) promise to pull this approach into the mainstream. For our most vital and most volatile data, it's a strategy whose time has come.