MongoUK 2011Sep 2011 6 minute read
MongoDB creators 10gen held a conference in London earlier this month with a great selection of speakers drawn from their own technical staff and from the wider user community. Thanks to the Twitter, I got a nudge to look at this and thanks to the handy student price of $30 (at time of writing I am studying for my degree at the Open University and have the NUS card to prove it) and a well-placed day off work (ironically with the idea of finishing a piece of work for the OU), I was able to wander along.
Disclaimer: As good as the in-depth tech stuff was, one of the most interesting things for me was who was using it, what for, which problems it was solving and why they had chosen it over and above the many other datastore options that are out there. My write-up is accordingly skewed in this direction.
The National Archives
After the opening keynote had been cleared away, Aleks Drozdov was first up to tell us about how The National Archives were implementing MongoDB as the object store in their Microsoft .NET stack, migrating away from Microsoft SQL Server to do so.
While most of The National Archives’ physical records are stored in underground salt mines to maintain them in the best possible condition, the digitisation process (storing metadata about each physical asset) is ongoing - at last count standing at about 5% of an estimated 11 million records. The relational model version of the datastore contained a terrifying 2,000+ tables and around 15,000 possible attributes. It was quickly realised that this approach was far too slow in day-to-day use and much too complex to look after well. The search was on for an object-based store instead.
XML was looked at and discarded on the grounds that it made it too hard to create and update the data as well as issues surrounding scaling. Pure JSON implementations weren’t appropriate for the surrounding architecture. In the end Mongo was chosen after it proved to be a good fit for the object model and withstood test conditions.
The Mongo version of the solution has simplified the structure down to around 30 collections (loosely speaking the Mongo equivalent to database tables) that contain all the attributes of the physical records as well as presentation information used on the public web. The datastore is currently growing at a rate of 6,000 changes (counting both inserts and updates) per day and is expected to contain 100s of terabytes of data by the year 2020.
Boasting an average of 3 million unique visits per day, The Guardian’s website makes for an interesting case study. Most of the site is still based on Oracle and memcache but MongoDB is starting to drift in here and there and there might even be a case for migrating the main CMS at some future point. They have been using Mongo in development for over a year and have 3 projects that are currently in production.
On their music site, Mongo is being used in place of memcache rather than to replace Oracle. Data dragged in from other sites such as Last.FM are being temporarily stored in Mongo. The trade off here is speed for flexibility and so far Mongo has proved itself to be fast enough. The soulmates dating website and the identity and logging system have also gained Mongo-driven components.
So, why Mongo? Well the current Oracle solution was starting to hit its performance limits with some of the problems being caused by needing to pull data out of around 50 normalised data tables to populate the object model before serving a response to a user request. The object store approach cut out the extra step while offering increased flexibility and a simplified data model. (And the quote that came back from Oracle for a scaling solution melted away the barriers to looking for a new solution.)
Cassandra and CouchDB were auditioned for the role but getting Cassandra up and running turned out to be an early stumbling block. One of the key requirements was ad hoc querying of the data “we don’t have the luxury of knowing what our future requirements might be” which ruled out Couch as it mandates data requests using pre-compiled cacheable queries. Mongo, however, was found to combine the best bits of relational modelling with the advantages of an object store.
Meanwhile away from the main hall, social media marketing company uberVu provided a masterclass on optimisation and handling data on an almost unimaginable scale. Their business rests on their ability to carry out real-time analytics over high-volume sites like Twitter and FaceBook so they need to be quick (warm vs cold RAM anyone?).
Thought Mongo might not be fast enough for you? uberVu’s MongoDB setup handles about 50,000 writes per minute as they process an estimated 200 million tweets per day. They are currently using 30 EC2 virtual machines - with 4 Mongo instances on each - and require a staggering 32TB of storage space if you include their replication sets.
One of the attractive things about Mongo from these guys’ point of view was the lack of a single point of failure. They have used the sharding feature to create multiple data nodes - if one fails the loadbalancer should ensure that another seamlessly takes its place. They also highlighted its ability to scale both horizontally (add more shards) and vertically (upgrade the EC2 instances).
Telefonica UK (aka O2)
Telefonica’s “NoSQL in the Enterprise” presentation followed the development of a new “multi-channel product database” to store their product information in a flexible yet predictable (structurally at least, so that the data can be found again) way, with data being made available to the web, call centres, stores and business intelligence systems. The core problem was that new products are throughly unpredictable and may include features not even imagined when the data store was being created - new phones aside, if you’ve set up a system to describe and compare all the features of mobile phones, how would you store a 3G dongle? Or a netbook computer? Or an iPad? (And that’s before you factor in promotional items like cinema tickets, downloadable content and so on.)
Mix in such internal logic as “this product is only available with certain contracts”, “this product is only available to new customers”, “that product is not available to new customers” and “this offer is only good through August” then, once the decision-makers have been persuaded to let go of their Oracle-branded comfort-blanket, NoSQL document stores start looking like a really good idea.
The main selling point of the document store approach was that it would allow product objects to be stored and new categories created on the fly. Being able to query across disparate types to pull in associated customer/contract/promotion data completed the picture, with query flexibility a key feature (companies ruling out Couch on this basis became quite a theme at this event). During development, geospatial data emerged as a useful extra facet, allowing the introduction of location-based deals (particularly relevant to the cinema ticket promotions idea I suspect).
Apparently on the strength of this one development piece, the company has gone from resisting this weird new non-enterprisey open source technology to pestering the development team to make more stuff. And although Telefonica stressed product functionality was the main driver, I’m sure that as with the Guardian the price difference between MongoDB and Oracle hasn’t hurt its long-term adoption chances either.