Results matching “MongoDB”

Like every year, we had our local Linux conference. It was very intense event (for first year I'm involved in real organization) and I can say it's all just a big blurb.

I had two tutorials, one about my Virtual LDAP and another one about creating Google like (horizontally scalable) cluster from library building. In this one, I covered a whole bunch of tools which I ended up using during last year:

  • Webconverger is the easiest way to deploy Firefox on kiosks for public Internet access
  • PXElator - full stack solution to network booting and spawning machines
  • Sack - horizontally scalable (across cores or nodes) in-memory perl hash with remote code execution (close to data)
  • mongoDB which I use for audit log in PXElator and feed it back to Sack after finding CouchDB too slow.
  • Sysadmin Cookbook as a way to document HOWTO or SOP documents
  • bak-git for tracking configuration changes
  • Gearman and Narada didn't get all attention they deserved, partly because i wasn't able to make Narada work (I tried perl and php version in preparation for tutorial). But, I hope that I managed to transfer part of my fascination with distributed fork approach.

During the conference I wrote small project to index git log messages using Sphinx which might help you to get started with it.

It's about system, stupid!

lib-architecture-v2.png When you are working as system architect or systems librarian, your job is to design systems. My initial idea was to create small Google out of 12 machines which are dedicated to be web kiosks. I decided to strictly follow loosely coupled principle, mostly to provide horizontal scaling for my data processing needs. I wanted to be able to add machine or two if my query is too slow... This easily translates into "now long will I have to wait for my page to generate results"....

I decided to split my system into three logical parts: network booting, data store, and quick reporting. So, let's take a look at each component separately:

  • PXElator
    • supported protocols: bootp, dhcp, tftp, http, amt, wol, syslog
    • boot kiosks using Webconverger (Debian Live based kiosk distribution)
    • provides web user interface for overview of network segment for audit
    • configuration is stored as files on disk, suitable for management with git or other source control management
  • MongoDB
    • NoSQL storage component which support ad-hoc queries, indexes and other goodies
    • simple store for perl hashes from PXElator generated every time we see network packet from one of clients using one of supported protocols
  • Sack
    • fastest possible way to execute snippet of perl code over multiple machines
    • this involves sharing information to nodes, executing code on all of them and collecting results back, all in sub 3 second mark!
    • web user interface for cloud overview and graph generation using gnuplot

When I started implementing this system last summer, I decided to use CouchDB for storage layer. This wasn't really good choice, since I didn't need transactions, MVCC or replication. Hack, I even implemented forking for document stored in CouchDB to provide faster response to clients in PXElator.

Moving to much faster MongoDB I got ad-hoc queries which are usable (as in I can wait for them to finish) and if that's too slow, I can move data to Sack and query it directly from memory. As a happy side effect, making shards from MongoDB is much faster than using CouchDB bulk HTTP API, and it will allow me to feed shards directly from MongoDB to Sack nodes, without first creating shards on disk.

I'm quite happy how it all turned out. I can configure any host using small snippet of perl code in PXElator, issue ad-hoc queries on audit data on it in MongoDB or move data to Sack if I want to do data munging using perl.

As you noticed by now, I'm using live distribution for kiosks, and machines do have hard drivers in them. Idea was to use those disks as storage with something like Sheepdog. seems like perfect fit. With it in place, I will have real distributed, building size computer :-).

I have been using CouchDB for some time now, mostly as audit storage for PXElator. Audit data stores are most useful for ad-hoc queries (hum, when did I saw that host last time?), and CouchDB map/reduces took half an hour or more. I wrote mall script couchdb2mongodb.pl to migrate my data over to MongoDB (in 26 minutes) and run first query I could write after reading MongoDB documentation about advanced queries. It took only 30 seconds, compared to 30 minutes or more in CouchDB. I was amazed.

This was NoSQL database which I can understand and tune. MongoDB has indexes and profiler so tuning query down to three seconds was a simple matter of adding an index. All my RDBMS knowledge was reusable here, so I decided to take a look why is it so much faster than CouchDB for same data...

To be honest, MongoDB, High-Performance SQL-Free Database by Dwight Merriman, CEO of 10gen won me over to finally try MongoDB. It was technical enough to make me think about MongoDB arhitecture and benefits. It's clearly pragmatic, let's re-think horizontally scalable hash storage with ad-hoc queries model, but with funny twist about close coupling with language types all encoded in BSON format, which is very similar to Google's protocol buffers.

First, let's have a look at raw side of data on disk. At some level, it will translate to number of IO operations involving rotating platters and usage of buffer cache.

root@opr:~# du -hc /var/lib/couchdb/0.9.0/.pxelator* /var/lib/couchdb/0.9.0/pxelator.couch
655M    /var/lib/couchdb/0.9.0/.pxelator_design
23M     /var/lib/couchdb/0.9.0/.pxelator_temp
7.8G    /var/lib/couchdb/0.9.0/pxelator.couch
8.4G    total

root@opr:~# du -hc /var/lib/mongodb/pxelator.*
65M     /var/lib/mongodb/pxelator.0
129M    /var/lib/mongodb/pxelator.1
257M    /var/lib/mongodb/pxelator.2
513M    /var/lib/mongodb/pxelator.3
513M    /var/lib/mongodb/pxelator.4
513M    /var/lib/mongodb/pxelator.5
17M     /var/lib/mongodb/pxelator.ns
2.0G    total
Here is a first hint about performance: MongoDB's 2G of data (which are used as mmap memory directly, leaving flushes and caching to OS layer) are almost a perfect fit into 3G of RAM memory I have in this machine.

MongoDB has montodump utility which dumps bson for backup and it's even smaller:

root@opr:~# du -hcs dump/pxelator/*
1.1G    dump/pxelator/audit.bson
4.0K    dump/pxelator/system.indexes.bson
76K     dump/pxelator/system.profile.bson
1.1G    total

So I switched PXElator to use MongoDB as storage. I never pushed anything in production after just one day of testing it, but first query speedup from 30 min to 30 sec, and ability to cut it down to 3 sec if I added index (which took about 13 sec to create) is just something which provides me with powerful analytical tool I didn't have before.