Database design with CouchDB

As you might know by now, I have been really struck by simplicity of CouchDB at last year's OSCON. From then, we got couchapp which is great idea of hosting CouchDB views on file-system for easy maintenance.

So good in fact, that I re-wrote couchapp in perl called design-couch.pl. I needed to deviate a bit from original design (one _design document per application) because PXElator data which I store in CouchDB is... data...

I have been introduced to Relational databases back at university and since then I have been using PostgreSQL for all my database needs. But, this time, I have dumps from commands, SOAP interfaces, syslog messages and arbitrary audit events generated all over the code. I didn't want to think about structure up-front, and View Cookbook for SQL Jockeys convinced me I don't have to, but I decided to make few simple rules to get me started:

  • Create URLs using humanly readable timestamps (yyyy-mm-dd.HH:MM:SS.package.ident) which allows easy parsing in JavaScript (if needed), and ensures that all entries are sorted by time-stamp
  • Augment each document with single new key package (perl keyword on top of each module). It will have sub-keys time decimal time-stamp, name of package, caller sub which called CouchDB::audit and line from which it's called
  • Single _design document for output from one package (which is directory on file-system) just because it easy browsable in Futon.
CouchDB_select_view.png

So, are those rules enough to forget about relational algebra and relax on the couch? Let's take a look at _design/amt/ip,SystemPowerState-count. I applied here almost SQL-ish naming convention - column names, separated by commas then dash - and output column(s).

Map

function(doc) {
  if ( doc.package.name == 'amt'
  && doc.package.caller == 'power_state')
  emit([doc.ip,doc.SystemPowerState],1);
}

Reduce

function (k,v) {
  return sum(v)
}

When run, this map/reduce queries produce result like this:

KeyValue
["172.16.10.200", null]4
["172.16.10.16", "5"]21
["172.16.10.16", "0"]9
["172.16.10.16", null]8
["10.60.0.196", "0"]6
["10.60.0.195", "256"]8
["10.60.0.194", "256"]11
["10.60.0.193", "256"]3

So far, so good. But what if I wanted to average all ping round trip times for each ip?

If you where using SQL, answer would be:

select ip,avg(rtt) from ping group by ip
However, evil rereduce roars it's head here:

Map

function(doc) {
  if ( doc.package.name == 'ping' )
   emit(doc.ip, doc.rtt)
}

Reduce

function (k,values,rereduce) {
  if (rereduce) {
    var total_sum = 0;
    var total_length = 0;
    for (var i = 0; i < values.length; i++) {
      total_sum += values[i][0];
      total_length += values[i][1];
    }
    return [total_sum, total_length];
  } else {
    return [sum(values), values.length];
  }
}

Since we are called incrementally, we can't average averages. We need to collect total sum and number of elements and perform final computation on client:

KeyValue
"193.198.212.4"[779.0038585662847, 9]
"193.198.212.228"[902.6305675506585, 10]
"192.168.1.61"[906.698703765869, 11]
"192.168.1.34"[995.9852695465088, 11]
"192.168.1.3"[316.55669212341303, 6]
"192.168.1.20"[506.162643432617, 8]
"192.168.1.2"[473.91605377197277, 11]
"192.168.1.13"[649.2500305175784, 11]
"172.16.10.10"[49.9579906463623, 1]
"172.16.10.1"[250.78511238098127, 15]
"127.0.0.1"[62.57653236389161, 16]
"10.60.0.94"[81.6218852996826, 2]
"10.60.0.93"[186.49005889892578, 6]
"10.60.0.92"[386.7535591125485, 5]
"10.60.0.91"[1070.863485336304, 9]
"10.60.0.90"[428.4689426422117, 10]

If you manage to wrap your head around this, you are ready to dive into CouchDB.