Results tagged “PXElator”

It's about system, stupid!

lib-architecture-v2.png When you are working as system architect or systems librarian, your job is to design systems. My initial idea was to create small Google out of 12 machines which are dedicated to be web kiosks. I decided to strictly follow loosely coupled principle, mostly to provide horizontal scaling for my data processing needs. I wanted to be able to add machine or two if my query is too slow... This easily translates into "now long will I have to wait for my page to generate results"....

I decided to split my system into three logical parts: network booting, data store, and quick reporting. So, let's take a look at each component separately:

  • PXElator
    • supported protocols: bootp, dhcp, tftp, http, amt, wol, syslog
    • boot kiosks using Webconverger (Debian Live based kiosk distribution)
    • provides web user interface for overview of network segment for audit
    • configuration is stored as files on disk, suitable for management with git or other source control management
  • MongoDB
    • NoSQL storage component which support ad-hoc queries, indexes and other goodies
    • simple store for perl hashes from PXElator generated every time we see network packet from one of clients using one of supported protocols
  • Sack
    • fastest possible way to execute snippet of perl code over multiple machines
    • this involves sharing information to nodes, executing code on all of them and collecting results back, all in sub 3 second mark!
    • web user interface for cloud overview and graph generation using gnuplot

When I started implementing this system last summer, I decided to use CouchDB for storage layer. This wasn't really good choice, since I didn't need transactions, MVCC or replication. Hack, I even implemented forking for document stored in CouchDB to provide faster response to clients in PXElator.

Moving to much faster MongoDB I got ad-hoc queries which are usable (as in I can wait for them to finish) and if that's too slow, I can move data to Sack and query it directly from memory. As a happy side effect, making shards from MongoDB is much faster than using CouchDB bulk HTTP API, and it will allow me to feed shards directly from MongoDB to Sack nodes, without first creating shards on disk.

I'm quite happy how it all turned out. I can configure any host using small snippet of perl code in PXElator, issue ad-hoc queries on audit data on it in MongoDB or move data to Sack if I want to do data munging using perl.

As you noticed by now, I'm using live distribution for kiosks, and machines do have hard drivers in them. Idea was to use those disks as storage with something like Sheepdog. seems like perfect fit. With it in place, I will have real distributed, building size computer :-).

I have been using CouchDB for some time now, mostly as audit storage for PXElator. Audit data stores are most useful for ad-hoc queries (hum, when did I saw that host last time?), and CouchDB map/reduces took half an hour or more. I wrote mall script to migrate my data over to MongoDB (in 26 minutes) and run first query I could write after reading MongoDB documentation about advanced queries. It took only 30 seconds, compared to 30 minutes or more in CouchDB. I was amazed.

This was NoSQL database which I can understand and tune. MongoDB has indexes and profiler so tuning query down to three seconds was a simple matter of adding an index. All my RDBMS knowledge was reusable here, so I decided to take a look why is it so much faster than CouchDB for same data...

To be honest, MongoDB, High-Performance SQL-Free Database by Dwight Merriman, CEO of 10gen won me over to finally try MongoDB. It was technical enough to make me think about MongoDB arhitecture and benefits. It's clearly pragmatic, let's re-think horizontally scalable hash storage with ad-hoc queries model, but with funny twist about close coupling with language types all encoded in BSON format, which is very similar to Google's protocol buffers.

First, let's have a look at raw side of data on disk. At some level, it will translate to number of IO operations involving rotating platters and usage of buffer cache.

root@opr:~# du -hc /var/lib/couchdb/0.9.0/.pxelator* /var/lib/couchdb/0.9.0/pxelator.couch
655M    /var/lib/couchdb/0.9.0/.pxelator_design
23M     /var/lib/couchdb/0.9.0/.pxelator_temp
7.8G    /var/lib/couchdb/0.9.0/pxelator.couch
8.4G    total

root@opr:~# du -hc /var/lib/mongodb/pxelator.*
65M     /var/lib/mongodb/pxelator.0
129M    /var/lib/mongodb/pxelator.1
257M    /var/lib/mongodb/pxelator.2
513M    /var/lib/mongodb/pxelator.3
513M    /var/lib/mongodb/pxelator.4
513M    /var/lib/mongodb/pxelator.5
17M     /var/lib/mongodb/pxelator.ns
2.0G    total
Here is a first hint about performance: MongoDB's 2G of data (which are used as mmap memory directly, leaving flushes and caching to OS layer) are almost a perfect fit into 3G of RAM memory I have in this machine.

MongoDB has montodump utility which dumps bson for backup and it's even smaller:

root@opr:~# du -hcs dump/pxelator/*
1.1G    dump/pxelator/audit.bson
4.0K    dump/pxelator/system.indexes.bson
76K     dump/pxelator/system.profile.bson
1.1G    total

So I switched PXElator to use MongoDB as storage. I never pushed anything in production after just one day of testing it, but first query speedup from 30 min to 30 sec, and ability to cut it down to 3 sec if I added index (which took about 13 sec to create) is just something which provides me with powerful analytical tool I didn't have before.

As you might know by now, I have been really struck by simplicity of CouchDB at last year's OSCON. From then, we got couchapp which is great idea of hosting CouchDB views on file-system for easy maintenance.

So good in fact, that I re-wrote couchapp in perl called I needed to deviate a bit from original design (one _design document per application) because PXElator data which I store in CouchDB is... data...

I have been introduced to Relational databases back at university and since then I have been using PostgreSQL for all my database needs. But, this time, I have dumps from commands, SOAP interfaces, syslog messages and arbitrary audit events generated all over the code. I didn't want to think about structure up-front, and View Cookbook for SQL Jockeys convinced me I don't have to, but I decided to make few simple rules to get me started:

  • Create URLs using humanly readable timestamps (yyyy-mm-dd.HH:MM:SS.package.ident) which allows easy parsing in JavaScript (if needed), and ensures that all entries are sorted by time-stamp
  • Augment each document with single new key package (perl keyword on top of each module). It will have sub-keys time decimal time-stamp, name of package, caller sub which called CouchDB::audit and line from which it's called
  • Single _design document for output from one package (which is directory on file-system) just because it easy browsable in Futon.

So, are those rules enough to forget about relational algebra and relax on the couch? Let's take a look at _design/amt/ip,SystemPowerState-count. I applied here almost SQL-ish naming convention - column names, separated by commas then dash - and output column(s).


function(doc) {
  if ( == 'amt'
  && doc.package.caller == 'power_state')


function (k,v) {
  return sum(v)

When run, this map/reduce queries produce result like this:

["", null]4
["", "5"]21
["", "0"]9
["", null]8
["", "0"]6
["", "256"]8
["", "256"]11
["", "256"]3

So far, so good. But what if I wanted to average all ping round trip times for each ip?

If you where using SQL, answer would be:

select ip,avg(rtt) from ping group by ip
However, evil rereduce roars it's head here:


function(doc) {
  if ( == 'ping' )
   emit(doc.ip, doc.rtt)


function (k,values,rereduce) {
  if (rereduce) {
    var total_sum = 0;
    var total_length = 0;
    for (var i = 0; i < values.length; i++) {
      total_sum += values[i][0];
      total_length += values[i][1];
    return [total_sum, total_length];
  } else {
    return [sum(values), values.length];

Since we are called incrementally, we can't average averages. We need to collect total sum and number of elements and perform final computation on client:

""[779.0038585662847, 9]
""[902.6305675506585, 10]
""[906.698703765869, 11]
""[995.9852695465088, 11]
""[316.55669212341303, 6]
""[506.162643432617, 8]
""[473.91605377197277, 11]
""[649.2500305175784, 11]
""[49.9579906463623, 1]
""[250.78511238098127, 15]
""[62.57653236389161, 16]
""[81.6218852996826, 2]
""[186.49005889892578, 6]
""[386.7535591125485, 5]
""[1070.863485336304, 9]
""[428.4689426422117, 10]

If you manage to wrap your head around this, you are ready to dive into CouchDB.

PXElator introduction

This weekend we where in Split on Ništa se neće dogoditi event and I did presetation about first three weeks of PXElator development which can be used as gentle introduction into this project. So, here we go...


PXElator is just a peace of puzzle which aims to replace system administration with nice declarative programs in perl. It's a experiment in replacing my work with reusable perl snippets.

It tries to solve following problems:

  • support deployment of new physical or virtual machines (ip, hostname, common configuration)

  • maintain documentation about changes on systems, good enough to be used for disaster recovery (or deployment of similar system)

  • configure systems in small chunks (virtual or containers) for better management and resource tracking using normal system administration tools (but track those changes)

  • provide overview and monitoring of network segment and services on it with alerting and trending

Deployment of new machines

What is really machine? For PXElator, it's MAC and IP address and some optional parameters (like hostname). It's stored on file-system, under conf/server.ip/machine.ip/hostname and can be tracked using source control if needed.

This is also shared state between all daemons implementing network protocols:

  • DHCP (with PXE support)

  • TFTP (to deliver initial kernel and initrd using pxelinux)

  • HTTP (to provide alternative way to fetch files and user interface)

  • DNS (we already have data)

  • syslog

  • AMT for remote management

Having all that protocols written in same language enables incredible flexibility in automatic configuration. I can issue command using installation which has only ping because I can have special DNS names which issue commands.

But, to get real power, we need to aggregate that data. I'm currently using CouchDB from to store all audit data from all services into single database.

I wanted simple way to write ad-hoc queries without warring about data structure too much. At the end, I opted for audit role of data, and used 1 second granularity as key when storing data. Result of it is that 133 syslog messages from kernel right after boot you will create single document with 133 revisions instead of flooding your database.

It would be logical to plug RRDtool somewhere here to provide nice graphs here, but that is still on TODO list.

End user scenarios:

  • Take a new machine, plug it into network, boot it from network and configure for kiosk style deployment with Webconverger available at Kiosk should automatically turn on every morning at 7:30 and turn off at 20:30.

  • Boot virtual machine (with new ip and hostname) from backup snapshot for easy recovery or testing

  • Boot machine from network into fully configurable (writable) system for quick recovery or dedicated machine. This is implemented using NFS server with aufs read-write overlay on top of debootstrap base machine.

Disaster recovery documentation for me, two years later

I have been trying to write useful documentation snippets for years. My best effort so far is Sysadmin Cookbook at a set of semi-structured shell scripts which can be executed directly on machines.

This part isn't yet integrated into PXElator, but most of the recipe will become some kind of rule which you can enforce on some managed machine.

End user scenario:

  • Install that something also on this other machine

Configure system like you normally would but track changes

This is basically requirement to track configuration changes. Currently, this feature falls out of writable snapshot over base system which is read-only. Overlay data is all custom configuration that I did!

Tracking changes on existing machines will be implemented scp to copy file on server into hostname/path/to/local/file directory structure. This structure will be tracked using source control (probably git as opposed to subversion which PXElator source uses) and cron job will pull those files at some interval (daily, hourly) to create rsync+git equivalent of BackupPC for this setup.

It's interesting to take a look how it's different from Puppet and similar to cfengine3:

  • All data is kept in normal configuration files on system -- you don't need to learn new administration tools or somehow maintain two sources of configuration (in configuration management and on the system)

  • Introspect live system and just tries to apply corrections if needed which is similar to cfengine3 approach.

End user scenario:

  • Turn useful how-to into workable configuration without much effort

Provide overview and monitoring

This falls out from HTTP interface and from collecting of data into CouchDB. For now, PXElator tries to manage development environment for you, opening xterms (with screen inside for logging and easy scrollback) in different colors, and enable you to start Wireshark on active network interfaces for debugging.

First, of all, happy sysadmin day 2009-07-31! So, it seems logical that I'm announcing by project PXElator which aims to replace me with a perl script. It's basically my take on cloud hype. Currently it supports bringing up machines (virtual or physical) from boot onwards. It implements bootp, dhcp, tftp and http server to enable single action boot of new machine.

It all started when I watched Practical Puppet: Systems Building Systems and decided that real power is in expressing system administration as code. I also liked DSL approach which Puppet took in ruby and tried to apply same principle (declarative DSL) in perl. If you take a look at source code it seems work work quite well.

In the spirit of release early, release often this code will be in flux until end of this summer, when I plan to deploy it to create web kiosk environment for browsing our library catalog based on it.