Results matching “PXElator”

DORS/CLUC 2010 conference

Like every year, we had our local Linux conference. It was very intense event (for first year I'm involved in real organization) and I can say it's all just a big blurb.

I had two tutorials, one about my Virtual LDAP and another one about creating Google like (horizontally scalable) cluster from library building. In this one, I covered a whole bunch of tools which I ended up using during last year:

Webconverger is the easiest way to deploy Firefox on kiosks for public Internet access
PXElator - full stack solution to network booting and spawning machines
Sack - horizontally scalable (across cores or nodes) in-memory perl hash with remote code execution (close to data)
mongoDB which I use for audit log in PXElator and feed it back to Sack after finding CouchDB too slow.
Sysadmin Cookbook as a way to document HOWTO or SOP documents
bak-git for tracking configuration changes
Gearman and Narada didn't get all attention they deserved, partly because i wasn't able to make Narada work (I tried perl and php version in preparation for tutorial). But, I hope that I managed to transfer part of my fascination with distributed fork approach.

During the conference I wrote small project to index git log messages using Sphinx which might help you to get started with it.

When you are working as system architect or systems librarian, your job is to design systems. My initial idea was to create small Google out of 12 machines which are dedicated to be web kiosks. I decided to strictly follow loosely coupled principle, mostly to provide horizontal scaling for my data processing needs. I wanted to be able to add machine or two if my query is too slow... This easily translates into "now long will I have to wait for my page to generate results"....

I decided to split my system into three logical parts: network booting, data store, and quick reporting. So, let's take a look at each component separately:

PXElator
- supported protocols: bootp, dhcp, tftp, http, amt, wol, syslog
- boot kiosks using Webconverger (Debian Live based kiosk distribution)
- provides web user interface for overview of network segment for audit
- configuration is stored as files on disk, suitable for management with git or other source control management
MongoDB
- NoSQL storage component which support ad-hoc queries, indexes and other goodies
- simple store for perl hashes from PXElator generated every time we see network packet from one of clients using one of supported protocols
Sack
- fastest possible way to execute snippet of perl code over multiple machines
- this involves sharing information to nodes, executing code on all of them and collecting results back, all in sub 3 second mark!
- web user interface for cloud overview and graph generation using gnuplot

When I started implementing this system last summer, I decided to use CouchDB for storage layer. This wasn't really good choice, since I didn't need transactions, MVCC or replication. Hack, I even implemented forking for document stored in CouchDB to provide faster response to clients in PXElator.

Moving to much faster MongoDB I got ad-hoc queries which are usable (as in I can wait for them to finish) and if that's too slow, I can move data to Sack and query it directly from memory. As a happy side effect, making shards from MongoDB is much faster than using CouchDB bulk HTTP API, and it will allow me to feed shards directly from MongoDB to Sack nodes, without first creating shards on disk.

I'm quite happy how it all turned out. I can configure any host using small snippet of perl code in PXElator, issue ad-hoc queries on audit data on it in MongoDB or move data to Sack if I want to do data munging using perl.

As you noticed by now, I'm using live distribution for kiosks, and machines do have hard drivers in them. Idea was to use those disks as storage with something like Sheepdog. seems like perfect fit. With it in place, I will have real distributed, building size computer :-).

MongoDB - so you want fast NoSQL database which you can grok

I have been using CouchDB for some time now, mostly as audit storage for PXElator. Audit data stores are most useful for ad-hoc queries (hum, when did I saw that host last time?), and CouchDB map/reduces took half an hour or more. I wrote mall script couchdb2mongodb.pl to migrate my data over to MongoDB (in 26 minutes) and run first query I could write after reading MongoDB documentation about advanced queries. It took only 30 seconds, compared to 30 minutes or more in CouchDB. I was amazed.

This was NoSQL database which I can understand and tune. MongoDB has indexes and profiler so tuning query down to three seconds was a simple matter of adding an index. All my RDBMS knowledge was reusable here, so I decided to take a look why is it so much faster than CouchDB for same data...

To be honest, MongoDB, High-Performance SQL-Free Database by Dwight Merriman, CEO of 10gen won me over to finally try MongoDB. It was technical enough to make me think about MongoDB arhitecture and benefits. It's clearly pragmatic, let's re-think horizontally scalable hash storage with ad-hoc queries model, but with funny twist about close coupling with language types all encoded in BSON format, which is very similar to Google's protocol buffers.

First, let's have a look at raw side of data on disk. At some level, it will translate to number of IO operations involving rotating platters and usage of buffer cache.

root@opr:~# du -hc /var/lib/couchdb/0.9.0/.pxelator* /var/lib/couchdb/0.9.0/pxelator.couch
655M    /var/lib/couchdb/0.9.0/.pxelator_design
23M     /var/lib/couchdb/0.9.0/.pxelator_temp
7.8G    /var/lib/couchdb/0.9.0/pxelator.couch
8.4G    total

root@opr:~# du -hc /var/lib/mongodb/pxelator.*
65M     /var/lib/mongodb/pxelator.0
129M    /var/lib/mongodb/pxelator.1
257M    /var/lib/mongodb/pxelator.2
513M    /var/lib/mongodb/pxelator.3
513M    /var/lib/mongodb/pxelator.4
513M    /var/lib/mongodb/pxelator.5
17M     /var/lib/mongodb/pxelator.ns
2.0G    total

Here is a first hint about performance: MongoDB's 2G of data (which are used as mmap memory directly, leaving flushes and caching to OS layer) are almost a perfect fit into 3G of RAM memory I have in this machine.

MongoDB has montodump utility which dumps bson for backup and it's even smaller:

root@opr:~# du -hcs dump/pxelator/*
1.1G    dump/pxelator/audit.bson
4.0K    dump/pxelator/system.indexes.bson
76K     dump/pxelator/system.profile.bson
1.1G    total

So I switched PXElator to use MongoDB as storage. I never pushed anything in production after just one day of testing it, but first query speedup from 30 min to 30 sec, and ability to cut it down to 3 sec if I added index (which took about 13 sec to create) is just something which provides me with powerful analytical tool I didn't have before.

Protocol analysts of DRAC remote console

As you know by now, I have been playing with Dell's remote consoles in hope that I will be able to connect from my Linux to Dell's RAC reliably. Currently, I have to run Windows XP with Internet Explorer and Java in kvm to have access to my servers, and that's clearly not reliable combination.

DRAC is PCI card which is presented to system as VGA which then transfers screen updates over the network to client. It also allows virtual media, but in a sense, it's mix-up of http over ssl and few propitiatory protocols:

443 - https interface
3368 - virtual media (proprietary)
5900 - keyboard and mouse (ssl encrypted)
5901 - video redirection (optionally ssl encrypted)

It's very strange that all documentation calls 5900 video redirection port and 5901 keyboard/mouse redirection when all traces of traffic between client and server clearly show that ports are swapped in implementation.

Did you notice ssl encrypted keyboard/mouse channel? I first decided to tackle this problem with well known SSL man in the middle approach. I decided to use simpliest possible approach first using something like:

apt-get install stunnel

openssl req -new -x509 -days 365 -nodes -out cert.pem -keyout cert.pem

# https mitm
stunnel -p cert.pem -d 443 -r 5443
stunnel -c -d 5443 -r 10.60.0.100:443

# 5900 mitm
stunnel -p cert.pem -d 5900 -r 5999
stunnel -c -d 5999 -r 10.60.0.100:5900

and than recoding all output using wireshark:

sudo tshark -w /tmp/drac.pcap -i any 'port 5999 or port 5901 or port 5443'

This allowed me to capture all unencrypted traffic into single pcap file which proved very useful for initial protocol analysis using wireshark. In short, you have to do following:

make https connection to https://drac/cgi-bin/webcgi/winvkvm?state=1 and acquire vKvmSessionId console redirection authentication key
connect to keyboard/mouse port 5900 forcing SSL_cipher_list to supported RC4-MD5 cipher and send vKvmSessionId
connect to video port 5901

Finding supported cipher for communication between us and server was a real problem. They are using openssl-0.9.7f and I had to downgrade all the way to Debian woody to make stunnel work. Same problem is visible with latest firmware update for DRAC where Active X plugin doesn't have old enough configuration in SSL handshake and doesn't work any more. Java plugin, on the other hand, provides much more cipher options, so one of them still works. ssldump was very useful for finding such problems.

Fortunately, kost was much more persistent than me, and he found out that adding 'SSL_cipher_list' => 'RC4-MD5' will force supported cipher. Armed with that new finding, I was able to modify kost's ssl mitm script up to the point where I can see decrypted key-presses, mouse movements and video settings. Hack, I even wrote drac-vkvm.pl async client which does steps outlined above.

All is not well, unfortunately. When sending authentication request, we need vKvmSessionId which we get from web server, but packet which is sent contains also two bytes which change with session. I haven't been able to figure this part out, and since same two byte sequence is needed to open video channel (to see VGA output) so I'm stuck.
Bytes don't look like crc16, and source code doesn't provide any hints about secondary 16-bit auth info. It seems that client calculates it somehow, since both connections close when I try to send different values for it.

I could write session recorder, but that isn't terribly useful, because it still forces me to use Windows+Java setup to access my console. I will collect usefull snippets about Dell's RAC protocol on wiki.

Database design with CouchDB

As you might know by now, I have been really struck by simplicity of CouchDB at last year's OSCON. From then, we got couchapp which is great idea of hosting CouchDB views on file-system for easy maintenance.

So good in fact, that I re-wrote couchapp in perl called design-couch.pl. I needed to deviate a bit from original design (one _design document per application) because PXElator data which I store in CouchDB is... data...

I have been introduced to Relational databases back at university and since then I have been using PostgreSQL for all my database needs. But, this time, I have dumps from commands, SOAP interfaces, syslog messages and arbitrary audit events generated all over the code. I didn't want to think about structure up-front, and View Cookbook for SQL Jockeys convinced me I don't have to, but I decided to make few simple rules to get me started:

Create URLs using humanly readable timestamps (yyyy-mm-dd.HH:MM:SS.package.ident) which allows easy parsing in JavaScript (if needed), and ensures that all entries are sorted by time-stamp
Augment each document with single new key package (perl keyword on top of each module). It will have sub-keys time decimal time-stamp, name of package, caller sub which called CouchDB::audit and line from which it's called
Single _design document for output from one package (which is directory on file-system) just because it easy browsable in Futon.

So, are those rules enough to forget about relational algebra and relax on the couch? Let's take a look at _design/amt/ip,SystemPowerState-count. I applied here almost SQL-ish naming convention - column names, separated by commas then dash - and output column(s).

Map
function(doc) { if ( doc.package.name == 'amt' && doc.package.caller == 'power_state') emit([doc.ip,doc.SystemPowerState],1); }
Reduce
function (k,v) { return sum(v) }

When run, this map/reduce queries produce result like this:

Key	Value
["172.16.10.200", null]	4
["172.16.10.16", "5"]	21
["172.16.10.16", "0"]	9
["172.16.10.16", null]	8
["10.60.0.196", "0"]	6
["10.60.0.195", "256"]	8
["10.60.0.194", "256"]	11
["10.60.0.193", "256"]	3

So far, so good. But what if I wanted to average all ping round trip times for each ip?

If you where using SQL, answer would be:

select ip,avg(rtt) from ping group by ip

However, evil rereduce roars it's head here:

Map
function(doc) { if ( doc.package.name == 'ping' ) emit(doc.ip, doc.rtt) }
Reduce
function (k,values,rereduce) { if (rereduce) { var total_sum = 0; var total_length = 0; for (var i = 0; i < values.length; i++) { total_sum += values[i][0]; total_length += values[i][1]; } return [total_sum, total_length]; } else { return [sum(values), values.length]; } }

Map


function(doc) {
  if ( doc.package.name == 'ping' )
   emit(doc.ip, doc.rtt)
}

Reduce


function (k,values,rereduce) {
  if (rereduce) {
    var total_sum = 0;
    var total_length = 0;
    for (var i = 0; i < values.length; i++) {
      total_sum += values[i][0];
      total_length += values[i][1];
    }
    return [total_sum, total_length];
  } else {
    return [sum(values), values.length];
  }
}

Since we are called incrementally, we can't average averages. We need to collect total sum and number of elements and perform final computation on client:

Key	Value
"193.198.212.4"	[779.0038585662847, 9]
"193.198.212.228"	[902.6305675506585, 10]
"192.168.1.61"	[906.698703765869, 11]
"192.168.1.34"	[995.9852695465088, 11]
"192.168.1.3"	[316.55669212341303, 6]
"192.168.1.20"	[506.162643432617, 8]
"192.168.1.2"	[473.91605377197277, 11]
"192.168.1.13"	[649.2500305175784, 11]
"172.16.10.10"	[49.9579906463623, 1]
"172.16.10.1"	[250.78511238098127, 15]
"127.0.0.1"	[62.57653236389161, 16]
"10.60.0.94"	[81.6218852996826, 2]
"10.60.0.93"	[186.49005889892578, 6]
"10.60.0.92"	[386.7535591125485, 5]
"10.60.0.91"	[1070.863485336304, 9]
"10.60.0.90"	[428.4689426422117, 10]

If you manage to wrap your head around this, you are ready to dive into CouchDB.

PXElator introduction

This weekend we where in Split on Ništa se neće dogoditi event and I did presetation about first three weeks of PXElator development which can be used as gentle introduction into this project. So, here we go...

Deployment of new machines
Disaster recovery documentation for me, two years later
Configure system like you normally would but track changes
Provide overview and monitoring

Introduction

PXElator is just a peace of puzzle which aims to replace system administration with nice declarative programs in perl. It's a experiment in replacing my work with reusable perl snippets.

It tries to solve following problems:

support deployment of new physical or virtual machines (ip, hostname, common configuration)
maintain documentation about changes on systems, good enough to be used for disaster recovery (or deployment of similar system)
configure systems in small chunks (virtual or containers) for better management and resource tracking using normal system administration tools (but track those changes)
provide overview and monitoring of network segment and services on it with alerting and trending

Deployment of new machines

What is really machine? For PXElator, it's MAC and IP address and some optional parameters (like hostname). It's stored on file-system, under conf/server.ip/machine.ip/hostname and can be tracked using source control if needed.

This is also shared state between all daemons implementing network protocols:

DHCP (with PXE support)
TFTP (to deliver initial kernel and initrd using pxelinux)
HTTP (to provide alternative way to fetch files and user interface)
DNS (we already have data)
syslog
AMT for remote management

Having all that protocols written in same language enables incredible flexibility in automatic configuration. I can issue command using installation which has only ping because I can have special DNS names which issue commands.

But, to get real power, we need to aggregate that data. I'm currently using CouchDB from http://couchdb.apache.org/ to store all audit data from all services into single database.

I wanted simple way to write ad-hoc queries without warring about data structure too much. At the end, I opted for audit role of data, and used 1 second granularity as key when storing data. Result of it is that 133 syslog messages from kernel right after boot you will create single document with 133 revisions instead of flooding your database.

It would be logical to plug RRDtool http://oss.oetiker.ch/rrdtool/ somewhere here to provide nice graphs here, but that is still on TODO list.

End user scenarios:

Take a new machine, plug it into network, boot it from network and configure for kiosk style deployment with Webconverger available at http://webconverger.com/. Kiosk should automatically turn on every morning at 7:30 and turn off at 20:30.
Boot virtual machine (with new ip and hostname) from backup snapshot for easy recovery or testing
Boot machine from network into fully configurable (writable) system for quick recovery or dedicated machine. This is implemented using NFS server with aufs read-write overlay on top of debootstrap base machine.

Disaster recovery documentation for me, two years later

I have been trying to write useful documentation snippets for years. My best effort so far is Sysadmin Cookbook at https://sysadmin-cookbook.rot13.org/ a set of semi-structured shell scripts which can be executed directly on machines.

This part isn't yet integrated into PXElator, but most of the recipe will become some kind of rule which you can enforce on some managed machine.

End user scenario:

Install that something also on this other machine

Configure system like you normally would but track changes

This is basically requirement to track configuration changes. Currently, this feature falls out of writable snapshot over base system which is read-only. Overlay data is all custom configuration that I did!

Tracking changes on existing machines will be implemented scp to copy file on server into hostname/path/to/local/file directory structure. This structure will be tracked using source control (probably git as opposed to subversion which PXElator source uses) and cron job will pull those files at some interval (daily, hourly) to create rsync+git equivalent of BackupPC http://backuppc.sourceforge.net for this setup.

It's interesting to take a look how it's different from Puppet and similar to cfengine3:

All data is kept in normal configuration files on system -- you don't need to learn new administration tools or somehow maintain two sources of configuration (in configuration management and on the system)
Introspect live system and just tries to apply corrections if needed which is similar to cfengine3 approach.

End user scenario:

Turn useful how-to into workable configuration without much effort

Provide overview and monitoring

This falls out from HTTP interface and from collecting of data into CouchDB. For now, PXElator tries to manage development environment for you, opening xterms (with screen inside for logging and easy scrollback) in different colors, and enable you to start Wireshark on active network interfaces for debugging.

tap magic: kvm network bridge using ssh ethernet tunnel

Let's assume that you want to create virtual network which spans sites (or continents :-). While we are at it, let's assume that you want to have layer 2 connectivity (because you want to run just single DHCP server for example).

At first, it seemed logical to use Virtual Distributed Ethernet for which kvm has support. However, this involves running multiple processes to support nodes on network, and it's really virtual -- you can't use familiar Linux tools (like brctl or arp) to configure it. And it's connected over ssh anyway, so why to add unnecessary complexity to setup?

Since we will use ssh to transfer traffic anyway (it easiest hole to drill over firewalls and you probably already have it for administration anyway), why do we need another layer of software in between, with new commands to learn if we already know how to make it using plain old Linux brctl?

So, let's take another look at ssh, especially option Tunnel=ethernet which provides Ethernet bridging between two tap devices. As I wrote before, ssh have point-to-point links using tun device which is great solution if you want to connect two networks on IP level using routing. However, tap devices provide access to Ethernet layer from user-space (so ssh, kvm, VDE and various others user-land programs can send and receive Ethernet packets). However, finding information on internet how to setup ssh to use tap devices is nowhere to be found and motivated me for this blog post.

Let's assume that we have two machines in following configuration:

t61p - laptop at home behind DSL link and nat which wants to run kvm virtual machine in virtual network 172.16.10.0/24
t42 - desktop machine at work which have network bridge called wire which has 172.16.10.0/24 network which provides network booting services

So, we need ethernet tunneling to remote client.

# install tunctl
dpavlin@t61p:/virtual/kvm$ sudo apt-get install uml-utilities

dpavlin@t61p:/virtual/kvm$ sudo tunctl -u dpavlin -t kvm0
Set 'kvm0' persistent and owned by uid 1000

dpavlin@t61p:/virtual/kvm$ kvm -net nic,macaddr=52:54:00:00:0a:3d -net tap,ifname=kvm0,script=no -boot n

This doesn't really boot our kvm from network because we didn't connect it together. Now we need to enable tunnels on t42 and setup remote tap device

dpavlin@t42:~$ grep -v PermitTunnel /etc/ssh/sshd_config > /tmp/conf
dpavlin@t42:~$ ( grep -v PermitTunnel /etc/ssh/sshd_config ; echo PermitTunnel yes ) > /tmp/conf
dpavlin@t42:~$ diff -urw /etc/ssh/sshd_config /tmp/conf
--- /etc/ssh/sshd_config        2009-04-20 12:50:27.000000000 +0200
+++ /tmp/conf   2009-08-14 20:42:40.000000000 +0200
@@ -75,3 +75,4 @@
 Subsystem sftp /usr/lib/openssh/sftp-server
 
 UsePAM yes
+PermitTunnel yes

# install and restart ssh
dpavlin@t42:~$ sudo mv /tmp/conf /etc/ssh/sshd_config
dpavlin@t42:~$ sudo /etc/init.d/ssh restart
Restarting OpenBSD Secure Shell server: sshd.

Now we can connect two machines using ssh ethernet tunnel

dpavlin@t61p:/virtual/kvm$ sudo ssh -w 1:1 -o Tunnel=ethernet root@10.60.0.94

t42:~# ifconfig tap1
tap1      Link encap:Ethernet  HWaddr fa:35:cb:9e:87:60  
          BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:500 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

t42:~# ip link set tap1 up
t42:~# brctl addif wire tap1
t42:~# brctl show wire
bridge name     bridge id               STP enabled     interfaces
pan0            8000.000000000000       no
wire            8000.006097472681       no              eth2
                                                        eth3
                                                        tap0
                                                        tap1
                                                        tap94

t42:~# dmesg | grep tap1
[284844.064953] wire: port 5(tap1) entering learning state

t42:~# tshark -i wire

This created tap1 devices on both machines and added one on t42 to bridge and left us with dump from tshark on wire bridge.

Now we need to setup virtual bridge on t61p to connect ssh tunnel and kvm tap device.

dpavlin@t61p:/virtual/kvm$ sudo ifconfig tap1
tap1      Link encap:Ethernet  HWaddr 52:c5:f8:64:30:d4  
          BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:500 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

dpavlin@t61p:/virtual/kvm$ sudo brctl addbr virtual
dpavlin@t61p:/virtual/kvm$ sudo brctl addif virtual kvm0
dpavlin@t61p:/virtual/kvm$ sudo brctl addif virtual tap1

dpavlin@t61p:/virtual/kvm$ sudo brctl show
bridge name     bridge id               STP enabled     interfaces
pan0            8000.000000000000       no
virtual         8000.4e1537af6cdc       no              kvm0
                                                        tap1

dpavlin@t61p:/virtual/kvm$ sudo ip link set kvm0 up
dpavlin@t61p:/virtual/kvm$ sudo ip link set tap1 up
dpavlin@t61p:/virtual/kvm$ sudo ip link set virtual up

dpavlin@t61p:/virtual/kvm$ dmesg | grep virtual
[31141.669760] virtual: port 1(kvm0) entering learning state
[31152.288025] virtual: no IPv6 routers present
[31156.668088] virtual: port 1(kvm0) entering forwarding state
[31211.699928] virtual: port 2(tap1) entering learning state
[31226.696070] virtual: port 2(tap1) entering forwarding state
dpavlin@t61p:/virtual/kvm$ kvm -net nic,macaddr=52:54:00:00:0a:3d -net tap,ifname=kvm0,script=no -boot n

This will boot our kvm using ethernet bridge from remote server using nothing more than brctl and ssh !

If you wanted even more lightweight solution to same problem, you might look into EtherPuppet.

On related note, if your kvm Windows XP machines stopped working with upgrade to Debian kernel 2.6.30-1-686, just upgrade to 2.6.30-1-686-bigmem (even if you don't have more memory) and everything will be o.k.

PXElator - replace system administration with a perl script

First, of all, happy sysadmin day 2009-07-31! So, it seems logical that I'm announcing by project PXElator which aims to replace me with a perl script. It's basically my take on cloud hype. Currently it supports bringing up machines (virtual or physical) from boot onwards. It implements bootp, dhcp, tftp and http server to enable single action boot of new machine.

It all started when I watched Practical Puppet: Systems Building Systems and decided that real power is in expressing system administration as code. I also liked DSL approach which Puppet took in ruby and tried to apply same principle (declarative DSL) in perl. If you take a look at source code it seems work work quite well.

In the spirit of release early, release often this code will be in flux until end of this summer, when I plan to deploy it to create web kiosk environment for browsing our library catalog based on it.

Dobrica Pavlinušić's Weblog / Blog