Dobrica Pavlinušić's Weblog / Blog

Please wait while my tweets load

If you can't wait - check out what I've been twittering

ZFS on Linux and pool replication

By Dobrica Pavlinušić on September 7, 2011 1:34 PM | No Comments | No TrackBacks

I have been using ZFS on Linux for some time to provide backup appliance using zfs-fuse. Since then, we got native ZFS implementation on Linux, so I decided to move by backup pool from zfs-fuse to in-kernel ZFS.

Additional reason to move pool over to new machine was to change pool's RAID level. In current ZFS implementation(s) you can't change mirror to RAIDZ1 without re-creating pool and then transfering data over using zfs send and zfs receive. However, when you are creating snapshots for years, and expiring them using script you will have hundreds of snapshots which you need to transfer.

This is where zfs-pool-replicate.pl script comes handy. It uses Net::OpenSSH to connect to two machines (source and destination), list all snapshots on source and transfer them to destination. If you have filesystem without snapshots it will create one @send snapshot which will be transferred. It will also optionally use compression for transfer of snapshot over the network. I am using LZO which is fast compression which nicely transfers 150Mb/s or more over normal 1Gbit/s network without much CPU overheard (and we all have multi-core machines anyway, right?).

Current implementation is designed to run from third (management) machine, so I can envision central storage administration tool which will also allow you to transfer LVM snapshots into ZFS snapshots. For now, I'm using shell script for that, but rewriting it in perl would improve error recovery and reporting.

MySQL innodb_file_per_table upgrade

By Dobrica Pavlinušić on June 20, 2011 12:40 PM | No Comments | No TrackBacks

By default, MySQL installation on Debian comes without innodb_file_per_table option which spread tables in individual InnoDB files. Based on your usage patterns or backup strategies this might be better filesystem organization than one big /var/lib/mysql/ibdata1 file. I first heard about it in OurSQL Episode 36: It's Not Our (De)fault!. It's great podcast, but to be honest with each new episode I wish to have only PostgreSQL servers to maintain...

To enable this option you will need to create configuration file and restart MySQL server:

koha:/etc/mysql/conf.d# cat > file-per-table.cnf 
[mysqld]
innodb_file_per_table
CTRL+D
koha:/etc/mysql/conf.d# /etc/init.d/mysql restart

This won't change anything, because only new tables will be created in separate files. But, we can use ALTER TABLE table ENGINE=InnoDB on each table to force InnoDB to re-read tables and create separate files:

mysqlshow koha --status | grep InnoDB | cut -d'|' -f2 | sed -e 's/^/alter table/' -e 's/$/ engine=InnoDB;/' | mysql -v koha

If you replace grep InnoDB with grep MyISAM you might use same snippet to convert MyISAM tables into InnoDB (if you still have them or don't use fulltext search).

I think that system administration is like gardening. I don't know anything about gardening, but it seems to involve a lot of care here and there, seemingly without much pattern. In that sense, it's similar to wiki editing, you start somewhere and you really don't know where it lead you to.

SAML2 expirience: implementing SP with perl website

By Dobrica Pavlinušić on June 12, 2011 2:29 PM | No Comments | No TrackBacks

You have to start reading by singing lady Ga-Ga with words: S, s, s, ss... SAML, SMAL2! It will help, really.

SAML 2 is latest in long line of different SSO implementation you will have to do sooner or later if you want to be part of larger web. Google and others seems to be using it, so it must be good, right?

It has two end-points: identity provider (IdP) which has user accounts and Service Provider (SP) which is usually your application. But of course, it's more complicated than that. For a start, you will need https on your host. I will assume that you already have domain, and you can get free SSL certificates at StartSSL so hop over there if you need one.

First, install SimpleSAMLphp. It's pimpliest possible way to get working SAML2 implementation of IdP and SP. You will want to follow first simpleSAMLphp Installation and Configuration and then SimpleSAMLphp Identity Provider QuickStart to configure simple IdP with static accounts so you can test your application against it. You will need both IdP and SP under your control to do development. It will also help if your remote IdP (identity provider which you intend to use) is also simpleSAMLphp (as AAI@EduHr is).

Installation is rather easy:

dpavlin@lib:/srv$ sudo apt-get install install memcached php5-memcache

dpavlin@lib:/srv$ wget http://simplesamlphp.googlecode.com/files/simplesamlphp-1.8.0.tar.gz

dpavlin@lib:/srv$ tar xf simplesamlphp-1.8.0.tar.gz
dpavlin@lib:/srv$ cd simplesamlphp-1.8.0/
dpavlin@lib:/srv/simplesamlphp-1.8.0$ cp config-templates/* config/
dpavlin@lib:/srv/simplesamlphp-1.8.0$ vi config/config.php

You will want to edit following options:

auth.adminpassword
secretsalt
enable.authmemcookie

dpavlin@lib:/srv/simplesamlphp-1.8.0$ php5 -l config/config.php 
No syntax errors detected in config/config.php

Interesting part here is authmemcookie option. This allows us to use SP side of simpleSAMLphp and store resulting authentication in memcache and send browser a cookie which we can later read and acquire data from memcache about current user.

To configure Apache side, you need Auth MemCookie but it isn't available in Debian package, so I opted for Apache::Auth::AuthMemCookie so I can flexibly modify IdP response before passing it on as environment variables.

dpavlin@lib:~$ cat /etc/apache2/conf.d/authmemcookie.conf 
Alias /simplesaml /srv/simplesamlphp-1.8.0/www
perlModule Apache::Auth::AuthMemCookie
<Location /cgi-bin>
        # get redirected here when not authorised
        ErrorDocument 401 "/simplesaml/authmemcookie.php"
        PerlAuthenHandler Apache::Auth::AuthMemCookie::authen_handler
        PerlSetVar AuthMemCookie "AuthMemCookie"
        PerlSetVar AuthMemServers "127.0.0.1:11211"
        PerlSetVar AuthMemDebug 1
        PerlSetVar AuthMemAttrsInHeaders 0
        AuthType Cookie
        AuthName "Koha SAML"
        Require valid-user
</Location>

To test it, easiest method is to create account in Feide OpenIdP and test against it. After all, it easiest to start with same implementation of SAML2 on both sides just to prevent following scenario:

On perl side I first tried Net::SAML2 and found out that it doesn't handle IdP without adding HTTP-Artifact support to the IdP. However, even after that I wasn't managed to make it work with simpleSAMLphp IdP implementation mostly because of my inpatience with SSL configuration of it.

On the bright side, for first test I didn't need to modify Koha (Library management software which I'm configuration SAML2 for) at all because it already has support for HTTP authorization.

Kindle K3G and DXG

By Dobrica Pavlinušić on May 22, 2011 8:51 PM | No Comments | No TrackBacks

As some of you already know, almost a moth ago we placed order for our first Kindle (3G with wifi). Week after it arrived, we also ordered bigger DXG to complement our reading habits. So why do we have two Kindles and do we have regrets with them? Well, no! This post will try to summarize what I learned since...

For a start, both devices have ARMv6, but K3G has 256 Mb of RAM while DXG has only 128 Mb. Both have 4Gb of flash of which about 3.5Gb is available for user content. Both have working 3G connectivity in Croatia. wpa_supplicant on Kindle won't allow connections to "enterprise" networks (like EduRoam). Smaller K3G is just too small for reading pdf files. Bigger DXG has e-paper in A5 size, so it's much more useful, but comes with older 2.5.8 firmware which has terrible browser (Netfront as opposed to WebKit in 3.1) and pdf reading without contrast settings (which is very, very useful).

But, I was fortunate enough to find mobileread forum post on migration of 3.1 software from K3G to DXG by Yifan Lu to whom I'm eternally grateful for much needed software upgrade.

However, even if you don't want to upgrade Amazon's own firmware, there is Duokan alternative reader software for Kindle. It has superb support for two column pdf files which allows you to just press next page key and read column on the left (from top to bottom) and then on the right (from top to bottom). It has also various other layouts for reading, and it makes smaller Kindle almost useful for pdf reading (and reading on bigger one a joy).

But, sooner or later, you will come across pdf which would really benefit from manual cropping. Take a look at briss which is cross-platform application for cropping PDF files.

While we are at software, you should really look at mobileforum's Kindle hacks thread which documents latest jailbreak (which will give you root permissions on Kindle), usbnetwork (to connect over USB cable) and various fonts and screensaver hacks.

Development for kindle can be done using Java2ME (remember that from cell phones?). However, Amazon doesn't actually seem to give out it's Kindle development kit but don't despair: Andrew de Quincey figured out a way to develop kindlets without the KDK. With KDK API from Amazon and JSR217 specification for Java2ME you can be in your marry Java development in no time.

I would like to see something like OpenInkport running on Kindle. This would allow free software developers to take full power from this nice e-paper device. If only I didn't have so much books to read...

Optimize pdf file size using Ghostscript

By Dobrica Pavlinušić on May 11, 2011 4:22 PM | 1 Comment | No TrackBacks

We just got our conference booklet, and we need to publish it on the web. But, it has 152 Mb. This seemed excessive, so I googled a bit and found following:

$ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf

$ ls -al input.pdf output.pdf 
-r-xr-xr-x 1 dpavlin dpavlin 158511430 May 11 16:23 input.pdf
-rw-r--r-- 1 dpavlin dpavlin   1646309 May 11 16:24 output.pdf

That's reduction to 1% of original size.

-dPDFSETTINGS=configuration to presets the "distiller parameters" to one of four predefined settings:

/screen selects low-resolution output similar to the Acrobat Distiller "Screen Optimized" setting.
/ebook selects medium-resolution output similar to the Acrobat Distiller "eBook" setting.
/printer selects output similar to the Acrobat Distiller "Print Optimized" setting.
/prepress selects output similar to Acrobat Distiller "Prepress Optimized" setting.
/default selects output intended to be useful across a wide variety of uses, possibly at the expense of a larger output file.

MojoFacets: ups! I created web spreadsheet with perl data mungling

By Dobrica Pavlinušić on April 27, 2011 5:30 PM | No Comments | No TrackBacks

While ago, I started writing MojoFacets, fast web-based faceted browser which keeps data in-memory. Since I last wrote blog post about it I added various features to it turning it into powerful spreadsheet-like application within browser in which you can mangle your data using perl code.

Let me start with a list of new features:

run perl snippet over filtered subset of data, modifying columns (using $update) or creating aggregated result (using $out)
format on-screen filter html with hidden ;, so that copy/paste into spreadsheet produce correct values and counts
export dataset as tab separated values for easy migration into other applications
use tab separated export and optional time format string with gnuplot to produce png graphs from data (this works well for huge datasets)
export filtered values from facets in simple one-value per line format
run perl snippets over filter's facet values to easily select ($checked) or calculate something with $value or $count
import of CSV files (with optional encoding specified in filename)
import from CouchDB databases or view
import SQL query results from RDBMS using any DBI perl module (tested with PostgreSQL, mysql and SQLite)
switch between loaded data-sets easily (filters are already shared between them, allowing poor man's join)
implement lookup into different data-set with descriptive statistics on values

Adding perl code evaluation over dataset was logical extension since I already had web interface written in perl which had all data in memory. To make it fast, I had to implement indexes (and invalidation). But small things, like automatic generation of meaningful names for code snippets in form of dependent_col1,dep_col2.result_col1,res_col2 turned read-only web interface into powerful tool for application of reusable code snippets on data.

Latest feature is lookup to other datasets with ability to create multiple columns from lookup values. MojoFacets is now sufficiently advanced to replace relational database for quick join-like problems, but this time by writing a little snippet of perl looking like this:

lookup($row->{issn}, 'nabava', => 'issn', sub {
  my $stat = shift;
  push @{ $update->{listprice} }, $on->{listprice};
  $stat->add_data( $on->{listprice} );
},sub {
 my $stat = shift;
 $update->{price_min} = $stat->min;
 $update->{price_max} = $stat->max;
});

This will lookup using $row->{issn} into nabava dataset using issn column. First sub is executed for each value in lookup date (using $on hash) and second one is executed once to create aggregates using $stat which is Statistics::Descriptive object. Second sub is optional if you don't need aggregates.

If you found Google Refine and wanted something similar, but in perl (with perl as query/mangling language) MojoFacets might be good choice.

Print conference name tags using Inkscape

By Dobrica Pavlinušić on April 7, 2011 11:00 AM | No Comments | No TrackBacks

I had a interesting problem with conference name tags this week. I wanted to use free software stack to produce them, and it turned out that this required patching bash to properly parse CSV files and learning new pdf tricks to print multiple pdf pages on single sheet of paper, so I will try to document them here, so I won't have to discover it again in two years for next one...

For design, we decided to use Inkscape, great vector drawing program. And even better, it already included Inkscape Generator extension just for that purpose.

For design, we decided to use ISO 7810 ID-1 card size of 85.60 × 53.98 mm and included template variables required by extension as you can see on picture.

Data for conference participants ended up in Gnumeric, and where exported in CSV file. And that's where we hit the first road-block. Current version of ink-generate extension doesn't support more than one comma inside quoted field. However, Inkscape Generator extension home page included useful pointer to correct bash code to parse CSV files by Chris F.A. Johnson so I decided to import it into ink-generator git repository and replace CSV parser. Few patches later and I had working extension which produces 600+ pdf files on disk.

In the process, I learned that you can invoke Inkscape extensions from command line, which is nice for generating previews while you edit code:

./generator.sh --var-type=name --data-file=test.csv --format=pdf --dpi=90 --output='$HOME/generator-output/%VAR_id%.pdf' --extra-vars=" " --preview=true akred1-var.svg

If --preview=true is removed, it will generate all files without GUI interaction, which is nice for automation.
To make it sorted by last name, we created fake id column padded with zeros.

Now we had to print them. While some printer drivers have option to print multiple pages, the one we where using decided to center each name tag requiring too much cutting each side of each name tag manually. This was obviously non-efficient, and I knew about psnup utility for PostScript, but I didn't know that there is pdfnup which is part of PDFjam (and in Debian package). However, getting layout just right involved reading pdfpages package documentation, but this is what I ended up with:

pdfnup --suffix nup --nup '2x5' --paper a4paper --no-landscape --noautoscale true --frame true --outfile nup.pdf -- ~/generator-output/*.pdf

This produces layout which you can see on the picture, nicely centered in the middle of the page (this is why I included fake grain background to show centering).

In the end, it didn't really worked out. Parsing CSV correctly (and supporting quotes inside quoted values) is a hard task in bash, and I had to admit that I don't really know how to fix it. With only a day to start of conference and no time to waste, I took my favorite language of choice, perl and wrote 60-line script which does same thing but uses Text::CSV perl module to parse data files.

There is something to be learned here: for a start language and good supporting libraries does matter. Second, sometimes you are better off starting from scratch. But that decision should be made only when you exhorted other options since fixing original extension would have benefit for wider community. There is a balance between scratching my own itch and common good which is tricky.

Google map data layer using custom tiles

By Dobrica Pavlinušić on April 4, 2011 1:50 PM | No Comments | No TrackBacks

Browsing through subscribed videos this week on YouTube I stumbled upon video SImulating Markers with Tile Layers which described how to create custom tiles for Google maps using perl and PostgreSQL. John Coryat did great job at describing challenges, but also provided useful source code snippets and documentation how to create custom tiles. So, this weekend I decided to try it out using publisher field (260$a) from our Koha to show where are our books coming from?

I had several challenges to overcome, including migrating data from MySQL Koha database to PostgreSQL (so I can use great point data-type), geolocating publisher location using Yahoo's PlaceFinder (I tried to use Google's v3 geolocation API, but it had limit at 10 requests which wasn't really useful). I also added support for different icons (with arbitrary size) depending on the zoom level. In the process I also replaced cgi based tile server with mod_rewrite configuration which does same function but inside Apache itself.

Source code is available at github.com/dpavlin/google-map-tiles and it's really easy to overlay huge amount of data-points over Google maps using custom tiles, so try it out!

Scalable applications with Gearman

By Dobrica Pavlinušić on March 24, 2011 3:45 PM | No Comments | No TrackBacks

If you talked with me in last years or so, you probably heard me mention queues as new paradigm in application development. If your background is web-development, you probably wondered why are they important. This blog will try to explain why they are useful and important, and how you can make your app scale, even on same box.

Problem was rather simple: I needed to make monitoring which will pull data from ~9000 devices using telnet protocol and store it in PostgreSQL. Normal way to solve this would be to write module which first checks if devices are available using something like fping and then telnet to each device and collect data. However, that would involve careful writing of puller, taking care of child processes and so on. This seemed like doable job, but it also seemed a bit complicated for task at hand.

So, I opted to implement system using Gearman as queue server, and leave all scaling to it. I decided to push all functionality in gearman workers. For that, I opted to use Gearman::Driver which allows me to easily change number of workers to test different configurations. Requirement was to pull each machine in 20-minute intervals.

Converting existing perl scripts which collect data into gearman workers was a joy. At first run (with 25 workers) it took 15 minutes to collect all data. Just by increasing number of workers to 100 we managed to cut down this time just over 1 minute. And that's on single core virtual machine (which makes sense, since most of the time we are waiting on network).

For web interface, I decided to use Mojolicious. But, to make it work with Gearman, I write MojoX::Gearman which allows me to invoke gearman functions directly from Mojolicious. In fact, all functionality of web interface is implemented as Gearman workers, even querying database :-)

Migration of Informix UNL dump files using SQL

By Dobrica Pavlinušić on February 23, 2011 11:18 PM | 1 Comment | No TrackBacks

I had an interesting problem at my hand today: a directory with Informix dump in UNL format from which I had to extract data for migration into new non-relational system (into MARC format and not into NoSQL, btw). Idea was simple: let's import dump back into relational database, write SQL queries which produce data and use that. However, SQL standard doesn't really allow us to relax and expect everything to work. In fact...

Step 1: import into PostgreSQL

First idea was to use my favorite database, PostgreSQL and import data into it. First problem was schema file which used DATETIME HOUR TO MINUTE which i decided to convert into TEXT. There was another column with only date, so I will have to mungle this using SQL anyway.

But then I hit several roadblocks:

ERROR:  insert or update on table "xxx" violates foreign key constraint "xxx_fkey"
ERROR:  invalid input syntax for integer: ""
ERROR:  invalid input syntax for type date: ""
ERROR:  value too long for type character(xx)
ERROR:  date/time field value out of range: "13.02.1997"

They are all somewhat worrying for system which maintains your data, but I couldn't really influence quality of data in dump files from Informix, so I decided to try something which is more relaxed with errors like this...

Step 2: relax import using MySQL

Well, most of invalid input syntax should be ignored by MySQL, however:

ERROR 1074 (42000) at line 55: Column length too big for column 'xxx' (max = 255); use BLOB or TEXT instead

was a show stopper. I really don't want to hand-tune schema just to create throw-away queries to export data.

Step 3: SQLite - it takes anything!

In the process, I learned that I can't really blindingly import data, and that format has backslash on end of line for multi-line values, so I decided to write a small perl script which will import Informix UNL dumps directly into SQLite.

I'm generating INSERT INTO table VALUES (...) SQL directly, so you could easily modify this to run on some other database or just produce SQL dumps. For speed of import, I'm creating temporary database in /dev/shm. This helps sqlite3 to be CPU bound as opposed to disk-bound for import operation, and whole database is just 50Mb (UML dumps are 44M so it's very reasonable).

Not bad for less then 100 lines of perl code: working Informix UML loader into SQLite!