Results matching “perl”

Migration of Informix UNL dump files using SQL

I had an interesting problem at my hand today: a directory with Informix dump in UNL format from which I had to extract data for migration into new non-relational system (into MARC format and not into NoSQL, btw). Idea was simple: let's import dump back into relational database, write SQL queries which produce data and use that. However, SQL standard doesn't really allow us to relax and expect everything to work. In fact...

Step 1: import into PostgreSQL

First idea was to use my favorite database, PostgreSQL and import data into it. First problem was schema file which used DATETIME HOUR TO MINUTE which i decided to convert into TEXT. There was another column with only date, so I will have to mungle this using SQL anyway.

But then I hit several roadblocks:

ERROR:  insert or update on table "xxx" violates foreign key constraint "xxx_fkey"
ERROR:  invalid input syntax for integer: ""
ERROR:  invalid input syntax for type date: ""
ERROR:  value too long for type character(xx)
ERROR:  date/time field value out of range: "13.02.1997"

They are all somewhat worrying for system which maintains your data, but I couldn't really influence quality of data in dump files from Informix, so I decided to try something which is more relaxed with errors like this...

Step 2: relax import using MySQL

Well, most of invalid input syntax should be ignored by MySQL, however:

ERROR 1074 (42000) at line 55: Column length too big for column 'xxx' (max = 255); use BLOB or TEXT instead

was a show stopper. I really don't want to hand-tune schema just to create throw-away queries to export data.

Step 3: SQLite - it takes anything!

In the process, I learned that I can't really blindingly import data, and that format has backslash on end of line for multi-line values, so I decided to write a small perl script which will import Informix UNL dumps directly into SQLite.

I'm generating INSERT INTO table VALUES (...) SQL directly, so you could easily modify this to run on some other database or just produce SQL dumps. For speed of import, I'm creating temporary database in /dev/shm. This helps sqlite3 to be CPU bound as opposed to disk-bound for import operation, and whole database is just 50Mb (UML dumps are 44M so it's very reasonable).

Not bad for less then 100 lines of perl code: working Informix UML loader into SQLite!

Mifare Application Directory - what is stored on RFID card in your wallet?

I have been playing with RFID cards for a while, but recently I borrowed touchatag reader which supports Mifare cards to take a look what is stored on public transit card which I carry in my wallet.

Mifare Classic weak crypto was reverse-engineered and recently Harald Welte took a look at one such payment system and had interesting observations about it's security.

So, reading your Mifare card isn't hard. If you have libnfc supported reader you can run Mifare Classic Offline Cracker to dump data. But then, you are presented with dumps which you don't understand.

Mifare Application Directory (MAD) documents directory-like structure that describes how sectors on card are used. It took me a while to figure out that pdf file called Mifare address book is authoritative source of data about card issuers.

But, browsing pdf and watching hex numers is not fun, so I wrote perl-Mifare-MAD - simple script which decodes card data in format similar to following:

manufacturer block
Serial number: 7ef11d7f
CB: ed
manufacturer data: 880400465976525d103308
ADV (MAD version code): 1
MA (multiapplication): yes
DA (MAD available): yes
Info byte (publisher sector): 0
0000 7ef11d7fed880400465976525d103308 010 R:AB W:-- I:-- DTR:--
0010 b100040015007b887b887b887b886220 100 R:AB W:-B I:-- DTR:--
0020 62200000000000000000000000000000 100 R:AB W:-B I:-- DTR:--
0030 a0a1a2a3a4a5696789c1a7a2676192a6 011 R/W: KEYSECXA:-/B ACCESS COND:AB/B KEYSECXB:-/B
KEY A:a0a1a2a3a4a5 | 696789 GDP: c1 | B:a7a2676192a6

MAD sector 1 @12 0004 [miscellaneous applications]
sector contains card holder information in ASCII format.
surname     1 
given name 18 Vrijednosna karta
any         5 0x09200901
0040 0100525672696a65646e6f736e61206b 110 R:AB W:-B I:-B DTR:AB
0050 6172746100c509200901000000000000 110 R:AB W:-B I:-B DTR:AB
0060 00000000000000000000000000000000 110 R:AB W:-B I:-B DTR:AB
0070 269d97f5a18508778f000f0dabf8179a 011 R/W: KEYSECXA:-/B ACCESS COND:AB/B KEYSECXB:-/B
KEY A:269d97f5a185 | 08778f GDP: 00 | B:0f0dabf8179a

MAD sector 2 @14 0015 [miscellaneous applications]
- card administration MIKROELEKTRONIKA spol.s.v.MIKROELEKTRONIKA spol.s.v.o. worldwide 1 01.02.2007 Card publisher info
Card number: 000040067318
0080 0326898400000460378177277382e134 110 R:AB W:-B I:-B DTR:AB
0090 13ee921ab8f5f6120000000000000036 110 R:AB W:-B I:-B DTR:AB
00a0 0000000000000000000000000000000b 110 R:AB W:-B I:-B DTR:AB
00b0 a0a1a2a3a4a508778f00969185651f84 011 R/W: KEYSECXA:-/B ACCESS COND:AB/B KEYSECXB:-/B
KEY A:a0a1a2a3a4a5 | 08778f GDP: 00 | B:969185651f84

...and so on, for all 16 sectors of Mifare Classic card. It uses data from publicly available sources (created using copy/paste from pdf files to end of script) and decodes all bits from documentation that I understand so far.

Only other information about Miface Class format is on mfuck wiki but it doesn't really explain MAD.

Last year, we had podcast about RFID in Croatian so if you want to learn more take a look!

Angular - what is browser was written for AJAX application?

A week ago, I stumbled by accident on <angular/> via Angular: A Radically Different Way of Building AJAX Apps. I was hooked. Making html ultimate template language by annotating it with few attributes and having two-way data-binding between input forms and objects which can be stored directly in the cloud make a lot of sense for me.

First a bit of history. Back in 2003 I wrote a small tool called wopi (Web Online Poll Interface) to generate on-line polls easily. It was basically html template with few magic fields which would be parsed by perl script to produce php code and database schema for deployment.
I was aware that this approach simply doesn't make sense any more, since browsers can do much more than in 2003, and we have flexible storage systems like CouchDB which doesn't require us having pre-defined data schema.

First I decided to try out hosted service at getangular.com to see if I can create working application in an afternoon. And I could.
Then I tried examples which worked on API at getangular.com against angular.js checkout on github, and I was at first surprised that things like ng-entity didn't exists in checkout. However, since storage API is part of hosting offer, that made sense. I will have to implement my own REST storage anyway, and REST API for angular hosting service is documented.

Another interesting thing was that source on github is newer version than one on hosting service, so I had a problem figuring out how to make $resource call which maps objects to REST work and developers on mailing list where very helpful in figuring it out.

Now, I needed to write some kind of REST storage against which I can test my angular controllers which I was writing.

angular-mojolicious is in-memory data store using documented REST API based on mojolicious lightweight perl web framework. In current stage, it provides ability to replicate data from shared storage and provides support for $resource call in angular.

Application which I was writing was simple conference submission app. Basically, one work per submission, but with possibility to group a bunch of works in symposium (single topic which have additional data).

I decided to duplicate symposium work inside each work to make this simple (and store number of work within symposium) and simply copy symposium data from first work to second work within same symposium.
However, I wanted to display all works within symposium and my REST back-end doesn't have any query capability yet (aside from returning all objects). At first, I just added additional javascript object symposium which stores all works within singe symposium in order, mostly so I can display list of works which are part of same symposium.
And again, implementing something visible on screen (list of works within symposium) I created something I can simply wrap into another $resource and store it on server, creating a data view which I needed.

All this duplication of data still seems a bit strange to my relational brain, and pure size of POST request to store symposium data with all works included could be prohibitively expensiveness for hundred of works. However, all data needed for symposium is stored within Work documents, so I could just write server side code which would update Symposium objects directly when Works are updated (something like stored procedures in relational lingo).

In the process, I also learned to pull upstream angular code often, because developers are very active, and any bug you might hitting now might already be fixed in upstream git. I learned it hard way, trying to figure out why my app isn't working while fix was already at angular's github repository. Kudos to fast release cycle.

Any sufficiently advanced technology is indistinguishable from magic. [Artur C. Clarke]

In summary, <angular/> is really good way to build web applications, so take a look. angular-mojolicious can get you started. It includes examples from angular cookbook and my simple conference submission application to get you started.

Update: My message about first week of development with angular got quoted on angular wiki testimonial page.

EPrints 3 migration: multiple encoded utf-8 strings and xml creation from TSV

Last few days I spent migrating our EPrints repository to version 3. This seems like a dull and easy job, but it did have it's own challenges:

mysql database had latin1 encoding which didn't play well with utf-8 encoded characters from EPrints 2, in effect producing utf-8 strings which where encoded multiple times (and different for different data)
we also had table with additional works collected after our EPrints 2 installation died, so it had to be imported somehow

In a essence, EPrints set of perl scripts which convert XML archive into database and web interface. So, how hard can it be?

For a start, take a look at utf8-fix.pl script which will try to convert all combination of croatian characters back to utf-8. Creating mapping was not easy. And if you look at the end, you will see that script has verification step at the end which tries to find uncovered utf-8 strings and dump them out. To make it work, I used test-driven methodology (sic!) with fix.sh as small runner script which will do one conversion, show diff from last one (removed lines from log with errors is good) and open vi to edit files directly.

Re-read last sentence once more. I spend two days before I streamlined this workflow up to point where I could really finish conversation, so it's useful to have that in mind if you are writing some kind of data mungling software.

In the process, I also stripped croatian characters from pdf filenames, creating symlinks to unaccented versions and passing generated xml through unaccent-file.pl from fix.sh.

Second part of problem was converting tab delimited file into EPrints XML for import of new documents. However, it's (again) not as easy as it seems, since data had only partial filename which had to be matched to real files on share somewhere. So, I decided to split this problem in following way:

files.txt is list of available files generated by find /mnt/share -print
ep-xml.xml is template for single document which uses  to denote places in which I need to insert custom data
finally, tsv2xp-xml.pl (which should really be named tsv2ep-xml.pl but I made typo) is script which reads both files (together with TSV) and emits XML for eprints

Finally, there is script eprints-ffzg-rebuild.sh which does import into EPrints (and create admin user :-)

Android for command-line users

Let's admit it: I'm command line user. So, Android as mobile platform with all it's GUI doesn't seem like something I would be positive about. But, as we will see, there are quite a few hints which I managed to collect in my first two weeks of Android usage (on HTC Dream/G1 in case you are wondering).

First, I got G1 with Android 1.6 and without root. So, I decided to follow instruction to root my Android which went quite well until I upgraded to DREAM RC29 release with root hole which installed 1.22 radio ROM. I was inpatient and flashed 2.22 radio ROM and DangerSPL, but I didn't exit Amon Ra menu, so radio ROM never got flashed to cell phone and I ended up with a brick.

Having bricked my phone, I decided to buy JTAG jig for G1, some ExtUSB connectors to make serial port and components to make low-power WIGGLER JTAG. All components still didn't arrive, so unbricking my Android will take a bit longer, so I decided to get another G1 in meantime...

I had more luck second time. I got G1 which was already rooted, so I didn't have to brick it first. So, as a read geek, I decided to install Android 2.2 Froyo on it. Hack, I even compiled my own CyanogenMod.

One of most interesting commands which come with Android SDK is adb. It allows you to connect to your cell phone locally (over USB cable) or remotely using:

$ adb shell setprop service.adb.tcp.port 5555

$ dpavlin@klin:~$ adb connect 192.168.1.40:5555

dpavlin@klin:~$ adb devices
List of devices attached 
192.168.1.40:5555  device

This proved very useful in compiling CyanogenMod on remote machine, because it requires you to download some propitiatory binary cruft from phone to build image.

I also tried to connect to Internet from Android device using USB tethering (there is also same question on stackoverflow until community decides which forum is the right place to ask it). As you can see from question, it isn't fully working because Android doesn't think it's connected to Internet (sic!). I will try tackle this problem in future or install azilink.

I also played with Scripting Layer for Android - SL4A and especially it's implementation of perl for SL4A to enable me, using RemoteControl to run perl scripts on my laptop while interacting with SL4A server on Android using sl4a-start.sh helper script.

In the end, I decided to give Android x86 port a try. Compilation isn't complicated, but I still haven't been able to boot my own compiled images due to alsa segfaults (!). However, there is Android 1.6 Donut release for EeePC which works perfectly on my EeePC 701 and I highly recommend at as alternative OS for this little machine. Best part is that most sites think you are on mobile device (and you are, in fact) and show you uncluttered interface (tested with FriendFeed, Facebook and Twitter).

Update: I started collecting useful command-line scripts for Android in github repo android-command-line

Does every web application have API?

Of course it does, it's get and post requests which your browser does to server. And if you need to modify in bulk records in your web application (Koha in our example) you might wonder about writing a script which does job for you.

It's not really hard. WWW::Mechanize provides us with web browser scriptable in perl and following scripts logs in into Koha's interanet and edit items specified in file:

#!/usr/bin/perl

use warnings;
use strict;

use WWW::Mechanize;
use Data::Dump qw(dump);

# we will use %d and %s to insert values from file
my $url_template = 'https://localhost:8443/cgi-bin/koha/cataloguing/addbiblio.pl?biblionumber=%d';

our ( $user, $passwd );
require 'config.pl'; # edit $user and $password in config.pl file

my $login_url = 'https://localhost:8443'; # Koha intranet

my $mech = WWW::Mechanize->new( autocheck => 1 );


warn "# login $login_url\n";
$mech->get( $login_url );

$mech->submit_form(
	fields => {
		userid => $user,#!/usr/bin/perl

use warnings;
use strict;

use WWW::Mechanize;
use Data::Dump qw(dump);

# we will use %d and %s to insert values from file
my $url_template = 'https://localhost:8443/cgi-bin/koha/cataloguing/addbiblio.pl?biblionumber=%d';

our ( $user, $passwd );
require 'config.pl'; # edit $user and $password in config.pl file

my $login_url = 'https://localhost:8443'; # Koha intranet

my $mech = WWW::Mechanize->new( autocheck => 1 );


warn "# login $login_url\n";
$mech->get( $login_url );

$mech->submit_form(
	fields => {
		userid => $user,
		password => $passwd,
	},
);

sub modify_field; # declare later

while( <> ) {
	chomp;
	my @v = split(/\s+/,$_);
	warn "<< ",dump(@v),$/;

	my $url = sprintf $url_template, @v;
	warn "# url $url\n";
	$mech->get( $url );

	my $form = $mech->form_number( 1 ); # XXX 1st form

	# XXX edit fields

	modify_field $form => 'tag_008_subfield_00' => sub { s/^(.{24}).(.+)/$1d$2/ };

	$mech->submit;
}

exit;

# magic to find field name by partial match from beginning
sub modify_field {
	my ( $form, $field, $coderef ) = @_;

	my @inputs = $form->inputs;
	my ( $name, $value ) = map { $_->name, $_->value } grep { defined $_->name && $_->name =~ /^\Q$field\E/ } $form->inputs;
	die "can't find $field in ", $mech->dump_forms unless $name && $value;

	$_ = $value; $coderef->($value);
	my $new = $_;

	if ( $value eq $new ) {
		warn "WARNING: $name not changed [$value]\n" if $value eq $new;
		return;
	}

	warn "$name\n\tOLD: $value\n\tNEW: $new\n";

	$mech->field( $name, $new );
}

sub modify_field; # declare later

while( <> ) {
	chomp;
	my @v = split(/\s+/,$_);
	warn "<< ",dump(@v),$/;

	my $url = sprintf $url_template, @v;
	warn "# url $url\n";
	$mech->get( $url );

	my $form = $mech->form_number( 1 ); # XXX 1st form

	# XXX edit fields

	modify_field $form => 'tag_008_subfield_00' => sub { s/^(.{24}).(.+)/$1d$2/ };

	$mech->submit;
}

exit;

# magic to find field name by partial match from beginning
sub modify_field {
	my ( $form, $field, $coderef ) = @_;

	my @inputs = $form->inputs;
	my ( $name, $value ) = map { $_->name, $_->value } grep { defined $_->name && $_->name =~ /^\Q$field\E/ } $form->inputs;
	die "can't find $field in ", $mech->dump_forms unless $name && $value;

	$_ = $value; $coderef->($value);
	my $new = $_;

	if ( $value eq $new ) {
		warn "WARNING: $name not changed [$value]\n" if $value eq $new;
		return;
	}

	warn "$name\n\tOLD: $value\n\tNEW: $new\n";

	$mech->field( $name, $new );
}

Interesting part is modify_field which tries to find field with specified prefix, since Koha adds unique numbers to all field names in edit form.

This script proved to be very useful for us, and hopefully it might be useful for other users of Koha also.

Biblio::RFID - write RFID applications in perl

Allmost two years ago, I began my experiments with RFID, writing support for 3M 810 RFID Reader. Then I tried to make web interface for RFID it in Koha.

This woked, but having separate Comet server was too much complexity for me, so I decided to implement JSONP directly in RFID driver code. Hack, I allready had HTTP server, and local JavaScript interface, so why not?

Move to JSONP allowed me to use original Koha web interface, and just overlay RFID information and form submission code as simple jQuery snippet.

But, two years of development and trying out different approaches produced not-quite-production-quality code. So, I began rewrite called Biblio::RFID. It splits RFID reader support from HTTP and JSONP servers and couples this with documentation and tests. I have production use for it this summer, involving programming of RFID cards as they are printed out, so expect it to change during next few weeks. After that I will push it to CPAN, but I would love to get feedback and comments before that.

Update: It's also availale at CPAN.

Mojo Facets - evolution of faceted browsing

My server side faceted browser just got a bit better. In fact, it become 10 times better. But, let's try to explain this story step by step...

This week I will try to introduce faceted joins. Primary motivation is great Plants For A Future database which consists of more than one text file.

Use case is something like following:
I would like to know all plants which can have medical use, are edable and have perennial habitat (so I don't have to re-plant them every year).

And you can watch the video to see how easily this can be done:

But, this still doesn't make MojoFacets 10 times better than before. This is quite small dataset (still about 10 times bigger than Exhibit could handle), but I had new problem: 100Mb source file a bit less than 30000 items. To make it scale more I implemented pre-calculated filters and sorts. They serve same usage as indexes do in relational databases, but they are calculated on demand and stored in memory.

Let's see in action how does it work with ~30000 items:

In this video, we saw:

starting memory usage of ~13Mb
100Mb dataset with 29869 items
filter by autor with 45644 taking ~10s
use regex filter ic,
godina_izdavanja is numeric facet
jezik filter using cro slo ser regexps and toggle it
show popup title on filters
turn off filters to show under 4s load time
at the end, we consumed ~260Mb of memory

Ok, 4s might not seem blazingly fast, but have in mind that all this is implemented in pure perl (so deployment is lightweight) using Mojolicious web framework. But it has it's overhead. Other than 260Mb or RAM for browser, it will also take 600Mb of RAM memory for server side. But, if you can live with 6* file size factor server side this might be very interesting as a faceted browsing tool for the web.

DORS/CLUC 2010 conference

Like every year, we had our local Linux conference. It was very intense event (for first year I'm involved in real organization) and I can say it's all just a big blurb.

I had two tutorials, one about my Virtual LDAP and another one about creating Google like (horizontally scalable) cluster from library building. In this one, I covered a whole bunch of tools which I ended up using during last year:

Webconverger is the easiest way to deploy Firefox on kiosks for public Internet access
PXElator - full stack solution to network booting and spawning machines
Sack - horizontally scalable (across cores or nodes) in-memory perl hash with remote code execution (close to data)
mongoDB which I use for audit log in PXElator and feed it back to Sack after finding CouchDB too slow.
Sysadmin Cookbook as a way to document HOWTO or SOP documents
bak-git for tracking configuration changes
Gearman and Narada didn't get all attention they deserved, partly because i wasn't able to make Narada work (I tried perl and php version in preparation for tutorial). But, I hope that I managed to transfer part of my fascination with distributed fork approach.

During the conference I wrote small project to index git log messages using Sphinx which might help you to get started with it.

MySQL is slow! Did you tune your settings?

Can MySQL be slow? Yes it can. It depends on disk speed (even if you have fancy SSD drives) and some tuning can turn down query time dramatically, especially if you are currently using default settings and/or your accumulated several gigabytes of data. Sure, there is documentation about optimizing MySQL, and bunch of information on Internet but where to start?

Tune MySQL parameters

For a start, download mysqltuner.pl or tuning-primer.sh and run them. While they won't give you exactly same recommendations it will give you nudge into right direction.

I would especially recommend setting innodb_buffer_pool_size to some reasonable number. This will provide immediate performance improvement, especially if you are running with default Debian settings and have couple of gigs of data.

Find slow queries

But linking to scripts wouldn't be worth a blog post. Next step is to find which queries are slow on your system. To do that, you can use Maatkit or more specifically mk-query-digest. You can feed it with /var/log/mysql/mysql-slow.log or run in on live system:

$ wget maatkit.org/get/mk-query-digest
$ perl mk-query-digest --processlist localhost --interval 0.01

This will create considerable load on your system (since we are using inverval of 0.01 seconds to capture all queries running longer that that), but after you press Ctrl+C you will have good idea which queries need speedup. You can also take a look at Baron Schwartz video from OpenSQLCamp 2009 or presentation about mk-query-digest from PgEast 2010 if you want to know more. Hint: it supports MySQL (slow query log, binlog, genlog or tcpdump), HTTP, PostgreSQL and memcached.

If you are can modify application, dig into show queries. In my case, I found out that session handling in Koha takes more than 75% of all MySQL query time. I decided to leave this for future improvement. But, if you can't or don't want to modify application, simply move to next topic.

Tune Linux

Yoshinori Matsunobu had great presentation Linux Performance Tuning and Stabilization Tips at recent MySQL Conference & Expo in which he covered all various ways to optimize Linux for MySQL workloads. I will summarize here just few parts which I found useful for my workload which involves multiple LXCs, each with own logical volume using btrfs and sitting on battery backed RAID controller.

turn off readhead
```
hdparm -a 0 /dev/sdb
```
use noop or deadline scheduler for block device on which is logical volume with MySQL
```
echo deadline > /sys/block/sdb/queue/scheduler
```
reduce swappiness
```
echo 0 > /proc/sys/vm/swappiness
```
disable write barriers (only with battery backed controller!)
```
mount /mnt/koha -o remount,nobarrier
```

Results

So did all of this made sense? It's best to have clear goal when optimizing, and mine was less than 3 second web response. Without any MySQL tweaking and 3Gb database it was taking 5 seconds or more for each response. To my amazement, without any modification in application itself, I managed to produce visible improvement, even if only real comment I got was: It's not slow today :-)