Results matching “PostgreSQL”

As you all know by now, last week we had another DORS/CLUC conference. This time I had two talks and one workshop.

SysAdmin Cookbook.png

Sysadmin cookbook

I never proparly introduced this project here, but if you want to know more about my convention based documentation examine the presentation of hop over to https://sysadmin-cookbook.rot13.org/ and take a look at generated documentation.

Basic idea is to document changes in easy to write files on file system (preserving symlinks to files on system which allows you to quickly see if cookbook is deployed or not and diff between template and deployed configuration). I know that my cookbook is mix of various things I did in last three years, but I do find it useful, so hopefully it might be useful to you also.

Kindle - so much more than ebook reader.png

Kindle - so much more than ebook reader

This was longer talk about my one year experience with Kindle. I must say that I'm still very happy user of Kindle, but in this talk, I tried to cover Kindle Developer's Corner at mobileread forum as well as other related projects: So if you are still wondering if it's worth the effort to install third-party software on Kindle, answer is yes, it is!.

Web scale monitoring.png

Web scale monitoring

This was a workshop which doesn't have much with web (it's about monitoring ADSL CPE devices and provider equipment in-between), but it shows (I hope) nice way to integrate several project to provide nice scalable monitoring infrastructure. It's composed of:

  • Gearman message queue together with Gearman::Driver provide on-demand scaling of workers
  • redis saves all data from external systems (LDAP, CRM) and all results from statistics collection nicely providing data for web interface
  • PostgreSQL stores all collected data, using hstore to provide unstructured key value store for different data from different devices while still allowing us to use SQL to query data (and export it to data warehouse)
  • Mojolicious provides web interface which uses data from redis and provides JSONP REST interface for Angular.js
All in all it's a nice combination of tools which served my need quite well, so I hope it was also useful to people who attended workshop.

By default, MySQL installation on Debian comes without innodb_file_per_table option which spread tables in individual InnoDB files. Based on your usage patterns or backup strategies this might be better filesystem organization than one big /var/lib/mysql/ibdata1 file. I first heard about it in OurSQL Episode 36: It's Not Our (De)fault!. It's great podcast, but to be honest with each new episode I wish to have only PostgreSQL servers to maintain...

To enable this option you will need to create configuration file and restart MySQL server:

koha:/etc/mysql/conf.d# cat > file-per-table.cnf 
[mysqld]
innodb_file_per_table
CTRL+D
koha:/etc/mysql/conf.d# /etc/init.d/mysql restart

This won't change anything, because only new tables will be created in separate files. But, we can use ALTER TABLE table ENGINE=InnoDB on each table to force InnoDB to re-read tables and create separate files:

mysqlshow koha --status | grep InnoDB | cut -d'|' -f2 | sed -e 's/^/alter table/' -e 's/$/ engine=InnoDB;/' | mysql -v koha

If you replace grep InnoDB with grep MyISAM you might use same snippet to convert MyISAM tables into InnoDB (if you still have them or don't use fulltext search).

I think that system administration is like gardening. I don't know anything about gardening, but it seems to involve a lot of care here and there, seemingly without much pattern. In that sense, it's similar to wiki editing, you start somewhere and you really don't know where it lead you to.

While ago, I started writing MojoFacets, fast web-based faceted browser which keeps data in-memory. Since I last wrote blog post about it I added various features to it turning it into powerful spreadsheet-like application within browser in which you can mangle your data using perl code.

Let me start with a list of new features:

  • run perl snippet over filtered subset of data, modifying columns (using $update) or creating aggregated result (using $out)
  • format on-screen filter html with hidden ;, so that copy/paste into spreadsheet produce correct values and counts
  • export dataset as tab separated values for easy migration into other applications
  • use tab separated export and optional time format string with gnuplot to produce png graphs from data (this works well for huge datasets)
  • export filtered values from facets in simple one-value per line format
  • run perl snippets over filter's facet values to easily select ($checked) or calculate something with $value or $count
  • import of CSV files (with optional encoding specified in filename)
  • import from CouchDB databases or view
  • import SQL query results from RDBMS using any DBI perl module (tested with PostgreSQL, mysql and SQLite)
  • switch between loaded data-sets easily (filters are already shared between them, allowing poor man's join)
  • implement lookup into different data-set with descriptive statistics on values

Adding perl code evaluation over dataset was logical extension since I already had web interface written in perl which had all data in memory. To make it fast, I had to implement indexes (and invalidation). But small things, like automatic generation of meaningful names for code snippets in form of dependent_col1,dep_col2.result_col1,res_col2 turned read-only web interface into powerful tool for application of reusable code snippets on data.

Latest feature is lookup to other datasets with ability to create multiple columns from lookup values. MojoFacets is now sufficiently advanced to replace relational database for quick join-like problems, but this time by writing a little snippet of perl looking like this:

lookup($row->{issn}, 'nabava', => 'issn', sub {
  my $stat = shift;
  push @{ $update->{listprice} }, $on->{listprice};
  $stat->add_data( $on->{listprice} );
},sub {
 my $stat = shift;
 $update->{price_min} = $stat->min;
 $update->{price_max} = $stat->max;
});
This will lookup using $row->{issn} into nabava dataset using issn column. First sub is executed for each value in lookup date (using $on hash) and second one is executed once to create aggregates using $stat which is Statistics::Descriptive object. Second sub is optional if you don't need aggregates.

If you found Google Refine and wanted something similar, but in perl (with perl as query/mangling language) MojoFacets might be good choice.

koha-map.png

Browsing through subscribed videos this week on YouTube I stumbled upon video Simulating Markers with Tile Layers which described how to create custom tiles for Google maps using perl and PostgreSQL. John Coryat did great job at describing challenges, but also provided useful source code snippets and documentation how to create custom tiles. So, this weekend I decided to try it out using publisher field (260$a) from our Koha to show where are our books coming from?

I had several challenges to overcome, including migrating data from MySQL Koha database to PostgreSQL (so I can use great point data-type), geolocating publisher location using Yahoo's PlaceFinder (I tried to use Google's v3 geolocation API, but it had limit at 10 requests which wasn't really useful). I also added support for different icons (with arbitrary size) depending on the zoom level. In the process I also replaced cgi based tile server with mod_rewrite configuration which does same function but inside Apache itself.

Source code is available at github.com/dpavlin/google-map-tiles and it's really easy to overlay huge amount of data-points over Google maps using custom tiles, so try it out!

If you talked with me in last years or so, you probably heard me mention queues as new paradigm in application development. If your background is web-development, you probably wondered why are they important. This blog will try to explain why they are useful and important, and how you can make your app scale, even on same box.

Problem was rather simple: I needed to make monitoring which will pull data from ~9000 devices using telnet protocol and store it in PostgreSQL. Normal way to solve this would be to write module which first checks if devices are available using something like fping and then telnet to each device and collect data. However, that would involve careful writing of puller, taking care of child processes and so on. This seemed like doable job, but it also seemed a bit complicated for task at hand.

So, I opted to implement system using Gearman as queue server, and leave all scaling to it. I decided to push all functionality in gearman workers. For that, I opted to use Gearman::Driver which allows me to easily change number of workers to test different configurations. Requirement was to pull each machine in 20-minute intervals.

Converting existing perl scripts which collect data into gearman workers was a joy. At first run (with 25 workers) it took 15 minutes to collect all data. Just by increasing number of workers to 100 we managed to cut down this time just over 1 minute. And that's on single core virtual machine (which makes sense, since most of the time we are waiting on network).

For web interface, I decided to use Mojolicious. But, to make it work with Gearman, I write MojoX::Gearman which allows me to invoke gearman functions directly from Mojolicious. In fact, all functionality of web interface is implemented as Gearman workers, even querying database :-)

I had an interesting problem at my hand today: a directory with Informix dump in UNL format from which I had to extract data for migration into new non-relational system (into MARC format and not into NoSQL, btw). Idea was simple: let's import dump back into relational database, write SQL queries which produce data and use that. However, SQL standard doesn't really allow us to relax and expect everything to work. In fact...

Step 1: import into PostgreSQL

First idea was to use my favorite database, PostgreSQL and import data into it. First problem was schema file which used DATETIME HOUR TO MINUTE which i decided to convert into TEXT. There was another column with only date, so I will have to mungle this using SQL anyway.

But then I hit several roadblocks:
ERROR:  insert or update on table "xxx" violates foreign key constraint "xxx_fkey"
ERROR:  invalid input syntax for integer: ""
ERROR:  invalid input syntax for type date: ""
ERROR:  value too long for type character(xx)
ERROR:  date/time field value out of range: "13.02.1997"
They are all somewhat worrying for system which maintains your data, but I couldn't really influence quality of data in dump files from Informix, so I decided to try something which is more relaxed with errors like this...

Step 2: relax import using MySQL

Well, most of invalid input syntax should be ignored by MySQL, however:

ERROR 1074 (42000) at line 55: Column length too big for column 'xxx' (max = 255); use BLOB or TEXT instead
was a show stopper. I really don't want to hand-tune schema just to create throw-away queries to export data.

Step 3: SQLite - it takes anything!

In the process, I learned that I can't really blindingly import data, and that format has backslash on end of line for multi-line values, so I decided to write a small perl script which will import Informix UNL dumps directly into SQLite.

I'm generating INSERT INTO table VALUES (...) SQL directly, so you could easily modify this to run on some other database or just produce SQL dumps. For speed of import, I'm creating temporary database in /dev/shm. This helps sqlite3 to be CPU bound as opposed to disk-bound for import operation, and whole database is just 50Mb (UML dumps are 44M so it's very reasonable).

Not bad for less then 100 lines of perl code: working Informix UML loader into SQLite!

Can MySQL be slow? Yes it can. It depends on disk speed (even if you have fancy SSD drives) and some tuning can turn down query time dramatically, especially if you are currently using default settings and/or your accumulated several gigabytes of data. Sure, there is documentation about optimizing MySQL, and bunch of information on Internet but where to start?

Tune MySQL parameters

For a start, download mysqltuner.pl or tuning-primer.sh and run them. While they won't give you exactly same recommendations it will give you nudge into right direction.

I would especially recommend setting innodb_buffer_pool_size to some reasonable number. This will provide immediate performance improvement, especially if you are running with default Debian settings and have couple of gigs of data.

Find slow queries

But linking to scripts wouldn't be worth a blog post. Next step is to find which queries are slow on your system. To do that, you can use Maatkit or more specifically mk-query-digest. You can feed it with /var/log/mysql/mysql-slow.log or run in on live system:

$ wget maatkit.org/get/mk-query-digest
$ perl mk-query-digest --processlist localhost --interval 0.01
This will create considerable load on your system (since we are using inverval of 0.01 seconds to capture all queries running longer that that), but after you press Ctrl+C you will have good idea which queries need speedup. You can also take a look at Baron Schwartz video from OpenSQLCamp 2009 or presentation about mk-query-digest from PgEast 2010 if you want to know more. Hint: it supports MySQL (slow query log, binlog, genlog or tcpdump), HTTP, PostgreSQL and memcached.

If you are can modify application, dig into show queries. In my case, I found out that session handling in Koha takes more than 75% of all MySQL query time. I decided to leave this for future improvement. But, if you can't or don't want to modify application, simply move to next topic.

Tune Linux

Yoshinori Matsunobu had great presentation Linux Performance Tuning and Stabilization Tips at recent MySQL Conference & Expo in which he covered all various ways to optimize Linux for MySQL workloads. I will summarize here just few parts which I found useful for my workload which involves multiple LXCs, each with own logical volume using btrfs and sitting on battery backed RAID controller.

  • turn off readhead
    hdparm -a 0 /dev/sdb
    
  • use noop or deadline scheduler for block device on which is logical volume with MySQL
    echo deadline > /sys/block/sdb/queue/scheduler
    
  • reduce swappiness
    echo 0 > /proc/sys/vm/swappiness
    
  • disable write barriers (only with battery backed controller!)
    mount /mnt/koha -o remount,nobarrier
    

Results

So did all of this made sense? It's best to have clear goal when optimizing, and mine was less than 3 second web response. Without any MySQL tweaking and 3Gb database it was taking 5 seconds or more for each response. To my amazement, without any modification in application itself, I managed to produce visible improvement, even if only real comment I got was: It's not slow today :-)

I had to make 10 sequences in our Koha installation which uses MySQL. To my horror MySQL doesn't have CREATE SEQUENCE (insert MySQL vs PostgreSQL rant here). Even worse, MySQL manual suggest usage of LAST_INSERT_ID which would mean that I had to create 10 tables just to track my sequences.

There is better solution (based on StackOverflow question: Emulating a transaction-safe SEQUENCE in MySQL) outlined below:

create table ffzg_zs_seq (
        name varchar(2) unique not null,
        current integer unsigned not null
);

insert into ffzg_zs_seq values ('PA',100000);
insert into ffzg_zs_seq values ('PB',100000);
insert into ffzg_zs_seq values ('PC',100000);
insert into ffzg_zs_seq values ('PD',100000);
insert into ffzg_zs_seq values ('PE',100000);
insert into ffzg_zs_seq values ('DD',100000);
insert into ffzg_zs_seq values ('MR',100000);
insert into ffzg_zs_seq values ('DR',100000);
insert into ffzg_zs_seq values ('FO',100000);
insert into ffzg_zs_seq values ('SE',100000);

delimiter |
         
create function ffzg_zs_nextval( seq_name varchar(2) )
returns integer unsigned
begin        
        update ffzg_zs_seq set current = ( @next_val := current + 1 ) where name = seq_name ;
        return @next_val;
end|
    
delimiter ;
This will create single table to hold all sequences and set all values to 100000 (have in mind that first usable number is 100001). Since I already have data in database I needed following snippet of SQL to set sequences to existing values:
update ffzg_zs_seq
set current=(
        select  
                max(substring_index(itemcallnumber,' ',-1))
        from items
        where substring_index(itemcallnumber,' ',1) = ffzg_zs_seq.name
);
As you might guesses by now, I'm packing sequence name and number into single field items.itemcallnumber, so I had to use substring_index to extract existing values.

Usage is simple, allmost as good as built-in sequences:

mysql> select ffzg_zs_nextval('DD');
+-----------------------+
| ffzg_zs_nextval('DD') |
+-----------------------+
|                100178 | 
+-----------------------+
1 row in set (0.00 sec)

mammoth_versus_dolphin_500.jpg I have been following MySQL saga with Sun and Oracle for quite some time, mostly because we decided to go mainstream with our Koha installation and use MySQL instead of PostgreSQL because it's support is experimental.

It seems that MariaDB is the future of MySQL if you trust Monty. There is also Drizzle but it's different enough that it's not considered drop-in replacement for MySQL. So, now that I have some hope that MySQL is here to stay I needed to find a quick way to analyze my performance on it.

And there is a great way to see overview of your slow query log: mk-query-digest. It's part of Maatkit which is very interesting and useful project for MySQL DBAs. So, with something like:

dpavlin@mjesec:~$ wget -q http://maatkit.org/get/mk-query-digest
dpavlin@mjesec:~$ perl mk-query-digest /var/log/mysql/mysql-slow.log
You will get nice report about your queries. Very useful.

As you might know by now, I have been really struck by simplicity of CouchDB at last year's OSCON. From then, we got couchapp which is great idea of hosting CouchDB views on file-system for easy maintenance.

So good in fact, that I re-wrote couchapp in perl called design-couch.pl. I needed to deviate a bit from original design (one _design document per application) because PXElator data which I store in CouchDB is... data...

I have been introduced to Relational databases back at university and since then I have been using PostgreSQL for all my database needs. But, this time, I have dumps from commands, SOAP interfaces, syslog messages and arbitrary audit events generated all over the code. I didn't want to think about structure up-front, and View Cookbook for SQL Jockeys convinced me I don't have to, but I decided to make few simple rules to get me started:

  • Create URLs using humanly readable timestamps (yyyy-mm-dd.HH:MM:SS.package.ident) which allows easy parsing in JavaScript (if needed), and ensures that all entries are sorted by time-stamp
  • Augment each document with single new key package (perl keyword on top of each module). It will have sub-keys time decimal time-stamp, name of package, caller sub which called CouchDB::audit and line from which it's called
  • Single _design document for output from one package (which is directory on file-system) just because it easy browsable in Futon.
CouchDB_select_view.png

So, are those rules enough to forget about relational algebra and relax on the couch? Let's take a look at _design/amt/ip,SystemPowerState-count. I applied here almost SQL-ish naming convention - column names, separated by commas then dash - and output column(s).

Map

function(doc) {
  if ( doc.package.name == 'amt'
  && doc.package.caller == 'power_state')
  emit([doc.ip,doc.SystemPowerState],1);
}

Reduce

function (k,v) {
  return sum(v)
}

When run, this map/reduce queries produce result like this:

KeyValue
["172.16.10.200", null]4
["172.16.10.16", "5"]21
["172.16.10.16", "0"]9
["172.16.10.16", null]8
["10.60.0.196", "0"]6
["10.60.0.195", "256"]8
["10.60.0.194", "256"]11
["10.60.0.193", "256"]3

So far, so good. But what if I wanted to average all ping round trip times for each ip?

If you where using SQL, answer would be:

select ip,avg(rtt) from ping group by ip
However, evil rereduce roars it's head here:

Map

function(doc) {
  if ( doc.package.name == 'ping' )
   emit(doc.ip, doc.rtt)
}

Reduce

function (k,values,rereduce) {
  if (rereduce) {
    var total_sum = 0;
    var total_length = 0;
    for (var i = 0; i < values.length; i++) {
      total_sum += values[i][0];
      total_length += values[i][1];
    }
    return [total_sum, total_length];
  } else {
    return [sum(values), values.length];
  }
}

Since we are called incrementally, we can't average averages. We need to collect total sum and number of elements and perform final computation on client:

KeyValue
"193.198.212.4"[779.0038585662847, 9]
"193.198.212.228"[902.6305675506585, 10]
"192.168.1.61"[906.698703765869, 11]
"192.168.1.34"[995.9852695465088, 11]
"192.168.1.3"[316.55669212341303, 6]
"192.168.1.20"[506.162643432617, 8]
"192.168.1.2"[473.91605377197277, 11]
"192.168.1.13"[649.2500305175784, 11]
"172.16.10.10"[49.9579906463623, 1]
"172.16.10.1"[250.78511238098127, 15]
"127.0.0.1"[62.57653236389161, 16]
"10.60.0.94"[81.6218852996826, 2]
"10.60.0.93"[186.49005889892578, 6]
"10.60.0.92"[386.7535591125485, 5]
"10.60.0.91"[1070.863485336304, 9]
"10.60.0.90"[428.4689426422117, 10]

If you manage to wrap your head around this, you are ready to dive into CouchDB.