Group by data in shell pipes

My mind is just too accustomed to RDBMS engines to accept that I can't have GROUP BY in my shell pipes. So I wrote one groupby.pl.

Aside from fact that it somewhat looks like perl golfing (which I'm somewhat proud of), let's see how does it look:



dpavlin@llin:~/private/perl$ ps axv | ./groupby.pl 'sum:($6+$7+$8),10,count:10,min:($6+$7+$8),max:($6+$7+$8)' | sort -k1 -nr | head -10 | align

440947 /usr/lib/iceweasel/firefox-bin        1 440947 440947

390913 /usr/sbin/apache2                    11  22207  39875

180943 /usr/bin/X                            1 180943 180943

135279 /usr/bin/pinot-dbus-daemon            1 135279 135279

122254 mocp                                  2  25131  97123

 84887 pinot                                 1  84887  84887

 78279 postgres:                             5  10723  21971

 70030 /usr/bin/perl                         6   6959  15615

 50213 /bin/bash                             7   6351   7343

 49266 /usr/lib/postgresql/8.2/bin/postgres  2  24631  24635

This will display total usage for process, it's name, number of such processes and range of memory usage. We can then use old friend sum.pl to produce console graph, but I already wrote about it.

So, let's move to another example, this time for OpenVZ. Let's see how much memory is each virtual machine using (and get number of processes for free):



$ vzps -E axv --no-headers | ./groupby.pl 'sum:($7+$8+$9*1024),1,count:1'

2209504127 0 265

611768242 212024 38

162484775 212037 19

170797534 212052 38

104853258 212226 26

712007227 212253 21

But wouldn't it be nice to display hostnames instead of VEID numbers? We can, using --join and --on options (which are really backticks on steroids):



$ vzps -E axv --no-headers | ./groupby.pl 'sum:($7+$8+$9*1024),1,count:1' --join 'sudo vzlist -H -o veid,hostname' --on 2                                 

2146263206 0 259

675835528 saturn.ffzg.hr 40

162484775 arh.rot13.org 19

170797534 koha-dev.rot13.org 38

104853258 koha.ffzg.hr 26

712011323 zemlja.ffzg.hr 21

Which brings us to final result:



$ vzps -E axv --no-headers | ./groupby.pl 'sum:($7+$8+$9*1024),1,count:1' --join 'sudo vzlist -H -o veid,hostname' --on 2 | sort -rn | align | ./sum.pl -h

0                  260   2105M OOOOOOOOOOOOOOOOOOO                  2105M

zemlja.ffzg.hr      21    679M OOOOOO-------------------            2784M

saturn.ffzg.hr      35    512M OOOO--------------------------       3296M

koha-dev.rot13.org  38    162M O------------------------------      3459M

arh.rot13.org       19    154M O--------------------------------    3614M

koha.ffzg.hr        26     99M ----------------------------------   3714M

So, here you have it: SQL like query language for your shell pipes.

Bag of useful scripts

Why does console refuse to die?
- It's because of pipes!

This post is result of my long addiction to console applications. Somehow, when I want to get a quick view of things on my system, I always turn to pipes and do something with them. In that process, I developed few of useful scripts for use within shell pipes, and I would like to introduce my readership to them.

PostgreSQL database size

When I want to see size of all databases on my system or size of tables in one database I turn to pg_size. It's a short and sweet script which will do a little shell magic (take a look in it) and display size of all databases on system (without any options) or size of all tables in database and number of rows (when used like pg_size database_name) like this:

dpavlin@llin:~$ pg_size dellstore2 | grep -v sql_
4890624 customers 20000
3153920 orderlines 60350
2678784 cust_hist 60350
991232 products 10000
966656 orders 12000
450560 inventory 10000
8192 categories 16
0 reorder 0

This is all nice and well, but doesn't really gives us the right overview, so move along for...

Nice console graphs

First, a caveat: this tools assumes that it will get number, space, and optional description. Output above seems to fit into this description, so let's try it:

dpavlin@llin:~$ COLUMNS=80 pg_size dellstore2 | grep -v sql_ | sum.pl -h
customers 20000    4776k OOOOOOOOOOOOOOOO                                 4776k
orderlines 60350   3080k OOOOOOOOOO-----------------                      7856k
cust_hist 60350    2616k OOOOOOOOO---------------------------               10M
products 10000      968k OOO-------------------------------------           11M
orders 12000        944k OOO----------------------------------------        12M
inventory 10000     440k O-------------------------------------------       12M
categories 16      8192b ---------------------------------------------      12M
reorder 0              0 ---------------------------------------------      12M

This gives us nice output: description (followed by number of rows from above output), and running total of size in human readable form (if you don't like it, remove -h flag and you will get raw numbers).

Let's take another example (if you are still reading this and not interested in PostgreSQL database size). Let's see how much traffic did pppd transfer over very slow GPRS link on 8-day vacation:

dpavlin@llin:~$ grep 'pppd.*: Sent' /var/log/messages | awk '{ print $7 + $10 " " $1 " " $2 }' | sum.pl -h
May 5         0                                                               0
May 5       39k                                                             39k
May 5     7512k OO                                                        7551k
May 6     6352b --                                                        7558k
May 6       20k --                                                        7579k
May 6     1183k ---                                                       8762k
May 8     6869k OO---                                                       15M
May 8       70k -----                                                       15M
May 9     3596k O------                                                     18M
May 9     1998k -------                                                     20M
May 10      32M OOOOOOOOOOOO--------                                        53M
May 10      13k --------------------                                        53M
May 11      44M OOOOOOOOOOOOOOOOO--------------------                       98M
May 12      12M OOOO--------------------------------------                 111M
May 13    7120k OO-------------------------------------------              118M
May 13      20M OOOOOOO----------------------------------------------      139M

Much more interesting! A long time ago, I had a bunch of quick one-lines which used sum.pl to produce output from various other system counters, but somehow it got lost.

As I get only few comments on my blog, if you find this useful, leave one. I have few other examples, like the one which shows top 5 memory eaters on my system:

dpavlin@llin:~$ ps v | awk '{ print $8 " " $9 " " $10 }' | sort -rn | ~/private/perl/sum.pl | head -5
# RSS %MEM COMMAND
10.6 /usr/lib/iceweasel/firefox-bin 165092 OOOOOOOOOOOOOO                165092
4.3 perl                             67240 OOOOOO---------------         232332
0.5 awesome                           8504 ---------------------         240836
0.4 irssi                             6632 ----------------------        247468
0.3 vi                                5888 ----------------------        253356

but, if this is not interested to my readership, tell me so, and I will stop spamming your already full RSS reader with console output! :-)

Predavanje: SQL od početnika do relacijskog maga

Danas sam na Razmjeni vještina održao maratonsko četverosatno predavanje koje je nadam se bilo donekle korisno. Nažalost, nismo stigli ući u detalje onoliko koliko bih želio, ali ako ništa drugo ponovo sam koristio pgrestraier (koji se nekako indeksira previše sporo, morati ću pogledati zašto) i još jedan zgodan projektić koji sam napisao prošle godine za studente u Zadru pg-getfeed koji je zapravo mala perl stored procedura kojim možete raditi SQL upite na RSS feedovima.

irc-logger - memory augmentation for #irc

Initially created in 2006 this handy tool is best described with original commit message:

IRC bot which replace human memory

Here is a quick run-down through available features:

web archive with search
irc commands: last, grep/search, stat, poll/count
tags// in normal irc messages (tagcloud, filter by tag, export as RSS feed)
announce /me messages to Twitter (yes, lame, but that was a year ago)
tags are available as html links for embedding (in wikis)
RSS feed from messages with tags (also nice for embedding)
irssi log import (useful for recovery in case of failure of machine or service :-)
announce new messages from RSS feeds (nice for wiki changes, blog entries or commits)

It has grown quite a bit from initial vision to recall last messages on the web (and it does go through some hoops to produce nice web archive). Adding of tags allowed easy recall of interesting topics but in a way now it provides an central hub for different content connected to irc.

It's written in perl using POE and it's probably not best example of POE usage. It is also somewhat PostgreSQL specific but works well for our small comunity at #razmjenavjestina irc channel. Since I have seen some interest in it this blog post might serve as announce of it's existence.

I will probably add some documentation to it's wiki page and add real muti-channel support (most of code is in there, but web archive needs filtering by channel). If you are interested to /invite it to your channel, drop me a note.

status line for dwm

I will be working on battery for most of next week, so I spent some time tweaking my setup. I have been running 2.6.21 because of my efforts to make gflrx play nice with CONFIG_PARAVIRT, so I had tickless kernel needed for powertop. To my horror, most of interrupts on my laptop was created by PostgreSQL (which I will stop when using battery) and ACPI! And that's because I'm using dwm with a primitive shell grep/sed pipe to produce status.

So, polling ACPI every 3 seconds (which is reasonable refresh time for me, even for clock) is too much. And than I started dreaming: network traffic would be nice. And a disk! Battery status when (dis)charging goes without saying. So, in the end, I also added a temperature, and got something like this (when on power):

2007-05-27 22:15:40 | 0.30 0.13 0.06 |   5M D 1k   |  32b > 54b  | 59 C

perl code it's nice and short but completely broken about estimates for charging time (how does acpi command-line tool calculate that?)

When started as stand-alone utility it can be poor man's replacement for dstat :-)

Testing replicaton with Slony-1 and PostgreSQL

I just finished set of shell scripts which test failover and replication using Slony-1 and PostgreSQL.

It's somewhat Debian specific (using pg_ctlcluster to control PostgreSQL instances), but other than that it should work on any platform. This is essentially codification of requirements for HA testing I had to do, and I hope that this automatic demonstration would be valuable to others to test their specific requirements. Without this throwaway model, I wouldn't be able to test production database failover, because there is just too much steps which had to be done in specific order.

So, my fascination with testing expanded from development to deployment.

pgestraier - easy and fast full-text search for PostgreSQL using Hyper Estraier

I'm somewhat proud to announce that current version of pgestraier now includes consistency triggers which will keep Hyper Estraier index up-to-date with data in your database.

That, coupled with ability to create full-text indexes easily, just by running helper script on database, makes pgestraier powerful solution if you need fast full-text indexing with ability to off-load search to another machine (thanks to Hyper Estraier P2P architecture) or need perfect N-gram search results.

This project might also help people who are porting applications which use MySQL full-text search to PostgreSQL (actually, it's it going to be used just for that).

Real trunk of development is in Subversion, and CVS repository at pgFoundry is just a mirror copy. Enjoy it, while I prepare to leave for seaside.

Begin work on Search::Estraier perl module

Search::Estraier will be perl API to Hyper Estraier. It's written using excellent Inline::C

Grand plan is to use pgestraier from PostgreSQL to query Estraier index and Search::Estraier to create index. This would allow to combine structured data in RDBMS with semi-structured data in full text index. Additional normalized tables can be created using materialized views in PostgreSQL, and if all goes well, it will be part of WebPAC version 2 which will be universal hybrid (structured/full text) storage.

Update: I had good luck to find already working perl bindings for Hyper Estraier at MATSUNO Tokuhiro blog. Thanks a lot! So, work on Search::Estraier is suspended.

Search Hyper Estraier from PostgreSQL

Hyper Estraier, new version of previously mentioned Estraier has a very good API. So, I wrote pgestraier function for it. Now you can query Estraier indexes directly from PostgreSQL.

It's beta, yada, yada, if it breaks you get to keep both parts. However, it's extremely useful if you want to (left/outer) join index results with data in PostgreSQL.

.

pgswish - search using swish-e from PostgreSQL

My pet project is on pgFoundry. pgswish allows integration of search results from swish-e index and structured data from PostgreSQL. It's a function which returns multi-row data in PostgreSQL which is swish-e search result.

However, I also posted e-mail message to swish-e mail list and Bill Moseley had some great comments. If you know how to write PostgreSQL C-functions I would be great-full for another pair of eyes on this project. My C is rusty.

Dobrica Pavlinušić's Weblog / Blog

Results matching “PostgreSQL”

Group by data in shell pipes

Bag of useful scripts

PostgreSQL database size

Nice console graphs

Predavanje: SQL od početnika do relacijskog maga

irc-logger - memory augmentation for #irc

status line for dwm

Testing replicaton with Slony-1 and PostgreSQL

pgestraier - easy and fast full-text search for PostgreSQL using Hyper Estraier

Begin work on Search::Estraier perl module

Search Hyper Estraier from PostgreSQL

pgswish - search using swish-e from PostgreSQL