October 2009 Archives

Your web site is slow. What do you do? Quick grep over access.log shows a flood of various crawling bots:

koha:~# grep bot /var/log/apache2/other_vhosts_access.log | tail
koha.ffzg.hr:80 71.181.32.73 - - [28/Oct/2009:14:16:04 +0100] "GET /cgi-bin/koha/opac-search.pl?q=su:in%C5%BEenjerstvo HTTP/1.1" 200 6579 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.2.5; http://www.majestic12.co.uk/bot.php?+)"
koha.ffzg.hr:80 66.249.65.156 - - [28/Oct/2009:14:16:09 +0100] "GET /cgi-bin/koha/opac-detailprint.pl?biblionumber=127887 HTTP/1.1" 200 849 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
koha.ffzg.hr:80 66.249.65.156 - - [28/Oct/2009:14:16:14 +0100] "GET /cgi-bin/koha/opac-detail.pl?biblionumber=81565 HTTP/1.1" 200 4240 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
koha.ffzg.hr:80 66.249.65.156 - - [28/Oct/2009:14:16:19 +0100] "GET /cgi-bin/koha/opac-showmarc.pl?id=164101 HTTP/1.1" 200 1675 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
koha.ffzg.hr:80 66.249.65.156 - - [28/Oct/2009:14:16:24 +0100] "GET /cgi-bin/koha/opac-detail.pl?biblionumber=38700 HTTP/1.1" 200 4343 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
koha.ffzg.hr:80 71.181.32.73 - - [28/Oct/2009:14:16:28 +0100] "GET /cgi-bin/koha/opac-search.pl?q=su:iracionalno HTTP/1.1" 200 5140 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.2.5; http://www.majestic12.co.uk/bot.php?+)"
koha.ffzg.hr:80 66.249.65.156 - - [28/Oct/2009:14:16:30 +0100] "GET /cgi-bin/koha/opac-ISBDdetail.pl?biblionumber=73120 HTTP/1.1" 200 3255 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
koha.ffzg.hr:80 66.249.65.156 - - [28/Oct/2009:14:16:35 +0100] "GET /cgi-bin/koha/opac-detail.pl?biblionumber=53576 HTTP/1.1" 200 4466 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
koha.ffzg.hr:80 66.249.65.156 - - [28/Oct/2009:14:16:40 +0100] "GET /cgi-bin/koha/opac-showmarc.pl?id=85958 HTTP/1.1" 200 1691 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
koha.ffzg.hr:80 66.249.65.156 - - [28/Oct/2009:14:16:45 +0100] "GET /cgi-bin/koha/opac-detail.pl?biblionumber=91684 HTTP/1.1" 200 4477 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Every 5 seconds? This impacts our performance, especially during working hours. So, for a start let's teach bots which URLs are really interesting. Create robots.txt which all good crawlers should support:
koha:~# cat /var/www/robots.txt 
User-Agent: *
Crawl-Delay: 300
Disallow: /cgi-bin/koha/opac-search.pl
Disallow: /cgi-bin/koha/opac-showmarc.pl
Disallow: /cgi-bin/koha/opac-detailprint.pl
Disallow: /cgi-bin/koha/opac-ISBDdetail.pl

User-Agent: MJ12bot
Crawl-Delay: 300
...and configure it using rule like this (if you web root isn't /var/www):
koha:~# cat /etc/apache2/conf.d/robots.conf 
Alias /robots.txt /var/www/robots.txt
This would work, but Google bot doesn't use Crawl-Delay (sigh!). So we have to hop over to Google's webmaster tools - Site config - Settings and ask Google bot nicely to slow down: google-crawl-rate.png

However, that's not enough, because it takes a while for Google bot to update his crawl speed, and our site is heavily loaded right now! Let's try to limit bandwidth available to all crawlers (identified by bot in User-Agent header) using bandwidth limiting.

First, install mod-bw:

koha:/# apt-get install libapache2-mod-bw

koha:/# a2enmod bw
Enabling module bw.
Run '/etc/init.d/apache2 restart' to activate new configuration!
Edit your configuration to have something like following (very draconian) policy:
        BandwidthModule On
        ForceBandWidthModule On

        BandWidth "u:bot" 8192
        MaxConnection "u:bot" 1

        BandWidth "u:wget" 8192
        MaxConnection "u:wget" 1
It seems that you have to set BandWidth to value which is larger than BandWidthPacket to have effect (which is 8192 by default).
This will slow down download speed to all crawlers (including wget for testing) to 8 k/s. So, does this work?
koha:~# cat /var/log/apache2/other_vhosts_access.log | grep ' 503 ' | cut -d\" -f6 | sort | uniq -c
    107 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
     61 Mozilla/5.0 (compatible; MJ12bot/v1.2.5; http://www.majestic12.co.uk/bot.php?+)
     54 Mozilla/5.0 (compatible; MJ12bot/v1.3.1; http://www.majestic12.co.uk/bot.php?+)
This is just for first two hours of operation, so it does help.

It seems that transition to digital television here forced me to refresh my knowledge of channel scanning, and it seems I didn't wrote it down last time, so here are quick notes for next time.

First, I had somehow to find new frequencies, so I hoped over to site of national television and converted it to following format for scan which is part of dvb-apps:

T 754000000 8MHz 2/3 NONE AUTO 8k 1/8 NONE
T 642000000 8MHz 2/3 NONE AUTO 8k 1/8 NONE
T 690000000 8MHz 2/3 NONE AUTO 8k 1/8 NONE
First one is HDTV stream which doesn't really work for me, so I end up with following ~/.mplayer/channels.conf for mplayer which produces following:
#HDTV Promo:754000000:INVERSION_AUTO:BANDWIDTH_8_MHZ:FEC_2_3:FEC_2_3:QAM_64:TRANSMISSION_MODE_8K:GUARD_INTERVAL_1_8:HIERARCHY_NONE:8000:8001:500
HTV1:642000000:INVERSION_AUTO:BANDWIDTH_8_MHZ:FEC_3_4:FEC_1_2:QAM_64:TRANSMISSION_MODE_8K:GUARD_INTERVAL_1_4:HIERARCHY_NONE:101:102:1
HTV2 Zg:642000000:INVERSION_AUTO:BANDWIDTH_8_MHZ:FEC_3_4:FEC_1_2:QAM_64:TRANSMISSION_MODE_8K:GUARD_INTERVAL_1_4:HIERARCHY_NONE:201:202:2
RTL TV:642000000:INVERSION_AUTO:BANDWIDTH_8_MHZ:FEC_3_4:FEC_1_2:QAM_64:TRANSMISSION_MODE_8K:GUARD_INTERVAL_1_4:HIERARCHY_NONE:301:302:3
NOVA TV:642000000:INVERSION_AUTO:BANDWIDTH_8_MHZ:FEC_3_4:FEC_1_2:QAM_64:TRANSMISSION_MODE_8K:GUARD_INTERVAL_1_4:HIERARCHY_NONE:401:402:4
HTV1:690000000:INVERSION_AUTO:BANDWIDTH_8_MHZ:FEC_3_4:FEC_1_2:QAM_64:TRANSMISSION_MODE_8K:GUARD_INTERVAL_1_4:HIERARCHY_NONE:101:102:1
HTV2 Zg:690000000:INVERSION_AUTO:BANDWIDTH_8_MHZ:FEC_3_4:FEC_1_2:QAM_64:TRANSMISSION_MODE_8K:GUARD_INTERVAL_1_4:HIERARCHY_NONE:201:202:2
RTL TV:690000000:INVERSION_AUTO:BANDWIDTH_8_MHZ:FEC_3_4:FEC_1_2:QAM_64:TRANSMISSION_MODE_8K:GUARD_INTERVAL_1_4:HIERARCHY_NONE:301:302:3
NOVA TV:690000000:INVERSION_AUTO:BANDWIDTH_8_MHZ:FEC_3_4:FEC_1_2:QAM_64:TRANSMISSION_MODE_8K:GUARD_INTERVAL_1_4:HIERARCHY_NONE:401:402:4
I commented out first troublesome HDTV stream, and run mplayer with:
mplayer dvb://
and used h and k keys to change channels.

As you might know, I despise flash as a way to deliver video, but just have to use it if you want to publish videos for poor M$ excuse for web browser. I don't like DVD either, mostly because it's encrypted, but let's finish with rants here.

So, I have DVD in my hand and need high quality flv file from it with readable slides if at all possible. You really don't want to encode video file more than once since it's already in mpeg format, and event two times re-encode could make your slides unreadable.

First step is to make local copy of whole disk (which will take 4+ Gb) so I can re-try encoding without listening to spinning DVD all the time and mount it:

$ dd if=/dev/cdrom of=dvd.iso
$ mkdir mnt
$ sudo mount dvd.iso mnt -o loop
$ ls -al mnt/
total 10
dr-xr-xr-x 4 4294967295 4294967295  136 2009-09-30 20:45 .
drwxr-xr-x 5 dpavlin    dpavlin    4096 2009-10-15 22:01 ..
dr-xr-xr-x 2 4294967295 4294967295   40 2009-09-30 17:25 AUDIO_TS
dr-xr-xr-x 2 4294967295 4294967295  560 2009-09-30 18:45 VIDEO_TS
Looks good so far. However, to add insult in injury, video is in multiple files (it's 60 minute lecture):
dpavlin@klin:/rest/iso/Zimbardo$ ls -al mnt/VIDEO_TS/
total 4093324
dr-xr-xr-x 2 4294967295 4294967295        560 2009-09-30 18:45 .
dr-xr-xr-x 4 4294967295 4294967295        136 2009-09-30 20:45 ..
-r--r--r-- 1 4294967295 4294967295      14336 2009-09-30 18:45 VIDEO_TS.BUP
-r--r--r-- 1 4294967295 4294967295      14336 2009-09-30 18:45 VIDEO_TS.IFO
-r--r--r-- 1 4294967295 4294967295     157696 2009-09-30 18:41 VIDEO_TS.VOB
-r--r--r-- 1 4294967295 4294967295      73728 2009-09-30 18:45 VTS_01_0.BUP
-r--r--r-- 1 4294967295 4294967295      73728 2009-09-30 18:45 VTS_01_0.IFO
-r--r--r-- 1 4294967295 4294967295      12288 2009-09-30 18:41 VTS_01_0.VOB
-r--r--r-- 1 4294967295 4294967295 1073565696 2009-09-30 18:42 VTS_01_1.VOB
-r--r--r-- 1 4294967295 4294967295 1073565696 2009-09-30 18:43 VTS_01_2.VOB
-r--r--r-- 1 4294967295 4294967295 1073565696 2009-09-30 18:45 VTS_01_3.VOB
-r--r--r-- 1 4294967295 4294967295  970516480 2009-09-30 18:45 VTS_01_4.VOB
So, I created following script to help me with it:
#!/bin/sh -x

input="`ls mnt/VIDEO_TS/VTS_01_[1234].VOB | sed 's/^/-i /'`"

out_flv="lecture.flv"

format="-ab 48k -ar 44100 -vcodec flv -b 400k -g 160 -cmp 3 -subcmp 3 -mbd 2 -flags aic+cbp+mv0+mv4 -trellis 1 -deinterlace"

ffmpeg $input $limit -pass 1 $format -y $out_flv || exit
ffmpeg $input $limit -pass 2 $format -y $out_flv || exit
It does two pass encoding, preserving most of audio fidelity while creating stream which takes about 50Kb/s to stream smoothly. You will notice that I didn't encode VTS_01_0.VOB which isn't really a just a subtitle stub:
$ ffplay mnt/VIDEO_TS/VTS_01_0.VOB
Input #0, mpeg, from 'mnt/VIDEO_TS/VTS_01_0.VOB':
  Duration: N/A, start: 0.360000, bitrate: N/A
    Stream #0.0[0x1e0]: Video: mpeg2video, yuv420p, 720x576 [PAR 16:15 DAR 4:3], 7000 kb/s, 25 
    Stream #0.1[0x20]: Subtitle: dvdsub

If you are recoding and editing video which will later be on Internet, please don't use fancy transitions. There is no hope for those to look good after re-encoding. This video had 3D box rotating effect (although simple fade would be enough) which turned into ugly blur, but other than that, it's perfectly readable.

I already stated several times that I really love ZFS file-system. Since I deployed it in production for backup and recovery of virtual machines, I had intuition how it works, but recent conversation on fuse mailing list turned out to have hidden gem in it: Keynote speech from Jeff Bonwick and Bill Moore at Kernel Conference Australia 2009. It's Solaris kernel conference (and has wrong title in program: Deduplication in ZFS), so I would never watch videos from it, but this one is well worth your time if you are interested in ZFS.

If you are wondering about ZFS performance, read VMWare + FreeBSD + ZFS soft-raid with SATA drives - performance. Yes, it's on FreeBSD which is great because, now I can compare performance (in terms of orders of magnitude) on different systems.

I have been using compression on my storage server for a while, and saw dramatic reduction in disk space usage, but didn't do any performance comparisons. On other hand, we have some idea about zfs-fuse performance on dial SSD so I will probably have to do some update to latest zfs-fuse from Ricardo Correia which recently moved zfs-fuse development to git.

sack-onion-logo.png Main design goal is to have interactive environment to query perl hashes which are bigger than memory on single machine.

Implementation uses TCP sockets (over ssh if needed) between perl processes. This allows horizontal scalability both on multi-core machines as well as across the network to additional machines.

Reading data into hash is done using any perl module which returns perl hash and supports offset and limit to select just subset of data (this is required to create disjunctive shards). Parsing of source file is done on master node (called lorry) which then splits it to shards and send data to sack nodes.

Views are small perl snippets which are called for each record on each shard with $rec. Views create data in $out hash which is automatically merged on master node.

You can influence default shard merge by adding + (plus sign) in name of your key to indicate that key => value pairs below should have values summed when combining shards on master node.

If view operation generate huge amount of long field names, you might run out of memory on master node when merging results. Solution is to add # to name of key which will turn key names into integers which use less memory.

So, how does it look? Below is small video showing 121887 records spread over 18 cores on 9 machines running first few short views, and than largest one on this dataset.

If your browser doesn't have support for <video> tag, watch Sack video on YouTube or using ttyrec player written in JavaScript.

Source code for Sack is available in my subversion and this is currently second iteration which brings much simpler network protocol (based only on perl objects serialized directly to socket using Storable) and better support for starting and controlling cluster (which used to be shell script).

Update: Sack now has proper home page at Ohloh and even playlist on YouTube (which doesn't really like my Theora encoded videos and doesn't have rss feed natively).

Following video shows improvements in version 0.11 on 22 node cloud hopefully better than video above.

About this Archive

This page is an archive of entries from October 2009 listed from newest to oldest.

September 2009 is the previous archive.

November 2009 is the next archive.

Find recent content on the main index or look in the archives to find all content.

Pages

  • pics
OpenID accepted here Learn more about OpenID
Powered by Movable Type 5.04