« PlayStation 3: perfect PowerPC machine for Linux | Main | ZFS does hierarchical management of disk space right »

Sack - sharding memory hash in perl

sack-onion-logo.png Main design goal is to have interactive environment to query perl hashes which are bigger than memory on single machine.

Implementation uses TCP sockets (over ssh if needed) between perl processes. This allows horizontal scalability both on multi-core machines as well as across the network to additional machines.

Reading data into hash is done using any perl module which returns perl hash and supports offset and limit to select just subset of data (this is required to create disjunctive shards). Parsing of source file is done on master node (called lorry) which then splits it to shards and send data to sack nodes.

Views are small perl snippets which are called for each record on each shard with $rec. Views create data in $out hash which is automatically merged on master node.

You can influence default shard merge by adding + (plus sign) in name of your key to indicate that key => value pairs below should have values summed when combining shards on master node.

If view operation generate huge amount of long field names, you might run out of memory on master node when merging results. Solution is to add # to name of key which will turn key names into integers which use less memory.

So, how does it look? Below is small video showing 121887 records spread over 18 cores on 9 machines running first few short views, and than largest one on this dataset.

If your browser doesn't have support for <video> tag, watch Sack video on YouTube or using ttyrec player written in JavaScript.

Source code for Sack is available in my subversion and this is currently second iteration which brings much simpler network protocol (based only on perl objects serialized directly to socket using Storable) and better support for starting and controlling cluster (which used to be shell script).

Update: Sack now has proper home page at Ohloh and even playlist on YouTube (which doesn't really like my Theora encoded videos and doesn't have rss feed natively).

Following video shows improvements in version 0.11 on 22 node cloud hopefully better than video above.

TrackBack

TrackBack URL for this entry:
http://blog.rot13.org/mt/mt-tb.cgi/658

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About

This page contains a single entry from the blog posted on October 6, 2009 2:19 PM.

The previous post in this blog was PlayStation 3: perfect PowerPC machine for Linux.

The next post in this blog is ZFS does hierarchical management of disk space right.

Many more can be found on the main index page or by looking through the archives.

Creative Commons License
This weblog is licensed under a Creative Commons License.