Main design goal is to have interactive environment to query perl hashes which are bigger than memory on single machine.
Implementation uses TCP sockets (over ssh if needed) between perl processes. This allows horizontal scalability both on multi-core machines as well as across the network to additional machines.
Reading data into hash is done using any perl module which returns perl hash and supports offset and limit to select just subset of data (this is required to create disjunctive shards). Parsing of source file is done on master node (called lorry) which then splits it to shards and send data to sack nodes.
Views are small perl snippets which are called for each record on each shard with
$rec. Views create data in
$out hash which is automatically merged on master node.
You can influence default shard merge by adding
+ (plus sign) in name of your key to indicate that
key => value pairs below should have values summed when combining shards on master node.
If view operation generate huge amount of long field names, you might run out of memory on master node when merging results. Solution is to add
# to name of key which will turn key names into integers which use less memory.
So, how does it look? Below is small video showing 121887 records spread over 18 cores on 9 machines running first few short views, and than largest one on this dataset.
Source code for Sack is available in my subversion and this is currently second iteration which brings much simpler network protocol (based only on perl objects serialized directly to socket using
Storable) and better support for starting and controlling cluster (which used to be shell script).
Following video shows improvements in version 0.11 on 22 node cloud hopefully better than video above.