« apache2-mpm-worker considered harmful to memory usage | Main | PXElator - replace system administration with a perl script »

Tokyo Cabinet: full-text search in few easy steps

I have been following Tokyo Cabinet for a while, and I was especially keen to try full-text indexes which where added recently. I'm actually so obsessed with it that I had Google alert set on words "Tokyo Cabinet" and apart from occasional political event in Tokyo's political cabinet, it was useful in finding interesting information when I got link to this blog post how to create simple intranet search by Mikio Hirabayashi which include easy to follow instructions to make local intranet search, complete with web crawler (in ruby).

Unfortunatly, Google translate isn't really kind to it and creates something which is not really usable. But, I managed to condense it to following script:

#!/bin/sh -x

url=http://blog.rot13.org/

test -f intra.tsv || ruby wgettsv -allow "$url.*html" -deny cgi -max 10000 $url > intra.tsv
tctmgr inform tctsearch.tct
tctmgr importtsv tctsearch.tct intra.tsv
tctmgr setindex -it qgram tctsearch.tct title
tctmgr setindex -it qgram tctsearch.tct body
...which is really awesome if you ask me. All the good things of Tokyo Cabinet with a little bit on qgram index on the top.

I will try to summarize blog post here to provide English speaking web public an opportunity to find out more if they can read Japanise Concept is really simple:

...and very nicely split into components. If you still didn't give Tokio Cabinet a try! If you want, you can take a look at my Tokyo Cabinet scripts as a starting point. I really need to make proper Debian packages for recent versions, so watch this space...

TrackBack

TrackBack URL for this entry:
http://blog.rot13.org/mt/mt-tb.cgi/644

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About

This page contains a single entry from the blog posted on July 20, 2009 8:54 PM.

The previous post in this blog was apache2-mpm-worker considered harmful to memory usage.

The next post in this blog is PXElator - replace system administration with a perl script.

Many more can be found on the main index page or by looking through the archives.

Creative Commons License
This weblog is licensed under a Creative Commons License.