Dobrica Pavlinušić's Weblog / Blog

Tokyo Cabinet: full-text search in few easy steps

I have been following Tokyo Cabinet for a while, and I was especially keen to try full-text indexes which where added recently. I'm actually so obsessed with it that I had Google alert set on words "Tokyo Cabinet" and apart from occasional political event in Tokyo's political cabinet, it was useful in finding interesting information when I got link to this blog post how to create simple intranet search by Mikio Hirabayashi which include easy to follow instructions to make local intranet search, complete with web crawler (in ruby).

Unfortunatly, Google translate isn't really kind to it and creates something which is not really usable. But, I managed to condense it to following script:

#!/bin/sh -x

url=https://blog.rot13.org/

test -f intra.tsv || ruby wgettsv -allow "$url.*html" -deny cgi -max 10000 $url > intra.tsv
tctmgr inform tctsearch.tct
tctmgr importtsv tctsearch.tct intra.tsv
tctmgr setindex -it qgram tctsearch.tct title
tctmgr setindex -it qgram tctsearch.tct body

...which is really awesome if you ask me. All the good things of Tokyo Cabinet with a little bit on qgram index on the top.

I will try to summarize blog post here to provide English speaking web public an opportunity to find out more if they can read Japanise Concept is really simple:

...and very nicely split into components. If you still didn't give Tokio Cabinet a try! If you want, you can take a look at my Tokyo Cabinet scripts as a starting point. I really need to make proper Debian packages for recent versions, so watch this space...

pgestraier - easy and fast full-text search for PostgreSQL using Hyper Estraier

I'm somewhat proud to announce that current version of pgestraier now includes consistency triggers which will keep Hyper Estraier index up-to-date with data in your database.

That, coupled with ability to create full-text indexes easily, just by running helper script on database, makes pgestraier powerful solution if you need fast full-text indexing with ability to off-load search to another machine (thanks to Hyper Estraier P2P architecture) or need perfect N-gram search results.

This project might also help people who are porting applications which use MySQL full-text search to PostgreSQL (actually, it's it going to be used just for that).

Real trunk of development is in Subversion, and CVS repository at pgFoundry is just a mirror copy. Enjoy it, while I prepare to leave for seaside.

Search::Estraier - pure perl API for Hyper Estraier node API

I needed this for a long time, so finally I wrote it. It's a birday present to... well... me. Geeks are strange, right?

pgestraier - ready for prime time?

Well, this was fast. In last post I promised to work some more on pgestraier and well, I did. It now supports node API, and generally I'm quote happy with it. It even has a proper home page.

More work on HyperEstraier perl bindings

I just added node API to HyperEstraier perl bindings. It's my first real project in C++, so be kind to it :-)
All changes are available at Subversion repository and if I didn't do something stupid it will be included in upstream version, I hope. Next step: add node API to pgestraier...

Begin work on Search::Estraier perl module

Search::Estraier will be perl API to Hyper Estraier. It's written using excellent Inline::C

Grand plan is to use pgestraier from PostgreSQL to query Estraier index and Search::Estraier to create index. This would allow to combine structured data in RDBMS with semi-structured data in full text index. Additional normalized tables can be created using materialized views in PostgreSQL, and if all goes well, it will be part of WebPAC version 2 which will be universal hybrid (structured/full text) storage.

Update: I had good luck to find already working perl bindings for Hyper Estraier at MATSUNO Tokuhiro blog. Thanks a lot! So, work on Search::Estraier is suspended.

Search Hyper Estraier from PostgreSQL

Hyper Estraier, new version of previously mentioned Estraier has a very good API. So, I wrote pgestraier function for it. Now you can query Estraier indexes directly from PostgreSQL.

It's beta, yada, yada, if it breaks you get to keep both parts. However, it's extremely useful if you want to (left/outer) join index results with data in PostgreSQL.

.

How do you search your data?

I often got questions like: how should I make my search?. Answer to that questions is: It depends. This is a quick overview of few really good solutions to searching your data:

swish-e

Toolkit for building search engines. Supports meta data, properties
(data which isn't indexed but just stored alongside index, gzip in this
case), regular expressions and different input formats (both from [perl]
scripts and from filesystem). It supports searching (but not indexing) from
perl using SWISH::API.

It doesn't support incremental indexing in
current stable versions, but code does exists in CVS, but it isn't stable
yet (for example index merge always segfaults for me using incremental
indexing). Indexing from perl will hopefully be available when I

finish perl module for it.

If you want custom search engine, this is my first choice.
Xapian and Omega

Another great search engine with very good perl support (for both
indexing and searching).

Only drawback is that it doesn't support wild-cards (for valid reason,
your stammer should do that work!), but if you need to update your indexes
often, I would recommend it. It's a ~~order of magnitude~~ about three
times slower in indexing and somewhat slower when searching than swish-e,
but speed isn't everything!

If I need custom engine for dynamically changing data, this is my choice.
It comes with Omega search tools which can index files on disk and provide
binary (C++) cgi for searching. I prefer to build my own.
Estraier

Very good search engine with clustering of results. Support wild-cards, but doesn't support any meta data (you can't search just titles).

This is a valid alternative to href="http://desktop.google.com/">Google Desktop for Linux
users.
SWISH++
Rewrite of swish-e in C++. Uses very clever heuristics to extracts just words from English language reducing index size and producing blazingly fast searches.

However it's not well suited in indexing foreign languages or non-English
words (e.g. dates or numbers). Doesn't have any support for perl out-of-the
box, but at one point in time

I wrote module for indexing and searching. However, without clever
heuristics it's not faster than swish-e so I discontinued development of
that module (but it is fully functional as-is).

So, answer isn't easy. But I hope that this quick overview will help someone.

Updated on 2005-01-06: Olly Betts contacted me about this post and provoked me to do benchmark of Xapian again. I was totally unfair, because new benchmark
doesn't confirm my claim that it's order of magnitude slower. I will
probably write separate post about this benchmark because it also compares
incremental and non-incremental version of swish-e.

Results tagged “search”