I often got questions like: how should I make my search?. Answer to that questions is: It depends. This is a quick overview of few really good solutions to searching your data:
Toolkit for building search engines. Supports meta data, properties
(data which isn't indexed but just stored alongside index, gzip in this
case), regular expressions and different input formats (both from [perl]
scripts and from filesystem). It supports searching (but not indexing) from
perl using SWISH::API.
It doesn't support incremental indexing in
current stable versions, but code does exists in CVS, but it isn't stable
yet (for example index merge always segfaults for me using incremental
indexing). Indexing from perl will hopefully be available when I
finish perl module for it.
If you want custom search engine, this is my first choice.
- Xapian and Omega
Another great search engine with very good perl support (for both
indexing and searching).
Only drawback is that it doesn't support wild-cards (for valid reason,
your stammer should do that work!), but if you need to update your indexes
often, I would recommend it. It's a
order of magnitudeabout three
times slower in indexing and somewhat slower when searching than swish-e,
but speed isn't everything!
If I need custom engine for dynamically changing data, this is my choice.
It comes with Omega search tools which can index files on disk and provide
binary (C++) cgi for searching. I prefer to build my own.
Very good search engine with clustering of results. Support wild-cards, but doesn't support any meta data (you can't search just titles).
This is a valid alternative to href="http://desktop.google.com/">Google Desktop for Linux
Rewrite of swish-e in C++. Uses very clever heuristics to extracts just words from English language reducing index size and producing blazingly fast searches.
However it's not well suited in indexing foreign languages or non-English
words (e.g. dates or numbers). Doesn't have any support for perl out-of-the
box, but at one point in time
I wrote module for indexing and searching. However, without clever
heuristics it's not faster than swish-e so I discontinued development of
that module (but it is fully functional as-is).
So, answer isn't easy. But I hope that this quick overview will help someone.
Updated on 2005-01-06: Olly Betts contacted me about this post and provoked me to do benchmark of Xapian again. I was totally unfair, because new benchmark
doesn't confirm my claim that it's order of magnitude slower. I will
probably write separate post about this benchmark because it also compares
incremental and non-incremental version of swish-e.