I've finished the implementation, tuning, and testing of Full Text Search (FTS) for Emdros.

The implementation is part of the libharvest library, and is written in C++ like the rest of Emdros.

I implemented the basic idea in Python first, then reimplemented it in C++. Python is so malleable that this sort of prototyping work makes Python ideal for the task.

The Full Text Search has a lot of features, including:

  • Index "documents", which must exist as object types.
  • Index documents based on "indexed object types" (e.g., token) and one indexed feature of the indexed object type.
  • Search within "documents".
  • Chainable filters that modify token strings before being indexed, e.g., to weed out stop-words, or to strip, lower-case, or otherwise alter the token strings.
  • Tokenization of query-string splitting on spaces.
  • Optional application of the chainable filters to the query-terms after tokenization, so as to be more likely to match the indexed feature.
  • Google-like "quoted strings" that make the query-terms be adjacent.
  • More than one "quoted string" allowed in the query-string.
  • Return results as list of three-tuples (document-first-monad, document-last-monad, first-search-term-first-monad)
  • Return results as customizable snippets of real tokens, with optional highlighting of query terms.
  • Command-line tools for both indexing and searching.

This will appear in the next public release of Emdros.

Interested parties should contact me via email for getting the latest sources.

Enjoy!

Ulrik