I'm currently working on a mail client for OS X (I won't mention the name since I don't have a landing page up yet) and had the opportunity to write an embedded search engine for it. I say opportunity because this is the kind of project that is lots of fun and would be a hard sell to a team. That is, even though there isn't a go-to embeddable search engine, one could almost certainly be put together with existing libraries at a much higher level than how I wrote this one. SQLite, for example, which is more or less the go-to DB for apps, has support for full text search. Apple has SearchKit, which is far from complete or easy to use, but nevertheless would likely be much faster than writing something from scratch.
From scratch is a very nebulous phrase in software development since it's exceedingly rare to write something that isn't built on a whole lot of other, preexisting stuff. This project is no different and while it's about as low level as you'd want to get when writing a search engine, it is based on a set of libraries, most notably LevelDB and ICU. LevelDB is an embedded key-value store that is tiny and fast and has a good pedigree. Like SQLite, LevelDB does not run as a separate process and does not have a client-server model. It gets compiled into your app. Hence the embedded part. ICU is the de facto standard for dealing with Unicode -- dealing with things like text tokenization in problematic languages like Chinese as well as Unicode normalization and a host of other things.
So I didn't really start from scratch. I started from a sane set of base libraries. From there however it's still a lot of work to build something that can store and search a set of documents. Before I get into the how I should take a moment to address the why.
Why Build a Search Engine from Scratch?
This is a great question. The high level answer is that I want an end result that is blazingly fast and very accurate. I want something lightweight enough that it can run on OS X as well as iOS or other mobile devices. Now, as a software engineer, I have to be able to answer the question, "Is alternative X or Y lacking in any measurable way that is relevant to the application at hand?" The problem with that is it's not possible to benchmark a piece of software that hasn't been written yet so I can't say if this approach is better than using SQLite or SearchKit. While I could have tested the performance and features of SQLite and SearchKit, the reality is that I decided to write one from scratch for this project for emotional reasons -- because I wanted to. When this is production-ready it certainly will be interesting to benchmark it against something like SQLite.
High Level Design
So what is a full text search engine? The goal is to end up with a black box that contains a bunch of documents that can be queried for a word or a phrase. The simplest approach would be to have a set of documents on disk (or in memory) and go through them one by one, breaking them into terms and checking if our search term matches any of those. In essence that's what every search engine does except that practical designs will do a lot of pre-processing to speed up that matching.
A more usable design will take pieces of the documents and create one or more indexes. These indexes are shortcuts -- a manifestation of the ability to trade off between processing time and disk space. The most basic index for full text search would be a word index (in this context usually called an inverted index) -- a list of words that point to the documents that contain them. Here we take up some more space (size of documents + size of word index) but gain the ability to scan through a smaller list of words (alphabetized, duplicates and unimportant words removed) to find matches. Instead of searching all of the documents for the word "beautiful" we search the index for the word. We can then fetch the set of documents with the word "beautiful" in them and do some other stuff to decide if those documents are the ones we want or not.
If all we want to do is find exact matches, we're actually most of the way there. This very simple index will allow us to find documents with the word "beautiful" in them if the following conditions are met:
- The word beautiful is spelled correctly as a search term.
- The word beautiful is spelled correctly in the document that contains it.
- The word beautiful doesn't have a prefix or suffix, e.g., "beautifully".
- The word beautiful was tokenized properly (related to #3).
- And probably others...
The point is this type of all or nothing term search is exceedingly fragile when dealing with human language and human operators. To improve on this, we'll need some manner of approximate or "fuzzy" matching of terms. When we're talking about matching terms based on similarity we're usually going to look at some type of edit distance, like Levenshtein distance (which is what we will eventually use here), but there are interesting alternatives such as cosine similarity. Regardless of how we determine it, we're calculating how similar our search term is with the terms in each document. For example the words "that" and "than" have a Levenshtein distance of 1 (one substitution) and a cosine similarity of ~0.8164 (a score of 1 meaning identical, 0 meaning no similarity). What we consider "close enough" is an arbitrary threshold.
A naive "fuzzy" approach would be to calculate the edit distance of our search term against each word in our index. Again, this will be slow as we'll be spending most of our time comparing words that have little or no similarity. While not as bad as testing every word in every document, it's still far from ideal. We could try to speed this up by doing things like only testing words that share a common prefix, but there are plenty of cases where that will fail, such as a spelling error in the first few letters of a word. If we want something more robust we're going to have to get more creative.
Precomputed N-Gram Index
Since we want to trade disk space for processing time, we want to do as much of the work required for searching ahead of time as possible. That will lead us to an index that is larger and more complex. One of the more common types of precomputed indexes is an n-gram index. N-grams (unigrams, bigrams, trigrams, ..., 25-grams, ...) in this case are our words broken into groups of n letters. Our word "beautiful" could be broken into the trigrams, "bea", "eau", "aut", "uti", etc. What n-grams allow us to do is search for partial word matches. We'll end up fetching a lot of unrelated words (there's an "aut" in "flautist") but the total number will probably be a reasonable amount to do another round of testing for a match (such as calculating cosine similarity). N-grams are neat. They're almost an ideal, universal solution to our problem and they're conceptually very simple.
Precomputed Levenshtein Index
Since Levenshtein is the go-to measure of string similarity, it kind of makes sense to combine that with the concept of indexing. The question is how. We could pre-compute "Levenshtein variations" of words but that's probably going to be zillions of words, right? Right. The problem here is edit distance encompasses all transpositions, insertions, deletions and replacements of letters in a word. For the English alphabet and a word 9 letters long, we're looking at over 110,000 possible combinations. That's bad enough but if we get into languages with larger alphabets, like Chinese, we're talking absolutely enormous indexes.
This is where I got stuck for a bit. I felt like I was on the right track but I didn't know how to solve this particular issue. Then I found this blog post about a similar approach where they addressed the index size issue. Bingo! It's always humbling when someone beats you to the punch but at the same time it's nice to have a solution present itself.
The basic concept here is instead of pre-calculating all possible Levenshtein variations of word, we only calculate the variations created by deleting letters from a word. The number of variations then is a number much smaller, for our purposes practically so. How does that work? How can we ignore insertions and everything else? The trick is to also calculate the deletion variants of our search term and then query each of those terms against our index. This is actually equivalent to querying a single term against a full Levenshtein index. For further reading, a similar approach is described here.