Writing an Embedded Full Text Search Engine

I'm currently working on a mail client for OS X (I won't mention the name since I don't have a landing page up yet) and had the opportunity to write an embedded search engine for it. I say opportunity because this is the kind of project that is lots of fun and would be a hard sell to a team. That is, even though there isn't a go-to embeddable search engine, one could almost certainly be put together with existing libraries at a much higher level than how I wrote this one. SQLite, for example, which is more or less the go-to DB for apps, has support for full text search. Apple has SearchKit, which is far from complete or easy to use, but nevertheless would likely be much faster than writing something from scratch.

From scratch is a very nebulous phrase in software development since it's exceedingly rare to write something that isn't built on a whole lot of other, preexisting stuff. This project is no different and while it's about as low level as you'd want to get when writing a search engine, it is based on a set of libraries, most notably LevelDB and ICU. LevelDB is an embedded key-value store that is tiny and fast and has a good pedigree. Like SQLite, LevelDB does not run as a separate process and does not have a client-server model. It gets compiled into your app. Hence the embedded part. ICU is the de facto standard for dealing with Unicode -- dealing with things like text tokenization in problematic languages like Chinese as well as Unicode normalization and a host of other things.

So I didn't really start from scratch. I started from a sane set of base libraries. From there however it's still a lot of work to build something that can store and search a set of documents. Before I get into the how I should take a moment to address the why.

Why Build a Search Engine from Scratch?

This is a great question. The high level answer is that I want an end result that is blazingly fast and very accurate. I want something lightweight enough that it can run on OS X as well as iOS or other mobile devices. Now, as a software engineer, I have to be able to answer the question, "Is alternative X or Y lacking in any measurable way that is relevant to the application at hand?" The problem with that is it's not possible to benchmark a piece of software that hasn't been written yet so I can't say if this approach is better than using SQLite or SearchKit. While I could have tested the performance and features of SQLite and SearchKit, the reality is that I decided to write one from scratch for this project for emotional reasons -- because I wanted to. When this is production-ready it certainly will be interesting to benchmark it against something like SQLite.

High Level Design

So what is a full text search engine? The goal is to end up with a black box that contains a bunch of documents that can be queried for a word or a phrase. The simplest approach would be to have a set of documents on disk (or in memory) and go through them one by one, breaking them into terms and checking if our search term matches any of those. In essence that's what every search engine does except that practical designs will do a lot of pre-processing to speed up that matching.

A more usable design will take pieces of the documents and create one or more indexes. These indexes are shortcuts -- a manifestation of the ability to trade off between processing time and disk space. The most basic index for full text search would be a word index (in this context usually called an inverted index) -- a list of words that point to the documents that contain them. Here we take up some more space (size of documents + size of word index) but gain the ability to scan through a smaller list of words (alphabetized, duplicates and unimportant words removed) to find matches. Instead of searching all of the documents for the word "beautiful" we search the index for the word. We can then fetch the set of documents with the word "beautiful" in them and do some other stuff to decide if those documents are the ones we want or not.

If all we want to do is find exact matches, we're actually most of the way there. This very simple index will allow us to find documents with the word "beautiful" in them if the following conditions are met:

  1. The word beautiful is spelled correctly as a search term.
  2. The word beautiful is spelled correctly in the document that contains it.
  3. The word beautiful doesn't have a prefix or suffix, e.g., "beautifully".
  4. The word beautiful was tokenized properly (related to #3).
  5. And probably others...

The point is this type of all or nothing term search is exceedingly fragile when dealing with human language and human operators. To improve on this, we'll need some manner of approximate or "fuzzy" matching of terms. When we're talking about matching terms based on similarity we're usually going to look at some type of edit distance, like Levenshtein distance (which is what we will eventually use here), but there are interesting alternatives such as cosine similarity. Regardless of how we determine it, we're calculating how similar our search term is with the terms in each document. For example the words "that" and "than" have a Levenshtein distance of 1 (one substitution) and a cosine similarity of ~0.8164 (a score of 1 meaning identical, 0 meaning no similarity). What we consider "close enough" is an arbitrary threshold.

A naive "fuzzy" approach would be to calculate the edit distance of our search term against each word in our index. Again, this will be slow as we'll be spending most of our time comparing words that have little or no similarity. While not as bad as testing every word in every document, it's still far from ideal. We could try to speed this up by doing things like only testing words that share a common prefix, but there are plenty of cases where that will fail, such as a spelling error in the first few letters of a word. If we want something more robust we're going to have to get more creative.

Precomputed N-Gram Index

Since we want to trade disk space for processing time, we want to do as much of the work required for searching ahead of time as possible. That will lead us to an index that is larger and more complex. One of the more common types of precomputed indexes is an n-gram index. N-grams (unigrams, bigrams, trigrams, ..., 25-grams, ...) in this case are our words broken into groups of n letters. Our word "beautiful" could be broken into the trigrams, "bea", "eau", "aut", "uti", etc. What n-grams allow us to do is search for partial word matches. We'll end up fetching a lot of unrelated words (there's an "aut" in "flautist") but the total number will probably be a reasonable amount to do another round of testing for a match (such as calculating cosine similarity). N-grams are neat. They're almost an ideal, universal solution to our problem and they're conceptually very simple.

Where they fall down, however, is their ability to deal with incorrect spelling. It's fairly likely that two words can be similar (in terms of their edit distance) but have zero trigrams in common. (For reasons beyond the scope of this, n-grams for partial word matches are almost always trigrams.) An example would be "vodka" misspelled as "votka". The Levenshtein distance is 1 but no trigrams match between the two. Since this is going to fail often (I don't have any numbers for this but I think it's a reasonable assumption that it's significant) we're going to have to keep looking.

Precomputed Levenshtein Index

Since Levenshtein is the go-to measure of string similarity, it kind of makes sense to combine that with the concept of indexing. The question is how. We could pre-compute "Levenshtein variations" of words but that's probably going to be zillions of words, right? Right. The problem here is edit distance encompasses all transpositions, insertions, deletions and replacements of letters in a word. For the English alphabet and a word 9 letters long, we're looking at over 110,000 possible combinations. That's bad enough but if we get into languages with larger alphabets, like Chinese, we're talking absolutely enormous indexes.

This is where I got stuck for a bit. I felt like I was on the right track but I didn't know how to solve this particular issue. Then I found this blog post about a similar approach where they addressed the index size issue. Bingo! It's always humbling when someone beats you to the punch but at the same time it's nice to have a solution present itself.

The basic concept here is instead of pre-calculating all possible Levenshtein variations of word, we only calculate the variations created by deleting letters from a word. The number of variations then is a number much smaller, for our purposes practically so. How does that work? How can we ignore insertions and everything else? The trick is to also calculate the deletion variants of our search term and then query each of those terms against our index. This is actually equivalent to querying a single term against a full Levenshtein index. For further reading, a similar approach is described here.

Read more »

First Five (and a Half) Minutes on a Server with a Shell Script

About a year ago I wrote this about hardening a fresh server using Ansible. This post has received about 10x as much traffic as anything else I've written. Oddly, admin is the area I'm probably least knowledgable about when it comes to software.

Anyway, the Ansible script I wrote about in that post is out of date. I realized this recently after trying to use it on a fresh install. I went about updating it (Ansible has made some breaking changes since then) and came to the realization that it would be faster and easier to just write a shell script. 

That's not to say Ansible (and tools like it) don't have their place. They obviously do. But for one-off installs or occasional use the learning curve is too steep. It's much easier to stay current with shell scripting than it is to stay current with a tool that is constantly being changed (improved) and meant for administering very large installations. 

Read more »

A Simple ToDo App in Swift

I decided the best way to learn Swift would be to whip together a simple Core Data app with a single view controller. That gives us a project with some depth but doesn't waste too much time on breadth by minimizing the time spent in IB and wiring up UI components. A ToDo app seemed like an obvious choice.

Our project requirements:

  1. Persist todos (Core Data)
  2. Have a completed state for todos, be able to delete them
  3. Have custom swipe controls for marking complete, deleting
  4. Edit todos in-place and touch to edit
  5. Color todos by position in list (urgency)

This project started with the navigation controller template in Xcode. I won't go over all of the changes or inclusions but you can get a complete copy here if you're interested. I also won't go through the building of the app step-by-step instead I'll just cover some of the more interesting parts with regards to transitioning to Swift.

Read more »

Deploying Yesod to Digital Ocean (Less Painfully)

I recently wrote about deploying Yesod to Ubuntu 14.04 x64 on Digital Ocean here. Since GHC hasn't been updated to 7.8 in the haskell-platform package for Ubuntu, I installed a pre-compiled set of binaries as suggested here. That's not without issues, including symlinking a library (not a great idea) and guessing which packages need to be installed. This is necessary because the binaries were compiled for an earlier version of the OS. In short, it's a horrible way to do it.

The other suggestion on that page is to install the platform from source. After some digging it turns out that this is much, much worse. I gave up after 5 minutes and I doubt anyone familiar with the subject would blame me.

Some more digging revealed that it's vastly more straightforward to build GHC and cabal-install individually. There is no drawback to doing this in practice and building Yesod and Keter works exactly the same. I've put together a shell script for it, lifted mostly from this.

Deploying Yesod to Digital Ocean

[Update] I've outlined a much cleaner way of updating GHC to 7.8 here. This doesn't involve installing the pre-compiled binaries from Ubuntu 12.

I've been working with Haskell about as much as I've been writing on this blog -- very little. So, two birds...

I've got a couple projects coming up that need websites built so I decided that would be a good excuse to learn some more Haskell. After digging around I decided Yesod is likely the most mature web framework so I tried it out. I didn't want to stop at the "install and run locally" phase as that gives a false sense of simplicity when dealing with web frameworks. I think it's more telling to set up a demo site and then deploy it to a (near) production setting. 

I host everything on Digital Ocean these days. I wrote previously about a quick way to harden a virgin install to make it more secure and easier to use. I didn't bother this time since everything was torn down quickly but if you're looking at a production install you'll want to think about making it more secure.

The preferred way of running Yesod in production is Keter, which acts as a reverse proxy, monitors applications, gracefully deploys new versions and some other fun stuff. The Keter docs cover everything very well. What is complicated about it is the fact that we're going to need to compile a binary version of Keter and it's not a good idea to compile that binary on your production machine. The same goes for the Yesod binaries. Both the Keter and Yesod docs talk about why, so I won't do it here.

Why is this a problem? Well unless you're running the same OS locally and on your server you're going to have to put together another server or a virtual machine that you can compile the binaries on.  Given that my SSD rMBP is woefully light in the storage department I opted for the former. Why not just compile on the same machine? Well, for just trying things out we could do that, but the whole point of testing this is to see what the Haskell ecosystem is like for actually getting the site up and running.

What I ended up doing is firing up two instances (droplets) of Ubuntu 14.04 x64, one a minimal $5/month version that will hold the production site and another, beefier, $40/month droplet (dual core, 4GB ram) to compile our binaries on. Don't worry about the cost as you'll probably be using it for less than an hour and will only be billed for that much.

The main issue you're going to run into here is the available Haskell platform package's GHC version is 7.6 while the newer version is 7.8 (which is what I installed on my Mac). As far as I can tell, projects created with Yesod for 7.6 are not compatible with 7.8. In order to work around this we can either install a binary bundle of the Haskell platform or compile from source (available on the same page). I thought the latter was more of a gamble so I decided to use the pre-compiled binaries. The page does say that it works on Ubuntu 14. Of course, it then trails off with something, something, extra packages, something, symlink. 

Good news, I figured out what all of that extra stuff is so you don't have to. The short version is that the Ubuntu 14 image (at least the one DO is using) is missing a few things, but that's easy enough to sort out.

Read more »

[Ab]using Blocks for Cleaner MVC in Obj-C

As I've started to utilize blocks more in iOS/OS X development I've noticed a patter emerge and wanted to talk about it. It's using the same building blocks (excuse the pun) are you're likely to find in any Cocoa project but leveraging blocks to the fullest extent has sped up development time for me and led to both thin controllers (which I think are good) and a very strict separation between the different layers in MVC. 

(Note: I wanted to point out that MVC in Cocoa in general, and explicitly in the example I give here, is more accurately called Model-View-Adapter as the model and view layers do not interact directly with one another, as they would be allowed to do in traditional MVC.)

I won't talk about blocks since they've been around for a while but if you're not familiar with Obj-C a block is an anonymous / first class function. It grants Obj-C (and C and C++) a handy tool for adopting a more functional programming style. The syntax can be awkward at first and I recommend this site as a handy reference and for an occasional laugh. 

So what does a block have to do with MVC? In MVC the controller layer is responsible for mediating between the model and view layers and usually "owns" or manages the lifecycle of both. While in theory a controller will generally be "thin", doing no more than it has to do to tie model to view, they tend to bloat over time. 

In very practical terms, controllers usually have a lot of functions defined in them. For every possible action in a view layer the controller will usually have a separate function (or a shared function with additional logic to determine the sender and action required). Working with Xcode and IB, that means defining and implementing a function as well as making a connection for that action in IB. Since we usually need a reference to the sender(s) (think a set of buttons with mutually exclusive state) we also end up defining properties. That's a lot of "stuff" for, say, an "Open File" button.

Read more »

Converting MIDI Pulses Per Quarter (PPQ) to Samples

I recently wrapped up an audio project for OS X based on JUCE (a cross platform C++ library for applications with an emphasis on audio) and using Propellerhead's ReWire protocol to allow the application to feed audio with very low latency to another audio app (e.g., Logic Pro).

The ReWire protocol defines a host, a device and a GUI (or "panel") that controls the device. The ReWire ecosystem is divided up that way so a ReWire device can be run inside a host (in the host's address space) in a similar way to how Audio Units or VSTs run as "plugins" in a host application. In order to control the device, a GUI or panel communicates with it via inter-process communication. (For standard plugins, this GUI would be directly managed by the host application.)

ReWire speaks both MIDI and audio data. It also communicates transport position and control commands (play, stop, etc) in both directions, allowing the transports of two applications to run in sync. For transport position, it uses MIDI PPQ (also called ticks or just PPQ) to give an offset. If you're building an audio data only application, as I was in this case, your ReWire device will probably ignore the MIDI side of ReWire and therefore will have no reference to utilize PPQ with. That means you'll need to convert a PPQ offset to a sample offset.

Even if you're ignoring the MIDI side of ReWire, the host will still announce all of the information you need to determine the current sample offset of the transport.

A few definitions first:

  • PPQ (pulses per quarter note) is expressed as its smallest subdivision, so PPQ 96 means a maximum resolution of 1/96th of a quarter note. ReWire uses PPQ 15360.
  • BPM (beats per minute) as our tempo. In this case we can assume each beat is a quarter note. The actual value ReWire supplies is BPM * 1000.
  • Sample rate is the number of samples per second (per channel) of our audio. In this case we need to use the sample rate as announced by the host, not by any of the audio files we're playing back (though we need to convert the sample rate of each audio file to that of the host for playback, so they'll end up being equivalent).

To convert PPQ to samples, we'll first convert our PPQ offset to milliseconds. The formula for that is milliseconds = milliseconds per minute / (BPM * PPQ) or in this case 60000 / ((120000 / 1000) * 15360) or 0.03255208333 ms per tick for a tempo of 120 BPM. Our offset in milliseconds is then PPQ offset * milliseconds per tick. To convert that value to a sample offset, we multiply by our sample rate (divided by 1000) or sample offset = (PPQ offset * milliseconds per tick) * (sample rate / 1000).

Putting it all together, if we have a tempo of 120 BPM, a sample rate of 44100, a PPQ value of 15360 and a PPQ offset of 1 million, we'll end up with ((60000 / ((120000 / 1000) * 15360)) * 1000000) * (44100 / 1000) or approximately 1435546 samples.

Since our PPQ value won't change (it's always 15360 for ReWire) we can simplify this quite a bit. Here is a simplified version in function form (as well as the complementary conversion from samples to PPQ):

First Five (and a Half) Minutes on a Server with Ansible

Note: The Ansible script below is unusable due to breaking changes. I've written about a similar approach here using a simple shell script.

This is a response/addendum to two really good "first five minutes" style posts discussing the setting up and basic hardening of a remote server. Brian Kennedy discusses his first five minutes here1 on Ubuntu. It's a great tutorial covering the basics of security. Of course, if you've gone through it once you'll want to automate it. There is also a post on automating the process2 (actually using the steps described in Brian's post) with Ansible. The latter was either not tested or only worked on earlier version of Ubuntu/Ansible. I'll cover an updated version here that works with the most recent version of Ansible and Ubuntu 13.04 x64 and includes some helpful additions. 

So, starting from a virgin install of Ubuntu server we're going to want to perform the following steps:

  1. Update & upgrade the system via apt-get
  2. Install vim & mosh (personal preferences)
  3. Install fail2ban to block ssh brute-force attempts
  4. Reset our root password to something strong
  5. Create a new user so we don't have to use root
  6. Copy over our pub key
  7. Lock down sudo
  8. Lock down ssh to prevent root & password login
  9. Setup the ufw firewall
  10. Configure unattended security upgrades
  11. Configure logwatch to email daily server logs

Even if you can do all of that in five minutes, this is obviously complicated enough that we want an automation tool to handle it. After reviewing popular automation tools like Chef and Puppet, I decided to go with the slightly lesser known and arguably simpler Ansible. Ansible is simpler because it doesn't require any server side installs to work. All Ansible commands are run via ssh from your computer and only need a password or private key to run. Ansible commands are organized in "playbooks" and Ansible has a extensive set of modules that simplify common tasks.

Read more »

Building a Fast Web App in 2013

I recently decided to put some other projects aside and build a web app/service that's sort of a "scratch my own itch project". My aspirations for it go beyond a personal project, however, so I wanted to build it in a way it could handle a lot of users without too much maintenance and resources. 

So what language/framework to choose? PHP is still the most popular web language (according to a recent analysis1 of jobs posted on Twitter) but, well, it's PHP. Rails (and to a lesser extent Django) is very popular, though it's a "kitchen sink" framework and doesn't have the best reputation for performance. Node.js is interesting, not because I care that much about using one language for the server and the client (especially if that language is JavaScript), but because it's supposed to be fast. 

Since i started playing around with Haskell recently I also took a look at the most popular web frameworks: Yesod, Happstack and Snap. Not only are there three somewhat mature web frameworks (with documentation that ranges from ok to pretty good) there's some evidence that Haskell web apps can be fast, maybe even really fast2. Despite the promising start, I ended up dropping Haskell.

Read more »

LJSelectionView Now Available on GitHub

I recently pushed LJSelectionView to GitHub. The project makes it easy to manage an NSView with a collection of subviews and their selection -- either by mouse clicks or "drag to select" actions. Selection rectangles, highlighting and selection management is something I've had to write more than once for Cocoa apps so I decided to write a standalone version to share.

I've set this up as a controller with several NSView subclasses. The controller is responsible for managing selection (and undo/redo) and the views are fairly dumb affairs that mostly just draw themselves (just how it should be). The exception is the main LJSelectionView that also understands the difference between selection, highlighting and your "content" views and has methods for managing them.

The two other views, LJSelectionItemView and LJSelectionRectView are configurable (up to a point) for line color, line width, fill, etc to change what the selection and highlighting rectangles (the stuff drawn around selected views)  look like. If you want to go crazy with their appearance they're easy enough to modify or subclass.

For the ARC-shy (not that there aren't good reasons to avoid ARC at this point) all files will support ARC or non-ARC project automatically. It cluttered up the code a bit but I thought it was better than supporting one vs the other.

The selection behavior is the same as Adobe Illustrator: If you hold the shift key while clicking or dragging, you toggle whatever is being selected against the current selection. If you're not holding the shift key, whatever is selected always replaces the current selection. As it is now, views aren't selected until after the selection operation is completed. It shouldn't be that hard to change it to support "live" selection but again I'm following Illustrator here.

To use, just copy the four main classes and their headers (what's in the root of the GitHub repo) to your project. You'll also want to setup the view hierarchy in IB and make sure the correct outlets are connected. It is possible to setup the view hierarchy without IB you just have to make sure the right connections are made. You can look at the demo project to see how it's setup. The tests aren't automated in any way but you can run them in Xcode with ⌘-U as normal.

The whole thing is MIT licensed so do with it what you will.