My First Algorithm in Haskell

I recently started learning Haskell. Like many programmers who get interested in the language I've spent as much time studying the language as I have trying to find an excuse to actually use it. It's a somewhat difficult language to learn -- at least if you haven't worked with functional programming before -- and it's not really a go-to language for most situations. Finding a reason to use it can be challenging.

My answer to this has been to work on projects that don't have an immediate application. For reasons I won't bore you with, I decided to (re)implement Porter's stemming algorithm. The point of a stemming algorithm is to reduce words to common roots, which is often used as a pre-processing step when building a search engine or doing other NLP work. For example, the words "stemmer", "stemming", "stemmed" are all based on "stem" and share a common meaning. Reducing these words to their root, "stem", makes processing them further a lot easier.

Porter's English word stemming algorithm is described here and there's a Python implementation here and a Haskell implementation here. I'm so new to Haskell (and functional programming) that I wanted access to both the original paper, a version in a language I'm familiar with as well as something in Haskell. Having said that, after I worked on this algorithm for a while it became apparent that working from the Python version was actually a hindrance compared to just working from the paper. Since our functional approach will be so different from our imperative approach, one doesn't really inform the other.

A Tiny Bit About Haskell

Haskell is a purely functional language. Purely functional means that the functions are pure in a mathematical sense. They're free (with some exceptions) from side effects. If you give a Haskell function a given input, it will always return the same output. Functions are also first class and the main control structure used in the language. It's not a stretch to say that everything is a function in Haskell. Haskell is lazy. An infinite list in Haskell is valid. That's made possible by lazy evaluation, i.e., the nth value of a list isn't "created" until it's needed.

Haskell lacks a lot of the stuff you're familiar with if you're coming from an imperative background. For control flow there are conditionals but no loops. If you need to do something repeatedly you'll generally use a recursive function or operate on a list. Haskell doesn't have mutable variables (well it does, but not by default) -- with some exceptions, it doesn't have state. Again, to do most things in Haskell you do so without any mutable state. No local variables inside functions to hold intermediate values (or examine them). No variables further up in scope to hold values between function calls. It's a mad, mad world.

So why Haskell or functional programming in general? Well Haskell's purity makes it much easier to write bug-free code and write tests for that code. Functions are black boxes that can be tested without elaborate setup or teardown. They can be tested in isolation and will operate the same when working with the rest of the program. Having no side effects has its advantages. It's also a terse language. While it's debatable that writing less code (to do the same thing) makes anyone more productive, it is, at least for me, better to write and read terse languages.

I've only scratched the surface here. For a more eloquent and informative overview of the benefits this is a good place to start. While we're talking about learning materials, I can't recommend Learn You a Haskell for Great Good enough. It's humorous and dare I say beautifully illustrated -- at least if you like the doodles of people who are obviously too smart for their own good.

Let There Be Stems

Enough bs. Let's write some Haskell. The Porter Stemmer algorithm kind of breaks down like this: We make some definitions (such as what's a consonant or a vowel, how many consonant sequences a word has) and then we apply a series of rules to our word to get its stem. I started with our definition of a consonant and vowel. From the paper, we define them as such:

A \consonant\ in a word is a letter other than A, E, I, O or U, and other than Y preceded by a consonant.
From our Python implementation, this definition is handled like this:

This is pretty straightforward. Our function cons returns True (or 1 in this case) if b (our word as a string in an instance variable) at the index in question is in the set "aeiou" or is both 'y' and the preceding letter is not in the set "aeiou". This function isn't particularly elegant but it's easy to understand and it gets the job done. Vowels then are anything where cons returns False.

There are a couple of problems here for the beginning Haskeller. We don't have our instance variable b for one. The other isn't a problem as much as it is a style or philosophy issue. I think you could argue that accessing lists by index isn't very Haskelly. We have an operator for it, !!, but using it means bounds checking and passing our index along with our string to the function. While this is certainly possible (the Haskell implementation listed here does it exactly that way) I wanted to find a solution that felt more natural in the context.

My only functional background comes from using the functional features present in Python. Specifically I'm comfortable with list comprehension as well as "decorating" values (in a way similar to the decorate-sort-undecorate pattern in Python) in tuples and "zipping" lists. In this case we want to not only figure out if the characters in a word are vowel or consonants, we're going to want to look at the patterns of vowels and consonants that exist in a word. This will be useful later in the algorithm. Given this, it seemed natural to figure out the entire word at once and keep that pattern as a string in a tuple along with our original word.

In Haskell I came up with this:

The first interesting bit here is wordDesc :: String -> (String, String). This is our function's type declaration. These are optional but considered good practice. If we don't explicitly define it, the compiler/interpreter will figure it out for itself. In this case, it says that we're going to take a String and return a tuple (String, String). Our first string will be our original word, unchanged, and our second will be the matching pattern of consonant or vowel definitions. If we called it with "tree", it would give us ("tree", "ccvv"), since our consonants are "tr" and our vowels "ee". 

The meat of this function is the list comprehension [corv (a,b,i) | (a,b,i) <- zip3 str (rotateR str) [0..len]]. This might be a bit confusing if it's new to you. It's of the form [val | val <- list]. This is very similar to [val for val in list] in Python. Our list in this case is actually three lists zipped together, zip3 str (rotateR str) [0..len]: Our original list (our string), that list rotated one letter to the right and the list 0 to the length of our word (less one since we're zero indexed).  

The first list is obvious, but what are the other two lists for? In the case of a 'y' we have to look at two things: the preceding letter (whether or not it's a consonant) and the index of our 'y'. If our 'y' isn't at the start of a word and has a consonant before it, then it's a vowel for our purposes. Since I didn't want to pass an index into this function I decided to look at the current letter and the previous letter simultaneously. Rotating the list one letter to the right "lines up" our current and previous letters and in this case makes them available in the variables a and b. We still ended up needing that index to check if we're looking at our first letter or not. Not wanting to belabor the "no index" edict any further I zipped in a list of index values for that purpose. There is likely another, non-indexed, way to do this but I decided to make a concession and move on. Since we only have index values that match our actual string, we don't have to worry about bounds checking at least.

The pipes in our function definition are called guards. They're one way Haskell allows us to define conditional statements. The guards are evaluated in order and the provided value is assigned for the first condition that evaluates as true. The otherwise statement is mandatory and defines our default value. There where constructions allow us to define parts of our main function separately to improve readability. In this case we're defining len (the length of our string, zero-indexed) and corv, which returns a 'c' or a 'v' based on the criteria that we've defined.

Our rotateR function is simply:

There are some other interesting Haskell bits here. First our type declaration is the variable a instead of a defined type. These are called "type variables" in Haskell. Our declaration here states that we'll take a list of type a and return a list of that same type. For our purposes, we could have used the declaration rotateR :: String -> String. Using type variables makes the function more generic.

The other really interesting thing going on here is called pattern matching. Pattern matching in Haskell allows us to create a set of functions that share a type declaration (and a name) but can return different results based on different attributes of the arguments they're called with. If we call our rotateR with a list of length one (defined as [x] here) we'll just return the list unchanged, since a list of length one, rotated, is itself. If we call it with a list of length > 1, we'll actually rotate it. The variable xs in this case means any list. Since the pattern matching is attempted in descending order, our first function will catch all lists of length one and our second function will catch all other lists. This is powerful stuff and makes it very easy to implement recursive functions and deal with edge cases.

To make this function more robust, we could could add in rotateR [] as a pattern so we could define how the function handles being passed an empty list. This is a common pattern in Haskell.

This is just the beginning of the algorithm. More will be coming soon... The next post is now available here.