My First Algorithm in Haskell - Step 1

This is a continuation of a series of posts starting here. I've been implementing Porter's stemming algorithm in Haskell in order to learn Haskell. I normally wouldn't learn a language this way, but as I stated in the last post I haven't yet been able to find an excuse to build something else in the language.

This algorithm is broken down into 5 steps which process a word in order. I'll describe how I tackled step one in a moment. Before I do, I'll go over some of the foundational parts of the algorithm that we need in place before we go through the steps.

The first two functions here were described in more detail in the last post. To recap: The first rotates a string one character to the right. The second determines which letters in a word are consonants or vowels. How this is would likely be handled in an imperative language would be to keep track of the index of the letter we're currently evaluating so we could subtract 1 from that index and check the previous letter (which we need to do in the case of a 'y'). That is, we need an index to know what character in a string we're evaluating. Haskell does have the operator !! which allows us to access a list by index but as I stated before I thought that was decidedly un-Haskelly. Instead, I opted to turn our string into a tuple where the first would be the word we're stemming and the second would be a string of either "c" or "v" to denote consonant or vowel.

The functions that follow show why I opted to do it this way: The algorithm sometimes wants to evaluate a word based on its constituent letters and sometimes wants to look at the pattern of consonants and vowels in that word. The functions endswith, hasvowel, and endsdblc (double consonant) don't need an explanation. The function measure is a bit more complex. What are we measuring? We're measuring the number of consonant-vowel sequences in a word. 

A consonant will be denoted by c, a vowel by v. A list ccc... of length greater than 0 will be 
denoted by C, and a list vvv... of length greater than 0 will be denoted by V. Any word, 
or part of a word, therefore has one of the four forms:

    CVCV ... C
    CVCV ... V
    VCVC ... C
    VCVC ... V

These may all be represented by the single form

    [C]VCVC ... [V]

where the square brackets denote arbitrary presence of their contents.
Using (VC){m} to denote VC repeated m times, this may again be written as

I'll admit that the "denote arbitrary presence of their contents" part confused me for quite a while. I had to read through the Python implementation and write my Haskell version before it made sense. It just means those bracketed letters (or letter sequences) may or may not be there. In retrospect I can't figure out why I couldn't figure it out -- other than the fact that "arbitrary" isn't the correct word.

So, we need to scan through our word to the first vowel sequence and then count how many times we encounter a consonant sequence. The first part [head a | a <- group ds] reduces our sequence (say "ccvvccvvcc") to its transitions ("cvcvc"). We do that because for our purposes a "cc" is the same as "c". The group function in Haskell groups like elements together in a list, so "aab" becomes ["aa", "b"]. At the front of our list comprehension head takes the first of every group ["a", "b"]

The function dropWhile will return a list starting from the first element that doesn't match the condition. In this case we'll drop the leading consonant, if any, so "cvcv" becomes "vcv".  The function filter operates in a similar way. It will return a list of elements that evaluate to true: "vcv" becomes "c". From there we just count the number of elements in the list to get our measure. Another interesting Haskell operator here: $. The $ operator changes the precedence of the expression that follows. We could rewrite length $ filter (=='c') ds' as length (filter (=='c') ds'). Using $ is often cleaner looking.

It's illustrative to take a look at the Python version:

Which one is easier to read? To be fair this is a problem that Haskell is ideally suited to. (The Python version could be made more succinct as well. The version here was apparently translated line for line from the C version).

Ok, now on to the actual steps:

The two helper functions at the top of this section probably don't need an explanation. The first returns a word minus a given suffix. The second swaps a suffix with a new one. The first part of step 1 step1a begins the actual meat of our algorithm. It simply matches our word against the attached list of tuples in descending order. If our word ends with suffix on the left, we replace it with the suffix on the right.

The remaining steps do similar matching and replacing operations, each using the foundational functions we discussed above. While their operation is somewhat opaque, there's not much to be gained by discussing them in detail.

More will be coming soon... The remaining steps of the algorithm are now available here.