This is a continuation of a series of posts starting here and following this. These three posts were front loaded since most of the work (and interesting explanations) were in the foundational work for the algorithm, with the actual steps (the steps as described in the original paper) being mostly data and application of the functions we wrote in part one.
With that said, Steps 2 - 5:
As the comments suggest, each function here is the application of a series of rules, mostly with the form "(measure>n) SUFFIX -> NEWSUFFIX". What that means in the context of the paper is if the measure of the word's stem (the word minus the first suffix) is over (or equal to) some number, then go ahead and swap the stems. Remember that a word's measure is the number of consonant-vowel sequences in the word. A more thorough explanation is here.
The general form in my implementation is to use
dropWhile against a list of tuples
("SUFFIX", "NEWSUFFIX"). The function
dropWhile will return a list that begins with the first element that doesn't satisfy the given condition. In this case we drop all of the tuples where the first part of the tuple doesn't match our word. To check our suffix, Haskell has a function
isSuffixOf that does exactly what you would expect.
The Python version of this algorithm is case-insensitive. I originally was going to make mine case sensitive (it was easier) and convert all input words to lowercase before processing them. As I was wrapping it up I decided I wanted the output to match the Python version so I could compare the two -- so I went back through and made the whole thing case insensitive. Doing so was pretty straightforward. Haskell has a built in function
toLower that will lowercase a character. I couldn't find a version that worked on strings so I added the
toLowers that does just that.
The module's exported function
stem composes the five main steps (and sub-steps) and returns our stemmed word. Pretty simple.
General Observations about Haskell
With the algorithm done (but not thoroughly tested yet) I thought this would be a good point to review my initial impressions of working with Haskell. As of the date on this post I've been working with the language for the equivalent of a few days (in small pieces, when I had the time), which decidedly isn't very much, but I have started to notice some patterns emerge:
1. Most errors are caught as incompatible type issues.
This is great. We can see one of the major benefits of the language's purity and strong, static typing here. For the most part, if you've screwed up your syntax it will become immediately apparent because a function will be given the wrong number and/or wrong argument types.
2. Most errors are syntax based.
Following on #1, this might just be that I'm really new to the language, but the majority of errors came from the fact that I had written an expression wrong. An common example would be to omit parentheses (or the
. operators). The upside here is those are generally easy to fix.
3. Functions are black boxes that once written and tested, just work.
While that's not completely true, or at least I wouldn't launch a rocket with that untested assumption, the purity of Haskell functions generally means that once you've written a function that produces the expected output, you won't have to revisit that function later. The magic here comes from the lack of side effects and typing.
4. If you're coming from an imperative-only background, there will be some head scratching, yes, but not that much.
The utter alien-ness of the language is strong at first but dissipates pretty quickly. I would imagine this is especially true if you've encountered list comprehensions and recursive functions in other languages. While those aren't the entirety of Haskell, they're the right mindset. My entirely unscientific evaluation is that functional languages do stuff "in place" while imperative languages do stuff "in order". It helps me to think of a Haskell module as a single expression (on a single, very long line). While you wouldn't write it that way, I think it's a good mental model to have.
It's worth pointing out that not only are my impressions based on a very short period of time, they're also based on only a subset of the Haskell language. While this module does use the IO Monad, I skipped over working with Haskell's juicier bits like Monads, Haskell's data structures, the Maybe type, etc. I guess my next exercise will be to find a reason to use more of the language.
I'll be testing this algorithm soon in a follow-up post...
For reference, here is the algorithm in its entirety: