Quantcast
Viewing latest article 4
Browse Latest Browse All 10

Peter Norvig’s Spelling Corrector in 21 Lines of Coffeescript

CoffeeScript is a very nice (and relatively new) language that compiles down to JavaScript, making web programming (and making firefox plugins, nodejs apps, and so forth) possibly more joyful. Its object model is the same as javascript (one of coffeescript’s motto is Unfancy JavaScript), and its compiled JS form is quite easy to read and debug. It has many niceties, including array/object comprehensions (heavily influenced by Python’s list comprehensions).

Ruby also has a influence on the language, such as optional parenthesis on method/function invocation. In fact, the original version of CoffeeScript compiler was written in Ruby (but nowadays CoffeeScript is a self-hosting language).

CoffeeScript has been used by several projects, including a mobile framework written by 37 signals. I’ve been using for about one year (including some open source work, and a port of ruby functionalities).

Because of all the Ruby and Python influence on the language, and the fact that CoffeeScript can convey beautiful and concise code, I had a hunch that it could get a really good position on Peter Norvig’s Spelling Corrector implementation collection (JavaScript’s smallest version currently has 53 lines, which is a bit more than Python‘s 21). With some work, I managed to implement it in 21 lines as well:

words = (text) -> (t for t in text.toLowerCase().split(/[^a-z]+/) when t.length > 0)
Array::or = (arrayFunc) -> if @length > 0 then @ else arrayFunc()
Array::flat = -> if @length == 0 then @ else @[0].concat(@[1..].flat())
train = (features) ->
 model = {}
 (model[f] = if model[f] then model[f] +1 else 2) for f in features
 return model
NWORDS = train(words(require('fs').readFileSync('./lib/big.txt', 'utf8')))
alphabet = 'abcdefghijklmnopqrstuvwxyz'.split ""
edits1 = (word) ->
 s = ([word.substring(0, i), word.substring(i)] for i in [0..word.length])
 deletes = (a.concat b[1..] for [a, b] in s when b.length > 0)
 transposes = (a + b[1] + b[0] + b.substring(2) for [a, b] in s when b.length > 1)
 replaces = (a + c + b.substring(1) for c in alphabet for [a, b] in s when b.length > 0)
 inserts = (a + c + b for c in alphabet for [a, b] in s)
 return deletes.concat transposes.concat replaces.flat().concat inserts.flat()
known_edits2 = (word) -> ((e2 for e2 in edits1(e1) when NWORDS[e2]? for e1 in edits1(word)).flat())
known = (words) -> (w for w in words when NWORDS[w])
correct = (word) ->
 candidates = known([word]).or -> known(edits1(word)).or -> known_edits2(word).or -> [word]
 ({k: w, v: NWORDS[w] or 1} for w in candidates).sort((a, b)-> b.v  - a.v)[0].k

All the code is hosted on Github. The code above can be seen in a more readable version here. There is a more testable version, along with Jasmine tests.

Considerations

Findall by regex is not a native function in Javascript, however it is equivalent to splitting by the complementary regex (see line 1).

Array::or (on line 2) was needed to be implemented because Python’s truthfulness allows a collection to be true (actually, any iterable) as long as it is not empty. Array::flat (on line 3) has to be implemented because CoffeeScript’s loop comprehension is a bit different from python’s: double loops (example: x + y for y in col1 for z in col2) return array of arrays instead of a single array.

Also note that the order of a loop comprehension’s syntax is inverted. That is: x + y for x for y in Python is translated as x + y for y for x in CoffeeScript.

This version runs really fast on NodeJs 0.4.1, and I was quite happy with the way the resulting code looked. I was even happier that I did not have to write the compiled JavaScript file and its whooping 148 lines of Spelling Corrector (minus the tests).

Finally please check out Peter Norvig’s original post.


Viewing latest article 4
Browse Latest Browse All 10

Trending Articles