Thursday, June 26, 2014

Information theory, Philadelphia Mathematicians, and some probability

This Saturday, June 28th, the Philadelphia Math Counts meetup will meet at the greatest coffee shop in Philly, Capriccio Cafe & Espresso Bar, to discuss John R. Pierce's book, An Introduction to Information Theory. As per the usual order of the group, you don't need to read the book to join in the discussion, just come with an interest in the material and two dollars to help make sure we get to continue doing this. Everybody from novices with questions to experts with answers are welcomed, because the regulars fall under both categories. I can't make it this month, but I can spend a little time posting about the book on the blog.

This post is actually inspired by two things, the book An Introduction to Information Theory by John R. Pierce, and this post I found on how to write a simple spelling correcter in python. The article is great and goes into detail about the hard math and how this applies to programming a spellchecker. It also has links to other pages with other programming languages in case you don't do python. The point of my post is to cement my understanding of the statistic presented in both works by explaining to the hypothetical readers.
The chapter that interested me the most was the third chapter, "A mathematical model". This book gets more into how to use math then other books I've read recently, and can still keep these important tools, Mathematical modeling and building theories, light enough to understand without needing a Masters in the subject. This model Pierce presents accentuates this point.
Build random words based on math. That's the name of the game here. This doesn't give you the hard stats and the hard formulas to this task. Instead, it walks the reader through the skeleton of building a model. The scenario Pierce describes is imagine a bag with an equal amount of letters and spaces (for discussion, 25 of each).  Mix it well, and draw from it at random. And the result is called a zero order approximation.
XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZAACIZLHJQD.
As you can see, it's gibberish. Why? Well, it obvious that there are more rules to language than "String letters together". Therefore, his example moves on to add more rules to the words (such as how many single letter words are in the English language, how often they are used, and frequency of letters) and each time adds another number to the approximation. A third-order approximation  is more accurate than a first-order.
If we have words, we still do not have English. We are still missing that wonderful structure known as grammar. The thing is, building a sentence this letter by letter process is very complicated. And who does that? I'm writing this damn thing now, and I don't choose my next word by frequency of letter. It's more natural to just have a collections of words in your brain and use them as you need them. The good news is, we can force machines to do this! Pierce uses this idea and adds it to his level of approximation model, but the article I posted to earlier, How to Write a Spelling Corrector, actually gives a practical use for all this mumbo-jumbo. This article is not more complex, it just uses more equations, as well as showing you how to write a basic spell-checker in python. 
The way the code works is given a user entered sentence, the computer checks it against a list it already has. If a word in the sentence matches a word in the file, then it assumes it's correct and moves on to the next one. If the word is, say, 'thew', then it has a choice between two words 'the' and 'thaw'. The choice comes down to probability. The article becomes more technical at this point, so to keep it light, a word that's used more frequently will have a higher probability, so the program will more likely choose a word with higher probability. The article is worth a read for anyone interested in how spelling correctors work, because it not only discusses probability but how the program assigns probability,  what it does if it's given a word that it doesn't know, and how it learns and becomes more accurate.
That's all I got today. Man, it feels good to finally finish this thing. I love math, and I love writing, but writing these math posts are always something I put off as long as possible, and I always feel like it needs to be better. Whatever. Vacation, here I come.

1 comment: