Peter Norvig in his spelling checker example explains training language models in plain English:
Next we train a probability model, which is a fancy way of saying we count how many times each word occurs, using the function
train.
Tracking News Events and their Impact
Peter Norvig in his spelling checker example explains training language models in plain English:
Next we train a probability model, which is a fancy way of saying we count how many times each word occurs, using the function
train.
From Richard Feynman’s Lectures on Computation, p. 123:
Now the average information in a message is calculated in standard probabilistic fashion; it is just:
which is our previous result. Incidentally, Shannon called this average information the “entropy”, which some think was a big mistake, as it led many to overemphasize the link between information theory and thermodynamics.1
BACK TO POST 1 Legend has it that Shannon adopted this term on the advice of the mathematician John von Neumann, who declared that it would give him “ … a great edge in debates because nobody really knows what entropy is anyway .” [RPF]
I had always been a PHP fan. It’s true that PHP served me well over the past 10+ years. It’s great for creating simple to complex dynamic web sites, pulling data from databases, handling files, and for building things –usually– fast.
But when it comes to research, well, the requirements change. Graphs are required, web representation is rarely needed, and a solid math library is required. That’s one of the reasons why Matlab, R, Mathematica, and other scientific packages exist after all, right? Matlab, R, is great, I doubt if there’s anyone out there saying the opposite. I’m sure that at least some of you had to push data from your program to Matlab input, do your calculations there, only to feed them back to another script/program of yours for further processing. It’s definitely ok if you have to do it only once. But what when you’re in the development phase of your algorithm/approach and you play with different values, and you have to push data around a few times? It’s taxing, if not frustrating at least — hope you agree.
And let there be light! Python comes with exactly what I need to overcome Matlab: NumPy, SciPy, and matplotlib offer a solid ground for doing math, statistics, and plotting directly from my Python scripts. The best of all, is that Python is easy to learn; I grasped the basics in a few days, and was able to pull data from database and program my first graphs in one week.
For the computional linguists, Python has a little nice surprise: NTLK, a module for natural language processing, and WOOSH, a full blown search engine framework supporting some of the most-used algorithms for search.
Final touch: You can export your graphs in LaTeX format and embed them directly to your document.
I am proud to present podTeller, a demo for predicting podcast preference. podTeller is a proof of concept system that analyzes a podcast and estimates its level of listener preference. Listener preference reflects the power of the podcast to draw audience interest and is useful for predicting if a podcast has the potential to be popular.
podTeller is trained on 250 feeds from all 16 podcast categories in iTunes. It uses 20 easily extractable features from the podcast feed. What striked me most was when I tested the winning podcasts from Podcast Awards 2008, and podTeller estimated user preference for the winners pretty close to 100%!
Use podTeller to see how your own podcast performs, to bid for next year’s winners, or to just have fun! [0]
If you have ideas on how to improve podTeller, or comments/funny stories after your interaction with it, please don’t hesitate to share them in the comments!
[0] podTeller is a proof-of-concept system. Use it at your own risk, we are not liable for damages of any type that may occur from the use of the system.
Manos Tsagkias, Martha Larson, and Maarten de Rijke, submitted a paper at the European Conference on Information Retrieval (ECIR) about predicting podcast preference using easily extracted features from the podcasts feeds. It is based on our previous work on Podcred: A Framework for Analyzing Podcast Preference. The data we used is pulled from Apple iTunes. The paper will be presented at ECIR 2009, held this year in Toulouse, France between 6 and 9 April 2009. The abstract follows:
Podcasts display an unevenness characteristic of domains dominated by user generated content, resulting in potentially radical variation of the user preference they enjoy. We report on work that uses easily extractable surface features of podcasts in order to achieve solid performance on two podcast preference prediction tasks: classification of preferred vs. non-preferred podcasts and ranking podcasts by level of preference. We identify features with good discriminative potential by carrying out manual data analysis, resulting in a refinement of the indicators of an existent podcast preference framework.
If you are interested, you can download Exploiting Surface Features for the Prediction of Podcast Preference(pdf), or the presentation slides (pdf).
Update: A proof-of-concept system based on this research is available; its name is podTeller.
A while ago, Data Wrangling Blog published a list of Datasets available on the Web. The list has been updated, adding up to 400 datasets, plus Video Lectures, Seminars, and Talks.
Thanks to Jason @ Mendicantbug.com, computational linguists can now follow NLP-related blogs through a non-comprehensive but sound list of blogs that Jason compiled from his wandering on the net.
Enjoy!
I faced the problem of reading large files in PHP (4.x, and also 5.2.x series). Not only file() was giving up with an error message of: “Value too large”, but most surprisingly fopen() exited with a similar message.
Well, do not despair! The answer is in the alpha release of PHP 5.3.0-alpha. Once you untar the code, you need also to apply a patch from Wez Furlong. You cd inside the source directory of PHP and you copy there the code from Wez. Then, you need to ask patch to .. err .. patch PHP’s source. You do that by typing in the command line:
patch -p0 < wez.patch
assuming that you named Wez’ patch as: wez.patch.
It worked for me on my Leopard 10.5; hope it works for you too!
PS: in order to get this working, you’ll need a 64-bit machine to compile PHP on and run your script from.
Recently, I stumbled upon OpenNLP. It is a Java API which provides a set of classes and methods for common Natural Language Processing tasks, such as: sentence detection, tokenization, pos-tagging, chunking and parsing, named-entity detection, and coreference.
OpenNLP also maintains a list of useful related links. An interesting addition to their list is LingPipe written in Java, and NLTK written in Python.
Predicting the Volume of Comments on Online News Stories
Manos Tsagkias, Wouter Weerkamp, and Maarten de Rijke published work on predicting the volume of comments on online News Stories at CIKM 2009, held in Hong Kong, between 2 and 6 November. We looked at 7 online Dutch news agents, and 1 collaborative news platform (Digg-like). The task was to predict if a news article would attract low, or high number of comments prior to publication time. Five sets of features were employed in a two stage classification process: first predict if an article will attract any comments, and second if it does, how many. Our method shows solid performance for the first classification step, but degrades for the second. The performance varied from source to source probably signifying that the number of features need to be tuned per source for the individual source characteristic to be taken onboard. The abstract follows:
If you are interested, you can download Predicting the Volume of Comments on Online News Stories(pdf, bibTex)