Next: Final project Up: Language and Speech Processing Previous: Lecture schedule 2008/9

#### Midterm project

The deadline for the midterm is set on 7 November 2008
Instructions
• Submit the midterm report in pdf using a computer program to edit and typeset your text.
• We do not tolerate hand written submissions (zero without smoothing).
Marking the midterm
The actual mark for the midterm is based on
• The scientific report: well structured, well written, clear, correct, complete, innovative, references etc (see balckboard»Assignments)
• The program
• Possibly clarification in an oral discussion with lecturers, if needed in certain cases.
Project detail
For your convenience, the project is structured in steps. The accumulation of the steps is the total project. Report and program for the total project to be submitted (not for the individual steps).
1. Markov language models over word sequences
In this step your goal is to build a language model over word sequences. This will be a 1st-order Markov Model. Take care that your program is general enough to take as input a file of sentences, where every sentence is a sequence of words delimited with white space (do not disregard differences between capital vs small case letters). Note that every sentence ends with a dot (.) . You will need a START symbol for every sentence (in order to condition every first word on START).
Programming
• Extract a table of all bigrams with their frequencies from the text and store those in an efficient mechanism (storage and retrieval - possible hash function?).
• Build a function that return the relative frequency estimate (probability) for for a given bigram from the counts in the table.
• Build a that takes as input a sentence and assigns to it a probability according to the 1st-order Markov model.
• Build a procedure that takes as input a set of words S:
Further excercises (for the report):
• Formalize the language model (1st-order Markov) and provide a formula for the relative frequency estimates for and for .
• Plot on the same graph the relative frequency estimate for probability mass functions (distributions") , , ,
• Use as training data for your model the data found on Blackboard under Assignments.
2. Markov language models over POS tag sequences and lexical models
The goal is to build the components of a POS tagger: a 1st-order Markov language model over POS tags and a lexical model over word-tag pairs.
Programming
• For the language model over POS tag sequences: extract a 1st-order Markov model (reuse the programs from step 1!!).
• For the lexical model: extract a table of word-POS pairs with relative frequency estimates for
Further excercises (for the report):
• Describe formally how the POS tagging model is built starting from the Noisy-Channel model including the details of the lexical and the language models.
• Derive, only theoretically, no programming needed, an alternative POS tagging model by making other kinds of independence assumptions than the ones described in the literature.
• POS tagged corpus is available on Blackboard

3. Calculating the most probable POS tag sequence and sentence probability
Viterbi and Forward algorithms for POS tagging.
Programming
• Build a procedure that takes an input sentence as input and builds for that sentence the lattice/trellis that is assigned by a 1st-order Markov model (see lecture slides and book). (You will need this lattice for the next step3 for the Viterbi algorithm for tagging).
• Given an input sentence, write a procedure that uses the lattice/trellis as data structure for calculating the most probable POS tag sequence for that sentence. This is the Viterbi algorithm described in the lecture slides (see book also).
• Write a procedure for calculating the sentence probability (Forward probability). Note that you can re-use almost the same procedure as Viterbi, just change the Max with Sum (see lecture slides).
• Write a small program that takes (1) a gold corpus" G (the correctly tagged corpus) and (2) the output of the tagger for the same corpus of sentences T. The program outputs the Accuracy of T relative to G, i.e. it checks how precise your tagger is relative to the gold corpus.
Further excercises (for the report):
• Tag (run your tagger on) the testset.sentences (Blackboard) and report the Accuracy relative to the testset.gold (all corporas are available on Blackboard).
• Desribe the Viterbi algorithm formally (recursive formula) and discuss its implementation. Similarly, describe formally how the sentence probability is dynamically calculated under the Forward algorithm.
• For the sentence The shares dropped yesterday.", draw the lattice that your POS tagging HMM builds by consulting the language model and the lexical model tables.
4. Smoothing the lexical model
Smooth the lexical model in order to deal with unseen pairs and also with unknown words. You may assume that all POS tags are available, i.e., no unknown POS tags problem.
Programming
• Use the Good-Turing method for discounting in order to reserve mass for unseen pairs. (See note below on Good-Turing).
• For every POS tag, estimate the probability that it will produce unknown words (not in training copus with this POS tag). Then use a bucketing according to word prefixes, suffixes and other features in order to approximate .
Try and report your selection of various features that help in obtaining better results in guessing unknown words.
• Note on Good-Turing: Sometimes (for some ) will be equal to zero, which makes the Good-Turing formula fail. A fix for this is to either apply Good-Turing only to (or alternatively to apply a first smoothing of (before smoothing ): see Simple Good-Turing by G. Sampson - see reading page of the course).
Further excercises (for the report):
• Describe the details of your smoothing of the lexical model.
• Describe the different features you tried for unknown words.
• Obtain tagging results on the same test set as Step 3 and report accuracy.
• Obtain accuracy for known words alone, for unknown words alone and for all on the test set and report that.

5. Report writing and delivering the total
Submit total program and write final report (5 pages) of all four steps. This is your Midterm project!
The report is a scietific article, limited to 5 pages (hard limit), should describe the essence of the preceding 4 steps formally, and try to explain clearly and in short Language models, Hidden Markov Models, POS tagging, prediction and smoothing, their relation to one another.

Next: Final project Up: Language and Speech Processing Previous: Lecture schedule 2008/9
Khalil Sima'an 2008-10-02