next up previous
Next: Final project Up: Language and Speech Processing Previous: Lecture schedule 2008/9


Midterm project

The deadline for the midterm is set on 7 November 2008
Instructions
Marking the midterm
The actual mark for the midterm is based on
Project detail
For your convenience, the project is structured in steps. The accumulation of the steps is the total project. Report and program for the total project to be submitted (not for the individual steps).
  1. Markov language models over word sequences
    In this step your goal is to build a language model over word sequences. This will be a 1st-order Markov Model. Take care that your program is general enough to take as input a file of sentences, where every sentence is a sequence of words delimited with white space (do not disregard differences between capital vs small case letters). Note that every sentence ends with a dot (.) . You will need a START symbol for every sentence (in order to condition every first word on START).
    Programming
    • Extract a table of all bigrams with their frequencies from the text and store those in an efficient mechanism (storage and retrieval - possible hash function?).
    • Build a function that return the relative frequency estimate (probability) for $ P(w_x \vert w_y)$ for a given bigram $ \langle w_y, w_x \rangle $ from the counts in the table.
    • Build a that takes as input a sentence and assigns to it a probability according to the 1st-order Markov model.
    • Build a procedure that takes as input a set of words S:
    Further excercises (for the report):
    • Formalize the language model (1st-order Markov) and provide a formula for the relative frequency estimates for $ P(w_x \vert w_y)$ and for $ P(<w_y, w_x>)$ .
    • Plot on the same graph the relative frequency estimate for probability mass functions (``distributions") $ P(X \vert the)$ , $ P(X \vert a)$ , $ P(X \vert IBM)$ , $ P(X \vert Mr.)$
    • Use as training data for your model the data found on Blackboard under Assignments.
  2. Markov language models over POS tag sequences and lexical models
    The goal is to build the components of a POS tagger: a 1st-order Markov language model over POS tags and a lexical model over word-tag pairs.
    Programming
    • For the language model over POS tag sequences: extract a 1st-order Markov model (reuse the programs from step 1!!).
    • For the lexical model: extract a table of word-POS pairs with relative frequency estimates for $ P(POS, word)$
    Further excercises (for the report):
    • Describe formally how the POS tagging model is built starting from the Noisy-Channel model including the details of the lexical and the language models.
    • Derive, only theoretically, no programming needed, an alternative POS tagging model by making other kinds of independence assumptions than the ones described in the literature.
    • POS tagged corpus is available on Blackboard

  3. Calculating the most probable POS tag sequence and sentence probability
    Viterbi and Forward algorithms for POS tagging.
    Programming
    • Build a procedure that takes an input sentence as input and builds for that sentence the lattice/trellis that is assigned by a 1st-order Markov model (see lecture slides and book). (You will need this lattice for the next step3 for the Viterbi algorithm for tagging).
    • Given an input sentence, write a procedure that uses the lattice/trellis as data structure for calculating the most probable POS tag sequence for that sentence. This is the Viterbi algorithm described in the lecture slides (see book also).
    • Write a procedure for calculating the sentence probability (Forward probability). Note that you can re-use almost the same procedure as Viterbi, just change the Max with Sum (see lecture slides).
    • Write a small program that takes (1) a ``gold corpus" G (the correctly tagged corpus) and (2) the output of the tagger for the same corpus of sentences T. The program outputs the Accuracy of T relative to G, i.e. it checks how precise your tagger is relative to the gold corpus.
    Further excercises (for the report):
    • Tag (run your tagger on) the testset.sentences (Blackboard) and report the Accuracy relative to the testset.gold (all corporas are available on Blackboard).
    • Desribe the Viterbi algorithm formally (recursive formula) and discuss its implementation. Similarly, describe formally how the sentence probability is dynamically calculated under the Forward algorithm.
    • For the sentence ``The shares dropped yesterday.", draw the lattice that your POS tagging HMM builds by consulting the language model and the lexical model tables.
  4. Smoothing the lexical model
    Smooth the lexical model in order to deal with unseen $ \langle POS, word \rangle $ pairs and also with unknown words. You may assume that all POS tags are available, i.e., no unknown POS tags problem.
    Programming
    • Use the Good-Turing method for discounting in order to reserve mass for unseen $ \langle POS, word \rangle $ pairs. (See note below on Good-Turing).
    • For every POS tag, estimate the probability that it will produce unknown words (not in training copus with this POS tag). Then use a bucketing according to word prefixes, suffixes and other features in order to approximate $ P (word, unknown  \vert\POS) \approx P(unknown\vert POS) P(features(word)\vert unknown, POS)$ .
      Try and report your selection of various features that help in obtaining better results in guessing unknown words.
    • Note on Good-Turing: Sometimes $ n_r$ (for some $ r$ ) will be equal to zero, which makes the Good-Turing formula fail. A fix for this is to either apply Good-Turing only to $ 0 <= r <=5$ (or alternatively to apply a first smoothing of $ n_r$ (before smoothing $ r$ ): see Simple Good-Turing by G. Sampson - see reading page of the course).
    Further excercises (for the report):
    • Describe the details of your smoothing of the lexical model.
    • Describe the different features you tried for unknown words.
    • Obtain tagging results on the same test set as Step 3 and report accuracy.
    • Obtain accuracy for known words alone, for unknown words alone and for all on the test set and report that.

  5. Report writing and delivering the total
    Submit total program and write final report (5 pages) of all four steps. This is your Midterm project!
    The report is a scietific article, limited to 5 pages (hard limit), should describe the essence of the preceding 4 steps formally, and try to explain clearly and in short Language models, Hidden Markov Models, POS tagging, prediction and smoothing, their relation to one another.

next up previous
Next: Final project Up: Language and Speech Processing Previous: Lecture schedule 2008/9
Khalil Sima'an 2008-10-02