Next: Final project
Up: Language and Speech Processing
Previous: Lecture schedule 2008/9
Midterm project
The deadline for the midterm is set on 7 November 2008
- Instructions
- Read the homework/assignments rules (and report structure) carefully (see Blackboad»Assignments)
- Submit the midterm report in pdf using a computer program to edit and typeset your text.
- We do not tolerate hand written submissions (zero without smoothing).
- Marking the midterm
- The actual mark for the midterm is based on
- The scientific report: well structured, well written, clear, correct, complete, innovative,
references etc (see balckboard»Assignments)
- The program
- Possibly clarification in an oral discussion with lecturers, if needed in certain cases.
- Project detail
- For your convenience, the project is structured in steps. The accumulation of the steps is the total
project. Report and program for the total project to be submitted (not for the individual steps).
- Markov language models over word sequences
In this step your goal is to build a language model over word sequences.
This will be a 1st-order Markov Model. Take care that your program is general enough to take
as input a file of sentences, where every sentence is a sequence of words delimited with white
space (do not disregard differences between capital vs small case letters).
Note that every sentence ends with a dot (.) . You will need a START symbol for every sentence
(in order to condition every first word on START).
- Programming
- Extract a table of all bigrams with their frequencies from the text and store those in
an efficient mechanism (storage and retrieval - possible hash function?).
- Build a function that return the relative frequency estimate (probability) for
for a given bigram
from the counts in the table.
- Build a that takes as input a sentence and assigns to it a probability according
to the 1st-order Markov model.
- Build a procedure that takes as input a set of words S:
- Further excercises (for the report):
- Formalize the language model (1st-order Markov) and
provide a formula for the relative frequency estimates for
and for
.
- Plot on the same graph the relative frequency estimate for probability mass functions
(``distributions")
,
,
,
- Use as training data for your model the data found on Blackboard under Assignments.
- Markov language models over POS tag sequences and lexical models
- The goal is to build the components of a POS tagger:
a 1st-order Markov language model over POS tags and a lexical model over word-tag pairs.
- Programming
- For the language model over POS tag sequences: extract a 1st-order Markov model
(reuse the programs from step 1!!).
- For the lexical model: extract a table of word-POS pairs with relative frequency estimates
for
- Further excercises (for the report):
- Describe formally how the POS tagging model is built starting from
the Noisy-Channel model including the details of the lexical and the language models.
- Derive, only theoretically, no programming needed, an alternative POS tagging model
by making other kinds of independence assumptions than the ones described in the
literature.
- POS tagged corpus is available on Blackboard
- Calculating the most probable POS tag sequence and sentence probability
Viterbi and Forward algorithms for POS tagging.
- Programming
- Build a procedure that takes an input sentence as input and builds
for that sentence the lattice/trellis that is assigned by a 1st-order Markov model
(see lecture slides and book). (You will need this lattice for the next step3 for
the Viterbi algorithm for tagging).
- Given an input sentence, write a procedure that uses the lattice/trellis as data structure
for calculating the most probable POS tag sequence for that sentence. This is the Viterbi
algorithm described in the lecture slides (see book also).
- Write a procedure for calculating the sentence probability (Forward probability).
Note that you can re-use almost the same procedure as Viterbi, just change the Max
with Sum (see lecture slides).
- Write a small program that takes (1) a ``gold corpus" G (the correctly tagged corpus)
and (2) the output of the tagger for the same corpus of sentences T.
The program outputs the Accuracy of T relative to G, i.e. it checks how precise your
tagger is relative to the gold corpus.
- Further excercises (for the report):
- Tag (run your tagger on) the testset.sentences (Blackboard) and report
the Accuracy relative to the testset.gold (all corporas are available on Blackboard).
- Desribe the Viterbi algorithm formally (recursive formula) and discuss its implementation.
Similarly, describe formally how the sentence probability is dynamically calculated
under the Forward algorithm.
- For the sentence ``The shares dropped yesterday.", draw the lattice that your POS tagging
HMM builds by consulting the language model and the lexical model tables.
- Smoothing the lexical model
Smooth the lexical model in order to deal with unseen
pairs
and also with unknown words. You may assume that all POS tags are available, i.e.,
no unknown POS tags problem.
- Programming
- Use the Good-Turing method for discounting in order to reserve mass
for unseen
pairs. (See note below on Good-Turing).
- For every POS tag, estimate the probability that it will produce unknown words
(not in training copus with this POS tag). Then use a bucketing according to word
prefixes, suffixes and other features in order to approximate
.
Try and report your selection of various features that help
in obtaining better results in guessing unknown words.
- Note on Good-Turing: Sometimes
(for some
) will be equal to zero, which makes
the Good-Turing formula fail. A fix for this is to either apply Good-Turing only
to
(or alternatively to apply a first smoothing of
(before smoothing
):
see Simple Good-Turing by G. Sampson - see reading page of the course).
- Further excercises (for the report):
- Describe the details of your smoothing of the lexical model.
- Describe the different features you tried for unknown words.
- Obtain tagging results on the same test set as Step 3 and report accuracy.
- Obtain accuracy for known words alone, for unknown words alone and for all on the
test set and report that.
- Report writing and delivering the total
Submit total program and write final report (5 pages) of all four steps. This is your Midterm project!
The report is a scietific article, limited to 5 pages (hard limit), should describe the essence
of the preceding 4 steps formally, and try to explain clearly and in short Language models,
Hidden Markov Models, POS tagging, prediction and smoothing, their relation to one another.
Next: Final project
Up: Language and Speech Processing
Previous: Lecture schedule 2008/9
Khalil Sima'an
2008-10-02