Language
and Speech Processing
THIS PAGE IS WORK IN
PROGRESS!!! No claims can be made!!
Mid-Term: Mini-Projects
Deadline
for delivery of reports and discussion of the projects
is going to be set soon.
If you wish to suggest your own project, please do that (pass by
to
discuss this before you start working).
If not, then here is a short list of projects to choose from.
Every project consists of a design+programming task followed by a
short report (5-8 pages: see report structure).
TRAINING AND TEST DATA IS TO BE OBTAINED FROM THE
LECTURER
Project 1
(1 student) Build a 2nd-order Markov Language Model
over words, i.e. a word-trigram model, and smooth it's frequencies
using Katz smoothing technique by recursive
backoff to the 1st-order model and from that again to the 0th-order
model.
Conduct experiments
using this bigram model for estimating the probability of the following
texts (each text must be considered as a sequence of
sentences that are assumed independent and identical outcomes of the
same experiment, i.e. probability of a sequence of sentences is the
multiplication of the probabilities of the individual sentences
according to the language model):
- The original training-set
- The original test-set that you receive from me.
- Change the test-set as follows: exchange every 5th word in
every sentence with the 4th word: now estimate the probability of this
new sequence of sentences using the same language model.
- Change the preceding test-set by also exchanging the 3rd
and 4th words of every sentence (after having echanged the 5th with the
4th as in the preceding set).
Compare the probabilities of these texts to one another and
explain why some texts are more probable than others?
Project 2
(1 student) Tagging for
Simple Spelling Correction.
Build a POS tagger that does Viterbi tagging and
computes the probability of an input sentence.
As usual, the
language model over POStags is a 1st-order Markov model, and the
lexical model is
the one that was discussed in the lecture (every word generated only
dependent on its own tag). Use a method for smoothing the components of
this tagger (e.g. add-one method for the language model component and a
simple treatment of unknown words for the lexical component, e.g.
assume any unknown word is
generated from proper-nouns only and that no unknown words are
generated from any other POS tag category) .
Experiments:
- Tag the test-set and report the tagging precision and
recall: allow the training-set to grow from 1/10 to 1/1 of original
training-set
at steps of 1/10 and then plot the accuracy vs. size of training-set
(X-axis is size of training set). Discuss the results and explain them.
- Now provide a treatment of unknown words based on their
prefixes and suffixes and do the experiments again.
Project 3
(2
students) Language Identification using Language Models:
Exercise
6.10 of Manning and Schutze book at page 227. Look at
http://odur.let.rug.nl/~vannoord/TextCat/competitors.html
For data
in different languages you need to download texts from the web in these
languages: I need to find Corpora available for FREE ONLINE!
Project 4
(2
students) NP-Chunking as tagging: the task is to be defined soon,
the data will come from CoNLL http://cnts.uia.ac.be/conll99/npb/
This task
is to be defined more accurately soon...
Corpora for training and for testing for these tasks are available from
the lecturer.
Suggestion for the structure of the report
- Introduction of problem
- Formalization of the solution
- Implementation details (specific issues that you
encountered).
- Empirical experiments: description of the setting and
tables of results
- Conclusion