Language and Speech Processing

THIS PAGE IS WORK IN PROGRESS!!! No claims can be made!!

Mid-Term: Mini-Projects



Deadline for delivery of reports and discussion of the projects is  going to be set soon.


If you wish to suggest your own project, please do that (pass by to discuss this before you start working).
If not, then here is a short list of projects to choose from.

Every project consists of a design+programming task followed by a short report (5-8 pages: see report structure).

TRAINING AND TEST DATA IS TO BE OBTAINED FROM THE LECTURER
Project 1
(1 student) Build a 2nd-order Markov Language Model over words, i.e. a word-trigram model, and smooth it's frequencies
  using Katz smoothing technique by recursive backoff to the 1st-order model and from that again to the 0th-order model.
Conduct experiments using this bigram model for estimating the probability of the following texts (each text must be considered as a sequence of sentences that are assumed independent and identical outcomes of the same experiment, i.e. probability of a sequence of sentences is the multiplication of the probabilities of the individual sentences according to the language model):
  1. The original training-set
  2. The original test-set  that you receive from me.
  3. Change the test-set as follows: exchange every 5th word in every sentence with the 4th word: now estimate the probability of this new sequence of sentences using the same language model.  
  4. Change the preceding test-set by also exchanging the 3rd and 4th words of every sentence (after having echanged the 5th with the 4th as in the preceding set).
Compare the probabilities of these texts to one another and explain why some texts are more probable than others?
Project 2
(1 student) Tagging for Simple Spelling Correction.
Build a POS tagger that does Viterbi tagging and computes the probability of an input sentence.
As usual, the language model over POStags is a 1st-order Markov model, and the lexical model is the one that was discussed in the lecture (every word generated only dependent on its own tag). Use a method for smoothing the components of this tagger (e.g. add-one method for the language model component and a simple treatment of unknown words for the lexical component, e.g. assume any unknown word is generated from proper-nouns only and that no unknown words are generated from any other POS tag category) .

Experiments:
  1. Tag the test-set and report the tagging precision and recall: allow the training-set to grow from 1/10 to 1/1 of original training-set at steps of 1/10 and then plot the accuracy vs. size of training-set (X-axis is size of training set). Discuss the results and explain them.
  2. Now provide a treatment of unknown words based on their prefixes and suffixes and do the experiments again.
Project 3
          (2 students) Language Identification using Language Models:
           Exercise 6.10 of Manning and Schutze book at page 227.  Look at   http://odur.let.rug.nl/~vannoord/TextCat/competitors.html
           For data in different languages you need to download texts from the web in these languages:  I need to find Corpora available for FREE ONLINE!

Project 4
        
(2 students)  NP-Chunking as tagging: the task is to be defined soon, the data will come from CoNLL http://cnts.uia.ac.be/conll99/npb/
           This task is to be defined more accurately soon...


Corpora for training and for testing for these tasks are available from the lecturer.







Suggestion for the structure of the report
  1.  Introduction of problem
  2.  Formalization of the solution
  3.  Implementation details (specific issues that you encountered).
  4.  Empirical experiments: description of the setting and tables of results
  5.  Conclusion