University of Amsterdam
Informatics Institute
Christof Monz

Home

Projects

Research

Publications

Teaching

Activities

Software

Other Stuff

Bio

I'm currently involved in the following research projects:

CoSyne: Multi-Lingual Content Synchronization with Wikis
   Duration: 2010-2013
Role: Project Coordinator/Leader
Funder: European Commission, FP7 STREP
Partners: Fondazione Bruno Kessler (Italy), Dublin City University (Ireland), Heidelberg Institute for Theoretical Studies (Germany), Netherlands Institute for Sound and Vision (The Netherlands), Deutsche Welle (Germany), Dutch Chapter of the WikiMedia Foundation (The Netherlands)
Website: http://www.cosyne.eu/
Summary: The combination of dynamic user-generated content and multi-lingual aspects is particularly prominent in Wiki sites. Wikis have gained increased popularity over the last few years as a means of collaborative content creation as they allow users to set up and edit web pages directly. A growing number of organizations use Wikis as an efficient means to provide and maintain information across several sites. Currently, multi-lingual Wikis rely on users to manually translate different Wiki pages on the same subject. This is not only a time-consuming procedure but also the source of many inconsistencies, as users update the different language versions separately, and every update would require translators to compare the different language versions and synchronize the updates. The overall aim of the CoSyne project is to automate the dynamic multi-lingual synchronization process of Wikis. CoSyne addresses the following challenges:
  • achieve robust translation of noisier user-generated content between 6 core languages (consisting of 4 core languages and 2 languages with limited resources to demonstrate adaptability of the system),
  • improve machine translation quality by segment-specific adaptive modeling,
  • identify textual content overlap between segments of Wiki pages across languages to avoid redundant machine translation,
  • identify the optimal insertion points for translated content to preserve coherence,
  • analyze user edits to distinguish between factual content changes and corrections of machine translation output, and exploit the latter to improve machine translation performance in a self-learning manner.

MataHari: Machine Translation with Harvested Internet Resources
   Duration: 2010-2012
Role: Project Leader
Summary: The main objective of the proposed research is to set the first step in building a machine translation framework that achieves truly global translation capabilities by covering a large number of languages. To this end this project will investigate a number of languages that have not---or just to a small extent---been covered so far by existing research.

The methods investigated in this project fall under the paradigm of statistical machine translation, which uses a parallel corpus, i.e., documents that have been translated by a professional translator, and then automatically learns the translation rules from this set of documents.

As the proposed project focuses on languages that have not been covered so far to a large extent, it has to address novel challenges and goes beyond existing academic and commercial research in a number of ways. There are hardly any readily available bilingual training data for the languages considered here, unlike for Arabic or Chinese, where sizable parallel corpora are distributed by the Linguistic Data Consortium (LDC). This means that we have to acquire the necessary training data ourselves.

To this end we will utilize internet resources to learn translation models. By exploiting online resources for machine translation this project will address a number of vital research issues:

  • How can multi-lingual resources be automatically identified and harvested?
  • How can translation rules be learned from smaller and only partially translated resources?
  • How do existing search strategies for finding the most likely translation have to be adapted to cope with limited resources?
  • How can one rapidly build evaluation benchmarks for languages with limited resources?

GALATEAS:
   Duration: 2010-2013
Role: Work Package Leader
Funder: European Commission, PSP
Partners: Xerox Research (Coordinator, France), CELI SRL (Italy), University of Trento (Italy), Object Direct SAS (France), Gonetwork SRL (Italy), Bridgeman Art Library Ltd (UK), Humboldt University Berlin (Germany)
Website: http://www.galateas.eu/
Summary: With the growth of digital libraries and digital library federation (as well as partially unstructured collections of documents such as web sites), a large set of vendors is offering engines for retrieving contents and metadata via search requests by the end user (queries). In most cases these queries are just unstructured fragments of text in a specific language.

Firstly, GALATEAS (LangLog) is focused on getting meaning out of these lists of queries and it is addressed to library/federation/site managers. Contrary to mainstream service in this field, GALATEAS services will not considered standard structured information of web logs (e.g. click rate, visited pages, user's paths inside the document tree) but the information contained in queries from the point of view of language interpretation.

The second challenge addressed by GALATEAS is the one of Cross Language Information Retrieval (CLIR) i.e. the capability of typing a query in one specific language and retrieving documents which are available in different languages.


PASCAL-2: Pattern Analysis, Statistical Modeling and Computational Learning
   Duration: 2010-2013
Role: Site Manager
Funder: European Commission, European Network of Excellence
Partners: See website
Website: http://pascallin2.ecs.soton.ac.uk/
Summary: The PASCAL Network of Excellence has created a distributed institute pioneering principled methods of pattern analysis, statistical modeling, and computational learning as core enabling technologies for multimodal interfaces that are capable of natural and seamless interaction with and among individual human users. The resulting expertise has been applied to problems relevant to both multi-modal interfaces and cognitive systems. PASCAL2 will enable a refocusing of the Institute towards the emerging challenges created by the ever expanding applications of adaptive systems technology and their central role in the development of large scale cognitive systems. Furthermore, the funding will enable the Institute to engage in technology transfer through an Industrial Club to effect rapid deployment of the developed technologies into a wide variety of applications, while undertaking a brokerage of expertise and public outreach programme to communicate the value and relevance of the achieved results.

CCCT: Center for Content, Creation and Technology
   Duration: 2010-2012
Role: Sub-Project Leader
Funder: Platform Betatechniek
Summary: The Center for Creation, Content and Technology is the University of Amsterdam and the Hogeschool of Amsterdam's response to the scientific, innovative and educational challenges that digital content presents us with. It brings together the University of Amsterdam's acknowledged strengths in Computer Science, Media Studies, and Communication Theory, and the Hogeschool of Amsterdam's Medialab in a unique multi-disciplinary setting.

Completed Projects:

Information Retrieval for Data Selection in Machine Translation
   Duration: 2006-2008
Funder: Nuffield Organization
Role: Principal Investigator
Summary: Well-performing statistical MT approaches require very large amounts of training data to achieve this quality, The challenge is to select those subsets within the training data that are most likely to be relevant for a given document or sentence that needs to be translated. This project investigates how information retrieval techniques can be used to build more contextually sensitive methods for identifying training data for building language models used for machine translation.