Language Technology Lab - Projects

Language Technology Lab - Projects

Home

Publications

Projects

SurReal: Surface Realization in Statistical Machine Translation

   Duration: 2013-2017
Principal Investigator: Christof Monz
Funder: Advanced research fellowship (Vidi scheme) by Dutch Science Foundation (NWO)
Summary: A major problem of automatically translating from a foreign language, e.g., Chinese, to a language a user understands, e.g., English, is the non-well-formedness, or disfluency, of the machine translated output. The predominant cause for current state-of-the-art statistical machine translation (SMT) systems failing to produce fluent translations lies with their limited expressive power. Current approaches only use phrases (be it contiguous or discontiguous) that were observed in the parallel corpus used to build the phrase table.
The proposed research aims to substantially increase the expressive power of SMT by decoupling concept identification, capturing the content of the source sentence, and surface realization, and thereby improve translation fluency in particular and translation quality in general.

CoBaLT: Constraint-Based Language Translation for Approximate Redundancy Resolution

   Duration: 2014-2018
Principal Investigator: Christof Monz
Funder: Free Competition Grant by Dutch Science Foundation (NWO)
Summary: The proposed project aims to improve machine translation (MT) quality (e.g., for translating from Chinese to English) of Statistical Machine Translation (SMT) systems. Redundancies (also known as over-generation) are an important problem for almost current state-of-the-art statistical machine translation systems. One the one hand, they pollute the search space with hypotheses that can lead to identical translations. On the other hand, these redundancies can be used to re-estimate the probabilities of translation candidates and help improve translation quality by reranking the n-best output of a decoder. Current approaches exploit redundancies after decoding and therefore can not influence the generation of the search space.
In this project, we aim to exploit redundancies to improve the search space generation and exploration.

DIGIT: Domain and Genre-Independent Translation

   Duration: 2013-2017
Principal Investigator: Christof Monz
Summary: Recent developments have shown that the ability to access and analyze foreign-language information is of critical importance to global companies and national governments. On the other hand professional translators are increasingly incapable of translating the vastly escalating mass of information available and automated translation means are desperately needed. Automating translation is the goal of the research area of Machine Translation (MT).
The performance of state-of-the-art MT systems varies vastly depending on the genre and domain of the document that is to be translated. This gap in translation quality is clearly undesirable and shows that the state of the art in Machine Translation is far away from achieving robust translation quality.
Genre or domain differences between the training and test data often cause a drop in translation quality. This is due to the sensitivity of statistical machine translation systems to lexical information, i.e., the exact surface realizations. To overcome this, the DIGIT project investigates ways to incorporate syntactic information into our machine translation approach that generalizes beyond the exact surface forms.

REMEDI: Robust and Efficient Machine Translation in a Distributed Infrastructure

   Duration: 2015-2018
Principal Investigator: Christof Monz
Summary: The vast majority of machine translation systems developed by research teams focuses on the development of models that result in improvements in translation quality, and research has made great advances in this regard over the last decade. On the other hand, it is also essential that these systems are efficient and scale up to sizable, real-world data sets to be of practical use.
To address the issue of speed, scalability, and robustness, we propose a complete overhaul and re-development from scratch of our existing SMT infrastructure which so far has also fo-cused almost entirely on translation quality improvements.

CoSyne: Multi-Lingual Content Synchronization with Wikis

   Duration: 2010-2013
Principal Investigator/Coordinator: Christof Monz
Funder: European Commission, FP7 STREP
Partners: Fondazione Bruno Kessler (Italy), Dublin City University (Ireland), Heidelberg Institute for Theoretical Studies (Germany), Netherlands Institute for Sound and Vision (The Netherlands), Deutsche Welle (Germany), Dutch Chapter of the WikiMedia Foundation (The Netherlands)
Website: http://www.cosyne.eu/
Summary: The combination of dynamic user-generated content and multi-lingual aspects is particularly prominent in Wiki sites. Wikis have gained increased popularity over the last few years as a means of collaborative content creation as they allow users to set up and edit web pages directly. A growing number of organizations use Wikis as an efficient means to provide and maintain information across several sites. Currently, multi-lingual Wikis rely on users to manually translate different Wiki pages on the same subject. This is not only a time-consuming procedure but also the source of many inconsistencies, as users update the different language versions separately, and every update would require translators to compare the different language versions and synchronize the updates. The overall aim of the CoSyne project is to automate the dynamic multi-lingual synchronization process of Wikis. CoSyne addresses the following challenges:

achieve robust translation of noisier user-generated content between 6 core languages (consisting of 4 core languages and 2 languages with limited resources to demonstrate adaptability of the system),
improve machine translation quality by segment-specific adaptive modeling,
identify textual content overlap between segments of Wiki pages across languages to avoid redundant machine translation,
identify the optimal insertion points for translated content to preserve coherence,
analyze user edits to distinguish between factual content changes and corrections of machine translation output, and exploit the latter to improve machine translation performance in a self-learning manner.

GALATEAS:

   Duration: 2010-2013
Work Package Leader: Christof Monz
Funder: European Commission, PSP
Partners: Xerox Research (Coordinator, France), CELI SRL (Italy), University of Trento (Italy), Object Direct SAS (France), Gonetwork SRL (Italy), Bridgeman Art Library Ltd (UK), Humboldt University Berlin (Germany)
Website: http://www.galateas.eu/
Summary: With the growth of digital libraries and digital library federation (as well as partially unstructured collections of documents such as web sites), a large set of vendors is offering engines for retrieving contents and metadata via search requests by the end user (queries). In most cases these queries are just unstructured fragments of text in a specific language.
Firstly, GALATEAS (LangLog) is focused on getting meaning out of these lists of queries and it is addressed to library/federation/site managers. Contrary to mainstream service in this field, GALATEAS services will not considered standard structured information of web logs (e.g. click rate, visited pages, user's paths inside the document tree) but the information contained in queries from the point of view of language interpretation.
The second challenge addressed by GALATEAS is the one of Cross Language Information Retrieval (CLIR) i.e. the capability of typing a query in one specific language and retrieving documents which are available in different languages.

PASCAL-2: Pattern Analysis, Statistical Modeling and Computational Learning

   Duration: 2010-2013
Funder: European Commission, European Network of Excellence
Partners: See website
Website: http://pascallin2.ecs.soton.ac.uk/
Summary: The PASCAL Network of Excellence has created a distributed institute pioneering principled methods of pattern analysis, statistical modeling, and computational learning as core enabling technologies for multimodal interfaces that are capable of natural and seamless interaction with and among individual human users. The resulting expertise has been applied to problems relevant to both multi-modal interfaces and cognitive systems. PASCAL2 will enable a refocusing of the Institute towards the emerging challenges created by the ever expanding applications of adaptive systems technology and their central role in the development of large scale cognitive systems. Furthermore, the funding will enable the Institute to engage in technology transfer through an Industrial Club to effect rapid deployment of the developed technologies into a wide variety of applications, while undertaking a brokerage of expertise and public outreach programme to communicate the value and relevance of the achieved results.

CCCT: Center for Content, Creation and Technology

   Duration: 2010-2012
Work Package Leader: Christof Monz
Funder: Platform Betatechniek
Summary: The Center for Creation, Content and Technology is the University of Amsterdam and the Hogeschool of Amsterdam's response to the scientific, innovative and educational challenges that digital content presents us with. It brings together the University of Amsterdam's acknowledged strengths in Computer Science, Media Studies, and Communication Theory, and the Hogeschool of Amsterdam's Medialab in a unique multi-disciplinary setting.

MataHari: Machine Translation with Harvested Internet Resources

   Duration: 2010-2012
Principal Investigator: Christof Monz
Summary: The main objective of the proposed research is to set the first step in building a machine translation framework that achieves truly global translation capabilities by covering a large number of languages. To this end this project will investigate a number of languages that have not---or just to a small extent---been covered so far by existing research.
The methods investigated in this project fall under the paradigm of statistical machine translation, which uses a parallel corpus, i.e., documents that have been translated by a professional translator, and then automatically learns the translation rules from this set of documents.
As the proposed project focuses on languages that have not been covered so far to a large extent, it has to address novel challenges and goes beyond existing academic and commercial research in a number of ways. There are hardly any readily available bilingual training data for the languages considered here, unlike for Arabic or Chinese, where sizable parallel corpora are distributed by the Linguistic Data Consortium (LDC). This means that we have to acquire the necessary training data ourselves.
To this end we will utilize internet resources to learn translation models. By exploiting online resources for machine translation this project will address a number of vital research issues:

How can multi-lingual resources be automatically identified and harvested?
How can translation rules be learned from smaller and only partially translated resources?
How do existing search strategies for finding the most likely translation have to be adapted to cope with limited resources?
How can one rapidly build evaluation benchmarks for languages with limited resources?