Home
Publications
Projects
|
SurReal: Surface Realization in Statistical Machine Translation
|
Duration: 2013-2017
Principal Investigator: Christof Monz
Funder: Advanced research fellowship (Vidi scheme) by Dutch Science Foundation (NWO)
Summary:
A major problem of automatically translating from a foreign language,
e.g., Chinese, to a language a user understands, e.g., English, is the
non-well-formedness, or disfluency, of the machine translated
output. The predominant cause for current state-of-the-art statistical
machine translation (SMT) systems failing to produce fluent
translations lies with their limited expressive power. Current
approaches only use phrases (be it contiguous or discontiguous) that
were observed in the parallel corpus used to build the phrase table.
The proposed research aims to substantially increase the expressive
power of SMT by decoupling concept identification, capturing the
content of the source sentence, and surface realization, and thereby
improve translation fluency in particular and translation quality in
general.
|
CoBaLT: Constraint-Based Language Translation for Approximate Redundancy Resolution
|
Duration: 2014-2018
Principal Investigator: Christof Monz
Funder: Free Competition Grant by Dutch Science Foundation (NWO)
Summary:
The proposed project aims to improve machine translation (MT) quality
(e.g., for translating from Chinese to English) of Statistical
Machine Translation (SMT) systems. Redundancies (also known as
over-generation) are an important problem for almost current
state-of-the-art statistical machine translation systems. One the one
hand, they pollute the search space with hypotheses that can lead to
identical translations. On the other hand, these redundancies can be
used to re-estimate the probabilities of translation candidates and
help improve translation quality by reranking the n-best output of a
decoder. Current approaches exploit redundancies after decoding and
therefore can not influence the generation of the search space.
In this project, we aim to exploit redundancies to improve the search
space generation and exploration.
|
DIGIT: Domain and Genre-Independent Translation
|
Duration: 2013-2017
Principal Investigator: Christof Monz
Summary:
Recent developments have shown that the ability to access and analyze
foreign-language information is of critical importance to global
companies and national governments. On the other hand professional
translators are increasingly incapable of translating the vastly
escalating mass of information available and automated translation
means are desperately needed. Automating translation is the goal of
the research area of Machine Translation (MT).
The performance of state-of-the-art MT systems varies vastly depending
on the genre and domain of the document that is to be translated.
This gap in translation quality is clearly undesirable
and shows that the state of the art in Machine Translation is far away
from achieving robust translation quality.
Genre or domain differences between the training and test data often
cause a drop in translation quality. This is due to the sensitivity of
statistical machine translation systems to lexical information, i.e.,
the exact surface realizations. To overcome this, the DIGIT project
investigates ways to incorporate syntactic information into our
machine translation approach that generalizes beyond the exact surface
forms.
|
REMEDI: Robust and Efficient Machine Translation in a Distributed Infrastructure
|
Duration: 2015-2018
Principal Investigator: Christof Monz
Summary:
The vast majority of machine translation systems developed by research teams focuses on the development of models that result in improvements in translation quality, and research has made great advances in this regard over the last decade. On the other hand, it is also essential that these systems are efficient and scale up to sizable, real-world data sets to be of practical use.
To address the issue of speed, scalability, and robustness, we propose a complete overhaul and re-development from scratch of our existing SMT infrastructure which so far has also fo-cused almost entirely on translation quality improvements.
|
CoSyne: Multi-Lingual Content Synchronization with Wikis
|
Duration: 2010-2013
Principal Investigator/Coordinator: Christof Monz
Funder: European Commission, FP7 STREP
Partners: Fondazione Bruno Kessler (Italy), Dublin City
University (Ireland), Heidelberg Institute for Theoretical Studies
(Germany), Netherlands Institute for Sound and Vision (The
Netherlands), Deutsche Welle (Germany), Dutch Chapter of the WikiMedia
Foundation (The Netherlands)
Website: http://www.cosyne.eu/
Summary: The combination of dynamic user-generated content and multi-lingual
aspects is particularly prominent in Wiki sites. Wikis have gained
increased popularity over the last few years as a means of
collaborative content creation as they allow users to set up and edit
web pages directly. A growing number of organizations use Wikis as an
efficient means to provide and maintain information across several
sites. Currently, multi-lingual Wikis rely on users to manually
translate different Wiki pages on the same subject. This is not only a
time-consuming procedure but also the source of many inconsistencies,
as users update the different language versions separately, and every
update would require translators to compare the different language
versions and synchronize the updates. The overall aim of the CoSyne
project is to automate the dynamic multi-lingual synchronization
process of Wikis.
CoSyne addresses the following challenges:
- achieve robust translation of noisier user-generated content
between 6 core languages (consisting of 4 core languages and 2
languages with limited resources to demonstrate adaptability of
the system),
- improve machine translation quality by segment-specific
adaptive modeling,
- identify textual content overlap between segments of Wiki
pages across languages to avoid redundant machine translation,
- identify the optimal insertion points for translated content
to preserve coherence,
- analyze user edits to distinguish between factual content
changes and corrections of machine translation output, and exploit
the latter to improve machine translation performance in a
self-learning manner.
|
GALATEAS:
|
Duration: 2010-2013
Work Package Leader: Christof Monz
Funder: European Commission, PSP
Partners: Xerox Research (Coordinator, France), CELI SRL
(Italy), University of Trento (Italy), Object Direct SAS (France),
Gonetwork SRL (Italy), Bridgeman Art Library Ltd (UK), Humboldt
University Berlin (Germany)
Website: http://www.galateas.eu/ Summary:
With the growth of digital libraries and digital library federation
(as well as partially unstructured collections of documents such as
web sites), a large set of vendors is offering engines for retrieving
contents and metadata via search requests by the end user
(queries). In most cases these queries are just unstructured fragments
of text in a specific language.
Firstly, GALATEAS (LangLog) is focused on getting meaning out of these
lists of queries and it is addressed to library/federation/site
managers. Contrary to mainstream service in this field, GALATEAS
services will not considered standard structured information of web
logs (e.g. click rate, visited pages, user's paths inside the document
tree) but the information contained in queries from the point of view
of language interpretation.
The second challenge addressed by GALATEAS is the one of Cross
Language Information Retrieval (CLIR) i.e. the capability of typing a
query in one specific language and retrieving documents which are
available in different languages.
|
PASCAL-2: Pattern Analysis, Statistical Modeling and Computational Learning
|
Duration: 2010-2013
Funder: European Commission, European Network of Excellence
Partners: See website
Website: http://pascallin2.ecs.soton.ac.uk/
Summary:
The PASCAL Network of Excellence has created a distributed institute
pioneering principled methods of pattern analysis, statistical
modeling, and computational learning as core enabling technologies for
multimodal interfaces that are capable of natural and seamless
interaction with and among individual human users. The resulting
expertise has been applied to problems relevant to both multi-modal
interfaces and cognitive systems. PASCAL2 will enable a refocusing of
the Institute towards the emerging challenges created by the ever
expanding applications of adaptive systems technology and their
central role in the development of large scale cognitive
systems. Furthermore, the funding will enable the Institute to engage
in technology transfer through an Industrial Club to effect rapid
deployment of the developed technologies into a wide variety of
applications, while undertaking a brokerage of expertise and public
outreach programme to communicate the value and relevance of the
achieved results.
|
CCCT: Center for Content, Creation and Technology
|
Duration: 2010-2012
Work Package Leader: Christof Monz
Funder: Platform Betatechniek
Summary: The Center for Creation, Content and Technology is the
University of Amsterdam and the Hogeschool of Amsterdam's response to
the scientific, innovative and educational challenges that digital
content presents us with. It brings together the University of
Amsterdam's acknowledged strengths in Computer Science, Media Studies,
and Communication Theory, and the Hogeschool of Amsterdam's Medialab
in a unique multi-disciplinary setting.
|
MataHari: Machine Translation with Harvested Internet Resources
|
Duration: 2010-2012
Principal Investigator: Christof Monz
Summary: The main objective of the proposed research is to set
the first step in building a machine translation framework that
achieves truly global translation capabilities by covering a large
number of languages. To this end this project will investigate a
number of languages that have not---or just to a small extent---been
covered so far by existing research.
The methods investigated in this project fall under the paradigm of
statistical machine translation, which uses a parallel corpus, i.e.,
documents that have been translated by a professional translator, and
then automatically learns the translation rules from this set of
documents.
As the proposed project focuses on languages that have not been
covered so far to a large extent, it has to address novel challenges
and goes beyond existing academic and commercial research in a number
of ways. There are hardly any readily available bilingual training
data for the languages considered here, unlike for Arabic or Chinese,
where sizable parallel corpora are distributed by the Linguistic Data
Consortium (LDC). This means that we have to acquire the necessary
training data ourselves.
To this end we will utilize internet resources to learn translation
models. By exploiting online resources for machine translation this
project will address a number of vital research issues:
- How can multi-lingual resources be automatically
identified and harvested?
- How can translation rules be learned
from smaller and only partially translated resources?
- How do
existing search strategies for finding the most likely translation
have to be adapted to cope with limited resources?
- How can one
rapidly build evaluation benchmarks for languages with limited
resources?
|
|