Home
Projects
Research
Publications
Teaching
Activities
Software
Other Stuff
Bio
|
I'm currently involved in the following research projects:
CoSyne: Multi-Lingual Content Synchronization with Wikis
| |
Duration: 2010-2013
Role: Project Coordinator/Leader
Funder: European Commission, FP7 STREP
Partners: Fondazione Bruno Kessler (Italy), Dublin City
University (Ireland), Heidelberg Institute for Theoretical Studies
(Germany), Netherlands Institute for Sound and Vision (The
Netherlands), Deutsche Welle (Germany), Dutch Chapter of the WikiMedia
Foundation (The Netherlands)
Website: http://www.cosyne.eu/
Summary: The combination of dynamic user-generated content and multi-lingual
aspects is particularly prominent in Wiki sites. Wikis have gained
increased popularity over the last few years as a means of
collaborative content creation as they allow users to set up and edit
web pages directly. A growing number of organizations use Wikis as an
efficient means to provide and maintain information across several
sites. Currently, multi-lingual Wikis rely on users to manually
translate different Wiki pages on the same subject. This is not only a
time-consuming procedure but also the source of many inconsistencies,
as users update the different language versions separately, and every
update would require translators to compare the different language
versions and synchronize the updates. The overall aim of the CoSyne
project is to automate the dynamic multi-lingual synchronization
process of Wikis.
CoSyne addresses the following challenges:
- achieve robust translation of noisier user-generated content
between 6 core languages (consisting of 4 core languages and 2
languages with limited resources to demonstrate adaptability of
the system),
- improve machine translation quality by segment-specific
adaptive modeling,
- identify textual content overlap between segments of Wiki
pages across languages to avoid redundant machine translation,
- identify the optimal insertion points for translated content
to preserve coherence,
- analyze user edits to distinguish between factual content
changes and corrections of machine translation output, and exploit
the latter to improve machine translation performance in a
self-learning manner.
|
MataHari: Machine Translation with Harvested Internet Resources
| |
Duration: 2010-2012
Role: Project Leader
Summary: The main objective of the proposed research is to set
the first step in building a machine translation framework that
achieves truly global translation capabilities by covering a large
number of languages. To this end this project will investigate a
number of languages that have not---or just to a small extent---been
covered so far by existing research.
The methods investigated in this project fall under the paradigm of
statistical machine translation, which uses a parallel corpus, i.e.,
documents that have been translated by a professional translator, and
then automatically learns the translation rules from this set of
documents.
As the proposed project focuses on languages that have not been
covered so far to a large extent, it has to address novel challenges
and goes beyond existing academic and commercial research in a number
of ways. There are hardly any readily available bilingual training
data for the languages considered here, unlike for Arabic or Chinese,
where sizable parallel corpora are distributed by the Linguistic Data
Consortium (LDC). This means that we have to acquire the necessary
training data ourselves.
To this end we will utilize internet resources to learn translation
models. By exploiting online resources for machine translation this
project will address a number of vital research issues:
- How can multi-lingual resources be automatically
identified and harvested?
- How can translation rules be learned
from smaller and only partially translated resources?
- How do
existing search strategies for finding the most likely translation
have to be adapted to cope with limited resources?
- How can one
rapidly build evaluation benchmarks for languages with limited
resources?
|
GALATEAS:
| |
Duration: 2010-2013
Role: Work Package Leader
Funder: European Commission, PSP
Partners: Xerox Research (Coordinator, France), CELI SRL
(Italy), University of Trento (Italy), Object Direct SAS (France),
Gonetwork SRL (Italy), Bridgeman Art Library Ltd (UK), Humboldt
University Berlin (Germany)
Website: http://www.galateas.eu/ Summary:
With the growth of digital libraries and digital library federation
(as well as partially unstructured collections of documents such as
web sites), a large set of vendors is offering engines for retrieving
contents and metadata via search requests by the end user
(queries). In most cases these queries are just unstructured fragments
of text in a specific language.
Firstly, GALATEAS (LangLog) is focused on getting meaning out of these
lists of queries and it is addressed to library/federation/site
managers. Contrary to mainstream service in this field, GALATEAS
services will not considered standard structured information of web
logs (e.g. click rate, visited pages, user's paths inside the document
tree) but the information contained in queries from the point of view
of language interpretation.
The second challenge addressed by GALATEAS is the one of Cross
Language Information Retrieval (CLIR) i.e. the capability of typing a
query in one specific language and retrieving documents which are
available in different languages.
|
PASCAL-2: Pattern Analysis, Statistical Modeling and Computational Learning
| |
Duration: 2010-2013
Role: Site Manager
Funder: European Commission, European Network of Excellence
Partners: See website
Website: http://pascallin2.ecs.soton.ac.uk/
Summary:
The PASCAL Network of Excellence has created a distributed institute
pioneering principled methods of pattern analysis, statistical
modeling, and computational learning as core enabling technologies for
multimodal interfaces that are capable of natural and seamless
interaction with and among individual human users. The resulting
expertise has been applied to problems relevant to both multi-modal
interfaces and cognitive systems. PASCAL2 will enable a refocusing of
the Institute towards the emerging challenges created by the ever
expanding applications of adaptive systems technology and their
central role in the development of large scale cognitive
systems. Furthermore, the funding will enable the Institute to engage
in technology transfer through an Industrial Club to effect rapid
deployment of the developed technologies into a wide variety of
applications, while undertaking a brokerage of expertise and public
outreach programme to communicate the value and relevance of the
achieved results.
|
CCCT: Center for Content, Creation and Technology
| |
Duration: 2010-2012
Role: Sub-Project Leader
Funder: Platform Betatechniek
Summary: The Center for Creation, Content and Technology is the
University of Amsterdam and the Hogeschool of Amsterdam's response to
the scientific, innovative and educational challenges that digital
content presents us with. It brings together the University of
Amsterdam's acknowledged strengths in Computer Science, Media Studies,
and Communication Theory, and the Hogeschool of Amsterdam's Medialab
in a unique multi-disciplinary setting.
|
Completed Projects:
Information Retrieval for Data Selection in Machine
Translation
| |
Duration: 2006-2008
Funder: Nuffield Organization
Role: Principal Investigator
Summary:
Well-performing statistical MT approaches require very large amounts
of training data to achieve this quality, The challenge is to select
those subsets within the training data that are most likely to be
relevant for a given document or sentence that needs to be translated.
This project investigates how information retrieval techniques can be
used to build more contextually sensitive methods for identifying
training data for building language models used for machine
translation.
|
|