Semi-Supervised Priors for Microblog Language Identification

UPDATE: There is an updated version of our microblog language identification. And we also made available our trained models. The accuracy of our baseline models can reach 95% for English.

Our paper “Semi-Supervised Priors for Microblog Language Identification” by Simon Carter, me, and Wouter Weerkamp has been accepted at DIR 2011 workshop in Amsterdam, as poster presentation. The presentation will be on February, 4. The abstract follows:

Offering access to information in microblog posts requires successful language identification. Language identification on sparse and noisy data can be challenging. In this paper we explore the performance of a state-of-the-art n-gram-based language identifier, and we introduce two semi-supervised priors to enhance performance at microblog post level: (i)~blogger-based prior, using previous posts by the same blogger, and (ii)~link-based prior, using the pages linked to from the post. We test our models on five languages (Dutch, English, French, German, and Spanish), and a set of 1,000 tweets per language. Results show that our priors improve accuracy, but that there is still room for improvement.

Download the PDF.

BibTex:
@misc{dir2011-carter,
Title = {Semi-Supervised Priors for Microblog Language Identification},
Author = {Simon Carter and Manos Tsagkias and Wouter Weerkamp},
Year = {2011},
Month = {February},
}

This entry was posted in Publications and tagged . Bookmark the permalink. Follow any comments here with the RSS feed for this post.

Leave a Reply

Your email is never published nor shared. Required fields are marked *

*

You may use these HTML tags and attributes:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>