Comparing Ranking-based and Naive Bayes Approaches to Language Detection on Tweets

TítuloComparing Ranking-based and Naive Bayes Approaches to Language Detection on Tweets
AutoresPablo Gamallo, Marcos Garcia, Susana Sotelo, José Ramón Pichel
TipoComunicación para congreso
Fonte Twitter Language Identification Workshop at SEPLN 2014, Girona (Spain), pp. 12-16 , 2014.
ISSN1613-0073
AbstractThis article describes two systems participating to the TweetLID-2014 competition focused on language detection in tweets. The systems are based on two different strategies: ranked dictionaries and Naive Bayes classifiers. The results show that ranking dictionaries performs better with small training corpora whose language distribution is similar to that of the test dataset, while a Naive Bayes algorithm improves the scores with large models even if the data are unbalanced with regard to the test dataset. The experiments also showed that the models based on word unigrams outperform the use of n-grams of characters. In the final evaluation the Naive Bayes classifier got the first position among the unconstrained systems (trained with external sources) participating at the competition.
Palabras chaveLanguage Identification, Short Text, Naive Bayes Classifier, Dictionary-Based Models