An ensemble model for sentiment classification on code-mixed data in Dravidian Languages

EasyChair Preprint 7266

9 pages•Date: December 27, 2021

S R Mithun Kumar, Nihal Reddy, Aruna Malapati and Lov Kumar

Abstract

Dravidian languages, Tamil, Kannada, Malayalam and Telugu, is spoken by over 220 million but is vastly under-resourced for natural language processing tasks. Code-switching and code-mixing have been on the rise, with multilingual speakers opting for expressing their opinion in their mother tongue along with English in both written text as well as in speech. Challenges arise in sentiment analysis of code-switched Dravidian languages because of under-resourced corpora and randomness in language interspersing. This paper applied an ensemble sentiment classification strategy based on majority voting using 13 different classification models on the Dravidian code-mixed languages dataset provided in FIRE 2021. The code-mixed dataset contained YouTube comments where the average word count per comment is less than 7. The key conclusion from our experiments was that the ensemble of multiple classifiers outperformed others for sentiment classification. Our approaches show that a result of weighted F1-score of 0.59, 0.65 and 0.60, respectively, on Kannada, Malayalam and Tamil code-switched data can be achieved with the traditional machine learning algorithms through an ensemble of multiple classifiers.

Keyphrases: Dravidian, Kanglish, Manglish, Tanglish, code-mixing, code-switching, sentiment classification

Links:

https://easychair.org/publications/preprint/sKB5

BibTeX entry

BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:

@booklet{EasyChair:7266,
  author    = {S R Mithun Kumar and Nihal Reddy and Aruna Malapati and Lov Kumar},
  title     = {An ensemble model for sentiment classification on code-mixed data in Dravidian Languages},
  howpublished = {EasyChair Preprint 7266},
  year      = {EasyChair, 2021}}

Download PDF Open PDF in browser