Low-Resource Offensive Language Detection

EACL-2021 Logo

In our work, we study robust ensembling techniques for offensive language identification in code-mixed Dravidian languages using multilingual BERT models. Owing to the stochastic nature of neural net predictions, ensembling over multiple models reduces variance, thus improving low resource performance of these models. In our current approach, we look at 3 major ensembling techniques, namely,

  1. Genetic algorithm optimised weighted averaging,
  2. Self-Ensembling with different random seeds and
  3. CNN-BERT embedding fusion, in which we train a classifier on concatenated embeddings for different BERT models and CNN embeddings trained on performant word vectors.

We experiment with models pretrained on code-mixed datasets. We also look at some ways to combat class imbalance in the dataset through collated 2-step training and weighted gradients. We study methods for robust ensembling of transformer and convolutional models for classification of low resource languages. We conclusively show that a combination of multiple inter-model ensembling techniques can help reduce the variance of predictions and improve performance across all classes in low-resource settings.

Debjoy Saha
Debjoy Saha
B.Tech Student

B.tech stduent interested in Multimodal Machine Learning and Speech, Language and Image Processing

Related