Detecting Toxic Content Online and the Effect of Training Data on Classification Performance

EasyChair Preprint 872

12 pages•Date: April 1, 2019

Zhixue Zhao, Ziqi Zhang and Frank Hopfgartner

Abstract

The spread of toxic content online has attracted a wealth of research into methods of automatic detection and classification in recent years. However, two limitations still exist: 1) the lack of support for multi-label classification; and 2) the lack of understanding of the impact of the typical unbalanced datasets on such tasks. In this work, we build three state of the art methods for the task of multi-label classification of toxic content online, and compare the effect of training data size on their performance. The three methods of choice are based on Support Vector Machine (SVM), Convolutional Neural Networks (CNN) and Long-Short-Term Memory Networks (LSTM), respectively. We conduct learning curve analysis and show that CNN is the most robust method as it outperforms the other two regardless of the sizes of the dataset, even on very small amounts of data. This challenges the conventional belief that Neural Networks require significant amounts of data to train accurate models. We also empirically derive indicative thresholds of training data size to help determine a reliable estimate of classifier performance, or maximise potential classifier performance in such tasks.

Keyphrases: Convolutional Neural Network, Deep Neural Network, NLP, Natural Language Processing, classifier performance, deep learning, detecting hate speech, hate speech, learning curve, machine learning, multi-label classification, neural network, offensive language, text classification, text mining, toxic comment, toxic content, toxic content classification, training data

Links:	https://easychair.org/publications/preprint/XGmR
	https://doi.org/10.29007/z5xk

BibTeX entry

BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:

@booklet{EasyChair:872,
  author    = {Zhixue Zhao and Ziqi Zhang and Frank Hopfgartner},
  title     = {Detecting Toxic Content Online and the Effect of Training Data on Classification Performance},
  doi       = {10.29007/z5xk},
  howpublished = {EasyChair Preprint 872},
  year      = {EasyChair, 2019}}

Download PDF Open PDF in browser