Statistical Analysis of Automatic Seed Word Acquisition to Improve Harmful Expression Extraction in Cyberbullying Detection
Keywords:cyberbullying, information extraction, text mining, seed word, SO-PMI-IR
AbstractWe study the social problem of cyberbullying, defined as a new form of bullying that takes place in the Internet space. This paper proposes a method for automatic acquisition of seed words to improve performance of the original method for the cyberbullying detection by Nitta et al. . We conduct an experiment exactly in the same settings to find out that the method based on a Web mining technique, lost over 30% points of its performance since being proposed in 2013. Thus, we hypothesize on the reasons for the decrease in the performance and propose a number of improvements, from which we experimentally choose the best one. Furthermore, we collect several seed word sets using different approaches, evaluate and their precision. We found out that the influential factor in extraction of harmful expressions is not the number of seed words, but the way the seed words were collected and filtered.
T. Nitta, F. Masui, M. Ptaszynski, Y. Kimura, R. Rzepka, and K. Araki,“Detecting cyberbullying entries on informal school websites based on category relevance maximization,” Proc. of the 6th International Joint Conference on Natural Language Processing (IJCNLP‘13), Oct. 2013, pp. 579-586.
A. Kilgarriff, “Googleology is bad science,” Computational Linguistics, vol. 33, no. 1, pp. 147-151, 2007.
T. Ishizaka and K. Yamamoto, “Automatic detection of sentences representing slandering on the Web,” Proc. of the 17th Annual Meeting of the Association for Natural Language Processing, 2011, pp. 131-134. (In Japanese)
P. D. Turney, “Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews,” Proc. of
ACL-02, Jul. 2002, pp. 417-424.
Q. McNemar, “Note on the sampling error of the difference between correlated proportions or percentages,” Psychometrika, vol. 12, no. 2, pp. 153-157, 1947.
How to Cite
Submission of a manuscript implies: that the work described has not been published before that it is not under consideration for publication elsewhere; that if and when the manuscript is accepted for publication. Authors can retain copyright in their articles with no restrictions. Also, author can post the final, peer-reviewed manuscript version (postprint) to any repository or website.
Since Jan. 01, 2019, IJETI will publish new articles with Creative Commons Attribution Non-Commercial License, under Creative Commons Attribution Non-Commercial 4.0 International (CC BY-NC 4.0) License.
The Creative Commons Attribution Non-Commercial (CC-BY-NC) License permits use, distribution and reproduction in any medium, provided the original work is properly cited and is not used for commercial purposes.