SWSR: A Chinese dataset and lexicon for online sexism detection

Aiqi Jiang

doi:10.1016/J.OSNEM.2021.100182

SWSR: A Chinese dataset and lexicon for online sexism detection

Aiqi Jiang

Online Social Networks and Media

https://doi.org/10.1016/J.OSNEM.2021.100182

visibility

…

description

44 pages

link

1 file

Abstract

Online sexism has become an increasing concern in social media platforms as it has affected the healthy development of the Internet and can have negative effects in society. While research in the sexism detection domain is growing, most of this research focuses on English as the language and on Twitter as the platform. Our objective here is to broaden the scope of this research by considering the Chinese language on Sina Weibo. We propose the first Chinese sexism dataset -Sina Weibo Sexism Review (SWSR) dataset -, as well as a large Chinese lexicon SexHateLex made of abusive and gender-related terms. We introduce our data collection and annotation process, and provide an exploratory analysis of the dataset characteristics to validate its quality and to show how sexism is manifested in Chinese. The SWSR dataset provides labels at different levels of granularity including (i) sexism or non-sexism, (ii) sexism category and (iii) target type, which can be exploited, among others, for building computational methods to identify and investigate finer-grained gender-related abusive language. We conduct experiments for the three sexism classification tasks making use of state-of-the-art machine learning models. Our results show competitive performance, providing a benchmark for sexism detection in the Chinese language, as well as an error analysis highlighting open challenges need- * This is to indicate the corresponding author.

Figures (11)

Table 1: Existing sexism-related datasets in multiple languages.

Figure 2: An example of Sina Weibo on weibo.cn

Figure 3: Overview of the data collection process.

Table 2: Number of weibos collected for each keyword. In this step, we process the collected weibos prior to collecting the associated

Table 3: Examples of sexism categories and target types in the dataset.

Figure 4: Distribution of sexism categories and target types in the dataset.

Table 4: Description of features in the weibo and comment datasets. Table 5: Statistics of the dataset.

Figure 5: Distribution of user gender across two classes in the dataset.

in Chinese. PCT denotes the percentage of each term. 6.2.8. Word Frequency Distribution

Table 9: Error analysis for misclassified examples. TL denotes true label and PL denotes ES Cay eee Mile petg Ree | We look at frequent errors across misclassified instances generated from

References (69)

P. Fortuna, J. Soler-Company, L. Wanner, How well do hate speech, toxicity, abusive and offensive language classification models generalize across datasets?, Information Processing & Management 58 (3) (2021) 102524. doi:https://doi.org/10.1016/j.ipm.2021.102524. URL https://www.sciencedirect.com/science/article/pii/ S0306457321000339
C. Nobata, J. Tetreault, A. Thomas, Y. Mehdad, Y. Chang, Abusive lan- guage detection in online user content, in: Proceedings of the 25th Inter- national Conference on World Wide Web, International World Wide Web Conferences Steering Committee, 2016, pp. 145-153.
E. Fersini, D. Nozza, P. Rosso, Overview of the evalita 2018 task on auto- matic misogyny identification (ami)., in: EVALITA@ CLiC-it, 2018.
E. Fersini, D. Nozza, P. Rosso, Ami@ evalita2020: Automatic misogyny identification, Proceedings of the 7th evaluation campaign of Natural Lan- guage Processing and Speech tools for Italian (EVALITA 2020), Online. CEUR. org.
P. Chiril, V. Moriceau, F. Benamara, A. Mari, G. Origgi, M. Coulomb- Gully, He said "who's gonna take care of your children when you are at ACL?": Reported sexist acts are not sexist, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 4055-4066. doi:10.18653/v1/2020.acl-main.373. URL https://www.aclweb.org/anthology/2020.acl-main.373
E. W. Pamungkas, V. Basile, V. Patti, A joint learning approach with knowledge injection for zero-shot cross-lingual hate speech de- tection, Information Processing & Management 58 (4) (2021) 102544. doi:https://doi.org/10.1016/j.ipm.2021.102544. URL https://www.sciencedirect.com/science/article/pii/ S0306457321000510
I. Gagliardone, D. Gal, T. Alves, G. Martinez, Countering online hate speech, Unesco Publishing, 2015.
M. L. Williams, P. Burnap, A. Javed, H. Liu, S. Ozalp, Hate in the Ma- chine: Anti-Black and Anti-Muslim Social Media Posts as Predictors of Offline Racially and Religiously Aggravated Crime, The British Journal of Criminology 60 (1) (2019) 93-117. arXiv:https://academic.oup.com/ bjc/article-pdf/60/1/93/31634412/azz049.pdf, doi:10.1093/bjc/ azz049. URL https://doi.org/10.1093/bjc/azz049
P. Fortuna, S. Nunes, A survey on automatic detection of hate speech in text, ACM Computing Surveys (CSUR) 51 (4) (2018) 85.
Z. Waseem, D. Hovy, Hateful symbols or hateful people? predictive fea- tures for hate speech detection on Twitter, in: Proceedings of the NAACL Student Research Workshop, Association for Computational Linguistics, San Diego, California, 2016, pp. 88-93. doi:10.18653/v1/N16-2013. URL https://www.aclweb.org/anthology/N16-2013
S. Frenda, B. Ghanem, M. Montes-y Gómez, P. Rosso, Online hate speech against women: Automatic identification of misogyny and sexism on twit- ter, Journal of Intelligent & Fuzzy Systems 36 (5) (2019) 4743-4752.
X. Shi, Y. Zheng, Perception and tolerance of sexual harassment: An exam- ination of feminist identity, sexism, and gender roles in a sample of chinese working women, Psychology of Women Quarterly 44 (2) (2020) 217-233. doi:10.1177/0361684320903683.
URL https://doi.org/10.1177/0361684320903683
K. M. DeLuca, E. Brunner, Y. Sun, Weibo, wechat, and the transformative events of environmental activism on china's wild public screens., Interna- tional Journal of Communication 10.
A. Jha, R. Mamidi, When does a compliment become sexist? analysis and classification of ambivalent sexism using twitter data, in: Proceedings of the Second Workshop on NLP and Computational Social Science, Association for Computational Linguistics, Vancouver, Canada, 2017, pp. 7-16. doi: 10.18653/v1/W17-2902. URL https://www.aclweb.org/anthology/W17-2902
F. Rodríguez-Sánchez, J. Carrillo-de Albornoz, L. Plaza, Automatic clas- sification of sexism in social networks: An empirical study on twitter data, IEEE Access 8 (2020) 219563-219576.
S. Hewitt, T. Tiropanis, C. Bokhove, The problem of identifying misogynist language on twitter (and other online social spaces), in: Proceedings of the 8th ACM Conference on Web Science, 2016, pp. 333-335.
M. Anzovino, E. Fersini, P. Rosso, Automatic identification and classifica- tion of misogynistic language on twitter, in: International Conference on Applications of Natural Language to Information Systems, Springer, 2018, pp. 57-64.
D. Nozza, C. Volpetti, E. Fersini, Unintended bias in misogyny detection, in: IEEE/WIC/ACM International Conference on Web Intelligence, WI '19, Association for Computing Machinery, New York, NY, USA, 2019, p. 149-155. doi:10.1145/3350546.3352512. URL https://doi.org/10.1145/3350546.3352512
E. W. Pamungkas, V. Basile, V. Patti, Misogyny detection in twitter: a multilingual and cross-domain study, Information Processing & Management 57 (6) (2020) 102360. doi:https: //doi.org/10.1016/j.ipm.2020.102360. URL https://www.sciencedirect.com/science/article/pii/ S0306457320308554
P. Glick, S. T. Fiske, Ambivalent sexism, in: Advances in experimental social psychology, Vol. 33, Elsevier, 2001, pp. 115-188.
K. Manne, Down girl: The logic of misogyny, Oxford University Press, 2017.
M. Hellinger, A. Pauwels, 21. language and sexism, in: Handbook of lan- guage and communication: Diversity and change, De Gruyter Mouton, 2008, pp. 651-684.
L. Richardson-Self, Woman-hating: On misogyny, sexism, and hate speech, Hypatia 33 (2) (2018) 256-272.
P. Parikh, H. Abburi, P. Badjatiya, R. Krishnan, N. Chhaya, M. Gupta, V. Varma, Multi-label categorization of accounts of sexism using a neural framework, in: Proceedings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing and the 9th International Joint Con- ference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019, pp. 1642-1652. doi:10.18653/v1/D19-1174. URL https://www.aclweb.org/anthology/D19-1174
M. Samory, I. Sen, J. Kohne, F. Flöck, C. Wagner, "call me sexist, but..." : Revisiting sexism detection using psychological scales and adversarial samples, Proceedings of the International AAAI Conference on Web and Social Media 15 (1) (2021) 573-584.
URL https://ojs.aaai.org/index.php/ICWSM/article/view/18085
S. Kiritchenko, I. Nejadgholi, K. C. Fraser, Confronting abusive language online: A survey from the ethical and human rights perspective, arXiv preprint arXiv:2012.12305.
A. Jha, R. Mamidi, When does a compliment become sexist? analysis and classification of ambivalent sexism using twitter data, in: Proceedings of the Second Workshop on NLP and Computational Social Science, Association for Computational Linguistics, Vancouver, Canada, 2017, pp. 7-16. doi: 10.18653/v1/W17-2902. URL https://www.aclweb.org/anthology/W17-2902
P. Chiril, V. Moriceau, F. Benamara, A. Mari, G. Origgi, M. Coulomb- Gully, An annotated corpus for sexism detection in French tweets, in: Proceedings of the 12th Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France, 2020, pp. 1397-1403. URL https://www.aclweb.org/anthology/2020.lrec-1.175
M. Wiegand, J. Ruppenhofer, A. Schmidt, C. Greenberg, Inducing a lexicon of abusive words -a feature-based approach, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technologies, Volume 1 (Long Pa- pers), Association for Computational Linguistics, New Orleans, Louisiana, 2018, pp. 1046-1056. doi:10.18653/v1/N18-1095. URL https://www.aclweb.org/anthology/N18-1095
A. Koufakou, E. W. Pamungkas, V. Basile, V. Patti, HurtBERT: In- corporating lexical features with BERT for the detection of abusive lan- guage, in: Proceedings of the Fourth Workshop on Online Abuse and Harms, Association for Computational Linguistics, Online, 2020, pp. 34- 43. doi:10.18653/v1/2020.alw-1.5. URL https://www.aclweb.org/anthology/2020.alw-1.5
E. Fersini, P. Rosso, M. Anzovino, Overview of the task on automatic misogyny identification at ibereval 2018., in: IberEval@ SEPLN, 2018, pp. 214-228.
S. Bhattacharya, S. Singh, R. Kumar, A. Bansal, A. Bhagat, Y. Dawer, B. Lahiri, A. K. Ojha, Developing a multilingual annotated corpus of misog- yny and aggression, in: Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying, European Language Resources Association (ELRA), Marseille, France, 2020, pp. 158-168.
URL https://www.aclweb.org/anthology/2020.trac-1.25
H. Mulki, B. Ghanem, Let-mi: An Arabic Levantine Twitter dataset for misogynistic language, in: Proceedings of the Sixth Arabic Natural Lan- guage Processing Workshop, Association for Computational Linguistics, Kyiv, Ukraine (Virtual), 2021, pp. 154-163.
URL https://www.aclweb.org/anthology/2021.wanlp-1.16
Wikipedia, Sina weibo -wikipedia, the free encyclopedia, https://en. wikipedia.org/wiki/Sina_Weibo, accessed: 2021-01-03.
SinaFinance, Sina weibo monthly active users reach 550 million, revenue exceeds wall street expectations, https://finance.sina.com.cn/stock/ usstock/c/2020-05-19/doc-iircuyvi3963989.shtml, accessed: 2021- 01.
A. Ghosh Chowdhury, R. Sawhney, R. R. Shah, D. Mahata, #YouToo? detection of personal recollections of sexual harassment on social media, in: Proceedings of the 57th Annual Meeting of the Association for Compu- tational Linguistics, Association for Computational Linguistics, Florence, Italy, 2019, pp. 2527-2537. doi:10.18653/v1/P19-1241. URL https://www.aclweb.org/anthology/P19-1241
V. Basile, C. Bosco, E. Fersini, D. Nozza, V. Patti, F. M. Rangel Pardo, P. Rosso, M. Sanguinetti, SemEval-2019 task 5: Multilingual detection of hate speech against immigrants and women in Twitter, in: Proceedings of the 13th International Workshop on Semantic Evaluation, Association for Computational Linguistics, Minneapolis, Minnesota, USA, 2019, pp. 54- 63. doi:10.18653/v1/S19-2007. URL https://www.aclweb.org/anthology/S19-2007
E. Guest, B. Vidgen, A. Mittos, N. Sastry, G. Tyson, H. Margetts, An ex- pert annotated dataset for the detection of online misogyny, in: Proceed- ings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Association for Computational Linguistics, Online, 2021, pp. 1336-1350.
URL https://www.aclweb.org/anthology/2021.eacl-main.114
P. Zeinert, N. Inie, L. Derczynski, Annotating online misogyny, in: Pro- ceedings of the 59th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, in press 2021.
B. Vidgen, L. Derczynski, Directions in abusive language training data, a systematic review: Garbage in, garbage out, Plos one 15 (12) (2020) e0243300. doi:10.1371/journal.pone.0243300.
X. Han, Y. Tsvetkov, Fortifying toxic speech detectors against disguised toxicity, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 7732-7739.
A. Schmidt, M. Wiegand, A survey on hate speech detection using natural language processing, in: Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media, Association for Compu- tational Linguistics, Valencia, Spain, 2017, pp. 1-10. doi:10.18653/v1/ W17-1101. URL https://www.aclweb.org/anthology/W17-1101
M. Mladenović, C. Krstev, J. Mitrović, R. Stanković, Using lexical resources for irony and sarcasm classification, in: Proceedings of the 8th Balkan Conference in Informatics, 2017, pp. 1-8.
P. Burnap, M. L. Williams, Us and them: identifying cyber hate on twitter across multiple protected characteristics, EPJ Data Science 5 (1) (2016) 11.
C. Tuckwood, Hatebase: Online database of hate speech, The Sentinal Project. Available at: https://www. hatebase. org.
E. Bassignana, V. Basile, V. Patti, Hurtlex: A multilingual lexicon of words to hurt, in: Proceedings of the 5th Italian Conference on Computational Linguistics, CLiC-it 2018, Vol. 2253, CEUR-WS, 2018, pp. 1-6.
B. Huberman, D. M. Romero, F. Wu, Social networks that matter: Twitter under the microscope, First Monday.
Q. Xu, Z. Shen, N. Shah, R. Cuomo, M. Cai, M. Brown, J. Li, T. Mackey, Characterizing weibo social media posts from wuhan, china during the early stages of the covid-19 pandemic: Qualitative content analysis, JMIR Public Health and Surveillance 6 (4) (2020) e24125.
J. Cohen, Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit., Psychological bulletin 70 (4) (1968) 213.
H. Yang, C.-J. Lin, TOCP: A dataset for Chinese profanity processing, in: Proceedings of the Second Workshop on Trolling, Aggression and Cyber- bullying, European Language Resources Association (ELRA), Marseille, France, 2020, pp. 6-12.
URL https://www.aclweb.org/anthology/2020.trac-1.2
J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Min- neapolis, Minnesota, 2019, pp. 4171-4186. doi:10.18653/v1/N19-1423. URL https://www.aclweb.org/anthology/N19-1423
Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretrain- ing approach, arXiv preprint arXiv:1907.11692.
Y. Kim, Convolutional neural networks for sentence classification, in: Pro- ceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar, 2014, pp. 1746-1751. doi:10.3115/v1/D14-1181. URL https://www.aclweb.org/anthology/D14-1181
L.-P. Jing, H.-K. Huang, H.-B. Shi, Improved feature selection approach tfidf in text mining, in: Proceedings. International Conference on Machine Learning and Cybernetics, Vol. 2, IEEE, 2002, pp. 944-946.
M. Thomae, A. Pina, Sexist humor and social identity: the role of sex- ist humor in men's in-group cohesion, sexual harassment, rape proclivity, and victim blame, HUMOR 28 (2) (2015) 187-204. doi:doi:10.1515/ humor-2015-0023. URL https://doi.org/10.1515/humor-2015-0023
B. Vidgen, A. Harris, D. Nguyen, R. Tromble, S. Hale, H. Margetts, Chal- lenges and frontiers in abusive content detection, in: Proceedings of the Third Workshop on Abusive Language Online, Association for Computa- tional Linguistics, Florence, Italy, 2019, pp. 80-93.
Y. Goldberg, O. Levy, word2vec explained: deriving mikolov et al.'s negative-sampling word-embedding method, arXiv preprint arXiv:1402.3722.
N. Pappas, J. Henderson, Gile: A generalized input-label embedding for text classification, Transactions of the Association for Computational Lin- guistics 7 (0) (2019) 139-155.
URL https://transacl.org/ojs/index.php/tacl/article/view/1550
X. Li, J. Song, W. Liu, Label-attentive hierarchical attention network for text classification, in: Proceedings of the 2020 5th International Conference on Big Data and Computing, ICBDC 2020, Association for Computing Machinery, New York, NY, USA, 2020, p. 90-96. doi:10.1145/3404687. 3404706. URL https://doi.org/10.1145/3404687.3404706
C. Molnar, Interpretable Machine Learning, 2019, https://christophm. github.io/interpretable-ml-book/.
E. Dai, Y. Sun, S. Wang, Ginger cannot cure cancer: Battling fake health news with a comprehensive data repository, in: Proceedings of the Inter- national AAAI Conference on Web and Social Media, Vol. 14, 2020, pp. 853-862.
T. Mandl, S. Modha, P. Majumder, D. Patel, M. Dave, C. Mandlia, A. Pa- tel, Overview of the hasoc track at fire 2019: Hate speech and offensive content identification in indo-european languages, in: Proceedings of the 11th Forum for Information Retrieval Evaluation, 2019, pp. 14-17.

SWSR: A Chinese dataset and lexicon for online sexism detection

Sign up for access to the world's latest research

Abstract

Related papers

References (69)

Related papers

Related topics