edward ombui

KenTrans: A Parallel Corpora for Swahili and local Kenyan Languages

Harvard Dataverse, 2022

Kencorpus: Kenyan Languages Corpus

Harvard Dataverse, 2022

KenSpeech: Swahili Speech Transcriptions

Harvard Dataverse, 2022

KenPos: Kenyan Languages Part of Speech Tagged dataset

Harvard Dataverse, 2022

Leveraging Hierarchical Features for HateSpeech Identification in Short Message Texts

This study espouses that quick gains in hate speech identification can be achieved by using a sim... more This study espouses that quick gains in hate speech identification can be achieved by using a simple hierarchical structure of high-level features that map into low level features e.g. hate lexical terms mapped to term frequency-inverse document frequency features. The study implements this approach and uses supervised machine learning to train a classifier on 48k human annotated tweets to automatically identify hate speech generated during the 2012 and 2017 presidential elections in Kenya. Preliminary results indicate an accuracy of 0.74, which is higher than the baseline for the same data set labeled by human annotators.

Leveraging Intelligent Decision Support System to Promote Inclusive Remote Teaching and Learning in Institutions of Higher Education in East Africa: Prototype Development

Social Science Research Network, 2023

Leveraging Intelligent Decision Support System to Promote Inclusive Remote Teaching and Learning in Institutions of Higher Education in East Africa: Prototype Development

SSRN Electronic Journal

KenSwQuAD – A Question Answering Dataset for Swahili Low Resource Language

ACM Transactions on Asian and Low-Resource Language Information Processing

The need for Question Answering datasets in low resource languages is the motivation of this rese... more The need for Question Answering datasets in low resource languages is the motivation of this research, leading to the development of Kencorpus Swahili Question Answering Dataset, KenSwQuAD. This dataset is annotated from raw story texts of Swahili low resource language, which is a predominantly spoken in Eastern African and in other parts of the world. Question Answering (QA) datasets are important for machine comprehension of natural language for tasks such as internet search and dialog systems. Machine learning systems need training data such as the gold standard Question Answering set developed in this research. The research engaged annotators to formulate QA pairs from Swahili texts collected by the Kencorpus project, a Kenyan languages corpus. The project annotated 1,445 texts from the total 2,585 texts with at least 5 QA pairs each, resulting into a final dataset of 7,526 QA pairs. A quality assurance set of 12.5% of the annotated texts confirmed that the QA pairs were all cor...

Download

Phonemic Representation and Transcription for Speech to Text Applications for Under-resourced Indigenous African Languages: The Case of Kiswahili

Cornell University - arXiv, Oct 29, 2022

Building automatic speech recognition (ASR) systems is a challenging task, especially for underre... more Building automatic speech recognition (ASR) systems is a challenging task, especially for underresourced languages that need to construct corpora nearly from scratch and lack sufficient training data. It has emerged that several African indigenous languages, including Kiswahili, are technologically under-resourced. ASR systems are crucial, particularly for the hearing-impaired persons who can benefit from having transcripts in their native languages. However, the absence of transcribed speech datasets has complicated efforts to develop ASR models for these indigenous languages. This paper explores the transcription process and the development of a Kiswahili speech corpus, which includes both read-out texts and spontaneous speech data from native Kiswahili speakers. The study also discusses the vowels and consonants in Kiswahili and provides an updated Kiswahili phoneme dictionary for the ASR model that was created using the CMU Sphinx speech recognition toolbox, an open-source speech recognition toolkit. The ASR model was trained using an extended phonetic set that yielded a WER and SER of 18.87% and 49.5%, respectively, an improved performance than previous similar research for under-resourced languages.

Download

Best feature performance in codeswitched hate speech texts

How well can hate speech concept be abstracted in order to inform automatic classification in cod... more How well can hate speech concept be abstracted in order to inform automatic classification in codeswitched texts by machine learning classifiers? We explore different representations and empirically evaluate their predictiveness using both conventional and deep learning algorithms in identifying hate speech in a ~48k human-annotated dataset that contain mixed languages, a phenomenon common among multilingual speakers. This paper espouses a novel approach to handle this challenge by introducing a hierarchical approach that employs Latent Dirichlet Allocation to generate topic models that feed into another high-level feature set that we acronym PDC. PDC groups similar meaning words in word families during the preprocessing stage for supervised learning models. The high-level PDC features generated are based on Ombui et al, (2019) hate speech annotation framework that is informed by the triangular theory of hate (Stanberg,2003). Results obtained from frequency-based models using the PDC feature on the annotated dataset of ~48k short messages comprising of tweets generated during the 2012 and 2017 Kenyan presidential elections indicate an improvement on classification accuracy in identifying hate speech as compared to the baseline

Download

Kencorpus: A Kenyan Language Corpus of Swahili, Dholuo and Luhya for Natural Language Processing Tasks

Cornell University - arXiv, Aug 25, 2022

Indigenous African languages are categorized as under-served in Artificial Intelligence and suffe... more Indigenous African languages are categorized as under-served in Artificial Intelligence and suffer poor digital inclusivity and information access. The challenge has been how to use machine learning and deep learning models without the requisite data. Kencorpus is a Kenyan Language corpus that intends to bridge the gap on how to collect, and store text and speech data that is good enough to enable data-driven solutions in applications such as machine translation, question answering and transcription in multilingual communities. Kencorpus is a corpus for three languages predominantly spoken in Kenya: Swahili, Dholuo and Luhya. This corpus intends to fill the gap of developing a dataset that can be used for Natural Language Processing and Machine Learning tasks for low-resource languages, with such languages usually being neglected due to few resources and research efforts. The Kencorpus is therefore a collection of text and speech data in the three languages. In the Kencorpus project, three Luhya dialects, namely Lumarachi, Lulogooli and Lubukusu, were sampled as Luhya has several dialects. Each of these languages and dialects therefore contributed text and speech data for the language corpus. Data collection was done by researchers who were deployed to the various data collection sources such as communities, schools and collaborating partners such as media and publishers. Kencorpus has a collection of 5,594 items, being 4,442 texts of 5.6 million words and 1,152 speech files worth 177 hours. Based on this data, other datasets were also developed as part of the project. These are a Part of Speech tagging sets for Dholuo and Luhya dialects, resulting in 50,000 and 93,000 words tagged respectively and Question-Answer pairs created from the Swahili text corpus that annotated 1,445 stories with 7,537 QA pairs. Translations of texts from Dholuo and Luhya into Swahili were done for 12,400 sentences. The datasets are useful for machine learning tasks such as text processing, annotation and translation. The project also undertook proof of concept systems in speech to text and machine learning for Question Answering task. These concepts provided results of a performance of 75% for the former, and 60% for the latter system. These are initial results that give great promise to the usability of the Kencorpus to the machine learning community. Kencorpus is the first such corpus of its kind for the low resource languages and forms a basis of learning and sharing experiences for similar works especially for low resource languages. Challenges in developing the corpus included deficiencies in the data sources, data cleaning challenges, relatively short project timelines and COVID19 pandemic that restricted movement hence the ability to get the data in a timely manner.

Download

Psychosocial Features for Identifying Hate Speech in Social Media Text

Journal of Education, Society and Behavioural Science, 2021

This study uses natural language processing to identify hate speech in social media codeswitched ... more This study uses natural language processing to identify hate speech in social media codeswitched text. It trains nine models and tests their predictiveness in recognizing hate speech in a 50k human-annotated dataset. The article proposes a novel hierarchical approach that leverages Latent Dirichlet Analysis to develop topic models that assist build a high-level Psychosocial feature set we call PDC. PDC organizes words into word families, which helps capture codeswitching during preprocessing for supervised learning models. Informed by the duplex theory of hate, the PDC features are based on a hate speech annotation framework. Frequency-based models employing the PDC feature on tweets from the 2012 and 2017 Kenyan presidential elections yielded an f-score of 83 percent (precision: 81 percent, recall: 85 percent) in recognizing hate speech. The study is notable because it publicly exposes a rich codeswitched dataset for comparative studies. Second, it describes how to create a novel P...

Download

Psychosocial Features for Hate Speech Detection in Code-switched Texts

International Journal of Information Technology and Computer Science, 2021

This study examines the problem of hate speech identification in codeswitched text from social me... more This study examines the problem of hate speech identification in codeswitched text from social media using a natural language processing approach. It explores different features in training nine models and empirically evaluates their predictiveness in identifying hate speech in a ~50k human-annotated dataset. The study espouses a novel approach to handle this challenge by introducing a hierarchical approach that employs Latent Dirichlet Analysis to generate topic models that help build a high-level Psychosocial feature set that we acronym PDC. PDC groups similar meaning words in word families, which is significant in capturing codeswitching during the preprocessing stage for supervised learning models. The high-level PDC features generated are based on a hate speech annotation framework [1] that is largely informed by the duplex theory of hate [2]. Results obtained from frequency-based models using the PDC feature on the dataset comprising of tweets generated during the 2012 and 201...

Download

Hate Speech Detection in Code-switched Text Messages

Not only does it happen in America, but also in Asia, in Africa and all over the world: Hate Spee... more Not only does it happen in America, but also in Asia, in Africa and all over the world: Hate Speech. The exponential growth of user-generated content on social media bordering hate speech is increasingly alarming. Several efforts to monitor this phenomenon by social media network companies and the research community are on-going with various degrees of success. One gap in previous studies that this study addresses is the identification of hate speech in codeswitched text messages. The alternation of words in different languages within a message is a common occurrence among multilingual persons or communities. The study explored the performance of different features across various machine learning algorithms and established that character-level Term Frequency-Inverse Document Frequency, performed best given a codeswitched dataset of 25k annotated tweets using support vector machine algorithm as compared to six other conventional and two deep learning algorithms.

Annotation Framework for Hate Speech Identification in Tweets: Case Study of Tweets During Kenyan Elections

2019 IST-Africa Week Conference (IST-Africa)

Considering the colossal amount of user-generated content on social media, it has become increasi... more Considering the colossal amount of user-generated content on social media, it has become increasingly difficult to monitor hateful content being published on public online spaces, especially during the electioneering periods, particularly in Kenya. In this regard, it is crucial to automate the identification of hate speech in order to manage the volume, variety, veracity and velocity of this content. In this research, we postulate a supervised machine learning approach whereby annotation of the training data set is critical in determining the performance of the trained classifier. Therefore, we develop an annotation framework based on Sternberg’s (2003) hate theory and test its performance in classifying about 5k tweets using 3 human annotators per tweet. Preliminary results indicate an intercoder reliability score of 0.5027 based on Krippendorff’s alpha.

Leveraging Hierarchical Features for HateSpeech Identification in Short Message Texts

2019 IEEE AFRICON, 2019

This study espouses that quick gains in hate speech identification can be achieved by using a sim... more This study espouses that quick gains in hate speech identification can be achieved by using a simple hierarchical structure of high-level features that map into low level features e.g. hate lexical terms mapped to term frequency-inverse document frequency features. The study implements this approach and uses supervised machine learning to train a classifier on 48k human annotated tweets to automatically identify hate speech generated during the 2012 and 2017 presidential elections in Kenya. Preliminary results indicate an accuracy of 0.74, which is higher than the baseline for the same data set labeled by human annotators.

The Design and Development of a Custom Text Annotator

2019 IEEE AFRICON, 2019

Researchers involved in work that entails annotation of text information are usually faced with t... more Researchers involved in work that entails annotation of text information are usually faced with the challenge of choosing an appropriate tool to use in their work. Such work usually involves establishment of an annotation scheme to guide the annotation exercise and the identification of a tool that matches the established scheme. The team may also be required to work with a huge team of annotators who may be geographically dispersed, and a huge dataset that needs to be annotated within a fairly short period of time. This paper seeks to demystify text annotators and to establish a simple approach for the development of a custom annotation tool based on a project specific annotation scheme. We also introduce an unconventional annotation approach that speeds up the annotation exercise without a significant loss of reliability.

Best feature performance in codeswitched hate speech texts

How well can hate speech concept be abstracted in order to inform automatic classification in cod... more How well can hate speech concept be abstracted in order to inform automatic classification in codeswitched texts by machine learning classifiers? We explore different representations and empirically evaluate their predictiveness using both conventional and deep learning algorithms in identifying hate speech in a ~48k human-annotated dataset that contain mixed languages, a phenomenon common among multilingual speakers. This paper espouses a novel approach to handle this challenge by introducing a hierarchical approach that employs Latent Dirichlet Allocation to generate topic models that feed into another high-level feature set that we acronym PDC. PDC groups similar meaning words in word families during the preprocessing stage for supervised learning models. The high-level PDC features generated are based on Ombui et al, (2019) hate speech annotation framework that is informed by the triangular theory of hate (Stanberg,2003). Results obtained from frequency-based models using the PD...

Download

Wiring Kenyan languages for the global virtual age: An audit of the human language technology resources

Whereas we recognize the advancement of computing and internet technologies over the years and it... more Whereas we recognize the advancement of computing and internet technologies over the years and its impact in the areas of health, education, government, etc., there is increasing cognizance that the languages used in these technologies will have a far reaching impact in terms of accessibility and usability by a wider audience. European languages and specifically English is considered the lingua franca of computing and the Internet due to the vast amount of language resources available in these languages. Does this therefore exacerbate the language and technology gap, especially in regards to African languages? This research is motivated by this question and begins to tackle a strand of the overarching language technology issue by auditing the human language technologies for Kenyan languages. The research uses the Basic Language Resource Kit (BLARK) to do the inventory. This method has been successfully used to conduct language resources surveys in other countries.

Download

Building and Annotating a Codeswitched Hate Speech Corpora

International Journal of Information Technology and Computer Science

Presidential campaign periods are a major trigger event for hate speech on social media in almost... more Presidential campaign periods are a major trigger event for hate speech on social media in almost every country. A systematic review of previous studies indicates inadequate publicly available annotated datasets and hardly any evidence of theoretical underpinning for the annotation schemes used for hate speech identification. This situation stifles the development of empirically useful data for research, especially in supervised machine learning. This paper describes the methodology that was used to develop a multidimensional hate speech framework based on the duplex theory of hate [1] components that include distance, passion, commitment to hate, and hate as a story. Subsequently, an annotation scheme based on the framework was used to annotate a random sample of ~51k tweets from ~400k tweets that were collected during the August and October 2017 presidential campaign period in Kenya. This resulted in a goldstandard codeswitched dataset that could be used for comparative and empiri...

Download

Uploads

Papers by edward ombui

Log In