Emerging Markets Welfare project investigates the effects of contentious politics on welfare state programs in countries of the Global South. It hypothesizes that government response to social contention is a significant factor that...
moreEmerging Markets Welfare project investigates the effects of contentious politics on welfare state programs in countries of the Global South. It hypothesizes that government response to social contention is a significant factor that shapes welfare policies. It is in this respect that mapping the dynamics of social contention in a given country becomes crucial, and duly constitutes a fundamental component of the entire project. Investigating the causal relationship between social contention and government policy involves more than a simple correlation, particularly if the focus is on specific government action, namely welfare policies. The map of social contention adequate for such an understanding should thus go beyond laying out basic trends of ebbing and flowing of social contention over space and time and provide insight into particularities such as the types of action repertoires, levels of violence, characteristics of actors or social groups that engage in contentious politics, the characteristics of the demands that they raise.
The purpose of the second work package of the EMW Project is to draw the aforementioned map of social contention. For achieving this purpose, we created a database of contentious politics events through the extraction of information from the news reports that are featured in the most prominent online sources each focus country has to offer. The Global Contentious Politics Database (GLOCON) records contentious politics events (referred to as protest events for the sake of brevity) that take place within the borders of our focus countries with all the information available in the source about the events’ time and place, actor, type, demands raised, violence level. As of the moment, the GLOCON database contains protest event data from India, China, South Africa, Argentina, and Brazil. It features data in three languages: English for India, China, and South Africa data, Spanish for Argentina data, and Portuguese for Brazil data. The database was created in a way that is able to accommodate additions of other focus countries and/or news sources in the future.
The database creation utilized automated text processing tools that detects if a news article contains a protest event, locate protest information within the article, and extract pieces of information regarding the detected protest events. The basis of training and testing the automated tools is the GLOCON Gold Standard Corpus (GSC), which contains news articles from multiple sources from each focus country. The articles in the GSC were manually coded by skilled annotators in both classification and extraction tasks with the utmost accuracy and consistency that automated tool development demands. In order to assure these, the annotation manuals in this document lay out the rules according to which annotators code the news articles. Annotators refer to the manuals at all times for all annotation tasks and apply the rules that they contain.
Despite the EMW Project's focus on the countries of the Global South, and the initial choice of a limited number of countries to be featured in the GSC, none of the rules or principles contained in this manual is more or less applicable to certain countries, sources or periods than others. The GLOCON database aims to be inclusive and capable of expanding. Securing consistency, reliability, and validity of data in the face of temporal and spatial expansion requires that annotation principles are generally applicable and that they are applied consistently.
The annotation process is composed of three main levels for each news report document. The document-level annotation determines the news articles that contain information on actual (past or ongoing) protest events. The sentence-level annotation aims to locate sentences that contain protest event-related information. In the final phase, words or phrases that give concrete information about protest events are detected.
Corresponding to the document and sentence classification, and information extraction tasks, there are three main and two supplemental manuals which together cover the entire annotation process from the document, through the sentence, to the token level. The first manual is the Document Level Protest Annotation Manual (DOLPAM) which establishes the rules for determining news articles that contain protest events; in other words, classifying news articles into those which contain protest events and those which do not. It lays out the protest event ontology, that is, the protest event definition which specifies the range of contentious politics events that are included in the scope of the project. It introduces and exemplifies different types of protest events, and defines the criteria to which a news report must conform to be labeled as a protest event article. The following Sentence Level Protest Annotation Manual (SELPAM) carries on with classifying the sentences in the documents that have already been classified as protest event articles. Similarly, it defines and exemplifies event sentences and enumerates the rules by which sentences are labeled as event sentences and non-event sentences. The third and final main manual is the Token Level Protest Annotation Manual (TOLPAM) which is the longest and most detailed of the three main level manuals. It defines the types of event-related information that the project aims at collecting from news articles and explains how expressions within the event sentences which contain these pieces of information are tagged. The remaining two manuals are supplemental manuals that label further information about the events that are already extracted in the three main levels of annotation. Both define annotation tasks that are performed on the document level. The first is the Violent Protest Events Annotation Manual which lays out the rules for classifying news reports that contain protest events into categories of violent and non-violent. The following, Protest Event Demands Annotation Manual aims at setting the rules for labeling the demands and/or grievances associated with the protest events that are extracted in the news articles. More detailed information about each manual can be found under their respective headings.
Even though every particular level of annotation has its respective annotation manual, the whole process must be thought of as an integrated whole as each level of annotation is premised on the results of the previous level. Hence, familiarizing oneself with all the manuals before starting annotation on any single level is recommended. Knowing in advance what the sentence and token level annotation tasks entail would help an annotator working on the document level considerably. That said, it is neither practical nor advisable to try to learn all annotation procedures by heart. Memory is prone to mislead, and recurrent reference to the manuals is the preferred way of utilizing them. Thus, annotators must read the entire manual before starting annotation, and remember to refer to it when there is the slightest doubt about a rule or a difficult case.
The content of the annotation manual is built on the general principles and standards of linguistic annotation laid out in other prominent annotation manuals such as ACE, CAMEO, and TimeML. These principles, however, have been adapted or rather modified heavily to accommodate the social scientific concepts and variables employed in the EMW project. The manual has been molded throughout a long trial and error process that accompanied the annotation of the GSC. It owes much of its current shape to the meticulous work and invaluable feedback provided by highly specialized teams of annotators, whose diligence and expertise greatly increased the quality of the corpus.