Online Information Extraction

description15 papers

group20 followers

lightbulbAbout this topic

Online Information Extraction is the process of automatically retrieving structured information from unstructured or semi-structured data sources on the internet, utilizing algorithms and natural language processing techniques to identify and extract relevant entities, relationships, and events in real-time.

lightbulbAbout this topic

Key research themes

1. How can dependency parsing and hand-crafted rules improve Open Information Extraction across different languages?

This research area focuses on the development and refinement of Open Information Extraction (OIE) methods using dependency parsing combined with hand-crafted linguistic rules. It is significant because dependency structures capture syntactic relations that enable precise extraction of relational triples without relying on domain-specific training data. The theme also extends to exploring language-specific adaptations, particularly for languages like Portuguese, where generic rules may underperform compared to English.

DptOIE: a portuguese Open Information Extraction system based on dependency analysis

by Daniela Claro

2022

Key finding: This paper presents DptOIE, an OIE system tailored for Portuguese that combines dependency parsing with a novel set of hand-crafted rules specific to Portuguese linguistic structures. By training its own POS tagger and... Read more

articleView Paper downloadDownload

2. What are effective strategies for scalable, web-scale information extraction using linked open data and automated wrapper induction?

This theme investigates methods for performing Information Extraction (IE) at web scale, addressing challenges like scarce labeled data and heterogeneity of web content. It explores leveraging Linked Open Data (LOD) as large-scale semi-structured annotated resources to bootstrap IE, combined with wrapper induction techniques and iterative learning to automate extraction pattern discovery, enabling adaptable and domain-independent extraction.

Early Steps Towards Web Scale Information Extraction with LODIE

by Ziqi Zhang

2023, AI Magazine

Key finding: The paper introduces the LODIE project which utilizes Linked Open Data as a rich, large-scale knowledge base to seed and guide web-scale IE. By combining wrapper induction with bootstrapping techniques over LOD-annotated web... Read more

articleView Paper downloadDownload

3. How can information extraction workflows be effectively applied in digital libraries using nearly unsupervised methods and what are their practical limitations?

This area evaluates the application of nearly unsupervised Open Information Extraction (OpenIE) workflows in digital library settings, focusing on cross-domain adaptability, extraction quality, and operational costs. It critically examines the challenge of non-canonicalized (heterogeneous and noisy) extractions from unsupervised methods, the required domain expertise, and computational overhead, aiming to bridge the gap between state-of-the-art extraction methods and real-world digital library needs.

A detailed library perspective on nearly unsupervised information extraction workflows in digital libraries

by Wolf-tilo Balke

2024, International Journal on Digital Libraries

Key finding: Through case studies in domains including encyclopedias, pharmacy, and political sciences, this paper demonstrates that nearly unsupervised OpenIE combined with entity linking and canonicalization can produce good precision... Read more

articleView Paper downloadDownload

A Library Perspective on Nearly-Unsupervised Information Extraction Workflows in Digital Libraries

by Wolf-tilo Balke

2024, Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries

Key finding: This complementary study focuses on workflow design for unsupervised IE, analyzing the portability of state-of-the-art extraction toolboxes across domains, affordability in terms of expertise and computation, and identifying... Read more

articleView Paper downloadDownload

All papers in Online Information Extraction

The Datafied Society. Studying Culture through Data

by Mirko Tobias Schäfer

2024

A few years back we embarked on an expedition into the rapidly transforming landscape of data research, the narratives of big data and the practices emerging with novel data resources, tools and new directions of social and cultural inquiry. This book represents our own experiences and impressions of this journey. We were out there on unfamiliar and at times even uncharted territory. Often we depended on the help of more learned colleagues who generously shared their knowledge, gave us directions or helped with problems. Many of them became authors for this book and we would like to thank them for trusting us with collecting their contributions and presenting them in a joint volume. The editors wish to thank the Utrecht Data School staff Thomas Boeschoten, Irene Westra, Iris Muis, Daniela van Geenen and Gerwin van Schie for their input and for providing an intellectually stimulating work environment. With the Utrecht Data School we have created a place where ambitious and enthusiastic students can meet to join us in this exploration. We are grateful for having the rare opportunity of conducting research together with students from whom we can learn so much and whose insatiable curiosity is an inspiration as well as a constant reminder of why we became teachers in the first place. Our gratitude extends also to the Institute of Cultural Inquiry and the open access fund at Utrecht University for enabling us to make this book open access. A special thanks to William Uricchio, Fernando van der Vlist and Eef Masson for their helpful comments and advice at various stages of the editing process. Finally, we are particularly grateful to Nicolás López Coombs for helping us with the editing and the formatting of the book, but most importantly for keeping an eye on our timeline. Mirko Tobias Schäfer & Karin van Es Anywhere but in their office, 2016 This book is an important contribution towards meeting the challenges of the platform-driven, data-fuelled world in which we have all come to live.

descriptionView Paper arrow_downwardDownload

Eco-friendly sonoluminescent determination of free glycerol in biodiesel samples

by Paulo Henrique Gonçalves Dias Diniz

2022, Talanta

This paper proposes a flow-batch methodology for the determination of free glycerol in biodiesel that is notably eco-friendly, since non-chemical reagents are used. Deionized water (the solvent) was used alone for glycerol (sample)... more

descriptionView Paper arrow_downwardDownload

Arabic Location Name Annotations

by omar asbayou

2022

This paper show how location named entity (LNE) extraction and annotation, which makes part of our named entity recognition (NER) systems, is an important task in managing the great amount of data. In this paper, we try to explain our... more

descriptionView Paper arrow_downwardDownload

Repurposing digital methods: The research affordances of platforms and engines

by Esther Weltevrede

2021

Digital research is often understood as data-driven. Yet the ways in which data are already informed by specific analytical assumptions and inscriptions of the media in which they originate, circulate, or are being used is often... more

Figure 2: How Twitter tracked the News of the World scandal. Interactive visualization showing the amount of tweets per minute with the #notw hashtag. Source: Richards et al. 2011.

Figure 3: Google Scraper result displaying the number of web pages mentioning ACTA issues on the website access- vector.org. The issues ‘enforcement’ and ‘transparency’ are found the most on pages of accessvector.org. Screenshot taken from the Google Scraper tool on February 2, 2012. Source: Google Scraper 2007.

Figure 5: Co-words related to ‘austerity’ derived from the Twitter streaming API on January 1-31, 2012 concerning over 100.000 tweets. Visualization created by loading the co-word network from DMI-TCAT into Gephi and using a Force Atlas 2 layout algorithm, excluding top 1% most connected terms including austerity and RT, at least 100 co-occurrences are retained, nodes scaled by degree, 2012.

Figure 8: Barcode chart showing the pace of new content. Each line represents a new result. Lines are distributed over the five minute intervals, the denser the lines within an interval, the more new results compared to the previ- ous five minute interval. Visualization created by using custom code and Adobe Illustrator, 2010. Source: Borra et al. 2010.

Figure 9: Pace Online for “Pakistan Floods”. Barcode chart showing the pace of content for the issue Pakistan Floods in Google Web, Google News, Google Blogs, Twitter, YouTube, Flickr, Wikipedia and Facebook in both fresh- ness and relevance mode, 18-19 August 2010. Each line represents a new result. Lines are distributed over the five minute intervals, the denser the lines, the more new results compared to the previous five minute interval. Facebook’s pace is determined solely by freshness, while for example Google News’ pace is based both on freshness and relevance. Visualization created by using custom code and Adobe Illustrator, 2010. Source: Borra et al. 2010.

Querying historical controversies in dominant devices and platforms, the question we ask is what kind of history are we accessing on each device? More Information Nee Sal Sa Figure 4: Historical Controversies Now. Screenshot from interactive visualization showing results for the query [9/11] from Twitter, Facebook, Google News, Google Blog, YouTube, Flickr, Google Web, Google Scholar and Google Books, 18 August 2010. The y-axis shows from top to bottom: day, week, month, year, decade, century, undefined. The X-axis shows results for particular platforms. The markers are either circles for hot controversies, squares for cold controversies, and triangles for undefined types of controversies. The color of the marker is red if it is a present controversy, green if it is past controversy, and grey if the controversy is undefined. The figure shows that Twitter displays recent content about 9/11, while for example Google books displays older content. For the full interactive version see Dagdelen et al. 2010.

Figure 10: Barcode chart showing the pace of new content for the issue ‘Pakistan Floods’ on Twitter, 19 August 2010. Each line represents a new result. Lines are distributed over the five minute intervals, the denser the lines, the more new results compared to the previous five minute interval. The orange square indicates idle time on Twitter. Visualization created by using custom code and Adobe Illustrator, 2010. Source: Borra et al. 2010.

Figure 11: Relative distribution of Top Level Domains (TLDs) in the Dutch blogs from 1999 until 2009. The Dutch blogs in the collection retrieved from the Wayback Machine favor the .nl domain over all other domains throughout the years. Moreover, a significant increase in the .nl domain becomes apparent, whereas the .com domain is stead- ily losing share over time. Visualization created in Google Spreadsheets, 2011. Source: Weltevrede and Helmond 2012a.

Figure 12: Relative distribution of self—hosted blog software & blog platforms in Dutch blogs from 1999 until 2009. The graph shows the rise and popularity of Blogger’s platform, Blogspot, in the beginning of 2000 in the collection of blogs retrieved from the Wayback Machine. The decline of Blogspot coincides with the rise of the Web—Log.nl blogging platform, and other Dutch blog platforms such as BlogNL, Blogo, Blogse, Punt and Blogeiland. The figure clearly shows how from 2004-2005 onwards Dutch bloggers—except for a relatively small number of Blogspot and WordPress.com users—shift to Dutch platforms, which are orange color-coded. Only a few bloggers remain on legacy platforms such as Pitas, which no longer accept new members but are still functional for old members. Visualization created in Google Spreadsheets and Illustrator, 2011. Source: Weltevrede and Helmond 2012a.

Figure 13: The relative amount of Dutch blog platforms over time compared to other blog platforms from 1999 until 2009. In 2009 almost all bloggers on blog platforms make use of Dutch platforms in the collection of blogs retrieved from the Wayback Machine. Visualization created in Google Spreadsheets, 2011. Source: Weltevrede and Helmond 2012a.

Figure 16: The Dutch blogosphere in transition: the rise and evolution of the Dutch blogosphere 1999-2009. Map- ping the outlinks of the blogs retrieved from the Wayback Machine from 1999 until 2009 allowed to go back in time and study how and where the Dutch blogosphere originated. The network is made with the Internet Archive Wayback Machine Network per Year tool and networks per year are overlaid in Gephi. Thereafter a co-link analysis was applied and the graph was drawn using the Force Atlas 2 layout algorithm. The figure shows the rise, evolution and first signs of decline of the Dutch blogosphere. Grey depicts the hyperlink network of all years together and red the blogosphere of a particular year. The first Dutch bloggers starting mid 1999 were not interlinked into a ‘sphere’, so the beginning of a Dutch blogosphere can be traced back to 2000. Visualization created with Gephi and Adobe Illustrator, 2011. Source: Weltevrede and Helmond 2012a.

Figure 17: The pre-blogosphere in 1999: early blogs link outward. The network is made with the Internet Archive Wayback Machine Network per Year tool and visualized with Gephi using the Force Atlas 2 layout algorithm; colors were produced with the modularity algorithm. Note that in this figure no co-link analysis was performed. Some of the known Dutch bloggers, as mentioned in Meeuwsen (2010), together with less well-known bloggers, are present but do not form a blogosphere yet. Most notably Alto169, ~wzweers and ~onnoz reach out to other Dutch blogs and may be seen as an effort to establish a community between blogs. Visualization created with Gephi and Adobe Illustrator, 2011. Source: Weltevrede and Helmond 2012a.

Figure 18: Partial screenshot of the reconstructed Dutch blogosphere in 2000. All blogs retrieved from the Wayback Machine have been color-coded based on the type of host. Blue nodes are personal homepages, pink nodes are stu- dent pages, and yellow pages are early blog platforms. Bloggers on personal homepage providers (blue) and student pages (pink) dominate. For the full interactive version with the G-Atlas tool see Weltevrede and Helmond 2011.

Figure 19: Partial screenshot of the reconstructed Dutch blogosphere of 2005: a cluster of marketing blogs (in pink). Marketing blogs are marked in pink. The Dutch marketing cluster emerged in 2005 and was still a very dominant cluster in the Dutch blogosphere in 2011. Data retrieved from the Wayback Machine. Visualization cre- ated with G-Atlas and Adobe Illustrator, 2011. Source: Weltevrede and Helmond 2011.

Figure 20: Partial screenshot of the reconstructed Dutch blogosphere in 2004: all co-linked Bloggers also link to the Dutch web statistics service Nedstat Basic. Data retrieved from the Wayback Machine. Visualization created with G-Atlas and Adobe Illustrator, 2011. Source: Weltevrede and Helmond 2011.

Figure 21: The Dutch Blogosphere in 2009: a comparison of different actor definitions. Social media platform nodes are highlighted in magenta. Both graphs show the full hyperlink network of the 2009 Blogosphere (i.e. with- out co-link analysis). Top: nodes represent host names. Bottom: nodes represent platform specific actor definitions such as user profiles. Through a more fine-grained link analysis (bottom) it becomes possible to analyze the role of social media platforms within the Dutch blogosphere in more detail. Data Retrieved from the Wayback Machine. Visualization created with Gephi and Adobe Illustrator, 2011. Source: Weltevrede and Helmond 2012a.

Figure 22: Different types of links to social media in the 2009 Dutch blogosphere. The grey circles represent social media platforms in the 2009 blogosphere. The colored circles represent different types of platform links such as user pages, hashtag queries, and status updates. Each circle is scaled proportionally to the number of links received for a particular actor definition. Comparing the various social media platforms, the results suggest that some plat- forms can be defined as ‘media sharing’ platforms, such as YouTube and Flickr, which mainly consist of embedded content links in blogs. Data retrieved from the Wayback Machine. Visualization created in Adobe Illustrator, 2011. Source: Weltevrede and Helmond 2012a.

Figure 23: Climate Change Sceptics on the Web (Frederick Seitz). Tag cloud displaying the number of web pages mentioning skeptic “Frederick Seitz” in the top hundred unique hosts returned by Google for the query [“Climate Change”]. The order of Google results is retained in the visualization. It can be seen that Seitz is not well recognized by the ‘top of the web’. Visualization created in Adobe Illustrator, 30 July 2007. Source: Digital Methods Initiative 2009. One of the Digital Methods program earliest empirical critical works using the out- puts of Google search results studied the effects of PageRank by questioning to what extent the results returned provide mainstream or alternative voices in the top engine results (also see the work of Muddiman 2013; Eklof and Mager 2013). Research with PageRank is for instance suitable for ‘source distance’ research, a technique devel- oped during the first Digital Methods Summer School in 2007, which looks at the sources in which certain terms appear and how they rank for a certain query. The classic example project is ‘climate change skeptics’ which looks at the position of known climate change skeptics in the top results for the query [“climate change”] (see

Figure 24: Issue Animals Hierarchy on The Web (Google). This figure shows how prominent certain animals are for a Google Search query for [“climate change”]. Animals are scaled by the number of results (in text and image) returned by Google Search. It is found that on the web, for a text query, results are distributed across all the animals and thus do not particularly favor one specific issue animal. Visualization created in Adobe Illustrator, 15 July 2007. Source: Digital Methods Initiative 2007.

Figure 25: Issue Animals Hierarchy in the News (Google News). This figure shows how prominent certain animals are for a Google News query for [“climate change”]. Animals are scaled by the number of results (in text and image) returned by Technorati. It is found that in the news, for a text query, the polar bear is the animal most associated with climate change, followed by the cow. Visualization created in Adobe Illustrator, 15 July 2007. Source: Digital Methods Initiative 2007.

Figure 26: Issue Animals Hierarchy in the Blogosphere (Technorati). This figure shows how prominent certain ani- mals are for a Technorati query for [“climate change”]. Animals are scaled by the number of results (in text and image) returned by Technorati. It is found that in the blogosphere, for a text query, the polar bear is the animal most associated with climate change, followed by the cow. Visualization created in Adobe Illustrator, 17 July 2007. Source: Digital Methods Initiative 2007.

Figure 27: Issue Animals Hierarchy on the Web (Google Images). This figure shows how prominent certain animals are for a Google Images query for [“climate change”]. Images are scaled by the number of images returned by Google images. It is found that, except for the polar bear, no animals are particularly ‘favored’. Visualization created in Adobe Illustrator, 17 July 2007. Source: Digital Methods Initiative 2007.

Figure 28: A Website is gone: the apparent removal of 911truth.org from Google results for the query [9/11], Sep- tember-October 2007. The graph shows how 911truth.org used to be in the top 10 Google results for the query [9/11], but received a dramatic drop starting 16 October, 2007. Source: Govcom.org 2007.

Figure 30: Hierarchies of rights types per country. Top ten distinctive rights types for the query [“rights”] in the local languages of various local Google versions (for example [“oigused”] in Google.ee and [“direitos”] in Google. pt). Results are in the order that Google provided and translated to English. It was found that certain countries have shared concerns, such as human rights, whereas others have unique concerns, such as activist’s rights in Australia. Visualization created in Adobe Illustrator, 2009. Source: Bekema et al. 2010.

Figure 31: Wikipedia Policies and Guidelines. Partial screenshot. Source: Wikipedia contributors 2012b.

Figure 32: The English Wikipedia article on ‘global warming’ on January 12, 2013 with all plain text blanked out to emphasize Wikipedia’s active content objects. Visualization created in Firefox, 2013. Source: Density Design et al. 2015.

Figure 33: Controversial issues in the ‘global warming’ Wikipedia article. Controversialness was calculated on the full edit history until April 16, 2014. The more controversial a wiki link, the redder it is. The images are converted to gray scale. The link to greenhouse gas is among the most controversial. Partial screenshot from the Contropedia Demo tool, 2015. Source: Density Design et al. 2015.

Figure 34: Partial dashboard view of the controversial wiki links in the ‘global warming’ Wikipedia article on April 16, 2014. Each row represents a controversial wiki link. The rows are ordered by how controversial the wiki link is. The redder the square, the more controversial it is overall. Also shown are a timeline of edits and a controversy bar indicating at which time the element was most controversial. Additionally, the type of element and the number of users editing sentences containing that wiki link are shown. The wiki link ‘greenhouse gas’ is the most controver- sial. Partial screenshot from the Contropedia Demo tool, 2015. Source: Density Design et al. 2015.

Figure 39: Detail of the dashboard view of the ‘global warming’ Wikipedia article from February 1, 2006 to Febru- ary 1, 2008. The second wiki link was used and controversial in 2006, whereas the first wiki link was present in the article throughout most of 2007. The names of the wiki links are struck through as they do not appear in the article on February 1, 2008. Partial screenshot from the Contropedia Demo tool, 2015. Source: Density Design et al. 2015.

Figure 42: Partial dashboard view of the controversial wiki links in the ‘global warming’ Wikipedia article between April 16, 2010 and April 14, 2014. In this period for example scientific opinion on climate change is more contro- versial than greenhouse gas. Partial screenshot from the Contropedia Demo tool, 2015. Source: Density Design et al. 2015.

You are almost done. Please take a moment to fill out the following information*: Figure 43: Alexa toolbar installation and registration process, with a field for the user’s postal code. Partial screen- shot, 2011. Source: Alexa n.d. The Alexa Toolbar for Firefox - Demographic Information

Languages on the Iranian web Figure 44: The distribution of languages on the Iranian webs. Likekhor, Alexa, Balatarin, Google Ad Planner, Google Web, Donbaleh and Sabzlink were queried to retrieve (ranked) lists of relevant URLs. As platforms such as Alexa only provide a ranked list of hosts, all URLs were chopped to their (sub-)domain part, in order to facilitate compari- son. The data were collected on July 7, 2011 and resulted in a list of URLs per web device. To automatically detect the different languages used in the Iranian webs, a custom tool was used that makes use of AlchemyAPI. The graph shows the percentage of URLs in a given language, per web device. The languages are color-coded. Likekhor’s web is mainly in Persian, while the Donbaleh and Sabzlink’s web has most diverse languages. Visualization created in Adobe Illustrator, 2011. Source: Rogers et al. 2011.

The health of the Iranian web Figure 45: The liveliness of the Iranian webs measured by HTTP response codes retrieved from the Netherlands. The graph shows color codes indicating the response codes and is subdivided in pie charts per web device. Data was collected in July and August 2011. The crowd-sourced web had most blocked pages. Visualization created in Adobe Illustrator, 2011. Source: Rogers et al. 2011.

YouTube, Iran Traffic Divided by Worldwide Traffic and Normalized

Subsequently, the Internet Archive’s new Wayback Machine was queried for each blog’s URL and the result selected was dated closest to the middle of each investigated year. From the 2,507 URLs requested 946 blogs could be retrieved from the Internet Archive. This method yielded a collection of archived copies of historical Dutch blogs for each year with a timestamp near the middle of the year. Only blogs with a copy in the Internet Archive were retained for further analysis. Table 1 depicts the number of blogs per year serving as a starting point. Table 1: The number of blogs from the expert list that were available in the Internet Archive Wayback Machine, per year from 1999 until 2009. The URLs were retrieved with the Internet Archive Wayback Machine Network Per Year tool, 2011. Source: Weltevrede and Helmond 2012a.

Table 2: Example of signals that are used to identify and define the local intent of queries and the locale of websites. Signals derived from: Diligenti et al. 2012; Buron et al. 2010; Heymans et al. 2011.

Figure 36: Detail of the dashboard view of the ‘global warming’ Wikipedia article on April 16, 2014, including a partial edit history. The red color under the ‘edit’ section indicates a deletion and green an insertion of text. The first few rows of the edit table show how models of global warming are put into doubt both through edit activity and via the comments. Partial screenshot from the Contropedia Demo tool, 2015. Source: Density Design et al. 2015. Figure 35: Detail of the dashboard view ‘intergovernmental panel on climate change’ wiki link in the ‘global warm- ing’ Wikipedia article on April 16, 2014. The wiki link was most edited within the article on global warming in 2005 en 2007 while it was involved in most edit activity in the middle of 2006. Partial screenshot from the Contropedia Demo tool, 2015. Source: Density Design et al. 2015.

-org with 4 percent. The ccTLD .ir contained 3 percent of all censored hosts. I would like to add a fourth type, multiple aggregator site scraping or more concep- tually said, device cultures to the sampling techniques described above. In any case, Google Ad Planner, Alexa, Google Web Search, Likekhor (Google Reader) as well as the crowd-sourcing platforms (Donbaleh, Sabzlink and Balatarin) either through que- ry results or (dynamically generated) listings yield websites relevant for Iranians and Persian speakers. In the case at hand, with the exception of the searcher’s web (gained through .ir and generic TLD queries in Google’s region search), the percentages of .ir sites among the significant hosts outputted by the devices were relatively low (see Table 3). The crowd-sourced web references yielded the fewest .ir sites at just over 10 percent, whilst both Google Ad Planner’s web as well as Alexa’s web yielded most at about 25 percent. As noted earlier, the .ir sites in the overall collection of URLs were much less likely to be blocked than the .com sites. 80 percent of the websites tested and found blocked from inside Iran were .com, followed by .net with 6 percent and .org with 4 percent. The ccTLD .ir contained 3 percent of all censored hosts. Table 3: Percentage of .ir sites in top websites collected from Alexa, Google Ad Planner, Likekhor, Don- baleh/Sabzlink and Balatarin, July 2011. Source: Rogers et al. 2011.

Table 4: Metrics commonly used in national web characterization studies, 2011. Source: Baeza-Yates, Castillo, and Efthimiadis 2007.

across a country, considering that both the implementation and user experience of censorship may vary by city, ISP, or even by computer (Wright, De Souza, and Brown 2011, 5). Taking this concern into account, we selected proxies from different cities and ISPs, and subsequently considered the response code returned by the majority. Table 5 details which proxies were used for this research. Table 5: Details of the proxies used to test for censorship in Iran. URLs were queried through the vari- ous proxies between August 22, 2011 and September 8, 2011. Proxies retrieved from the Censorship Explorer tool.

descriptionView Paper arrow_downwardDownload

Types of Aspect Terms in Aspect-Oriented Sentiment Labeling

by Evgeniy Kotelnikov

2021

The paper studies the diversity of ways to express entity aspects in users’ reviews. Besides explicit aspect terms, it is possible to distinguish implicit aspect terms and sentiment facts. These subtypes of aspect terms were annotated... more

descriptionView Paper arrow_downwardDownload

Analisis Interaksi Pengguna Twitter pada Strategi Pengadaan Barang Menggunakan Social Network Analysis

by Deinard Sihombing

2021

Abstrak Meningkatnya interaksi pengguna internet dan media sosial tentu memiliki dampak terhadap peningkatan jumlah data atau konten yang dihasilkan oleh pengguna. Data atau konten yang dihasilkan sering disebut dengan User Generated... more

descriptionView Paper arrow_downwardDownload

Experimenting with Co-Occurrence Analysis Using #GDPR on Twitter

by Sameeh Selim

2021

This study seeks to contribute to recent debates concerning computational social science by experimenting with ‘co‐occurrence analysis’ on a Twitter dataset relating to the subject of the recently introduced General Data Protection... more

descriptionView Paper arrow_downwardDownload

17. Data Point Critique

by Carolin Gerlitz

2021, The Datafied Society

There is a plethora of publications emerging in the humanities, especially media studies, that use data points from social media platforms in order to investigate social interaction and cultural production. Data points taken from social... more

descriptionView Paper arrow_downwardDownload

The Politics of Real-time: A Device Perspective on Social Media Platforms and Search Engines

by E Weltevrede

2021, Theory, Culture & Society

This paper inquires in the politics of real-time in online media. It suggests that real-time cannot be accounted for as a universal temporal frame in which events happen, but explores the making of real-time from a device perspective... more

descriptionView Paper arrow_downwardDownload

PERCEPÇÕES SOBRE CIBERESPAÇO E TERRITORIALIDADE DIGITAL: estudo exploratório com foco em aspectos socioculturais presentes na deep web e dark web

by Braz Batista Vas

2021, Revista Observatório

Aquilo que muitos conhecem popularmente como Internet, caracteriza-se, socioculturalmente como ciberespaço e possui territorialidade própria, bem como suas próprias práticas culturais, identificadas como cibercultura. Tendo em vista que,... more

descriptionView Paper arrow_downwardDownload

Exploring Trends and Challenges in Sociological Research

by Louise Ryan

2021, Sociology

This is the first e-special issue for the journal Sociology and its chosen focus is the article 'The coming crisis of empirical sociology' by . This article challenged sociologists with a variety of questions about the role, relevance and... more

Additional services and information for Sociology can be found at: Email Alerts: http://soc.sagepub.com/cgi/alerts

descriptionView Paper arrow_downwardDownload

CUSTOMERS' SATISFACTION OF ONLINE SHOPPING MEASURED BY INFORMATION QUALITY AND TRUST FACTORS

by IAEME Publication

2021, IAEME PUBLICATION

There is a need to study whether consumer trust and e-commerce information quality are the answers to the question what drives customers' purchase decision and consequently their satisfaction. This gap in knowledge can be a significant... more

descriptionView Paper arrow_downwardDownload

Digital methods in a post-API environment

by Andreas Birkbak

2021, International Journal of Social Research Methodology

Qualitative and mixed methods digital social research often relies on gathering and storing social media data through the use of APIs (Application Programming Interfaces). In past years this has been relatively simple, with academic... more

descriptionView Paper arrow_downwardDownload

Some Further Reflections on the Coming Crisis of Empirical Sociology

by Mike savage

2020, Sociology

We respond to the two comments on our article `The Coming Crisis of Empirical Sociology' from Rosemary Crompton (2008) and Richard Webber (2009) which have been published in Sociology , as well as issues arising from the wider debate... more

descriptionView Paper arrow_downwardDownload

Some Further Reflections on the Coming Crisis of Empirical Sociology

by Mike savage

2020, Sociology

descriptionView Paper arrow_downwardDownload

A Model Based Research Material Recommendation System For Individual Users

by Nikhat Akhtar

2020, Transactions on Machine Learning and Artificial Intelligence (TMLAI), Society for Science and Education, United Kingdom (UK), ISSN 2054-7390

As there is an enormous amount of online research material available, finding pertinent information for specific purposes has become a tedious chore. So there is a requirement of the research paper recommendation system to facilitate... more

descriptionView Paper arrow_downwardDownload

Critically Engaging with Social Media Research Methods

by Dhiraj Murthy

2019, An End to the Crisis of Empirical Sociology?

As social media technologies such as Twitter, Instagram, and YouTube have become highly ubiquitous, social life itself has become reconfigured. Though early notions of an offline/online binary remain in some quarters of social research,... more

descriptionView Paper arrow_downwardDownload

Science as Culture 'Unpacking the "Digital" and the "Social" in Digital Sociology'

by Tomas Ariztia

2019, Science as Culture

descriptionView Paper arrow_downwardDownload

Cooking with Controversies: How geographers might use controversy mapping as a research tool

by John-Michael Davis

2019, The Professional Geographer

The purpose of this paper is twofold: First, to suggest that techniques for mapping public disagreements over claims to knowledge, or controversies, can act as assistive devices for researchers in geography to move from research topics to... more

Table 1 Search criterion used to generate a trans- boundary e-waste controversy corpus determine the search results users experience and their technical details are knowable only to small groups of people with access to them. Even so, the rules are premised on broad parameters known to outsiders. The page rank algorithm at the core of Google searches, for example, includes mathematical measures that mix inlink count (the number of links a page receives from other Web sites), user popular- ity (i.e., tracked user behavior with links), freshness (e.g., how frequently a page is updated), and longev- ity together to derive a measure of relevance (and thus, ranking) in the returned search results (Hine 2000; Langville 2006; Feuz, Fuller, and Stalder 2011; Rogers 2013). settee ze ww <« ag

Table 2, The concordance of disagreement terms used to search for debates in the corpus mapping literature to design such Maps 1n ways that make things public and, ideally, permit experimenta- tion by users of the maps (Venturini et al. 2015; more broadly, see Weibel and Latour 2005). Voyant facilitates such experimentation. The platform com- bines text corpus creation from Web sites and other file formats and a variety of data visualization options, but it also enables viewers of those visual- izations—who might or might not be academic researchers themselves—to be experimenters with the underlying corpus of text (ie., the data). Using Voyant permits the controversy mapping analysts to present dynamic public visualizations of text corpi (e.g., on a Web site; see Lepawsky et al. [2017i] for our own instance of such a use case of Voyant) while also providing the entire data set of text to viewers who might then experiment with it, for example, to test the claims made by the controversy analysts or, indeed, to ask entirely different questions of the underlying data. In this sense, Voyant offers a form of dialogic interpretation of text corpi relevant to a given controversy between the controversy research- ers and the publics those controversies entrain (Rockwell and Sinclair 2016; discussed further later).

descriptionView Paper arrow_downwardDownload

A Reality Check(-list) for Digital Methods

by Liliana Bounegru

2018

Digital Methods can be defined as the repurposing of the inscriptions generated by digital media for the study of collective phenomena. The strength of these methods comes from their capacity to take advantage of the data and... more

Figure 1. Two screenshots of the Contropedia.net interface. Controversial wiki-links are highlighted in red on the original page and the full evolution of the discussion surrounding them is displayed below (original figure in Weltevrede & Borra, 2016). Other times, the partiality of medium coverage with respect to the phenomenon may be used strategically. Drawing on James Gibson’s theory of visual perception (1986) Anders Koed Madsen (2012 and 2015) introduced the term “web-vision analysis” precisely to point at the way in which researchers can use different media and filtering parameters to compare different angles on the same phenomenon:

Figure 2. A visual representation of the different human rights as appearing in the results of Google Search for different countries and languages (original figure in Bekema et al., 2009). In an exploratory project, for example, a group of researchers compared the Google Web Search results for the query “rights” in a number of languages, to highlight the specific ways in which cultures conceive the question. human rights (Bekema et al., 2009 - https://www.digitalmethods.net/Dmi/Nationalityoflssues). 2 Definition of the object of study online constitute the very object of the study, in the second they are the proxies of other actions (walking, standing, shouting...) taking place outside the medium. Indeed, digital methods takes the explicit stance of using digital traces to study not only online phenomena but culture and society in general (Rogers, 2013, 2017). Repurposing the media means using digital traces as proxies for phenomena that extend beyond them.

Figure 3. Spread and debunk of the fake story according to which the Pope would have endorsed Donald Trump. The nodes represent the web pages in which the story has circulated and the lines the different ways in which they mention each other (original figure in ANONYMIZED).

Figure 4. Comparison of the most mentioned and most connected hashtags connected to climate change debate in Twitter (original figure in Marres & Gerlitz, 2015). Investigating climate debate on Twitter, Marres and Gerlitz (2015) noted that the platform relies on “frequency of mentions” to identify and promote trending topics. Such focus encourages specific practices among the users (e.g. re-tweeting as way of having messages picked up by the system) and is transmitted to most Twitter analytic tools. This ends up privileging hashtags referring to events or campaigns (e.g. #cop16, #auspol, #savethearctic) that are subject to hype-like dynamics. In order to detect more substantial issues, the researchers then moved from frequency measures to “associationist measures” (not how many times a hashtag is mentioned, but how many othe: hashtags co-occur with it), which allowed them to identify tags such as #economics, #flood, #co2, #health, #environment, and #drought.

descriptionView Paper arrow_downwardDownload

APPLICATION OF LRN AND BPNN USING TEMPORAL BACKPROPAGATION LEARNING FOR PREDICTION OF DISPLACEMENT

by Talvinder Singh

2017

— Landslides are the most threatening geo-hazard. It is a kind of genetic type of slope and has same characteristics with slope. Chaotic time series of landslide displacement and its influential factors could reflect the history of... more

descriptionView Paper arrow_downwardDownload

Survey on Aspect-Level Sentiment Analysis

by Kim Schouten

2016, IEEE Transactions on Knowledge and Data Engineering, volume 28, number 3

The field of sentiment analysis, in which sentiment is gathered, analyzed, and aggregated from text, has seen a lot of attention in the last few years. The corresponding growth of the field has resulted in the emergence of various... more

Fig. 1. Taxonomy for aspect-level sentiment analysis approaches using the main characteristic of the pro- posed algorithm.

Flavius Frasincar is an assistant professor in information systems at Erasmus University Rotterdam, the Netherlands. He has pub- lished in numerous conferences and journals in the areas of databases, Web information systems, personalization, and the Semantic Web. He is a member of the editorial board of the International Journal of Web Engineering and Technology, and Decision Support Sys- tems.

that the meaning of an expression is a function of the meaning of its parts and the syntactic rules by which these are combined. Applying this principle, a two- step process is proposed in which the polarities of the parts are determined first, and then these polarities are combined bottom-up to form the polarity of the expression as a whole. However, instead of using a manually-defined rule set to combine the various parts and their polarities, a learning algorithm is employed to cope with the irregularities and complex- ities of natural language. of their parameters from the data. However, since it is relatively easy to incorporate lexicon information as features into a supervised classifier, many of them em- ploy one or more sentiment lexicons. In [52], the raw score from the sentiment lexicon and some derivative measures (e.g., a measure called purity that reflects the fraction of positive to negative sentiment, thus showing whether sentiment is conflicted or uniform) are used as features for a MaxEnt classifier. When available, the overall star rating of the review is used as an additional signal to find the sentiment of each aspect (cf. [29]).

descriptionView Paper arrow_downwardDownload

The Public and Its Algorithms

by Hjalmar Bang Carlsen and

2016

descriptionView Paper arrow_downwardDownload

Aspect Extraction from Reviews Using Conditional Random Fields

by Yuliya Rubtsova

2015

This paper describes the Information extraction and content analysis system. The proposed system based on a conditional random eld algorithm and intended to extract aspect terms mentioned in the text. We used a set of morphological... more

descriptionView Paper arrow_downwardDownload

The Medium is the Method: Issue Mapping Workshop at CSISP

by David Moats

2015

descriptionView Paper arrow_downwardDownload

The Digital Revolution

by Alex Autio

2015

Essay explains how today's education system and societal expectations require "educated" individuals to have strong computer and information analysis skills.

descriptionView Paper arrow_downwardDownload

Optimizing Web Extraction Queries for Robustness

by Mantas Kanaporis

2015

The World Wide Web organizes information in semi-structured HTML documents. For a template-based web page that contains a list of items, information schema can be implied and structured data can be extracted with a query, i.e. a (web)... more

descriptionView Paper arrow_downwardDownload

Supervised Methods for Aspect-Based Sentiment Analysis

by hussam hamdan

2014

In this paper, we present our contribution in SemEval2014 ABSA task, some supervised methods for Aspect-Based Sentiment Analysis of restaurant and laptop reviews are proposed, implemented and evaluated. We focus on determining the aspect... more

descriptionView Paper arrow_downwardDownload

Interface Methods Renegotiating relations between digital research, STS and Sociology

by Carolin Gerlitz

2014, The Sociological Review

This paper introduces a distinctive approach to methods development in digital social research called “interface methods.” We begin by discussing various methodological confluences between digital media, social studies of science and technology (STS) and sociology. Some authors have posited significant overlap between, on the one hand, sociological and STS concepts, and on the other hand, the ontologies of digital media. Others have emphasised the significant differences between prominent methods built into digital media and those of STS and Sociology. This paper advocates a third approach, one that a) highlights the dynamism and relative under-determinacy of digital methods, and b) affirms that multiple methodological traditions intersect in digital devices and research. We argue that these two circumstances enable a distinctive approach to methodology in digital social research - ‘interface methods’ - and the paper contextualizes this approach in two different ways: first, we show how the proliferation of online data tools or ‘digital analytics’ opens up distinctive opportunities for critical and creative engagement with methods development at the intersection of sociology, STS and digital research. Second, we discuss a digital research project in which we investigated a specific ‘interface method’, namely co-occurrence analysis. The second half of the paper presents a digital pilot study in which we implemented this method in a critical and creative way to analyse and visualise ‘issue dynamics’ in the area of climate change on Twitter. We evaluate this project in the light of our principal objective, which was to test the possibilities for the critical and creative adaptation and modification of this method through its experimental implementation. To conclude, we discuss a major obstacle to the development of ‘interface methods’: digital media are marked by particular quantitative dynamics that seem adverse to the methodological commitments of sociology and STS. To address this, we argue in favour of a methodological approach in digital social research that affirms its mal-adjustment to the social methods that are prevalent in the medium.

descriptionView Paper arrow_downwardDownload

Holistic Twitter Research

by Kalina Dancheva

2013

"This thesis aims to contribute for the discussions on online research methods, by suggesting the concept of a holistic approach to the study of social media. This idea argues that data, online platforms and tools cannot be perceived as... more

descriptionView Paper arrow_downwardDownload

Scraping the Social? Issues in real-time social research

by Esther Weltevrede

2012, Journal of Cultural Economy (subm)

What makes scraping methodologically interesting for social and cultural research?

descriptionView Paper arrow_downwardDownload

Describing Description (and Keeping Causality): The Case of Academic Articles on Food and Eating

by Emma Uprichard

2012

Recently, Savage and Burrows argued that there is an ‘empirical crisis’ in sociology. They concluded that sociologists should abandon a focus on causality for descriptions that ‘link narrative, numbers, and images’. This article takes up... more

descriptionView Paper arrow_downwardDownload

NEMO: Extraction and normalization of organization names from PubMed affiliation strings

by Siddhartha Jonnalagadda

2011, Journal of Biomedical Discovery …

Background. We are witnessing an exponential increase in biomedical research citations in PubMed. However, translating biomedical discoveries into practical treatments is estimated to take around 17 years, according to the 2000 Yearbook of Medical Informatics, and much information is lost during this transition. Pharmaceutical companies spend huge sums to identify opinion leaders and centers of excellence. Conventional methods such as literature search, survey, observation, self-identification, expert opinion, and sociometry not only need much human effort, but are also noncomprehensive. Such huge delays and costs can be reduced by “connecting those who produce the knowledge with those who apply it”. A humble step in this direction is large scale discovery of persons and organizations involved in specific areas of research. This can be achieved by automatically extracting and disambiguating author names and affiliation strings retrieved through Medical Subject Heading (MeSH) terms and other keywords associated with articles in PubMed. In this study, we propose NEMO (Normalization Engine for Matching Organizations), a system for extracting organization names from the affiliation strings provided in PubMed abstracts, building a thesaurus (list of synonyms) of organization names, and subsequently normalizing them to a canonical organization name using the thesaurus. Results: We used a parsing process that involves multi-layered rule matching with multiple dictionaries. The normalization process involves clustering based on weighted local sequence alignment metrics to address synonymy at word level, and local learning based on finding connected components to address synonymy. The graphical user interface and java client library of NEMO are available at http://lnxnemo.sourceforge.net .Conclusion: NEMO is developed to associate each biomedical paper and its authors with a unique organization name and the geopolitical location of that organization. This system provides more accurate information about organizations than the raw affiliation strings provided in PubMed abstracts. It can be used for : a) bimodal social network analysis that evaluates the research relationships between individual researchers and their institutions; b) improving author name disambiguation; c) augmenting National Library of Medicine (NLM)’s Medical Articles Record System (MARS) system for correcting errors due to OCR on affiliation strings that are in small fonts; and d) improving PubMed citation indexing strategies (authority control) based on normalized organization name and country.