inproceedings by Yingjie Hu

Local place names are those used by local residents but not recorded in existing gazetteers. Some... more Local place names are those used by local residents but not recorded in existing gazetteers. Some of them are colloquial place names, which are frequently referred to in conversations but not formally documented. Some are not recorded in existing gazetteers due to other reasons, such as their insignificance to a general gazetteer that covers a large geographic extent (e.g., the entire world). Yet, these local place names play important roles in many applications, from supporting public participation GIS to disaster response. This extended abstract describes our preliminary work in developing an automatic workflow for harvesting local place names from the geotagged Social Web. Specifically, we make use of the geotagged Craigslist posts in the apartments/housing section where people use local place names in their posts frequently. Our workflow consists of two major steps, a natural language processing (NLP) step and a geospatial step. The NLP step focuses on the textual contents of the posts, and extracts candidate place names by analysing the grammatical structure of the texts and applying a named entity recognition model. The geospatial step examines the geographic coordinates associated with the candidate place names, and performs multi-scale clustering to filter out the false positives (non-place names) included in the result of the first step. We ran a preliminary comparison between our initial result and a comprehensive gazetteer, GeoNames. Possible future steps are discussed.
In this work we introduce an anisotropic density-based clustering algorithm. It outperforms DBSCA... more In this work we introduce an anisotropic density-based clustering algorithm. It outperforms DBSCAN and OPTICS for the detection of anisotropic spatial point patterns and performs equally well in cases that do not explicitly benefit from an anisotropic perspective. ADCN has the same time complexity as DBSCAN and OPTICS, namely O(n log n) when using a spatial index, O(n 2 ) otherwise.

Place name disambiguation is the task of correctly identifying a place from a set of places shari... more Place name disambiguation is the task of correctly identifying a place from a set of places sharing a common name. It contributes to tasks such as knowledge extraction, query answering, geographic information retrieval, and automatic tagging. Disambiguation quality relies on the ability to correctly identify and interpret contextual clues, complicating the task for short texts. Here we propose a novel approach to the disambiguation of place names from short texts that integrates two models: entity co-occurrence and topic modeling. The first model uses Linked Data to identify related entities to improve disambiguation quality. The second model uses topic modeling to differentiate places based on the terms used to describe them. We evaluate our approach using a corpus of short texts, determine the suitable weight between models, and demonstrate that a combined model outperforms benchmark systems such as DBpedia Spotlight and Open Calais in terms of F1-score and Mean Reciprocal Rank.

As more data from heterogeneous sources become available, interfaces that support the federated e... more As more data from heterogeneous sources become available, interfaces that support the federated exploration of these data are gaining importance to uncover relations between entities across multiple sources. Instead of explicit queries, visual interfaces enable a follow-your-nose style of exploration by which a user can seamlessly navigate between entities from different data sources. This requires an alignment of the ontologies used by said sources as well as the coreference resolution of entities across them. Together with Semantic Web technologies, the Linked Data paradigm provides the technological foundations to address these challenges. Nonetheless, the majority of work studies these components in isolation, focusing either on the alignment, coreference resolution, or visualization. Some interesting aspects, however, only arise when all puzzle pieces are in place. Two of these aspects are the seamless transitions between visualization and interaction paradigms as well as the combination of entity and type queries. In this work, we present a multi-perspective visual interface that enables the seamless exploration of major scientific geo-data sources that contain millions of RDF triples.

Identifying the same places across different gazetteers is a key prerequisite for spatial data co... more Identifying the same places across different gazetteers is a key prerequisite for spatial data conflation and interlinkage. Conventional approaches mostly rely on combining spatial distance with string matching and structural similarity measures, while ignoring relations among places and the semantics of place types. In this work, we propose to use spatial statistics to mine semantic signatures for place types and use these signatures for coreference resolution, i.e., to determine whether records form different gazetteers refer to the same place. We implement 27 statistical features for computing these signatures and apply them to the type and entity levels to determine the corresponding places between two gazetteers, which are GeoNames and DBpedia. The city of Kobani, Syria, is used as a running example to demonstrate the feasibility of our approach. The experimental results show that the proposed signatures have the potential to improve the performance of coreference resolution.

In recent years, online volunteers have been actively participating in disaster response, thanks ... more In recent years, online volunteers have been actively participating in disaster response, thanks to the advancement of information technologies and the support from humanitarian organizations. One important way in which online volunteers contribute to disaster response is by mapping the affected area based on remote sensing imagery. Such online mapping generates up-to-date geographic information which can provide valuable support for the decision making of emergency responders. Typically, the area affected by an disaster is divided into a number of cells using a grid-based tessellation and each volunteer can select one cell to start the mapping. While this approach coordinates the efforts from many online volunteers, it is unclear in which sequence these grid cells have been mapped. This sequence is important because it determines when the geographic information within a particular cell will become available to emergency responders, which in turn can directly influence the efficiency of rescue tasks and other relief efforts. In this work, we study three online mapping projects which were deployed and utilized in 2015 Nepal, 2016 Ecuador, and 2016 Japan earthquakes to gain insights into the mapping sequences performed by online volunteers.

Geo-ontologies play an important role in fostering the publication, retrieval, reuse, and integra... more Geo-ontologies play an important role in fostering the publication, retrieval, reuse, and integration of geographic data within and across domains. The status quo of geo-ontology engineering often follows a centralized top-down approach, namely a group of domain experts collaboratively formalizing key concepts and their relationships. On the one hand, such an approach makes use of the invaluable knowledge and experience of subject matter experts and captures their perception of the world. On the other hand, however, it can introduce biases and ontological commitments that do not well correspond to the data that will be semantically lifted using these ontologies. In this work, we propose a data-driven method to calculate a Discrepancy Index in order to identify and quantify the potential modeling biases in current geo-ontologies. In other words, instead of trying to measure quality, we determine how much the ontology differs from what would be expected when looking at the data alone.
While the adoption of Linked Data technologies has grown dramatically over the past few years, it... more While the adoption of Linked Data technologies has grown dramatically over the past few years, it has not come without its own set of growing challenges. The triplification of domain data into Linked Data has not only given rise to a leading role of places and positioning information for the dense interlinkage of data about actors, objects, and events, but also led to massive errors in the generation, transformation, and semantic annotation of data. In a global and densely interlinked graph of data, even seemingly minor error can have far reaching consequences as different datasets make statements about the same resources. In this work we present the first comprehensive study of systematic errors and their potential causes. We also discuss lessons learned and means to avoid some of the introduced pitfalls in the future.
GeoLink is one of the building block projects within Earth-Cube, a major effort of the National S... more GeoLink is one of the building block projects within Earth-Cube, a major effort of the National Science Foundation to establish a next-generation knowledge infrastructure for geosciences. Specifically, GeoLink aims to improve data reuse and integration of seven geoscience data repositories through the use of ontologies. In this paper, we present the approach taken by this project, which combines linked data publishing and modular ontology engineering based on ontology design patterns to realize integration while respecting existing heterogeneity within the participating repositories.

The practices and standards of travel behavior data collection have changed significantly over th... more The practices and standards of travel behavior data collection have changed significantly over the past decade with the introduction of mobile, location based technology. The use of GPS devices such as loggers has been a great enhancement to the field. However, with the increased ubiquity of smartphones, which come equipped with a variety of sensors useful to behavioral data collection, the possibilities and methods used in data collection are again shifting. These mobile devices offer several exciting opportunities to either collect more data from respondents with a similar amount of burden, or collect previously burdensome data (such as detailing time use by paper and pen survey) with little or no interaction from the respondent. In this paper, an overview of possible sensors is presented, as well as current research efforts in the field. A forthcoming analysis of sensor frequency, battery expenditure and accuracy of detection will also be discussed in the final version of this paper.

Life Cycle Assessment (LCA) is the study of the environmental impact of products taking into acco... more Life Cycle Assessment (LCA) is the study of the environmental impact of products taking into account their entire life-span and production chain. This requires gathering data from a variety of heterogeneous sources into a Life Cycle Inventory (LCI). LCI preparation involves the integration of observations and engineering models with reference data and literature results from around the world, from different domains, and at varying levels of granularity. Existing LCA data formats only address syntactic interoperability, thereby often ignoring semantics. This leads to inefficiencies in information collection and management and thus a variety of challenges, e.g., difficulties in reproducing assessments published in the literature. In this work, we present an ontology pattern that specifies key aspects of LCA/LCI data models, i.e., the notions of flows, activities, agents, and products, as well as their properties.

Information plays an important role in disaster response. In the past, there has been a lack of u... more Information plays an important role in disaster response. In the past, there has been a lack of up-to-date information following major disasters due to the limited means of communication. This situation has changed substantially in recent years. With the ubiquity of mobile devices, people experiencing emergency events may still be able to share information via social media and peer-to-peer networks. Meanwhile, volunteers throughout the world are remotely convened by humanitarian organizations to digitize satellite images for the impacted area. These processes produce rich information which presents a new challenge for decision makers who have to interpret large amount of heterogeneous information within limited time. This short paper discusses this problem and outlines a potential solution to prioritizing information in emergency situations. Specifically, we focus on information about road network connectivity, i.e., whether a road segment is still accessible after a disaster. We propose to integrate information value theory with graph theory, and prioritize information items based on their contributions to the successes of potential rescue tasks and to the more accurate estimation of road network connectivity. Finally, we point out directions for future work.

Place categorization plays an important role in locationbased services as well as more recently i... more Place categorization plays an important role in locationbased services as well as more recently in place-based geographic information systems. Traditionally, such categorization systems are often designed following a top-down approach in which a group of experts or users assign a place type, e.g. Restaurant, to a place instance, e.g., Bob's BBQ shack. While the output of such a process generally satisfies the requirements of a particular application, it often fails to incorporate the perception of the general public towards places. In today's online landscape, some parts of this perception are captured by location-based social network platforms. Contributions to these platforms, such as check-ins and reviews, enable a bottom-up approach to place categorization based on the actual interaction between humans and places. In this short paper, we outline selected advantages of a hybrid approach, which combines top-down and bottom-up methods to enhance place type hierarchies.

The Biological and Chemical Oceanography Data Management Office (BCO-DMO) and the Rolling Deck to... more The Biological and Chemical Oceanography Data Management Office (BCO-DMO) and the Rolling Deck to Repository (R2R) program are two key data repositories for oceanographic research, supported by the U.S. National Science Foundation (NSF). R2R curates digital data and documentation generated by environmental sensor systems installed on vessels from the U.S. academic research fleet, with support from the NSF Oceanographic Technical Services and Arctic Research Logistics Programs. BCO-DMO human-curates and maintains data and metadata including biological, chemical, and physical measurements and results from projects funded by the NSF Biological Oceanography, Chemical Oceanography, and Antarctic Organisms & Ecosystems Programs. These two repositories have a strong connection, and document several thousand U.S. oceanographic research expeditions since the 1970's. Recently, R2R and BCO-DMO have made their metadata collections available as Linked Data, accessible via public SPARQL endpoints. In this paper, we report on these datasets.

Life Cycle Assessment (LCA) evaluates the environmental impact of a product through its entire li... more Life Cycle Assessment (LCA) evaluates the environmental impact of a product through its entire life cycle, from material extraction to final disposal or recycling. The environmental impacts of an activity depend on both the activity's direct emissions to the environment as well as indirect emissions caused by activities elsewhere in the supply chain. Both the impacts of direct emissions and the provisioning of supply chain inputs to an activity depend on the activity's spatiotemporal scope. When accounting for spatiotemporal dynamics, LCA often faces significant data interoperability challenges. Ontologies and Semantic technologies can foster interoperability between diverse data sets from a variety of domains. Thus, this paper presents an ontology for modeling spatiotemporal scopes, i.e., the contexts in which impact estimates are valid. We discuss selected axioms and illustrate the use of the ontology by providing an example from LCA practice. The ontology enables practitioners to address key competency questions regarding the effect of spatiotemporal scopes on environmental impact estimation.
The Semantic Web journal implements an open and transparent review process which creates a unique... more The Semantic Web journal implements an open and transparent review process which creates a unique bibliographic dataset. In addition to traditional publication data such as author names and paper titles, each paper in this dataset is also accompanied with a fully timestamped history of its successive decision statuses, assigned editors, solicited and voluntary reviewers, full text reviews, comments, and in many cases also the authors' response letters. This dataset presents a rich and valuable resource for a variety of studies, such as understanding the collaboration networks of scholars as well as exploring the trending topics in the field of Semantic Web. This dataset is now publicly available online as Linked Data. In this short paper, we report the availability, novelty, as well as some design considerations of this dataset.

In this research, we first aim at developing data analytics that can derive insights about how pe... more In this research, we first aim at developing data analytics that can derive insights about how people from different regions communicate and connect via mobile phone calls and physical movements. We uncover the digital divide (geographical segregation of phone communication patterns) and the physical divide (geographical limits of human mobility) in Senegal. The research also demonstrates that the chosen spatial unit and temporal resolution can affect the community detection results of spatial interaction graphs when analyzing human mobility patterns and exploring urban dynamics in the mobile age. We find that the daily detection has generated a more stable partition structure than an hourly one, while monthly changes also exist over time. The presented framework can help identify patterns of spatial interaction in both cyberspace and physical space with phone call detailed records in some regions where census data acquisition is difficult, especially in African countries.
GeoLink is one of the building block projects within Earth-Cube, a major effort of the National S... more GeoLink is one of the building block projects within Earth-Cube, a major effort of the National Science Foundation to establish a next-generation knowledge infrastructure for geosciences. As part of this effort, GeoLink aims to improve data retrieval, reuse, and integration of seven geoscience data repositories through the use of ontologies. In this paper, we report on the GeoLink modular ontology, which consists of an interlinked collection of ontology design patterns engineered as the result of a collaborative modeling effort. We explain our design choices, present selected modeling details, and discuss how data integration can be achieved using the patterns while respecting the existing heterogeneity within the participating repositories.

Research Institute (ESRI). It contains a rich collection of Web maps, layers, and services contri... more Research Institute (ESRI). It contains a rich collection of Web maps, layers, and services contributed by GIS users throughout the world. The metadata about these GIS resources reside in data silos that can be accessed via a Web API. While this is sufficient for simple syntax-based searches, it does not support more advanced queries, e.g., finding maps based on the semantics of the search terms, or performing customized queries that are not pre-designed in the API. In metadata, titles and descriptions are commonly available attributes which provide important information about the content of the GIS resources. However, such data cannot be easily used since they are in the form of unstructured natural language. To address these difficulties, we combine data-driven techniques with theory-driven approaches to enable semantic search and knowledge discovery for ArcGIS Online. We develop an ontology for ArcGIS Online data, convert the metadata into Linked Data, and enrich the metadata by extracting thematic concepts and geographic entities from titles and descriptions. Based on a human participant experiment, we calibrate a linear regression model for semantic search, and demonstrate the flexible queries for knowledge discovery that are not possible in the existing Web API. While this research is based on the ArcGIS Online data, the presented methods can also be applied to other GIS cloud services and data infrastructures.
Uploads
inproceedings by Yingjie Hu