Papers by Alexander Rasin
Evaluating the Interpretation of Natural Language Trace Queries
Lecture Notes in Computer Science, 2016
Investigating the effects of majority voting on CAD systems: a LIDC case study
Medical Imaging 2016: Computer-Aided Diagnosis, 2016
Towards Achieving Diagnostic Consensus in Medical Image Interpretation
2014 Ieee International Conference on Data Mining Workshop, Dec 1, 2014

TiQi: answering unstructured natural language trace queries
Requirements Engineering, 2015
ABSTRACT Software traceability is a required element in the development and certification of safe... more ABSTRACT Software traceability is a required element in the development and certification of safety-critical software systems. However, trace links, which are created at significant cost and effort, are often underutilized in practice due primarily to the fact that project stakeholders often lack the skills needed to formulate complex trace queries. To mitigate this problem, we present a solution which transforms spoken or written natural language queries into structured query language (SQL). TiQi includes a general database query mechanism and a domain-specific model populated with trace query concepts, project-specific terminology, token disambiguators, and query transformation rules. We report results from four different experiments exploring user preferences for natural language queries, accuracy of the generated trace queries, efficacy of the underlying disambiguators, and stability of the trace query concepts. Experiments are conducted against two different datasets and show that users have a preference for written NL queries. Queries were transformed at accuracy rates ranging from 47 to 93 %.
HadoopDB
Proceedings of the VLDB Endowment, 2009
Towards Achieving Diagnostic Consensus in Medical Image Interpretation
2014 IEEE International Conference on Data Mining Workshop, 2014
TiQi: Towards natural language trace queries
2014 IEEE 22nd International Requirements Engineering Conference (RE), 2014

An automatic physical design tool for clustered column-stores
Proceedings of the 16th International Conference on Extending Database Technology - EDBT '13, 2013
ABSTRACT Good database design is typically a very difficult and costly process. As database syste... more ABSTRACT Good database design is typically a very difficult and costly process. As database systems get more complex and as the amount of data under management grows, the stakes increase accordingly. Past research produced a number of design tools capable of automatically selecting secondary indexes and materialized views for a known workload. However, a significant bulk of research on automated database design has been done in the context of row-store DBMSes. While this work has produced effective design tools, new specialized database architectures demand a rethinking of automated design algorithms. In this paper, we present results for an automatic design tool that is aimed at column-oriented DBMSes on OLAP workloads. In particular, we have chosen a commercial column store DBMS that supports data sorting. In this setting, the key problem is selecting proper sort orders and compression schemes for the columns as well as appropriate pre-join views. This paper describes our automatic design algorithms as well as the results of some experiments using it on realistic data sets.

Reducing Classification Cost through Strategic Annotation Assignment
2013 IEEE 13th International Conference on Data Mining Workshops, 2013
ABSTRACT The problem of classifying samples for which there is no definite label is a challenging... more ABSTRACT The problem of classifying samples for which there is no definite label is a challenging one in which multiple annotators will provide a more certain input for a classifier. Unlike most of active learning scenarios that require identifying which images to be annotated, we explore how many annotations can potentially be used per instance (one annotation per instance is only the initial step) and propose a threshold-based concept of estimated instance difficulty to guide the custom label acquisition strategy. Using a lung nodule image data set, we determined that, by a simple division of cases into easy and hard to classify, the number of annotations can be distributed to significantly lower the cost (number of acquired annotations) for building a reliable classifier. We show the entire range of available tradeoffs-from a small reduction in annotation cost with no perceptible accuracy loss to a large reduction in annotation cost with a minimal sacrifice of classification accuracy.

Proceedings of the 29th International Conference on Very Large Data Bases Volume 29, Jul 24, 2003
Many stream-based applications have sophisticated data processing requirements and real-time perf... more Many stream-based applications have sophisticated data processing requirements and real-time performance expectations that need to be met under high-volume, time-varying data streams. In order to address these challenges, we propose novel operator scheduling approaches that specify (1) which operators to schedule (2) in which order to schedule the operators, and (3) how many tuples to process at each execution step. We study our approaches in the context of the Aurora data stream manager. We argue that a fine-grained scheduling approach in combination with various scheduling techniques (such as batching of operators and tuples) can significantly improve system efficiency by reducing various system overheads. We also discuss application-aware extensions that make scheduling decisions according to per-application Quality of Service (QoS) specifications. Finally, we present prototype-based experimental results that characterize the efficiency and effectiveness of our approaches under various stream workloads and processing scenarios.
Query workloads and database schemas in OLAP applications are becoming increasingly complex. More... more Query workloads and database schemas in OLAP applications are becoming increasingly complex. Moreover, the queries and the schemas have to continually \textit{evolve} to address business requirements. During such repetitive transitions, the \textit{order} of index deployment has to be considered while designing the physical schemas such as indexes and MVs. An effective index deployment ordering can produce (1) a prompt query
International Conference on Data Engineering, 2000
Stream-processing systems are designed to support an emerging class of applications that require ... more Stream-processing systems are designed to support an emerging class of applications that require sophisticated and timely processing of high-volume data streams, often origi- nating in distributed environments. Unlike traditional data- processing applications that require precise recovery for cor- rectness, many stream-processing applications can tolerate and benefit from weaker recovery guarantees. In this paper, we study various recovery guarantees and pertinent
Proceedings of The Vldb Endowment, 2009
In relational query processing, there are generally two choices for access paths when performing ... more In relational query processing, there are generally two choices for access paths when performing a predicate lookup for which no clustered index is available. One option is to use an unclustered index. Another is to perform a complete sequential scan of the table. Many analytical workloads do not benefit from the availability of unclustered indexes; the cost of random disk
Hold the Accusations That Limit Scientific Innovation. Authors' reply
Communications of the Acm, 2010
Recently, significant efforts have focused on develop- ing novel data-processing systems to suppo... more Recently, significant efforts have focused on develop- ing novel data-processing systems to support a new class of applications that commonly require sophisticated and timely processing of high-volume data streams. Early work in stream processing has primarily focused on stream- oriented languages and resource-constrained, one-pass query-processing. High availability, an increasingly impor- tant goal for virtually all data processing systems, is yet

Proceedings of the 15th International Conference on Extending Database Technology - EDBT '12, 2012
Query workloads and database schemas in OLAP applications are becoming increasingly complex. More... more Query workloads and database schemas in OLAP applications are becoming increasingly complex. Moreover, the queries and the schemas have to continually evolve to address business requirements. During such repetitive transitions, the order of index deployment has to be considered while designing the physical schemas such as indexes and MVs. An effective index deployment ordering can produce (1) a prompt query runtime improvement and (2) a reduced total deployment time. Both of these are essential qualities of design tools for quickly evolving databases, but optimizing the problem is challenging because of complex index interactions and a factorial number of possible solutions. We formulate the problem in a mathematical model and study several techniques for solving the index ordering problem. We demonstrate that Constraint Programming (CP) is a more flexible and efficient platform to solve the problem than other methods such as mixed integer programming and A* search. In addition to exact search techniques, we also studied local search algorithms to find near optimal solution very quickly. Our empirical analysis on the TPC-H dataset shows that our pruning techniques can reduce the size of the search space by tens of orders of magnitude. Using the TPC-DS dataset, we verify that our local search algorithm is a highly scalable and stable method for quickly finding a near-optimal solution.
Coradd
Proceedings of the VLDB Endowment, 2010

Proceedings of the VLDB Endowment, 2009
In relational query processing, there are generally two choices for access paths when performing ... more In relational query processing, there are generally two choices for access paths when performing a predicate lookup for which no clustered index is available. One option is to use an unclustered index. Another is to perform a complete sequential scan of the table. Many analytical workloads do not benefit from the availability of unclustered indexes; the cost of random disk I/O becomes prohibitive for all but the most selective queries. It has been observed that a secondary index on an unclustered attribute can perform well under certain conditions if the unclustered attribute is correlated with a clustered index attribute [4]. The clustered index will co-locate values and the correlation will localize access through the unclustered attribute to a subset of the pages. In this paper, we show that in a real application (SDSS) and widely used benchmark (TPC-H), there exist many cases of attribute correlation that can be exploited to accelerate queries. We also discuss a tool that can automatically suggest useful pairs of correlated attributes. It does so using an analytical cost model that we developed, which is novel in its awareness of the effects of clustering and correlation. Furthermore, we propose a data structure called a Correlation Map (CM) that expresses the mapping between the correlated attributes, acting much like a secondary index. The paper also discusses how bucketing on the domains of both attributes in the correlated attribute pair can dramatically reduce the size of the CM to be potentially orders of magnitude smaller than that of a secondary B+Tree index. This reduction in size allows us to create a large number of CMs that improve performance for a wide range of queries. The small size also reduces maintenance costs as we demonstrate experimentally.
Proceedings of the 2003 ACM SIGMOD international conference on on Management of data - SIGMOD '03, 2003
The Aurora system [1] is an experimental data stream management system with a fully functional pr... more The Aurora system [1] is an experimental data stream management system with a fully functional prototype. It includes both a graphical development environment, and a runtime system. We propose to demonstrate the Aurora system with its development environment and runtime system, with several example monitoring applications developed in consultation with defense, financial, and natural science communities. We will also demonstrate the effect of various system alternatives on various workloads. For example, we will show how different scheduling algorithms affect tuple latency and internal queue lengths. We will use some of our visualization tools to accomplish this.

Assessing diagnostic complexity: An image feature-based strategy to reduce annotation costs
Computers in Biology and Medicine, 2015
Computer-aided diagnosis systems can play an important role in lowering the workload of clinical ... more Computer-aided diagnosis systems can play an important role in lowering the workload of clinical radiologists and reducing costs by automatically analyzing vast amounts of image data and providing meaningful and timely insights during the decision making process. In this paper, we present strategies on how to better manage the limited time of clinical radiologists in conjunction with predictive model diagnosis. We first introduce a metric for discriminating between the different categories of diagnostic complexity (such as easy versus hard) encountered when interpreting CT scans. Second, we propose to learn the diagnostic complexity using a classification approach based on low-level image features automatically extracted from pixel data. We then show how this classification can be used to decide how to best allocate additional radiologists to interpret a case based on its diagnosis category. Using a lung nodule image dataset, we determined that, by a simple division of cases into hard and easy to diagnose, the number of interpretations can be distributed to significantly lower the cost with limited loss in prediction accuracy. Furthermore, we show that with just a few low-level image features (18% of the original set) we are able to determine the easy from hard cases for a significant subset (66%) of the lung nodule image data.
Uploads
Papers by Alexander Rasin