Skip to main content

Artur Dubrawski

Followers

40

Following

4

Co-authors

2

Public Views

Interests

Uploads

Papers by Artur Dubrawski

Discovery of Complex Anomalous Patterns of Sexual Violence in El Salvador

arXiv (Cornell University), Nov 17, 2017

When sexual violence is a product of organized crime or social imaginary, the links between sexua... more When sexual violence is a product of organized crime or social imaginary, the links between sexual violence episodes can be understood as a latent structure. With this assumption in place, we can use data science to uncover complex patterns. In this paper we focus on the use of data mining techniques to unveil complex anomalous spatiotemporal patterns of sexual violence. We illustrate their use by analyzing all reported rapes in El Salvador over a period of nine years. Through our analysis, we are able to provide evidence of phenomena that, to the best of our knowledge, have not been previously reported in literature. We devote special attention to a pattern we discover in the East, where underage victims report their boyfriends as perpetrators at anomalously high rates. Finally, we explain how such analyzes could be conducted in real-time, enabling early detection of emerging patterns to allow law enforcement agencies and policy makers to react accordingly.

Discovery of Crime Event Sequences with Constricted Spatio-Temporal Sequential Patterns

arXiv (Cornell University), Dec 3, 2021

In this article, we introduce a novel type of spatio-temporal sequential patterns called Constric... more In this article, we introduce a novel type of spatio-temporal sequential patterns called Constricted Spatio-Temporal Sequential (CSTS) patterns and thoroughly analyze their properties. We demonstrate that the set of CSTS patterns is a concise representation of all spatio-temporal sequential patterns that can be discovered in a given dataset. To measure significance of the discovered CSTS patterns we adapt the participation index measure. We also provide CSTS-Miner: an algorithm that discovers all participation index strong CSTS patterns in event data. We experimentally evaluate the proposed algorithms using two crime-related datasets: Pittsburgh Police Incident Blotter Dataset and Boston Crime Incident Reports Dataset. In the experiments, the CSTS-Miner algorithm is compared with the other four state-of-the-art algorithms: STS-Miner, CSTPM, STBFM and CST-SPMiner. As the results of experiments suggest, the proposed algorithm discovers much fewer patterns than the other selected algorithms. Finally, we provide the examples of interesting crime-related patterns discovered by the proposed CSTS-Miner algorithm.

Machine Learning for Adaptable Heterogeneous Indexing and Search

The public reporting burden for this collection of information is estimated to average 1 hour per... more The public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Department of Defense, Washington Headquarters Services, Directorate for Information Operations and Reports (0704-0188), 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to any penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. PLEASE DO NOT RETURN YOUR FORM TO THE ABOVE ADDRESS.

Double Adaptive Stochastic Gradient Optimization

arXiv (Cornell University), Nov 6, 2018

Adaptive moment methods have been remarkably successful in deep learning optimization, particular... more Adaptive moment methods have been remarkably successful in deep learning optimization, particularly in the presence of noisy and/or sparse gradients. We further the advantages of adaptive moment techniques by proposing a family of double adaptive stochastic gradient methods DASGrad. They leverage the complementary ideas of the adaptive moment algorithms widely used by deep learning community, and recent advances in adaptive probabilistic algorithms. We analyze the theoretical convergence improvements of our approach in a stochastic convex optimization setting, and provide empirical validation of our findings with convex and non convex objectives. We observe that the benefits of DASGrad increase with the model complexity and variability of the gradients, and we explore the resulting utility in extensions of distributionmatching multitask learning.

Empirically derived sequence similarity thresholds to study the genomic epidemiology of plasmids shared among healthcare-associated bacterial pathogens

eBioMedicine

Forecasting imminent atrial fibrillation in long-term ECG recordings

Journal of Electrocardiology

Incorporation of machine learning and signal quality indicators can significantly suppress false respiratory alerts during in-hospital bedside monitoring

Journal of Electrocardiology

Recovering Sparse and Interpretable Subgroups with Heterogeneous Treatment Effects with Censored Time-to-Event Outcomes

arXiv (Cornell University), Feb 24, 2023

Studies involving both randomized experiments as well as observational data typically involve tim... more Studies involving both randomized experiments as well as observational data typically involve time-to-event outcomes such as time-to-failure, death or onset of an adverse condition. Such outcomes are typically subject to censoring due to loss of follow-up and established statistical practice involves comparing treatment efficacy in terms of hazard ratios between the treated and control groups. In this paper we propose a statistical approach to recovering sparse phenogroups (or subtypes) that demonstrate differential treatment effects as compared to the study population. Our approach involves modelling the data as a mixture while enforcing parameter shrinkage through structured sparsity regularization. We propose a novel inference procedure for the proposed model and demonstrate its efficacy in recovering sparse phenotypes across large landmark real world clinical studies in cardiovascular health.

Learning to Extract Actionable Evidence from Medical Insurance Claims Data

Actionable Intelligence in Healthcare, 2017

Automated data mining of the electronic health record for investigation of healthcare-associated outbreaks

Infection Control & Hospital Epidemiology, 2019

Background:Identifying routes of transmission among hospitalized patients during a healthcare-ass... more Background:Identifying routes of transmission among hospitalized patients during a healthcare-associated outbreak can be tedious, particularly among patients with complex hospital stays and multiple exposures. Data mining of the electronic health record (EHR) has the potential to rapidly identify common exposures among patients suspected of being part of an outbreak.Methods:We retrospectively analyzed 9 hospital outbreaks that occurred during 2011–2016 and that had previously been characterized both according to transmission route and by molecular characterization of the bacterial isolates. We determined (1) the ability of data mining of the EHR to identify the correct route of transmission, (2) how early the correct route was identified during the timeline of the outbreak, and (3) how many cases in the outbreaks could have been prevented had the system been running in real time.Results:Correct routes were identified for all outbreaks at the second patient, except for one outbreak i...

Canonical Autocorrelation Analysis

arXiv (Cornell University), Nov 19, 2015

We present an extension of sparse Canonical Correlation Analysis (CCA) designed for finding multi... more We present an extension of sparse Canonical Correlation Analysis (CCA) designed for finding multiple-tomultiple linear correlations within a single set of variables. Unlike CCA, which finds correlations between two sets of data where the rows are matched exactly but the columns represent separate sets of variables, the method proposed here, Canonical Autocorrelation Analysis (CAA), finds multivariate correlations within just one set of variables. This can be useful when we look for hidden parsimonious structures in data, each involving only a small subset of all features. In addition, the discovered correlations are highly interpretable as they are formed by pairs of sparse linear combinations of the original features. We show how CAA can be of use as a tool for anomaly detection when the expected structure of correlations is not followed by anomalous data. We illustrate the utility of CAA in two application domains where single-class and unsupervised learning of correlation structures are particularly relevant: breast cancer diagnosis and radiation threat detection. When applied to the Wisconsin Breast Cancer data, singleclass CAA is competitive with supervised methods used in literature. On the radiation threat detection task, unsupervised CAA performs significantly better than an unsupervised alternative prevalent in the domain, while providing valuable additional insights for threat analysis.

Lass-0: sparse non-convex regression by local search

arXiv (Cornell University), Nov 13, 2015

Leveraging Expert Consistency to Improve Algorithmic Decision Support

arXiv (Cornell University), Jan 24, 2021

Machine learning (ML) is increasingly being used to support high-stakes decisions, a trend owed i... more Machine learning (ML) is increasingly being used to support high-stakes decisions, a trend owed in part to its promise of superior predictive power relative to human assessment. However, there is frequently a gap between decision objectives and what is captured in the observed outcomes used as labels to train ML models. As a result, machine learning models may fail to capture important dimensions of decision criteria, hampering their utility for decision support. In this work, we explore the use of historical expert decisions as a rich-yet imperfect-source of information that is commonly available in organizational information systems, and show that it can be leveraged to bridge the gap between decision objectives and algorithm objectives. We consider the problem of estimating expert consistency indirectly when each case in the data is assessed by a single expert, and propose influence function-based methodology as a solution to this problem. We then incorporate the estimated expert consistency into a predictive model through a training-time label amalgamation approach. This approach allows ML models to learn from experts when there is inferred expert consistency, and from observed labels otherwise. We also propose alternative ways of leveraging inferred consistency via hybrid and deferral models. In our empirical evaluation, focused on the context of child maltreatment hotline screenings, we show that (1) there are high-risk cases whose risk is considered by the experts but not wholly captured in the target labels used to train a deployed model, and (2) the proposed approach significantly improves precision for these cases.

<i>Deep Survival Machines</i>: Fully Parametric Survival Regression and Representation Learning for Censored Data With Competing Risks

IEEE Journal of Biomedical and Health Informatics, Aug 1, 2021

We describe a new approach to estimating relative risks in time-to-event prediction problems with... more We describe a new approach to estimating relative risks in time-to-event prediction problems with censored data in a fully parametric manner. Our approach does not require making strong assumptions of constant proportional hazard of the underlying survival distribution, as required by the Cox-proportional hazard model. By jointly learning deep nonlinear representations of the input covariates, we demonstrate the benefits of our approach when used to estimate survival risks through extensive experimentation on multiple real world datasets with different levels of censoring. We further demonstrate advantages of our model in the competing risks scenario. To the best of our knowledge, this is the first work involving fully parametric estimation of survival times with competing risks in the presence of censoring.

Machine learning of physiological waveforms and electronic health record data to predict, diagnose and treat haemodynamic instability in surgical patients: protocol for a retrospective study

BMJ Open, Dec 1, 2019

Noise-Tolerant Interactive Learning Using Pairwise Comparisons

Neural Information Processing Systems, Apr 1, 2017

We study the problem of interactively learning a binary classifier using noisy labeling and pairw... more We study the problem of interactively learning a binary classifier using noisy labeling and pairwise comparison oracles, where the comparison oracle answers which one in the given two instances is more likely to be positive. Learning from such oracles has multiple applications where obtaining direct labels is harder but pairwise comparisons are easier, and the algorithm can leverage both types of oracles. In this paper, we attempt to characterize how the access to an easier comparison oracle helps in improving the label and total query complexity. We show that the comparison oracle reduces the learning problem to that of learning a threshold function. We then present an algorithm that interactively queries the label and comparison oracles and we characterize its query complexity under Tsybakov and adversarial noise conditions for the comparison and labeling oracles. Our lower bounds show that our label and total query complexity is almost optimal.

Novel Prediction Techniques Based on Clusterwise Linear Regression

arXiv (Cornell University), Apr 28, 2018

In this paper we explore different regression models based on Clusterwise Linear Regression (CLR)... more In this paper we explore different regression models based on Clusterwise Linear Regression (CLR). CLR aims to find the partition of the data into k clusters, such that linear regressions fitted to each of the clusters minimize overall mean squared error on the whole data. The main obstacle preventing to use found regression models for prediction on the unseen test points is the absence of a reasonable way to obtain CLR cluster labels when the values of target variable are unknown. In this paper we propose two novel approaches on how to solve this problem. The first approach, predictive CLR builds a separate classification model to predict test CLR labels. The second approach, constrained CLR utilizes a set of user-specified constraints that enforce certain points to go to the same clusters. Assuming the constraint values are known for the test points, they can be directly used to assign CLR labels. We evaluate these two approaches on three UCI ML datasets as well as on a large corpus of health insurance claims. We show that both of the proposed algorithms significantly improve over the known CLR-based regression methods. Moreover, predictive CLR consistently outperforms linear regression and random forest, and shows comparable performance to support vector regression on UCI ML datasets. The constrained CLR approach achieves the best performance on the health insurance dataset, while enjoying only ≈ 20 times increased computational time over linear regression.

Detecting Patterns of Physiological Response to Hemodynamic Stress via Unsupervised Deep Learning

arXiv (Cornell University), Nov 12, 2019

Monitoring physiological responses to hemodynamic stress can help in determining appropriate trea... more Monitoring physiological responses to hemodynamic stress can help in determining appropriate treatment and ensuring good patient outcomes. Physicians' intuition suggests that the human body has a number of physiological response patterns to hemorrhage which escalate as blood loss continues, however the exact etiology and phenotypes of such responses are not well known or understood only at a coarse level. Although previous research has shown that machine learning models can perform well in hemorrhage detection and survival prediction, it is unclear whether machine learning could help to identify and characterize the underlying physiological responses in raw vital sign data. We approach this problem by first transforming the high-dimensional vital sign time series into a tractable, lowerdimensional latent space using a dilated, causal convolutional encoder model trained purely unsupervised. Second, we identify informative clusters in the embeddings. By analyzing the clusters of latent embeddings and visualizing them over time, we hypothesize that the clusters correspond to the physiological response patterns that match physicians' intuition. Furthermore, we attempt to evaluate the latent embeddings using a variety of methods, such as predicting the cluster labels using explainable features.

Automated Assessment of Cardiovascular Sufficiency Using Non-Invasive Physiological Data

Sensors, Jan 28, 2022

This article is an open access article distributed under the terms and conditions of the Creative... more

Intelligent Clinical Decision Support

Sensors, Feb 12, 2022

This article is an open access article distributed under the terms and conditions of the Creative... more