Papers by Artur Dubrawski

arXiv (Cornell University), Nov 17, 2017
When sexual violence is a product of organized crime or social imaginary, the links between sexua... more When sexual violence is a product of organized crime or social imaginary, the links between sexual violence episodes can be understood as a latent structure. With this assumption in place, we can use data science to uncover complex patterns. In this paper we focus on the use of data mining techniques to unveil complex anomalous spatiotemporal patterns of sexual violence. We illustrate their use by analyzing all reported rapes in El Salvador over a period of nine years. Through our analysis, we are able to provide evidence of phenomena that, to the best of our knowledge, have not been previously reported in literature. We devote special attention to a pattern we discover in the East, where underage victims report their boyfriends as perpetrators at anomalously high rates. Finally, we explain how such analyzes could be conducted in real-time, enabling early detection of emerging patterns to allow law enforcement agencies and policy makers to react accordingly.

arXiv (Cornell University), Dec 3, 2021
In this article, we introduce a novel type of spatio-temporal sequential patterns called Constric... more In this article, we introduce a novel type of spatio-temporal sequential patterns called Constricted Spatio-Temporal Sequential (CSTS) patterns and thoroughly analyze their properties. We demonstrate that the set of CSTS patterns is a concise representation of all spatio-temporal sequential patterns that can be discovered in a given dataset. To measure significance of the discovered CSTS patterns we adapt the participation index measure. We also provide CSTS-Miner: an algorithm that discovers all participation index strong CSTS patterns in event data. We experimentally evaluate the proposed algorithms using two crime-related datasets: Pittsburgh Police Incident Blotter Dataset and Boston Crime Incident Reports Dataset. In the experiments, the CSTS-Miner algorithm is compared with the other four state-of-the-art algorithms: STS-Miner, CSTPM, STBFM and CST-SPMiner. As the results of experiments suggest, the proposed algorithm discovers much fewer patterns than the other selected algorithms. Finally, we provide the examples of interesting crime-related patterns discovered by the proposed CSTS-Miner algorithm.

The public reporting burden for this collection of information is estimated to average 1 hour per... more The public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Department of Defense, Washington Headquarters Services, Directorate for Information Operations and Reports (0704-0188), 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to any penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. PLEASE DO NOT RETURN YOUR FORM TO THE ABOVE ADDRESS.

arXiv (Cornell University), Nov 6, 2018
Adaptive moment methods have been remarkably successful in deep learning optimization, particular... more Adaptive moment methods have been remarkably successful in deep learning optimization, particularly in the presence of noisy and/or sparse gradients. We further the advantages of adaptive moment techniques by proposing a family of double adaptive stochastic gradient methods DASGrad. They leverage the complementary ideas of the adaptive moment algorithms widely used by deep learning community, and recent advances in adaptive probabilistic algorithms. We analyze the theoretical convergence improvements of our approach in a stochastic convex optimization setting, and provide empirical validation of our findings with convex and non convex objectives. We observe that the benefits of DASGrad increase with the model complexity and variability of the gradients, and we explore the resulting utility in extensions of distributionmatching multitask learning.
Empirically derived sequence similarity thresholds to study the genomic epidemiology of plasmids shared among healthcare-associated bacterial pathogens
eBioMedicine
Forecasting imminent atrial fibrillation in long-term ECG recordings
Journal of Electrocardiology
Incorporation of machine learning and signal quality indicators can significantly suppress false respiratory alerts during in-hospital bedside monitoring
Journal of Electrocardiology

arXiv (Cornell University), Feb 24, 2023
Studies involving both randomized experiments as well as observational data typically involve tim... more Studies involving both randomized experiments as well as observational data typically involve time-to-event outcomes such as time-to-failure, death or onset of an adverse condition. Such outcomes are typically subject to censoring due to loss of follow-up and established statistical practice involves comparing treatment efficacy in terms of hazard ratios between the treated and control groups. In this paper we propose a statistical approach to recovering sparse phenogroups (or subtypes) that demonstrate differential treatment effects as compared to the study population. Our approach involves modelling the data as a mixture while enforcing parameter shrinkage through structured sparsity regularization. We propose a novel inference procedure for the proposed model and demonstrate its efficacy in recovering sparse phenotypes across large landmark real world clinical studies in cardiovascular health.
Learning to Extract Actionable Evidence from Medical Insurance Claims Data
Actionable Intelligence in Healthcare, 2017

Infection Control & Hospital Epidemiology, 2019
Background:Identifying routes of transmission among hospitalized patients during a healthcare-ass... more Background:Identifying routes of transmission among hospitalized patients during a healthcare-associated outbreak can be tedious, particularly among patients with complex hospital stays and multiple exposures. Data mining of the electronic health record (EHR) has the potential to rapidly identify common exposures among patients suspected of being part of an outbreak.Methods:We retrospectively analyzed 9 hospital outbreaks that occurred during 2011–2016 and that had previously been characterized both according to transmission route and by molecular characterization of the bacterial isolates. We determined (1) the ability of data mining of the EHR to identify the correct route of transmission, (2) how early the correct route was identified during the timeline of the outbreak, and (3) how many cases in the outbreaks could have been prevented had the system been running in real time.Results:Correct routes were identified for all outbreaks at the second patient, except for one outbreak i...

arXiv (Cornell University), Nov 19, 2015
We present an extension of sparse Canonical Correlation Analysis (CCA) designed for finding multi... more We present an extension of sparse Canonical Correlation Analysis (CCA) designed for finding multiple-tomultiple linear correlations within a single set of variables. Unlike CCA, which finds correlations between two sets of data where the rows are matched exactly but the columns represent separate sets of variables, the method proposed here, Canonical Autocorrelation Analysis (CAA), finds multivariate correlations within just one set of variables. This can be useful when we look for hidden parsimonious structures in data, each involving only a small subset of all features. In addition, the discovered correlations are highly interpretable as they are formed by pairs of sparse linear combinations of the original features. We show how CAA can be of use as a tool for anomaly detection when the expected structure of correlations is not followed by anomalous data. We illustrate the utility of CAA in two application domains where single-class and unsupervised learning of correlation structures are particularly relevant: breast cancer diagnosis and radiation threat detection. When applied to the Wisconsin Breast Cancer data, singleclass CAA is competitive with supervised methods used in literature. On the radiation threat detection task, unsupervised CAA performs significantly better than an unsupervised alternative prevalent in the domain, while providing valuable additional insights for threat analysis.
arXiv (Cornell University), Nov 13, 2015

arXiv (Cornell University), Jan 24, 2021
Machine learning (ML) is increasingly being used to support high-stakes decisions, a trend owed i... more Machine learning (ML) is increasingly being used to support high-stakes decisions, a trend owed in part to its promise of superior predictive power relative to human assessment. However, there is frequently a gap between decision objectives and what is captured in the observed outcomes used as labels to train ML models. As a result, machine learning models may fail to capture important dimensions of decision criteria, hampering their utility for decision support. In this work, we explore the use of historical expert decisions as a rich-yet imperfect-source of information that is commonly available in organizational information systems, and show that it can be leveraged to bridge the gap between decision objectives and algorithm objectives. We consider the problem of estimating expert consistency indirectly when each case in the data is assessed by a single expert, and propose influence function-based methodology as a solution to this problem. We then incorporate the estimated expert consistency into a predictive model through a training-time label amalgamation approach. This approach allows ML models to learn from experts when there is inferred expert consistency, and from observed labels otherwise. We also propose alternative ways of leveraging inferred consistency via hybrid and deferral models. In our empirical evaluation, focused on the context of child maltreatment hotline screenings, we show that (1) there are high-risk cases whose risk is considered by the experts but not wholly captured in the target labels used to train a deployed model, and (2) the proposed approach significantly improves precision for these cases.
IEEE Journal of Biomedical and Health Informatics, Aug 1, 2021
We describe a new approach to estimating relative risks in time-to-event prediction problems with... more We describe a new approach to estimating relative risks in time-to-event prediction problems with censored data in a fully parametric manner. Our approach does not require making strong assumptions of constant proportional hazard of the underlying survival distribution, as required by the Cox-proportional hazard model. By jointly learning deep nonlinear representations of the input covariates, we demonstrate the benefits of our approach when used to estimate survival risks through extensive experimentation on multiple real world datasets with different levels of censoring. We further demonstrate advantages of our model in the competing risks scenario. To the best of our knowledge, this is the first work involving fully parametric estimation of survival times with competing risks in the presence of censoring.

Neural Information Processing Systems, Apr 1, 2017
We study the problem of interactively learning a binary classifier using noisy labeling and pairw... more We study the problem of interactively learning a binary classifier using noisy labeling and pairwise comparison oracles, where the comparison oracle answers which one in the given two instances is more likely to be positive. Learning from such oracles has multiple applications where obtaining direct labels is harder but pairwise comparisons are easier, and the algorithm can leverage both types of oracles. In this paper, we attempt to characterize how the access to an easier comparison oracle helps in improving the label and total query complexity. We show that the comparison oracle reduces the learning problem to that of learning a threshold function. We then present an algorithm that interactively queries the label and comparison oracles and we characterize its query complexity under Tsybakov and adversarial noise conditions for the comparison and labeling oracles. Our lower bounds show that our label and total query complexity is almost optimal.

arXiv (Cornell University), Apr 28, 2018
In this paper we explore different regression models based on Clusterwise Linear Regression (CLR)... more In this paper we explore different regression models based on Clusterwise Linear Regression (CLR). CLR aims to find the partition of the data into k clusters, such that linear regressions fitted to each of the clusters minimize overall mean squared error on the whole data. The main obstacle preventing to use found regression models for prediction on the unseen test points is the absence of a reasonable way to obtain CLR cluster labels when the values of target variable are unknown. In this paper we propose two novel approaches on how to solve this problem. The first approach, predictive CLR builds a separate classification model to predict test CLR labels. The second approach, constrained CLR utilizes a set of user-specified constraints that enforce certain points to go to the same clusters. Assuming the constraint values are known for the test points, they can be directly used to assign CLR labels. We evaluate these two approaches on three UCI ML datasets as well as on a large corpus of health insurance claims. We show that both of the proposed algorithms significantly improve over the known CLR-based regression methods. Moreover, predictive CLR consistently outperforms linear regression and random forest, and shows comparable performance to support vector regression on UCI ML datasets. The constrained CLR approach achieves the best performance on the health insurance dataset, while enjoying only ≈ 20 times increased computational time over linear regression.

arXiv (Cornell University), Nov 12, 2019
Monitoring physiological responses to hemodynamic stress can help in determining appropriate trea... more Monitoring physiological responses to hemodynamic stress can help in determining appropriate treatment and ensuring good patient outcomes. Physicians' intuition suggests that the human body has a number of physiological response patterns to hemorrhage which escalate as blood loss continues, however the exact etiology and phenotypes of such responses are not well known or understood only at a coarse level. Although previous research has shown that machine learning models can perform well in hemorrhage detection and survival prediction, it is unclear whether machine learning could help to identify and characterize the underlying physiological responses in raw vital sign data. We approach this problem by first transforming the high-dimensional vital sign time series into a tractable, lowerdimensional latent space using a dilated, causal convolutional encoder model trained purely unsupervised. Second, we identify informative clusters in the embeddings. By analyzing the clusters of latent embeddings and visualizing them over time, we hypothesize that the clusters correspond to the physiological response patterns that match physicians' intuition. Furthermore, we attempt to evaluate the latent embeddings using a variety of methods, such as predicting the cluster labels using explainable features.
Sensors, Jan 28, 2022
This article is an open access article distributed under the terms and conditions of the Creative... more This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY
Sensors, Feb 12, 2022
This article is an open access article distributed under the terms and conditions of the Creative... more This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY
Uploads
Papers by Artur Dubrawski