Papers by Panayiotis Petousis

Using Autoencoders for Imputing Missing Data in eGFR Decline Trajectories of Patients with CKD
BACKGROUND Using machine learning (ML) approaches to impute missing data has not been explored in... more BACKGROUND Using machine learning (ML) approaches to impute missing data has not been explored in CKD progression. We investigated the utility of a data-driven imputation to improve downstream classifier prediction of rapid eGFR decline in the CURE-CKD registry. METHODS We analyzed CKD patients at UCLA (N=13,206) over a 2-year period. We used: 1) the dataset with missing data; and 2) a censored subset with no missing data. We introduced 33% and 66% missingness by removing values by removing values either missing completely at random (MCAR); missing at random (MAR); or missing not at random (MNAR). We included: eGFR, hemoglobin (HbA1c), systolic blood pressure (SBP), number of ambulatory and inpatient visits, age, sex, ethnicity, rurality status, diagnosis of hypertension, diabetes mellitus (DM), pre-DM, and use of renin angiotensin aldosterone system inhibitors. We introduced missingness on SBP and HbA1c to mirror the original dataset. We imputed missing values using an autoencoder ML model. To predict a 40% eGFR decline over 2 years, we developed random forest models using the full and resultant imputed datasets. RESULTS On the full subset, the MNAR imputation method achieved a root mean squared error (RMSE) of 0. The MAR method achieved RMSE of 3.8 at 33% missingness and 5.4 at 66%. MCAR achieved RMSE of 38.5 at 33% missingness and 56.4 at 66%. Using the random forest model to predict rapid decline on the fully observed subset without removing and imputing data achieved a receiver operating characteristic (ROC) area under the curve (AUC) mean of 80.8%±1.1 and precision/recall (PR)-AUC mean of 23.9%±1.5; the same as our methodology on MNAR, which is explained by the RMSE of 0, shown in Table 1. CONCLUSION Our method accurately imputes clinical data values while accounting for uncertainty caused by missing values

ABSTRACT: PO0528: Predicting Rapid eGFR Decline Using Electronic Health Record (EHR) Data Despite High Missingness in the CURE-CKD Registry
BACKGROUND Patients with rapid eGFR decline tend to progress to kidney failure. Automated tools c... more BACKGROUND Patients with rapid eGFR decline tend to progress to kidney failure. Automated tools can identify individuals at risk of severe renal function decline and facilitate disease mitigation. We describe a deep neural network (DNN) for predicting the risk of rapid eGFR decline (\u3e40% decrease in eGFR over 2 years) and identified populations at higher risk of rapid decline using the CURE-CKD Registry. METHODS Variables include: age, sex, race/ethnicity, ACE inhibitor/ARB use, eGFR, systolic blood pressure (SBP), hemoglobin A1C, and the diagnosis of hypertension, type 2 diabetes (DM), pre-DM or chronic kidney disease (CKD) based on EHR coding from patients with CKD (N=93,567) and at-risk for CKD (N=913,289) with eGFR ≥15ml/min/1.73m2 over 2 years. We trained and validated a 5-layer DNN, a logistic regression (LR) model, and a gradient boosted tree (GBT) model using a 60/20/20 train/test/validation split. We computed the risk distribution of all 25,475 subpopulations, based on all possible expert defined combinations of the above variables, and compared this risk distribution against the whole population’s risk distribution using the Kolmogorov-Smirnov (KS) test. Subgroups with the highest risk of decline were identified using the KS test (p\u3c0.05) on our highest performing model. RESULTS The DNN achieved an area under the receiver operating curve (AUC-ROC) of 0.75 on the test set. The LR and GBT achieved an AUC-ROC of 0.72 and 0.73, respectively. The subpopulations with significantly highest average predicted risk across training, validation, and testing were 17,734. We identified the most frequent predictors of rapid eGFR decline across the highest risk populations. Of the top 100 significantly higher risk subpopulations the following variables were the most frequent: CKD (100%), SBP \u3e 140 mmHg (72%), age 45-66 years (56%), DM (52%), and A1C \u3e 8 (50%). CONCLUSION We developed a methodology that uses a risk model for rapid eGFR decline using big data and used its predictions, along with the KS test, to identify subpopulations with significantly high risk for rapid eGFR decline

AMIA ... Annual Symposium proceedings. AMIA Symposium, 2018
Risk prediction models are crucial for assessing the pretest probability of cancer and are applie... more Risk prediction models are crucial for assessing the pretest probability of cancer and are applied to stratify patient management strategies. These models are frequently based on multivariate regression analysis, requiring that all risk factors be specified, and do not convey the confidence in their predictions. We present a framework for uncertainty analysis that accounts for variability in input values. Uncertain or missing values are replaced with a range of plausible values. These ranges are used to compute individualized risk confidence intervals. We demonstrate our approach using the Gail model to evaluate the impact of uncertainty on management decisions. Up to 13% of cases (uncertain) had a risk interval that falls within the decision threshold (e.g., 1.67% 5-year absolute risk). A small number of cases changed from low- to high-risk when missing values were present. Our analysis underscores the need for better communication of input assumptions that influence the resulting ...
2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), 2021
A Web-based Platform for Evaluating and Disseminating Predictive Models

Author(s): Petousis, Panayiotis | Advisor(s): Bui, Alex Anh-Tuan; Aberle, Denise | Abstract: Curr... more Author(s): Petousis, Panayiotis | Advisor(s): Bui, Alex Anh-Tuan; Aberle, Denise | Abstract: Current clinical decision-making relies heavily both upon the experience of a physician and the recommendations of evidence-based practice guidelines, the latter often informed by population-level policies. Yet with the heightened complexity of patient care given newer types of data and longitudinal observations (e.g., from the electronic health record, EHR), as well as the goal of more individually-tailored healthcare, medical decision-making is increasingly complicated. This issue is particularly true in cancer with emergent techniques for early detection and personalized treatment. This research establishes an informatics-based framework to inform optimal cancer screening through sequential decision-making methods. This dissertation develops tools to formulate a partially observable Markov decision process (POMDP) model, enabling each component to be learned from a dataset: dynamic Bayesi...
A Continuous Markov Model Approach Using Individual Patient Data to Estimate Mean Sojourn Time of Lung Cancer

Frontiers in Big Data, 2021
We present a novel approach for imputing missing data that incorporates temporal information into... more We present a novel approach for imputing missing data that incorporates temporal information into bipartite graphs through an extension of graph representation learning. Missing data is abundant in several domains, particularly when observations are made over time. Most imputation methods make strong assumptions about the distribution of the data. While novel methods may relax some assumptions, they may not consider temporality. Moreover, when such methods are extended to handle time, they may not generalize without retraining. We propose using a joint bipartite graph approach to incorporate temporal sequence information. Specifically, the observation nodes and edges with temporal information are used in message passing to learn node and edge embeddings and to inform the imputation task. Our proposed method, temporal setting imputation using graph neural networks (TSI-GNN), captures sequence information that can then be used within an aggregation function of a graph neural network. ...

Artificial Intelligence in Health, 2019
Cancer screening is a large, population-based intervention that would benefit from tools enabling... more Cancer screening is a large, population-based intervention that would benefit from tools enabling individually-tailored decision making to decrease unintended consequences such as overdiagnosis. The heterogeneity of cancer screening participants advocates the need for more personalized approaches. Partially observable Markov decision processes (POMDPs) can be used to suggest optimal, individualized screening policies. However, determining an appropriate reward function can be challenging. Here, we propose the use of inverse reinforcement learning (IRL) to form rewards functions for lung and breast cancer screening POMDP models. Using data from the National Lung Screening Trial and our institution's breast screening registry, we developed two POMDP models with corresponding reward functions. Specifically, the maximum entropy (MaxEnt) IRL algorithm with an adaptive step size was used to learn rewards more efficiently; and combined with a multiplicative model to learn state-action pair rewards in the POMDP. The lung and breast cancer screening models were evaluated based on their ability to recommend appropriate screening decisions before the diagnosis of cancer. Results are comparable with experts' decisions. The lung POMDP demonstrated an improved performance in terms of recall and false positive rate in the second screening and post-screening stages. Precision (0.02 − 0.05) was comparable to experts' (0.02 − 0.06). The breast POMDP has excellent recall (0.97 − 1.00), matching the physicians and a satisfactory false positive rate (< 0.03). The reward functions learned with the MaxEnt IRL algorithm, when combined with POMDP models in lung and breast cancer screening, demonstrate performance comparable to experts.

IEEE Access, 2019
Globally, lung cancer is responsible for nearly one in five cancer deaths. The National Lung Scre... more Globally, lung cancer is responsible for nearly one in five cancer deaths. The National Lung Screening Trial (NLST) demonstrated the efficacy of low-dose computed tomography (LDCT) to identify early-stage disease, setting the basis for widespread implementation of lung cancer screening programs. However, the specificity of LDCT lung cancer screening is suboptimal, with a significant false positive rate. Representing this imaging-based screening process as a sequential decision making problem, we combined multiple machine learning-based methods to learn a partially-observable Markov decision process that simultaneously optimizes lung cancer detection while enhancing test specificity. Using NLST data, we trained a dynamic Bayesian network as an observational model and used inverse reinforcement learning to discover a rewards function based on experts' decisions. Our resultant predictive model decreased the false positive rate while maintaining a high true positive rate at a level comparable to human experts. Our model also detected a number of lung cancers earlier.

Computers in biology and medicine, Feb 22, 2016
A growing number of individuals who are considered at high risk of cancer are now routinely under... more A growing number of individuals who are considered at high risk of cancer are now routinely undergoing population screening. However, noted harms such as radiation exposure, overdiagnosis, and overtreatment underscore the need for better temporal models that predict who should be screened and at what frequency. The mean sojourn time (MST), an average duration period when a tumor can be detected by imaging but with no observable clinical symptoms, is a critical variable for formulating screening policy. Estimation of MST has been long studied using continuous Markov model (CMM) with Maximum likelihood estimation (MLE). However, a lot of traditional methods assume no observation error of the imaging data, which is unlikely and can bias the estimation of the MST. In addition, the MLE may not be stably estimated when data is sparse. Addressing these shortcomings, we present a probabilistic modeling approach for periodic cancer screening data. We first model the cancer state transition u...

Artificial Intelligence in Medicine, 2016
Introduction-Identifying high-risk lung cancer individuals at an early disease stage is the most ... more Introduction-Identifying high-risk lung cancer individuals at an early disease stage is the most effective way of improving survival. The landmark National Lung Screening Trial (NLST) demonstrated the utility of low-dose computed tomography (LDCT) imaging to reduce mortality (relative to x-ray screening). As a result of the NLST and other studies, imaging-based lung cancer screening programs are now being implemented. However, LDCT interpretation results in a high number of false positives. A set of dynamic Bayesian networks (DBN) were designed and evaluated to provide insight into how longitudinal data can be used to help inform lung cancer screening decisions. Methods-The LDCT arm of the NLST dataset was used to build and explore five DBNs for high-risk individuals. Three of these DBNs were built using a backward construction process, and two using structure learning methods. All models employ demographic, smoking status, cancer history, family lung cancer history, exposure risk factors, comorbidities related to lung cancer, and LDCT screening outcome information. Given the uncertainty arising from lung cancer screening, a cancer state-space model based on lung cancer staging was utilized to characterize the cancer status of an individual over time. The models were evaluated on balanced training and test sets of cancer and non-cancer cases to deal with data imbalance and overfitting. Results-Results were comparable to expert decisions. The average area under the curve (AUC) of the receiver operating characteristic (ROC) for the three intervention points of the NLST trial was higher than 0.75 for all models. Evaluation of the models on the complete LDCT arm of the NLST dataset (N = 25, 486) demonstrated satisfactory generalization. Consensus of predictions over similar cases is reported in concordance statistics between the models' and the physicians' predictions. The models' predictive ability with respect to missing data was also evaluated with the
As our collective knowledge about COVID-19 continues to grow at an exponential rate, it becomes m... more As our collective knowledge about COVID-19 continues to grow at an exponential rate, it becomes more difficult to organize and observe emerging trends. In this work, we built an open source methodology that uses topic modeling and a pretrained BERT model to organize large corpora of COVID-19 publications into topics over time and over location. Additionally, it assesses the association of medical keywords against COVID-19 over time. These analyses are then automatically pushed into an open source web application that allows a user to obtain actionable insights from across the globe.
Uploads
Papers by Panayiotis Petousis