Papers by Florian Guitton
Lecture Notes in Computer Science, 2014

Along with the blooming of AI and Machine Learning-based applications and services, data privacy ... more Along with the blooming of AI and Machine Learning-based applications and services, data privacy and security have become a critical challenge. Conventionally, data is collected and aggregated in a data centre on which machine learning models are trained. This centralised approach has induced severe privacy risks to personal data leakage, misuse, and abuse. Furthermore, in the era of the Internet of Things and big data in which data is essentially distributed, transferring a vast amount of data to a data centre for processing seems to be a cumbersome solution. This is not only because of the difficulties in transferring and sharing data across data sources but also the challenges on complying with rigorous data protection regulations and complicated administrative procedures such as the EU General Data Protection Regulation (GDPR). In this respect, Federated learning (FL) emerges as a prospective solution that facilitates distributed collaborative learning without disclosing origina...

In the recent years, convolutional neural networks have transformed the field of medical image an... more In the recent years, convolutional neural networks have transformed the field of medical image analysis due to their capacity to learn discriminative image features for a variety of classification and regression tasks. However, successfully learning these features requires a large amount of manually annotated data, which is expensive to acquire and limited by the available resources of expert image analysts. Therefore, unsupervised, weakly-supervised and self-supervised feature learning techniques receive a lot of attention, which aim to utilise the vast amount of available data, while at the same time avoid or substantially reduce the effort of manual annotation. In this paper, we propose a novel way for training a cardiac MR image segmentation network, in which features are learnt in a self-supervised manner by predicting anatomical positions. The anatomical positions serve as a supervisory signal and do not require extra manual annotation. We demonstrate that this seemingly simpl...

Along with the blooming of AI and Machine Learning-based applications and services, data privacy ... more Along with the blooming of AI and Machine Learning-based applications and services, data privacy and security have become a critical challenge. Conventionally, data is collected and aggregated in a data centre on which machine learning models are trained. This centralised approach has induced severe privacy risks to personal data leakage, misuse, and abuse. Furthermore, in the era of the Internet of Things and big data in which data is essentially distributed, transferring a vast amount of data to a data centre for processing seems to be a cumbersome solution. This is not only because of the difficulties in transferring and sharing data across data sources but also the challenges on complying with rigorous data protection regulations and complicated administrative procedures such as the EU General Data Protection Regulation (GDPR). In this respect, Federated learning (FL) emerges as a prospective solution that facilitates distributed collaborative learning without disclosing origina...

ArXiv, 2020
Along with the blooming of AI and Machine Learning-based applications and services, data privacy ... more Along with the blooming of AI and Machine Learning-based applications and services, data privacy and security have become a critical challenge. Conventionally, data is collected and aggregated in a data centre on which machine learning models are trained. This centralised approach has induced severe privacy risks to personal data leakage, misuse, and abuse. Furthermore, in the era of the Internet of Things and big data in which data is essentially distributed, transferring a vast amount of data to a data centre for processing seems to be a cumbersome solution. This is not only because of the difficulties in transferring and sharing data across data sources but also the challenges on complying with rigorous data protection regulations and complicated administrative procedures such as the EU General Data Protection Regulation (GDPR). In this respect, Federated learning (FL) emerges as a prospective solution that facilitates distributed collaborative learning without disclosing origina...

eTRIKS analytical environment: A modular high performance framework for medical data analysis
2017 IEEE International Conference on Big Data (Big Data), 2017
Translational research is quickly becoming a science driven by big data. Improving patient care, ... more Translational research is quickly becoming a science driven by big data. Improving patient care, developing personalized therapies and new drugs depend increasingly on an organization's ability to rapidly and intelligently leverage complex molecular and clinical data from a variety of large-scale partner and public sources. As analysing these large-scale datasets becomes computationally increasingly expensive, traditional analytical engines are struggling to provide a timely answer to the questions that biomedical scientists are asking. Designing such a framework is developing for a moving target as the very nature of biomedical research based on big data requires an environment capable of adapting quickly and efficiently in response to evolving questions. The resulting framework consequently must be scalable in face of large amounts of data, flexible, efficient and resilient to failure. In this paper we design the eTRIKS Analytical Environment (eAE), a scalable and modular fram...

Adaptive Domain Decomposition for Effective Data Assimilation
We present a parallel Data Assimilation model based on an Adaptive Domain Decomposition (ADD-DA) ... more We present a parallel Data Assimilation model based on an Adaptive Domain Decomposition (ADD-DA) coupled with the open-source, finite-element, fluid dynamics model Fluidity. The model we present is defined on a partition of the domain in sub-domains without overlapping regions. This choice allows to avoid communications among the processes during the Data Assimilation phase. However, during the balance phase, the model exploits the domain decomposition implemented in Fluidity which balances the results among the processes exploiting overlapping regions. Also, the model exploits the technology provided by the mesh adaptivity to generate an optimal mesh we name supermesh. The supermesh is the one used in ADD-DA process. We prove that the ADD-DA model provides the same numerical solution of the corresponding sequential DA model. We also show that the ADD approach reduces the execution time even when the implementation is not on a parallel computing environment. Experimental results are...

A Multi Tenant Computational Platform for Translational Medicine
2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS), 2018
Translational biomedical research has become a science driven by big data. Improving patient care... more Translational biomedical research has become a science driven by big data. Improving patient care by developing personalized therapies and new drugs depends increasingly on an organization's ability to rapidly and intelligently leverage complex molecular and clinical data from a variety of large-scale internal and external, partner and public, data sources. As analysing these large-scale and complex datasets has become increasingly computationally expensive, it is of paramount importance to enable researchers to seamlessly scale up their computation platform while being able to manage complex yet flexible scenario that biomedical scientists are asking for. We developed a new platform as an answer to those needs of analysing and exploring massive amounts of medical data with the constrain of enabling the broadest audience, ranging from the medical doctor to the advanced coders, to easily and intuitively exploit this new resource. The platform consists of three main components: Bo...

ArXiv, 2021
Epidemiology models play a key role in understanding and responding to the COVID-19 pandemic. In ... more Epidemiology models play a key role in understanding and responding to the COVID-19 pandemic. In order to build those models, scientists need to understand contributing factors and their relative importance. A large strand of literature has identified the importance of airflow to mitigate droplets and far-field aerosol transmission risks. However, the specific factors contributing to higher or lower contamination in various settings have not been clearly defined and quantified. As part of the MOAI project (https://moaiapp.com), we are developing a privacy-preserving test and trace app to enable infection cluster investigators to get in touch with patients without having to know their identity. This approach allows involving users in the fight against the pandemic by contributing additional information in the form of anonymous research questionnaires. We first describe how the questionnaire was designed, and the synthetic data was generated based on a review we carried out on the lat...

eTRIKS IT platfroms for large-scale biomedical research
European Respiratory Journal, 2015
Introduction: Research projects, such as U-BIOPRED (Unbiased BIOmarkers for the PREDiction of res... more Introduction: Research projects, such as U-BIOPRED (Unbiased BIOmarkers for the PREDiction of respiratory disease outcomes), require technology enabling multimodal data integration and analysis, hypothesis management, collaboration and result reproducibility. Aims and Objectives: Integration of clinical and omics data from human subjects, animal and cell models. Omics data include gene expression, protein abundance, lipid abundance, breath metabolites and genetics. Secure, multi-user access, data storage and analysis capabilities. Saving and sharing results, provenance capturing and publication management. Methods: For clinical and omics data integration, storage and analysis, the tranSMART platform was extended by the European Translational Research and Knowledge Management Services (eTRIKS) project. For collaboration, transparency and reproducibility of analyses the Knowledge Portal (KP) was developed. Results: The human asthma cohort dataset, stored on the eTRIKS-tranSMART platform, enables real-time analyses and inferences (e.g. Abdominal Girth / FEV1 correlation (rs=-0.14, p=0.0015) in asthma patients). The dataset is constantly growing with newly generated data. The KP has led to the development and review of at least 10 statistical analysis plans linked to source data and tens of papers in the publication pipeline. Conclusions: eTRIKS IT platforms are essential in streamlining large scale biomedical research.
Privacy preservation in federated learning: An insightful survey from the GDPR perspective
Computers & Security

Lecture Notes in Computer Science
In the recent years, convolutional neural networks have transformed the field of medical image an... more In the recent years, convolutional neural networks have transformed the field of medical image analysis due to their capacity to learn discriminative image features for a variety of classification and regression tasks. However, successfully learning these features requires a large amount of manually annotated data, which is expensive to acquire and limited by the available resources of expert image analysts. Therefore, unsupervised, weakly-supervised and self-supervised feature learning techniques receive a lot of attention, which aim to utilise the vast amount of available data, while at the same time avoid or substantially reduce the effort of manual annotation. In this paper, we propose a novel way for training a cardiac MR image segmentation network, in which features are learnt in a self-supervised manner by predicting anatomical positions. The anatomical positions serve as a supervisory signal and do not require extra manual annotation. We demonstrate that this seemingly simple task provides a strong signal for feature learning and with self-supervised learning, we achieve a high segmentation accuracy that is better than or comparable to a U-net trained from scratch, especially at a small data setting. When only five annotated subjects are available, the proposed method improves the mean Dice metric from 0.811 to 0.852 for short-axis image segmentation, compared to the baseline U-net.
A blockchain-based trust system for decentralised applications: When trustless needs trust
Future Generation Computer Systems
A population-based phenome-wide association study of cardiac and aortic structure and function
Nature Medicine

JAMA Network Open
IMPORTANCE Identifying brain regions associated with risk factors for dementia could guide mechan... more IMPORTANCE Identifying brain regions associated with risk factors for dementia could guide mechanistic understanding of risk factors associated with Alzheimer disease (AD). OBJECTIVES To characterize volume changes in brain regions associated with aging and modifiable risk factors for dementia (MRFD) and to test whether volume differences in these regions are associated with cognitive performance. DESIGN, SETTING, AND PARTICIPANTS This cross-sectional study used data from UK Biobank participants who underwent T1-weighted structural brain imaging from August 5, 2014, to October 14, 2016. A voxelwise linear model was applied to test for regional gray matter volume differences associated with aging and MRFD (ie, hypertension, diabetes, obesity, and frequent alcohol use). The potential clinical relevance of these associations was explored by comparing their neuroanatomical distributions with the regional brain atrophy found with AD. Mediation models for risk factors, brain volume differences, and cognitive measures were tested. The primary hypothesis was that common, overlapping regions would be found. Primary analysis was conducted on April 1, 2018. MAIN OUTCOMES AND MEASURES Gray matter regions that showed relative atrophy associated with AD, aging, and greater numbers of MRFD. RESULTS Among 8312 participants (mean [SD] age, 62.4 [7.4] years; 3959 [47.1%] men), aging and 4 major MRFD (ie, hypertension, diabetes, obesity, and frequent alcohol use) had independent negative associations with specific gray matter volumes. These regions overlapped neuroanatomically with those showing lower volumes in participants with AD, including the posterior cingulate cortex, the thalamus, the hippocampus, and the orbitofrontal cortex. Associations between these MRFD and spatial memory were mediated by differences in posterior cingulate cortex volume (β = 0.0014; SE = 0.0006; P = .02). CONCLUSIONS AND RELEVANCE This cross-sectional study identified differences in localized brain gray matter volume associated with aging and MRFD, suggesting regional vulnerabilities. These differences appeared relevant to cognitive performance even among people considered cognitively healthy.

JAMA Network Open
IMPORTANCE Identifying brain regions associated with risk factors for dementia could guide mechan... more IMPORTANCE Identifying brain regions associated with risk factors for dementia could guide mechanistic understanding of risk factors associated with Alzheimer disease (AD). OBJECTIVES To characterize volume changes in brain regions associated with aging and modifiable risk factors for dementia (MRFD) and to test whether volume differences in these regions are associated with cognitive performance. DESIGN, SETTING, AND PARTICIPANTS This cross-sectional study used data from UK Biobank participants who underwent T1-weighted structural brain imaging from August 5, 2014, to October 14, 2016. A voxelwise linear model was applied to test for regional gray matter volume differences associated with aging and MRFD (ie, hypertension, diabetes, obesity, and frequent alcohol use). The potential clinical relevance of these associations was explored by comparing their neuroanatomical distributions with the regional brain atrophy found with AD. Mediation models for risk factors, brain volume differences, and cognitive measures were tested. The primary hypothesis was that common, overlapping regions would be found. Primary analysis was conducted on April 1, 2018. MAIN OUTCOMES AND MEASURES Gray matter regions that showed relative atrophy associated with AD, aging, and greater numbers of MRFD. RESULTS Among 8312 participants (mean [SD] age, 62.4 [7.4] years; 3959 [47.1%] men), aging and 4 major MRFD (ie, hypertension, diabetes, obesity, and frequent alcohol use) had independent negative associations with specific gray matter volumes. These regions overlapped neuroanatomically with those showing lower volumes in participants with AD, including the posterior cingulate cortex, the thalamus, the hippocampus, and the orbitofrontal cortex. Associations between these MRFD and spatial memory were mediated by differences in posterior cingulate cortex volume (β = 0.0014; SE = 0.0006; P = .02). CONCLUSIONS AND RELEVANCE This cross-sectional study identified differences in localized brain gray matter volume associated with aging and MRFD, suggesting regional vulnerabilities. These differences appeared relevant to cognitive performance even among people considered cognitively healthy.

Scientific Data
Biomedical informatics has traditionally adopted a linear view of the informatics process (collec... more Biomedical informatics has traditionally adopted a linear view of the informatics process (collect, store and analyse) in translational medicine (tM) studies; focusing primarily on the challenges in data integration and analysis. However, a data management challenge presents itself with the new lifecycle view of data emphasized by the recent calls for data re-use, long term data preservation, and data sharing. there is currently a lack of dedicated infrastructure focused on the 'manageability' of the data lifecycle in TM research between data collection and analysis. Current community efforts towards establishing a culture for open science prompt the creation of a data custodianship environment for management of tM data assets to support data reuse and reproducibility of research results. Here we present the development of a lifecycle-based methodology to create a metadata management framework based on community driven standards for standardisation, consolidation and integration of tM research data. Based on this framework, we also present the development of a new platform (PlatformtM) focused on managing the lifecycle for translational research data assets.
Optimising Correlation Matrix Calculations on Gene Expression Data

BMC bioinformatics, Jan 5, 2014
BackgroundHigh-throughput molecular profiling data has been used to improve clinical decision mak... more BackgroundHigh-throughput molecular profiling data has been used to improve clinical decision making by stratifying subjects based on their molecular profiles. Unsupervised clustering algorithms can be used for stratification purposes. However, the current speed of the clustering algorithms cannot meet the requirement of large-scale molecular data due to poor performance of the correlation matrix calculation. With high-throughput sequencing technologies promising to produce even larger datasets per subject, we expect the performance of the state-of-the-art statistical algorithms to be further impacted unless efforts towards optimisation are carried out. MapReduce is a widely used high performance parallel framework that can solve the problem.ResultsIn this paper, we evaluate the current parallel modes for correlation calculation methods and introduce an efficient data distribution and parallel calculation algorithm based on MapReduce to optimise the correlation calculation. We studi...

BMC genomics, Jan 13, 2014
High-throughput transcriptomic data generated by microarray experiments is the most abundant and ... more High-throughput transcriptomic data generated by microarray experiments is the most abundant and frequently stored kind of data currently used in translational medicine studies. Although microarray data is supported in data warehouses such as tranSMART, when querying relational databases for hundreds of different patient gene expression records queries are slow due to poor performance. Non-relational data models, such as the key-value model implemented in NoSQL databases, hold promise to be more performant solutions. Our motivation is to improve the performance of the tranSMART data warehouse with a view to supporting Next Generation Sequencing data. In this paper we introduce a new data model better suited for high-dimensional data storage and querying, optimized for database scalability and performance. We have designed a key-value pair data model to support faster queries over large-scale microarray data and implemented the model using HBase, an implementation of Google's Big...
Uploads
Papers by Florian Guitton