You are here : Home > The Lab > Scalable machine learning approaches for chromatographic pattern extraction in large-scale mass spectrometry data

Olga Permiakova

Scalable machine learning approaches for chromatographic pattern extraction in large-scale mass spectrometry data

Published on 3 May 2021
Thesis presented May 03, 2021

Proteomic analysis consists in determining which proteins are contained in biological samples and in which quantity. Such analysis is often required in fundamental or clinical research, to find proteins differentially expressed between several conditions, a.k.a. biomarkers. Modern proteomics largely relies on analytical chemistry techniques, and notably, on mass spectrometry (MS) coupled with high-pressure liquid chromatography (LC). To increase the depth and coverage of proteomics analyses, multiplexed LC-MS acquisitions are increasingly relied on, despite the subsequent challenges in data processing. Recently, it has been shown that some of these challenges could be addressed using chromatogram libraries, which consist of elementary chromatographic profile collections corresponding to different protein fragments present in the samples. The current state-of-the-art approaches propose to construct the chromatogram library by means of additional (and costly) mass spectrometry experiments. The aim of this work is to construct it numerically, through the direct analysis of the LC-MS data using innovative machine learning approaches. Two approaches have been developed. The first one, referred to as CHICKN (Chromatogram Hierarchical Compressive K-means with Nystrom approximation), proposes to cluster the observed elution profiles (defined as the columns of the matrix containing the LC-MS data) and to construct the library using the consensus chromatograms resulting from these clusters. This clustering method operates on a data sketch, as defined in the compressive learning theory. Furthermore, the algorithm is compatible with the kernel trick, which is accelerated thanks to Nyström kernel approximation. Finally, we have derived two new kernel functions, based on the Wasserstein-1 distance. We have established on real proteomics data that these kernel functions lead to better capturing the LC-MS data specificities. The second approach developed in this thesis is an online dictionary learning algorithm, referred to as SSDL (Sketched Stochastic Dictionary Learning), so as to use the trained dictionary as a chromatogram library. This method also relies on the compressive learning theory. In addition, its computational efficiency is strengthened by a stochastic version of Nesterov accelerated gradient descent method. The performance of both methods has been assessed on real LC-MS data. We demonstrated that both of them lead to the construction of meaningful chromatogram libraries, satisfying all LC-MS data requirements (notably physical interpretability). Moreover, they have small computational cost and are efficient to build extremely large chromatogram libraries, as required for complex biological samples.

Mass spectrometry, Demultiplexing, Machine learning

On-line thesis.