High-dimensional inference with correlated data: statistical modeling of protein sequences beyond structural prediction
Alice Coucke (LPT)

Over the last decades, genomic databases have grown exponentially in size
thanks to the constant progress of modern DNA sequencing. A large variety
of statistical tools have been developed, at the interface between
bioinformatics, machine learning, and statistical physics, to extract
information from these ever increasing datasets. In the specific context
of protein sequence data, several approaches have been recently introduced
by statistical physicists, such as direct-coupling analysis, a global
statistical inference method based on the maximum-entropy principle, that
has proven to be extremely effective in predicting the three-dimensional
structure of proteins from purely statistical considerations.

In this dissertation, we review the relevant inference methods and,
encouraged by their success, discuss their extension to other challenging
fields, such as sequence folding prediction and homology detection.
Contrary to residue-residue contact prediction, which relies on an
intrinsically topological information about the network of interactions,
these fields require global energetic considerations and therefore a more
quantitative and detailed model. Through an extensive study on both
artificial and biological data, we provide a better interpretation of the
central inferred parameters, up to now poorly understood, especially in
the limited sampling regime. Finally, we present a new and more precise
procedure for the inference of generative models, which leads to further
improvements on real, finitely sampled data.