statbiophys / sonnia Goto Github PK

License: GNU General Public License v3.0

Python 100.00%

sonnia's Introduction

SoNNia is a python 3.6 software which extends the functionality of the SONIA package. It expands the choice of selection models that can be inferred. Non linear single-chain models and (non-)linear paired-chain models are included in the package. The pre-processing pipeline implemented in the corresponding paper is also included as a separate class. Finally the likelihood ratio classifier and a linear logistic classifier for functional annotation are also included and can be directly applied to T- and B-cell receptor repertoire datasets.

Version

Latest released version: 0.2.3

Installation

SoNNia is a python /3.6 software. It is available on PyPI and can be downloaded and installed through pip:

pip install sonnia.

SoNNia is also available on GitHub. The command line entry points can be installed by using the setup.py script:

python setup.py install.

Sometimes pip fails to install the dependencies correctly. Thus, if you get any error try first to install the dependencies separately:

pip install tensorflow
pip install matplotlib
pip install olga

For mac user on new metal devices, make sure to install additional dependencies. Currently, the configuration tensorflow-macos==2.9 and tensorflow-metal==0.5.0 should work.

References

Isacchini G, Walczak AM, Mora T, Nourmohammad A, Deep generative selection models of T and B cell receptor repertoires with soNNia, (2021) PNAS, https://www.pnas.org/content/118/14/e2023141118.short

SoNNia modules in a Python script

In order to incorporate the core algorithm into an analysis pipeline (or to write your own script wrappers) all that is needed is to import the modules. Each module defines some classes that only a few methods get called on.

The modules are:

Module name	Classes
sonia_paired.py	SoniaPaired
sonnia_paired.py	SoNNiaPaired
sonnia.py	SoNNia
sonia.py	Sonia
utils.py	N/A (contains util functions)
processing.py	Processing
classifiers.py	Linear, SoniaRatio

The classes SoniaPaired, SoNNiaPaired, and SoNNia have similar behaviour to the ones defined in the SONIA package. As an example, the basic import and initialization of the single-chain SoniaLeftposRightpos model

from sonia.sonia_leftpos_rightpos import SoniaLeftposRightpos
qm=SoniaLeftposRightpos()

translates into the deep version as

from sonnia.sonnia import SoNNia
qm=SoNNia()

translates into the linear paired-chain (i.e. alpha-beta for TCRs) version as

from sonnia.sonia_paired import SoniaPaired
qm=SoniaPaired()

translates into the deep paired (i.e. alpha-beta for TCRs) version as

from sonnia.sonnia_paired import SoNNiaPaired
qm=SoNNiaPaired()

SoNNia keeps all the functionalities of SONIA. As an example you can infer a linear SONIA model with SoNNia using the following definition of the model:

from sonnia.sonia import Sonia
qm=Sonia()

In the examples folder there is a python notebook (or alternatively the example_pipeline script) which shows the main properties of the software. The fig2_paper folder contains all scripts and explanations needed to reproduce figure 2 of the soNNia paper (TODO: this needs to be updated to new model behaviour)

Command line console scripts

There are three command line console scripts (the scripts can still be called as executables if SoNNia is not installed):

sonnia-evaluate

evaluates Ppost, Pgen or selection factors of sequences according to a generative V(D)J model and selection model.

sonnia-generate

generates CDR3 sequences, before (like olga) or after selection

sonnia-infer

infers a selection model with respect to a generative V(D)J model

For any of them you can execute with the -h or --help flags to get the options.

We offer a quick demonstration of the console scripts. This will show how to generate and evaluate sequences and infer a selection model using the default generation model for human TCR beta chains that ships with the SONIA software. In order to run the commands below you need to download the examples folder.

$ sonnia-infer --humanTRB -i examples/data_seqs.txt -d ';' -m 10000

This reads in the full file example_seqs.txt, infers a selection model and saves to the folder sel_model

$ sonnia-generate --set_custom_model_VDJ examples/sonnia_model --post -n 100

Generate 100 human TRB CDR3 sequences from the post-selection repertoire and print to stdout along with the V and J genes used to generate them.

$ sonnia-evaluate --set_custom_model_VDJ examples/sonnia_model -i examples/data_seqs.txt --ppost -m 100 -d ';'

This computes Ppost,Pgen and Q of the first 100 seqs in the data_seqs file.

Notes about CDR3 sequence definition and Dataset size

This code is quite flexible, however it does demand a very consistent definition of CDR3 sequences.

CHECK THE DEFINITION OF THE CDR3 REGION OF THE SEQUENCES YOU INPUT. This will likely be the most often problem that occurs.

The default models/genomic data are set up to define the CDR3 region from the conserved cysteine C (INCLUSIVE) in the V region to the conserved F or W (INCLUSIVE) in the J. This corresponds to positions X and X according to IMGT.

Neural Network models suffer from overfitting issues in the low data regime. While the use of appropriate regularization could reduce the risk of overfitting, it is recommended to use the linear SONIA model for datasets with fewer than 100 000 receptor sequences.

Contact

Any issues or questions should be addressed to us.

License

Free use of soNNia is granted under the terms of the GNU General Public License version 3 (GPLv3).

sonnia's People

Contributors

Stargazers

Watchers

Forkers

giulioisac esobttt syntenybio zacmon

sonnia's Issues

Feature request: consolidating chain_type and custom_pgen_model into one argument: pgen_model

It's never quite made complete sense to me why there are two arguments for essentially the same thing. chain_type specifies the default OLGA models that can be used whereas custom_pgen_model specifies a custom OLGA model. In the past, loading a SONIA/SoNNia model didn't guarantee that the OLGA model in the SONIA/SoNNia folder would be loaded and custom_pgen_model had to be specified also as the SONIA folder. That appears to be remedied in this iteration of the software. However, I think it would be cleaner if chain_type and custom_pgen_model were merged into a single keyword: pgen_model.

The class initialization definition would appear as:

from typing import Iterable, List, Optional, Tuple
class Sonia(object):
    def __init__(self,
                 features: List[Iterable[str]] = [],
                 data_seqs: List[Iterable[str]] = [],
                 gen_seqs: List[Iterable[str]] = [],
                 load_dir: Optional[str] = None,
                 pgen_model: Optional[str] = None, # pgen_model keyword parameter here
                 max_depth: int = 25,
                 max_L: int = 30,
                 load_seqs: bool = True,
                 l2_reg: float = 0.,
                 l1_reg: float = 0.,
                 min_energy_clip: int = -5,
                 max_energy_clip: int = 10,
                 seed: Optional[int] = None,
                 vj: bool = False,
                 custom_pgen_model: Optional[str] = None,
                 processes: Optional[int] = None,
                 objective: str = 'BCE',
                 gamma: int = 1,
                 include_joint_genes: bool = False,
                 joint_vjl: bool = False,
                 include_aminoacids: bool =True
                ) -> None:
        # Other initializations unchanged.
        if load_dir is None and pgen_model is None:
            raise ValueError('Both load_dir and pgen_model cannot be None.')
        # Use pgen_model in the SONIA/SoNNia model folder.
        if pgen_model is None:
            self.pgen_model = load_dir
        else:
            self.pgen_model = pgen_model

utils.define_pgen_model is used to load the pgen model and contains the necessary checks:

import olga.generation_probability as generation_probability
import olga.sequence_generation as sequence_generation
import olga.load_model as olga_load_model
import olga.generation_probability as pgen
import olga.sequence_generation as seq_gen

DEFAULT_CHAIN_TYPES = {'humanTRA': 'human_T_alpha', 'human_T_alpha': 'human_T_alpha',
                       'humanTRB': 'human_T_beta', 'human_T_beta': 'human_T_beta',
                       'humanIGH': 'human_B_heavy', 'human_B_heavy': 'human_B_heavy',
                       'humanIGK': 'human_B_kappa', 'human_B_kappa': 'human_B_kappa',
                       'humanIGL': 'human_B_lambda', 'human_B_lambda': 'human_B_lambda',
                       'mouseTRB': 'mouse_T_beta', 'mouse_T_beta': 'mouse_T_beta',
                       'mouseTRA': 'mouse_T_alpha','mouse_T_alpha':'mouse_T_alpha'}

def define_pgen_model(pgen_model: str,
                      vj: bool = False,
                      return_files: bool = False
                     ):
    if pgen_model in DEFAULT_CHAIN_TYPES:
        pgen_model = DEFAULT_CHAIN_TYPES[pgen_model]
        filedir = os.path.dirname(os.path.abspath(__file__))
        main_folder = os.path.join(filedir, 'default_models', pgen_model)
    elif os.path.isdir(pgen_model):
        main_folder = pgen_model
    else:
        options = f'{list(DEFAULT_CHAIN_TYPES.keys())[::2]}'[1:-1]
        raise ValueError('pgen_model is neither a directory that exists '
                         f'nor one of the default options ({options}). '
                         'Try using one of the default options or an existing '
                         'directory containing the pgen model.')

    pgen_files = ('model_params.txt', 'model_marginals.txt',
                  'V_gene_CDR3_anchors.csv', 'J_gene_CDR3_anchors.csv')
    files_in_dir = set(os.listdir(pgen_model))
    missing_files = set(pgen_files) - files_in_dir

    if len(missing_files) > 0:
        missing_files = f'{missing_files}'[1:-1]
        raise RuntimeError('The pgen model cannot be loaded. The following files '
                           f'are missing: {missing_files}.')

    params_file_name = os.path.join(main_folder, pgen_files[0])
    marginals_file_name = os.path.join(main_folder, pgen_files[1])
    V_anchor_pos_file = os.path.join(main_folder, pgen_files[2])
    J_anchor_pos_file = os.path.join(main_folder, pgen_files[3])

    if vj:
        model_str = 'VJ'
    else:
        model_str = 'VDJ'

    genomic_data = getattr(olga_load_model, f'GenomicData{model_str}')()
    genomic_data.load_igor_genomic_data(params_file_name,
                                        V_anchor_pos_file,
                                        J_anchor_pos_file)
    generative_model = getattr(olga_load_model, f'GenerativeModel{model_str}')()
    generative_model.load_and_process_igor_model(marginals_file_name)
    pgen_model = getattr(pgen, f'GenerationProbability{model_str}')(generative_model, genomic_data)
    seqgen_model = getattr(seq_gen, f'SequenceGeneration{model_str}')(generative_model, genomic_data)

    to_return = (genomic_data, generative_model, pgen_model, seqgen_model)

    if return_files:
        return to_return + (params_file_name, marginals_file_name, V_anchor_pos_file, J_anchor_pos_file)
    return to_return

I show these two examples to give an idea of what this would look like. Corrections elsewhere in the software would be made accordingly if accepted.

I think this would enhance the software and remedy confusion that's, at the very least, happened to me.

nans obtained when training the model with very different pre and post dataset sizes

Description of the issue:

When training the model with n_pre "pre data" and n_post "post data", if n_pre >> n_post or n_post >> n_pre, the training completes without any error, but the loss and the likelihood values (of both the training and validation sets) are all nans.

Minimal example of the issue:

In this example I generate the pre and post sequences using soNNia itself, but the same happened with my actual dataset. I only show the case with n_pre >> n_post, but also the case n_pre << n_post gives the nans.

from sonnia.sonnia import SoNNia

# instantiate model
qm = SoNNia()

# prepare some example sequences
n_tot = 100000
qm.add_generated_seqs(num_gen_seqs = n_tot, reset_gen_seqs = True)

# prepare pre and post data
n_pre = 100000
n_post = 1000
seqs_post = qm.gen_seqs[:n_post]
seqs_pre = qm.gen_seqs[-n_pre:]

# model training (here I get nans)
qm.infer_selection(epochs = 20, batch_size = 1000, validation_split=0.05, verbose=1)

Error message:

No specific error message is given when the nans appear in the printed output of qm.infer_selection (with verbose=1).

how to compare repertoires

Hello,
Is it currently possible to compare repertoires using the JSD function implemented in your pipeline?
I see that you have a method to compute it however I can't seem to find any explanation on how to use it.
Thanks!