neksa / mutagene Goto Github PK

Python library and package for mutational analysis

Home Page: https://www.ncbi.nlm.nih.gov/projects/mutagene/

License: Other

Python 3.88% Shell 0.01% Jupyter Notebook 96.12%

mutations cancer-genomics cancer cancer-variants mutagenesis genomics machine-learning ncbi mutational-signatures mutational-significance-analysis

mutagene's Introduction

What is MutaGene?

MutaGene is a set of methods for the analysis of mutations and mutational processes in cancer. The MutaGene Python package consists of command-line tools that provide direct access to some of the functions available on MutaGene's website https://www.ncbi.nlm.nih.gov/research/mutagene/.

The MutaGene software package includes 5 subpackages: fetch, profile, rank, motif, and signature. Each subpackage has a unique functionality.

How can I install and use MutaGene?

Installation: MutaGene package is accessible via standard python repository PyPi. The package requires Python3.7 or higher.

It is recommended to use venv or conda to create an isolated environment for mutagene package:

python3 -m venv env_mutagene
source env_mutagene/bin/activate

Then install from pypi

pip install mutagene

Usage: The MutaGene package can be ran on the command line.

usage: mutagene [-v] [-V] [-h] {fetch, profile, rank, motif, signature} ...

MutaGene version 0.9.1.0 - Analysis of mutational processes and driver mutations

Global optional arguments:
  -v, --verbose         Print additional messages (-v, -vv)
  -V, --version         Show version and exit
  -h, --help            Show this help message and exit

Choose MutaGene subpackage:
          fetch - Load data such as genomes and cancer datasets from demote sources (alias: download)
          profile - Create a mutational profile given a sample with mutations
          rank - Predict driver mutations by ranking observed mutations with respect to their expected mutability
          motif - Test samples for presence of mutational motifs
          signature - Identify activity of existing mutational signatures in samples or derive new signatures (aliases: identify, decompose)

  {fetch, profile, rank, motif, signature}

What are MutaGene's five subpackages?

MutaGene Fetch allows you to download sample data, cohorts, and human genome reference sequences.
MutaGene Profile allows you to analyze any set of mutations from one or several cancer samples to identify cancers of unknown primary tumor site, to detect the most likely mutational process and to distinguish tumorigenic from normal or benign mutation sets
MutaGene Rank allows you to rank mutations with respect to their driver status in a given sample or cohort in a batch mode using pre-calculated or user-provided mutational profiles
MutaGene Motif allows you to analyze the presence of mutational motifs in genomic data.
MutaGene Signature allows you to analyze the presence of mutational signatures in genomic data.

module fetch

usage: mutagene fetch [-h] {examples,cohorts,genome} ...

Download data from remote repositories and API

optional arguments:
  -h, --help            show this help message and exit

subcommands:
  Choose data source

  {examples,cohorts,genome}
                        additional help available for subcommands
    cohorts             cohorts help

module profile

 mutagene profile --help
usage: mutagene profile [-h] [--infile [INFILE ...]] [--outfile [OUTFILE]] [--genome GENOME] [--input-format INPUT_FORMAT]

positional arguments:


optional arguments:
  -h, --help            show this help message and exit
  --infile [INFILE ...], -i [INFILE ...]
                        Input file format
  --outfile [OUTFILE], -o [OUTFILE]
                        Name of output file, will be generated in TSV format
  --genome GENOME, -g GENOME
                        Location of genome assembly file
  --input-format INPUT_FORMAT, -f INPUT_FORMAT
                        Input format: auto, MAF, VCF

module rank

usage: mutagene rank [-h] [--infile INFILE] [--genome GENOME] [--outfile [OUTFILE]] [--cohorts-file [COHORTS_FILE]] [--cohort [COHORT]] [--profile PROFILE] [--nsamples NSAMPLES]
                     [--threshold-driver THRESHOLD_DRIVER] [--threshold-passenger THRESHOLD_PASSENGER]

optional arguments:
  -h, --help            show this help message and exit

Required arguments:
  --infile INFILE, -i INFILE
                        Input file in MAF format
  --genome GENOME, -g GENOME
                        Location of genome assembly file in 2bit format

Optional arguments:
  --outfile [OUTFILE], -o [OUTFILE]
  --cohorts-file [COHORTS_FILE]
                        Location of tar.gz container or directory for cohorts
  --cohort [COHORT], -c [COHORT]
                        Name of cohort with observed mutations

Advanced arguments:
  --profile PROFILE, -p PROFILE
                        Override profile to calculate mutability, may also describe cohort size
  --nsamples NSAMPLES, -n NSAMPLES
                        Override cohort size
  --threshold-driver THRESHOLD_DRIVER, -td THRESHOLD_DRIVER
                        BScore threshold between Driver and Pontential Driver mutations
  --threshold-passenger THRESHOLD_PASSENGER, -tp THRESHOLD_PASSENGER
                        BScore threshold between Pontential Driver and Passenger mutations

module motif

usage: mutagene motif [-h] [--infile INFILE] [--genome GENOME] [--input-format {MAF,VCF}] [--motif MOTIF] [--outfile [OUTFILE]] [--window-size WINDOW_SIZE] [--strand {T,N,A,TNA}]
                      [--threshold THRESHOLD] [--save-motif-matches SAVE_MOTIF_MATCHES] [--test {Fisher,Chi2}]

optional arguments:
  -h, --help            show this help message and exit

Required arguments:
  --infile INFILE, -i INFILE
                        Input file in MAF or VCF format with one or multiple samples
  --genome GENOME, -g GENOME
                        Location of genome assembly file in 2bit format

Optional arguments:
  --input-format {MAF,VCF}, -f {MAF,VCF}
                        Input format: MAF, VCF
  --motif MOTIF, -m MOTIF
                        Motif to search for, use the 'R[C>T]GY' syntax for the motif. Use quotes
  --outfile [OUTFILE], -o [OUTFILE]
                        Name of output file, will be generated in TSV format


Advanced arguments:
  --window-size WINDOW_SIZE, -w WINDOW_SIZE
                        Context window size for motif search, default setting is 50
  --strand {T,N,A,TNA}, -s {T,N,A,TNA}
                        Transcribed strand (T), non-transcribed (N), any (A), or all (TNA default)
  --threshold THRESHOLD, -t THRESHOLD
                        Significance threshold for qvalues, default value=0.05
  --save-motif-matches SAVE_MOTIF_MATCHES
                        Save mutations in matching motifs to a BED file
  --test {Fisher,Chi2}  Statistical test to use

Examples:
# search in sample2.vcf for all preidentified motifs in mutagene using hg19
mutagene motif --infile sample2.vcf --input-format VCF --genome hg19

# search for the presence of the C[A>T] motif in sample1.maf using hg19 not checking for strand-specificity
mutagene motif --infile sample1.maf --input-format MAF --genome hg19 --motif 'C[A>T]' --strand A

module signature

usage: mutagene signature [-h] [--infile INFILE] [--genome GENOME] [--signatures {5,10,30,49,53,MGA,MGB,COSMICv2,COSMICv3,KUCAB}] [--input-format {MAF,VCF,TCGI}] [--outfile [OUTFILE]]
                          [--method [METHOD]] [--no-unexplained-variance] [--mutations-threshold MUTATIONS_THRESHOLD] [--keep-only KEEP_ONLY] [--bootstrap]
                          [--bootstrap-replicates BOOTSTRAP_REPLICATES] [--bootstrap-confidence-level BOOTSTRAP_CONFIDENCE_LEVEL] [--bootstrap-method {t,p}]

optional arguments:
  -h, --help            show this help message and exit

Required arguments:
  --infile INFILE, -i INFILE
                        Input file in VCF or MAF format
  --genome GENOME, -g GENOME
                        Location of genome assembly file in 2bit format
  --signatures {5,10,30,49,53,MGA,MGB,COSMICv2,COSMICv3,KUCAB}, -s {5,10,30,49,53,MGA,MGB,COSMICv2,COSMICv3,KUCAB}
                        Collection of signatures to use

Optional arguments:
  --input-format {MAF,VCF,TCGI}, -f {MAF,VCF,TCGI}
                        Input format: MAF, VCF
  --outfile [OUTFILE], -o [OUTFILE]
                        Name of output file, will be generated in TSV format

Advanced arguments:
  --method [METHOD], -m [METHOD]
                        Method defines the function minimized in the optimization procedure
  --no-unexplained-variance, -U
                        Do not account for unexplained variance (non-context dependent mutational processes and unknown signatures)
  --mutations-threshold MUTATIONS_THRESHOLD, -t MUTATIONS_THRESHOLD
                        Only report signatures with mutations above the threshold
  --keep-only KEEP_ONLY, -k KEEP_ONLY
                        Keep only the signatures in the list, separated by commas e.g. 1,3,5

Bootstrap-specific arguments:
  --bootstrap, -b       Use the bootstrap to calculate confidence intervals
  --bootstrap-replicates BOOTSTRAP_REPLICATES, -br BOOTSTRAP_REPLICATES
                        Number of bootstrap replicates
  --bootstrap-confidence-level BOOTSTRAP_CONFIDENCE_LEVEL, -bcl BOOTSTRAP_CONFIDENCE_LEVEL
                        Confidence level
  --bootstrap-method {t,p}, -bm {t,p}
                        Bootstrap method (t: t-distribution, p: percentile)

How can I cite MutaGene?

If you use MutaGene, you should cite:

Goncearenco A, Rager SL, Li M, Sang Q, Rogozin IB, Panchenko AR Exploring background mutational processes to decipher cancer genetic heterogeneity. Nucleic Acids Res. 2017; 45(W1):W514–W522. https://doi.org/10.1093/nar/gkx367

Additionally, if you use the MutaGene rank subpackage, you should cite the driver prediction method as:

Brown AL, Li M, Goncearenco A, Panchenko AR Finding driver mutations in cancer: Elucidating the role of background mutational processes. PLOS Computational Biology 2019; 15(4): e1006981. https://doi.org/10.1371/journal.pcbi.1006981

mutagene's People

Contributors

Stargazers

Watchers

Forkers

shandy79 fhernandezl panchenko-lab

mutagene's Issues

signature identify documentation and help

[Method] argument is not clearly described, there is some mix-up between target functions and optimization methods. Some "methods" which account or not account for context-independent processes contradict "--no-unexplained-variance" option.
Too many methods which are not necessary.

webserver error

"Identify" returns an error message for a small one sample file:
"Error processing the sample. The file could be too large for web processing - please try again. If error persists we recommend using stand-alone mutagene package. In some cases input format could not be parsed correctly."

genome list

No option to list available genomes to fetch like how cohorts can be listed.

Redesign rank subpackage

Simplify rank sub package to calculate everything from one input file

Genome assembly does not match

"mutagene rank" does not report an error if specifying genome assembly which does not match the one from maf file

provides no list of cohorts

$ mutagene fetch cohorts --list
WARNING Looks like the file has been downloaded already: cohorts.tar.gz

motif str requirement in cli

if motif following -m is not in string format, program works unless user uses > to represent mutation

is user uses >, motif truncates beforehand

ex: if Y[C>T]N inputted, will search for Y[C
ex: if Y[C.T]N inputted, will search for Y[C>T]N as specified
ex: if 'Y[C>T]N' inputted, will search for Y[C>T]N as specified

problem: having to manually put in string format is confusing to user, esp. b/c works sometimes w/o being a str

ways to fix:

automatically convert to str format anyway
not working (error message produced) if user doesn't use quotes just to be consistent

Ranking

In mutagene rank module, we should specify in the output what was used as a profile, cohort size and observed mutations' count source.

arguments in motif search

Currently thresholds "-t" are not included in "advanced arguments". We should also include thresholds on individual p-values.

I don’t think it is clear what “TN”, “NA”… mean;
I would call these options “transcribed”,”non-transcribed”, “any” or “both” (“TN”?)

--save-motif-matches
“save mutations in matching motifs to a BED file”

Ranking, can not override profile

$ mutagene -v rank -g hg38 -i TCGA.COAD.mutect.somatic.maf --profile TCGA.COAD.mutect.somatic.maf
INFO Loaded 230248 mutations
INFO Loaded 190240 protein mutations in 399 samples skipped 74543 protein mutations
INFO Cohort size: 399
INFO Profile overridden
INFO THRESHOLD_DRIVER: 8.030647e-05
INFO THRESHOLD_PASSENGER: 0.003440945
Traceback (most recent call last):
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/Users/panch/mutagene/env_mutagene/lib/python3.7/site-packages/mutagene/main.py", line 113, in
MutaGeneApp()
File "/Users/panch/mutagene/env_mutagene/lib/python3.7/site-packages/mutagene/main.py", line 104, in init
parser_class.callback(args)
File "/Users/panch/mutagene/env_mutagene/lib/python3.7/site-packages/mutagene/cli/rank_menu.py", line 114, in callback
rank(protein_mutations, args.outfile, profile, cohort_aa_mutations, cohort_size, args.threshold_driver, args.threshold_passenger)
File "/Users/panch/mutagene/env_mutagene/lib/python3.7/site-packages/mutagene/mutability/mutability.py", line 138, in rank
mutation_model = calculate_base_substitution_mutability(profile, cohort_size)
File "/Users/panch/mutagene/env_mutagene/lib/python3.7/site-packages/mutagene/mutability/mutability.py", line 111, in calculate_base_substitution_mutability
assert len(counts_profile) == 96
AssertionError

Gene transcripts

Add a description of how to handle gene transcript in Ranking. Choose one longest transcript per gene.

non-informative error message

Specifying the wrong genome assembly returns an error about the maf format:
(env_mutagene) Annas-MacBook-Pro:mutagene panch$ mutagene -v rank -g hg18 -i TCGA.UCEC.mutect.somatic.maf
WARNING Parsing MAF failed. Check that the input file is in MAF format or specify a different format using option -f
FileNotFoundError(2, 'No such file or directory')

pvalue

MutaGene currently only reports 1 p-value - should functionality be added to let user choose which p-values are reported?

motif search

It is not clear which motif is used by default. If "--motif" argument is missing, the program is hanging for long time and then returns an error. It should return a warning: "motif is not specified" in the beginning.

mutagene motif -i sample.maf

Traceback (most recent call last):
File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/global/home/hpc4498/venv/lib/python3.7/site-packages/mutagene/main.py", line 113, in
MutaGeneApp()
File "/global/home/hpc4498/venv/lib/python3.7/site-packages/mutagene/main.py", line 104, in init
parser_class.callback(args)
File "/global/home/hpc4498/venv/lib/python3.7/site-packages/mutagene/cli/motif_menu.py", line 149, in callback
MotifMenu.search(args)
File "/global/home/hpc4498/venv/lib/python3.7/site-packages/mutagene/cli/motif_menu.py", line 127, in search
dump_matches=args.save_motif_matches) if mutations_with_context is not None else []
File "/global/home/hpc4498/venv/lib/python3.7/site-packages/mutagene/motifs/init.py", line 46, in identify_motifs
motifs = get_known_motifs()
File "/global/home/hpc4498/venv/lib/python3.7/site-packages/mutagene/io/motifs.py", line 21, in get_known_motifs
with open(fname, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/global/home/hpc4498/venv/lib/python3.7/site-packages/mutagene/io/../data/motifs/motifs.json'
(venv) [hpc4498@caclogin02 mutagene]$
(venv) [hpc4498@caclogin02 mutagene]$ mutagene motif -i TCGA.COAD.mutect.somatic.maf -w 250 -t 0.1 -o tmp1 -g hg38
Traceback (most recent call last):
File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/global/home/hpc4498/venv/lib/python3.7/site-packages/mutagene/main.py", line 113, in
MutaGeneApp()
File "/global/home/hpc4498/venv/lib/python3.7/site-packages/mutagene/main.py", line 104, in init
parser_class.callback(args)
File "/global/home/hpc4498/venv/lib/python3.7/site-packages/mutagene/cli/motif_menu.py", line 149, in callback
MotifMenu.search(args)
File "/global/home/hpc4498/venv/lib/python3.7/site-packages/mutagene/cli/motif_menu.py", line 127, in search
dump_matches=args.save_motif_matches) if mutations_with_context is not None else []
File "/global/home/hpc4498/venv/lib/python3.7/site-packages/mutagene/motifs/init.py", line 46, in identify_motifs
motifs = get_known_motifs()
File "/global/home/hpc4498/venv/lib/python3.7/site-packages/mutagene/io/motifs.py", line 21, in get_known_motifs
with open(fname, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/global/home/hpc4498/venv/lib/python3.7/site-packages/mutagene/io/../data/motifs/motifs.json'

Package documentation

auto-generate package CLI and API documentation
Use readthedocs to host it

optional args

need some sort of separation/clarification for optional arguments

genome & file are listed as optional but required to run
motif, output file, etc. aren't required and also listed as optional

user can't see distinction between these groups

Issues with motif_doc.rst

There appear to be syntactic issues with motif_doc.rst along with formatting, numbering, and other issues. In addition, the documentation should be reviewed for accuracy and completeness.

ambiguous error message

Specifying a wrong format type (VCF, not MAF) produces an uninformative message:

(env_mutagene) Annas-MBP:mutagene panch$ mutagene signature identify -i data_mutations_extended.txt -g hg19 -s 49 -U -b -f VCF
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/Users/panch/mutagene/env_mutagene/lib/python3.7/site-packages/mutagene/main.py", line 113, in
MutaGeneApp()
File "/Users/panch/mutagene/env_mutagene/lib/python3.7/site-packages/mutagene/main.py", line 104, in init
parser_class.callback(args)
File "/Users/panch/mutagene/env_mutagene/lib/python3.7/site-packages/mutagene/cli/signature_menu.py", line 117, in callback
self.identify(args)
File "/Users/panch/mutagene/env_mutagene/lib/python3.7/site-packages/mutagene/cli/signature_menu.py", line 94, in identify
mutations, processing_stats = read_auto_profile(args.infile, fmt=args.input_format, asm=args.genome)
File "/Users/panch/mutagene/env_mutagene/lib/python3.7/site-packages/mutagene/io/mutations_profile.py", line 115, in read_auto_profile
mutations, processing_stats = read_VCF_profile(mutations_lines, asm)
File "/Users/panch/mutagene/env_mutagene/lib/python3.7/site-packages/mutagene/io/mutations_profile.py", line 234, in read_VCF_profile
pos = int(col_list[1]) # VCF POS
ValueError: invalid literal for int() with base 10: 'Entrez_Gene_Id'

Do not show VCF MAF format option in rank module

Currently rank module does not support VCF

mutagene signature reporting

When running mutagene signature identify, the signatures are reported as 1, 2, 3, etc., instead of Mutagene-1, Mutagene-2, ... or COSMIC-1, COSMIC-2, etc. This could cause confusion for user.

cohort documentation

cohort args overlap, need clearer documentation + possible visuals/expl.

KeyError: 'GCTGA'

I performed ranking for about 100 MAF files but had the following Key errors for several of them. Do you have any ideas about this error? Thanks!

mutagene rank -i maf -g hg38 -o ranking.txt -p sample.profile

Traceback (most recent call last):
File "/usr/local/Anaconda/envs/py3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/local/Anaconda/envs/py3.7/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/pengy10/.local/lib/python3.7/site-packages/mutagene/main.py", line 113, in
MutaGeneApp()
File "/home/pengy10/.local/lib/python3.7/site-packages/mutagene/main.py", line 104, in init
parser_class.callback(args)
File "/home/pengy10/.local/lib/python3.7/site-packages/mutagene/cli/rank_menu.py", line 75, in callback
protein_mutations, processing_stats = read_protein_mutations_MAF(args.infile, args.genome)
File "/home/pengy10/.local/lib/python3.7/site-packages/mutagene/io/protein_mutations_MAF.py", line 239, in read_protein_mutations_MAF
flat_mutations[(gene, protein_mutation)]['seq5'][props['seq5']] += 1
KeyError: 'GCTGA'

genome requirement

currently requires user to manually input genome

ideas to fix:

config file
extract from MAF/VCF

Explore CircleCI integration

Explore CircleCI integration with GitHub for use in CI testing.

Create log file with input parameters

Log file must be generated for every program run (opt-in mode)
Log file should contain all necessary information to reproduce the results, in particular input parameters
It should be possible to read log file back as input configuration. Suggested to use YAML
Moved the last point from #27 "Include in the output a file with information on the arguments provided when running the command"

'Unexpected value of transcript_strand in MAF file'

mutagene -v rank -i test.maf -g hg38 -c gcb_lymphomas -p test.profile
WARNING Parsing MAF failed. Check that the input file is in MAF format or specify a different format using option -f
ValueError('Unexpected value of transcript_strand in MAF file')

mutagene report error when transcript_strand is not available in some lines of input MAF files.

mutagene signature identify

not an informative warning message if no signature set is specified:

WARNING Set of signatures required. Use 5 and 10 for MUTAGENE-5 and MUTAGENE-10. Use 30 for COSMIC-30

Replace this message into - WARNING Use "-s" option to specify a set of signatures

Standard output

Propose options for standardizing the output of the identify function. Specifically:

The discrepancy in format between the bootstrap and non-bootstrap outputs
Signature set and signature name convention
Include in the output a file with information on the arguments provided when running the command

Related:
#15

VCF processing

mutagene currently can't process VCFs due to updates in MAF reader

steps to reproduce error:
run: mutagene motif search -g hg19 -i sample2.vcf

caroline's draft code to fix:

in motif_menu.py:

    if ".vcf" not in args.infile.name:
        mutations, mutations_with_context, processing_stats = read_MAF_with_context_window(args.infile, args.genome, args.window_size)

        if len(mutations_with_context) == 0:
            logger.warning("No mutations loaded")

        matching_motifs = identify_motifs(mutations_with_context, custom_motif,
                                          args.strand) if mutations_with_context is not None else []

    else:
        mutations, mutations_with_context, processing_stats = read_VCF_with_context_window(args.infile, args.genome, args.window_size)

        if len(mutations_with_context) == 0:
            logger.warning("No mutations loaded")

        matching_motifs = identify_motifs(mutations_with_context, custom_motif,
                                          args.strand) if mutations_with_context is not None else []

in motifs/init.py:

try:
    for sample, mutations in samples_mutations.items():
        if mutations is not None and len(mutations) > 0:
            first_mut_seq_with_coords = mutations[0][-1]
            window_size = (len(first_mut_seq_with_coords) - 1) // 2

            for m in search_motifs:
                for s in strand:
                    # print("IDENTIFYING MOTIF: ", m['name'])
                    result = get_enrichment(mutations, m['motif'], m['position'], m['ref'], m['alt'], window_size, s)

                    debug_data = {'sample': sample, 'motif': m['logo'], 'strand': s}
                    debug_data.update(result)
                    debug_string = pprint.pformat(debug_data, indent=4)
                    logger.debug(debug_string)

                    if result['mutation_load'] == 0:
                        continue


                    motif_dict = {
                        'sample': sample,
                        'name': m['name'],
                        'motif': m['logo'],
                        'strand': s,
                        'enrichment': result['enrichment'],
                        'pvalue': result['pvalue_fisher'],
                        'mutations_low_est': result['mutation_load'],
                        'mutations_high_est': result['bases_mutated_in_motif'],
                        # 'mut_motif': result['bases_mutated_in_motif'],
                        # 'mut_not_in_motif': result['bases_mutated_not_in_motif'],
                        # 'stat_count': result['bases_not_mutated_in_motif'],
                        # 'ref_count': result['bases_not_mutated_not_in_motif']
                    }

                    motif_matches.append(motif_dict)

except AttributeError:
    # print(sample, len(mutations))
    if samples_mutations is not None and len(samples_mutations) > 0:
        first_mut_seq_with_coords = samples_mutations[0][-1]
        window_size = (len(first_mut_seq_with_coords) - 1) // 2

        for m in search_motifs:
            for s in strand:
                # print("IDENTIFYING MOTIF: ", m['name'])
                result = get_enrichment(samples_mutations, m['motif'], m['position'], m['ref'], m['alt'], window_size, s)

                debug_data = {'sample': "VCF", 'motif': m['logo'], 'strand': s}
                debug_data.update(result)
                debug_string = pprint.pformat(debug_data, indent=4)
                logger.debug(debug_string)

                if result['mutation_load'] == 0:
                    continue


                motif_dict = {
                    'sample': "VCF",
                    'name': m['name'],
                    'motif': m['logo'],
                    'strand': s,
                    'enrichment': result['enrichment'],
                    'pvalue': result['pvalue_fisher'],
                    'mutations_low_est': result['mutation_load'],
                    'mutations_high_est': result['bases_mutated_in_motif'],
                    # 'mut_motif': result['bases_mutated_in_motif'],
                    # 'mut_not_in_motif': result['bases_mutated_not_in_motif'],
                    # 'stat_count': result['bases_not_mutated_in_motif'],
                    # 'ref_count': result['bases_not_mutated_not_in_motif']
                }

                motif_matches.append(motif_dict)

return motif_matches

in context_window.py:

def read_VCF_with_context_window(muts, asm, window_size):
cn = complementary_nucleotide
mutations = defaultdict(float)
N_skipped = 0
# N_skipped_indels = 0

raw_mutations = []

for line in muts.read().split("\n"):
    if line.startswith("#"):
        continue
    if len(line) < 10:
        continue

    col_list = line.split()
    if len(col_list) < 4:
        continue

    # ID = col_list[2]

    # chromosome is expected to be one or two number or one letter
    chrom = col_list[0]  # VCF CHROM
    if chrom.lower().startswith("chr"):
        chrom = chrom[3:]
    # if len(chrom) == 2 and chrom[1] not in "0123456789":
    #     chrom = chrom[0]

    pos = int(col_list[1])  # VCF POS
    x = col_list[3]         # VCF REF
    y = col_list[4]         # VCF ALT

    # if multiple REF or ALT alleles are given, ignore mutation entry (could mean seq error, could mean deletion)
    if len(x) != 1:
        N_skipped += 1
        continue
    if len(y) != 1:
        N_skipped += 1
        continue

    transcript_strand = '+'  # bad assumption about transcript strand

    raw_mutations.append((chrom, pos, transcript_strand, x, y))

# print("RAW", raw_mutations)
# print("INDELS", N_skipped)

mutations_with_context = []
if len(raw_mutations) > 0:
    contexts = get_context_twobit_window(raw_mutations, asm, window_size)
    if contexts is None or len(contexts) == 0:
        return None, None

    for (chrom, pos, transcript_strand, x, y) in raw_mutations:
        (p5, p3), seq_with_coords = contexts.get((chrom, pos), (("N", "N"), []))
        # print("RESULT: {} {}".format(p5, p3))

        if len(set([p5, x, y, p3]) - set(nucleotides)) > 0:
            # print(chrom, pos, p5, p3, x)
            # print("Skipping invalid nucleotides")
            N_skipped += 1
            continue
        if x in "CT":
            mutations[p5 + p3 + x + y] += 1.0
        else:
            # complementary mutation
            mutations[cn[p3] + cn[p5] + cn[x] + cn[y]] += 1.0

        mutations_with_context.append((chrom, pos, transcript_strand, x, y, seq_with_coords))
N_loaded = int(sum(mutations.values()))
processing_stats = {'loaded': N_loaded, 'skipped': N_skipped, 'format': 'VCF'}
return mutations, mutations_with_context, processing_stats

Motif search

There is no description in help files what does it mean "pre-identified motifs".

--help for "signature identify"

please copy the new documentation from "signature_doc*" file

MAF parsing error

Error parsing MAF file

MutaGen requires Codon_Change or Codons fields in MAF files
Could not find HGVSp_Short or Protein_Change fields in MAF file

Reproduce:

$ mutagene rank -i test.maf -g hg38 -o ranking.txt -c pancancer
WARNING MAF format not recognized
WARNING No mutations to rank

Sample:

Hugo_Symbol     Entrez_Gene_Id   Center   NCBI_Build        Chromosome     Start_Position     End_Position           Strand   Variant_Classification       tumor_VAF        Variant_Type      Reference_Allele  Tumor_Seq_Allele1           Tumor_Seq_Allele2          dbSNP_RS         Tumor_Sample_Barcode   Mutation_Status AAChange           Transcript_Id      TxChange          GeneDetail.refGene
SAMD11           NCC     hg38     chr1     930314  930314  +         Missense_Mutation        0.5       SNP      C           C         T                    NSLA_475_exonic_anno.txt           Somatic p.H78Y  NM_152486       c.C232T 
...

rounding of mutations high estimate

some mutation low estimates & all high estimates end in 0.5, should be whole number

add some sort of correction/rounding

error can be reproduced by running mutagens motif search on a sample

web server and package

the webserver and package produce different results:

(env_mutagene) Annas-MBP:mutagene panch$ mutagene signature -i TSVC_variants_IonXpress_019_(BCMDS-116).vcf -g hg38 -f VCF -s 10 -U
sample signature exposure mutations
VCF 2 0.0535 3
VCF 3 0.1164 7
VCF 4 0.1134 7
VCF 5 0.1244 7
VCF 6 0.3912 23
VCF 7 0.1603 9
VCF 8 0.0109 1
VCF 9 0.0274 2
(env_mutagene) Annas-MBP:mutagene panch$ mutagene signature -i TSVC_variants_IonXpress_019_(BCMDS-116).vcf -g hg38 -f VCF -s 10
sample signature exposure mutations
VCF 1 0.0144 1
VCF 3 0.1123 7
VCF 4 0.0514 3
VCF 5 0.1074 6
VCF 6 0.1604 9
VCF 7 0.0341 2
VCF 9 0.0328 2

different output with and w/o bootstrap

mutagene signature identify -i TSVC_variants_IonXpress_019_(BCMDS-116).vcf -g hg38 -s 10 -f VCF

Produces different output with and without "-b" option, this issue is not observed for other VCF files

motif search command line

example in "motif search --help" is not working:

mutagene motif --infile --input-format sample1.maf -f MAF --genome hg19 --motif 'C[A>T]' --strand A

It should be:

mutagene motif --infile sample1.maf -f MAF --genome hg19 --motif 'C[A>T]' --strand A

We should have an example with the significant motif matches.

generating output pdf

specifically, PDF is generated but preview cannot open b/c "file may be damaged or use a file format that Preview doesn’t recognize", other pdf editors also will not open

issue will be reproduced if user runs -o file_name.pdf and tries to open PDF on their machine

formats that work: txt, csv, doc, tex, etc. seems to be a pdf issue

idea to fix:

need to maybe make separate case for how PDFs are generated, if need to be handled in first place