Code Monkey home page Code Monkey logo

alexandrovlab / sigprofilersimulator Goto Github PK

View Code? Open in Web Editor NEW
18.0 18.0 4.0 8.59 MB

SigProfilerSimulator allows realistic simulations of mutational patterns and mutational signatures in cancer genomes. The tool can be used to simulate signatures of single point mutations, double point mutations, and insertion/deletions. Further, the tool makes use of SigProfilerMatrixGenerator and SigProfilerPlotting.

License: BSD 2-Clause "Simplified" License

Python 100.00%
bioinformatics cancer-genomics mutational-signatures somatic-variants

sigprofilersimulator's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sigprofilersimulator's Issues

Reference genome installation and wget

Hi @ebergstr,

Thanks for the great tool! A couple of suggestions from me:

  1. The argument c("96") of contexts variable in the example sigSim.SigProfilerSimulator command in Readme throws an error in my environment: NameError: name 'c' is not defined
    contexts=["96"] works though, so I suggest changing example command to that.

  2. If the reference genome is not downloaded, nothing seems to happen when Chromosome proportion file does not exist. Creating now... message appears. Manual download as per SigProfilerMatrixGenerator Readme works, so perhaps it's worth mentioning it in SigProfilerSimulator readme, too.

  3. When installing the reference genome, The ensembl ftp site is not currently responding. error appears even when WGET is not installed. Although this tool is mentioned in Readme as a prerequisite, perhaps it's worth adding an exception here to check if WGET is available.

Best,
Sergey

Hard-coded chromosomes make it impossible to run code with "chr" prefix for chromosome IDs or "MT" chromosomes

https://github.com/AlexandrovLab/SigProfilerSimulator/blob/ced39a81e38098fc9c42b90255e5f70144ac52aa/SigProfilerSimulator/SigProfilerSimulator.py#L162C1-L163C67

Because the chromosomes are hardcoded without the chr prefix, it is impossible to run SigProfilerSimulator on non-Ensembl GRCh38 genomes that have the "chr" prefix before the chromosome number. Additionally, there is no MT chromosome listed so the code also breaks on vcf files containing MT chromosomes.

Chromosome based error

Hello,

I am running your tool using:

sigSim.SigProfilerSimulator(name, \
  vcf_dir, \
  "GRCh37", \
  contexts=["96", "ID"], \
  exome=None, \
  simulations=1000, \
  updating=False, \
  bed_file=bed, \
  overlap=False, \
  gender='female', \
  chrom_based=True, \
  seed_file=None, \
  noisePoisson=False, \
  cushion=100, \
  region=None, \
  vcf=True)

But I am getting the following error:

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/gs/gsfs0/users/rcutler/.conda/envs/SigProfilerSimulator/lib/python3.12/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "/gs/gsfs0/users/rcutler/.conda/envs/SigProfilerSimulator/lib/python3.12/site-packages/SigProfilerSimulator/mutational_simulator.py", line 973, in simulator
    random_sample = random.sample(list(mutation_tracker[context]),1)[0]
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gs/gsfs0/users/rcutler/.conda/envs/SigProfilerSimulator/lib/python3.12/random.py", line 430, in sample
    raise ValueError("Sample larger than population or is negative")
ValueError: Sample larger than population or is negative
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/gs/gsfs0/shared-lab/vijg-lab/2023-Ronnie/231009_multiple_ENU_analysis/SigProfilerSimulator/merged/runSigProfilerSimulator.py", line 10, in <module>
    sigSim.SigProfilerSimulator(name, \
  File "/gs/gsfs0/users/rcutler/.conda/envs/SigProfilerSimulator/lib/python3.12/site-packages/SigProfilerSimulator/SigProfilerSimulator.py", line 479, in SigProfilerSimulator
    r.get()
  File "/gs/gsfs0/users/rcutler/.conda/envs/SigProfilerSimulator/lib/python3.12/multiprocessing/pool.py", line 774, in get
    raise self._value
ValueError: Sample larger than population or is negative

When I run your tool with chrom_based=False I am able to get results. So this makes me think it is an error when wanting to have mutations simulated by chromosome. Since some of my samples don't have many mutations, I think this may be due to some chromosomes having 0 mutations. Any help with this?

Thanks,
Ronnie

"bed_file" and "region" parameters

Dear author.

Thanks for making this wonderful tool.
I have questions regarding this tool.

I would like to simulate mutations within genes.
In order to do this, I inputted gene regions in "bed_file".
(sigSim.SigProfilerSimulator("SIMUL", "./up_gene", "GRCh38", contexts=["96"],simulations=120,bed_file="./updown_bed_canonincal_hg38")
However, this tool seemed to simulate a subset of mutations with this parameter.
(e.g., the number of total mutations within genes is around 10,000, but, each maf file contains about 700 mutations.)

Next, I inputted "BED_GRCh38_proportions", which seemed to be generated when using "bed_file" parameter, in the region parameter. (sigSim.SigProfilerSimulator("SIMUL", "./up_gene", "GRCh38", contexts=["96"],simulations=120,region="BED_GRCh38_proportions")

However, this approach caused the following error.
Could you tell me the best way to simulate mutations within genes?

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/lustre/scratch117/casm/team78/hj6/anaconda3/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/lustre/scratch117/casm/team78/hj6/anaconda3/lib/python3.7/site-packages/SigProfilerSimulator/mutational_simulator.py", line 2349, in simulator
mutNuc = ''.join([tsb_ref[base][1] for base in sequence[random_number - mut_start:random_number + mut_start+1]])
File "/lustre/scratch117/casm/team78/hj6/anaconda3/lib/python3.7/site-packages/SigProfilerSimulator/mutational_simulator.py", line 2349, in
mutNuc = ''.join([tsb_ref[base][1] for base in sequence[random_number - mut_start:random_number + mut_start+1]])
KeyError: 51
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "", line 1, in
File "/lustre/scratch117/casm/team78/hj6/anaconda3/lib/python3.7/site-packages/SigProfilerSimulator/SigProfilerSimulator.py", line 479, in SigProfilerSimulator
r.get()
File "/lustre/scratch117/casm/team78/hj6/anaconda3/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
KeyError: 51

Lower mutation number in simulation than observed

Hello,

Thanks for a great tool! I have been using it to randomize the position of mutations in samples while controlling for mutational signatures. However, I have noticed that I am not getting the same amount of mutations in the simulated files as in the input, which I would expect.

I am running the tool like this, including some specific regions to simulate in (based on coverage):
`#!/usr/bin/env python

import sys
from SigProfilerSimulator import SigProfilerSimulator as sigSim

name = sys.argv[1]
vcf_dir = sys.argv[2]
bed = sys.argv[3].strip()

sigSim.SigProfilerSimulator(name,
vcf_dir,
"GRCh37",
contexts=["96", "ID"],
exome=None,
simulations=10,
updating=False,
bed_file=bed,
overlap=False,
gender='female',
chrom_based=False,
seed_file=None,
noisePoisson=False,
cushion=100,
region=None,
vcf=True)`

Here is the output of the log file:
`======================================
SigProfilerSimulator
Checking for all reference files and relevant matrices...
Creating a chromosome proportion file for the given BED file ranges...Completed!
SigProfilerSimulator_cell_output/C1_05_24/output/ID/C1_05_24.ID83.region does not exist. Creating the matrix file now.
Starting matrix generation for SNVs and DINUCs...Completed! Elapsed time: 116.12 seconds.
Starting matrix generation for INDELs...Completed! Elapsed time: 106.96 seconds.
Matrices generated for 1 samples with 0 errors. Total of 597 SNVs, 1 DINUCs, and 15 INDELs were successfully analyzed.
The context distribution file does not exist. This file needs to be created before simulating. This may take several hours...
Chromosome X done
Chromosome 1 done
Chromosome 2 done
Chromosome 3 done
Chromosome 4 done
Chromosome 5 done
Chromosome 6 done
Chromosome 7 done
Chromosome 8 done
Chromosome 9 done
Chromosome 10 done
Chromosome 11 done
Chromosome 12 done
Chromosome 13 done
Chromosome 14 done
Chromosome 15 done
Chromosome 16 done
Chromosome 17 done
Chromosome 18 done
Chromosome 19 done
Chromosome 20 done
Chromosome 21 done
The context distribution file has been created!
The context distribution file does not exist. This file needs to be created before simulating. This may take several hours...
Chromosome X done
Chromosome 1 done
Chromosome 2 done
Chromosome 3 done
Chromosome 4 done
Chromosome 5 done
Chromosome 6 done
Chromosome 7 done
Chromosome 8 done
Chromosome 9 done
Chromosome 10 done
Chromosome 11 done
Chromosome 12 done
Chromosome 13 done
Chromosome 14 done
Chromosome 15 done
Chromosome 16 done
Chromosome 17 done
Chromosome 18 done
Chromosome 19 done
Chromosome 20 done
Chromosome 21 done
The context distribution file has been created!

Files successfully read and mutations collected. Mutation assignment starting now.
Mutations have been distributed. Starting simulation now...
Chromosome 21 done
Chromosome 22 done
Chromosome 19 done
Chromosome 20 done
Chromosome 18 done
Chromosome 13 done
Chromosome 17 done
Chromosome 15 done
Chromosome 16 done
Chromosome 14 done
Chromosome 9 done
Chromosome 11 done
Chromosome 10 done
Chromosome 12 done
Chromosome X done
Chromosome 8 done
Chromosome 7 done
Chromosome 6 done
Chromosome 4 done
Chromosome 5 done
Chromosome 3 done
Chromosome 1 done
Chromosome 2 done
Simulation completed
Job took 2381.3919444084167 seconds`

In my input file I have 613 mutations, but in the simulated output I only get 313 mutations. This issue is not specific to this file. Any help with this? I have tried to circumvent this by creating many more simulations than I need and sampling them to get the matching number of mutations as in my real sample.

Thanks,
Ronnie

Update SigProfilerSimulator for Changes in SigProfilerMatrixGenerator Imports

Description:
Recent changes in SigProfilerMatrixGenerator v1.3.18 have implications for SigProfilerSimulator:

  • File SigProfilerMatrixGenerator.py has been renamed to MutationMatrixGenerator.py.
  • The import statement has transitioned:
    • From:
      from SigProfilerMatrixGenerator.scripts import SigProfilerMatrixGenerator
    • To:
      from SigProfilerMatrixGenerator.scripts import MutationMatrixGenerator

These modifications introduce import errors in SigProfilerSimulator (line 20, line 24).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.