Code Monkey home page Code Monkey logo

alexandrovlab / sigprofilermatrixgenerator Goto Github PK

View Code? Open in Web Editor NEW
97.0 97.0 34.0 1.02 GB

SigProfilerMatrixGenerator creates mutational matrices for all types of somatic mutations. It allows downsizing the generated mutations only to parts for the genome (e.g., exome or a custom BED file). The tool seamlessly integrates with other SigProfiler tools.

License: BSD 2-Clause "Simplified" License

Python 99.89% Dockerfile 0.11%
bioinformatics cancer-genomics mutation-analysis mutational-signatures somatic-variants

sigprofilermatrixgenerator's People

Contributors

azhark2 avatar ebergstr avatar marcos-diazg avatar mdbarnesucsd avatar mishugeb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sigprofilermatrixgenerator's Issues

more of a question: what is the source of transcribed stand bias reference bed file from?

Hello, thank you for this nice tool doing a beautiful and comprehensive mutational signature work!
One quick question for the logic in determining the strand bias: I notice the logic is done by using the
references/chromosomes/tsb_BED

I am wondering what is the source of this BED file - is there a reference paper or link for this tsb related bed file?

Thanks!
Isaac

Indel types '1:Del:C:0' and '1:Del:T:0' have zero counts

for i in range (pos_rev-type_length, pos_rev, 1):
actual_seq += tsb_ref[chrom_string[i]][1]
while pos_rev - type_length > 0 and actual_seq == type_sequence:
sequence = actual_seq + sequence
pos_rev -= type_length
actual_seq = ''
for i in range (pos_rev-type_length, pos_rev, 1):

Since position start-1 on the chrom_string is the actual ref (as indicated on line 1188), using start (or pos_rev) as the right bound on the lines 1248 & 1254 means that the ref base (or start-1) would be included in the range and counted as a repeat. So, we won't get any single base deletions with no repeats. If I understood your code & the concept correctly, then I suggest using start-1 (or pos_rev-1) as the right bound on these lines.

Thank you!
-Mehdi

ID_TSB plots no "Nontranscribed" group like SBS_288

Hi,

Versions:

SigProfilerMatrixGenerator==1.1.16
sigProfilerPlotting==1.1.6

Not sure if this is a limitation, but the lack of the 3rd group "Nontranscribed" results in a loss of large amounts of data when comparing ID_83 to ID_TSB plots:

ID_83

Screen Shot 2020-06-19 at 11 23 08

ID_TSB

Screen Shot 2020-06-19 at 11 23 48

This isn't an issue on SBS_96 vs SBS_288 as all variants are accounted for.

File import error in installation benchmark.

from SigProfilerMatrixGenerator.scripts import convert_input_to_simple_files as convertIn

I get the following error when benchmarking an installation:

Traceback (most recent call last):
  File "/home/garmstro/code/SigProfilerSuite/SigProfilerMatrixGenerator/SigProfilerMatrixGenerator/scripts/SigProfilerMatrixGenerator.py", line 18, in <module>                                                                                                                           
    from SigProfilerMatrixGenerator.scripts import convert_input_to_simple_files as convertIn
  File "/home/garmstro/code/SigProfilerSuite/SigProfilerMatrixGenerator/SigProfilerMatrixGenerator/scripts/SigProfilerMatrixGenerator.py", line 18, in <module>                                                                                                                           
    from SigProfilerMatrixGenerator.scripts import convert_input_to_simple_files as convertIn
ModuleNotFoundError: No module named 'SigProfilerMatrixGenerator.scripts'; 'SigProfilerMatrixGenerator' is not a package

This is being raised due to there also being a local file named 'SigProfilerMatrixGenerator', and python searches local directory before installed packages for imports.

mm39 analysis requires MT.txt, chr20.txt, 21.txt and 22.txt

I think this bug is similar to another issue (#42).
Using the installed 'mm39' genome the analysis stalls here:
Starting matrix generation for INDELs...>>>

The log mentions that various txt files are missing from '/references/chromosomes/tsb/mm39/', including MT.txt, chr20.txt, 21.txt and 22.txt.

I got round this by adding empty files for those missing above.

ValueError: threshold must be numeric and non-NAN, try sys.maxsize for untruncated representation

Just to be aware:

When using numpy-1.16.0

we get the following error (see below) when importing from SigProfilerMatrixGenerator.
we fixed it by arbitrarily setting :
#np.set_printoptions(threshold=np.nan)
to
np.set_printoptions(threshold=0.001)

line 24 in SigProfilerMatrixGenerator.py on master branch.

from SigProfilerMatrixGenerator import install as genInstall
Traceback (most recent call last):
File "", line 1, in
File "/home/debian/Stephen/SigProfilerMatrixGenerator/SigProfilerMatrixGenerator/install.py", line 18, in
from SigProfilerMatrixGenerator.scripts import SigProfilerMatrixGeneratorFunc as matGen
File "/home/debian/Stephen/SigProfilerMatrixGenerator/SigProfilerMatrixGenerator/scripts/SigProfilerMatrixGeneratorFunc.py", line 7, in
from . import SigProfilerMatrixGenerator as matGen
File "/home/debian/Stephen/SigProfilerMatrixGenerator/SigProfilerMatrixGenerator/scripts/SigProfilerMatrixGenerator.py", line 24, in
np.set_printoptions(threshold=np.nan)
File "/home/debian/.local/lib/python3.5/site-packages/numpy/core/arrayprint.py", line 246, in set_printoptions
floatmode, legacy)
File "/home/debian/.local/lib/python3.5/site-packages/numpy/core/arrayprint.py", line 93, in _make_options_dict
raise ValueError("threshold must be numeric and non-NAN, try "
ValueError: threshold must be numeric and non-NAN, try sys.maxsize for untruncated representation
from SigProfilerMatrixGenerator.scripts import SigProfilerMatrixGeneratorFunc as matGen
Traceback (most recent call last):
File "", line 1, in
File "/home/debian/Stephen/SigProfilerMatrixGenerator/SigProfilerMatrixGenerator/scripts/SigProfilerMatrixGeneratorFunc.py", line 7, in
from . import SigProfilerMatrixGenerator as matGen
File "/home/debian/Stephen/SigProfilerMatrixGenerator/SigProfilerMatrixGenerator/scripts/SigProfilerMatrixGenerator.py", line 24, in
np.set_printoptions(threshold=np.nan)
File "/home/debian/.local/lib/python3.5/site-packages/numpy/core/arrayprint.py", line 246, in set_printoptions
floatmode, legacy)
File "/home/debian/.local/lib/python3.5/site-packages/numpy/core/arrayprint.py", line 93, in _make_options_dict
raise ValueError("threshold must be numeric and non-NAN, try "
ValueError: threshold must be numeric and non-NAN, try sys.maxsize for untruncated representation

Error due to reference file corrupted

"The transcriptional reference data appears to be corrupted. Please reinstall the GRCh38 genome"

Im getting that error after a list of not matching md5sums .

Do you think is there any way to solve this?

Is there any way I can donwload it manually and keep moving?

Install.py does not exist

When trying to run SigProfilerMatrixGenerator on a file where the specified reference genome has not been installed, the user receives a message prompting them to install the reference using an install.py script:

 python sigprofilermatrixgenerator/generate_matrix.py -m <input>.maf 
The specified genome: GRCh37 has not been installed
Run the following command to install the genome:
	python sigProfilerMatrixGenerator/install.py -g GRCh37

However, when running the suggested command, the install.py script is not found:

python sigProfilerMatrixGenerator/install.py -g GRCh37
python: can't open file 'sigProfilerMatrixGenerator/install.py': [Errno 2] No such file or directory

This looks like an error in the pip install process, as the install.py script does exist in the repo.

It looks like some of the paths are hard-coded as well:

 python sigprofilermatrixgenerator/SigProfilerMatrixGenerator/install.py -g GRCh37
Traceback (most recent call last):
  File "sigprofilermatrixgenerator/SigProfilerMatrixGenerator/install.py", line 571, in <module>
    main()
  File "sigprofilermatrixgenerator/SigProfilerMatrixGenerator/install.py", line 516, in main
    os.chdir(first_path + "/sigProfilerMatrixGenerator/")
FileNotFoundError: [Errno 2] No such file or directory: '/home/ericco92/indels_sigprof_1.0.3_2020FEB20/sigProfilerMatrixGenerator/'

Certain MAF files produce no 1bp deletions in matrix

Hi,

I have several MAF files that have known 1bp deletions. When running SigProfilerMatrixGenerator (and subsequently SigProfilerExtractor), the output matrix has counts of 0 for 1:Del:C:0 or 1:Del:T:0 features across all samples.

Here's an example of a 1bp deletion from my MAF:

Unknown 0       broad.mit.edu   37      1       168186704       168186704       +       IGR     DEL    C       C       -

Do I need to prepend a reference base to both alleles or otherwise preprocess inputs before running SigProfilerMatrixGenerator? Happy to contribute a script to do so if we can figure out the issue.

192 TSB-SBS extraction

Can I extract 192 transcription strand based mutation type matrix with SigProfilerMatrixGenerator?

Parsing of 3 consecutive mutations

Hello,

when 3 consecutive mutations are encountered in VCF, SigProfilerMatrixGenerator will extract 3 SBS and 2 DBS mutational patterns from those 3 mutations. Is this expected behaviour (2 extracted DBS patterns come from "overlaping mutations", is this expected as well)?

Also, same will happen for 4 consecutive mutations(3 DBS and 4 SBS patterns will be extracted).

Best regards,
Dalibor

Feature request: annotate each variant with the context

Hi @ebergstr,

Thanks for the amazing tool. I was just thinking is it possible (with current implementation or with some modifications) to annotate each variant in the input files with the context (for SBS, DBS, ID) and output a text file (may include chr, position, ref, alt, context, sampleID).

IMO, there will be several use cases for integrative analysis with this data e.g. one can calculate overrepresentation of different contexts in certain genomic loci such as enhancers or open chromatin regions.

Any help would be appreciated.

Thank you.

Question about micro-homology

Dear developer,

Thanks for creating this wonderful tool.

I am reading the source code about how to generate mutational matrix for INDEL.
I have some questions about understanding the processing steps and and their meanings.

  • What's the biological significance of the upstream and downstream micro-homology sequence?
  • Does the micro-homology sequence affect the mutation events?
  • Why do you use both upstream and downstream sequences rather than only the downstream sequence in micro-homology classification?

Could you help me out?

Best wishes,
Ziyu

error when using maf file as input

I tried to give a maf file which has all my samples and mutations to SigProfilerMatrixGeneratorR. The exact command is matrices <- SigProfilerMatrixGeneratorR("WUMDANY_sigprofiler", "GRCh37", "~/Desktop/WUMDANY_sigprofiler/", plot=T, exome=F, bed_file=NULL, chrom_based=F, tsb_stat=F, seqInfo=F, cushion=100)
But I get an error,
Error in py_call_impl(callable, dots$args, dots$keywords) :
IndexError: list index out of range

Phasing information in VCF files

Hello,

when using SigProfilerMatrixGenerator to parse VCF files following case is encountered:
#CHROM POS REF ALT sample1
chr1 17000 A C 0|1
chr1 17001 A G 1|0
Here two mutations are on consecutive positions in genome but on different chromosome pair. Still SigProfilerMatrixGenerator counts this as DBS pattern (AA>CG), is this expected behaviour, should this be considered a DBS pattern?

Best regards,
Dalibor

offline mode?

Hi there,
I want to put this code into a container that will be run on a cluster from which the internet will be inaccessible. Is it possible to build the matrices using a fasta file reference, or perhaps used a stripped down version of the matrix generation that works only for SNVs and Indels?

These lines can prevent a proper install if initial install fails

My initial install of GRCh37 failed because I had to exit the job. The initial job touched files for all the chromosomes (references/chromosomes/tsb/GRCh37/{1.txt,2.txt,...}), but only two of them had any data in them. When I went to reinstall, they could not be installed properly because this line skips over them. I see that you were trying to save computation, but maybe it is better to be able to make sure you have a fresh install in case anything gets corrupted?

if os.path.exists(output_path + chrom + ".txt"):
continue
else:

Feature request: Throw an error or warning when the input path is incorrect.

Hi,
I accidentally used the tool without using a full path. (I used ~/path/to/directory). However, the tool didn't give any warning or error message that the path was incorrect. It just didn't do anything. It would be nice if the tool threw an error with a descriptive error message, when it can't find the supplied path, so that users know what went wrong.

Installation of CanFam3.1 failing

Hello,

I am trying to install the CanFam3.1 reference genome from the list of supported genomes. I was able to successfully install GRCh37 and GRCh38, but when I try to install CanFam3.1, I get:

genInstall.install('CanFam3.1', bash=True)
Beginning installation. This may take up to 40 minutes to complete.
tar (child): /path/lib/python3.7/site-packages/SigProfilerMatrixGenerator/references/chromosomes/tsb/CanFam3.1.tar.gz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
The ensembl ftp site is not currently responding.

I tried various combinations of setting ftp to "False" and rsync to "True," and get the following error saying that CanFam3.1 is not supported:

genInstall.install('CanFam3.1', ftp=False, rsync=True, bash=True)
Beginning installation. This may take up to 20 minutes to complete.
[DEBUG] Path to SigProfilerMatrixGenerator used for the install: /path/lib/python3.7/site-packages/SigProfilerMatrixGenerator
CanFam3.1 is not supported. The following genomes are supported:
GRCh37, GRCh38, mm10

Can you advise me on how to install the CanFam3.1 reference?

Thank you for your help!

Best,
Kate

Documentation for custom references

Hello,

I was wondering if you guys have provided any documentation for installing a custom reference genome? I see that it's listed as an option in the paper and also listed as a parameter to various functions within the install.py script - but I am not sure what syntax to use and what parameters need to be passed. Is just a fasta file sufficient?

Thanks
Matt

Issues when installing genomes

Hi,
I have tried both ways of installing genomes, but I can't make it work. This is what I get when I try it:

genInstall.install('GRCh37', bash=True)
Beginning installation. This may take up to 40 minutes to complete.
tar (child): /home/david/.local/lib/python3.7/site-packages/SigProfilerMatrixGenerator/references/chromosomes/tsb/GRCh37.tar.gz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
The ensembl ftp site is not currently responding.

and

genInstall.install('GRCh37', rsync=False)
Beginning installation. This may take up to 40 minutes to complete.
tar (child): /home/david/.local/lib/python3.7/site-packages/SigProfilerMatrixGenerator/references/chromosomes/tsb/GRCh37.tar.gz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
The ensembl ftp site is not currently responding.

Do you think I am missing something here?

Thank you!

skip all dot ('.) files'

Hi Erik, would it be possible to skip all dot files in scripts/convert_input_to_simple_files.py ?
At the moment you only skip for .DS_Store

57 	if file == '.DS_Store':

We have been caught by this a few times.
Thanks,
S.

Bad notion for invalid MAF format

Dear developer,

Thanks for your nice tool along with SigProfiler and the corresponding important work.

I found the notion The given input files do not appear to be in the correct MAF format. is bad for understanding and using this tool. The code does not do what they mean. The continue sentence here just skip one row and go to the next line of data, if indeed an invalid MAF provided by a user, an error should be immediately raised for saving the user time and stopping generate the wrong output.

except:
if first_incorrect_file:
print("The given input files do not appear to be in the correct MAF format. Skipping this file: ", file)
first_incorrect_file = False
continue

When I installed this package, I also found Pandas is a dependent package, I am curious about why this tool does not use read_csv() to handle the input? It may make this tool more compatible, e.g., columns of most MAF files are not ordered as GDC docs describe, they should be handled properly.

Best,
Shixiang

"I/O operation on closed file" when using Jupyter-Notebook

I am trying to use a Jupyter Notebook to develop a script

Unfortunately, when I execute:
from SigProfilerMatrixGenerator.scripts import SigProfilerMatrixGeneratorFunc as matGen matrices = matGen.SigProfilerMatrixGeneratorFunc("name", "GRCh37", ".../running_folder", plot=False, exome=False, bed_file=None, chrom_based=False, tsb_stat=False, seqInfo=False, cushion=100)

the cell keeps running, while in the console, I receive the following:
Traceback (most recent call last): File "/home/dcullerne/miniconda3/envs/SigProfiler/lib/python3.7/site-packages/ipykernel/kernelbase.py", line 381, in dispatch_queue yield self.process_one() File "/home/dcullerne/miniconda3/envs/SigProfiler/lib/python3.7/site-packages/tornado/gen.py", line 735, in run value = future.result() File "/home/dcullerne/miniconda3/envs/SigProfiler/lib/python3.7/site-packages/tornado/gen.py", line 742, in run yielded = self.gen.throw(*exc_info) # type: ignore File "/home/dcullerne/miniconda3/envs/SigProfiler/lib/python3.7/site-packages/ipykernel/kernelbase.py", line 365, in process_one yield gen.maybe_future(dispatch(*args)) File "/home/dcullerne/miniconda3/envs/SigProfiler/lib/python3.7/site-packages/tornado/gen.py", line 735, in run value = future.result() File "/home/dcullerne/miniconda3/envs/SigProfiler/lib/python3.7/site-packages/tornado/gen.py", line 748, in run yielded = self.gen.send(value) File "/home/dcullerne/miniconda3/envs/SigProfiler/lib/python3.7/site-packages/ipykernel/kernelbase.py", line 278, in dispatch_shell sys.stderr.flush() ValueError: I/O operation on closed file.

Stdout indicates the resulting files are being generated correctly:
Starting matrix generation for SNVs and DINUCs...Completed! Elapsed time: 4.08 seconds. Starting matrix generation for INDELs...Completed! Elapsed time: 3.98 seconds. Matrices generated for 2 samples with 0 errors. Total of 500 SNVs, 3 DINUCs, and 8 INDELs were successfully analyzed.

Kernel information:
Python 3.7.4 (default, Aug 13 2019, 20:35:49) Type 'copyright', 'credits' or 'license' for more information IPython 7.17.0 -- An enhanced Interactive Python. Type '?' for help.

Notebook server version: 6.1.3

Thankyou.

strandBiasTest_384.txt / strandBiasTest_6144.txt files

Hello Erik,
I've been analyzing output results and regarding those two produced files => 'strandBiasTest_384.txt' & 'strandBiasTest_6144.txt', the number of lines / mut. sig. (96 and 1536) differs from their labels. Are them related or nothing at all?
Thanks.

What is the point of this software

Sorry

I have a big confusion: Can't I directly use SigProfilerExtractor to get de novo found signatures or I must use SigProfilerMatrixGenerator as a mediate step for SigProfilerExtractor?

For running SigProfilerExtractor, I had provided my mutations like below for each sample individually as .vcf like below

2	81901396	LP6005500-DNA_D03	-	T
4	133567287	LP6005500-DNA_D03	CACCATGAATCTTAGACTTTATTATTGCTTGGTGC	-
X	87420176	LP6005500-DNA_D03	-	T
8	116029769	LP6005500-DNA_D03	TGTC	-
9	92713767	LP6005500-DNA_D03	T	-
4	150842718	LP6005500-DNA_D03	A	-
10	76559024	LP6005500-DNA_D03	-	AAAT
14	98590319	LP6005500-DNA_D03	-	TTTG

But SBS , DBS and ID output folders are empty

So you think I must use SigProfilerMatrixGenerator before using SigProfilerExtractor?

What is the point of SigProfilerMatrixGenerator?

wget: Error in server response, closing control connection

The following produced an error with wget:

from SigProfilerMatrixGenerator import install as genInstall
genInstall.install('GRCh37')

I think the connection to ngs.sanger.ac.uk/* for the installation of GRCH37 is down. To check:

wget -r -l1 -c -nc --no-parent -nd -P . ftp://ngs.sanger.ac.uk/scratch/project/mutographs/SigProf/GRCh37.tar.gz 2
--2019-11-22 15:26:09-- ftp://ngs.sanger.ac.uk/scratch/project/mutographs/SigProf/GRCh37.tar.gz
=> `./.listing'
Resolving ngs.sanger.ac.uk... 193.62.203.79
Connecting to ngs.sanger.ac.uk|193.62.203.79|:21... connected.
Logging in as anonymous ...
Error in server response, closing control connection.
Retrying.

I did a workaround by downloading the genome using a web browser, and i put it in the proper dir: /home/[usr]/.local/lib/python3.7/site-packages/SigProfilerMatrixGenerator/references/chromosomes/tsb/

i ran genInstall.install('GRCh37') again after placing the genome within /tsb and it seemed to work if anyone else has the same issue with wget.

How exactly is 96 context counted?

Hi, thank you for all your effort. This package is wonderful.

I was using SigProfilerMatrixGeneratorFunc to count 96 context mutations and found that the result is different between 'SNV only VCF' and 'SNV + INDEL VCF'.

For example, my VCF has total of 5482 mutations of which 4510 are SNVs and 972 are INDELs. (I obtained the number of SNVs and INDELs using python package cyvcf2 and GATK SelectVariants, and the results are consistent.)

I made two VCFs, one containing only SNVs (total of 4510 mutations) and the other containing SNVs and INDELs (total of 5482 mutations). When I run these two VCFs in SigProfilerMatrixGeneratorFunc , these are the results I get. The actual codes are as follows:

from SigProfilerMatrixGenerator.scripts import SigProfilerMatrixGeneratorFunc as matGen
sigprofiler_snv_count = matGen.SigProfilerMatrixGeneratorFunc("matrix_generation_snv", "GRCh37", "02_vcf/test_snv", tsb_stat= True)
sigprofiler_both_count = matGen.SigProfilerMatrixGeneratorFunc("matrix_generation_both", "GRCh37", "02_vcf/test_both", tsb_stat= True)
print(np.array_equal(sigprofiler_both_count['96'], sigprofiler_snv_count['96']))

and the result is like this:

Matrices generated for 1 samples with 0 errors. Total of 4928 SNVs, 209 DINUCs, and 752 INDELs were successfully analyzed.
Matrices generated for 1 samples with 0 errors. Total of 4510 SNVs, 0 DINUCs, and 0 INDELs were successfully analyzed.
False

I looked at the source code of SigProfilerMatrixGeneratorFunc and it seems like extra counts of SNVs are from VCF rows where REF and ALT both have length 2. When SigProfilerMatrixGeneratorFunc processes these rows it divides them into two separate SNVs. After all, if the number of DINUCs (209) are multiplied by two and subtracted from the number of SNVs (4928), the result is 4510.

I am bit confused at this result. Is this how the mutations were counted for extracting COSMIC signatures? (Both v2 and v3?) If so, what is the rationale behind this? It seems like mutations from DINUC could influence both SNV signatures and DINUC signatures. If we are taking account all the SNVs from dinucleotide separately, why not consider the same for trinucleotide mutations? (Although the probability of trinucleotide mutation is very low, I think the code should still take this case into account.)

Chromosome installation is deemed successful if benchmarking is not run

Chromosome installation always finishes with the following message, even if benchmarking fails:

Installation was succesful.
SigProfilerMatrixGenerator took 3.1525115966796875 seconds to complete.
To proceed with matrix_generation, please provide the path to your vcf files and an appropriate output path.
Installation complete.

I believe this is because both the benchmark files and the correct answers are included in the repo, so the script that checks answers finds the correct answers if they are not overwritten.

GRCh37 ensembl path changed

In line 77 of the install.py file. the path to ensembl ftp should be ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/ the path in the current code is ftp://ftp.ensembl.org/pub/grch37/**update**/fasta/homo_sapiens/dna/

Fix .gitignore and remove files from version control

Hi,

Your .gitignore is currently called 1.gitignore - you should fix the name because at the moment it's doing nothing.

There are a load of committed .DS_Store and *.pyc files in this repo - they should not be version controlled!

Stale link

Looks like Ensemble is throwing invalid ftp location for GRCh37 links. This should be replaced with the most current GRCh37 at Ensembl (release-75) for all of downloads related to GRCh37: ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/

os.system("bash -c '" + 'wget -r -l1 -c -nc --no-parent -A "*.dna.chromosome.*" -nd -P ' + chromosome_fasta_path + ' ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/ 2>> install.log' + "'")

New Reference Genome

Is it possible to use another reference genome except from the supported ones indicated in the repository?

Missing CNV folder under SigProfilerMatrixGenerator/references

Hi,

I just installed SigProfilerMatrixGenerator following instructions. But when I tried to run a test, I got the following error:

>>> scna.generateCNVMatrix(file_type, input_file, project, output_path)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/yc790/.local/lib/python3.8/site-packages/SigProfilerMatrixGenerator/scripts/CNVMatrixGenerator.py", line 18, in 
generateCNVMatrix
    with open('SigProfilerMatrixGenerator/references/CNV/CNV_features.tsv') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'SigProfilerMatrixGenerator/references/CNV/CNV_features.tsv'
>>> 

I checked references folder and there is no CNV folder.

Additionally when I tried to install references for R sessions I got another error:

> library("SigProfilerMatrixGeneratorR")
> install('GRCh38', rsync=FALSE, bash=TRUE)
Error in py_module_import(module, convert = convert) : 
  ModuleNotFoundError: No module named 'SigProfilerMatrixGenerator'
> traceback()
4: stop(structure(list(message = "ModuleNotFoundError: No module named 'SigProfilerMatrixGenerator'", 
       call = py_module_import(module, convert = convert), cppstack = structure(list(
           file = "", line = -1L, stack = c("/usr/local/lib/R/site-library/reticulate/libs/reticulate.so(Rcpp::exception::exception(char 
const*, bool)+0x74) [0x7fea898d7294]", 
           "/usr/local/lib/R/site-library/reticulate/libs/reticulate.so(Rcpp::stop(std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocator<char> > const&)+0x29) [0x7fea898c7a66]", 
           "/usr/local/lib/R/site-library/reticulate/libs/reticulate.so(py_module_import(std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocator<char> > const&, bool)+0xa5) [0x7fea898e8ea5]", 
           "/usr/local/lib/R/site-library/reticulate/libs/reticulate.so(_reticulate_py_module_import+0xbe) [0x7fea898d346e]", 
           "/usr/lib/R/lib/libR.so(+0xf5b6c) [0x7fea921e0b6c]", 
           "/usr/lib/R/lib/libR.so(+0x13766a) [0x7fea9222266a]", 
           "/usr/lib/R/lib/libR.so(Rf_eval+0x88) [0x7fea92235dd8]", 
           "/usr/lib/R/lib/libR.so(+0x14cc9f) [0x7fea92237c9f]", 
           "/usr/lib/R/lib/libR.so(Rf_applyClosure+0x1a2) [0x7fea92238b92]", 
           "/usr/lib/R/lib/libR.so(+0x13a50e) [0x7fea9222550e]", 
           "/usr/lib/R/lib/libR.so(Rf_eval+0x88) [0x7fea92235dd8]", 
           "/usr/lib/R/lib/libR.so(+0x14cc9f) [0x7fea92237c9f]", 
           "/usr/lib/R/lib/libR.so(Rf_applyClosure+0x1a2) [0x7fea92238b92]", 
           "/usr/lib/R/lib/libR.so(+0x13a50e) [0x7fea9222550e]", 
           "/usr/lib/R/lib/libR.so(Rf_eval+0x88) [0x7fea92235dd8]", 
           "/usr/lib/R/lib/libR.so(+0x14cc9f) [0x7fea92237c9f]", 
           "/usr/lib/R/lib/libR.so(Rf_applyClosure+0x1a2) [0x7fea92238b92]", 
           "/usr/lib/R/lib/libR.so(Rf_eval+0x2af) [0x7fea92235fff]", 
           "/usr/lib/R/lib/libR.so(Rf_ReplIteration+0x202) [0x7fea92269ff2]", 
           "/usr/lib/R/lib/libR.so(+0x17f380) [0x7fea9226a380]", 
           "/usr/lib/R/lib/libR.so(run_Rmainloop+0x50) [0x7fea9226a440]", 
           "/usr/lib/R/bin/exec/R(main+0x1f) [0x55b23511a09f]", 
           "/usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7fea91f200b3]", 
           "/usr/lib/R/bin/exec/R(_start+0x2e) [0x55b23511a0de]"
           )), class = "Rcpp_stack_trace")), class = c("Rcpp::exception", 
   "C++Error", "error", "condition")))
3: py_module_import(module, convert = convert)
2: reticulate::import("SigProfilerMatrixGenerator.install")
1: install("GRCh38", rsync = FALSE, bash = TRUE)
> 

Any suggestion?

Thanks a lot for the help!

Ying

Install error

I installed mm10 - this worked fine. Later, I tried to install GRCh37, this created the reference files and then died with the following message:

Transcript files created.
Job took: 934.6896922588348 seconds
The transcriptional reference data for GRCh37 has been saved.
All reference files have been created.
Verifying and benchmarking installation now...
Traceback (most recent call last):
File "", line 1, in
File "/home/debian/docker_test/new_test/SigProfilerMatrixGenerator/SigProfilerMatrixGenerator/install.py", line 200, in install
benchmark(ref_dir)
File "/home/debian/docker_test/new_test/SigProfilerMatrixGenerator/SigProfilerMatrixGenerator/install.py", line 110, in benchmark
shutil.move("scripts/Benchmark/BRCA_bench/", "references/vcf_files/")
File "/usr/lib/python3.5/shutil.py", line 542, in move
raise Error("Destination path '%s' already exists" % real_dst)
shutil.Error: Destination path 'references/vcf_files/BRCA_bench' already exists

The transcriptional reference data appears to be corrupted

Hello,
i finished SigProfilerMatrixGenerator install,and i try to install 'GRCh37' reference genome by this command genInstall.install('GRCh37', rsync=True, bash=True) . I have been reinstall several times,but always got an error message like The transcriptional reference data appears to be corrupted. Please reinstall the GRCh37 genome. What should i do to resolve this issue?

Operating system is centos 7
Python3.6.2
rsync-3.1.2-6.el7_6.1.x86_64

Best regards,
Bruce

reference files corrupted

Hello,

after SigProfilerMatrixGenerator is installed, when i try to install 'GRCh38' or 'GRCh37' , installation process fails and i get message that 'files may be corrupted' and 'to try to reinstall genomes'. I reinstalled genomes multiple times but result is always the same. What should i do to resolve this issue?

Note: i tried installing genomes in the way that is explained in readme on github page.

Best regards,
Dalibor

How to get ICGC-like VCF Format

please sigProfilerExtractor will not run with my vcf file.
How can i convert my vcf file to the ICGC-like vcf format you mentioned in the manual? I'm searching online but haven't got any hints yet.

INPUT FILE FORMAT
This tool currently supports maf, vcf, simple text file, and ICGC formats. The user must provide variant data adhering to one of these four formats. If the user’s files are in vcf format, each sample must be saved as a separate files.

Below is my error message;


>>> sig.sigProfilerExtractor("vcf","mutect2_withFiltersTerra_vcf",data, minimum_signatures=1, maximum_signatures=4,reference_genome="GRCh38",opportunity_genome="GRCh38", exome= True)

************** Reported Current Memory Use: 0.17 GB *****************
File format not supported
>>> 

Here is a snippet of my vcf (without headers)

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SC173874	SC173909
chr1	35816826	.	G	A	.	PASS	ALIGN_DIFF=63;AS_FilterStatus=SITE;AS_SB_TABLE=57,27|3,1;DP=95;ECNT=1;FUNCOTATION=[AGO4|hg38|chr1|35816826|35816826|INTRON||SNP|G|G|A|g.chr1:35816826G>A|ENST00000373210.3|+|||c.e2-56G>A|||0.42643391521197005|AAAAAAGAAAGAAAGAAAGAA|||||||||||||||||||||||||||||CH471059|NM_017629.3|NP_060099|HGNC:18424|argonaute_%20_4_%2C__%20_RISC_%20_catalytic_%20_component|Approved|gene_%20_with_%20_protein_%20_product|protein-coding_%20_gene|EIF2C4|"eukaryotic_%20_translation_%20_initiation_%20_factor_%20_2C_%2C__%20_4"_%2C__%20_"argonaute_%20_RISC_%20_catalytic_%20_component_%20_4"|hAGO4_%2C__%20_KIAA1567_%2C__%20_FLJ20033|"argonaute_%20_4"|1p34.3|2016-10-05|2013-02-15|2015-11-27|AB046787||192670|ENSG00000134698|12906857|NM_017629|408|Argonaute/PIWI_%20_family|CCDS397|OTTHUMG00000004243|192670|607356|NM_017629|Q9HCK5|ENSG00000134698|uc001bzj.3||||AGO4_HUMAN||A7MD27|Q9HCK5|epidermal_%20_growth_%20_factor_%20_receptor_%20_signaling_%20_pathway_%20_(GO:0007173)_%7C_Fc-epsilon_%20_receptor_%20_signaling_%20_pathway_%20_(GO:0038095)_%7C_fibroblast_%20_growth_%20_factor_%20_receptor_%20_signaling_%20_pathway_%20_(GO:0008543)_%7C_gene_%20_expression_%20_(GO:0010467)_%7C_innate_%20_immune_%20_response_%20_(GO:0045087)_%7C_mRNA_%20_catabolic_%20_process_%20_(GO:0006402)_%7C_negative_%20_regulation_%20_of_%20_translation_%20_involved_%20_in_%20_gene_%20_silencing_%20_by_%20_miRNA_%20_(GO:0035278)_%7C_neurotrophin_%20_TRK_%20_receptor_%20_signaling_%20_pathway_%20_(GO:0048011)_%7C_Notch_%20_signaling_%20_pathway_%20_(GO:0007219)_%7C_phosphatidylinositol-mediated_%20_signaling_%20_(GO:0048015)|cytoplasmic_%20_mRNA_%20_processing_%20_body_%20_(GO:0000932)_%7C_cytosol_%20_(GO:0005829)_%7C_membrane_%20_(GO:0016020)_%7C_micro-ribonucleoprotein_%20_complex_%20_(GO:0035068)_%7C_RISC_%20_complex_%20_(GO:0016442)|miRNA_%20_binding_%20_(GO:0035198)|||||||||||||||||||false|false||false|false||false|false|false||false|false|false|false|false|false|false|false|false|false|false|false|false|false|false|false|false|false|false|false|||false|false||false||false||false|false|false||false|||false||||||||||||||||||||||||||||||||||||||||||||3.60985e-04|5.05433e-04|5.98086e-04|4.37637e-04|0.00000e+00|0.00000e+00|0.00000e+00|3.81679e-03|0.00000e+00|5.15464e-03|0.00000e+00|0.00000e+00|0.00000e+00|3.28192e-04|4.04531e-04|7.66871e-04|0.00000e+00|3.86747e-04|2.88060e-04|0.00000e+00|1.63666e-04|3.85802e-04|2.48942e-04|5.29101e-04|1.61290e-02|0.00000e+00|0.00000e+00|0.00000e+00|5.05433e-04|1.55971e-03||1|36282427|false|false|rs1296156478|RF|SC173909|SC173874|Unknown|CRC_PR];GERMQ=93;MBQ=30,20;MFRL=233,315;MMQ=60,50;MPOS=27;NALIGNS=53;NALOD=1.21;NLOD=12.72;POPAF=2.71;ROQ=12;TLOD=6.51;UNITIGS=124	GT:AD:AF:DP:F1R2:F2R1:SB	0/1:36,4:0.109:40:9,0:21,4:26,10,3,1	0/0:48,0:0.021:48:23,0:23,0:31,17,0,0
chr1	48471987	.	G	T	.	PASS	AS_FilterStatus=SITE;AS_SB_TABLE=61,82|3,2;DP=155;ECNT=1;FUNCOTATION=[SPATA6|hg38|chr1|48471987|48471987|MISSENSE||SNP|G|G|T|g.chr1:48471987G>T|ENST00000371847.7|-|1|187|c.22C>A|c.(22-24)Cag>Aag|p.Q8K|0.71571072319202|AGGGCGCACTGCAGCGCCTTC|SPATA6_ENST00000371843.7_MISSENSE_p.Q8K/SPATA6_ENST00000396199.7_MISSENSE_p.Q8K||||||||||||||||||||91|biliary_tract(2)_%7C_breast(12)_%7C_central_nervous_system(44)_%7C_large_intestine(11)_%7C_pancreas(22)|||||||HM005491|NM_019073.3|NP_061946|HGNC:18309|spermatogenesis_%20_associated_%20_6|Approved|gene_%20_with_%20_protein_%20_product|protein-coding_%20_gene|||SRF1_%2C__%20_FLJ10007_%2C__%20_SRF-1|"spermatogenesis-related_%20_factor-1"|1p33|2016-10-05|||AK000869||54558|ENSG00000132122||NM_019073|||CCDS551_%2C__%20_CCDS65535_%2C__%20_CCDS72787|OTTHUMG00000007794|54558|613947|NM_001286238|Q9NWH7|ENSG00000132122|uc001crr.4||||SPAT6_HUMAN||Q5T3N7_%7C_Q8WUE6|Q9NWH7|cell_%20_differentiation_%20_(GO:0030154)_%7C_multicellular_%20_organismal_%20_development_%20_(GO:0007275)_%7C_spermatogenesis_%20_(GO:0007283)|extracellular_%20_region_%20_(GO:0005576)||||||||||||||||||||false|false||false|false||false|false|false||false|false|false|false|false|false|false|false|false|false|false|false|false|false|false|false|false|false|false|false|||false|false||false||false||false|false|false||false|||false|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||false|false|||SC173909|SC173874|Unknown|CRC_PR];GERMQ=93;MBQ=30,20;MFRL=248,175;MMQ=60,60;MPOS=60;NALIGNS=1;NALOD=1.89;NLOD=22.87;POPAF=6.00;ROQ=24;TLOD=7.51;UNITIGS=205	GT:AD:AF:DP:F1R2:F2R1:SB	0/1:51,5:0.087:56:27,3:24,2:23,28,3,2	0/0:92,0:0.013:92:36,0:56,0:38,54,0,0

Cannot generate InDel matrix

The SNV and DNV works fine:

matrices = matGen.SigProfilerMatrixGeneratorFunc("denovoWGS", "mm10", "/home/max2/WD/denovomaf/",plot=False, exome=False, bed_file=None, chrom_based=False, tsb_stat=False, seqInfo=False, cushion=100)
Starting matrix generation for SNVs and DINUCs...Completed! Elapsed time: 408.53 seconds.
Starting matrix generation for INDELs...

Then the program just stopped. The log show the following errors

Traceback (most recent call last):
File "", line 1, in
File "/home/max2/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/SigProfilerMatrixGenerator/scripts/SigProfilerMatrixGeneratorFunc.py", line 532, in SigProfilerMatrixGeneratorFunc
mutation_ID, skipped_mut, total = matGen.catalogue_generator_INDEL_single (mutation_ID, lines, chrom, vcf_path, vcf_path_original, vcf_files, bed_file_path, chrom_path, project, output_matrix, exome, genome, ncbi_chrom, limited_indel, functionFlag, bed, bed_ranges, chrom_based, plot, tsb_ref, transcript_path, seqInfo, gs, log_file)
File "/home/max2/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/SigProfilerMatrixGenerator/scripts/SigProfilerMatrixGenerator.py", line 748, in catalogue_generator_INDEL_single
with open (chrom_path + chrom + '.txt', "rb") as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/max2/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/SigProfilerMatrixGenerator/references/chromosomes/tsb/mm10/21.txt'

There are only 20 + XY chromosome in mouse genome and I didn't see any chr21 in my input maf file.

Problem generating custom genome

Hi,

I am not able to generate a custom genome for the BALB_cJ reference.

I separated the fasta file containing all chromosomes on to separate files (header '>chr1' etc.) and gzipped them. Resulting in chr1.txt.gz, etc. (for chr1-19, X and M).

Transcript information was downloaded from biomart. The first line of the file:
MGP_BALBcJ_G0030692 MGP_BALBcJ_T0077445 6 1 45869861 45878905

I do not have exome coordinates for the genome, so added a dummy BED file containing:
chr1 1 2

I ran the following using Python 3.9.2 and SigProfilerMatrixGenerator (ver 1.1.28).
from SigProfilerMatrixGenerator import install as genInstall

genInstall.install("BALB_cJ", custom=True, fastaPath="/mnt/data-sets/bcf/genomeIndexes/BALB_cJ_v1/fasta/separate_chr", transcriptPath="/mnt/data-sets/bcf/genomeIndexes/BALB_cJ_v1/annotation/Transcripts_Ensembl_103_BALB_cJ_v1.txt", exomePath="/mnt/fls01-home01/mqbssid3/scratch/HP_20210412/exome.bed")

Here is the output:

Beginning installation. This may take up to 20 minutes to complete.
[DEBUG] Path to SigProfilerMatrixGenerator used for the install: /mnt/fls01-home01/mqbssid3/.conda/envs/FreeBayes135/lib/python3.9/site-packages/SigProfilerMatrixGenerator
The string file has been created for Chromosome: 1 (1/21)
10 (2/21)
18 (3/21)
14 (4/21)
X (5/21)
4 (6/21)
M (7/21)
11 (8/21)
6 (9/21)
3 (10/21)
7 (11/21)
13 (12/21)
9 (13/21)
2 (14/21)
17 (15/21)
19 (16/21)
15 (17/21)
16 (18/21)
12 (19/21)
5 (20/21)
8 (21/21)
Chromosome string files for BALB_cJ have been created. Continuing with installation.
[DEBUG] Chromosome tsb files found at: /mnt/fls01-home01/mqbssid3/.conda/envs/FreeBayes135/lib/python3.9/site-packages/SigProfilerMatrixGeneratorreferences/chromosomes/tsb/BALB_cJ/
The transcriptional reference data for BALB_cJ has not been saved. Creating these files now
Traceback (most recent call last):
File "/mnt/fls01-home01/mqbssid3/.conda/envs/FreeBayes135/lib/python3.9/site-packages/SigProfilerMatrixGenerator/scripts/save_tsb_192.py", line 347, in
main()
File "/mnt/fls01-home01/mqbssid3/.conda/envs/FreeBayes135/lib/python3.9/site-packages/SigProfilerMatrixGenerator/scripts/save_tsb_192.py", line 344, in main
save_tsb(chromosome_string_path, transcript_path, output_path)
File "/mnt/fls01-home01/mqbssid3/.conda/envs/FreeBayes135/lib/python3.9/site-packages/SigProfilerMatrixGenerator/scripts/save_tsb_192.py", line 85, in save_tsb
out.close()
UnboundLocalError: local variable 'out' referenced before assignment
The transcriptional reference data for BALB_cJ has been saved.
All reference files have been created.
To proceed with matrix_generation, please provide the path to your vcf files and an appropriate output path.
Installation complete.

HOWEVER: '/mnt/fls01-home01/mqbssid3/.conda/envs/FreeBayes135/lib/python3.9/site-packages/SigProfilerMatrixGeneratorreferences/chromosomes/tsb/BALB_cJ/' is empty.

The chromosomes appear to be processed in a temporary location '/mnt/fls01-home01/mqbssid3/.conda/envs/FreeBayes135/lib/python3.9/site-packages/SigProfilerMatrixGenerator/references/chromosomes/chrom_string/BALB_cJ/', but is deleted by the program at the end of its run.

Q) Am i doing doing anything obviously wrong?

When I install 'mm39' I only see chromosome files in '/mnt/fls01-home01/mqbssid3/.conda/envs/FreeBayes135/lib/python3.9/site-packages/SigProfilerMatrixGenerator/references/chromosomes/tsb/mm39/'.

Q) Is it possible to only install BALB_cJ without annotation or exome data?

Thank you

File format not supported

Hi folks,

I am trying to plot the mutational spectrum of results from a set of variant calls from nanopore data called with claire.

A snapshot of the file looks like this:

...
##contig=<ID=GL000225.1,length=211173>
##contig=<ID=GL000192.1,length=547496>
##contig=<ID=NC_001416.1,length=48502>
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  GM24385-422
1       584592  .       C       T       1227    .       .       GT:GQ:DP:AF     0/1:1227:43:0.3256
1       846202  .       G       A       953     .       .       GT:GQ:DP:AF     0/1:953:40:0.225
1       858692  .       G       C       946     .       .       GT:GQ:DP:AF     1/1:946:34:0.1176
1       920362  .       C       A       1020    .       .       GT:GQ:DP:AF     0/1:1020:35:0.3143
...

When I run SigProfilerMatrixGeneratorFunc using a folder with VCFs like above as my input I get a "File format not supported" error. The same code seems to work fine on variants called from illumina data using Strelka or Mutect. Do you see anything incorrect about the VCF format above?

Changing reference folder path

Hi,

I am building a docker image based on python 3.8 to run sigproSS. All software components are installed in /usr/local but I would like reference files to be kept outside the docker image (meaning not under /usr/local but rather on shared storage system). Is there an easy a way to do this ?

I am reading the docs and the code but I am not sure it is possible at this time.

Thanks for your help,
Anthony

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.