quinlan-lab / strling Goto Github PK

Detect novel (and reference) STR expansions from short-read data

License: MIT License

Nim 69.17% Python 25.74% Groovy 5.10%

short-tandem-repeats str whole-genome-sequencing nim-lang hacktoberfest

strling's Introduction

STRling (pronounced like “sterling”) is a method to detect large STR expansions from short-read sequencing data. It is capable of detecting novel STR expansions, that is expansions where there is no STR in the reference genome at that position (or a different repeat unit from what is in the reference). It can also detect STR expansions that are annotated in the reference genome. STRling uses kmer counting to recover mis-mapped STR reads. It then uses soft-clipped reads to precisely discover the position of the STR expansion in the reference genome.

Install and Run STRling

Please see the STRling Documentation for installation and running instructions.

Citation

For more details able the algorithm check out our paper.

If using STRling, please cite:

Dashnow, H., Pedersen, B.S., Hiatt, L. et al. STRling: a k-mer counting approach that detects short tandem repeat expansions at known and novel loci. Genome Biol 23, 257 (2022). https://doi.org/10.1186/s13059-022-02826-4

strling's People

Contributors

Stargazers

Watchers

Forkers

hdashnow pythseq pengjia6 lindenb christopher-schroeder novapyth kew24 tong2200

strling's Issues

Assertion error for reads without sequence

I get the following assertion error:

strling version: 0.5.0
[strling] using existing file resources/genome.dna.homo_sapiens.GRCh38.100.fasta.str for genome repeats
[strling] got STR repeats from genome into an interval tree
[strling] collecting str-like reads
[strling] extracting chromosome:1
[strling] extracting chromosome:10
[strling] extracting chromosome:11
[strling] extracting chromosome:12
[strling] extracting chromosome:13
[strling] extracting chromosome:14
[strling] extracting chromosome:15
[strling] extracting chromosome:16
[strling] extracting chromosome:17
[strling] extracting chromosome:18
[strling] extracting chromosome:19
[strling] extracting chromosome:2
[strling] extracting chromosome:20
[strling] extracting chromosome:21
[strling] extracting chromosome:22
[strling] extracting chromosome:3
[strling] extracting chromosome:4
[strling] extracting chromosome:5
[strling] extracting chromosome:6
[strling] extracting chromosome:7
[strling] extracting chromosome:8
[strling] extracting chromosome:9
[strling] extracting chromosome:X
[strling] extracting chromosome:Y
/opt/conda/conda-bld/strling_1622157642620/work/src/strling.nim(44) strling
/opt/conda/conda-bld/strling_1622157642620/work/src/strling.nim(41) main
/opt/conda/conda-bld/strling_1622157642620/work/src/strpkg/extract.nim(319) extract_main
/opt/conda/conda-bld/strling_1622157642620/work/src/strpkg/extract.nim(200) add
/opt/conda/conda-bld/strling_1622157642620/work/src/strpkg/extract.nim(67) to_tread
/opt/conda/conda-bld/strling_1622157642620/_build_env/nim/lib/system/assertions.nim(30) failedAssertImpl
/opt/conda/conda-bld/strling_1622157642620/_build_env/nim/lib/system/assertions.nim(23) raiseAssert
/opt/conda/conda-bld/strling_1622157642620/_build_env/nim/lib/system/fatal.nim(49) sysFatal
Error: unhandled exception: /opt/conda/conda-bld/strling_1622157642620/work/src/strpkg/extract.nim(67, 12) `align_length > 0` K00276:107:HHYWGBBXX:8:1125:32309:38451   141     *       0       0       *
       *       0       0       *       *       AS:i:0  XS:i:0  RG:Z:LUEB0077G [AssertionDefect]

This is probably due to the * in the sequence and quality field. By specification these are allowed, for example when the sequence is fully trimmed by the adapter trimming step. Even if the read itself is useless (also because it is unmapped), it is still useful to have them in the alignment file to remain a complete paired end file.

assign_reads_locus produces bad indices

Sometimes right index < left index. I haven't figured out when/why yet

For example ri = 0 and li = 1

STR ref subcommand

Currently the file containing STR sequencing in the reference genome is generated when running a sample. Add a subcommand to generate this file. This will be simpler/faster when running a batch of samples from the same reference genome.

Is it okay to apply STRling on PCR-based WGS dataset?

Hi, I'm using STRling for PCR-based WGS data and worried that it's the right way.
When I read your doc and paper, I can't find any contents about this so I want to ask you.

Also, if ok, can you recommend some filters useful for short tandem repeat QC when considering PCR-based WGS data.

Thanks,
JaeHyun

Multiallelic STRs

Hey, thanks for the great tool. How are you dealing with problems related to multiallelic short tandem repeats? For example, RFC1.

broadinstitute/str-analysis#12 (comment)

STRling warning message and empty binaries

Abbreviated from emailed bug report:

"
I'm using STRling with CRAM files from the Human Genome Diversity Project (HGDP-ceph) downloaded from there:
https://www.internationalgenome.org/data-portal/population/FrenchHGDP.
I indexed the files using samtools index and proceded to running STRling extract with the

GRCh38_full_analysis_set_plus_decoy_hla.fa reference genome (the one the reads were aligned to).
The program is running smoothly until after the Y chromosome where it starts giving me warnings like :

warning. bad read (this happens with bwa-kit alignments):ERR1395768.33889710 already in table as:(tid: 204, position: 52844, repeat: ['\x00', '\x00', '\x00', '\x00', '\x00', '\x00'], flag: PAIRED,MREVERSE,READ1, split: none, mapping_quality: 0, repeat_count: 0, align_length: 151, qname: "ERR1395768.33889710")

The program will keep going until completion with no crash, but the resulting .bin files are signicantly smaller compared to those I get when using our own files (1.2 MB vs 30 MB). The call step using those binaries yield empty outputs with no STRs found and a few lines in the Unplaced.txt file.

strling extract -f /path/to/decoy_hla/GRCh38_full_analysis_set_plus_decoy_hla.fa -v /path/to/HGDP00511.alt_bwamem_GRCh38DH.20181023.French.cram str-control/511.bin

strling call --output-prefix indiv/511 -f /path/to/decoy_hla/GRCh38_full_analysis_set_plus_decoy_hla.fa /pat/to/controls/HGDP00511.alt_bwamem_GRCh38DH.20181023.French.cram 511.bin
"

Add bounds filtering settings as command line options

e.g. number of soft-clipped reads required

Homopolymer false positives

I think we are incorrectly calling homopolyer expansions in regions that are rich in a particular base, and/or where there is an insertion that is rich in a particular base.

All of these are aligned to hg38.

Illumina reads:
/scratch/ucgd/lustre-work/quinlan/u6018199/chaisson_2019/ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/hgsv_sv_discovery/data/PUR/HG00733/high_cov_alignment/HG00733.alt_bwamem_GRCh38DH.20150715.PUR.high_coverage.cram

STRling calls:
/uufs/chpc.utah.edu/common/HIPAA/u6026198/storage/git/STRling/working/chaisson_2019/HG00733.alt_bwamem_GRCh38DH.20150715.PUR.high_coverage-unplaced.txt

PacBio assemblies:
/scratch/ucgd/lustre-work/quinlan/u6018199/chaisson_2019/pacbio_local_assemblies/HG00733.*.bam

False positive STRling calls:

chr1:875844-876015 C

chr1:1350049-1350049 G

chr1:3167585-3167954 C
(This is the locus in the image)

conda install error message (still installs)

conda reports the following error, but still installs just fine. Not sure if this can/should be fixed? Might make users think install has failed.

$ conda update -n base conda
$ conda create --name test-strling
$ conda activate test-strling
$ conda install strling
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: done

Error: repeat not found in expected region

Command

strling extract --fasta=ucsc.hg19.fasta \
                --verbose \
                --proportion-repeat 0.6 \
                sample.bam \
                sample.bin

Error

Error: unhandled exception: genome_strs.nim(39, 12) w.start < w.stop repeat CTA not found in expected region for (chrom: "chr2", start: 102619040, stop: 102619340, repeat: "CTA"), TTTATGTCTGCGTCTGTGTCTGTGTCTATGTCTGTATAAATGTCTATGTCTGTGTCTGTGTTGGTGTCTGTGTCTATGTCTATGTCTAAGTCTAGGTCTAGGTCTGTGTCTAAGTCTATGTCTAGGTCTATGTCTATGTCTATGCCTATGCCTATGTCTATGTCTATGTCTATATCTATATCTATATCTATATCTATATCTATATCTATATCTGTATCTATATCTATGTCTATGTCTGTGTCTTTGTCCCTGTCTGTGTCTGTGTCTATGTCTGTGTCTAGGTCTAGGTCTGTGTCTGTG, (chrom: "chr2", start: 102619340, stop: 102619340, repeat: "CTA") [AssertionError]

○ → samtools faidx ${genome} chr1:102619040-102619340
>chr1:102619040-102619340
aaaatattgcaggttgaaaatgcatttaatacacctaatctactgaacgtcatagcttag
cacatcttaccgtaaacatgatcagaatacttactttatacatagcctttagttgggcaa
aaatcatctaacacaaagtgtattttatagtaaagtgttgaatagctcatgtaatttatt
gaatactgtactgaaagtgaaaaacaatttttgtatgggtacgtgaagtgtggtttctac
tgaatgcgtattcctttcacaccattttaaagctgaaaaatcagtaagtcaaacaattaa
g

The sequence above does contain 'CTA' - but in lowercase (indicating it is a repetitive region I believe).

The command works if I increase the --proportion-repeat flag to 0.8.

How are allele sizes estimated?

Hi,
Could you share how the allele sizes are estimated for the samples? The no.s being in integars and not whole numbers, is kind of confusing.

Regards,
Hasna

strling-outliers.py: Incompatibility with newer Pandas version

Hello
It looks like the strling-outliers.py script uses a Pandas indexing functionality that is no longer supported.
I currently have Pandas 2.0.3, which I believe conda installed automatically following the strling instructions.
It looks like you will have to re-write the pandas indexing to keep up to date.
In the meantime, can you tell me which version of pandas you are using so that I can downgrade?
Thanks!
Error log below:

Elapsed time: 5:44:18 Calculating z scores
Traceback (most recent call last):
File "/ENVIRONMENT_PATH/bin/strling-outliers.py", line 459, in
main()
File "/ENVIRONMENT_PATH/bin/strling-outliers.py", line 340, in main
z = z_score(sum_str_log_wide, mu_sd_estimates)
File "/ENVIRONMENT_PATH/bin/strling-outliers.py", line 141, in z_score
return (x - df['mu'][:,np.newaxis])/df['sd'][:,np.newaxis]
File "/ENVIRONMENT_PATH/lib/python3.8/site-packages/pandas/core/series.py", line 1033, in getitem
return self._get_with(key)
File "/ENVIRONMENT_PATH/lib/python3.8/site-packages/pandas/core/series.py", line 1048, in _get_with
return self._get_values_tuple(key)
File "/ENVIRONMENT_PATH/lib/python3.8/site-packages/pandas/core/series.py", line 1082, in _get_values_tuple
disallow_ndim_indexing(result)
File "/ENVIRONMENT_PATH/lib/python3.8/site-packages/pandas/core/indexers/utils.py", line 343, in disallow_ndim_indexing
raise ValueError(
ValueError: Multi-dimensional indexing (e.g. obj[:, None]) is no longer supported. Convert to a numpy array before indexing instead.

Hide advanced parameters

Hide these parameters to discourage users from changing them without careful consideration

--proportion-repeat
--min-mapq
--min-support
--min-clip
--min-clip-total

Need a more useful error for when bed file chromosomes don't match reference genome

strling call -f g1k_v37_decoy.fa -l loci.hg19.bed -o hg002.extra hg002.cram hg002.str.bin

If chromosome names in loci.hg19.bed above don't match the reference genome (e.g. chr 1 vs 1) then the following error occurs:

strling version: 0.1.0
[strling] read format version 0 from software version 0.1.0
[strling] proportion_repeat 0.800 and min mapping quality 40
[strling] reading 843211 STR reads from bin file
fatal.nim(39) sysFatal
Error: unhandled exception: index -1 not in 0 .. 86 [IndexError]
Cleaned up file hg002.extra-bounds.txt to .bpipe/trash/hg002.extra-bounds.txt.1
Cleaned up file hg002.extra-unplaced.txt to .bpipe/trash/hg002.extra-unplaced.txt.1
Cleaned up file hg002.extra-genotype.txt to .bpipe/trash/hg002.extra-genotype.txt.1
ERROR: stage str_call_individual failed: Command in stage str_call_individual failed with exit status = 1 :

Inconsistent soft-clipped reads may impact length estimates

Inconsistent soft-clipped reads are ignored when calculating bounds, see #90
However, they do still contribute to center mass and to allele length calculations. Keeping this here as a reminder in case this becomes an issue in the future.

Binary build error

@brentp I'm having some trouble compiling the binaries. Following these instructions:
https://strling.readthedocs.io/en/latest/contribute.html

stack trace: (most recent call last)
argparse.nim(881, 23)    tmpmkParser
argparse.nim(724, 31)    mkParser
argparse.nim(587, 36)    genParseProcs
macrohelp.nim(60, 8)     parentOf
/scratch/ucgd/lustre-work/quinlan/u6026198/git/STRling/src/strpkg/genome_strs.nim(172, 20) template/generic instantiation of `newParser` from here
/uufs/chpc.utah.edu/common/HIPAA/u6026198/.nimble/pkgs/argparse-0.7.1/argparse/macrohelp.nim(60, 8) Error: node not found: EXTRA
error compiling code

recode homopolymers

Because we use k 2-6, homopolymers are currently represented as:
AA, TT, CC, GG
Recode them to A, T, C, G and adjust metrics as appropriate i.e. double the str count

Support hemizygous calls in males on sex chromosomes

Hi Harriet—another suggestion would be to support hemizygous calls in males on chrX. This is important for Fragile X and Kennedy disease (SBMA) in particular. You’d have to pass a ped file or sample gender to call.

Could alternatively infer sex from sex chromosome coverage?

SIGSEGV: Illegal storage access. (Attempt to read from nil?)

Facing error SIGSEGV: Illegal storage access. (Attempt to read from nil?) while running "strling merge".
STRling version 0.5.0

Command used-
nohup strling merge --output-prefix strling/strling_joint/joint -f GRCh38.primary_assembly.genome.fa 1.sorted.bam.bin 2.sorted.bam.bin

Error snapshot-

Comparing STR length from STRling, which value would be appropriate to use?

Hello, thank you for providing a great STR detection tool.

I understand that STRling is developed as a tool to detect outlier expansion. I'm wondering if it's appropriate to use this approach for my purpose. I plan to use STRling to uncover STRs and apply a logistic regression model to assess the association between the length of each STR and phenotype.

If it's appropriate, I'm curious about how to interpret each STR length. Currently, I assume that the length can be represented by the value of 40 * [(sum_str_counts) / local_depth]. Is this assumption correct, or would it be more appropriate to use a different output value? In paper, [ log2( (sum_str_counts + 1) / local_depth) ] is used for outlier detection. I wonder which value is more appropriate in the logistic regression with STR track length.

Best regards,
Chanhee

Set up verbose messages with logging

https://nim-lang.org/docs/logging.html

Call stage has high memory utilization for large numbers of bounds

Example:

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                 
   NA  NA          20   0  133.6g  84.4g    676 S 139.8 22.4  31:57.63 strling

Occurs when number of bounds is large due to loci detected in that sample, as well as when a large number of loci are provided with -l as occurs in joint genotyping.

Occurs on release v0.0.1.

align_length > 0 Assertion Error

Hello,
with the current version 0.1.0 I get the following assertion Error. The file was align with bwa mem, duplicates removes with samblaster and sorting accomplished by sambamba. Did I do something wrong?

[strling] finding STR regions on reference chromosome: 1
[strling] finding STR regions on reference chromosome: 2
[strling] finding STR regions on reference chromosome: 3
[strling] finding STR regions on reference chromosome: 4
[strling] finding STR regions on reference chromosome: 5
[strling] finding STR regions on reference chromosome: 6
[strling] finding STR regions on reference chromosome: 7
[strling] finding STR regions on reference chromosome: 8
[strling] finding STR regions on reference chromosome: 9
[strling] finding STR regions on reference chromosome: 10
[strling] finding STR regions on reference chromosome: 11
[strling] finding STR regions on reference chromosome: 12
[strling] finding STR regions on reference chromosome: 13
[strling] finding STR regions on reference chromosome: 14
[strling] finding STR regions on reference chromosome: 15
[strling] finding STR regions on reference chromosome: 16
[strling] finding STR regions on reference chromosome: 17
[strling] finding STR regions on reference chromosome: 18
[strling] finding STR regions on reference chromosome: 19
[strling] finding STR regions on reference chromosome: 20
[strling] finding STR regions on reference chromosome: 21
[strling] finding STR regions on reference chromosome: 22
[strling] finding STR regions on reference chromosome: X
[strling] finding STR regions on reference chromosome: Y
[strling] found 3327 STR-like regions in the genome
[strling] got STR repeats from genome into an interval tree
[strling] collecting str-like reads
[strling] extracting chromosome:1
fatal.nim(39) sysFatal
Error: unhandled exception: extract.nim(80, 12) align_length > 0 A00786:99:HW2C2DSXX:4:2161:9245:35665 117 1 153708993 0 * = 153708993 0 * * MC:Z:93S35M23S AS:i:0 XS:i:0 RG:Z:EPF-DAN-024-048 [AssertionError]

Unexpectedly high number of anchored reads reported

6 anchored reads reported, but I think there should only be 3. There are 3 soft-clipped reads in this picture that have MAPQ>20.

chr22.hg38-genotype.txt:
chr12 78702630 78702631 CCAAA nan inf 6 0 0 3 0 0 0.0 32

chr22.hg38-reads.txt:

chr12	78702630	CCAAA	left	4	52878202_52878488_1553162_1	205
chr12	78702630	CCAAA	left	3	52878289_52878490_1671274_0	205
chr12	78702630	CCAAA	left	1	52878247_52878503_2966158_0	205
chr12	78702654	CCAAA	none	8	52878202_52878488_1553162_1	205
chr12	78702654	CCAAA	none	8	52878247_52878503_2966158_0	205
chr12	78702654	CCAAA	none	8	52878289_52878490_1671274_0	205

Data here:
/uufs/chpc.utah.edu/common/HIPAA/u6026198/storage/git/STRling/working/ref_sim
Simulated reference reads from chr22, aligned to hg38

gene and variant scores

Do some sensible filtering by default

By default only output loci passing certain filtering thresholds e.g. number of anchored or spanning pairs. Have a flag to output all loci.

STRling merge error "SIGSEGV: Illegal storage access" when -l bed is used

Reported by @seboyden, thanks!

strling merge -v --output-prefix results/joint -l $BED -f $FASTA ../test01/$PBD.bin ../test01/$DAD.bin ../test01/$MOM.bin

SIGSEGV: Illegal storage access. (Attempt to read from nil?)

The bug arises from opts.targets not being set in merge.nim

SIGSEGV: Illegal storage access. (Attempt to read from nil?)

I am testing STRling for the first time, and am getting the following error message at the "call" step. The "extract" step seems to run fine. Here is the stderr output from the call step:

[strling] read 1130403 treads from bin file
SIGSEGV: Illegal storage access. (Attempt to read from nil?)

And in case it helps, here is the same from the extract step:

[strling] finding STR regions on reference chromosome: chr1
[strling] finding STR regions on reference chromosome: chr2
[strling] finding STR regions on reference chromosome: chr3
[strling] finding STR regions on reference chromosome: chr4
[strling] finding STR regions on reference chromosome: chr5
[strling] finding STR regions on reference chromosome: chr6
[strling] finding STR regions on reference chromosome: chr7
[strling] finding STR regions on reference chromosome: chr8
[strling] finding STR regions on reference chromosome: chr9
[strling] finding STR regions on reference chromosome: chr10
[strling] finding STR regions on reference chromosome: chr11
[strling] finding STR regions on reference chromosome: chr12
[strling] finding STR regions on reference chromosome: chr13
[strling] finding STR regions on reference chromosome: chr14
[strling] finding STR regions on reference chromosome: chr15
[strling] finding STR regions on reference chromosome: chr16
[strling] finding STR regions on reference chromosome: chr17
[strling] finding STR regions on reference chromosome: chr18
[strling] finding STR regions on reference chromosome: chr19
[strling] finding STR regions on reference chromosome: chr20
[strling] finding STR regions on reference chromosome: chr21
[strling] finding STR regions on reference chromosome: chr22
[strling] finding STR regions on reference chromosome: chrX
[strling] finding STR regions on reference chromosome: chrY
[strling] finding STR regions on reference chromosome: chr6_GL000250v2_alt
[strling] finding STR regions on reference chromosome: chr16_KI270853v1_alt
[strling] finding STR regions on reference chromosome: chr17_KI270857v1_alt
[strling] finding STR regions on reference chromosome: chr6_GL000251v2_alt
[strling] finding STR regions on reference chromosome: chr15_KI270905v1_alt
[strling] finding STR regions on reference chromosome: chr6_GL000252v2_alt
[strling] finding STR regions on reference chromosome: chr6_GL000253v2_alt
[strling] finding STR regions on reference chromosome: chr6_GL000254v2_alt
[strling] finding STR regions on reference chromosome: chr6_GL000255v2_alt
[strling] finding STR regions on reference chromosome: chr6_GL000256v2_alt
[strling] found 3470 STR-like regions in the genome
[strling] got STR repeats from genome into an interval tree
[strling] collecting str-like reads
[strling] extracting chromosome:chr1
[strling] extracting chromosome:chr2
[strling] extracting chromosome:chr3
[strling] extracting chromosome:chr4
[strling] extracting chromosome:chr5
[strling] extracting chromosome:chr6
[strling] extracting chromosome:chr7
[strling] extracting chromosome:chr8
[strling] extracting chromosome:chr9
[strling] extracting chromosome:chr10
[strling] extracting chromosome:chr11
[strling] extracting chromosome:chr12
[strling] extracting chromosome:chr13
[strling] extracting chromosome:chr14
[strling] extracting chromosome:chr15
[strling] extracting chromosome:chr16
[strling] extracting chromosome:chr17
[strling] extracting chromosome:chr18
[strling] extracting chromosome:chr19
[strling] extracting chromosome:chr20
[strling] extracting chromosome:chr21
[strling] extracting chromosome:chr22
[strling] extracting chromosome:chrX
[strling] extracting chromosome:chrY
[strling] extracting chromosome:chr6_GL000250v2_alt
[strling] extracting chromosome:chr16_KI270853v1_alt
[strling] extracting chromosome:chr17_KI270857v1_alt
[strling] extracting chromosome:chr6_GL000251v2_alt
[strling] extracting chromosome:chr15_KI270905v1_alt
[strling] extracting chromosome:chr6_GL000252v2_alt
[strling] extracting chromosome:chr6_GL000253v2_alt
[strling] extracting chromosome:chr6_GL000254v2_alt
[strling] extracting chromosome:chr6_GL000255v2_alt
[strling] extracting chromosome:chr6_GL000256v2_alt
[strling] extracting unampped reads
[strling] writing binary file:HuRef.bin
[strling] finished extraction

I am running it on a BAM file aligned to hg38.

Exclude NaN rows in STRs.tsv output

These exist because the locus was not called in that sample. Should not be reported by default.

0-length intervals

0-length intervals occur in output and are invalid bed format - should be 1bp intervals

For cases where we are uncertain about the bounds, look at the reference genome sequence

E.g. less than 4 soft-clipped reads each side

Full diploid genotypes for 2 short alleles

Current behavior is to report "nan" for allele2 if it is short. It would be nice to have an estimated allele size for both allele1 and allele2, even when both alleles are short.

STRling version number is in two places

strling.nimble and src/strpkg/version.nim
This is a workaround to fix this error: https://travis-ci.com/hdashnow/STRling/builds/140315052
But means version needs to be updated in two places each time, so would be nice to fix.

Parsing issue with outliers.py

I am having a problem with the controls file in the outliers.py
Traceback (most recent call last):
File "/projects/b1073/pipelines/genomes/jrg_sandbox/str_callers/outliers.py", line 448, in
main()
File "/projects/b1073/pipelines/genomes/jrg_sandbox/str_callers/outliers.py", line 289, in main
control_estimates = parse_controls(control_file)
File "/projects/b1073/pipelines/genomes/jrg_sandbox/str_callers/outliers.py", line 99, in parse_controls
control_estimates = pd.read_csv(control_file, index_col=0, delim_whitespace = True, header = None)
File "/home/jrg4257/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 685, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/jrg4257/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 463, in _read
data = parser.read(nrows)
File "/home/jrg4257/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1154, in read
ret = self._engine.read(nrows)
File "/home/jrg4257/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 2059, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 881, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 896, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 950, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 937, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2132, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 3 fields in line 131145, saw 4

Add a sample input file

Include a small bam file (simulation?) for a quick test run to let users check their installation is working correctly.
Thanks for the suggestion @breons

Typo in command line help

STRling/src/strpkg/call.nim

Line 57 in c727509

    
           option("-l", "--loci", help="Annoated bed file specifying additional STR loci to genotype. Format is: chr start stop repeatunit [name]")

Output VCF

Use this:
https://github.com/mflevine/hts-nim-sugar

L.start <= R.start

Hey,
I get the following error for one of my samples in one of my projects

strling version: 0.5.1
using first reads in fragment_length_distribution calculation as there were not enough
Read 67332 bounds from results/strling/merge/tremor-ataxia-bounds.txt
fatal.nim(49)            sysFatal
Error: unhandled exception: collect.nim(37, 12) `L.start <= R.start`  [AssertionDefect]

I may help to fix this by providing any information you need!

Typo and bug in the version warning?

[strling] read format version 0 from software version 0.0.2
[strling] WARNING: this bin file was generated by a different vertsion of STRling: 0. Current version is: 0.1.0.

This bit looks wrong: vertsion of STRling: 0

Issue with low proportion_repeat

Hello, I'm getting odd errors when using lower --proportion-repeat in strling extract. For example, when running this command:

strling extract -f GRCh38_full_analysis_set_plus_decoy_hla.fa -p 0.4 NA06989.final.cram test.bin

I get this error:
fatal.nim(49) sysFatal
Error: unhandled exception: genome_strs.nim(39, 12) w.start < w.stop repeat ACCTTT not found in expected region for (chrom: "chr2", start: 71388200, stop: 71388500, repeat: "ACCTTT"), TGAAGGCAGCTAAATTCTCTTACCCTGAGGCTAAGGGCAAGTAGTAGGTAACAAAGGAGTGTAAAGGAATTTATCTAGATAAGTTTATTTACTTTTGCCGACCTTTGATCATCCGACCTTTGATCATCCGACCTTTGATCATCTGACCTTTGATCATCTGACCTTTGATCATCCGACCTTTGATCATCTGACCTTTGATCATCCGCGTGCAGGACTGCTCCCTACAGGCGGGGGCAACAACTACCCACAGATTGTGTTGGCTCCAGGCCTTTGTCATTAAATCTGTACTAAATAAATACA, (chrom: "chr2", start: 71388500, stop: 71388500, repeat: "ACCTTT") [AssertionDefect]

strling version: 0.5.1
fatal.nim(49) sysFatal
Error: unhandled exception: unpack.nim(59, 12) fs != nil [strling] got nil fileStream in unpack_file. check given file-path [AssertionDefect]

I am using a 1000 Genomes sample downloaded from ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239459/NA06989.final.cram

Thanks!

small test dataset

I was wondering you there was a small test dataset that was available to test the software and make sure it is working/was installed properly? I see previous issues that were resolved regarding the same thing but I can't seem to find the data.

SIGSEGV: Illegal storage access. (Attempt to read from nil?)

Hey,
I get the following error for one of my samples in one of my projects

strling version: 0.5.1
using first reads in fragment_length_distribution calculation as there were not enough
Read 67332 bounds from results/strling/merge/tremor-ataxia-bounds.txt
SIGSEGV: Illegal storage access. (Attempt to read from nil?)

I may help to fix this by providing any information you need!

Checksum mismatch in cram decode

Hi STRling team,
I'm trying to run STRling on some cram files, and I get the following error (this is the last part of the log):

[strling] found 3438 STR-like regions in the genome
[strling] got STR repeats from genome into an interval tree
[strling] collecting str-like reads
[strling] extracting chromosome:chr1
[E::cram_decode_slice] MD5 checksum reference mismatch for ref 0 pos 248714585..248753335
[E::cram_decode_slice] CRAM: 75e67a2b43990fd5c419b4180857f756
[E::cram_decode_slice] Ref : 91fd29daa2e0a9ab4422bfed5a28e7e5
[E::cram_next_slice] Failure to decode slice
bam.nim(439)             extract_main
Error: unhandled exception: hts/bam:error in iteration [ValueError]-

and this is the command that I run:

strling extract  \
    sample.cram \
    outs/sample/out \
    -f $REF_GENOME

I checked that I'm using the correct reference genome (samtools view cram -T ref worked fine).
Please let me know if I've missed something in my run.
Thanks!

STRling warning in [Joint call str loci across all samples]

Hello,

I run below command for joint calling which binds several bin file

cat ../AD_WGS_batch1-7_STRling.txt | xargs -L 2000 strling merge -f ../../resources/chm13v2.0.fa --output-prefix ~/WGS/AD_STR/outputs/joint_bin/ > ~/WGS/AD_STR/outputs/joint_bin/str_joint_log.txt 2>&1

The command finished with logs below.

More than 65535 reads in cluster with first read:(tid: 3, position: 169960672, repeat: ['G', '\x00', '\x00', '\x00', '\x00', '\x00'], flag: PAIRED,READ2,DUP, split: right, mapping_quality: 47, repeat_count: 70, align_length: 70, qname: "20") skipping
More than 65535 reads in cluster with first read:(tid: 24, position: 10263, repeat: ['G', '\x00', '\x00', '\x00', '\x00', '\x00'], flag: PAIRED,MREVERSE,READ2, split: none, mapping_quality: 43, repeat_count: 150, align_length: 150, qname: "428") skipping
More than 65535 reads in cluster with first read:(tid: 24, position: 15757, repeat: ['G', '\x00', '\x00', '\x00', '\x00', '\x00'], flag: PAIRED,MREVERSE,READ2, split: none, mapping_quality: 60, repeat_count: 150, align_length: 150, qname: "113") skipping
More than 65535 reads in cluster with first read:(tid: 1, position: 181396416, repeat: ['A', '\x00', '\x00', '\x00', '\x00', '\x00'], flag: PAIRED,READ2, split: none_right, mapping_quality: 0, repeat_count: 124, align_length: 150, qname: "49") skipping
More than 65535 reads in cluster with first read:(tid: 3, position: 169960503, repeat: ['A', '\x00', '\x00', '\x00', '\x00', '\x00'], flag: PAIRED,PROPER_PAIR,MREVERSE,READ1, split: none, mapping_quality: 54, repeat_count: 128, align_length: 150, qname: "1138") skipping
More than 65535 reads in cluster with first read:(tid: 22, position: 149830784, repeat: ['G', '\x00', '\x00', '\x00', '\x00', '\x00'], flag: PAIRED,PROPER_PAIR,REVERSE,READ1, split: none, mapping_quality: 60, repeat_count: 142, align_length: 150, qname: "754") skipping
More than 65535 reads in cluster with first read:(tid: 6, position: 86499722, repeat: ['G', '\x00', '\x00', '\x00', '\x00', '\x00'], flag: PAIRED,MREVERSE,READ2, split: none, mapping_quality: 57, repeat_count: 144, align_length: 150, qname: "233") skipping
More than 65535 reads in cluster with first read:(tid: 24, position: 15398, repeat: ['C', '\x00', '\x00', '\x00', '\x00', '\x00'], flag: PAIRED,READ2, split: none, mapping_quality: 60, repeat_count: 150, align_length: 150, qname: "97") skipping
More than 65535 reads in cluster with first read:(tid: 24, position: 15965, repeat: ['C', '\x00', '\x00', '\x00', '\x00', '\x00'], flag: PAIRED,READ2, split: none, mapping_quality: 60, repeat_count: 150, align_length: 150, qname: "71") skipping
More than 65535 reads in cluster with first read:(tid: 6, position: 86500001, repeat: ['C', '\x00', '\x00', '\x00', '\x00', '\x00'], flag: PAIRED,READ2, split: none, mapping_quality: 60, repeat_count: 122, align_length: 150, qname: "1715") skipping

I'm asking if it is okay to continue the skipping warning, or should I check other options to solve this problem.

I used 1,824 samples with strling.

Thank you for providing a great tool.

Best regards,
Chan

[log]

strling version: 0.5.2
[strling] read 815645 STR reads from file: WGS_0001.bin
[strling] read 501777 STR reads from file: WGS_0002.bin
...
[strling] read 102666 STR reads from file: WGS_1836.bin
[strling] read 113928 STR reads from file: WGS_1837.bin
[strling] read 123117 STR reads from file: WGS_1838.bin
More than 65535 reads in cluster with first read:(tid: 3, position: 169960672, repeat: ['G', '\x00', '\x00', '\x00', '\x00', '\x00'], flag: PAIRED,READ2,DUP, split: right, mapping_quality: 47, repeat_count: 70, align_length: 70, qname: "20") skipping
More than 65535 reads in cluster with first read:(tid: 24, position: 10263, repeat: ['G', '\x00', '\x00', '\x00', '\x00', '\x00'], flag: PAIRED,MREVERSE,READ2, split: none, mapping_quality: 43, repeat_count: 150, align_length: 150, qname: "428") skipping
More than 65535 reads in cluster with first read:(tid: 24, position: 15757, repeat: ['G', '\x00', '\x00', '\x00', '\x00', '\x00'], flag: PAIRED,MREVERSE,READ2, split: none, mapping_quality: 60, repeat_count: 150, align_length: 150, qname: "113") skipping
More than 65535 reads in cluster with first read:(tid: 1, position: 181396416, repeat: ['A', '\x00', '\x00', '\x00', '\x00', '\x00'], flag: PAIRED,READ2, split: none_right, mapping_quality: 0, repeat_count: 124, align_length: 150, qname: "49") skipping
More than 65535 reads in cluster with first read:(tid: 3, position: 169960503, repeat: ['A', '\x00', '\x00', '\x00', '\x00', '\x00'], flag: PAIRED,PROPER_PAIR,MREVERSE,READ1, split: none, mapping_quality: 54, repeat_count: 128, align_length: 150, qname: "1138") skipping
More than 65535 reads in cluster with first read:(tid: 22, position: 149830784, repeat: ['G', '\x00', '\x00', '\x00', '\x00', '\x00'], flag: PAIRED,PROPER_PAIR,REVERSE,READ1, split: none, mapping_quality: 60, repeat_count: 142, align_length: 150, qname: "754") skipping
More than 65535 reads in cluster with first read:(tid: 6, position: 86499722, repeat: ['G', '\x00', '\x00', '\x00', '\x00', '\x00'], flag: PAIRED,MREVERSE,READ2, split: none, mapping_quality: 57, repeat_count: 144, align_length: 150, qname: "233") skipping
More than 65535 reads in cluster with first read:(tid: 24, position: 15398, repeat: ['C', '\x00', '\x00', '\x00', '\x00', '\x00'], flag: PAIRED,READ2, split: none, mapping_quality: 60, repeat_count: 150, align_length: 150, qname: "97") skipping
More than 65535 reads in cluster with first read:(tid: 24, position: 15965, repeat: ['C', '\x00', '\x00', '\x00', '\x00', '\x00'], flag: PAIRED,READ2, split: none, mapping_quality: 60, repeat_count: 150, align_length: 150, qname: "71") skipping
More than 65535 reads in cluster with first read:(tid: 6, position: 86500001, repeat: ['C', '\x00', '\x00', '\x00', '\x00', '\x00'], flag: PAIRED,READ2, split: none, mapping_quality: 60, repeat_count: 122, align_length: 150, qname: "1715") skipping

Off by 1 errors when detecting if a read spans a locus?

See CANVAS genomes. Unexpected spanning reads/pairs.

Combine equivalent STR repeat units for unplaced reads?

For example, CAG and CTG are equivalent repeat units, since these are unplaced so we don't know the orientation.

==> 98.4-3076603_CAG_0_184_L001_R1-unplaced.txt <==
CAG	34
CTG	2

I'm not sure yet if this is important, or if it would be simplest to do this when using this data.

quinlan-lab / strling Goto Github PK

strling's Introduction

Install and Run STRling

Citation

strling's People

Contributors

Stargazers

Watchers

Forkers

strling's Issues

Recommend Projects

Recommend Topics

Recommend Org