brentp / bio-playground Goto Github PK

miscellaneous scripts for bioinformatics/genomics that dont merit their own repo.

License: MIT License

JavaScript 26.41% Python 25.91% C 30.63% Shell 0.13% C++ 12.59% HTML 0.13% Mako 0.13% Lua 0.33% Nim 3.57% Makefile 0.17%

bio-playground's Introduction

Contents

Ks Calc

Miscellaneous scripts for bioinformatics that dont merit their own repo. All under MIT License unless otherwise specified.

Ks Calc

Abnormal nucleotide frequency tends to throw off normal procedures for estimating evolutionary models. A practical situation is when calculating the Ks values for the grass genes where a significant portion of them are high-GC genes (see details here). In the case of high GC genes, most of the substitutions will be either G or C, therefore the Jukes-Cantor correction will under-estimate the Ks values. The codon models in PAML, on the contrary, tend to over-estimate Ks values. The Ks calculator we want to implement here, ignores the inference of models (where it is difficult anyway, since you have very few sites to estimate the parameters in the model). Instead, we ask this: given biased substitutions, lengths, run simulations and try to fit an evolutionary model based on the simulations.

bio-playground's People

Contributors

Stargazers

Watchers

Forkers

guniorobot yu68 jdownie2389 matifr masanao aseetharam crmacpherson samfway fawnshao xtmgah silask joshwaterfall natassagioti aungthurhahein bluelittlefrog haraldgrove rfarhoodi fw1121 josemrc edwardhust cband lenaalston decodebiology leonardj09 djinnome alexbrandt sahirmohit zzygyx9119 unix0000 wxb263stu yixf-self darmitage b1234561 vd4mmind jianlian92 winterli1993 rwanwork wavefancy habibr vyx-nir-neerman shyamsg vicbcn2001 cviner zanissa manoshidatta chris-pepin jikimlucas altingia zlyrebecca tabbassidaloii menghaowei wujh2018 bixbeta transtephen zhuyezhang jianzuoyi mouradbioinfo jonason91 gtyagics rintukutum kaiqiangliu akhileshkaushal haoziyeung oops324 cyndigoh andurill nextomics adamewing secretloong crimcc sb43 dbernard22 mooreabrega pacomito mingleiyang solivehong shodhak ktp-forked-repos adhopkins speromelior zd-mei duolabuaimeng vishalrossi thecochenille pengjia6 advaitb jaden30 trayjames biogeeker jeongyeojin nirmal2310 dataronio alexvasilikop saulobritto huangjingchuan12 qindan2008 yuanzehu88 liu5796796 bbalog87 dimuccidm

bio-playground's Issues

in bio-playground/reads-utils/guess-encoding.py, FASTQ range for Sanger should be (33, 73) instead of (33, 93)

From wiki about FASTQ Phred Range(https://en.wikipedia.org/wiki/FASTQ_format), it says that the standard of Sanger Phred+33 has the lower bound of 33 and the upper bound of 73. Narrow down the range may increase the guess speed.

AttributeError: 'str' object has no attribute 'load'

I am trying to generate IGV screenshots automatically for easier inspection of variants in a vcf file. Upon using the IGV wrapper here, I tried following the commands in the example:

Python 2.7.12 (default, Mar  1 2021, 11:38:31)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import igv
>>> igv=igv.IGV()
>>> igv=igv.genome('hg19')
>>> igv=igv.load('http://www.broadinstitute.org/igvdata/1KG/pilot2Bams/NA12878.SLX.bam')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'load'
>>>

Any idea why?

read-utils/fastq

Hi!

If I run:

fastq filter --adjust 33 --unique input.fq > output.fq
it always returns(to stderr) incorrect total number of records (increased by 1).

Moreover, for the last read, the rl.size() is always increased by 1.
I'm not familiar with C++, but I would appreciate your feedback and help.
I would like to utilize your fastq program to return the list of unique FASTQ records
and to add to the FASTQ header the number of found duplicates.

Example:

input:

FASTQ_header:read01
FASTQ_sequence
+
FASTQ_low_quality
FASTQ_header:read02
FASTQ_sequence
+
FASTQ_high_quality

output

FASTQ_header:read01:2
FASTQ_sequence
+
FASTQ_high_quality

Many thanks in advance,
pepap

guess-encoding.py incorrect instructions

Correct me if I'm wrong, but does the guess-encoding script not process base quality strings? (e.g., bb_eeeeeggggfiiiiiiiiiiiiihhiifhiiiiiihiiiiiiifffc)

The example at the top of the script uses cut -f 5 when the 5th column is the mapping quality, not the base quality string. Shouldn't it be cut -f 11?

link in "list_overlap_p.py"

In this python file, you cite the paper:

http://www.nslij-genetics.org/wli/pub/ieee-embs06.pdf

But this link goes to a sketchy site with advertisements with things link Viagra. I'm guessing that the domain ownership must have changed or?

I'm posting as an issue so 1. you're aware and 2. because I'm curious what the actual paper is.

Thank you!

memory usage

Hi Brent,
I kindof have another issue. I have two lanes of HiSeq, paired end 2_100bp.
I previously removed duplicates lane by lane. I would concatenate the 2_100bp into 1_200bp (using galaxy's joiner tool), then run fastqClean To remove duplicates, I run "fastq filter --adjust 64 --unique ". Then split the fake 200bp reads back into 2_100bp reads using galaxy's tool.
I can do this fine on individual lanes. But it would make more sense to remove duplicates overall. The concatenated data from two lanes make a fastq of 84 gigabytes. I launched your tool (older version that removes all sequences with an N)... now after 45 minutes I'm up to 212gigs of memory usage. And increasing. My machine has only 256 gigs of ram, so I'll probably have to kill it.
So how about a low-mem version :) (even if it runs a lot slower)
cheers,
y

IGV socket response differs than its stderr

Hi Brent,

Just thought you might want to know about an issue with IGV socket response (related to your igv python class) , see here for more info:

https://groups.google.com/forum/?fromgroups=#!topic/igv-help/uS-a5EFOZC4

Briefly, IGV doesn't send stderr output to a socket request when error occurs. It only response with an 'OK' as long as it was able to connect to the port but this doesn't reflect if the internal operation (goto, load, sort, etc) has finished successfully. It is a one-way communication.

Do you think capturing the IGV stderr via "subprocess.Popen" is a good idea?

Many thanks,
Saeed

guess-encoding.py bad output

A recent change to guess-encodying.py makes it output the first two columns in square brackets. Like this

['Sanger', 'Illumina-1.8']    42    71

Before, it would have been

Sanger     Illumina-1.8    42    71

This seems like a bug. Am I wrong?

illumina, 1.8+

This script do not have illumina, 1.8+ ranging from 33-74

URL BAM input

Hi @brentp , I would like to ask, if there is a way to load BAM files from the working directory and not URLs (shown in the example)?

if I may...

./fastq filter --adjust 64 --unique /path/to/your.fasta > unique.fasta

reorders things unnecessarily. Don't know if it would slow things down to keep the same order... (but not a real problem)

thanks!
y

Qual range for Illumina 1.5+

It seems the range for Illumina 1.5+ should go up to 105, not 104.

See the Wikipedia FASTQ format article.

Porting igv.py to Ruby

I created ruby-igv based on igv.py. Thanks.
https://github.com/kojix2/ruby-igv

Manhattan Plots - asynchronous calls?

Hi,
have you considered mixing matplotlib and asynchronous calls(multiprocessing module?) to produce an interactive environment to "explore" the dataset?
I am not interested in biology, btw, just in the coding.
Regards!

bio-playground / reads-utils / select-random-pairs.py

Current solution allows duplication of selected pairs

I believe: rand_records = sorted(random.sample(xrange(records), int(N)))

Would be an elegant solution.

Guessing heuristics

It is desirable to narrow down the possible encodings that guess-encoding.py produces, using heuristics beyond checking the min/max.

I have added some to improve the ability to uniquely detect or eliminate Illumina 1.5 format, by considering the unused scores in that scheme and the frequency of the Illumina 1.5 "special" quality score (B).

I would be happy to work to integrate those modifications into this version, if there is interest in my doing so.