Code Monkey home page Code Monkey logo

bio-playground's Introduction

Contents

Miscellaneous scripts for bioinformatics that dont merit their own repo. All under MIT License unless otherwise specified.

Abnormal nucleotide frequency tends to throw off normal procedures for estimating evolutionary models. A practical situation is when calculating the Ks values for the grass genes where a significant portion of them are high-GC genes (see details here). In the case of high GC genes, most of the substitutions will be either G or C, therefore the Jukes-Cantor correction will under-estimate the Ks values. The codon models in PAML, on the contrary, tend to over-estimate Ks values. The Ks calculator we want to implement here, ignores the inference of models (where it is difficult anyway, since you have very few sites to estimate the parameters in the model). Instead, we ask this: given biased substitutions, lengths, run simulations and try to fit an evolutionary model based on the simulations.

method

bio-playground's People

Contributors

anarkia7115 avatar brentp avatar brwnj avatar cviner avatar jurezmrzlikar avatar rwanwork avatar simonohanlon101 avatar tanghaibao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bio-playground's Issues

AttributeError: 'str' object has no attribute 'load'

I am trying to generate IGV screenshots automatically for easier inspection of variants in a vcf file. Upon using the IGV wrapper here, I tried following the commands in the example:

Python 2.7.12 (default, Mar  1 2021, 11:38:31)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import igv
>>> igv=igv.IGV()
>>> igv=igv.genome('hg19')
>>> igv=igv.load('http://www.broadinstitute.org/igvdata/1KG/pilot2Bams/NA12878.SLX.bam')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'load'
>>>

Any idea why?

read-utils/fastq

Hi!

If I run:

fastq filter --adjust 33 --unique input.fq > output.fq
it always returns(to stderr) incorrect total number of records (increased by 1).

Moreover, for the last read, the rl.size() is always increased by 1.
I'm not familiar with C++, but I would appreciate your feedback and help.
I would like to utilize your fastq program to return the list of unique FASTQ records
and to add to the FASTQ header the number of found duplicates.

Example:

input:

FASTQ_header:read01
FASTQ_sequence
+
FASTQ_low_quality
FASTQ_header:read02
FASTQ_sequence
+
FASTQ_high_quality

output

FASTQ_header:read01:2
FASTQ_sequence
+
FASTQ_high_quality

Many thanks in advance,
pepap

guess-encoding.py incorrect instructions

Correct me if I'm wrong, but does the guess-encoding script not process base quality strings? (e.g., bb_eeeeeggggfiiiiiiiiiiiiihhiifhiiiiiihiiiiiiifffc)

The example at the top of the script uses cut -f 5 when the 5th column is the mapping quality, not the base quality string. Shouldn't it be cut -f 11?

memory usage

Hi Brent,
I kindof have another issue. I have two lanes of HiSeq, paired end 2_100bp.
I previously removed duplicates lane by lane. I would concatenate the 2_100bp into 1_200bp (using galaxy's joiner tool), then run fastqClean To remove duplicates, I run "fastq filter --adjust 64 --unique ". Then split the fake 200bp reads back into 2_100bp reads using galaxy's tool.
I can do this fine on individual lanes. But it would make more sense to remove duplicates overall. The concatenated data from two lanes make a fastq of 84 gigabytes. I launched your tool (older version that removes all sequences with an N)... now after 45 minutes I'm up to 212gigs of memory usage. And increasing. My machine has only 256 gigs of ram, so I'll probably have to kill it.
So how about a low-mem version :) (even if it runs a lot slower)
cheers,
y

IGV socket response differs than its stderr

Hi Brent,

Just thought you might want to know about an issue with IGV socket response (related to your igv python class) , see here for more info:

https://groups.google.com/forum/?fromgroups=#!topic/igv-help/uS-a5EFOZC4

Briefly, IGV doesn't send stderr output to a socket request when error occurs. It only response with an 'OK' as long as it was able to connect to the port but this doesn't reflect if the internal operation (goto, load, sort, etc) has finished successfully. It is a one-way communication.

Do you think capturing the IGV stderr via "subprocess.Popen" is a good idea?

Many thanks,
Saeed

guess-encoding.py bad output

A recent change to guess-encodying.py makes it output the first two columns in square brackets. Like this

['Sanger', 'Illumina-1.8']    42    71

Before, it would have been

Sanger     Illumina-1.8    42    71

This seems like a bug. Am I wrong?

URL BAM input

Hi @brentp , I would like to ask, if there is a way to load BAM files from the working directory and not URLs (shown in the example)?

if I may...

./fastq filter --adjust 64 --unique /path/to/your.fasta > unique.fasta

reorders things unnecessarily. Don't know if it would slow things down to keep the same order... (but not a real problem)

thanks!
y

Manhattan Plots - asynchronous calls?

Hi,
have you considered mixing matplotlib and asynchronous calls(multiprocessing module?) to produce an interactive environment to "explore" the dataset?
I am not interested in biology, btw, just in the coding.
Regards!

bio-playground / reads-utils / select-random-pairs.py

bio-playground / reads-utils / select-random-pairs.py

Current solution allows duplication of selected pairs

I believe: rand_records = sorted(random.sample(xrange(records), int(N)))

Would be an elegant solution.

Guessing heuristics

It is desirable to narrow down the possible encodings that guess-encoding.py produces, using heuristics beyond checking the min/max.

I have added some to improve the ability to uniquely detect or eliminate Illumina 1.5 format, by considering the unused scores in that scheme and the frequency of the Illumina 1.5 "special" quality score (B).

I would be happy to work to integrate those modifications into this version, if there is interest in my doing so.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.