Code Monkey home page Code Monkey logo

tripal_dev_seed's People

Contributors

almasaeed2010 avatar bradfordcondon avatar jwest60 avatar lupercal2 avatar mestato avatar

Watchers

 avatar  avatar

Forkers

lupercal2 grlazo

tripal_dev_seed's Issues

Provide a shell script for creating mock dataset

Ideally, README lists pre-requisites (blast installed, blast annotation DB installed...), and it takes an input CDS fasta sequence and an -n parameter. The output is all data that can be generated via shell.

completed

  • minify dataset
    • two steps: minify input mrna fasta and corresponding GFF (1), then minify peptide (2)
  • create biomaterials
  • create expression data

to do

GFF loader requires landmarks to be loaded prior to execution

Exception: The landmark '<em class="placeholder">Contig0</em>' cannot be found for this organism (<em class="placeholder">Fraxinus excelsior</em>) Please add the landmark and then retry the import of this GFF3 file

I can make a landmark generator script pretty easily.

mRNA entities give "negative substring length not allowed" error

What does this mean?!

PDOException: SQLSTATE[22011]: Substring error: 7 ERROR: negative substring length not allowed: SELECT featureloc_id, srcname, srcfeature_id, strand, srctypename, typename, fmin, fmax, upstream, downstream, adjfmin, adjfmax, substring(residues from (cast(adjfmin as int4) + 1) for cast((upstream + (fmax - fmin) + downstream) as int4)) as residues, genus, species FROM ( SELECT FL.featureloc_id, OF.name srcname, FL.srcfeature_id, FL.strand, OCVT.name as srctypename, SCVT.name as typename, FL.fmin, FL.fmax, OO.genus, OO.species, CASE WHEN FL.strand >= 0 THEN CASE WHEN FL.fmin - :upstream <= 0 THEN 0 ELSE FL.fmin - :upstream END WHEN FL.strand < 0 THEN CASE WHEN FL.fmin - :downstream <= 0 THEN 0 ELSE FL.fmin - :downstream END END as adjfmin, CASE WHEN FL.strand >= 0 THEN CASE WHEN FL.fmax + :downstream > OF.seqlen THEN OF.seqlen ELSE FL.fmax + :downstream END WHEN FL.strand < 0 THEN CASE WHEN FL.fmax + :upstream > OF.seqlen THEN OF.seqlen ELSE FL.fmax + :upstream END END as adjfmax, CASE WHEN FL.strand >= 0 THEN CASE WHEN FL.fmin - :upstream <= 0 THEN FL.fmin ELSE :upstream END ELSE CASE WHEN FL.fmax + :upstream > OF.seqlen THEN OF.seqlen - FL.fmax ELSE :upstream END END as upstream, CASE WHEN FL.strand >= 0 THEN CASE WHEN FL.fmax + :downstream > OF.seqlen THEN OF.seqlen - FL.fmax ELSE :downstream END ELSE CASE WHEN FL.fmin - :downstream <= 0 THEN FL.fmin ELSE :downstream END END as downstream, OF.residues FROM chado.featureloc FL INNER JOIN chado.feature SF on FL.feature_id = SF.feature_id INNER JOIN chado.cvterm SCVT on SF.type_id = SCVT.cvterm_id INNER JOIN chado.feature OF on FL.srcfeature_id = OF.feature_id INNER JOIN chado.cvterm OCVT on OF.type_id = OCVT.cvterm_id INNER JOIN chado.organism OO on OF.organism_id = OO.organism_id WHERE SF.feature_id = :feature_id and NOT (OF.residues = '' or OF.residues IS NULL)) as tbl1 ; Array ( [:upstream] => 0 [:downstream] => 0 [:feature_id] => 2316 ) in chado_query() (line 1704 of /Users/Almsaeed/Work/DevSites/Tripal/sites/all/modules/tripal/tripal_chado/api/tripal_chado.query.api.inc).

New datsets

I probably want

  • A second organism to load
  • natural diversity data (GWAS, SNPs)

error in tree shell script

./annotate.sh \
out/sequences/mrna_mini.fasta \
out/sequences/polypeptide_mini.fasta \
/staton/libraries/uniprot/ \
Fexcel


./annotate.sh: line 77: out/tree/clustal.out/sequences/mrna_mini.fasta.clustal: No such file or directory
mv: rename out/sequences/mrna_mini.fasta.tree to out/tree/mrna_mini.fasta.tree: No such file or directory

DOI with release

publish release when ready. Zenodo integration already added.

Provide software versions for all tools

  • page 5, line 29: InterProScan 5.4 was used to produce functional annotation. It would have been better to use a recent version of this software (5.29-68.0 at the time of writing) to fit with data produced on recent projects. The versions and settings of the other tools used (BLAST, MAFFT, ...) should also be specified in the same paragraph.

Germplasm and Genotype data

I've been enjoying Tripal DevSeed as a quick way to get a testing environment up but need data to test germplasm and genotype-based functionality... I am willing to contribute to this project to see such data added if there is interest :-)

GFF

I would like a GFF of just these genes.

This would also mean including the SCAFFOLDS which can be loaded.

fasta loader requires regexp

  • Do otehr loaders allow type specification without regexp? If so, fix FASTA importer
  • If not, modify devseed instructions with working regexp ie (>(.*?))

empty landmark files end up with weird sequences in db

the landmark file:

>Contig10036
>Contig1
>Contig0
>Contig100
>Contig10022
>Contig10023
>Contig10035
>Contig1001
>Contig10012
>Contig1002
>Contig10026
>Contig10018
>Contig1003
>Contig10030
>Contig10
>Contig10011
>Contig10005
>Contig10002
>Contig1000
>Contig10000
>Contig10001

heres an entry in the db:

 6d9693ada0fcaab94fe1170c9085c065        598     f       f       2019-01-03 13:06:22.549905      2019-01-03 13:06:22.549905
1               1       Contig10036     Contig10036     >Contig1>Contig0>Contig100>Contig10022>Contig10023>Contig10035>Contig1001>Contig10012>Contig1002>Contig10026>Contig10018>Contig1003>Contig10030>Contig10>Contig10011>Contig10005>Contig10002>Contig1000>Contig10000>Contig10001 223     9bd8839b32b0ed413e160b2cff283c79        598     f
       f       2019-01-03 13:06:22.512179      2019-01-03 13:06:22.512179
16              1       Contig10011     Contig10011     >Contig10005>Contig10002>Contig1000>Contig10000>Contig10001     59      97f18e0396bd49965a2a993354920224        598     f       f       2019-01-03 13:06:22.62867       2019-01-03 13:06:22.62867
14              1       Contig10030     Contig10030     >Contig10>Contig10011>Contig10005>Contig10002>Contig1000>Contig10000>Contig10001        80      6c0f5e65211364d10c6b2e9c7644efa0        598     f       f       2019-01-03 13:06:22.616003      2019-01-03 13:06:22.616003

provide up to date SQL dump, with and without ontologies, of data in this set

Want to pair with TripalDock to allow fast launching of functional, seeded dev sites.

I dont know how to really handle this as far as a smart build goes. Loading the ontologies is sloooooooow so perhaps what we do is keep two base images: one with ontologies, one without. Then we load everything in and push it tagged as :with_data onto dockerhub, and create the dump.
The problem is that when/if the CVs change, we'll have to reload all the data which is a pain.

spilt trimming and annotating scripts

The trim step is just too variable because its too hard to predict how the mrna, protein, and GFF files will relate. I'll include a couple examples and the scripts but i know they wont work in all cases.

The annotations and loaders, on the other hand, are general.

Viewing BLAST Results - Step Missing.

Issue

The steps explaining how to view the blast results with the mRNA content is missing a step. The step to check for new fields in "manage fields". Screen shots include the section referenced and a screenshot once the new fields have been added.
slide 78
slide 78 fized

Trembl blast fails for structural errors

Loading the Trembl XML fails with the following error.

Running 'Chado BLAST XML results loader' importer
NOTE: Loading of file is performed using a database transaction. 
If it fails or is terminated prematurely then all insertions and 
updates are rolled back and will not be found in the database

XMLReader::read():                                                   [warning]
/Users/Almsaeed/Work/DevSites/Tripal/sites/default/files/tripal/users/1/Fexcel.TREMBL.xml:42637:
parser error : Extra content at the end of the document
BlastImporter.inc:424
XMLReader::read(): </Hit> BlastImporter.inc:424                      [warning]
XMLReader::read():       ^ BlastImporter.inc:424                     [warning]
XMLReader::readInnerXml():                                           [warning]
/Users/Almsaeed/Work/DevSites/Tripal/sites/default/files/tripal/users/1/Fexcel.TREMBL.xml:42637:
parser error : Extra content at the end of the document
BlastImporter.inc:577
XMLReader::readInnerXml(): </Hit> BlastImporter.inc:577              [warning]
XMLReader::readInnerXml():       ^ BlastImporter.inc:577             [warning]
Percent complete: 100.00 %. Memory: 38,355,856 bytes.
Done.


Done.

Remapping Chado Controlled vocabularies to Tripal Terms...
Done.

Swiss-Prot however succeeds!

biosamples not recognized by importer

the name comes from:
$biomaterial_names[] = $xml->BioSample[$i]->Ids->Id[1]; but results in null.

Looks like it expects Ids not IDs.

 ["IDs"]=>
  string(8) "continue"

missing image: ![](img/ips/ipsdoc_2.png)

Hi Joe: please note the [ips documentation(https://github.com/statonlab/tripal_dev_mini_dataset/blob/master/documentation/loading_IPS.md) has a missing link: ![](img/ips/ipsdoc_2.png)

jbrowse tracks

For tripal_apollo i'd like jbrowse tracks to easily load for CI testing.

readthedocs doesnt format tables right?

https://tripal-devseed.readthedocs.io/en/latest/loading_GFF.html

displays as below

| column | ID | explanation | example value | |——–|————|———————————————————————————————————————————————–|——————————————————————————–| | 1 | seqid | Name of the landmark chromosome or scaffold (not the feature itself). | Contig0 | | 2 | source | Program name, data source, etc | FRAEX38873_v2 | | 3 | type | Sequence ontology term for type_id of feature | gene | | 4 | start | start of the feature. | 16315 | | 5 | end | end of the feature. | 44054 | | 6 | score | Float value or . The score, because the feature was computationally predicted. ignore. | . | | 7 | strand | Can be = or -. Refers to the strand of DNA: ignore | + | | 8 | phase | Can be 0, 1, 2, or . Refers to the open reading frame, you can ignore. | . | | 9 | attributes | This includes the actual name for the feature that will be created (in this case FRAEX38873_v2_000000010). It also includes the Parent= tag. | ID=FRAEX38873_v2_000000010;Name=FRAEX38873_v2_000000010;biotype=protein_coding |

publish entities for seeder?

this might go instead in test suite, but we need to explain that entities still need to be published in the seeder.

error with polypeptide regexp

Note that to load the proteins, you must link them with the regexp (FRA.*?).1

At a glance it looks like the .1 is necessary?

Cannot find a unique feature for the parent 'FRAEX38873_v2_000001980' of type 'mRNA' for the feature.
WD tripal_job: Cannot find a unique feature for the parent           [error]
'FRAEX38873_v2_000001980' of type 'mRNA' for the feature.
[site http://default] [TRIPAL ERROR] [TRIPAL_JOB] Cannot find a unique feature for the parent 'FRAEX38873_v2_000001980' of type 'mRNA' for the feature.```

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.