statonlab / tripal_dev_seed Goto Github PK

View Code? Open in Web Editor NEW

0.0 2.0 2.0 61.22 MB

A minified bioinformatics dataset for seeding Tripal sites

License: GNU General Public License v3.0

Perl 18.91% Python 70.82% Shell 10.27%

tripal tripal3-compatible dataset bioinformatics tripal-developer-tools chado

tripal_dev_seed's People

Contributors

Watchers

Forkers

lupercal2 grlazo

tripal_dev_seed's Issues

GFF trimmer creates duplicate genes

the GFF trimmer is very simple, and genes iwth multiple mRNA pointing to them are going to get duplicated.

This may or may not be the cause of #66

Update documentation

The documentation here needs to be good enough to publish.

As you load in the full dev dataset, it would be awesome if you could contribute guides to doing it.

Many cases the guides are already written but they are on the staton wiki. for example:

https://github.com/mestato/statonlabprivate/wiki/Loading-a-Transcriptome-Geome-Tripal-3

In those cases, you can create a branch for that guide, add it, and make a PR. Screenshots would be great.

tripal_API python automated loading

I think this is a really good idea.

A script that uses tripal python to load in all the data in dev seed.

rerun annotation pipeline for F excelsior with new pipeline

Provide a shell script for creating mock dataset

Ideally, README lists pre-requisites (blast installed, blast annotation DB installed...), and it takes an input CDS fasta sequence and an -n parameter. The output is all data that can be generated via shell.

completed

minify dataset
- two steps: minify input mrna fasta and corresponding GFF (1), then minify peptide (2)
create biomaterials
create expression data

to do

GFF loader requires landmarks to be loaded prior to execution

Exception: The landmark '<em class="placeholder">Contig0</em>' cannot be found for this organism (<em class="placeholder">Fraxinus excelsior</em>) Please add the landmark and then retry the import of this GFF3 file

I can make a landmark generator script pretty easily.

Two GFF: which to supply (or both?)

https://github.com/statonlab/tripal_dev_seed/tree/master/gff

FexcelsiorCDS.fasta is not in the repo

mrn_mini.fasta this is the file

mRNA entities give "negative substring length not allowed" error

What does this mean?!

PDOException: SQLSTATE[22011]: Substring error: 7 ERROR: negative substring length not allowed: SELECT featureloc_id, srcname, srcfeature_id, strand, srctypename, typename, fmin, fmax, upstream, downstream, adjfmin, adjfmax, substring(residues from (cast(adjfmin as int4) + 1) for cast((upstream + (fmax - fmin) + downstream) as int4)) as residues, genus, species FROM ( SELECT FL.featureloc_id, OF.name srcname, FL.srcfeature_id, FL.strand, OCVT.name as srctypename, SCVT.name as typename, FL.fmin, FL.fmax, OO.genus, OO.species, CASE WHEN FL.strand >= 0 THEN CASE WHEN FL.fmin - :upstream <= 0 THEN 0 ELSE FL.fmin - :upstream END WHEN FL.strand < 0 THEN CASE WHEN FL.fmin - :downstream <= 0 THEN 0 ELSE FL.fmin - :downstream END END as adjfmin, CASE WHEN FL.strand >= 0 THEN CASE WHEN FL.fmax + :downstream > OF.seqlen THEN OF.seqlen ELSE FL.fmax + :downstream END WHEN FL.strand < 0 THEN CASE WHEN FL.fmax + :upstream > OF.seqlen THEN OF.seqlen ELSE FL.fmax + :upstream END END as adjfmax, CASE WHEN FL.strand >= 0 THEN CASE WHEN FL.fmin - :upstream <= 0 THEN FL.fmin ELSE :upstream END ELSE CASE WHEN FL.fmax + :upstream > OF.seqlen THEN OF.seqlen - FL.fmax ELSE :upstream END END as upstream, CASE WHEN FL.strand >= 0 THEN CASE WHEN FL.fmax + :downstream > OF.seqlen THEN OF.seqlen - FL.fmax ELSE :downstream END ELSE CASE WHEN FL.fmin - :downstream <= 0 THEN FL.fmin ELSE :downstream END END as downstream, OF.residues FROM chado.featureloc FL INNER JOIN chado.feature SF on FL.feature_id = SF.feature_id INNER JOIN chado.cvterm SCVT on SF.type_id = SCVT.cvterm_id INNER JOIN chado.feature OF on FL.srcfeature_id = OF.feature_id INNER JOIN chado.cvterm OCVT on OF.type_id = OCVT.cvterm_id INNER JOIN chado.organism OO on OF.organism_id = OO.organism_id WHERE SF.feature_id = :feature_id and NOT (OF.residues = '' or OF.residues IS NULL)) as tbl1 ; Array ( [:upstream] => 0 [:downstream] => 0 [:feature_id] => 2316 ) in chado_query() (line 1704 of /Users/Almsaeed/Work/DevSites/Tripal/sites/all/modules/tripal/tripal_chado/api/tripal_chado.query.api.inc).

BLAST XML annotations are split into 100 files and dont need to be

New datsets

I probably want

A second organism to load
natural diversity data (GWAS, SNPs)

Where do I get the FASTA files?

The RTD guide does not tell me where to get the FASTA file.

error in tree shell script

./annotate.sh \
out/sequences/mrna_mini.fasta \
out/sequences/polypeptide_mini.fasta \
/staton/libraries/uniprot/ \
Fexcel


./annotate.sh: line 77: out/tree/clustal.out/sequences/mrna_mini.fasta.clustal: No such file or directory
mv: rename out/sequences/mrna_mini.fasta.tree to out/tree/mrna_mini.fasta.tree: No such file or directory

DOI with release

publish release when ready. Zenodo integration already added.

documetnation needed for KEGG data

https://github.com/statonlab/tripal_dev_mini_dataset/tree/master/kegg_annotations

Create annotated set for 1, or 2, more species

README needs to be cleared and link to RTD more explicitly

Provide software versions for all tools

page 5, line 29: InterProScan 5.4 was used to produce functional annotation. It would have been better to use a recent version of this software (5.29-68.0 at the time of writing) to fit with data produced on recent projects. The versions and settings of the other tools used (BLAST, MAFFT, ...) should also be specified in the same paragraph.

Update documentation: create index in README.

FROM #6

Compressed XML files - Unable to open on Windows

Issue

Cannot get the XML files at this link https://github.com/statonlab/tripal_dev_mini_dataset/tree/master/blast_annotations to open on Windows. Cannot uncompress the files on windows.

Germplasm and Genotype data

I've been enjoying Tripal DevSeed as a quick way to get a testing environment up but need data to test germplasm and genotype-based functionality... I am willing to contribute to this project to see such data added if there is interest :-)

GFF

I would like a GFF of just these genes.

This would also mean including the SCAFFOLDS which can be loaded.

Issue with data source

https://github.com/statonlab/tripal_dev_mini_dataset/blob/master/documentation/loading_FASTA.md

Issue

The beginning part to create an analysis lacks information about the data course, the screenshot of the analysis with text in the fields also lacks Data Source.

fasta loader requires regexp

Do otehr loaders allow type specification without regexp? If so, fix FASTA importer
If not, modify devseed instructions with working regexp ie (>(.*?))

update documentation: Proteins no longer need regexp

proteins no longer need regexp in new F excelsior data since name = parent name.

merge conflict in bin/readme

empty landmark files end up with weird sequences in db

the landmark file:

>Contig10036
>Contig1
>Contig0
>Contig100
>Contig10022
>Contig10023
>Contig10035
>Contig1001
>Contig10012
>Contig1002
>Contig10026
>Contig10018
>Contig1003
>Contig10030
>Contig10
>Contig10011
>Contig10005
>Contig10002
>Contig1000
>Contig10000
>Contig10001

heres an entry in the db:

 6d9693ada0fcaab94fe1170c9085c065        598     f       f       2019-01-03 13:06:22.549905      2019-01-03 13:06:22.549905
1               1       Contig10036     Contig10036     >Contig1>Contig0>Contig100>Contig10022>Contig10023>Contig10035>Contig1001>Contig10012>Contig1002>Contig10026>Contig10018>Contig1003>Contig10030>Contig10>Contig10011>Contig10005>Contig10002>Contig1000>Contig10000>Contig10001 223     9bd8839b32b0ed413e160b2cff283c79        598     f
       f       2019-01-03 13:06:22.512179      2019-01-03 13:06:22.512179
16              1       Contig10011     Contig10011     >Contig10005>Contig10002>Contig1000>Contig10000>Contig10001     59      97f18e0396bd49965a2a993354920224        598     f       f       2019-01-03 13:06:22.62867       2019-01-03 13:06:22.62867
14              1       Contig10030     Contig10030     >Contig10>Contig10011>Contig10005>Contig10002>Contig1000>Contig10000>Contig10001        80      6c0f5e65211364d10c6b2e9c7644efa0        598     f       f       2019-01-03 13:06:22.616003      2019-01-03 13:06:22.616003

remove HWG specific language

provide up to date SQL dump, with and without ontologies, of data in this set

Want to pair with TripalDock to allow fast launching of functional, seeded dev sites.

I dont know how to really handle this as far as a smart build goes. Loading the ontologies is sloooooooow so perhaps what we do is keep two base images: one with ontologies, one without. Then we load everything in and push it tagged as :with_data onto dockerhub, and create the dump.
The problem is that when/if the CVs change, we'll have to reload all the data which is a pain.

spilt trimming and annotating scripts

The trim step is just too variable because its too hard to predict how the mrna, protein, and GFF files will relate. I'll include a couple examples and the scripts but i know they wont work in all cases.

The annotations and loaders, on the other hand, are general.

Viewing BLAST Results - Step Missing.

Issue

The steps explaining how to view the blast results with the mRNA content is missing a step. The step to check for new fields in "manage fields". Screen shots include the section referenced and a screenshot once the new fields have been added.

Trembl blast fails for structural errors

Loading the Trembl XML fails with the following error.

Running 'Chado BLAST XML results loader' importer
NOTE: Loading of file is performed using a database transaction. 
If it fails or is terminated prematurely then all insertions and 
updates are rolled back and will not be found in the database

XMLReader::read():                                                   [warning]
/Users/Almsaeed/Work/DevSites/Tripal/sites/default/files/tripal/users/1/Fexcel.TREMBL.xml:42637:
parser error : Extra content at the end of the document
BlastImporter.inc:424
XMLReader::read(): </Hit> BlastImporter.inc:424                      [warning]
XMLReader::read():       ^ BlastImporter.inc:424                     [warning]
XMLReader::readInnerXml():                                           [warning]
/Users/Almsaeed/Work/DevSites/Tripal/sites/default/files/tripal/users/1/Fexcel.TREMBL.xml:42637:
parser error : Extra content at the end of the document
BlastImporter.inc:577
XMLReader::readInnerXml(): </Hit> BlastImporter.inc:577              [warning]
XMLReader::readInnerXml():       ^ BlastImporter.inc:577             [warning]
Percent complete: 100.00 %. Memory: 38,355,856 bytes.
Done.


Done.

Remapping Chado Controlled vocabularies to Tripal Terms...
Done.

Swiss-Prot however succeeds!

change biosample XML to be randomly generated

so that properties are nice etc.

I think one set of 3 and another set of 3 will be enough.

biosamples not recognized by importer

the name comes from:
$biomaterial_names[] = $xml->BioSample[$i]->Ids->Id[1]; but results in null.

Looks like it expects Ids not IDs.

 ["IDs"]=>
  string(8) "continue"

Documentation needs to be improved for publication

Need to have a table that is Data -> Module -> loader with instructions

missing image: ![](img/ips/ipsdoc_2.png)

Hi Joe: please note the [ips documentation(https://github.com/statonlab/tripal_dev_mini_dataset/blob/master/documentation/loading_IPS.md) has a missing link: ![](img/ips/ipsdoc_2.png)

jbrowse tracks

For tripal_apollo i'd like jbrowse tracks to easily load for CI testing.

readthedocs doesnt format tables right?

https://tripal-devseed.readthedocs.io/en/latest/loading_GFF.html

displays as below

| column | ID | explanation | example value | |——–|————|———————————————————————————————————————————————–|——————————————————————————–| | 1 | seqid | Name of the landmark chromosome or scaffold (not the feature itself). | Contig0 | | 2 | source | Program name, data source, etc | FRAEX38873_v2 | | 3 | type | Sequence ontology term for type_id of feature | gene | | 4 | start | start of the feature. | 16315 | | 5 | end | end of the feature. | 44054 | | 6 | score | Float value or . The score, because the feature was computationally predicted. ignore. | . | | 7 | strand | Can be = or -. Refers to the strand of DNA: ignore | + | | 8 | phase | Can be 0, 1, 2, or . Refers to the open reading frame, you can ignore. | . | | 9 | attributes | This includes the actual name for the feature that will be created (in this case FRAEX38873_v2_000000010). It also includes the Parent= tag. | ID=FRAEX38873_v2_000000010;Name=FRAEX38873_v2_000000010;biotype=protein_coding |

Note that to load the proteins, you must link them with the regexp (FRA.*?).1

At a glance it looks like the .1 is necessary?

Cannot find a unique feature for the parent 'FRAEX38873_v2_000001980' of type 'mRNA' for the feature.
WD tripal_job: Cannot find a unique feature for the parent           [error]
'FRAEX38873_v2_000001980' of type 'mRNA' for the feature.
[site http://default] [TRIPAL ERROR] [TRIPAL_JOB] Cannot find a unique feature for the parent 'FRAEX38873_v2_000001980' of type 'mRNA' for the feature.```

instructions needed for loading GFF and scaffolds

generated in #27

statonlab / tripal_dev_seed Goto Github PK

tripal_dev_seed's People

Contributors

Watchers

Forkers

tripal_dev_seed's Issues

completed

to do

Issue

Issue

Issue

Recommend Projects

Recommend Topics

Recommend Org