Code Monkey home page Code Monkey logo

opencb / opencga Goto Github PK

View Code? Open in Web Editor NEW
164.0 164.0 97.0 153.53 MB

An Open Computational Genomics Analysis platform for big data genomics analysis. OpenCGA is maintained and develop by its parent company Zetta Genomics. Please contact [email protected] for bug report and feature requests.

License: Apache License 2.0

Shell 0.46% Python 2.11% R 1.88% Java 71.07% JavaScript 2.21% HTML 0.01% Jupyter Notebook 22.16% Dockerfile 0.05% Makefile 0.01% Smarty 0.01% Mustache 0.03%

opencga's People

Contributors

agaor avatar antonioaltamura avatar antonior26 avatar bart-jansen avatar cyenyxe avatar dapregi avatar dgomezpere avatar frasator avatar halender avatar imedina avatar j-coll avatar javild avatar jjcollinge avatar jlfrueda avatar jmmut avatar jtarraga avatar juanfesanahuja avatar juanrizetta avatar laulopezreal avatar lawrencegripper avatar marrobi avatar martinpeck avatar mbleda avatar melsiddieg avatar pamag avatar pawanpal01 avatar pfurio avatar roalva1 avatar swaathik avatar wbari avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

opencga's Issues

tag names in vcf header with dots

DBObjectToVariantSourceConverter complains when a vcf has a line in the header like:

##source_20120715.1=...

because mongo doesn't allow dots in field names, so they are not loaded. Maybe it's interesting adding a char mapping in the converter.

Initial work in study configuration file

In order to keep the storage framework independent of OpenCGA Catalog and therefore to be reused in other existing platforms, a mechanism to pass the study configuration with files and samples is needed.

storage-mongodb variant schema change: files to studies

Currently, the MongoDB schema for variants is file oriented. The actual mongo document looks like this:

{
  "_id" : "22_1234_A_T",
  "files" : [ 
    {"sid" : "1000g", "fid" : "f1", "samp": {"def" : "0|0", "0|1" : [254, 623] } , "attrs":{ } ... }, 
    {"sid" : "1000g", "fid" : "f2", "samp": {"def" : "0|0", "0|1" : [54, 78 ] } , "attrs":{ } ... } 
  ]
}

This can be a constraint when the samples of a study are provided separately, and want to query all samples from different files. Files should be merged in a single element in the array and look like this:

{
  "_id" : "22_1234_A_T",
  "studies" : [ 
    { "sid" : "1000g",  "gt": {"0|1" : [54, 78, 254, 623] }, ... } 
  ]
}

Because can happen that in different files, the default genotype is not the same, this default genotype must be the same for the whole study. This won't compromise the size of the collection, because the default value use to be the same in the 99% of the documents. The only consideration to keep in mind is to decide which is this default genotype. It depends on the polyploid of the specie and if the variants are or not in phase. The first approach will require this information from the user. In the future, this default genotype should be inferred by studying the files.

In this scenario, the "attrs" field, witch contains the vcf "info column" will disappear. This is a very unused fields, and they tend to be parsed into stats (#61) or into annot.
For times that this field need to be obtained, a Tabix access must be implemented.

Remove files type "index"

Now, when a file is indexed from the Analysis layer, a new catalog file entry is created with type "index", to represent the indexation of the file in a specific Storage Engine. It works like a virtual index to the data in the storage engine. This file can be confusing for end users, and doesn't provide any interesting feature.

This catalog entry is a good approach with the current file oriented indexation, where you can index independent files, and then, fetch over different files. This is going to change. In the future, the files are going to be indexed in the study, and the queries are going to be done over the whole study, filtering by sample or cohort, not by file.

The indexation information must be stored in a new field on File.java. See Index.java

VariantVcfMongoDataWriter

Create a new VariantVcfMongoDataWriter.

The VariantVcfMongoDataWriter inserts all the data into MongoDB.

Accept more aggregated files

Aggregated files are a complicated problem, due to the number of different ways that the aggregated data can be represented. A generic way to parse this data have to be implemented, in order to understand aggregated files.

More flexibility when adding samples to study

If a PED file is loaded after the associated VCF file, the samples are already stored in catalog and thus rejected with a message like:

[main] ERROR org.opencb.opencga.app.cli.main.OpenCGAMain - Sample { name: 'NA12347'} already exists.

These existing samples should be detected and updated using the information from the PED file.

CatalogPermissionManager

Separated class with methods for (at least) all CRUD operations for each resource type.
Also, add a CatalogPermissionException class for specific exceptions.

This class should contain all the permissions logic. Will manage Roles and ACLs.

In future versions, this class also will manage permissions at file system level.

OpenCGA Catalog needs more JUnit tests

Catalog tests are not covering most of the use cases and there are some dependencies between the tests. This must be redone to fully cover Catalog functionality

Catalog JUnit tests need to be redone

Current tests have many dependencies between them what makes difficult to test some specific aspects. Also, JUnit test exceptions frameworks should be used.

New postLoad stats pipeline

Stats calculation is done at the transform step. It must be moved to the postLoad step, and work over the loaded data.

Configuring how genotypes are returned

Genotypes are returned from the database always using numeric indexes for the alleles. It would be interesting to allow more flexibility and change the representation to nucleotides on demand.

files/create: checksum matches null

When running the file creation tool, even when it finishes successfully it display this message about the checksum:

[main] INFO org.opencb.opencga.catalog.CatalogFileManager - Coping file from file:///home/cyenyxe/Templates/chr22_10K.vcf.gz to file:/home/cyenyxe/.opencga/catalog/users/biouser/projects/1/2/chr22_10K.vcf.gz
[main] INFO org.opencb.opencga.catalog.CatalogFileManager - Checksum matches null

Empty options must be ignored

If an empty (not-null) studies list is provided as argument, it is not ignored in the VariantMongoDBAdaptor code, so it parses it as a study with name "" and nothing is returned as result.

Remove variant-lib repository dependency

OpenCGA depends on variant-lib for the VariantRunner. This variant-lib repository is going to be removed so this dependency must be removed.
Current variant-lib in code in use must be either taken to biodata or to OpenCGA

Improve variant command-line help

The description of the --aggregated option is not clear enough. The supported values, as well as the default, must be listed when the help shows.

This may also apply to other arguments such as the variant study type.

Server UploadFile concatenate and finish uploading process.

Concatenation of _partial files should be done by the daemon (or, in the future, by the workers).

When lastChunk arrive, file.status should change from "uploading" to "uploaded".
Then, the daemon should concatenate, calculate checksum, move to the final destination and, then, set status to "ready".

Check better when a variant is already in Mongo

When multiple instances of OpenCGA's variants loading tool are processing the same chromosome from different studies at the same time, inconsistencies may arise. This is probably due to checking only once whether a variant has been already saved. This condition should be checked not only at the beginning of the DBObject generation, but again right before the insertion.

Fix tests in org.opencb.opencga.lib.common.IOUtilsTest

The referenced files are in a user home directory and can't be found in any other machine. They must be loaded using the getResource() method. An example can be found in opencga-storage, in the class VariantMongoDBAdaptorTest.

Also, asserts must be used instead of System.out.println.

Dots not allowed in VCF sample names

Some studies can't be loaded into Mongo because their samples contain dots in their names, but Mongo won't allow them as keys. Dots should be replaced with a rarely used character ($, £, whatever we decide) and shown properly when they are returned as data model objects by the DB Converter.

CatalogDBAdaptor and CatalogManager with QueryOptions

This classes should accept a QueryOptions param with "exclude" or "include" with a List of fields path.

This fields can be relative to the returning element or absolute to the user. e.g:

getStudy( <ID>, QueryOptions{ include: ["alias"]} )

vs

getStudy( <ID>, QueryOptions{ include: ["projects.studies.alias"]} )

To unify the fields name between all the methods, it's recommended to use an absolute field path

Fix tests in org.opencb.opencga.lib.common.IOUtilsTest

The referenced file is in a user home directory and can't be found in other computers. It should be loaded using the getResource() method. An example can be found in opencga-storage, in the VariantMongoWriterTest class.

Also, asserts should be used instead of System.out.println.

The output of samples/search is not a valid JSON file

The output from the samples/search tool is something like the following: http://pastebin.com/3gYL2X6J

APIs such as Python's complain when trying to parse this file. A valid JSON file with multiple documents must represent them as a comma-separated list surrounded by square brackets.

(There is a workaround which is using PLAIN_JSON instead of PRETTY_JSON as output and then parsing each line separately, but still it is not very convenient)

New AggregatedFactory

Add new factory for aggregated VCFs.
This factory will transform aggregated vcfs into Variant objects.

Cohort stats

Stats per variant are calculated over the whole set of samples. It's needed to allow define different sets of samples (cohorts) to calculate other stats.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.