opencb / opencga Goto Github PK

An Open Computational Genomics Analysis platform for big data genomics analysis. OpenCGA is maintained and develop by its parent company Zetta Genomics. Please contact [email protected] for bug report and feature requests.

License: Apache License 2.0

Shell 0.46% Python 2.11% R 1.88% Java 71.07% JavaScript 2.21% HTML 0.01% Jupyter Notebook 22.16% Dockerfile 0.05% Makefile 0.01% Smarty 0.01% Mustache 0.03%

opencga's People

Contributors

Stargazers

Watchers

Forkers

pabarcgar j-coll cyenyxe jesusrodrc jmmut roalva1 jpdna wangzhenfei kalyanreddyemani pawanpal01 mrg7 antonior26 mattdmem mh11 agaor swaathik danstaines melsiddieg mermegar pamag ernoc javild wbari priesgo ebivariation dapregi bi-kim eiathom saharelmukashfi lemnx nicholsn punchmk pkleanthous-zz gtlangseth ahmed-abdelmoneim jlfrueda alemarcha babelomics renyaoxiang mbrukman genomicsiter-developers greglever marrobi bart-jansen lawrencegripper martinpeck avodovnik jjcollinge my-dna-map mgviz kevinpetersavage rgiffen meeji ktp-forked-repos jmcabandara sconeill vvvk0613 pilarnatividad szbalx fizquierdo wychytu antonioaltamura travissalascox mwhamgenomics amarantolaw andbloch kbri-biology-experiment-computation franco382 fabbondanza ardicon laulopezreal wsp00nr julie-sullivan drfranknlee mbleda phamidko spicysomtam imedina wisienkas venkyvb xahiru ealagorm pelamee cvedetect kkavya0391 imonz gpveronica pelahimovic juanfesanahuja magdalenazz vmarzal happyzjp optionfactory bingli2019 genostack

opencga's Issues

tag names in vcf header with dots

DBObjectToVariantSourceConverter complains when a vcf has a line in the header like:

##source_20120715.1=...

because mongo doesn't allow dots in field names, so they are not loaded. Maybe it's interesting adding a char mapping in the converter.

Initial work in study configuration file

In order to keep the storage framework independent of OpenCGA Catalog and therefore to be reused in other existing platforms, a mechanism to pass the study configuration with files and samples is needed.

storage-mongodb variant schema change: files to studies

Currently, the MongoDB schema for variants is file oriented. The actual mongo document looks like this:

{
  "_id" : "22_1234_A_T",
  "files" : [ 
    {"sid" : "1000g", "fid" : "f1", "samp": {"def" : "0|0", "0|1" : [254, 623] } , "attrs":{ } ... }, 
    {"sid" : "1000g", "fid" : "f2", "samp": {"def" : "0|0", "0|1" : [54, 78 ] } , "attrs":{ } ... } 
  ]
}

This can be a constraint when the samples of a study are provided separately, and want to query all samples from different files. Files should be merged in a single element in the array and look like this:

{
  "_id" : "22_1234_A_T",
  "studies" : [ 
    { "sid" : "1000g",  "gt": {"0|1" : [54, 78, 254, 623] }, ... } 
  ]
}

Because can happen that in different files, the default genotype is not the same, this default genotype must be the same for the whole study. This won't compromise the size of the collection, because the default value use to be the same in the 99% of the documents. The only consideration to keep in mind is to decide which is this default genotype. It depends on the polyploid of the specie and if the variants are or not in phase. The first approach will require this information from the user. In the future, this default genotype should be inferred by studying the files.

In this scenario, the "attrs" field, witch contains the vcf "info column" will disappear. This is a very unused fields, and they tend to be parsed into stats (#61) or into annot.
For times that this field need to be obtained, a Tabix access must be implemented.

Remove files type "index"

Now, when a file is indexed from the Analysis layer, a new catalog file entry is created with type "index", to represent the indexation of the file in a specific Storage Engine. It works like a virtual index to the data in the storage engine. This file can be confusing for end users, and doesn't provide any interesting feature.

This catalog entry is a good approach with the current file oriented indexation, where you can index independent files, and then, fetch over different files. This is going to change. In the future, the files are going to be indexed in the study, and the queries are going to be done over the whole study, filtering by sample or cohort, not by file.

The indexation information must be stored in a new field on File.java. See Index.java

VariantVcfMongoDataWriter

Create a new VariantVcfMongoDataWriter.

The VariantVcfMongoDataWriter inserts all the data into MongoDB.

Accept more aggregated files

Aggregated files are a complicated problem, due to the number of different ways that the aggregated data can be represented. A generic way to parse this data have to be implemented, in order to understand aggregated files.

Variant transformation should be multi-threaded

Currently the variant transformation to the data model is executed in a single thread, this could be highly improved by making a multi-thread implementation.

IO managers must use filesystem permissions

Filesystem permissions must be implemented for both POSIX and HDFS filesystems. This will provide a more security if someone gains access.

More flexibility when adding samples to study

If a PED file is loaded after the associated VCF file, the samples are already stored in catalog and thus rejected with a message like:

[main] ERROR org.opencb.opencga.app.cli.main.OpenCGAMain - Sample { name: 'NA12347'} already exists.

These existing samples should be detected and updated using the information from the PED file.

CatalogPermissionManager

Separated class with methods for (at least) all CRUD operations for each resource type.
Also, add a CatalogPermissionException class for specific exceptions.

This class should contain all the permissions logic. Will manage Roles and ACLs.

In future versions, this class also will manage permissions at file system level.

OpenCGA Catalog needs more JUnit tests

Catalog tests are not covering most of the use cases and there are some dependencies between the tests. This must be redone to fully cover Catalog functionality

Catalog JUnit tests need to be redone

Current tests have many dependencies between them what makes difficult to test some specific aspects. Also, JUnit test exceptions frameworks should be used.

New postLoad stats pipeline

Stats calculation is done at the transform step. It must be moved to the postLoad step, and work over the loaded data.

CatalogManagerException should be renamed

CatalogManagerException should be renamed to CatalogDBException to reflect properly that the exception comes from the database

Update third party dependencies

Some third party dependencies such as Guava, SLF4J, SQLlite, ... are quite old

Using Snappy instead of Gzip for data models serialization

Currently serializing data model with gzip is a bottleneck during the variant transform. Snappy offers a much better performance.

OpenCGA Catalog changes and new functionality

Some changes need to be implemented to allow cohorts and sample annotations. The API needs to be completed with many new methods.

Configuring how genotypes are returned

Genotypes are returned from the database always using numeric indexes for the alleles. It would be interesting to allow more flexibility and change the representation to nucleotides on demand.

files/create: checksum matches null

When running the file creation tool, even when it finishes successfully it display this message about the checksum:

[main] INFO org.opencb.opencga.catalog.CatalogFileManager - Coping file from file:///home/cyenyxe/Templates/chr22_10K.vcf.gz to file:/home/cyenyxe/.opencga/catalog/users/biouser/projects/1/2/chr22_10K.vcf.gz
[main] INFO org.opencb.opencga.catalog.CatalogFileManager - Checksum matches null

Rename opencga-lib module to opencga-core

OpenCGA lib is more a core library to OpenCGA than a lib for others. This should be renamed to make this clear.

Compress output of the accessioning tool.

The output of the accessioning tool should be compressed.

Implement POST methods for create objects from JSON

Currently only GET methods exists to create single elements. HTTP POST method would allow to get a JSON array of Studies with nested fields (e.g. files in studies).

HDFS IO manager implementation

A HDFS Manager implementing IOManager API is needed to offer the same functionality for Hadoop based installations

Speed up VariantMongoDBWriter

Use Bulk Operations and threads in order to speed up the load variants operation.

mvn build dependency issue due to unresolved cellbase version

The current 'master' branch has dependency on

<cellbase.version>3.1.0-rc</cellbase.version>

maven central shows version 3.1.0 as latest available build.
latest cellbase repo shows version 3.1-SNAPSHOT as the current version

Having problems building the project / 'master' branch due to this unresolved dependency.

Empty options must be ignored

If an empty (not-null) studies list is provided as argument, it is not ignored in the VariantMongoDBAdaptor code, so it parses it as a study with name "" and nothing is returned as result.

Remove variant-lib repository dependency

OpenCGA depends on variant-lib for the VariantRunner. This variant-lib repository is going to be removed so this dependency must be removed.
Current variant-lib in code in use must be either taken to biodata or to OpenCGA

studies/create displays error "name is null" when no study type specified

When the study type is not specified in the study creation tool, the following error is shown:

[main] ERROR org.opencb.opencga.app.cli.main.OpenCGAMain - Name is null

Which makes a bit confusing to trace the problem.

OpenCGA Storage CLI improvements

Storage CLI lacks many options and need to be redone and completed. Currently it is hardly usable.

Accept multiple rs/ss IDs in a single variant

Stored variants must accept multiple rs/ss IDs instead of just one, as it currently happens.

Command-line for variant indexing

Both in SQLite and Monbase

Improve variant command-line help

The description of the --aggregated option is not clear enough. The supported values, as well as the default, must be listed when the help shows.

This may also apply to other arguments such as the variant study type.

Server UploadFile concatenate and finish uploading process.

Concatenation of _partial files should be done by the daemon (or, in the future, by the workers).

When lastChunk arrive, file.status should change from "uploading" to "uploaded".
Then, the daemon should concatenate, calculate checksum, move to the final destination and, then, set status to "ready".

Check better when a variant is already in Mongo

When multiple instances of OpenCGA's variants loading tool are processing the same chromosome from different studies at the same time, inconsistencies may arise. This is probably due to checking only once whether a variant has been already saved. This condition should be checked not only at the beginning of the DBObject generation, but again right before the insertion.

getAllVariantsByRegionAndStudies returns incorrect value

getAllVariantsByRegionAndStudies should return QueryResult<Variant>. Now the function returns QueryResult<DBObject>

https://github.com/opencb/opencga/blob/develop/opencga-storage/src/main/java/org/opencb/opencga/storage/variant/mongodb/VariantMongoDBAdaptor.java#L74-L89

Refactoring in VariantVcfSqliteWriter

Use SqliteCredentias instead of dbName.
Use SqliteManager for connections. Remove the actual code for connections.

getAllVariantByRegion in VariantSqliteQueryBuilder

Implement a method that retrieves a Variant object, with optional VariantStats and VariantEffect.

Upgrade Java version from 1.7 to 1.8

It's time to migrate java

After April 2015, Oracle will no longer post updates of Java SE 7 to its public download sites.

Fix tests in org.opencb.opencga.lib.common.IOUtilsTest

The referenced files are in a user home directory and can't be found in any other machine. They must be loaded using the getResource() method. An example can be found in opencga-storage, in the class VariantMongoDBAdaptorTest.

Also, asserts must be used instead of System.out.println.

Dots not allowed in VCF sample names

Some studies can't be loaded into Mongo because their samples contain dots in their names, but Mongo won't allow them as keys. Dots should be replaced with a rarely used character ($, £, whatever we decide) and shown properly when they are returned as data model objects by the DB Converter.

A variation annotation pipeline to annotate loaded variants

A variant annotation pipeline and CLI need to be implemented to allow users to annotate variants using CellBase Variant Annotation

Remove _serverold_ package from server

Old web services implemented in the serverold package in opencga-server are not longer used and should be removed.

CatalogDBAdaptor and CatalogManager with QueryOptions

This classes should accept a QueryOptions param with "exclude" or "include" with a List of fields path.

This fields can be relative to the returning element or absolute to the user. e.g:

getStudy( <ID>, QueryOptions{ include: ["alias"]} )

getStudy( <ID>, QueryOptions{ include: ["projects.studies.alias"]} )

To unify the fields name between all the methods, it's recommended to use an absolute field path

Remove old OpenCGA Account module

Old OpenCGA Account module is not longer used by any live project, this Account module has been superseded by OpenCGA Catalog

Fix tests in org.opencb.opencga.lib.common.IOUtilsTest

The referenced file is in a user home directory and can't be found in other computers. It should be loaded using the getResource() method. An example can be found in opencga-storage, in the VariantMongoWriterTest class.

Also, asserts should be used instead of System.out.println.

The output of samples/search is not a valid JSON file

The output from the samples/search tool is something like the following: http://pastebin.com/3gYL2X6J

APIs such as Python's complain when trying to parse this file. A valid JSON file with multiple documents must represent them as a comma-separated list surrounded by square brackets.

(There is a workaround which is using PLAIN_JSON instead of PRETTY_JSON as output and then parsing each line separately, but still it is not very convenient)