oicr-gsi / cbioportal_tools Goto Github PK
View Code? Open in Web Editor NEWtools for import of data and administration of the gsi cbioportal instance
License: GNU General Public License v3.0
tools for import of data and administration of the gsi cbioportal instance
License: GNU General Public License v3.0
currently the pipeline scripts, under lib/analysis_pipelines, each handle the generation of the data files.
each data file requires a corresponding meta file. Generation of this file is outside of the pipeline script, and is run outside of the data generation
one exception is the mrna zscores, which is part of the mrna pipelines. this is due to it being a separate data file generated by the pipeline apart from the main data
for clarity, the meta data genearation calls should be indicated in the pipeline. this will also better allow setting up other pipelines to also generate multiple meta/data pipelines
Improve logging setup.
Turning off the "verbose" option still gives rather verbose output, eg.
[{'profile_name': 'All samples', 'profile_description': 'All GECCO Samples (12 Samples)'}, [['Patient Identifier', 'Sample Identifier'], ['Patient Identifier', 'Sample Identifier'], ['STRING', 'STRING'], ['1', '1'], ['PATIENT_ID', 'SAMPLE_ID'], ['GECCO_0001', 'GECCO_0001_Ly_R_TS'], ['GECCO_0002', 'GECCO_0002_Ly_R_TS'], ['GECCO_0003', 'GECCO_0003_Ly_R_TS'], ['GECCO_0004', 'GECCO_0004_Ly_R_TS'], ['GECCO_0005', 'GECCO_0005_Ly_R_TS'], ['GECCO_0006', 'GECCO_0006_Ly_R_TS'], ['GECCO_0001', 'GECCO_0001_Li_P_TS'], ['GECCO_0002', 'GECCO_0002_Li_P_TS'], ['GECCO_0003', 'GECCO_0003_Li_P_TS'], ['GECCO_0004', 'GECCO_0004_Li_P_TS'], ['GECCO_0005', 'GECCO_0005_Li_P_TS'], ['GECCO_0006', 'GECCO_0006_Li_P_TS']], 'SAMPLE_ATTRIBUTES', 'CLINICAL']
[{'profile_name': 'All samples', 'profile_description': 'All GECCO Samples (12 Samples)'}, [['Patient Identifier', 'Patient Name'], ['Patient Identifier', 'Patient Name'], ['STRING', 'STRING'], ['1', '1'], ['PATIENT_ID', 'PATIENT_DISPLAY_NAME'], ['GECCO_0001', 'GECCO_0001'], ['GECCO_0002', 'GECCO_0002'], ['GECCO_0003', 'GECCO_0003'], ['GECCO_0004', 'GECCO_0004'], ['GECCO_0005', 'GECCO_0005'], ['GECCO_0006', 'GECCO_0006']], 'PATIENT_ATTRIBUTES', 'CLINICAL']
****************************************************************************************************
****************************************************************************************************
CONGRATULATIONS! Your study should now be imported!
****************************************************************************************************
Output folder: /tmp/13549303.1.all.q/janus_generator_test_b937g87_/GECCO_test
Study Name: Genetics and Epidemiology of Colorectal Cancer Consortium
Study ID: gecco_gsi_mutect_2019
****************************************************************************************************
The two example project configurations:
https://github.com/oicr-gsi/cbioportal_tools/tree/master/study_input/examples/DCIS
https://github.com/oicr-gsi/cbioportal_tools/tree/master/study_input/examples/GECCO
fail on the current version of janus.py generator
.
Designations such as expression-data
and mutation-data
(eg. on the generator
command line) are vague. Make clear which tool(s) correspond to each data type, eg. Mutect, Cufflinks.
The janus.py script should be executable and have a shebang line, eg. #! /usr/bin/env python3
Remove attributes such as __author__
, __version__
from .py
files. This information can be tracked better and more reliably using git.
Some config headers have a ref_fasta
entry for the reference path.
Would be better to get this from modulator, eg. by reading environment variables.
Eg.
Opens in mode r+
(read/write) instead of r
(read only). This is unnecssary and causes an error if the user does not have write permission to the file.
The above is one example, there may be others.
The program crashes, and reports that file was not found.
This should be checked and handled
There are pipeline scripts for these tools, but when i run i get a message saying that these are not available
ERROR:: The pipeline (CNV-freec) you have placed in the SEG file is not currently supported. Please use one of these ['CNVkit', 'Sequenza', 'HMMCopy']
This has to do with a check on the pipeline, with the acceptable values currently hardcoded. I have a separate ticket open to address this hardcoding.
This module contains some very hacky utility functions. Replace with cleaner and more reliable versions, using standard libraries and avoiding shell calls where possible.
Eg. black and flake8 to check via commit hooks: https://ljvmiranda921.github.io/notebook/2018/06/21/precommits-using-black-and-flake8/
The multi-line header format is somewhat fragile and opaque. It may work better to move the column specification into a YAML file header.
See also: #43
if -k --key is not specified, it takes on a default value of the current path.
if janus has either a key, or is given the --push argument it will run the import.
if these are missing, it should create the import file but not run the import
i think it gets the current path because of this code, despite the fact that there is no default value
generator.py
options.add_argument("-k", "--key",
type=lambda key: os.path.abspath(key),
help="The RSA key to cBioPortal. Should have appropriate read write restrictions",
metavar='FILE',
default='')
src/lib/tools/importer.py
is in a rough testing state, with many functions commented out and hard-coded values substituted for inputs. Develop into a functional state, or else remove from the codebase.
Note that other scripts can be used for the import function, so this mode is not high-priority.
in MUTATION_EXTENDED/support_functions.py
Need to automatically set VEP_PATH and VEP_DATA
The current organization of pipelines in the dev branch is by data type.
This should instead be organized by genetic_alteration_type, under which will be a set of scripts for each pipeline. The pipeline script will then define which data_type to specify in the meta files.
The naming should reflect how cBioPortal importer looks at it.
I see the following:
genetic_alteration_type: CANCER_TYPE
datatype: CANCER_TYPE
genetic_alteration_type: CLINICAL
datatype: PATIENT_ATTRIBUTES,SAMPLE_ATTRIBUTES, TIMELINE
genetic_alteration_type: COPY_NUMBER_ALTERATION
datatype: DISCRETE, CONTINUOUS, LOG2-VALUE, SEG
genetic_alteration_type: MRNA_EXPRESSION
datatype: CONTINUOUS, DISCRETE, Z-SCORE
genetic_alteration_type: MUTATION_EXTENDED
datatype: MAF
genetic_alteration_type: METHYLATION
datatype: CONTINUOUS
genetic_alteration_type: PROTEIN_LEVEL
datatype: LOG2-VALUE, Z-SCORE
genetic_alteration_type: FUSION
datatype: FUSION
genetic_alteration_type: GISTIC_GENES_AMP
datatype: Q-VALUE
genetic_alteration_type: GISTIC_GENES_DEL
datatype: Q-VALUE
genetic_alteration_type: MUTSIG
datatype: Q-VALUE
genetic_alteration_type: GENE_PANEL_MATRIX
datatype: GENE_PANEL_MATRIX
genetic_alteration_type: GENESET_SCORE
datatype: GSVA-SCORE, P-VALUE
Project should have a setup.py script, to enable installation using pip or similar: https://packaging.python.org/tutorials/packaging-projects/
currently, the pipelines create a temp folder and this sits in the location from where the tools is launched. Launching the tool multiple times results in each run competing for the same folder
this should either be
Add a test suite, with mock/stub versions of servers/databases if needed.
A number of try/except
clauses print a message, and then omit to re-raise the exception. This causes the program to ignore the exception and continue, which may cause further errors which are very hard to debug.
Add a raise
statement to re-raise such exceptions.
Examples:
./lib/generate/analysis_pipelines/MRNA_EXPRESSION/get_tcga.r
appears to be doing text manipulation which can (and should) be done in Python instead.
Analysis (and possibly clinical) data is attached to objects using keys, and these keys are used for validation and for downtream decision making.
The way these are set up are inconsistent between data types.
for Mutation data, the key is simply MAF, which is a data type
for Expression data the key is MRNA_EXPRESSION, which is an alteration type (can have three data types, Continuous, discrete, z-score)
for Copy number data the key is a data type (SEG or continuous).
Data types are not unique across alteration types.
The key is better set at Alteration_type:data_type
This will require modifications to the organization of the data, and downstream decision making.
Eg. the string DATATYPE
appears 8 times in src/lib/support/Config.py
. Identifiers which occur multiple times should be stored as a variable.
src/lib/analysis_pipelines/cancer_type.py
and src/lib/remove_data_type/cancer_type.py
are extremely similar, some scope for consolidation
The resolve_priority_queue
function is being removed from generator.py. See:
This function is a (very opaque) method of choosing an order in which to process pipeline config files. It is unclear when/if it is needed.
If there are dependencies between pipeline config files, represent them explicitly, eg. in the study config header.
the import command is to upload an import folder to a cbioportal instance and load into the database. This should check that the folder that is imported as the necessary basic file : meta_study.txt
Requiring the parent directory of the janus.py script as a command line argument is confusing and error-prone.
If this directory is needed, find it using pathlib
or similar.
See:
Obfuscated and fragile endswith condition for manipulating generator config
See:
cbioportal_tools/src/lib/tools/generator.py
Line 239 in c4bbf12
./lib/generate/analysis_pipelines/COPY_NUMBER_ALTERATION/preProcCNA.r
uses the cnTools package from bioconductor: http://bioconductor.org/packages/release/bioc/html/CNTools.html
The R script is very untidy with much commented-out code. Fix this.
Related: Execution of the script is currently a (wrong) hard-coded path, see: #48
In the current dev branch, pipelines actions are each defined in a separate python script under lib/data_type/{TYPE}/{PIPELINE}.py
the list of available pipelines is indicted in
lib/constants/constants.py
It would be better if the list could be loaded by getting a listing of PIPELINE.py files, so that when adding a new pipeline this constants.py file does not need to be modified
Takes ~5 minutes to run, could speed up by using a smaller dataset
Alternatively, could split into "fast" and "slow" test suites
Eg. study_input/examples/CAP_expression/expression.txt
has hard-coded paths to the cbioportal_tools
repository in the header. Fix this to make it more portable.
Document the configuration files for study and analysis pipelines:
Validate the files automatically, eg. with a --validate
argument to the main script.
Could add a validate
method to each config class.
Example:
/.mounts/labs/gsi/modulator/sw/Ubuntu18.04/python-3.7/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject return f(*args, **kwds)
Warning is benign and can be suppressed, see:
Make automated tests on Jenkins (or a similar service).
Nice-to-have for now. May become more important when code is in production and tests are more comprehensive.
The current config file format is defined in an ad hoc way, and parsed using regular expressions. This is fragile and difficult to extend.
A more robust solution would be YAML for the header, and keeping the existing TSV format for the body. See: http://csvy.org/
Eg. mutect data requires vcf2maf
Ensure all relevant modules are loaded as part of the janus
module
Documentation is scattered among 11 different README.md
files (plus the Specification
folder).
Some of the README.md
material is out of date, unnecessary, or could be better recorded as docstrings in the source files.
Documentation and logging loosely refers to "generating" a study as "importing" it, eg:
cbioportal_tools/src/lib/tools/generator.py
Line 323 in c4bbf12
But the "generate" mode only creates a directory of files appropriately formatted for addition to a cBioPortal instance; it does not actually add them.
In addition, "import" to cBioPortal is not really what Janus is doing. It is pushing data from a local directory to a cBioPortal instance, and as such the operation should be called "export" or "upload".
The config file specification documents are somewhat vague. See: https://github.com/oicr-gsi/cbioportal_tools/tree/master/study_input/Specification
Replace with a well-defined schema for each config type, including automated validation.
When creating an import folder, the current implementation gives a times warning (you have 5 seconds, you have 4 seconds, etc), that tells . you it will overwrite an existent folder unless you cancel.
It also does this for each of the temp folders under the import folder
this should be modified to ask whether the import folder should be overwritten.
can also include a -force flag, (which i think Kunal had, but removed)
Examples:
iain@bastet:~/oicr/git/cbioportal_tools$ grep -rn '/.mounts/' . | grep -v '#' ./src/runner.sh:4:module use /.mounts/labs/gsi/modulator/modulefiles/Ubuntu18.04 ./src/runner.sh:6:module use /.mounts/labs/PDE/Modules/modulefiles ./src/runner.sh:16: --output-folder /.mounts/labs/gsiprojects/gsi/cBioGSI/data/project_TEST/cbio_DCIS/ \ ./src/runner.sh:19: --path /.mounts/labs/gsiprojects/gsi/cBioGSI/Janus/cbioportal_tools/ \ ./src/lib/analysis_pipelines/COPY_NUMBER_ALTERATION/support_functions.py:30: command = '/.mounts/labs/gsi/modulator/sw/Ubuntu18.04/rstats-3.6/bin/Rscript' ./src/lib/analysis_pipelines/COPY_NUMBER_ALTERATION/support_functions.py:31: path2script = '/.mounts/labs/gsiprojects/gsi/cBioGSI/aliang/cbioportal_tools/src/lib/analysis_pipelines/COPY_NUMBER_ALTERATION/preProcCNA.r'
iain@bastet:~/oicr/git/cbioportal_tools$ grep -rn '/u/' . | grep -v '#' ./src/runner.sh:17: --key /u/kchandan/cbioportal.pem \
Above paths cause failures if the current user lacks permissions and/or the path has been removed.
See:
This behaviour was removed in v0.0.2, may wish to restore
ibancarz@ucn207-12:~/git/cbioportal_tools$ ./src/test.py
File Name: /u/ibancarz/git/cbioportal_tools/study_input/examples/GECCO/study.txt
Configuration File Name: /u/ibancarz/git/cbioportal_tools/study_input/examples/GECCO/sample.txt
/u/ibancarz/git/cbioportal_tools/src/lib/support/Config.py:133: ResourceWarning: unclosed file <_io.TextIOWrapper name='/u/ibancarz/git/cbioportal_tools/study_input/examples/GECCO/sample.txt' mode='r' encoding='ISO-8859-1'>
verb))
ResourceWarning: Enable tracemalloc to get the object allocation traceback
Configuration File Name: /u/ibancarz/git/cbioportal_tools/study_input/examples/GECCO/patient.txt
/u/ibancarz/git/cbioportal_tools/src/lib/support/Config.py:133: ResourceWarning: unclosed file <_io.TextIOWrapper name='/u/ibancarz/git/cbioportal_tools/study_input/examples/GECCO/patient.txt' mode='r' encoding='ISO-8859-1'>
verb))
ResourceWarning: Enable tracemalloc to get the object allocation traceback
File Name: /u/ibancarz/git/cbioportal_tools/study_input/examples/GECCO/cancer_type.txt
bin
directory for command-line scripts: janus.py
(and maybe also generator.py
, importer.py
, etc.)hmmcopy handling code needs to be reworked.
the current implementation is to run a set of complex bash commands on the data to transform the file so that it is compliant with cBioPortal import. In particular, the num_mark column has to be calculated. This sits under a function called fix_hmmcopy_max_chrom, which is designed to ensure that the stated chromosome length doesn't exceed the maximum known length. Adding num_mark column to this function seems an afterthought, and should be treated properly. It currently does not seem to be working, and the data that is generated loses the seg means and replaces them with 1, which represents the num_mark value for a 1000base region.
Major fix is to rewrite this and NOT use bash commands. While these are fast, they are complex to debug, and it wouldn't run appreciably slower with some python code that addressed the same issue.
More immediate fix is to correct the output.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.