The cbioportal_tools from oicr-gsi

move meta data creation to pipeline scripts

currently the pipeline scripts, under lib/analysis_pipelines, each handle the generation of the data files.

each data file requires a corresponding meta file. Generation of this file is outside of the pipeline script, and is run outside of the data generation

one exception is the mrna zscores, which is part of the mrna pipelines. this is due to it being a separate data file generated by the pipeline apart from the main data

for clarity, the meta data genearation calls should be indicated in the pipeline. this will also better allow setting up other pipelines to also generate multiple meta/data pipelines

Improve logging/verbosity

Improve logging setup.

Turning off the "verbose" option still gives rather verbose output, eg.

[{'profile_name': 'All samples', 'profile_description': 'All GECCO Samples (12 Samples)'}, [['Patient Identifier', 'Sample Identifier'], ['Patient Identifier', 'Sample Identifier'], ['STRING', 'STRING'], ['1', '1'], ['PATIENT_ID', 'SAMPLE_ID'], ['GECCO_0001', 'GECCO_0001_Ly_R_TS'], ['GECCO_0002', 'GECCO_0002_Ly_R_TS'], ['GECCO_0003', 'GECCO_0003_Ly_R_TS'], ['GECCO_0004', 'GECCO_0004_Ly_R_TS'], ['GECCO_0005', 'GECCO_0005_Ly_R_TS'], ['GECCO_0006', 'GECCO_0006_Ly_R_TS'], ['GECCO_0001', 'GECCO_0001_Li_P_TS'], ['GECCO_0002', 'GECCO_0002_Li_P_TS'], ['GECCO_0003', 'GECCO_0003_Li_P_TS'], ['GECCO_0004', 'GECCO_0004_Li_P_TS'], ['GECCO_0005', 'GECCO_0005_Li_P_TS'], ['GECCO_0006', 'GECCO_0006_Li_P_TS']], 'SAMPLE_ATTRIBUTES', 'CLINICAL']
[{'profile_name': 'All samples', 'profile_description': 'All GECCO Samples (12 Samples)'}, [['Patient Identifier', 'Patient Name'], ['Patient Identifier', 'Patient Name'], ['STRING', 'STRING'], ['1', '1'], ['PATIENT_ID', 'PATIENT_DISPLAY_NAME'], ['GECCO_0001', 'GECCO_0001'], ['GECCO_0002', 'GECCO_0002'], ['GECCO_0003', 'GECCO_0003'], ['GECCO_0004', 'GECCO_0004'], ['GECCO_0005', 'GECCO_0005'], ['GECCO_0006', 'GECCO_0006']], 'PATIENT_ATTRIBUTES', 'CLINICAL']
****************************************************************************************************
****************************************************************************************************
CONGRATULATIONS! Your study should now be imported!
****************************************************************************************************
Output folder: /tmp/13549303.1.all.q/janus_generator_test_b937g87_/GECCO_test
Study Name: Genetics and Epidemiology of Colorectal Cancer Consortium
Study ID: gecco_gsi_mutect_2019
****************************************************************************************************

generator doesn't work on example projects

The two example project configurations:
https://github.com/oicr-gsi/cbioportal_tools/tree/master/study_input/examples/DCIS
https://github.com/oicr-gsi/cbioportal_tools/tree/master/study_input/examples/GECCO

fail on the current version of janus.py generator.

Terminology: Data types

Designations such as expression-data and mutation-data (eg. on the generator command line) are vague. Make clear which tool(s) correspond to each data type, eg. Mutect, Cufflinks.

Make main janus.py script executable

The janus.py script should be executable and have a shebang line, eg. #! /usr/bin/env python3

Remove .py header attributes

Remove attributes such as __author__, __version__ from .py files. This information can be tracked better and more reliably using git.

Get fasta reference from environment module

Some config headers have a ref_fasta entry for the reference path.

Would be better to get this from modulator, eg. by reading environment variables.

Incorrect file opening mode

Eg.

cbioportal_tools/src/lib/analysis_pipelines/MRNA_EXPRESSION/support_functions.py

Line 49 in c4bbf12

keep_genes_file = open(genelist, 'r+')

Opens in mode r+ (read/write) instead of r (read only). This is unnecssary and causes an error if the user does not have write permission to the file.

The above is one example, there may be others.

files specified in configs are missing

The program crashes, and reports that file was not found.
This should be checked and handled

CNV-freec and CNV-varscan pipelines not available

There are pipeline scripts for these tools, but when i run i get a message saying that these are not available

ERROR:: The pipeline (CNV-freec) you have placed in the SEG file is not currently supported. Please use one of these ['CNVkit', 'Sequenza', 'HMMCopy']

This has to do with a check on the pipeline, with the acceptable values currently hardcoded. I have a separate ticket open to address this hardcoding.

Clean up utility functions

See: https://github.com/oicr-gsi/cbioportal_tools/blob/ffdbc009743b0896c5ddff32be532ba169f4ced8/src/lib/support/helper.py

This module contains some very hacky utility functions. Replace with cleaner and more reliable versions, using standard libraries and avoiding shell calls where possible.

Add an automated Python style checker

Eg. black and flake8 to check via commit hooks: https://ljvmiranda921.github.io/notebook/2018/06/21/precommits-using-black-and-flake8/

Eliminate multi-line header for patient/sample data

See: https://github.com/oicr-gsi/cbioportal_tools/blob/0b735af3d4bee0e0994655e9e5a151d9ef304a4a/study_input/Specification/PATIENT_AND_SAMPLE_CONFIG.md

The multi-line header format is somewhat fragile and opaque. It may work better to move the column specification into a YAML file header.

--key takes on a default value

if -k --key is not specified, it takes on a default value of the current path.

if janus has either a key, or is given the --push argument it will run the import.
if these are missing, it should create the import file but not run the import

i think it gets the current path because of this code, despite the fact that there is no default value

generator.py
options.add_argument("-k", "--key",
type=lambda key: os.path.abspath(key),
help="The RSA key to cBioPortal. Should have appropriate read write restrictions",
metavar='FILE',
default='')

Importer.py is non-functional

src/lib/tools/importer.py is in a rough testing state, with many functions commented out and hard-coded values substituted for inputs. Develop into a functional state, or else remove from the codebase.

Note that other scripts can be used for the import function, so this mode is not high-priority.

Environment variables for mutation pipeline

in MUTATION_EXTENDED/support_functions.py

Need to automatically set VEP_PATH and VEP_DATA

Organization of pipeline specific files

The current organization of pipelines in the dev branch is by data type.
This should instead be organized by genetic_alteration_type, under which will be a set of scripts for each pipeline. The pipeline script will then define which data_type to specify in the meta files.

The naming should reflect how cBioPortal importer looks at it.
I see the following:

genetic_alteration_type: CANCER_TYPE
datatype: CANCER_TYPE

genetic_alteration_type: CLINICAL
datatype: PATIENT_ATTRIBUTES,SAMPLE_ATTRIBUTES, TIMELINE

genetic_alteration_type: COPY_NUMBER_ALTERATION
datatype: DISCRETE, CONTINUOUS, LOG2-VALUE, SEG

genetic_alteration_type: MRNA_EXPRESSION
datatype: CONTINUOUS, DISCRETE, Z-SCORE

genetic_alteration_type: MUTATION_EXTENDED
datatype: MAF

genetic_alteration_type: METHYLATION
datatype: CONTINUOUS

genetic_alteration_type: PROTEIN_LEVEL
datatype: LOG2-VALUE, Z-SCORE

genetic_alteration_type: FUSION
datatype: FUSION

genetic_alteration_type: GISTIC_GENES_AMP
datatype: Q-VALUE

genetic_alteration_type: GISTIC_GENES_DEL
datatype: Q-VALUE

genetic_alteration_type: MUTSIG
datatype: Q-VALUE

genetic_alteration_type: GENE_PANEL_MATRIX
datatype: GENE_PANEL_MATRIX

genetic_alteration_type: GENESET_SCORE
datatype: GSVA-SCORE, P-VALUE

setup.py script needed

Project should have a setup.py script, to enable installation using pip or similar: https://packaging.python.org/tutorials/packaging-projects/

temp folder should be in a better location

currently, the pipelines create a temp folder and this sits in the location from where the tools is launched. Launching the tool multiple times results in each run competing for the same folder

this should either be

in the import folder, then deleted
or
in a temp space, with a uid string

Add vcf2maf dependency in janus module

Unit tests needed

Add a test suite, with mock/stub versions of servers/databases if needed.

Re-raise issues after printing messages

A number of try/except clauses print a message, and then omit to re-raise the exception. This causes the program to ignore the exception and continue, which may cause further errors which are very hard to debug.

Add a raise statement to re-raise such exceptions.

Examples:

./lib/generate/analysis_pipelines/COPY_NUMBER_ALTERATION/support_functions.py
./lib/generate/analysis_pipelines/MUTATION_EXTENDED/support_functions.py
~~./lib/generate/analysis_pipelines/MRNA_EXPRESSION/support_functions.py~~ had this issue, now fixed

Replace get_tgca.r

./lib/generate/analysis_pipelines/MRNA_EXPRESSION/get_tcga.r appears to be doing text manipulation which can (and should) be done in Python instead.

Fix inconsistencies in the data organization

Analysis (and possibly clinical) data is attached to objects using keys, and these keys are used for validation and for downtream decision making.

The way these are set up are inconsistent between data types.

for Mutation data, the key is simply MAF, which is a data type
for Expression data the key is MRNA_EXPRESSION, which is an alteration type (can have three data types, Continuous, discrete, z-score)
for Copy number data the key is a data type (SEG or continuous).

Data types are not unique across alteration types.
The key is better set at Alteration_type:data_type

This will require modifications to the organization of the data, and downstream decision making.

Replace hard-coded strings with constants

Eg. the string DATATYPE appears 8 times in src/lib/support/Config.py. Identifiers which occur multiple times should be stored as a variable.

Redundant "cancer type" code

src/lib/analysis_pipelines/cancer_type.py and src/lib/remove_data_type/cancer_type.py are extremely similar, some scope for consolidation

Dependency (if any) between pipeline config files

The resolve_priority_queue function is being removed from generator.py. See:

https://jira.oicr.on.ca/browse/GRD-272
cbioportal_tools/src/lib/generate/generator.py

Line 186 in 8b49e96

def resolve_priority_queue(information: Information) -> Information:

This function is a (very opaque) method of choosing an order in which to process pipeline config files. It is unclear when/if it is needed.

If there are dependencies between pipeline config files, represent them explicitly, eg. in the study config header.

import command should check for appropriate directory content

the import command is to upload an import folder to a cbioportal instance and load into the database. This should check that the folder that is imported as the necessary basic file : meta_study.txt

Move deprecated code into a "legacy" package

Remove --path argument to generator

Requiring the parent directory of the janus.py script as a command line argument is confusing and error-prone.

If this directory is needed, find it using pathlib or similar.

See:

cbioportal_tools/src/lib/tools/generator.py

Line 59 in 42bd72e

config.add_argument("--path",

cbioportal_tools/src/lib/study_generation/data.py

Line 44 in 42bd72e

    
           def generate_data_type(meta_config: Config.Config, study_config: Config.Config, janus_path, verb):

endswith in generator config

Obfuscated and fragile endswith condition for manipulating generator config

See:

cbioportal_tools/src/lib/tools/generator.py

Line 239 in c4bbf12

elif any([each.endswith('_data'), each.endswith('_info')]):

Clean up preProcCNA.r

./lib/generate/analysis_pipelines/COPY_NUMBER_ALTERATION/preProcCNA.r uses the cnTools package from bioconductor: http://bioconductor.org/packages/release/bioc/html/CNTools.html

The R script is very untidy with much commented-out code. Fix this.

Related: Execution of the script is currently a (wrong) hard-coded path, see: #48

Dynamic loading of available pipelines

In the current dev branch, pipelines actions are each defined in a separate python script under lib/data_type/{TYPE}/{PIPELINE}.py

the list of available pipelines is indicted in
lib/constants/constants.py

It would be better if the list could be loaded by getting a listing of PIPELINE.py files, so that when adding a new pipeline this constants.py file does not need to be modified

Make CAP expression test faster

Takes ~5 minutes to run, could speed up by using a smaller dataset

Alternatively, could split into "fast" and "slow" test suites

Install and read accessory files

Eg. study_input/examples/CAP_expression/expression.txt has hard-coded paths to the cbioportal_tools repository in the header. Fix this to make it more portable.

Document & validate study config

Document the configuration files for study and analysis pipelines:

Required/allowed metadata fields
Required/allowed columns in body

Validate the files automatically, eg. with a --validate argument to the main script.

Could add a validate method to each config class.

Fix or suppress numpy warning at runtime

Example:

/.mounts/labs/gsi/modulator/sw/Ubuntu18.04/python-3.7/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject return f(*args, **kwds)

Warning is benign and can be suppressed, see:

https://stackoverflow.com/questions/40845304/runtimewarning-numpy-dtype-size-changed-may-indicate-binary-incompatibility

Automated tests on Jenkins

Make automated tests on Jenkins (or a similar service).

Nice-to-have for now. May become more important when code is in production and tests are more comprehensive.

Improved header format for config files

The current config file format is defined in an ad hoc way, and parsed using regular expressions. This is fragile and difficult to extend.

A more robust solution would be YAML for the header, and keeping the existing TSV format for the body. See: http://csvy.org/

vcf2maf usage is broken

Eg. mutect data requires vcf2maf

Ensure all relevant modules are loaded as part of the janus module

Consolidate and improve documentation

Documentation is scattered among 11 different README.md files (plus the Specification folder).

Some of the README.md material is out of date, unnecessary, or could be better recorded as docstrings in the source files.

Terminology: Generate, import, export

Documentation and logging loosely refers to "generating" a study as "importing" it, eg:

cbioportal_tools/src/lib/tools/generator.py

Line 323 in c4bbf12

    
           helper.working_on(True, message='CONGRATULATIONS! Your study should now be imported!')

But the "generate" mode only creates a directory of files appropriately formatted for addition to a cBioPortal instance; it does not actually add them.

In addition, "import" to cBioPortal is not really what Janus is doing. It is pushing data from a local directory to a cBioPortal instance, and as such the operation should be called "export" or "upload".

Config file specs & validation

The config file specification documents are somewhat vague. See: https://github.com/oicr-gsi/cbioportal_tools/tree/master/study_input/Specification

Replace with a well-defined schema for each config type, including automated validation.

Overwrite of Import Folder

When creating an import folder, the current implementation gives a times warning (you have 5 seconds, you have 4 seconds, etc), that tells . you it will overwrite an existent folder unless you cancel.

It also does this for each of the temp folders under the import folder

this should be modified to ask whether the import folder should be overwritten.

can also include a -force flag, (which i think Kunal had, but removed)

Remove hard-coded paths from code

Examples:

iain@bastet:~/oicr/git/cbioportal_tools$ grep -rn '/.mounts/' . | grep -v '#' ./src/runner.sh:4:module use /.mounts/labs/gsi/modulator/modulefiles/Ubuntu18.04 ./src/runner.sh:6:module use /.mounts/labs/PDE/Modules/modulefiles ./src/runner.sh:16: --output-folder /.mounts/labs/gsiprojects/gsi/cBioGSI/data/project_TEST/cbio_DCIS/ \ ./src/runner.sh:19: --path /.mounts/labs/gsiprojects/gsi/cBioGSI/Janus/cbioportal_tools/ \ ./src/lib/analysis_pipelines/COPY_NUMBER_ALTERATION/support_functions.py:30: command = '/.mounts/labs/gsi/modulator/sw/Ubuntu18.04/rstats-3.6/bin/Rscript' ./src/lib/analysis_pipelines/COPY_NUMBER_ALTERATION/support_functions.py:31: path2script = '/.mounts/labs/gsiprojects/gsi/cBioGSI/aliang/cbioportal_tools/src/lib/analysis_pipelines/COPY_NUMBER_ALTERATION/preProcCNA.r'

iain@bastet:~/oicr/git/cbioportal_tools$ grep -rn '/u/' . | grep -v '#' ./src/runner.sh:17: --key /u/kchandan/cbioportal.pem \

Above paths cause failures if the current user lacks permissions and/or the path has been removed.

Licence and changelog are needed

Reinstate "colours" for cancer type

See:

cbioportal_tools/src/lib/study_generation/data.py

Line 57 in c4bbf12

elif meta_config.datahandler == 'CANCER_TYPE':

This behaviour was removed in v0.0.2, may wish to restore

Unclosed file warning in tests

ibancarz@ucn207-12:~/git/cbioportal_tools$ ./src/test.py 
File Name: /u/ibancarz/git/cbioportal_tools/study_input/examples/GECCO/study.txt

Configuration File Name: /u/ibancarz/git/cbioportal_tools/study_input/examples/GECCO/sample.txt

/u/ibancarz/git/cbioportal_tools/src/lib/support/Config.py:133: ResourceWarning: unclosed file <_io.TextIOWrapper name='/u/ibancarz/git/cbioportal_tools/study_input/examples/GECCO/sample.txt' mode='r' encoding='ISO-8859-1'>
  verb))
ResourceWarning: Enable tracemalloc to get the object allocation traceback
Configuration File Name: /u/ibancarz/git/cbioportal_tools/study_input/examples/GECCO/patient.txt

/u/ibancarz/git/cbioportal_tools/src/lib/support/Config.py:133: ResourceWarning: unclosed file <_io.TextIOWrapper name='/u/ibancarz/git/cbioportal_tools/study_input/examples/GECCO/patient.txt' mode='r' encoding='ISO-8859-1'>
  verb))
ResourceWarning: Enable tracemalloc to get the object allocation traceback
File Name: /u/ibancarz/git/cbioportal_tools/study_input/examples/GECCO/cancer_type.txt

Code reorganisation

Current code organisation is somewhat opaque; unclear what the purpose of each folder is
Consider having separate packages for each Janus mode: generate, import, query, remove
Have additional packages for shared code and utilities
bin directory for command-line scripts: janus.py (and maybe also generator.py, importer.py, etc.)

hmmcopy handler fixes

hmmcopy handling code needs to be reworked.

the current implementation is to run a set of complex bash commands on the data to transform the file so that it is compliant with cBioPortal import. In particular, the num_mark column has to be calculated. This sits under a function called fix_hmmcopy_max_chrom, which is designed to ensure that the stated chromosome length doesn't exceed the maximum known length. Adding num_mark column to this function seems an afterthought, and should be treated properly. It currently does not seem to be working, and the data that is generated loses the seg means and replaces them with 1, which represents the num_mark value for a 1000base region.

Major fix is to rewrite this and NOT use bash commands. While these are fast, they are complex to debug, and it wouldn't run appreciably slower with some python code that addressed the same issue.

More immediate fix is to correct the output.

oicr-gsi / cbioportal_tools Goto Github PK

cbioportal_tools's People

Contributors

Watchers

Forkers

cbioportal_tools's Issues

Recommend Projects

Recommend Topics

Recommend Org