Code Monkey home page Code Monkey logo

cbioportal_tools's People

Contributors

a33liang avatar iainrb avatar kunalchandan avatar lheisler avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

trellixvulnteam

cbioportal_tools's Issues

move meta data creation to pipeline scripts

currently the pipeline scripts, under lib/analysis_pipelines, each handle the generation of the data files.

each data file requires a corresponding meta file. Generation of this file is outside of the pipeline script, and is run outside of the data generation

one exception is the mrna zscores, which is part of the mrna pipelines. this is due to it being a separate data file generated by the pipeline apart from the main data

for clarity, the meta data genearation calls should be indicated in the pipeline. this will also better allow setting up other pipelines to also generate multiple meta/data pipelines

Improve logging/verbosity

Improve logging setup.

Turning off the "verbose" option still gives rather verbose output, eg.

[{'profile_name': 'All samples', 'profile_description': 'All GECCO Samples (12 Samples)'}, [['Patient Identifier', 'Sample Identifier'], ['Patient Identifier', 'Sample Identifier'], ['STRING', 'STRING'], ['1', '1'], ['PATIENT_ID', 'SAMPLE_ID'], ['GECCO_0001', 'GECCO_0001_Ly_R_TS'], ['GECCO_0002', 'GECCO_0002_Ly_R_TS'], ['GECCO_0003', 'GECCO_0003_Ly_R_TS'], ['GECCO_0004', 'GECCO_0004_Ly_R_TS'], ['GECCO_0005', 'GECCO_0005_Ly_R_TS'], ['GECCO_0006', 'GECCO_0006_Ly_R_TS'], ['GECCO_0001', 'GECCO_0001_Li_P_TS'], ['GECCO_0002', 'GECCO_0002_Li_P_TS'], ['GECCO_0003', 'GECCO_0003_Li_P_TS'], ['GECCO_0004', 'GECCO_0004_Li_P_TS'], ['GECCO_0005', 'GECCO_0005_Li_P_TS'], ['GECCO_0006', 'GECCO_0006_Li_P_TS']], 'SAMPLE_ATTRIBUTES', 'CLINICAL']
[{'profile_name': 'All samples', 'profile_description': 'All GECCO Samples (12 Samples)'}, [['Patient Identifier', 'Patient Name'], ['Patient Identifier', 'Patient Name'], ['STRING', 'STRING'], ['1', '1'], ['PATIENT_ID', 'PATIENT_DISPLAY_NAME'], ['GECCO_0001', 'GECCO_0001'], ['GECCO_0002', 'GECCO_0002'], ['GECCO_0003', 'GECCO_0003'], ['GECCO_0004', 'GECCO_0004'], ['GECCO_0005', 'GECCO_0005'], ['GECCO_0006', 'GECCO_0006']], 'PATIENT_ATTRIBUTES', 'CLINICAL']
****************************************************************************************************
****************************************************************************************************
CONGRATULATIONS! Your study should now be imported!
****************************************************************************************************
Output folder: /tmp/13549303.1.all.q/janus_generator_test_b937g87_/GECCO_test
Study Name: Genetics and Epidemiology of Colorectal Cancer Consortium
Study ID: gecco_gsi_mutect_2019
****************************************************************************************************

Terminology: Data types

Designations such as expression-data and mutation-data (eg. on the generator command line) are vague. Make clear which tool(s) correspond to each data type, eg. Mutect, Cufflinks.

Remove .py header attributes

Remove attributes such as __author__, __version__ from .py files. This information can be tracked better and more reliably using git.

CNV-freec and CNV-varscan pipelines not available

There are pipeline scripts for these tools, but when i run i get a message saying that these are not available

ERROR:: The pipeline (CNV-freec) you have placed in the SEG file is not currently supported. Please use one of these ['CNVkit', 'Sequenza', 'HMMCopy']

This has to do with a check on the pipeline, with the acceptable values currently hardcoded. I have a separate ticket open to address this hardcoding.

--key takes on a default value

if -k --key is not specified, it takes on a default value of the current path.

if janus has either a key, or is given the --push argument it will run the import.
if these are missing, it should create the import file but not run the import

i think it gets the current path because of this code, despite the fact that there is no default value

generator.py
options.add_argument("-k", "--key",
type=lambda key: os.path.abspath(key),
help="The RSA key to cBioPortal. Should have appropriate read write restrictions",
metavar='FILE',
default='')

Importer.py is non-functional

src/lib/tools/importer.py is in a rough testing state, with many functions commented out and hard-coded values substituted for inputs. Develop into a functional state, or else remove from the codebase.

Note that other scripts can be used for the import function, so this mode is not high-priority.

Organization of pipeline specific files

The current organization of pipelines in the dev branch is by data type.
This should instead be organized by genetic_alteration_type, under which will be a set of scripts for each pipeline. The pipeline script will then define which data_type to specify in the meta files.

The naming should reflect how cBioPortal importer looks at it.
I see the following:

genetic_alteration_type: CANCER_TYPE
datatype: CANCER_TYPE

genetic_alteration_type: CLINICAL
datatype: PATIENT_ATTRIBUTES,SAMPLE_ATTRIBUTES, TIMELINE

genetic_alteration_type: COPY_NUMBER_ALTERATION
datatype: DISCRETE, CONTINUOUS, LOG2-VALUE, SEG

genetic_alteration_type: MRNA_EXPRESSION
datatype: CONTINUOUS, DISCRETE, Z-SCORE

genetic_alteration_type: MUTATION_EXTENDED
datatype: MAF

genetic_alteration_type: METHYLATION
datatype: CONTINUOUS

genetic_alteration_type: PROTEIN_LEVEL
datatype: LOG2-VALUE, Z-SCORE

genetic_alteration_type: FUSION
datatype: FUSION

genetic_alteration_type: GISTIC_GENES_AMP
datatype: Q-VALUE

genetic_alteration_type: GISTIC_GENES_DEL
datatype: Q-VALUE

genetic_alteration_type: MUTSIG
datatype: Q-VALUE

genetic_alteration_type: GENE_PANEL_MATRIX
datatype: GENE_PANEL_MATRIX

genetic_alteration_type: GENESET_SCORE
datatype: GSVA-SCORE, P-VALUE

temp folder should be in a better location

currently, the pipelines create a temp folder and this sits in the location from where the tools is launched. Launching the tool multiple times results in each run competing for the same folder

this should either be

  1. in the import folder, then deleted
    or
  2. in a temp space, with a uid string

Unit tests needed

Add a test suite, with mock/stub versions of servers/databases if needed.

Re-raise issues after printing messages

A number of try/except clauses print a message, and then omit to re-raise the exception. This causes the program to ignore the exception and continue, which may cause further errors which are very hard to debug.

Add a raise statement to re-raise such exceptions.

Examples:

  • ./lib/generate/analysis_pipelines/COPY_NUMBER_ALTERATION/support_functions.py
  • ./lib/generate/analysis_pipelines/MUTATION_EXTENDED/support_functions.py
  • ./lib/generate/analysis_pipelines/MRNA_EXPRESSION/support_functions.py had this issue, now fixed

Replace get_tgca.r

./lib/generate/analysis_pipelines/MRNA_EXPRESSION/get_tcga.r appears to be doing text manipulation which can (and should) be done in Python instead.

Fix inconsistencies in the data organization

Analysis (and possibly clinical) data is attached to objects using keys, and these keys are used for validation and for downtream decision making.

The way these are set up are inconsistent between data types.

for Mutation data, the key is simply MAF, which is a data type
for Expression data the key is MRNA_EXPRESSION, which is an alteration type (can have three data types, Continuous, discrete, z-score)
for Copy number data the key is a data type (SEG or continuous).

Data types are not unique across alteration types.
The key is better set at Alteration_type:data_type

This will require modifications to the organization of the data, and downstream decision making.

Redundant "cancer type" code

src/lib/analysis_pipelines/cancer_type.py and src/lib/remove_data_type/cancer_type.py are extremely similar, some scope for consolidation

Dependency (if any) between pipeline config files

The resolve_priority_queue function is being removed from generator.py. See:

This function is a (very opaque) method of choosing an order in which to process pipeline config files. It is unclear when/if it is needed.

If there are dependencies between pipeline config files, represent them explicitly, eg. in the study config header.

Dynamic loading of available pipelines

In the current dev branch, pipelines actions are each defined in a separate python script under lib/data_type/{TYPE}/{PIPELINE}.py

the list of available pipelines is indicted in
lib/constants/constants.py

It would be better if the list could be loaded by getting a listing of PIPELINE.py files, so that when adding a new pipeline this constants.py file does not need to be modified

Make CAP expression test faster

Takes ~5 minutes to run, could speed up by using a smaller dataset

Alternatively, could split into "fast" and "slow" test suites

Install and read accessory files

Eg. study_input/examples/CAP_expression/expression.txt has hard-coded paths to the cbioportal_tools repository in the header. Fix this to make it more portable.

Document & validate study config

Document the configuration files for study and analysis pipelines:

  • Required/allowed metadata fields
  • Required/allowed columns in body

Validate the files automatically, eg. with a --validate argument to the main script.

Could add a validate method to each config class.

Automated tests on Jenkins

Make automated tests on Jenkins (or a similar service).

Nice-to-have for now. May become more important when code is in production and tests are more comprehensive.

Improved header format for config files

The current config file format is defined in an ad hoc way, and parsed using regular expressions. This is fragile and difficult to extend.

A more robust solution would be YAML for the header, and keeping the existing TSV format for the body. See: http://csvy.org/

vcf2maf usage is broken

Eg. mutect data requires vcf2maf

Ensure all relevant modules are loaded as part of the janus module

Consolidate and improve documentation

Documentation is scattered among 11 different README.md files (plus the Specification folder).

Some of the README.md material is out of date, unnecessary, or could be better recorded as docstrings in the source files.

Terminology: Generate, import, export

Documentation and logging loosely refers to "generating" a study as "importing" it, eg:

helper.working_on(True, message='CONGRATULATIONS! Your study should now be imported!')

But the "generate" mode only creates a directory of files appropriately formatted for addition to a cBioPortal instance; it does not actually add them.

In addition, "import" to cBioPortal is not really what Janus is doing. It is pushing data from a local directory to a cBioPortal instance, and as such the operation should be called "export" or "upload".

Overwrite of Import Folder

When creating an import folder, the current implementation gives a times warning (you have 5 seconds, you have 4 seconds, etc), that tells . you it will overwrite an existent folder unless you cancel.

It also does this for each of the temp folders under the import folder

this should be modified to ask whether the import folder should be overwritten.

can also include a -force flag, (which i think Kunal had, but removed)

Remove hard-coded paths from code

Examples:

iain@bastet:~/oicr/git/cbioportal_tools$ grep -rn '/.mounts/' . | grep -v '#' ./src/runner.sh:4:module use /.mounts/labs/gsi/modulator/modulefiles/Ubuntu18.04 ./src/runner.sh:6:module use /.mounts/labs/PDE/Modules/modulefiles ./src/runner.sh:16: --output-folder /.mounts/labs/gsiprojects/gsi/cBioGSI/data/project_TEST/cbio_DCIS/ \ ./src/runner.sh:19: --path /.mounts/labs/gsiprojects/gsi/cBioGSI/Janus/cbioportal_tools/ \ ./src/lib/analysis_pipelines/COPY_NUMBER_ALTERATION/support_functions.py:30: command = '/.mounts/labs/gsi/modulator/sw/Ubuntu18.04/rstats-3.6/bin/Rscript' ./src/lib/analysis_pipelines/COPY_NUMBER_ALTERATION/support_functions.py:31: path2script = '/.mounts/labs/gsiprojects/gsi/cBioGSI/aliang/cbioportal_tools/src/lib/analysis_pipelines/COPY_NUMBER_ALTERATION/preProcCNA.r'

iain@bastet:~/oicr/git/cbioportal_tools$ grep -rn '/u/' . | grep -v '#' ./src/runner.sh:17: --key /u/kchandan/cbioportal.pem \

Above paths cause failures if the current user lacks permissions and/or the path has been removed.

Unclosed file warning in tests

ibancarz@ucn207-12:~/git/cbioportal_tools$ ./src/test.py 
File Name: /u/ibancarz/git/cbioportal_tools/study_input/examples/GECCO/study.txt

Configuration File Name: /u/ibancarz/git/cbioportal_tools/study_input/examples/GECCO/sample.txt

/u/ibancarz/git/cbioportal_tools/src/lib/support/Config.py:133: ResourceWarning: unclosed file <_io.TextIOWrapper name='/u/ibancarz/git/cbioportal_tools/study_input/examples/GECCO/sample.txt' mode='r' encoding='ISO-8859-1'>
  verb))
ResourceWarning: Enable tracemalloc to get the object allocation traceback
Configuration File Name: /u/ibancarz/git/cbioportal_tools/study_input/examples/GECCO/patient.txt

/u/ibancarz/git/cbioportal_tools/src/lib/support/Config.py:133: ResourceWarning: unclosed file <_io.TextIOWrapper name='/u/ibancarz/git/cbioportal_tools/study_input/examples/GECCO/patient.txt' mode='r' encoding='ISO-8859-1'>
  verb))
ResourceWarning: Enable tracemalloc to get the object allocation traceback
File Name: /u/ibancarz/git/cbioportal_tools/study_input/examples/GECCO/cancer_type.txt

Code reorganisation

  • Current code organisation is somewhat opaque; unclear what the purpose of each folder is
  • Consider having separate packages for each Janus mode: generate, import, query, remove
  • Have additional packages for shared code and utilities
  • bin directory for command-line scripts: janus.py (and maybe also generator.py, importer.py, etc.)

hmmcopy handler fixes

hmmcopy handling code needs to be reworked.

the current implementation is to run a set of complex bash commands on the data to transform the file so that it is compliant with cBioPortal import. In particular, the num_mark column has to be calculated. This sits under a function called fix_hmmcopy_max_chrom, which is designed to ensure that the stated chromosome length doesn't exceed the maximum known length. Adding num_mark column to this function seems an afterthought, and should be treated properly. It currently does not seem to be working, and the data that is generated loses the seg means and replaces them with 1, which represents the num_mark value for a 1000base region.

Major fix is to rewrite this and NOT use bash commands. While these are fast, they are complex to debug, and it wouldn't run appreciably slower with some python code that addressed the same issue.

More immediate fix is to correct the output.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.