comorment / containers Goto Github PK
View Code? Open in Web Editor NEWCoMorMent-Containers
Home Page: https://www.comorment.uio.no
License: GNU General Public License v3.0
CoMorMent-Containers
Home Page: https://www.comorment.uio.no
License: GNU General Public License v3.0
Just to check for consistency, incorporate a Dockerfile linter GitHub action, e.g., Hadolint (https://github.com/hadolint/hadolint-action) in this repo.
A shallow git clone of this repo reports that a few files should've been pointers:
% GIT_LFS_SKIP_SMUDGE=1 /opt/homebrew/bin/git clone --depth 1 [email protected]:comorment/containers.git
Cloning into 'containers'...
remote: Enumerating objects: 1130, done.
remote: Counting objects: 100% (1130/1130), done.
remote: Compressing objects: 100% (1032/1032), done.
remote: Total 1130 (delta 28), reused 1083 (delta 20), pack-reused 0
Receiving objects: 100% (1130/1130), 19.23 MiB | 3.21 MiB/s, done.
Resolving deltas: 100% (28/28), done.
Updating files: 100% (1174/1174), done.
Encountered 5 files that should have been pointers, but weren't:
usecases/bolt_out/example_3chr.frq
usecases/bolt_out/example_3chr.log
usecases/bolt_out/myld.l2.ldscore.gz
usecases/bolt_out/myld.log
usecases/saige_out/out_vcf.log
Not a big issue though, as they're all pretty small files:
24K usecases/bolt_out/example_3chr.frq
4.0K usecases/bolt_out/example_3chr.log
4.0K usecases/bolt_out/myld.l2.ldscore.gz
4.0K usecases/bolt_out/myld.log
4.0K usecases/saige_out/out_vcf.log
Edit: Some more info here: https://stackoverflow.com/questions/46704572/git-error-encountered-7-files-that-should-have-been-pointers-but-werent
Jacob and me were wondering if some tool exists that does the opposite of genetic correlations, namely identify loci that are specific for a trait. Say you want to compare summary statistics for bipolar disorder and depressive disorder, and you want to identify loci that are not shared between the two, is there a tool to identify those? You can do something like that with genomic SEM or by eyeballing circular manhattan plots, but I could not think of a tool that specifically identifies non-shared variants.
If such a tool exists we would love to have it in the container toolbox.
Please build in some flexibility to deal with variations in reading .pheno / .dict files. Of course it's unfeasible (and unnecessary) to be able to deal with all possible variations; we need to keep a balance. Just be clear on the restrictions in the documentation.
LDSC also has a fairly sizable reference, and it's reasonable to keep it in its own container.
Describe the solution you'd like
Set up a small framework (py.test
or similar) calling the different containers locally, checking that software installed in containers returns its version or similar, and does not result in crashes (from missing libs, etc.)
For some reason the Mahattan plot utility from python_convert only includes about 45,000 of the ~8 mil SNPs in my summary statistics, without outputting any warnings or errors. What reference data is used?
It would be great if more information can be added about the usage of the clump utility (sumstats.py). What are the minimum required flags? What are the defaults? What is the reference dataset? To unify these processes across different sites as much as possible we probably should pre-set as many of the parameters as we can.
Hi @ofrei; The gwas
dockerfile at https://github.com/comorment/gwas incorporates binaries of an older version of this software from https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html.
There's a more recent version with sources and MIT license available from https://github.com/odelaneau/shapeit4.
Should we rather package that?
Could you make sure the following packages are available and fully functional:
The gwas.py merge-regenie command detects missing SNP IDs in the .afreq files as duplicates and throws an error. Would it be possible to specify a command for removing (or replacing with CHR:BP) missing SNP IDs?
MiniForge (https://github.com/conda-forge/miniforge) is the community-driven version of Conda. We can replace MiniConda by MiniForge in the Dockerfiles as we're mainly using the conda-forge channel anyway. Also, this means we can ignore the Anaconda terms of license (https://legal.anaconda.com/policies/en/?name=terms-of-service), just in case.
Edit: We should rather use the Mambaforge distribution from Miniforge, as this resolves the conda environment much faster than conda.
For the saige analysis, some chromosomes did not finish, and yet job 3 still went on and combined together all .saige chromosomes files, even the ones partially finished.
The gwas.py script should look for an error at the end of the corresponding .out files, to see if the previous step in job 2 finished, before running job 3 .
I have a suggestion for tweaking the gwas.py script so that it can write jobs using geno and geno-fit files that are split out per chromosome.
Snakemake (https://snakemake.readthedocs.io/en/stable/) is a tool that could potentially be used for updating container builds, e.g., in case a container dependency or install script used by the Dockerfile(s) is updated. Would this be useful?
We need one (or both) of these tools in a container:
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-288
https://genome.sph.umich.edu/wiki/METAL_Documentation
Copied from comorment/gwas#29:
In the current Dockerfile recipes and bash installer files, versions of different tools are (usually) not pinned. Thus a (re)built container will likely differ from day to day, in particular, if packages are installed from sources like conda-forge and similar where updates are frequent.
Ideally, versions should explicitly be pinned in the recipes, e.g., like
FROM buildpack-deps:focal
RUN apt-get update && \
apt-get install --no-install-recommends -y \
cmake=3.16.3-1ubuntu1 \
python3-dev=3.8.2-0ubuntu2
....
RUN pip install h5py==2.10.0 && \
pip install git+https://github.com/NeuralEnsemble/parameters@b95bac2bd17f03ce600541e435e270a1e1c5a478#egg=parameters \
...
RUN git clone --depth 1 -b v3.1 https://github.com/nest/nest-simulator /usr/src/nest-simulator && \
# compile
...
The above is just taken from another project of mine (complete example: https://github.com/LFPy/LFPykernels/blob/main/Dockerfile).
Version pinning is also a best practice suggested by Dockerfile linting tools like Hadolint (https://hadolint.github.io/hadolint/).
Line 998 in 6434e86
I switched from copying the gwas.py to a personal folder to running gwas.py directly from the repository in the TSD environment which gave rise to the following problem.
If the yaml configuration or gwas.py script is not located in the same directory as gwas.py is executed it seems that gwas.py won't find the configuration file. Maybe it's a good idea to retreive the path of the gwas.py to locate the configuration file?
os.path.dirname(os.path.realpath(__file__))
Replace
Line 68 in 6434e86
configFile = os.path.dirname(os.path.realpath(__file__)) + "/config.yaml"
parent_parser.add_argument('--config', type=str, default=configFile, help='file with misc configuration options')
Since this file seems to be required...
Below line 986:
Lines 985 to 986 in 6434e86
if not os.path.exists(args.config):
raise IOError('configuration file "' + os.path.basename(args.config) + '" not found')
Not sure if IOError is the appropriate error though...
Could you please implement:
High-Definition Likelihood for genetic correlation: https://github.com/zhenin/HDL
Make reporting issues/open PRs more streamlined, using GitHub template files, see https://docs.github.com/en/communities/using-templates-to-encourage-useful-issues-and-pull-requests/configuring-issue-templates-for-your-repository
Currently we use https://github.com/comorment/gwas to keep all development-related scripts for comorment containers.
The https://github.com/comorment/containers repo is used to release singularity containers (as .sif files), to keep reference data, and for user documentation. This separation is suboptimal, and it makes more sense to include all development-related scripts (Dockerfile, bash scripts, some dev instuctions, etc) in https://github.com/comorment/containers. However we should keep those codes somewhat hidden from the end user, for example move them to a new source
folder in the root of this repo. After than the github.com/comorment/gwas repo can be archived (e.g. kept in case we need code history, but we lock is so no futher changes can be submitted).
Also, we should change our development model and start using feature & bug-fix branches, using a pull request and code review to integrate changes into the main branch.
I'm getting the following error when run hello.sif container on surfsara login node. However the error seem to be harmless
ERROR: ld.so: object '/sara/tools/xalt/xalt/lib64/libxalt_init.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
Ubuntu 18.04 (LTS) reaches the end of standard support in April 2023 (https://wiki.ubuntu.com/Releases). It could make sense to base container builds on the more current 22.04 (LTS) or perhaps 20.04 (LTS).
Something goes wrong on my end with gwas.py merge-regenie. Both run_regenie1 and run_regenie2 run as expected but then I get the following error for merge-regenie. Looks like something goes wrong in the join.
jacber@sens2017599-b10:~/nordic_gwas/basic$ $PYTHON ~/gwas.py merge-regenie --maf 0.1 --sumstats out/run_chr@_MDD_broad.regenie --basename out/run_chr@ --out out/run_MDD_broad --chr2use 1,2
Call:
/home/jacber/gwas.py merge-regenie
--maf 0.1
--sumstats out/run_chr@_MDD_broad.regenie
--basename out/run_chr@
--out out/run_MDD_broad
--chr2use 1,2
Beginning analysis at Mon Aug 23 09:19:46 2021 by jacber, host sens2017599-b10.uppmax.uu.se
Traceback (most recent call last):
File "/home/jacber/gwas.py", line 1908, in
args.func(args, log)
File "/home/jacber/gwas.py", line 838, in merge_regenie
df, info_col = apply_filters(args, df)
File "/home/jacber/gwas.py", line 760, in apply_filters
df = pd.merge(df, maf, how='left', on='SNP')
File "/usr/local/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 89, in merge
return op.get_result()
File "/usr/local/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 684, in get_result
join_index, left_indexer, right_indexer = self._get_join_info()
File "/usr/local/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 909, in _get_join_info
(left_indexer, right_indexer) = self._get_join_indexers()
File "/usr/local/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 887, in _get_join_indexers
return get_join_indexers(
File "/usr/local/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 1441, in get_join_indexers
return join_func(lkey, rkey, count, **kwargs)
File "pandas/_libs/join.pyx", line 109, in pandas._libs.join.left_outer_join
MemoryError: Unable to allocate 446. GiB for an array with shape (59901284770,) and data type int64
MAGMA software is released as binary, but it require a fairly large reference data, and for this reason it's best to move into a separate github repository.
LAVA tool is quite difference - it's is based on R, and it addresses a different question than MAGMA.
But it needs some of the reference files as MAGMA. Also, LAVA is developed by the same group as MAGMA. So it's reasonable to include LAVA in the same github repository - but perhaps in a separate .sif file (can be magma.sif and lava.sif).
The github repo can be https://github.com/comorment/magma
To reduce overhead from git clone
of https://github.com/comorment/containers it's good to move tools such as mixer, ldsc, magma, pleiofdr each into its own repository, same as currently done for HDL tool.
Current scripts are developed under assumption that FID and IID are the same, and only IID is used to identify individuals and link them between .pheno file and the genetic files. Good to design this in a way that is more flexible.
Could the gwas.py script implement the option to specify a .sample file to accompany a .bgen file with a different (path)name?
It would be good if the script doesn't throw an error for non-autosomes, but just filters them out (or keeps them if they are sensible codes)
Hello,
The new version of Saige is running, but I have encountered (and fixed) some errors, but I have hit a new issue.
First, I have several long flag errors, and they were solved by removing the following flags:
--long flag "numLinesOutput" is invalid
--long flag "IsOutputAFinCaseCtrl" is invalid
--long flag "IsOutputNinCaseCtrl" is invalid
The issue below in the screen shot is more complicated, I do not know how to change the GMMATmodelFile, but it appears empty and causes an error that halts Saige. Interestingly, this wasn't an error before the last update.
CUSTOMVARIABLE NCASE
CUSTOMVARIABLE NCONTROL
However, the output from Regenie only has N, not NCASE or NCONTROL.
If we want this implemented, we would need Regenie to have additional output columns.
I don't think we fully established a container with MATLAB runtime to package software written in MATLAB.
Bayram have done a lot of work on this:
Note that matlabruntime.sif
as well as pre-build pleiofdr
and magicsquare
binaries can be build with a separate dev box. It's also hosted on NREC, but it's separate from the devbox where we build the rest of containers. This is because matlab runtime environment was pretty tricky to configure and make it compatible with docker / singularity.
I suggest to move all of this to a separate github repo, which already exists:
src
folder to be a bit more hidden, and we'll need a cleaner user docs for those users who want to run MATLAB code via container.Later we can consider including other tools such as https://github.com/precimed/mostest/ and FEMA ( https://github.com/cmig-research-group/cmig_tools ) into MATLAB container. However I consider this to be low priority, because MATLAB is somewhat too tricky to squeeze into a container. Perhaps it's best if users just stick to run MATLAB code in their own environment.
So let's move everything matlab-related away from github.com/comorment/containers repo, include it in https://github.com/comorment/matlabruntime , and than put it all on hold and discuss whether or not we want to put more effort in MATLAB-related containers.
We have a phenotype dictionary that looks like this:
We supply ind_F3300
with the --pheno
argument and the rest of the variables are supplied with --covar
.
When we run gwas.py it incorrectly reports that the variables of type CONTINUOUS
have no cases, controls or missing even though all individuals have valid values for these variables:
It seems that the part of the code that reports these sums use the variable pheno_type
that doesn't change between iterations:
log.log("extracting phenotypes{}...".format(' and covariates' if join_covar_into_pheno else ''))
pheno_and_covar_cols = args.pheno + (args.covar if join_covar_into_pheno else [])
pheno_output = extract_variables(pheno, pheno_and_covar_cols, pheno_dict_map, log)
for var in pheno_and_covar_cols:
if pheno_type=='BINARY':
log.log('variable: {}, cases: {}, controls: {}, missing: {}'.format(var, np.sum(pheno[var]=='1'), np.sum(pheno[var]=='0'), np.sum(pheno[var].isnull())))
else:
log.log('variable: {}, missing: {}'.format(var, np.sum(pheno[var].isnull())))
Source: https://github.com/comorment/containers/blob/main/gwas/gwas.py#L742-L749
If I'm not mistaken we can check the type of each variable using the pheno_dict_map
, like so:
log.log("extracting phenotypes{}...".format(' and covariates' if join_covar_into_pheno else ''))
pheno_and_covar_cols = args.pheno + (args.covar if join_covar_into_pheno else [])
pheno_output = extract_variables(pheno, pheno_and_covar_cols, pheno_dict_map, log)
for var in pheno_and_covar_cols:
- if pheno_type=='BINARY':
+ if pheno_dict_map[var]=='BINARY':
log.log('variable: {}, cases: {}, controls: {}, missing: {}'.format(var, np.sum(pheno[var]=='1'), np.sum(pheno[var]=='0'), np.sum(pheno[var].isnull())))
else:
log.log('variable: {}, missing: {}'.format(var, np.sum(pheno[var].isnull())))
This Debian package which is not (yet) supporting all architectures (e.g., arm64) is installed via https://github.com/comorment/gwas/blob/e3f295087b11866d12d3da9d5ba5c5d929e7272a/scripts/apt_get_essential.sh across different containers. What tools except Linux-king
(https://github.com/comorment/gwas/blob/b6209ffbad73e4f638e9057b1f2612a3d0e0a625/scripts/install_king.sh) require this library?
https://github.com/weizhouUMICH/SAIGE
This update may resolve the issues we are having with saige.
We have released a new version 1.0.0 (on March 15, 2022). It has substantial computation efficiency improvements for both Step 1 and Step 2 for single-variant and set-based tests. We have created a new program github page https://github.com/saigegit/SAIGE with the documentation provided https://saigegit.github.io/SAIGE-doc/ The program will be maintained by multiple SAIGE developers there. The docker image has been updated. Please feel free to try the version 1.0.0 and report issues if any.
Thanks!
current implementation makes --variance-standardize flag is potentially dangerous as it will introduce NA to a column that originally didn't have any variation
Before commit c38f807 the file singularity/saige.sif
was 702 MB in size, after the commit that file is 295 Bytes.
This is the content of that file now:
version https://git-lfs.github.com/spec/v1
<<<<<<< HEAD
oid sha256:8c870154d08604b5eefe2a4635a6ef22c2cf69b4dccb72f12367bda467dffb43
size 736071680
=======
oid sha256:1d8e3762db280395a73eb9bd3a070f6666717f8d0dbc76cb7738e256bf5649da
size 899510272
>>>>>>> 205045b7ae8864036476cf68d358bd0e9ce045c0
It seems like the file has been accidentally commited as a literal LFS textfile, instead of the actual singularity image.
Here's the expected the behavior:
Given I have an arguments file named "my_test.args" in the current directory
And my arguments file have the argument "--analysis regenie figures"
And my arguments file have the argument "--out my_test"
And I have an config file named "my_test.yaml" in the current directory
When I run "gwas.py --argsfile my_test.args --config my_test.yaml"
Then the file "my_test.3.job" is created
And the file "my_test.3.job" contains the command "gwas merge-regenie"
And the file "my_test.3.job" contains the argument "--config my_test.yaml"
But currently when going through this scenario (tested with 5d3a5b4) the last step fails, the argument --config
is absent from the job file.
You asked before if there were example data for GCTA/B tutorials. Here are the tutorial data for GCTB (I don't think there is a tutorial for SBLUP): https://cnsgenomics.com/software/gctb/#Download
I suggest moving development of ipsychcnv.sif
and enigma-cnv.sif
into a separate github repository, e.g. http://github.com/comorment/cnv , which should also include docker files, scripts, and other documentation related to these containers. @bayramakdeniz does this sound reasonable?
Small suggestion to add confidence bands to the qq-plot, like this: https://slowkow.com/notes/ggplot2-qqplot/ , so that it can be seen if the p-values fall within the expected range.
There is an error in the loci command. The output from loci is used when plotting.
This is the description of the loci command:
Perform LD-based clumping of summary stats, using a procedure that is similar to FUMA snp2gene functionality
The plotting seems to expect the files iPSYCH2012_ind_F3300.lead.csv and iPSYCH2012_ind_F3300.indep.csv, but those files does not exist. I guess they would be produced by the loci command
I tried to re-run and remove the flags for those two files to see if that solves the issue. But it didn’t. And I know that chr 2 has variants that pass significance threshold.
Sometimes when I use scripts/from_docker_image.sh
script it fails with an error. Here is how I call the script:
>sudo make gwas.sif
which in turn triggers the following command
docker build -t gwas -f containers/gwas/Dockerfile . && scripts/convert_docker_image_to_singularity.sh gwas
The first part of it succeeds, but the second fails:
...
Successfully built 6038cc3d0a1a
Successfully tagged gwas:latest
registry
Using default tag: latest
The push refers to repository [localhost:5000/gwas]
Get http://localhost:5000/v2/: EOF
make: *** [Makefile:4: gwas.sif] Error 1
Running it one more time always solves the problem.
I haven't investigated what's the problem here...
Tools that could be added to the containers:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.