h3abionet / h3agwas Goto Github PK

View Code? Open in Web Editor NEW

104.0 104.0 65.0 32.07 MB

GWAS Pipeline for H3Africa

License: Other

Python 33.85% Shell 0.06% Nextflow 39.30% Perl 16.08% R 10.71%

h3agwas's People

Contributors

Stargazers

Watchers

Forkers

geneticresources snewhouse hfbassani raonyguimaraes kilaza ceballosgene dolonosa htnani vallurumk cbsb lifebit-ai wangdi2014 linhxxx tzebin majdi-nag azzaea jiaolongsun omixplatform briandoconnor cgpu jsan4christ fw1121 danielwyuan shicheng-guo agus-setiawan-desu yu-1011 lvclark weizhousjtu jinbinchan shreyali1001 jambler24 davidenoma mouna555 sickle-in-africa 19831985 genostack bioinformatics-lab ruth-moraa f-annassiri paulahkings rissy2021 deribaabera1234 mcbrlab sogada abhi18av animesh xhyuo parthosen jpcartailler brainworkup devinaseeruttun emad123diab jrhtdo dexwel skiyaga cnabbumba marvinkobit xtmgah bigwanglab jjrun-11 abubakariabdulwasid luiceringasia88 ankurs103

h3agwas's Issues

plink-qc issues

Hello,

Thank you for the great effort putting different aspects of GWAS analysis into handy pipelines. I'm trying out the plink-qc pipeline with the slurmSingularity profile. The core bioinformatics processes work as one would hope, but the pipeline fails at all the steps of generating plots. Below is an error report from such a run that fails on the generateSnpMissingnessPlot process:

Command error:
  Traceback (most recent call last):
    File ".command.sh", line 44, in <module>
      plt.savefig(args.output)
    File "/home/p287664/.local/lib/python3.6/site-packages/matplotlib/pyplot.py", line 716, in savefig
      res = fig.savefig(*args, **kwargs)
    File "/home/p287664/.local/lib/python3.6/site-packages/matplotlib/figure.py", line 2180, in savefig
      self.canvas.print_figure(fname, **kwargs)
    File "/home/p287664/.local/lib/python3.6/site-packages/matplotlib/backend_bases.py", line 2014, in print_figure
      canvas = self._get_output_canvas(format)
    File "/home/p287664/.local/lib/python3.6/site-packages/matplotlib/backend_bases.py", line 1950, in _get_output_canvas
      canvas_class = get_registered_canvas_class(fmt)
    File "/home/p287664/.local/lib/python3.6/site-packages/matplotlib/backend_bases.py", line 125, in get_registered_canvas_class
      backend_class = importlib.import_module(backend_class).FigureCanvas
    File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
      return _bootstrap._gcd_import(name[level:], package, level)
    File "<frozen importlib._bootstrap>", line 994, in _gcd_import
    File "<frozen importlib._bootstrap>", line 971, in _find_and_load
    File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
    File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
    File "<frozen importlib._bootstrap_external>", line 678, in exec_module
    File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
    File "/home/p287664/.local/lib/python3.6/site-packages/matplotlib/backends/backend_pdf.py", line 42, in <module>
      from matplotlib import _png
  ImportError: Error relocating /home/p287664/.local/lib/python3.6/site-packages/matplotlib/.libs/libz-a147dcb0.so.1.2.3: __fprintf_chk: symbol not found

For completion, for now, I'm using the sample data in this repo; with nextflow version 19.04.1

And I'm attaching the complete
nextflow.log too.

Thank you,
Azza

Error running association pipeline with Gemma

Hi,

Sorry for making another issue 😅

When I try running the association pipeline with Gemma using the following command:

nextflow run assoc.nf --input_dir sample/ --input_pat sampleA --data sample/sample.phe --pheno PHE --covariates SEX --gemma 1 --gemma_num_cores 4 -profile docker

I get the following error:

ERROR ~ Error executing process > 'getGemmaRel (1)'

Caused by:
  Process `getGemmaRel (1)` terminated with an error exit status (127)

Command executed:

  export OPENBLAS_NUM_THREADS=4
  cat sampleA.fam |awk '{print $1"	"$2"	"0.2}' > pheno
  gemma -bfile sampleA  -gk 1 -o sampleA -p pheno -n 3

Command exit status:
  127

Command output:
  (empty)

Command error:
  .command.sh: line 4: gemma: command not found

I think this is because the getGemmaRel process is using the default docker container (quay.io/h3abionet_org/py3plink) rather than the gemmaImage & so the gemma command cannot be found.

However, if I try changing the container to the gemmaImage I get a segmentation fault:

Caused by:
  Process `getGemmaRel (1)` terminated with an error exit status (139)

Command executed:

  export OPENBLAS_NUM_THREADS=4
  cat sampleA.fam |awk '{print $1"	"$2"	"0.2}' > pheno
  gemma -bfile sampleA  -gk 1 -o sampleA -p pheno -n 3

Command exit status:
  139

Command output:
  Reading Files ...

Command error:
  .command.sh: line 4:    34 Segmentation fault      gemma -bfile sampleA -gk 1 -o sampleA -p pheno -n 3

Any ideas what's causing the issue? Any help would be much appreciated.

Many thanks in advance,
Phil

PCA plot not being included in the analysis report

HI,
I am trying to run of the association analysis on sample files using plink-assoc.nf. The analysis completed successfully. However, in the subsequent report generated, the PCA plot (of snps) seems to be missing.

for eg; nextflow run assoc.nf --input_pat raw-GWA-data --chi2 1 --logistic 1 --adjust 1

Can you please let me know how can we include the PCA plot in the report. ?
Although, the 'print_pca' option is set to 1, the PCA plot is generated seperately. But its not included in the final analysis report.

batchReport.py requires additional str formatting

pairA and pairB for curr may require str formatting to work.

h3agwas/templates/batchReport.py

Lines 339 to 340 in 525272b

    
           curr = curr+TAB.join([row[0],row[1],pairA,\ 
        
                                 row[2],row[3],pairB,\

suggesting replace with: str(pairA) & str(pairB)

error on running code

sudo ./nextflow run h3abionet/h3agwas/assoc/main.nf

Both docker and singularity installed and nextflow also
Error on running code:

N E X T F L O W ~ version 22.10.0
Launching https://github.com/h3abionet/h3agwas [grave_stonebraker] DSL1 - revision: c8a7881 [master]

Unknown parameter : Check parameter <shared-storage-mount=/mnt/shared>

No phenotype given -- set params.pheno

mask didn't work in the call2plink workflow

I tried to use a mask file to exclude some individuals in the call2plink workflow. But the removal seemed not work and there were several other issues when using a mask file.

I gave a one-column mask file as follows to exclude these individuals. mask_type=sample-label.
G4968
G4969
G4970
G4971
G4972
sheet2fam.py reported an error at line 130. because the mask file doesn't have a second column.
indivs[(data[0],data[1])]=data[-1]
so modify the code to:
indivs[(data[0],data[0])]=data[-1]
the plink log file shows that a --remove command was added to the process, but no individuals were removed. no idea about this bug.
PLINK v1.90b7 64-bit (16 Jan 2023)
Options in effect:
--a2-allele emptyZ0ref.txt
--bed raw.bed
--bim raw.bim
--fam raw.fam
--flip flips.lst
--make-bed
--out plink_out
--remove mask.inds
4819 people (0 males, 0 females, 4819 ambiguous) loaded from .fam.
--remove: 4819 people remaining.
the fixfam process marked mask individuals with 'MSK', but why they are not removed?
G4969_MSK__ G4969 0 0 1 0
G4970_MSK__ G4970 0 0 1 0
G4971_MSK__ G4971 0 0 2 0
G4972_MSK__ G4972 0 0 2 0

Not possible to copy ami-9ca7b2fa AWS AMI storage snapshot

When using AWS it is possible to launch the ami-9ca7b2fa AMI, but any attempts to copy produce the error message: "You do not have permission to access the storage of this ami".

I suspect that although the AMI itself is public, the underlying snapshot is not public, preventing copying and preventing the ability to copy the snapshot to: 1) a different region and, 2) an encrypted snapshot in order to enable root volume encryption.

Could this be enabled, or is there a reason that the underlying storage should not be publicly accessible?

Error checking code assumes data is local

We have some Groovy code that checks e.g. that the phenotype file is good. This requires that the data is local -- but some users want it to be in S3

Python version issues?

I'm running the pipeline on Ubuntu 14.04 machine, with python 3.4 (locally installed dependencies). The behavior with python seems odd though. Kindly see below:

Line 92 in templates/drawPCA.py produces the error

Command error:
    File ".command.sh", line 91
      draw(*evecs,labels,the_colours)
                 ^
  SyntaxError: only named arguments may follow *expression

A dirty fix is to change that line such it reads:

draw(*(evecs+(labels,the_colours)))

With that fix, the pipeline continues until a similar problem occurs from the script batchReport.py line 503.

Command error:
    output = TAB.join(map(str, [*i,bfrm.loc[i][args.batch_col],"%5.3f"%row['F_MISS'],pfrm.loc[i]]))+                 TAB+TAB.join(map(xstr,sxAnalysis.loc[i]))+EOL
                                       ^
SyntaxError: can use starred expression only as assignment target

Reading around, the advice is to solve this problem by using python 3.5, but in the later case the error reads:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/indexes/base.py", line 3078, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'all'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/users_current/amelbadawi/training_data/h3agwas/work/36/ebc4dbc425b2eb3db495983cda14a5/.command.sh", line 588, in <module>
    text = text + detailedSexAnalysis(pfrm,ifrm,args.sx_pickle,args.pheno_col,bfrm)
  File "/home/users_current/amelbadawi/training_data/h3agwas/work/36/ebc4dbc425b2eb3db495983cda14a5/.command.sh", line 517, in detailedSexAnalysis
    dumpMissingSexTable(sex_fname,ifrm,sxAnalysis,pfrm[pheno_col],bfrm)
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py", line 2688, in __getitem__
    return self._getitem_column(key)
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py", line 2695, in _getitem_column
    return self._get_item_cache(key)
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/generic.py", line 2489, in _get_item_cache
    values = self._data.get(item)
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/internals.py", line 4115, in get
    loc = self.items.get_loc(item)
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/indexes/base.py", line 3080, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'all'

Would you kindly advice?

Thank you,
Alyaa

Cannot clone to Windows due to directory name

The name aux is reserved in Windows: https://docs.microsoft.com/en-us/windows/win32/fileio/naming-a-file#naming-conventions

Therefore, it isn't possible to clone the repository to a Windows system.

Not that anyone is going to use this workflow on Windows, but this also prevents me from making edits on my local machine.

Fine-mapping

Hi,

I would like to perform fine mapping analysis on a Coronary Artery Disease (CAD) dataset (GWAS file available here : https://data.mendeley.com/datasets/gbbsrpx6bs/1). Can I use your pipeline from this data ? Do I need to use plink format ? Can you give me some clues about the command line to use?

Thanks four yor help

Error executing doReport process

Hi,

If I try running the following command:

nextflow run assoc.nf --input_dir sample --input_pat sampleA --data sample/sample.phe --pheno PHE --covariates SEX -profile docker

I get the following error:

ERROR ~ Error executing process > 'doReport'

Caused by:
  Missing output file(s) `out-report.pdf` expected by process `doReport`

Command executed [/h3agwas/templates/make_assoc_report.py]:

..........

Command exit status:
  0

Command output:
  (empty)

Command error:
  sh: pdflatex: not found
  chown: unrecognized option: from
  BusyBox v1.27.2 (2017-12-12 10:41:50 GMT) multi-call binary.

  Usage: chown [-RhLHPcvf]... USER[:[GRP]] FILE...

  Change the owner and/or group of each FILE to USER and/or GRP

  	-R	Recurse
  	-h	Affect symlinks instead of symlink targets
  	-L	Traverse all symlinks to directories
  	-H	Traverse symlinks on command line only
  	-P	Don't traverse symlinks (default)
  	-c	List changed files
  	-v	List all files
  	-f	Hide errors

The following files are in the work dir:

cleaned-pca.pdf  input.1  out-report.tex

So it looks like the doReport process failed because the output pdf could not be made (or rather converted from the latex report). So it looks like the problem is with the pdflatex command in the make_assoc_report.py script, which I guess then calls chown incorrectly?

Perhaps the problem is that pdflatex is not installed in the quay.io/h3abionet_org/py3plink docker container?

I think this is the case because if I try entering the docker container interactively & typing pdflatex I get bash: pdflatex: command not found

Do you think this might be the case? Or could something else be causing the issue?

Thanks again & sorry for keep on making issues

Provide sample data and a valid configuration file for running call2plink

The documentation on the parameters needed for running call2plink is excellent. What would make it even better is if we had some "fake" data to give users an idea of the structure of the inputs they need to provide.

Report clean-up non-missing data

If there are no non-missing SNPs, the reports are odd because there are extensive graphs etc talking about this. This needs cleaning up.

Illumina cluster file for H3Africa chip

I am not sure if this the right question here, but I was wondering if there is a specific Illumina cluster file for the H3Africa chip.

I found the manifest (.bpm and .csv) files. I also found the human reference suitable for the H3Africa chip from http://www.bioinf.wits.ac.za/data/h3agwas/

However, I cannot seem to find any cluster file suitable for the H3Africa chip.

Thanks in advance for assistance.

Esoh

Trying to run qc.nf and it's failing findRelatedIndiv

Hi there,

I'm trying to run this pipeline but I'm getting some errors:

./nextflow run qc.nf -profile docker

Error executing process > 'findRelatedIndiv (1)'

Caused by:
Process findRelatedIndiv (1) terminated with an error exit status (1)

Command executed [/netapp/raony/gwas/src/h3agwas/templates/removeRelInds.py]:

Command error:
WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2890, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 128, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index_class_helper.pxi", line 91, in pandas._libs.index.Int64Engine._check_type
KeyError: '1'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "pandas/_libs/index.pyx", line 672, in pandas._libs.index.BaseMultiIndexCodesEngine.get_loc
File "/usr/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2892, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 128, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index_class_helper.pxi", line 91, in pandas._libs.index.Int64Engine._check_type
KeyError: '1'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/pandas/core/indexes/multi.py", line 2815, in get_loc_level
return (self._engine.get_loc(key), None)
File "pandas/_libs/index.pyx", line 675, in pandas._libs.index.BaseMultiIndexCodesEngine.get_loc
KeyError: ('1', '1')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File ".command.sh", line 46, in
rel, deg = getDegrees(remove)
File ".command.sh", line 33, in getDegrees
deg[x]=deg[x]+1
File "/usr/lib/python3.6/site-packages/pandas/core/series.py", line 1106, in getitem
return self._get_with(key)
File "/usr/lib/python3.6/site-packages/pandas/core/series.py", line 1120, in _get_with
return self._get_values_tuple(key)
File "/usr/lib/python3.6/site-packages/pandas/core/series.py", line 1168, in _get_values_tuple
indexer, new_index = self.index.get_loc_level(key)
File "/usr/lib/python3.6/site-packages/pandas/core/indexes/multi.py", line 2817, in get_loc_level
raise KeyError(key) from e
KeyError: ('1', '1')

Any ideas what could be wrong?

Using continuous phenotypic data for an association study

Hi,

Is there any way to use the plink-assoc.nf (or rather assoc.nf) pipeline with continuous phenotypic data or phenotypic data with more factors than just case and control?

As I understand it currently the pipeline accepts a phenotype file eg sample.phe which looks like the following:

FID IID PAT MAT SEX PHE
1 1 0 0 1 2
2 2 0 0 1 1
3 3 0 0 1 2
4 4 0 0 1 2
5 5 0 0 2 2

Here the phenotype PHE is encoded as 1 = control & 2 = case.
Am I able to use a phenotype with more than two values eg 1, 2, 3, 4... etc.
Or can I use a phenotype with continuous values eg 1.23, 12.32, 7.32.. etc.

If it's not currently possible, do you have any idea how either this pipeline could be adapted or any other tools which may allow this to be done?

Many thanks in advance.
Any help would be much appreciated

qc.nf fails during getDuplicateMarkers due to chr prefix / non-autosomal chrs

Hi,

I am trying to run the qc.nf script. It works with the sample data, however, when I try & run it with my own data it fails during the getDuplicateMarkers process (see nextflow.log)

I think the problem comes from this line in dups.py as the error message is (edited for brevity):

Caused by:
  Process `getDuplicateMarkers (1)` terminated with an error exit status (1)

..........

Command error:
  Traceback (most recent call last):
    File ".command.sh", line 71, in <module>
      removeOnBP(sys.argv[1],out)
    File ".command.sh", line 43, in removeOnBP
      elif (chrom, bp) > (old_chrom, old_bp):
  TypeError: '>' not supported between instances of 'str' and 'int'

I have tried printing the value of the chrs that cause a problem (see print_chr_nextflow.log) & it seems to be due to the chromosome being chr1 (it also fails with chr X). The problem occurs because this is not an int and python cannot make the comparison.

Any ideas on the best way to fix this?

(It's also worth noting that I am running my fork of the pipeline which I have made changes to but I think this is also a problem for this pipeline)

Many thanks in advance,
Phil

AWS Batch issue

If sex information isn't available, workflow crases

Thie underlying issue is that Batch demands that file channels must contain files. You can't output a value onto a channel and interpret it as a file. We do this in a few places to deal with optional input

Convert the Wiki to one or more doc files within a dedicated directory in the repo

Having just used the information in the Wiki to get things set up last night, I suggest converting some, if not all, of the information in the Wiki to one or more doc files within a dedicated directory in the repo. IMO, doing this would have several benefits, including:

Converting from a GitHub Wiki to doc files within the repo itself will would make it possible to use the normal GitHub workflow and tools, for example it allow would-be contributors to use standard pull requests to submit possible improvements, and empower the project administration to use the standard GitHub workflow and tools to manage all of the documentation.
Converting to doc files that reside in the repo, and thus come with the repo automatically when cloned, has the added advantage of making sure that everyone that clones the repo has a documentation set for that specific version already present in the repo when cloned. As it is now, since the Wiki and the rest of the repo are managed separately, not only do we have to refer back to the Wiki for documentation, but there is no guarantee that the doc "versions" match the versions of the actual repo.
Keeping the primary documentation covering important topics of general interest, like installation and setup, within the repo itself is standard practice, with subjects of less general interest, like how to handle extreme edge cases, more often relegated into Wikis.

I would be willing to create a first draft of this over the next week or so if you're interested and barring any unforeseen demands on my time from my employer.

Improve existing installation and setup documentation

Having just used the information in the Wiki last night for installation, I found what I would consider installation information was actually spread out over several different sections of the Wiki. This meant that instead of going to one section and finding what I needed, I read the installation section, didn't find what I needed there, and would have given up if I didn't have such a strong desire to use the project that I read the entire Wiki over several times. I understand projects of any size usually have a hard time getting people to write documentation, since most people prefer to just write code, and I also understand that small projects also have a huge challenge finding the resources to create and maintain good documentation. If you would like, I would be willing to at least see what I could do to improve and consolidate the documentation regarding installation and setup.

Why does plink-qc.nf check for samplesheet, sample.phe, and idatpat?

Hello, I'm trying to run plink-qc.nf on my Linux HPC cluster with:

nextflow run -c my_nf.config /Users/mchiment/.nextflow/assets/h3abionet/h3agwas/plink-qc.nf -profile sgeSingularity --samplesheet "0" --idatpat ""

Where "sgeSingularity" is:
sgeSingularity {

process.executor = "sge"
singularity.autoMounts = true
singularity.enabled = true
process.queue = queue

}

I keep getting errors like this:

N E X T F L O W ~ version 19.01.0
Launching /Users/mchiment/.nextflow/assets/h3abionet/h3agwas/plink-qc.nf [agitated_lavoisier] - revision: 418f49aacb
Check idatpat=true ************** is it a valid parameter -- are you using one rather than two - signs or vice-versa
ERROR ~ Argument of file function cannot be empty

-- Check script 'plink-qc.nf' at line: 237 or see '.nextflow.log' file for more details

Line 237:

sample_sheet_ch = file(params.samplesheet)

Why is it checking for a samplesheet when the directions say we only need a BIM/BED/FAM file as input for this pipeline?

Similarly, if the pipeline also requires a "sample.phe" file, you should put that in the instructions. I had to create this by hand.

Anomalies and extras in topbottom.nf

Need to improve the way in which we handle (a) problems in the samplesheet and (b) duplicates (e.g. for qc)

module numpy has no attribute "float"

While running
nextflow run qc -profile docker I am having the following problem
I am using a Mac terminal
raise AttributeError("module {!r} has no attribute "
AttributeError: module 'numpy' has no attribute 'float'

N E X T F L O W ~ version 22.10.6
Launching qc/main.nf [cranky_brahmagupta] DSL1 - revision: 7d401873bd
Downloading plugin [email protected]
The batch file is 0
Sexinfo available command
WARN: The echo directive has been deprecated - use to debug instead
0
executor > local (13)
[3e/4a0c85] process > inMD5 (1) [100%] 1 of 1 ✔
[2d/169242] process > noSampleSheet [100%] 1 of 1 ✔
[41/1da057] process > getDuplicateMarkers (1) [100%] 1 of 1 ✔
[42/26b684] process > removeDuplicateSNPs (1) [100%] 1 of 1 ✔
[29/8db8f9] process > getX (1) [100%] 1 of 1 ✔
[67/c71474] process > analyseX (1) [ 0%] 0 of 1
[1f/38fd77] process > identifyIndivDiscSexinfo (1) [100%] 1 of 1 ✔
[f6/0abe06] process > generateSnpMissingnessPlot (1) [ 0%] 0 of 1
[07/c7f4df] process > generateIndivMissingnessPlot (1) [ 0%] 0 of 1
[4f/31c923] process > getInitMAF (1) [100%] 1 of 1 ✔
[86/bff059] process > showInitMAF (1) [ 0%] 0 of 1
[92/c96228] process > showHWEStats (1) [ 0%] 0 of 1
[e1/337150] process > removeQCPhase1 (1) [ 0%] 0 of 1
[- ] process > compPCA -
[- ] process > drawPCA -
[- ] process > pruneForIBD -
[- ] process > findRelatedIndiv -
[- ] process > calculateSampleHeterozygosity -
[- ] process > generateMissHetPlot -
[- ] process > getBadIndivsMissingHet -
[- ] process > removeQCIndivs -
[- ] process > calculateSnpSkewStatus -
[- ] process > generateDifferentialMissingnessPlot -
[- ] process > findSnpExtremeDifferentialMissingness -
[- ] process > removeSkewSnps -
[- ] process > calculateMaf -
[- ] process > generateMafPlot -
[- ] process > findHWEofSNPs -
[- ] process > generateHwePlot -
[- ] process > outMD5 -
[- ] process > batchProc -
[- ] process > produceReports -

Error executing process > 'generateIndivMissingnessPlot (1)'

Caused by:
executor > local (13)
[3e/4a0c85] process > inMD5 (1) [100%] 1 of 1 ✔
[2d/169242] process > noSampleSheet [100%] 1 of 1 ✔
[41/1da057] process > getDuplicateMarkers (1) [100%] 1 of 1 ✔
[42/26b684] process > removeDuplicateSNPs (1) [100%] 1 of 1 ✔
[29/8db8f9] process > getX (1) [100%] 1 of 1 ✔
[- ] process > analyseX (1) -
[1f/38fd77] process > identifyIndivDiscSexinfo (1) [100%] 1 of 1 ✔
[- ] process > generateSnpMissingnessPlot (1) -
[07/c7f4df] process > generateIndivMissingnessPlot (1) [100%] 1 of 1, failed: 1 ✘
[4f/31c923] process > getInitMAF (1) [100%] 1 of 1 ✔
[- ] process > showInitMAF (1) -
[- ] process > showHWEStats (1) -
[- ] process > removeQCPhase1 (1) -
[- ] process > compPCA -
[- ] process > drawPCA -
[- ] process > pruneForIBD -
[- ] process > findRelatedIndiv -
[- ] process > calculateSampleHeterozygosity -
[- ] process > generateMissHetPlot -
[- ] process > getBadIndivsMissingHet -
[- ] process > removeQCIndivs -
[- ] process > calculateSnpSkewStatus -
[- ] process > generateDifferentialMissingnessPlot -
[- ] process > findSnpExtremeDifferentialMissingness -
[- ] process > removeSkewSnps -
[- ] process > calculateMaf -
[- ] process > generateMafPlot -
[- ] process > findHWEofSNPs -
[- ] process > generateHwePlot -
[- ] process > outMD5 -
[- ] process > batchProc -
[- ] process > produceReports -

Error executing process > 'generateIndivMissingnessPlot (1)'

Caused by:
Process generateIndivMissingnessPlot (1) terminated with an error exit status (1)

Command executed [/Users/devina/h3agwas/qc/templates/missPlot.py]:

#!/usr/bin/env python3
#Load SNP frequency file and generate histogram

import pandas as pd
import numpy as np
import sys
from matplotlib import use
use('Agg')
import argparse
import matplotlib
import matplotlib.pyplot as plt
import sys

def parseArguments():
if len(sys.argv)<=1:
sys.argv="snpmissPlot.py sampleA-nd.imiss samples sampleA-nd-indmiss_plot.pdf".split()
parser=argparse.ArgumentParser()
parser.add_argument('input', type=str, metavar='input'),
parser.add_argument('label', type=str, metavar='label'),
parser.add_argument('output', type=str, metavar='output'),
args = parser.parse_args()
return args

args = parseArguments()

data = pd.read_csv(args.input,delim_whitespace=True)

fig = plt.figure(figsize=(17,14))
fig,ax = plt.subplots()
matplotlib.rcParams['ytick.labelsize']=13
matplotlib.rcParams['xtick.labelsize']=13
miss = data["F_MISS"]
big = min(miss.mean()+2*miss.std(),miss.nlargest(4).iloc[3])
interesting = miss[miss<big]
if len(interesting)>0.9 * len(miss):
miss = interesting
miss = np.sort(miss)
n = np.arange(1,len(miss)+1) / np.float(len(miss))
ax.step(miss,n)
ax.set_xlabel("Missingness",fontsize=14)
ax.set_ylabel("Proportion of %s"%args.label,fontsize=14)
ax.set_title("Cumulative prop. of %s with given missingness"%args.label,fontsize=16)
fig.tight_layout()
plt.savefig(args.output)

Command exit status:
1

Command output:
(empty)

Command error:
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-afe5mj8_ because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Traceback (most recent call last):
File ".command.sh", line 38, in
n = np.arange(1,len(miss)+1) / np.float(len(miss))
File "/usr/local/lib/python3.8/dist-packages/numpy/init.py", line 284, in getattr
raise AttributeError("module {!r} has no attribute "
AttributeError: module 'numpy' has no attribute 'float'

Work dir:
/Users/devina/h3agwas/work/07/c7f4dfd568806c00cebe0c50ebf0d7

Tip: you can replicate the issue by changing to the process work dir and entering

unlabelled params in 'illumina2lgen' process in topbottom.nf

suggest labelling of the 'time' and 'memory' params in the 'illumina2lgen' process in order to accommodate the processing of studies with larger sample sizes.

Non roman characters in sample sheet

The sheet2fam.py script fails if there are characters with accents in them. Having these is probably a bad idea, but who are we to judge -- should be protected against.

Autosomal SNPs

Detect and report number of autosomal SNPs

Conflict between plink for GWAS and plink the networking tool on Linux

Dear Prof Scott

Thank you for making GWAS so simple with this h3agwas pipeline. I am using the pipeline in parallel with local gwas tools like Plink, QCTOOL, IMPUTE2 etc to compare my results.

I just recently installed the pipeline on my account on the DELGEME bio server since it requires no root privileges and will run faster there.

However, it spilt some errors (below) when I tried running the tutorial.

nextflow run plink-qc.nf --input_pat sampleA --input_dir sample

N E X T F L O W ~ version 18.10.1
Launching plink-qc.nf [cranky_perlman] - revision: f986962b17
Sexinfo available command
[warm up] executor > local
[8a/01c9f4] Submitted process > inMD5 (1)
[40/3c2c85] Submitted process > getDuplicateMarkers (1)
[58/52ead8] Submitted process > removeDuplicateSNPs (1)
ERROR ~ Error executing process > 'removeDuplicateSNPs (1)'

Caused by:
Process removeDuplicateSNPs (1) terminated with an error exit status (1)

Command executed:

plink --keep-allele-order --bfile sampleA --must-have-sex --exclude sampleA.dups --missing --make-bed --out sampleA-nd
wc -l sampleA.bim > sampleA.orig
wc -l sampleA.fam >> sampleA.orig

Command exit status:
1

Command output:
(empty)

Command error:
plink: unknown option "--keep-allele-order"
plink: unknown option "--bfile"
plink: unknown option "--must-have-sex"
plink: unknown option "--exclude"

I observed the plink-qc.nf file and noticed that plink is written simply as plink which is the same as the connectivity tool

Plink: command-line connection utility
Release 0.67
Usage: plink [options] [user@]host [command]
       ("host" can also be a PuTTY saved session name)

I am not sure how significant the error is but I just thought it is worth pointing out since some users managing the pipeline with Git might potentially encounter this.

We have plink on the server, but the executable file was renamed to plink1 in order to prevent the conflict. We also have plink1.9 which is actually named as plink1.9

One fix for the conflict that I thought of doing was to manually edit the pink in all the executables to plink1.9.

However, is there a possibility to add program as a parameter in the .config file so that users can specify what version they want to use, and then call it up in the .nf and .sh executables as a variable? This may potentially be useful in making the pipeline support plink2 which you raised as an issue, and other versions.

Please pardon the typos, I am just a novice :)

Thanks,
Esoh

Support plink 2

Plink 2 has different naming conventions for output files. It would be good to have code that supports both plink 1.9 and 2.

Confirm intention in templates/removeRelInds.py

Quick question about the intended behavior of templates/removeRelInds.py.

The line if x in remove or y in remove : pass doesn't actually do anything since pass is a null operation in Python—so it is as if the if didn't happen at all.

Shouldn't the line actually be if x in remove or y in remove : continue if the true intention is to return control to the beginning of the loop?

Support PGEN, BGEN, VCF

Nice to have

Reporting needs improvement

The batchProc.py is probably the worst code I have written in 25 years. Needs to be redone from scratch.

Question: interpreting association results

Hi, I am trying to run the association pipeline with the test data included in the sample directory.

I am familiar with nextflow and docker but less familiar with GWAS/plink etc.

I have run the following command:

nextflow run plink-assoc.nf --input_dir sample --input_pat sampleA --chi2 1 --logistic 1 --adjust 1 --data sample/sample.phe --pheno "SEX,PHE" --covariates "PAT,MAT" -profile docker

It ran to completion producing a very nice report:
out-report.pdf

However, I am not completley sure I understand exactly what the pipeline is doing.

For example how are the pheno & covariates parameters are used.
The documentation says the following:

pheno: a comma-separated list of phenotypes that you want to test. Each phenotype is tested separately.
covariates: a comma-separated list of phenotypes that you want to use

So are each of the pheno paramters tested against the covariates parameters to see if the vary together at the particular genetic regions?

Also what are the observed and expected p-value in the output report for? Is this for each trait/phenotype that is used in the covariants and how strongly they correlate with particular genomic regions for example? And why are some of the plots empty?

Great pipeline though and many thanks in advance. Any help would be much appreciated

	curr = curr+TAB.join([row[0],row[1],pairA,\
	row[2],row[3],pairB,\