gelog / adam-ibs Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 6.0 9.62 MB

Ports the IBS/MDS/IBD functionality of Plink to Spark / ADAM

License: Apache License 2.0

Scala 98.09% Groovy 1.91%

adam-ibs's People

Contributors

Stargazers

Watchers

Forkers

naksej iki-v panaceagenomics snashraf yixf-self xiaolin-ding

adam-ibs's Issues

IBD sharing expected value, based on just .fam/.ped relationship (--genome)

EZ IBD sharing expected value, based on just .fam/.ped relationship

Implement algorithm for RT Relationship type inferred from .fam/.ped file

EZ est un champ du fichier produit par la commande --genome

https://www.cog-genomics.org/plink2/ibd

List of available command line option of the features implemented in MGL804

List all the command line used in the project

Calculate the pairwise IBD/IBS metrics - Part I (--genome)

Description

Similar to plink --genome option. See the wiki on IBS-MDS Process and the diagram for the Genome file.

The input files are those created in #2.

The fields required for --cluster and --mds-plot are:

FID1: family ID of individual 1
IID1: individual ID of individual 1
FID2: family ID of individual 2
IID2: individual ID of individual 2
DST: IBS distance
PPC: IBS binomial test

There is more fields, but they are will be done in Part II.

Analysis

Add a comment to this issue with:

plink version used as reference (1.07 or 1.90 beta 3)
relevant C++ function name(s) and the file name(s) where they appear
Document the algorithm and/or mathematical formula to compute the fields:
- DST
- PPC
- any other field required as a dependency

Design

Add a comment to this issue describing how this will be implemented in Spark, and how it differs from plink.

Also update the class diagram on the wiki page describing PLink formats (when incomplete) and add a class diagram describing the models implemented in Scala for this feature on the wiki page on the MGL804 formats.

Implementation

The implementation should use:

Scala
Spark RDD
Spark MLlib / GraphX (if appropriate)

Important note: The model can be only in memory for now, but you'll need to integrate into the ADAM format later on. You'll probably need to create a new record type.

Persistence of Variant/Genotype with ADAM format

Description

This feature allow persistence of the PED/MAP fields imported in issue #2.

The Spark models should integrates with the ADAM format. The ADAM models for Variant and Genotypes should be a good start, but other record types may need to be added to the ADAM format.

The ADAM format can be added as a maven dependency. The structure of the ADAM format is defined in an Avro file. The Avro file can be compiled into Java classes.

For updates to the ADAM format, choose the most simple solution for you as long as it is easy to diff the changes to the Avro file.

Analysis

Add a comment to this issue with:

the missing fields and record types in the current ADAM format

Design

Add a comment to this issue describing how this will be implemented in Spark, and explain how the persistence model will work.

Implementation

The implementation should use:

Scala
Spark RDD
Avro
Parquet

scala-Logging

j'aurais besoin d'aide pour gerer les dependanses avec maven pour inclure au projet cette librairie

missinIBS clustering constraint: g genotype data (--ibm)

Description

This feature adds the --ibm constraint(s) on the --cluster option described in issue #7. For more info, check: http://pngu.mgh.harvard.edu/~purcell/plink/strat.shtml#options

The input file is the model created in #3.

Analysis

Add a comment to this issue with:

plink version used as reference (1.07 or 1.90 beta 3)
relevant C++ function name(s) and the file name(s) where they appear
Document the structure of the hh file generated by plink (not always present).
Document the algorithm and/or mathematical formula to compute:
- --ibm option

Design

Add a comment to this issue describing how this will be implemented in Spark, and how it differs from plink.

Implementation

The implementation should use:

Scala
Spark RDD
Spark MLlib / GraphX (if appropriate)

Association analysis

Description

This feature adds the --assoc, --model, --fisher, --linear, --logistic, --ci, --counts, --fisher, --cell, --within, --mh, --mh2, --bd, --homog, --gene-drop, --T2, --qt-means, --gxe, --covar, --reference-allele, --beta, --standard-beta, --genotypic, --hethom, --dominant, --recessive option(s) based on the input file described in #2.

This feature also generates the following file formats: ASSOC, FISHER, MODEL, BEST.PERM, BEST.MPERM, GEN.PERM, TREND.PERM, DOM.PERM, REC.PERM, CMH, CMH2, HOMOG, T2.PERM, T2.MPERM, QASSOC, ASSOC.PERM, ASSOC.MPERM, QASSOC.MEANS, QASSOC.GXE, ASSOC.LINEAR, ASSOC.LOGISTIC. For more info, check: http://pngu.mgh.harvard.edu/~purcell/plink/anal.shtml

Scope

The scope of this feature may be too large to implement completely during MGL804. However, many of the file formats seems to share a similar structure... Anyway, the scope needs to be refined with Beatriz.

From a development point of view, it is interesting to note that feature is very distinct from the rest of the other features, so it would be easy to implement in parallel to the other features.

Note: I stopped listing the options and file formats at the section Covariates and interactions.

Analysis

Add a comment to this issue with:

plink version used as reference (1.07 or 1.90 beta 3)
relevant C++ function name(s) and the file name(s) where they appear
Document the structure of all file format(s) described above generated by plink.
Document the algorithm and/or mathematical formula to compute all the missing fields described above:

Design

Add a comment to this issue describing how this will be implemented in Spark, and how it differs from plink.

Implementation

The implementation should integrates with the models implemented in scala by this project and use:

Scala
Spark RDD
Spark MLlib / GraphX (if appropriate)

Logging processing

On the same way as plink does, we need to log the program execution.

1. Find an easy way to log the program execution.
1. Implement it.

I have created a WriteLog scala class which I call any time I would like to write a log. For now, it does only "println". A logging to file functionality needs to be implemented.

import text dataset (--make-bed) into Spark

Description

Similar to plink --make-bed option. See the wiki on IBS-MDS Process.

The input files are PED and MAP. However, a relational model similar to the FAM - BED - BIM class diagram is better, and should be used internally.

Analysis

Add a comment to this issue with:

plink version used as reference (1.07 or 1.90 beta 3)
relevant C++ function name(s) and the file name(s) where they appear

Design

Add a comment to this issue describing how this will be implemented in Spark, and how it differs from plink.

Implementation

The implementation should use:

Scala
Spark RDD

Important note: The model can be only in memory for now, but you'll need to integrate into the ADAM format later on. The relevant records from the ADAM model are Variant and Genotypes, but some fields are missing and will need to be added.

MDS analysis formats

Compléter Wiki avec le format de données MDS analysis formats (partie 5).

Rq: merci de mettre le lien vers la documentation officielle.

IBS clustering formats

Compléter Wiki avec le format de données IBS clustering formats (partie 4).

Rq: merci de mettre le lien vers la documentation officielle.

Persistence for clustering outliers using ADAM format

Description

This feature allow persistence of the models created in issue #21 (NEIGHBOUR).

The Spark models should integrates with the ADAM format. New record types may need to be added to the ADAM format.

The ADAM formats can be added as a maven dependency. The structure of the ADAM format is defined in an Avro file. The Avro file can be compiled into Java classes.

For updates to the ADAM format, choose the most simple solution for you as long as it is easy to diff the changes to the Avro file.

Analysis

Add a comment to this issue with:

the missing fields and record types in the current ADAM format

Design

Add a comment to this issue describing how this will be implemented in Spark, and explain how the persistence model will work.

Implementation

The implementation should use:

Scala
Spark RDD
Avro
Parquet

Use a dependency manager to build the project

Description

All the dependencies used in this project (e.g. ADAM's formats, CLI parser, etc.) should be loaded externally via a build tool / dependency manager like Maven or SBT.

Implementation

Update the README file at the root of the repo to describe how to build the project.

Calculate IBD (--genome)

https://www.cog-genomics.org/plink2/ibd

Persistence of MIBS and MDIST with ADAM format

Description

This feature allow persistence of the model created in issue #15.

The Spark models should integrates with the ADAM format. New record types may need to be added to the ADAM format.

The ADAM formats can be added as a maven dependency. The structure of the ADAM format is defined in an Avro file. The Avro file can be compiled into Java classes.

For updates to the ADAM format, choose the most simple solution for you as long as it is easy to diff the changes to the Avro file.

Analysis

Add a comment to this issue with:

the missing fields and record types in the current ADAM format

Design

Add a comment to this issue describing how this will be implemented in Spark, and explain how the persistence model will work.

Implementation

The implementation should use:

Scala
Spark RDD
Avro
Parquet

Object persistance to file

We have scala objects containing information. In order to make the same functionality as in Plink (writing to text file), those objects needs to be saved to file.

Task :

Write a scala object (Persist) which "saves" (writes to file) complex scala data objects such as ImportRecord. Example method saveToFile.
Scala data classes may need to have some specifics attributes. Implement it.

Example of use :

myRecord (ImportRecord) object needs to be saved.
filename (String) for created file.
Persist.saveToFile(myRecord, filename) create files filename. extensions for each attribute object in myRecord.

PS : All examples are related to our code !

IBS clustering constraint: cluster by phenotype (--cc)

Description

This feature adds the --cc constraint(s) on the --cluster option described in issue #7. For more info, check: http://pngu.mgh.harvard.edu/~purcell/plink/strat.shtml#options

The input file is the model created in #3.

Analysis

Add a comment to this issue with:

plink version used as reference (1.07 or 1.90 beta 3)
relevant C++ function name(s) and the file name(s) where they appear
Document the structure of the hh file generated by plink (not always present).
Document the algorithm and/or mathematical formula to compute:
- --cc option

Design

Add a comment to this issue describing how this will be implemented in Spark, and how it differs from plink.

Implementation

The implementation should use:

Scala
Spark RDD
Spark MLlib / GraphX (if appropriate)

Command-line (CLI) option parser for Scala

Description

All these plink features uses a different command line options, and some of these options can be combined or not (e.g. --cluster can be called alone are with multiple options). Each of these options can have 0, 1 or more parameters (e.g. --mcc A B).

I think the best way to think about this is with a tree-like structure. For each invocation of the program there is one main use case (e.g. clustering). Then each use case require mandatory options and offer optional options. The optional options can have a default value. Some options are shared by many use cases.

So with plink, it is not very clear which options are applicable to a use case. To improve this, we need to find a flexible, easy-to-use, and light-weight option for scala.

I'm a big fan of the git software approach:

$ git
usage: git [--version] [--help] [-C <path>] [-c name=value]
           [--exec-path[=<path>]] [--html-path] [--man-path] [--info-path]
           [-p|--paginate|--no-pager] [--no-replace-objects] [--bare]
           [--git-dir=<path>] [--work-tree=<path>] [--namespace=<name>]
           <command> [<args>]

The most commonly used git commands are:
   add        Add file contents to the index
   bisect     Find by binary search the change that introduced a bug
   branch     List, create, or delete branches
   checkout   Checkout a branch or paths to the working tree
   clone      Clone a repository into a new directory
   commit     Record changes to the repository
   diff       Show changes between commits, commit and working tree, etc
   fetch      Download objects and refs from another repository
   grep       Print lines matching a pattern
   init       Create an empty Git repository or reinitialize an existing one
   log        Show commit logs
   merge      Join two or more development histories together
   mv         Move or rename a file, a directory, or a symlink
   pull       Fetch from and integrate with another repository or a local branch
   push       Update remote refs along with associated objects
   rebase     Forward-port local commits to the updated upstream head
   reset      Reset current HEAD to the specified state
   rm         Remove files from the working tree and from the index
   show       Show various types of objects
   status     Show the working tree status
   tag        Create, list, delete or verify a tag object signed with GPG

'git help -a' and 'git help -g' lists available subcommands and some
concept guides. See 'git help <command>' or 'git help <concept>'
to read about a specific subcommand or concept.

$ git branch --help
(shows help on the git branch command)

Analysis

Add a comment to this issue with:

the list of available CLI parsers librairies available for Java and Scala, and the PROs and CONs of each

Design

Add a comment to this issue describing the "tree" of available commands for the scope of the features implemented in MGL804.

Implementation

This need to be implemented in all features produced during MGL804.

Parametrage environnement de developpement.

La nouvelle tache pour suivre l'évolution de paramétrage d'environnement de développement.

Relationship type inferred from .fam/.ped file (--genome)

Implement algorithm for RT Relationship type inferred from .fam/.ped file

RT est un champ du fichier produit par la commande --genome

https://www.cog-genomics.org/plink2/ibd

Describe "tree" structure for CLI option

Describe the "tree" of available commands for the scope of the features implemented in MGL804.

Cleanup the repo used during MGL804's summer

For each of the repo below:

Comment:

if they are still useful for the evolution of the ADAM-IBS project.
if they can be opensourced (non-private)

IBS clustering analysis without constraints (--cluster)

Description

Similar to plink --cluster option. See the wiki on IBS-MDS Process and the diagram for the Genome file.

More information can be found on the --cluster and --genome-full options in the section on Pairwise IBD estimation of plink manual.

The input file is the model created in #3.

Analysis

Add a comment to this issue with:

plink version used as reference (1.07 or 1.90 beta 3)
relevant C++ function name(s) and the file name(s) where they appear
Document the structure of the 4 cluster files generated by plink
Document the algorithm and/or mathematical formula to compute these documented fields

Design

Add a comment to this issue describing how this will be implemented in Spark, and how it differs from plink.

Implementation

The implementation should use:

Scala
Spark RDD
Spark MLlib / GraphX (if appropriate)

Important note: The model can be only in memory for now, but you'll need to integrate into the ADAM format later on. You'll probably need to create a new record type.

MDS constraint: --mds-cluster, --within

Description

This feature adds the --mds-cluster and --within constraint(s) on the --mds-plot option described in issue #17. For more info, check: http://pngu.mgh.harvard.edu/~purcell/plink/mds.shtml#options

Analysis

Add a comment to this issue with:

plink version used as reference (1.07 or 1.90 beta 3)
relevant C++ function name(s) and the file name(s) where they appear
Document the algorithm and/or mathematical formula to compute:
- --mds-cluster option
- --within option

Design

Add a comment to this issue describing how this will be implemented in Spark, and how it differs from plink.

Implementation

The implementation should use:

Scala
Spark RDD
Spark MLlib / GraphX (if appropriate)

Persistence of IBS cluster / HH files with ADAM format

Description

This feature allow persistence of the model created in issue #7 - #1 via the --cluster option.

The Spark models should integrates with the ADAM format. New record types may need to be added to the ADAM format.

The ADAM formats can be added as a maven dependency. The structure of the ADAM format is defined in an Avro file. The Avro file can be compiled into Java classes.

For updates to the ADAM format, choose the most simple solution for you as long as it is easy to diff the changes to the Avro file.

Analysis

Add a comment to this issue with:

the missing fields and record types in the current ADAM format

Design

Add a comment to this issue describing how this will be implemented in Spark, and explain how the persistence model will work.

Implementation

The implementation should use:

Scala
Spark RDD
Avro
Parquet

Implement the parser for the CLI option

Describe the general idea of the solution

Failed to execute goal on project adam-ibs-core: Could not resolve dependencies

The compilation of the adam-ibs-data/ folder works but not the adam-ibs-core:

23% [:~/workspace … /adam-ibs-core] master* ± mvn clean compile
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Adam IBS Core 0.1.0
[INFO] ------------------------------------------------------------------------
Downloading: https://repo.maven.apache.org/maven2/com/ets/mgl804/adam-ibs/0.1.0/adam-ibs-0.1.0.pom
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1.666 s
[INFO] Finished at: 2015-10-29T13:56:53-04:00
[INFO] Final Memory: 14M/309M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project adam-ibs-core: Could not resolve dependencies for project com.ets.mgl804:adam-ibs-core:jar:0.1.0: Failed to collect dependencies at com.ets.mgl804:adam-ibs-data:jar:0.1.0: Failed to read artifact descriptor for com.ets.mgl804:adam-ibs-data:jar:0.1.0: Could not find artifact com.ets.mgl804:adam-ibs:pom:0.1.0 in central (https://repo.maven.apache.org/maven2) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException

Mapping des données .ped, .map, .bin, .fam, .bed

L'objectif de cette tache est de proposer un mapping de format de données Plink dans le Format ADAM.

Proposition de mapping pour le extensins .ped, .map, .bim, .fam, .bed.

Find a tool that evaluates the quality of the code

Ideally this tool would automatically look at the code, set a quality score, and help us improve the code quality.

It would be nice if this tool could generate a badge in our README.md file as well.

We can find a non-exhaustive list of these tools on http://shields.io/ (for exemple CodeCov, Scrutinizer, etc..)

Obviously the tool needs to understand the language of the repo (Scala, etc.)

IBS similarity matrix (--matrix, --distance-matrix) using MIBS / MDIST formats

Description

This feature adds the --matrix and --distance-matrix option(s) based on the input file described in #3. For more info, check: http://pngu.mgh.harvard.edu/~purcell/plink/strat.shtml#matrix

Analysis

Add a comment to this issue with:

plink version used as reference (1.07 or 1.90 beta 3)
relevant C++ function name(s) and the file name(s) where they appear
Document the structure of the MIBS and MDIST file formats generated by plink.
Document the algorithm and/or mathematical formula to compute:
- --matrix option
- --distance-matrix option

Design

Add a comment to this issue describing how this will be implemented in Spark, and how it differs from plink.

Implementation

The implementation should use:

Scala
Spark RDD
Spark MLlib / GraphX (if appropriate)

Important note: The model can be only in memory for now, but you'll need to integrate into the ADAM format later on. You'll probably need to create a new record type.

Find a project name!

better than ADAM-IBS

[Ubuntu, Java 7] No configuration setting found for key 'akka.version'

La compilation fonctionne sur une machine Linux vierge (Ubuntu 14.04). Mais l'execution ne fonctionne pas:

docker run --rm -ti ubuntu:14.04

apt-get update

export JAVA_VERSION="7u85"
export JAVA_HOME="/usr/lib/jvm/java-7-openjdk-amd64"
DEBIAN_FRONTEND=noninteractive apt-get install -y openjdk-7-jdk=$JAVA_VERSION\*

apt-get install -y git
git clone https://github.com/GELOG/adam-ibs.git
cd adam-ibs/

apt-get install -y maven

mvn package

root@7d5f110a4f03:/adam-ibs# java -jar adam-ibs-core/target/adam-ibs-core-0.1.0-jar-with-dependencies.jar --help
23:33:40.516 [main] [INFO ] [c.e.m.c.Main$] : Begin with arguments : --help
      --file  <name>   Specify .ped + .map filename prefix (default 'plink')
      --genome         Calculate IBS distances between all individuals [needs
                       --file and --out]
      --make-bed       Create a new binary fileset. Specify .ped and .map files
                       [needs --file and --out]
      --out  <name>    Specify the output filename
      --show-parquet   Show shema and data sample ostored in a parquet file [needs
                       --file]
  -h, --help  <arg>    Show help message

root@7d5f110a4f03:/adam-ibs# java -jar adam-ibs-core/target/adam-ibs-core-0.1.0-jar-with-dependencies.jar --file DATA/
test --out output --make-bed
23:22:23.977 [main] [INFO ] [c.e.m.c.Main$] : Begin with arguments : --file DATA/test --out output --make-bed
23:22:25.346 [main] [ERROR] [o.a.s.SparkContext] : Error initializing SparkContext.
com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'akka.version'
    at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:124) ~[adam-ibs-core-0.1.0-jar-with-dependencies.jar:na]
    at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:145) ~[adam-ibs-core-0.1.0-jar-with-dependencies.jar:na]
    at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:151) ~[adam-ibs-core-0.1.0-jar-with-dependencies.jar:na]
    at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:159) ~[adam-ibs-core-0.1.0-jar-with-dependencies.jar:na]
    at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:164) ~[adam-ibs-core-0.1.0-jar-with-dependencies.jar:na]
    at com.typesafe.config.impl.SimpleConfig.getString(SimpleConfig.java:206) ~[adam-ibs-core-0.1.0-jar-with-dependencies.jar:na]
    at akka.actor.ActorSystem$Settings.<init>(ActorSystem.scala:168) ~[adam-ibs-core-0.1.0-jar-with-dependencies.jar:na]
    at akka.actor.ActorSystemImpl.<init>(ActorSystem.scala:504) ~[adam-ibs-core-0.1.0-jar-with-dependencies.jar:na]
    at akka.actor.ActorSystem$.apply(ActorSystem.scala:141) ~[adam-ibs-core-0.1.0-jar-with-dependencies.jar:na]
    at akka.actor.ActorSystem$.apply(ActorSystem.scala:118) ~[adam-ibs-core-0.1.0-jar-with-dependencies.jar:na]
    at org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:122) ~[adam-ibs-core-0.1.0-jar-with-dependencies.jar:na]
    at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54) ~[adam-ibs-core-0.1.0-jar-with-dependencies.jar:na]
    at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53) ~[adam-ibs-core-0.1.0-jar-with-dependencies.jar:na]
    at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1991) ~[adam-ibs-core-0.1.0-jar-with-dependencies.jar:na]
    at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:142) ~[adam-ibs-core-0.1.0-jar-with-dependencies.jar:na]
    at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1982) ~[adam-ibs-core-0.1.0-jar-with-dependencies.jar:na]
    at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:56) ~[adam-ibs-core-0.1.0-jar-with-dependencies.jar:na]
    at org.apache.spark.rpc.akka.AkkaRpcEnvFactory.create(AkkaRpcEnv.scala:245) ~[adam-ibs-core-0.1.0-jar-with-dependencies.jar:na]
    at org.apache.spark.rpc.RpcEnv$.create(RpcEnv.scala:52) ~[adam-ibs-core-0.1.0-jar-with-dependencies.jar:na]
    at org.apache.spark.SparkEnv$.create(SparkEnv.scala:247) ~[adam-ibs-core-0.1.0-jar-with-dependencies.jar:na]
    at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:188) ~[adam-ibs-core-0.1.0-jar-with-dependencies.jar:na]
    at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:267) ~[adam-ibs-core-0.1.0-jar-with-dependencies.jar:na]
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:424) ~[adam-ibs-core-0.1.0-jar-with-dependencies.jar:na]
    at com.ets.mgl804.core.AppContext$.<init>(AppContext.scala:16) [adam-ibs-core-0.1.0-jar-with-dependencies.jar:na]
    at com.ets.mgl804.core.AppContext$.<clinit>(AppContext.scala) [adam-ibs-core-0.1.0-jar-with-dependencies.jar:na]
    at com.ets.mgl804.core.cli.PlinkMethod$.<init>(PlinkMethod.scala:19) [adam-ibs-core-0.1.0-jar-with-dependencies.jar:na]
    at com.ets.mgl804.core.cli.PlinkMethod$.<clinit>(PlinkMethod.scala) [adam-ibs-core-0.1.0-jar-with-dependencies.jar:na]
    at com.ets.mgl804.core.Main$$anonfun$main$2.apply(Main.scala:26) [adam-ibs-core-0.1.0-jar-with-dependencies.jar:na]
    at com.ets.mgl804.core.Main$$anonfun$main$2.apply(Main.scala:23) [adam-ibs-core-0.1.0-jar-with-dependencies.jar:na]
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) [adam-ibs-core-0.1.0-jar-with-dependencies.jar:na]
    at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) [adam-ibs-core-0.1.0-jar-with-dependencies.jar:na]
    at com.ets.mgl804.core.Main$.main(Main.scala:22) [adam-ibs-core-0.1.0-jar-with-dependencies.jar:na]
    at com.ets.mgl804.core.Main.main(Main.scala) [adam-ibs-core-0.1.0-jar-with-dependencies.jar:na]
Exception in thread "main" java.lang.ExceptionInInitializerError
    at com.ets.mgl804.core.cli.PlinkMethod$.<init>(PlinkMethod.scala:19)
    at com.ets.mgl804.core.cli.PlinkMethod$.<clinit>(PlinkMethod.scala)
    at com.ets.mgl804.core.Main$$anonfun$main$2.apply(Main.scala:26)
    at com.ets.mgl804.core.Main$$anonfun$main$2.apply(Main.scala:23)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
    at com.ets.mgl804.core.Main$.main(Main.scala:22)
    at com.ets.mgl804.core.Main.main(Main.scala)
Caused by: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'akka.version'
    at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:124)
    at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:145)
    at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:151)
    at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:159)
    at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:164)
    at com.typesafe.config.impl.SimpleConfig.getString(SimpleConfig.java:206)
    at akka.actor.ActorSystem$Settings.<init>(ActorSystem.scala:168)
    at akka.actor.ActorSystemImpl.<init>(ActorSystem.scala:504)
    at akka.actor.ActorSystem$.apply(ActorSystem.scala:141)
    at akka.actor.ActorSystem$.apply(ActorSystem.scala:118)
    at org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:122)
    at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54)
    at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53)
    at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1991)
    at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:142)
    at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1982)
    at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:56)
    at org.apache.spark.rpc.akka.AkkaRpcEnvFactory.create(AkkaRpcEnv.scala:245)
    at org.apache.spark.rpc.RpcEnv$.create(RpcEnv.scala:52)
    at org.apache.spark.SparkEnv$.create(SparkEnv.scala:247)
    at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:188)
    at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:267)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:424)
    at com.ets.mgl804.core.AppContext$.<init>(AppContext.scala:16)
    at com.ets.mgl804.core.AppContext$.<clinit>(AppContext.scala)
    ... 8 more

Multidimensional scaling plots (--mds-plot)

Description

This feature adds the --mds-plot option(s) based on the input file described in #7. For more info, check: http://pngu.mgh.harvard.edu/~purcell/plink/strat.shtml#mds

Analysis

Add a comment to this issue with:

plink version used as reference (1.07 or 1.90 beta 3)
relevant C++ function name(s) and the file name(s) where they appear
Document the structure of the MDS file format generated by plink (not always present).
Document the algorithm and/or mathematical formula to compute:
- --mds-plot option

Design

Add a comment to this issue describing how this will be implemented in Spark, and how it differs from plink.

Implementation

The implementation should use:

Scala
Spark RDD
Spark MLlib / GraphX (if appropriate)

Important note: The model can be only in memory for now, but you'll need to integrate into the ADAM format later on. You'll probably need to create a new record type.

IBS clustering constraint: cluster by phenotype (--match, --match-type)

Description

This feature adds the --match and match-type constraint(s) on the --cluster option described in issue #7. For more info, check: http://pngu.mgh.harvard.edu/~purcell/plink/strat.shtml#options

The input file is the model created in #3.

Analysis

Add a comment to this issue with:

plink version used as reference (1.07 or 1.90 beta 3)
relevant C++ function name(s) and the file name(s) where they appear
Document the structure of the hh file generated by plink (not always present).
Document the algorithm and/or mathematical formula to compute:
- --match option
- --match-type option

Design

Add a comment to this issue describing how this will be implemented in Spark, and how it differs from plink.

Implementation

The implementation should use:

Scala
Spark RDD
Spark MLlib / GraphX (if appropriate)

Persistence of Genome-wide IBS/IBD Pairwise Metrics with ADAM format

Description

This feature allow persistence of the model created in issue #3 and #4 via the --genome and --genome-full parameters.

The Spark models should integrates with the ADAM format. New record types may need to be added to the ADAM format.

The ADAM formats can be added as a maven dependency. The structure of the ADAM format is defined in an Avro file. The Avro file can be compiled into Java classes.

For updates to the ADAM format, choose the most simple solution for you as long as it is easy to diff the changes to the Avro file.

Analysis

Add a comment to this issue with:

the missing fields and record types in the current ADAM format

Design

Add a comment to this issue describing how this will be implemented in Spark, and explain how the persistence model will work.

Implementation

The implementation should use:

Scala
Spark RDD
Avro
Parquet

IBS Clustering outliers format

Compléter Wiki avec le format de données IBS Clustering outliers format (partie 6).

Rq: merci de mettre le lien vers la documentation officielle.

IBS-IBD Estimation

Description

This feature adds the --rel-check, --min, --max, --het, --indep, --indep-pairwise, --homozyg, --homozyg-snp, --homozyg-kb, --homozyg-window-het, --homozyg-window-missing, --homozyg-window-threshold, --homozyg-density, --homozyg-gap, --homozyg-group, --pool-size, --homozyg-match, --consensus-match, --homozyg-verbose, --ibs-test, --segment, --all-pairs, --segment-length, --segment-snp, --segment-group, --segment-verbose, --mperm option(s) based on the input file described in #3.

This feature also generates the following file formats: HET, HOM, HOM.OVERLAP, SEGMENT, SEGMENT.SUMMARY, SEGMENT.SUMMARY.MPERM.For more info, check: http://pngu.mgh.harvard.edu/~purcell/plink/ibdibs.shtml

Analysis

Add a comment to this issue with:

plink version used as reference (1.07 or 1.90 beta 3)
relevant C++ function name(s) and the file name(s) where they appear
Document the structure of all file format(s) described above generated by plink.
Document the algorithm and/or mathematical formula to compute all the missing fields described above:

Design

Add a comment to this issue describing how this will be implemented in Spark, and how it differs from plink.

Implementation

The implementation should integrates with the models implemented in scala by this project and use:

Scala
Spark RDD
Spark MLlib / GraphX (if appropriate)

Réunions

La tache pour tracer nos réunions. Ordres de jour, compte rendus.

IBS clustering constraint: fixed number of clusters (--K)

Description

This feature adds the --K constraint(s) on the --cluster option described in issue #7. For more info, check: http://pngu.mgh.harvard.edu/~purcell/plink/strat.shtml#options

The input file is the model created in #3.

Analysis

Add a comment to this issue with:

plink version used as reference (1.07 or 1.90 beta 3)
relevant C++ function name(s) and the file name(s) where they appear
Document the structure of the hh file generated by plink (not always present).
Document the algorithm and/or mathematical formula to compute:
- --K option

Design

Add a comment to this issue describing how this will be implemented in Spark, and how it differs from plink.

Implementation

The implementation should use:

Scala
Spark RDD
Spark MLlib / GraphX (if appropriate)

Recette pour compiler plink

La recette doit inclure:

Éditeur C/C++ utilisé (idéalement cross-platform)
Version de Plink: 1.07 ou 1.90 Beta 3
Makefile, etc..

Getting started

Clone the repo
Discuss the skills of each team member
Elect a management leader
Review the planning based on team member skills
Install plink and perform the tutorial on the wiki
Assign tickets of first iteration to each member

IBS clustering: outlier detection

Description

This feature adds the --neighbour option(s) based on the input file described in #3. For more info, check: http://pngu.mgh.harvard.edu/~purcell/plink/strat.shtml#outlier

Analysis

Add a comment to this issue with:

plink version used as reference (1.07 or 1.90 beta 3)
relevant C++ function name(s) and the file name(s) where they appear
Document the structure of the NEAREST file format(s) generated by plink.
Document the algorithm and/or mathematical formula to compute the missing fields:
- NN
- MIN_DST
- Z
- PROP_DIFF

Design

Add a comment to this issue describing how this will be implemented in Spark, and how it differs from plink.

Implementation

The implementation should integrates with the models implemented in scala by this project and use:

Scala
Spark RDD
Spark MLlib / GraphX (if appropriate)

Important note: The model can be only in memory for now, but you'll need to integrate into the ADAM format later on. You'll probably need to create a new record type.

IBS clustering constraint: population concordance test (--ppc, --ppc-gap)

Description

This feature adds the --ppc and --ppc-gap constraint(s) on the --cluster option described in issue #7. For more info, check: http://pngu.mgh.harvard.edu/~purcell/plink/strat.shtml#options

The input file is the model created in #3.

Analysis

Add a comment to this issue with:

plink version used as reference (1.07 or 1.90 beta 3)
relevant C++ function name(s) and the file name(s) where they appear
Document the structure of the hh file generated by plink (not always present).
Document the algorithm and/or mathematical formula to compute:
- --ppc option
- --ppc-gap option

Design

Add a comment to this issue describing how this will be implemented in Spark, and how it differs from plink.

Implementation

The implementation should use:

Scala
Spark RDD
Spark MLlib / GraphX (if appropriate)

List of available CLI parsers librairies

Make a list of existing CLI parsers librairies available for Java and Scala.
Describe the PROs and CONs of each.

Use a single command to build adam-ibs

It should be possible to build the project with a single command, e.g. not have to change folder or other manual operations.

Persistence for MDS using ADAM format

Description

This feature allow persistence of the models created in issue #17 and #18 (MDS, MIBS, MDIST).

The Spark models should integrates with the ADAM format. New record types may need to be added to the ADAM format.

The ADAM formats can be added as a maven dependency. The structure of the ADAM format is defined in an Avro file. The Avro file can be compiled into Java classes.

For updates to the ADAM format, choose the most simple solution for you as long as it is easy to diff the changes to the Avro file.

Analysis

Add a comment to this issue with:

the missing fields and record types in the current ADAM format

Design

Add a comment to this issue describing how this will be implemented in Spark, and explain how the persistence model will work.

Implementation

The implementation should use:

Scala
Spark RDD
Avro
Parquet

Règle de codage

Il faudrait qu'on se mette d'accord sur des règles de codage pour le nouveau code en Scala (pour faciliter la future maintenance). Je propose les points suivants:

Un en-tête pour chaque fichier avec les informations suivante: nom de l'auteur, fonction du fichier, autres informations pertinentes.
Un en-tête pour chaque méthode: description de la méthode, paramètres, valeur de retour.
Nom des variables, des méthodes explicites, des classes et des packaes explicites.
Ajouter des commentaires pour les parties de code qu'il n'est pas évident de comprendre (même avec des noms de variables et méthodes explicites )

N'hésitez pas à rajouter d'autres points

IBS clustering constraint: maximum cluster size (--mc, --mcc)

Description

Similar to plink --cluster with --mc or --mcc options. See the wiki on IBS-MDS Process and the diagram for the Genome file.

More information can be found on the --cluster and --genome-full options in the section on Pairwise IBD estimation of plink manual.

The input file is the model created in #3.

This feature adds a constraint on the --cluster option described in issue #7.

Analysis

Add a comment to this issue with:

plink version used as reference (1.07 or 1.90 beta 3)
relevant C++ function name(s) and the file name(s) where they appear
Document the structure of the hh file generated by plink (not always present).
Document the algorithm and/or mathematical formula to compute:
- --mc option
- --mcc option

Design

Add a comment to this issue describing how this will be implemented in Spark, and how it differs from plink.

Implementation

The implementation should use:

Scala
Spark RDD
Spark MLlib / GraphX (if appropriate)

Calculate the pairwise IBD/IBS metrics - Part II (--genome-full)

Description

Similar to plink --genome-full option. See the wiki on IBS-MDS Process and the diagram for the Genome file.

More information can be found on the --genome and --genome-full options in the section on Pairwise IBD estimation of plink manual.

The input files are those created in #2.

This feature completes feature #3 by adding the missing fields.

Analysis

Add a comment to this issue with:

plink version used as reference (1.07 or 1.90 beta 3)
relevant C++ function name(s) and the file name(s) where they appear
Document the algorithm and/or mathematical formula to compute the fields:
- RT
- EZ
- Z0
- Z1
- Z2
- PI_HAT
- PHE
- RATIO
- IBS0
- IBS1
- IBS2
- HOMHOM
- HETHET
- any other field required as a dependency

Design

Add a comment to this issue describing how this will be implemented in Spark, and how it differs from plink.

Implementation

The implementation should use:

Scala
Spark RDD
Spark MLlib / GraphX (if appropriate)

Important note: The model can be only in memory for now, but you'll need to integrate into the ADAM format later on. You'll probably need to create a new record type.