bigdatagenomics / adam Goto Github PK

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

License: Apache License 2.0

Scala 87.90% Java 2.11% Shell 0.62% Python 5.34% Makefile 0.15% R 3.87%

spark big-data bioinformatics genomics parquet avro scala java python r

adam's Issues

Create ADAM Benchmarking suite

This benchmarking suite would be used for validating the quality of the implementation of the features inside of ADAM. This benchmarking suite would return both performance results, as well as accuracy results.

I have some code already that takes care of performance results; I'll polish this up and start building the framework out. I believe that a lot of the accuracy results will be based on the GenomeBridge tools (at least, at first).

Promote more BAM optional fields to first-class ADAMRecord fields

We currently only extract the optional MD tag to a proper record field named "mismatchingPositions". Updating the schema to include more optional fields would be a performance and compression win.

Write quick start guide for Cloudera Manager/CDH5

We should have simple step-by-step instructions (or a script) for deploying ADAM on CDH4 and CDH5.

There is a wiki page that outlines the steps to run ADAM on EC2 which could be useful as a starting point.

https://github.com/massie/adam/wiki/Running-ADAM-on-EC2

0-vs-1 based coordinates in SAM / BAM

The SAM file format document (http://samtools.sourceforge.net/SAMv1.pdf) claims that SAM files use 1-based coordinate systems, while BAM files use 0-based coordinates.

I'm not sure, but I don't recall seeing anywhere in our code where we treat the two file-formats differently, in terms of the coordinates.

This issue is just a reminder then (for me, or for anyone else who's interested in looking in the meantime) to make sure that we're handling the coordinate systems correctly across both SAM and BAM formats in their conversion to ADAM.

(Additionally, we could probably use a word somewhere, maybe the Github wiki, as to which coordinate system the ADAMRecords -- and by extension, every other position-based record -- are using. Having a consistent choice here is going to be almost more important than which coordinate convention is chosen.)

Clean up and characterize IndelRealignment

The indel realigner is a large hunk of code. It "works" right now, but we need to go through and fully debug the performance and accuracy of the feature.

Add ability to perform a "range join" to RDDs of ADAMRecords.

Right now, we can use Spark's RDD 'join' method to perform equi-joins on ADAMRecords or other types -- joining records based on the equality of one or more fields.

However, for several different genomics applications will require a "range join" -- joining records based on whether they overlap within a genomic coordinate system. This issue is written for the task of extending an RDD[ADAMRecord] to support range join on ADAMRecords or other record types which contain a reference chromosome and start/end coordinates.

Fix the -list_comparisons argument in CompareAdam

It's suboptimal that there's no way to get a list of comparisons, without also specifying inputs, e.g.

$ adam compare -list_comparisons

Prints the usage help with 'Argument "INPUT1" is required'. ...

Can we figure out a way to have an optional -list_comparisons flag, without requiring the INPUT{1,2} arguments as well?

Parse out common annotations stored in VCF format

We have a few commonly distributed annotation databases stored (and publicly available) as VCF fomats. This will parse out the additional annotation specific fields for a particular format.

It will most be an addition to the VCF parsing and extending the INFO field parsing to pick up annotations specific fields.

Abstract our "reference identifiers" from the avdl

Seems a few of our objects have "reference identifiers":

union { null, int } referenceId = null;
union { null, string } referenceName = null;
union { null, long } referenceLength = null;
union { null, string } referenceUrl = null;

Should this be cut out?

Annotate classes with their production worthiness

Currently, this information is stored in scaladoc, or in the commit message, or is just lost, but it would be good to know the level of maturity for our interfaces.

@hammer sent out this link about the ways that Hadoop currently classifies their interface, we should use something similar.

We should annotate the current classes that we have, as well as document our process and ensure that future PRs adhere to this standard.

Release non-SNAPSHOT artifacts to Maven Central

ADAMRecord-->BAM converter

Currently, we have a converter that converts BAM/SAM via SamRecords to ADAMRecords. However, we do not have a converter that goes from ADAMRecords to BAM/SAM. This would perform the opposite packaging from ADAMRecords-->SamRecords, the collection of SamHeader data, and then write-out through Hadoop-BAM.

Create a "How to Release" guide

In addition to Matt's great work on Contributing.md, it would also be useful to have a guide to how to build a release and push it out to a public repository.

Add Smith-Waterman as consensus generation method for indel realignment

Currently, indel realignment uses a mismatch quality score based approach for generating alternate consensuses. It is desirable to also support Smith-Waterman for generating consensuses. We have a Smith-Waterman implementation, but it needs to be integrated into the realigner itself.

Allow list of commands to be injected into adam-cli AdamMain

The list of commands in AdamMain is private, so outside of the PluginExecutor mechanism, adding new commands requires modifying this list and rebuilding; this also introduces a circular dependency between the adam build and an external project.

Adding a new constructor to AdamMain with a reference to the list of commands as a parameter would allow the list to be provided by dependency injection (e.g. via Guice).

https://code.google.com/p/google-guice/

Allow AdamPlugins to specify command line parameters

Update normalization code to enable normalization of sequences with more than two indels

Currently, the indel left normalization utility that we have can only left normalize and trim sequences with a single indel. However, it is ideal to be able to left normalize and trim more complex sequences; sequences of interest would include reads with combined insertion/deletions, and reads with multiple indels.

Add Adam2Fastq command

Something similar to:

http://picard.sourceforge.net/command-line-overview.shtml#SamToFastq

Turning off parquet logging in SparkFunSuite tests

(I'm going to experiment with opening this as an issue, so that it's on our collective minds and maybe someone will remember how to track it down?)

If you're using the SparkFunSuite for Spark-based tests, with sparkTest, sparkBefore, and sparkAfter, there's a silenceSpark option that allows you to turn off Spark logging. However, if you're using adamSave as part of your spark test, even silenceSpark=true won't silence the parquet.hadoop INFO level logging coming out of (I presume) Parquet.

I tried adding "parquet" and "parquet.hadoop" to the hard-coded list of silenced packages in the SparkLogUtils, and that didn't work for me either.

If someone can figure out where the Parquet logging is enabled (my guess: a log4j.xml embedded in the Parquet dependency?) and whether it could be modified at runtime, that would take us a long way towards figuring out how to turn it off during tests. It's making the test outcome output nearly unreadable.

Actually, generalizing that list of hard-coded silenced packages (or allowing the user to add new packages to it "on the fly" from the test as an argument) might be another reasonable change in the future too.

Add Pileup-->Read Converter

Currently, we can generate pileups from read data, and we save sufficient data to generate reads from pileup data. However, we don't have a converter that performs this transformation. This would be done by grouping pileups by readName, and then stringing read bases together by reference position (and rangePosition for indels).

The ADAMContextSuite loadAdamFromPaths test (seems to) occasionally hang.

The test in the ADAMContextSuite, with the description line

"loadAdamFromPaths can load simple RDDs that have just been saved"

seems to occasionally, randomly hang. For me, killing and running "mvn clean test" will usually pass through fine. Carl and I here at GB are both seeing this problem -- is anyone else noticing it?

We'll try and look into it, just wanted to gather a bit more information...

Reference key format

We've been using integer ids for reference keys - we thought that using integer reference/contig ids was better because it prevented naming conflicts when processing multiple BAM or VCF files, since it's essentially uniquifying the sequence dictionary keys. However, we can also do this by using RefSeq strings as identifiers, including the reference version information. If we're processing multiple BAM or alignment files and they specify both specify the same random (i.e. non RefSeq) contig name that could mean different contigs for each BAM file, it's a problem. However, it is fairly rare, and we can detect/throw an error/punt on it for now.

Also, the region join work is blocked on this; as Timothy summarized in issue 127:

Neal is going to make the change from using refId -> referenceName as the primary key, and will break it out as a separate PR
the regionJoin PR will be blocked on the merging of that PR first
it will include a way to specify a mapping from "common contig name" to a unique ("full Refseq ID") name at load time, for both BAMs and VCFs.
Michael will add liftover files that specify those conversions for at least builds 37 and 38 for human

Error running multiple test suites

We are adding a test suite for the CLI components, and noticed that if we run two different modules that are using the SparkFunSuite, the second will fail like this:

org.jboss.netty.channel.ChannelException: Failed to bind to: wm45a-220/10.1.0.186:55382
  at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:298)
  at akka.remote.netty.NettyRemoteServer.start(Server.scala:54)
  at akka.remote.netty.NettyRemoteTransport.start(NettyRemoteSupport.scala:90)
  at akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:94)
  at akka.actor.ActorSystemImpl._start(ActorSystem.scala:588)
  at akka.actor.ActorSystemImpl.start(ActorSystem.scala:595)
  at akka.actor.ActorSystem$.apply(ActorSystem.scala:111)
  at akka.actor.ActorSystem$.apply(ActorSystem.scala:104)
  at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:64)
  at org.apache.spark.SparkEnv$.createFromSystemProperties(SparkEnv.scala:119)
  ...
  Cause: java.net.BindException: Address already in use
  at sun.nio.ch.Net.bind(Native Method)
  at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:124)
  at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
  at org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink.bind(NioServerSocketPipelineSink.java:140)
  at org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink.handleServerSocket(NioServerSocketPipelineSink.java:90)
  at org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink.eventSunk(NioServerSocketPipelineSink.java:64)
  at org.jboss.netty.channel.Channels.bind(Channels.java:569)
  at org.jboss.netty.channel.AbstractChannel.bind(AbstractChannel.java:187)
  at org.jboss.netty.bootstrap.ServerBootstrap$Binder.channelOpen(ServerBootstrap.java:343)
  at org.jboss.netty.channel.Channels.fireChannelOpen(Channels.java:170)

It seems that the state is not properly cleaned up in the current sparkDestroy method.

BQSR should support recalibration across multiple ADAM files

Per the discussion here: #48 (comment)

It doesn't appear that BQSR could currently be (correctly) run across multiple ADAM files with overlapping read groups (or possibly multiple ADAM files with different read groups either, I'm not sure).

We should probably (a) make that clearer in the documentation to BQSR, and (b) correct it (by building or maintaining a master read-group map that can be used to correctly define the read group covariate across files).

SAMRecordConverter: ensure internal consistency of flags

The SAMRecordConverter currently sets combinations of flags that can be hard to interpret. For example, SAMRecordConverter currently sets PrimaryAlignment by default even on unmapped reads. We should revisit this so that the flags are set in a way that is internally consistent and coherent.

Upgrade to Java 7

Our pom.xml declares <java.version>1.6</java.version>. I'm curious if that's intentional (e.g. because one of our dependencies requires it or to support Mac users out of the box), or if it's something that we could change.

Identifier consistency & formatting cleanup

We need to run through the code base and clean up some inconsistencies that have cropped up:

ADAM vs Adam
Variation vs Variant
Organize/alphabetize imports
Run it through a source code formatting tool (maybe https://github.com/mdr/scalariform?)

Simplify import/export command line structure

From the discussion here (a copy follows):

@tdanford:
Are we really going to have a separate command for each file-format-pair conversion operation? For example, we're going to want an Adam2Bam sometime soon, right? Can we roll these into a single "Export" (say) command?

@fnothaft:
I think we should at least break them up by format type (e.g., reads, reference, pileups, variants, annotations). Otherwise, the conversion operations will get really messy. If we had an import/export Swiss Army Knife per format, I'd be OK.

@tdanford:
I'd be okay with that -- anyone else have an opinion? @massie? @carlyeks?

@carlyeks:
What about having subcommands for each of the different formats, but having a top-level import/export command? Each importer/exporter would be distinct, but a new user who's looking for usage information wouldn't be overwhelmed by the number of commands. That way adam import would list all of the file formats that we support. The commands would become

adam import bam [bam] [adam]
adam export bam [adam] [bam]

@fnothaft:
You wind up then having an explosion of options. The problem is that at the end of the day, there are 5 different ADAM formats, each has 1-4 different external formats that they can be converted to. In some cases (e.g. pileup data), it's logical to convert reads to pileups to mpileup style output; therefore, you also need to track and notate which formats can be converted between each other.

I agree with consolidating, but I disagree with consolidating down to a single command.

@massie:
I don't see an issue consolidating down to a single command assuming it's
possible to derive the file type from the extension and that the conversion
doesn't require cli options.

adam import foo.bam foo.adam
adam export foo.adam foo.bam

could be a very clean and simple way to do this. However, if we foresee
needing options (e.g. filtering) for each import/export command then the
number of options could explode.

Another options might be to change the AdamMain to group commands into
categories (e.g. "pre-processors", "utilities", "converters", etc). We
could put all the bam2adam, vcf2adam, etc commands in a group.

@tdanford:
I think it's possible that there will be some file extensions (e.g. .vcf) which might have multiple corresponding ADAM formats though...
Frank Austin Nothaft
fnothaft added a note 3 hours ago repo collab
Some of the formats don't have extensions (mpileup), and also there aren't extension conventions for adam data (e.g. adam reads vs. adam pileups vs. adam variants etc.)

Additionally, all of the converters themselves have their own options, e.g. SAM validation stringency, or whether to re-ID reference contigs. Consolidating per format makes validating a command line messy; consolidating per all formats makes validating a command line really nasty, and makes documenting the acceptable options cumbersome.

Can we have a one-to-one relationship (or maybe a partial one-to-one relationship) between Export/Import classes and schemas in the adam.avdl file?

@carlyeks:
I don't think we need to specify the whole argument, if we had a tree structure to our commands:

import
- bam
- vcf
- reference-fasta
export
- bam
- vcf
- reference-fasta
  Each one of these would be their own command and parser (they would be a subcommand to their import/export).

At this point, it isn't a big deal. But, as we continue to add more commands, we will need to at least categorize them for easier display to users, and at most make them into a hierarchy. If we think too much as ourselves and not as new users, then using ADAM may become too daunting a task for newcomers.

I think this is a bigger discussion which should probably happen on the mailing list instead of keeping this pull request from being merged.

@massie:
+1

Create a Contributor License Agreement and require it to be signed before accepting a commit from a new contributor

I've worked on previous open source projects where we've wanted to change the project governance in one way or another (for example, grant the project to the Apache Software Foundation). The process was needlessly complicated by having accepted contributions from many software developers who needed to be recontacted to sign over their copyright to the project before it could be relicensed.

I've noticed copyrights from multiple organizations in the ADAM codebase. It would be great while we still can contact everyone who has contributed to craft a CLA, ask all previous contributors to sign, and then require any new contributors to sign it going forward.

Clarify role of CHANGES.txt versus CHANGES.md

Once we've settled on whether we need both or just one, we should also include information about how to update one or both in CONTRIBUTING.md.

Modify AdamCommand to add global Parquet logging option

Currently, many of the commands manually lower the Parquet logging level. Instead of doing this in every single command, we should refactor so that this is in the main AdamCommand or AdamSparkCommand class. Additionally, we should add an option that allows users to specify the level of logging they'd like.

Move adam-format into bdg-formats repo

Non-JVM projects may want to work with data conforming to the schemas in adam-format. For this reason we should put the .avdl files into a separate repository and generate release artifacts for non-JVM languages.

As part of this change, we should probably change the namespace declaration, the protocol name, and the "ADAM" prefix on the record names.

Further, we may want to consider breaking the .advl file into separate files using Avro imports so that downstream projects can use only the schemas that they need.

Add schemas for annotation types, and CLI tools for ingesting annotations

We want to be able to represent and ingest many genome annotations that are out there (e.g., ENCODE, UCSC Genome Browser, etc.). We need a standard way to represent them and ingest them (e.g., from BED/GFF files.)

This is also discussed in the ADAM RFCs 2 and 3:
http://bigdatagenomics.github.io/rfc/2/
http://bigdatagenomics.github.io/rfc/3/

I'm cross-posting here just so people know it's being worked on. Such a CLI tool is in the pipeline for me.

Fasta converter hangs on large files

The fasta converter seems to hang when importing larger contigs (e.g. human chromosomes). This has been reported at both MSR and Berkeley.

Upgrade to Parquet 1.3.2

https://github.com/Parquet/parquet-mr/blob/master/CHANGES.md

Support Alphabets?

The avro schema currently contains enum Base which uses the IUPAC alphabet. Is there any interest/need in allowing alternative alphabets to be specified, e.g., in the style of BioPython?

Predicate to filter conversion.

We need a way to convert Parquet predicates into Spark filters—this is needed for the adamRead methods for both read and variant data, as we ignore the predicate passed if reading BAM/SAM/VCF data.

The 'attributes' field of ADAMRecord drops the attribute type information.

The 'attributes' field in the ADAMRecord is used to capture the optional fields on a SAMRecord during conversion from SAM/BAM. However, the optional fields in the SAM file format contain type information, along with a tag and a value -- but the encoding into the 'attributes' field only maintains the tag and value, and drops the type information.

This is a problem for two reasons -- first, because the tags aren't canonical or standard, so there may be "new" or unknown tags whose type information we're not already aware of by introspection, and second, because we'd like to be able to convert back to a BAM or SAM from ADAM in the future, and we'll need this type information to do the conversion.

This ticket is for adding that type information back into the attributes field in a reasonable way.

Add ability to call 'plugins' from the command-line

Right now, each individual command-line ... command, that we run, is written as a separate AdamCommand and made a part of the adam-cli package.

That's great for common commands that are useful to everyone -- but in the future, users will want to write commands to execute on a Spark cluster with ADAM, without necessarily compiling the command into the adam-cli module itself.

This suggests that we might define an interface and an execution harness for "plugins," which would be dynamically-loaded code, conforming to a particular interface, that can be executed (by fully-qualified class name) from the command line.

Test IRC

Dependency convergence errors in build

Running the build as is gives warnings about multiple versions of the same groupId:artifactId transitive depedencies

$ mvn install
...
[INFO] --- scala-maven-plugin:3.1.5:doc-jar (attach-scaladocs) @ adam-cli ---
[WARNING]  Expected all dependencies to require Scala version: 2.9.3
[WARNING]  edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT requires scala version: 2.9.3
[WARNING]  com.twitter:chill_2.9.3:0.3.1 requires scala version: 2.9.3
[WARNING]  org.apache.spark:spark-core_2.9.3:0.8.1-incubating requires scala version: 2.9.3
[WARNING]  org.scala-lang:scalap:2.9.3 requires scala version: 2.9.3
[WARNING]  org.scala-lang:scala-compiler:2.9.3 requires scala version: 2.9.3
[WARNING]  org.apache.spark:spark-core_2.9.3:0.8.1-incubating requires scala version: 2.9.3
[WARNING]  net.liftweb:lift-json_2.9.2:2.5 requires scala version: 2.9.3
[WARNING]  net.liftweb:lift-json_2.9.2:2.5 requires scala version: 2.9.2
[WARNING] Multiple versions of scala libraries detected!
[WARNING]  Expected all dependencies to require Scala version: 2.9.3
[WARNING]  edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT requires scala version: 2.9.3
[WARNING]  com.twitter:chill_2.9.3:0.3.1 requires scala version: 2.9.3
[WARNING]  org.apache.spark:spark-core_2.9.3:0.8.1-incubating requires scala version: 2.9.3
[WARNING]  org.scala-lang:scalap:2.9.3 requires scala version: 2.9.3
[WARNING]  org.scala-lang:scala-compiler:2.9.3 requires scala version: 2.9.3
[WARNING]  org.apache.spark:spark-core_2.9.3:0.8.1-incubating requires scala version: 2.9.3
[WARNING]  net.liftweb:lift-json_2.9.2:2.5 requires scala version: 2.9.3
[WARNING]  net.liftweb:lift-json_2.9.2:2.5 requires scala version: 2.9.2
...
[INFO] Including args4j:args4j:jar:2.0.23 in the shaded jar.
[WARNING] objenesis-1.2.jar, kryo-2.21.jar define 32 overlappping classes: 
[WARNING]   - org.objenesis.Objenesis
[WARNING]   - org.objenesis.strategy.StdInstantiatorStrategy
[WARNING]   - org.objenesis.instantiator.basic.ObjectStreamClassInstantiator
[WARNING]   - org.objenesis.instantiator.sun.SunReflectionFactorySerializationInstantiator
[WARNING]   - org.objenesis.instantiator.perc.PercSerializationInstantiator
[WARNING]   - org.objenesis.instantiator.NullInstantiator
[WARNING]   - org.objenesis.instantiator.jrockit.JRockitLegacyInstantiator
[WARNING]   - org.objenesis.instantiator.gcj.GCJInstantiatorBase
[WARNING]   - org.objenesis.ObjenesisException
[WARNING]   - org.objenesis.instantiator.basic.ObjectInputStreamInstantiator$MockStream
[WARNING]   - 22 more...
[WARNING] asm-tree-4.0.jar, cofoja-1.0.jar define 26 overlappping classes: 
[WARNING]   - org.objectweb.asm.tree.MethodNode$1
[WARNING]   - org.objectweb.asm.tree.LocalVariableNode
[WARNING]   - org.objectweb.asm.tree.FieldNode
[WARNING]   - org.objectweb.asm.tree.InnerClassNode
[WARNING]   - org.objectweb.asm.tree.LabelNode
[WARNING]   - org.objectweb.asm.tree.VarInsnNode
[WARNING]   - org.objectweb.asm.tree.InsnNode
[WARNING]   - org.objectweb.asm.tree.FieldInsnNode
[WARNING]   - org.objectweb.asm.tree.JumpInsnNode
[WARNING]   - org.objectweb.asm.tree.IntInsnNode
[WARNING]   - 16 more...
...

If I enable the <DependencyConvergence/> check in the maven-enforcer-plugin (see http://maven.apache.org/enforcer/enforcer-rules/dependencyConvergence.html) it reports these and fails the build

[INFO] ------------------------------------------------------------------------
[INFO] Building ADAM: Core 0.6.1-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO] 
[INFO] --- maven-enforcer-plugin:1.3.1:enforce (enforce-maven) @ adam-core ---
[INFO] 
[INFO] --- maven-enforcer-plugin:1.3.1:enforce (enforce-java) @ adam-core ---
[WARNING] 
Dependency convergence error for com.thoughtworks.paranamer:paranamer:2.4.1 paths to dependency are:
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.spark:spark-core_2.9.3:0.8.1-incubating
    +-net.liftweb:lift-json_2.9.2:2.5
      +-com.thoughtworks.paranamer:paranamer:2.4.1
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.avro:avro:1.7.4
    +-com.thoughtworks.paranamer:paranamer:2.3

[WARNING] 
Dependency convergence error for io.netty:netty:3.5.4.Final paths to dependency are:
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.spark-project:akka-remote:2.0.5-protobuf-2.5-java-1.5
    +-io.netty:netty:3.5.4.Final
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.spark:spark-core_2.9.3:0.8.1-incubating
    +-org.apache.avro:avro-ipc:1.7.4
      +-io.netty:netty:3.4.0.Final
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.spark:spark-core_2.9.3:0.8.1-incubating
    +-com.typesafe.akka:akka-remote:2.0.5
      +-io.netty:netty:3.5.4.Final

[WARNING] 
Dependency convergence error for org.scala-lang:scalap:2.9.3 paths to dependency are:
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.spark:spark-core_2.9.3:0.8.1-incubating
    +-org.scala-lang:scalap:2.9.3
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.spark:spark-core_2.9.3:0.8.1-incubating
    +-net.liftweb:lift-json_2.9.2:2.5
      +-org.scala-lang:scalap:2.9.2

[WARNING] 
Dependency convergence error for org.xerial.snappy:snappy-java:1.0.5 paths to dependency are:
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.spark:spark-core_2.9.3:0.8.1-incubating
    +-org.xerial.snappy:snappy-java:1.0.5
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.avro:avro:1.7.4
    +-org.xerial.snappy:snappy-java:1.0.4.1
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-com.twitter:parquet-avro:1.2.5
    +-com.twitter:parquet-hadoop:1.2.5
      +-org.xerial.snappy:snappy-java:1.0.5

[WARNING] 
Dependency convergence error for commons-logging:commons-logging:1.1.1 paths to dependency are:
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.spark-project:akka-remote:2.0.5-protobuf-2.5-java-1.5
    +-net.debasishg:sjson_2.9.1:0.15
      +-net.databinder:dispatch-json_2.9.1:0.8.5
        +-org.apache.httpcomponents:httpclient:4.1
          +-commons-logging:commons-logging:1.1.1
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.hadoop:hadoop-client:2.2.0
    +-org.apache.hadoop:hadoop-common:2.2.0
      +-commons-httpclient:commons-httpclient:3.1
        +-commons-logging:commons-logging:1.0.4
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.hadoop:hadoop-client:2.2.0
    +-org.apache.hadoop:hadoop-common:2.2.0
      +-commons-logging:commons-logging:1.1.1
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.hadoop:hadoop-client:2.2.0
    +-org.apache.hadoop:hadoop-common:2.2.0
      +-commons-configuration:commons-configuration:1.6
        +-commons-logging:commons-logging:1.1.1
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.hadoop:hadoop-client:2.2.0
    +-org.apache.hadoop:hadoop-common:2.2.0
      +-commons-configuration:commons-configuration:1.6
        +-commons-digester:commons-digester:1.8
          +-commons-logging:commons-logging:1.1
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.hadoop:hadoop-client:2.2.0
    +-org.apache.hadoop:hadoop-common:2.2.0
      +-commons-configuration:commons-configuration:1.6
        +-commons-beanutils:commons-beanutils-core:1.8.0
          +-commons-logging:commons-logging:1.1.1
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.hadoop:hadoop-client:2.2.0
    +-org.apache.hadoop:hadoop-hdfs:2.2.0
      +-commons-logging:commons-logging:1.1.1
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.spark:spark-core_2.9.3:0.8.1-incubating
    +-net.java.dev.jets3t:jets3t:0.7.1
      +-commons-logging:commons-logging:1.1.1

[WARNING] 
Dependency convergence error for org.slf4j:slf4j-api:1.7.5 paths to dependency are:
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.hadoop:hadoop-client:2.2.0
    +-org.apache.hadoop:hadoop-common:2.2.0
      +-org.slf4j:slf4j-api:1.7.5
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.hadoop:hadoop-client:2.2.0
    +-org.apache.hadoop:hadoop-common:2.2.0
      +-org.apache.hadoop:hadoop-auth:2.2.0
        +-org.slf4j:slf4j-api:1.7.5
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.hadoop:hadoop-client:2.2.0
    +-org.apache.hadoop:hadoop-mapreduce-client-app:2.2.0
      +-org.apache.hadoop:hadoop-mapreduce-client-common:2.2.0
        +-org.apache.hadoop:hadoop-yarn-client:2.2.0
          +-org.slf4j:slf4j-api:1.7.5
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.hadoop:hadoop-client:2.2.0
    +-org.apache.hadoop:hadoop-mapreduce-client-app:2.2.0
      +-org.apache.hadoop:hadoop-mapreduce-client-common:2.2.0
        +-org.apache.hadoop:hadoop-yarn-server-common:2.2.0
          +-org.slf4j:slf4j-api:1.7.5
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.hadoop:hadoop-client:2.2.0
    +-org.apache.hadoop:hadoop-mapreduce-client-app:2.2.0
      +-org.apache.hadoop:hadoop-mapreduce-client-common:2.2.0
        +-org.slf4j:slf4j-api:1.7.5
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.hadoop:hadoop-client:2.2.0
    +-org.apache.hadoop:hadoop-mapreduce-client-app:2.2.0
      +-org.apache.hadoop:hadoop-mapreduce-client-shuffle:2.2.0
        +-org.slf4j:slf4j-api:1.7.5
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.hadoop:hadoop-client:2.2.0
    +-org.apache.hadoop:hadoop-mapreduce-client-app:2.2.0
      +-org.slf4j:slf4j-api:1.7.5
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.hadoop:hadoop-client:2.2.0
    +-org.apache.hadoop:hadoop-yarn-api:2.2.0
      +-org.slf4j:slf4j-api:1.7.5
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.hadoop:hadoop-client:2.2.0
    +-org.apache.hadoop:hadoop-mapreduce-client-core:2.2.0
      +-org.apache.hadoop:hadoop-yarn-common:2.2.0
        +-org.slf4j:slf4j-api:1.7.5
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.hadoop:hadoop-client:2.2.0
    +-org.apache.hadoop:hadoop-mapreduce-client-core:2.2.0
      +-org.slf4j:slf4j-api:1.7.5
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.hadoop:hadoop-client:2.2.0
    +-org.apache.hadoop:hadoop-mapreduce-client-jobclient:2.2.0
      +-org.slf4j:slf4j-api:1.7.5
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.spark:spark-core_2.9.3:0.8.1-incubating
    +-org.apache.avro:avro-ipc:1.7.4
      +-org.slf4j:slf4j-api:1.6.4
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.spark:spark-core_2.9.3:0.8.1-incubating
    +-org.slf4j:slf4j-api:1.7.2
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.spark:spark-core_2.9.3:0.8.1-incubating
    +-com.typesafe.akka:akka-slf4j:2.0.5
      +-org.slf4j:slf4j-api:1.6.4
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.spark:spark-core_2.9.3:0.8.1-incubating
    +-com.codahale.metrics:metrics-core:3.0.0
      +-org.slf4j:slf4j-api:1.7.5
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.spark:spark-core_2.9.3:0.8.1-incubating
    +-com.codahale.metrics:metrics-jvm:3.0.0
      +-org.slf4j:slf4j-api:1.7.5
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.spark:spark-core_2.9.3:0.8.1-incubating
    +-com.codahale.metrics:metrics-json:3.0.0
      +-org.slf4j:slf4j-api:1.7.5
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.spark:spark-core_2.9.3:0.8.1-incubating
    +-com.codahale.metrics:metrics-ganglia:3.0.0
      +-org.slf4j:slf4j-api:1.7.5
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.slf4j:slf4j-log4j12:1.7.5
    +-org.slf4j:slf4j-api:1.7.5

[WARNING] 
Dependency convergence error for com.google.guava:guava:11.0.2 paths to dependency are:
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.hadoop:hadoop-client:2.2.0
    +-org.apache.hadoop:hadoop-common:2.2.0
      +-com.google.guava:guava:11.0.2
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.hadoop:hadoop-client:2.2.0
    +-org.apache.hadoop:hadoop-hdfs:2.2.0
      +-com.google.guava:guava:11.0.2
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.spark:spark-core_2.9.3:0.8.1-incubating
    +-com.google.guava:guava:14.0.1

[WARNING] 
Dependency convergence error for commons-lang:commons-lang:2.5 paths to dependency are:
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.hadoop:hadoop-client:2.2.0
    +-org.apache.hadoop:hadoop-common:2.2.0
      +-commons-lang:commons-lang:2.5
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.hadoop:hadoop-client:2.2.0
    +-org.apache.hadoop:hadoop-common:2.2.0
      +-commons-configuration:commons-configuration:1.6
        +-commons-lang:commons-lang:2.4
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.hadoop:hadoop-client:2.2.0
    +-org.apache.hadoop:hadoop-hdfs:2.2.0
      +-commons-lang:commons-lang:2.5
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.spark:spark-core_2.9.3:0.8.1-incubating
    +-org.apache.avro:avro-ipc:1.7.4
      +-org.apache.velocity:velocity:1.7
        +-commons-lang:commons-lang:2.4

[WARNING] 
Dependency convergence error for commons-codec:commons-codec:1.4 paths to dependency are:
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.spark-project:akka-remote:2.0.5-protobuf-2.5-java-1.5
    +-net.debasishg:sjson_2.9.1:0.15
      +-net.databinder:dispatch-json_2.9.1:0.8.5
        +-org.apache.httpcomponents:httpclient:4.1
          +-commons-codec:commons-codec:1.4
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.hadoop:hadoop-client:2.2.0
    +-org.apache.hadoop:hadoop-common:2.2.0
      +-commons-httpclient:commons-httpclient:3.1
        +-commons-codec:commons-codec:1.2
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.hadoop:hadoop-client:2.2.0
    +-org.apache.hadoop:hadoop-common:2.2.0
      +-commons-codec:commons-codec:1.4
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.hadoop:hadoop-client:2.2.0
    +-org.apache.hadoop:hadoop-common:2.2.0
      +-org.apache.hadoop:hadoop-auth:2.2.0
        +-commons-codec:commons-codec:1.4
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.hadoop:hadoop-client:2.2.0
    +-org.apache.hadoop:hadoop-hdfs:2.2.0
      +-commons-codec:commons-codec:1.4
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-org.apache.spark:spark-core_2.9.3:0.8.1-incubating
    +-net.java.dev.jets3t:jets3t:0.7.1
      +-commons-codec:commons-codec:1.3
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-com.twitter:parquet-avro:1.2.5
    +-com.twitter:parquet-column:1.2.5
      +-com.twitter:parquet-encoding:1.2.5
        +-commons-codec:commons-codec:1.7
and
+-edu.berkeley.cs.amplab.adam:adam-core:0.6.1-SNAPSHOT
  +-com.twitter:parquet-avro:1.2.5
    +-com.twitter:parquet-column:1.2.5
      +-commons-codec:commons-codec:1.7
...

I made some progress in a branch of my fork but after an hour I gave up

https://github.com/heuermh/adam/tree/dep-conv

With all these dependency conflicts in the shaded jar, it is possible that errors may occur at runtime (e.g. there are binary incompatibilities between version 11.x and 14.x of guava).

That said, feel free to mark this as WontFix at this time because I'm not sure having a clean build would be worth the effort.

RichADAMRecord: referencePosition methods ignore contig

Only the offset within the contig is handled, so a logic bug could easily slip in where the code assumes a matching contig but only the offset within the contig is compared.

Methods: isMismatchAtReferencePosition, isMismatchAtReferencePosition, readOffsetToReferencePosition, possibly others

Add ADAM version to help output

There are plugins for Maven that will integrate the git version into programs. This would allow the help output for ADAM to print the version (e.g. git hash) for code being run. This can be very helpful when debugging issues.

Delete old feature branches after the feature has been merged into master

I think it's a good idea to prune old feature branches, e.g. https://github.com/bigdatagenomics/adam/tree/ifq-reader.

If others agree, perhaps we should update https://github.com/bigdatagenomics/adam/blob/ifq-reader/CONTRIBUTING.md to include this step?

Add concordance analysis of ADAMGenotype

Perform comparison of ADAMGenotypes for the same samples to compute several concordance statistics. The simple version of this is a strict genotype-genotype comparison; a future version can attempt to identify identical variants otherwise represented differently.

Add clipping heuristic to indel realigner

Currently, the indel realigner code unclips all reads when locally realigning. This is OK (we won't wind up with worse alignments) but we may be missing out on the opportunity to realign some reads. I believe that we could have a clipping heuristic that would improve this—something like whether the base was previously clipped and has a low quality score.

Change ADAMRecord to ADAMAlignedRead (or something similar)

The scope of work that's happening within the ADAM project has grown to encompass the entire pipeline of genomics data management, from unmapped short reads through to annotated variants. Within this new context, the name "ADAMRecord" is too generic. Names like "ADAMPileup" and "ADAMVariant" are more descriptive and useful for end users.

Alternatively, given that Avro supports namespaces, we may want to consider dropping the "ADAM" prefix for all of these types. That's a larger conversation, of course. I'm curious first to get opinions on the specific issue of "ADAMRecord".

AdamContext.adamVariantContextLoad depends on optional 'contig' header fields when loading a VCF.

When the AdamContext.adamVariantContextLoad method is called on a VCF file (rather than a Parquet path), the converter fails if the VCF lacks the (optional) 'contig' field names. This is because the loader is unable to construct a proper SequenceDictionary for the records in the file first, and so the contig lookup by name fails later.

The offending code starts here, in AdamContext.adamVcfLoad, where the SequenceDictionary is created:

    val vcfHeader = VCFHeaderReader.readHeaderFrom(seekable)
    val seqDict = SequenceDictionary.fromVCFHeader(vcfHeader)

later, in the 'convert' method of VariantContextConverter, we get

    val contigName: String = vc.getChr()
    val contigId: Int = sequenceDict(contigName).id
    val referencePosition = ReferencePosition(vc.getStart - 1, contigId)

But the second line errors out, if the VCF (for example, the dbSNP-produced 'clinvar' VCF, available here: http://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/#clinvar ) lacks a set of 'contig' fields in its header.

The VariantContextConverter.convert method should, instead, optionally create new records in the SequenceDictionary when non exist. This is going to mean handling the SequenceDictionary through a means other than loading-and-broadcasting.

Write a quick start guide to running ADAM on EC2

There's a Wiki page at https://github.com/massie/adam/wiki/Running-ADAM-on-EC2 with information to get started. It would be great to have someone clean up the guide and submit it as part of the source (e.g. Markdown).

bigdatagenomics / adam Goto Github PK

adam's Issues

Recommend Projects

Recommend Topics

Recommend Org