bio4j / bio4j-titan Goto Github PK

Titan-specific bio4j implementation

Home Page: https://github.com/bio4j/bio4j

Shell 0.52% Scala 1.25% Java 98.23%

bio4j-titan's Introduction

Bio4j bioinformatics graph data platform

Bio4j is a bioinformatics graph data platform, integrating most data available in Uniprot KB (SwissProt + Trembl), Gene Ontology (GO), UniRef (50,90,100), NCBI Taxonomy, and Expasy Enzyme DB.

Bio4j provides a completely new and powerful framework for protein related information querying and management. The use of a graph-based data model makes possible to store and query data in a way that semantically represents its own structure. On the contrary, traditional relational models and databases must flatten the data they represent into tables, creating artificial ids in order to connect the different tuples; which can in some cases eventually lead to domain models that have almost nothing to do with the actual structure of data.

Project structure and overview

Bio4j can look a bit intimidating at first, with all those repositories with kind of similar names; here you have a guided tour around:

bio4j/bio4j

In this repository bio4j/bio4j you will find the generic Bio4j model and API. Entities, relationships and their properties are modeled using a typed property graph model. For example, there are vertex types for Protein or GoTerm, and a GoAnnotation edge type going from Protein to GoTerm. This graph schema is separated into different graphs, corresponding to the different data sources (UniProt, Go, UniRef, ...) and connections between them (UniProtGo, UniProtUniRef, ...).

The API, based on bio4j/angulillos, lets you write generic typed traversals over this graph schema:

protein.uniref50Member_outV()
  .map(
    UniRef50Cluster::uniRef50Member_inV
  )
  .map(
    prts -> prts.map(
      Protein::goAnnotation_outV
    )
  );

which can later be executed on a particular backend. Generic data import code is also here, which can be used to load the data using any implementation of angulillos.

bio4j/angulillos

You can think of bio4j/angulillos as a strongly typed version of the property graph model. You can describe graph schemas and write generic traversals over them which are guranteed to be well-typed in that for example

you cannot retrieve the outgoing edges of and edge
and you can get the tweets that a user tweeted, but not the users that a tweet follows!

bio4j/bio4j-titan

In bio4j/bio4j-titan you will find a Titan-based Bio4j distribution. This is the the default standard distribution, and we also provide through AWS S3 the database binaries with all data already loaded. Go there if you want to stop reading and use Bio4j now!

bio4j/angulillos-titan

bio4j/angulillos-titan is an implementation of the angulillos API using Titan.

Documentation

General docs: docs/
Code docs: docs/src/
API Docs: v0.12.0

Community and contact

Gitter chat The easiest and fastest way to contact developers and ask for help
bio4j-user Google group / mailing list
@bio4j Twitter
Bio4j LinkedIn

Licensing

Bio4j is an open source platform released under the AGPLv3 license.

bio4j-titan's People

Contributors

Stargazers

Watchers

Forkers

amcp arwyer

bio4j-titan's Issues

file namings

I got the following error:

[error] E:\reps\bio4j-titan\src\main\java\com\bio4j\titan\model\uniprot\TitanUniprotGraph.java:21: error: class TitanUniProtGraph is public, should be declared in a file named TitanUniProtGraph.java
[error] public final class TitanUniProtGraph

when I was trying to build code in the master branch of this repository. Is that already fixed somewhere should do all these renaming from "Uniprot" to "UniProt"?

alternative product key is wrong

When inititalizing the alternative product type the wrong key is used.

update to the new naming conditions in bio4j/bio4j

Making import faster

When the problem with GI index (ohnosequences/bio4j-scala#18) was resolved, import process became extremely slow (before it was fast, because didn't do nothing).

So I'm trying to improve it using TitanDB docs about Bulk Loading:

configuring graph:

conf.setProperty("autotype", "none");
conf.setProperty("storage.batch-loading", "true");
conf.setProperty("storage.buffer-size", "10000");
conf.setProperty("storage.write-attempts", "10");

trying to use blueprints BatchGraph somewhere

Problem when publishing from sbt

Hi!
I just got this error and I don't quite understand what is happening:

[warn] Merging 'META-INF\NOTICE.txt' with strategy 'rename'
[warn] Merging 'META-INF\NOTICE' with strategy 'rename'
[warn] Merging 'META-INF\LICENSE.txt' with strategy 'rename'
[warn] Merging 'META-INF\README.txt' with strategy 'rename'
[warn] Merging 'META-INF\LICENSE' with strategy 'rename'
[warn] Merging 'LICENSE' with strategy 'rename'
[trace] Stack trace suppressed: run last *:assembly for the full output.
error deduplicate: different file contents found in the following
:
[error] C:\Users\ppareja.ivy2\cache\org.neo4j\neo4j-kernel\jars\neo4j-kernel-1.
9.3.jar:META-INF/CHANGES.txt
[error] C:\Users\ppareja.ivy2\cache\org.neo4j\neo4j-lucene-index\jars\neo4j-luc
ene-index-1.9.3.jar:META-INF/CHANGES.txt
[error] C:\Users\ppareja.ivy2\cache\org.neo4j\neo4j-graph-algo\jars\neo4j-graph
-algo-1.9.3.jar:META-INF/CHANGES.txt
[error] C:\Users\ppareja.ivy2\cache\org.neo4j\neo4j-udc\jars\neo4j-udc-1.9.3.ja
r:META-INF/CHANGES.txt
[error] C:\Users\ppareja.ivy2\cache\org.neo4j\neo4j-cypher\jars\neo4j-cypher-1.
9.3.jar:META-INF/CHANGES.txt
[error] C:\Users\ppareja.ivy2\cache\org.neo4j\neo4j-jmx\jars\neo4j-jmx-1.9.3.ja
r:META-INF/CHANGES.txt
[error] Total time: 317 s, completed Nov 27, 2013 12:07:28 PM

Pablo

update to angulillos-titan 0.2.0

factor out type and index management to an interface

It is done the same for all modules, and we should have a default (same as for confs #50)

dependencies, methods and building graphs

Discussing this with @evdokim

at the abstract level, split everything into subprojects, with "linking" modules refining the corresponding types. Methods throw exceptions describing the missing module/s
for implementations, the node has fields for all possible graphs that it might need, and it extends all the possible abstract extensions. It has a constructor with all these fields, and you can build it step by step with with builder methods. At the implementation site, you check for nulls (and throw the corresponding exception if needed)
Actually, you only need to put all those graph fields in the graph. The nodes and rels can get them through te corresponding fromXXX methods in their node/rel types.

So what's up with RefSeq in the end?

I just wanted to confirm that we will be no longer importing RefSeq??
If that's the case, I would still import genome elements together with their relationship with proteins, but storing only those referenced in Uniprot KB files.
What do you think?
@rtobes @eparejatobes

update to Titan 0.5?

@pablopareja this is almost done in bio4j/angulillos-titan#1. Given that Titan 0.5.x is going to be DB binary compatible with Titan 1.x (meaning that we can update it and use the same DB), what do you think about it?

update to angulillos-titan 0.1.0

Problem with indexes and special characters such as underscores

@eparejatobes I just committed a new Test class where you can see how the index can be created and the query executed in order to get hits with values containing underscores.
Here you go: 3e1c094

Please change in typed-graphs the respective part so that we can finally get this working! 😃

Exception when importing UniRef in current version

I'm getting the following exception:

SEVERE: null
java.lang.NullPointerException
        at com.bio4j.angulillos.titan.TitanTypedVertexIndex$Unique.name(TitanTypedVertexIndex.java:125)
        at com.bio4j.angulillos.titan.TitanTypedVertexIndex$DefaultUnique.<init>(TitanTypedVertexIndex.java:188)
        at com.bio4j.titan.model.uniref.TitanUniRefGraph.initIndices(TitanUniRefGraph.java:120)
        at com.bio4j.titan.model.uniref.TitanUniRefGraph.<init>(TitanUniRefGraph.java:85)
        at com.bio4j.titan.model.uniref.programs.ImportUniRefTitan.config(ImportUniRefTitan.java:43)
        at com.bio4j.titan.model.uniref.programs.ImportUniRefTitan.config(ImportUniRefTitan.java:37)
        at com.bio4j.model.uniref.programs.ImportUniRef.importUniRef(ImportUniRef.java:59)
        at com.bio4j.titan.model.uniref.programs.ImportUniRefTitan.execute(ImportUniRefTitan.java:52)
        at com.ohnosequences.util.ExecuteFromFile.main(ExecuteFromFile.java:66)
        at com.bio4j.titan.programs.ImportTitanDB.main(ImportTitanDB.java:8)

The exception is thrown when initializing the indices:

bio4j-titan/src/main/java/com/bio4j/titan/model/uniref/TitanUniRefGraph.java

Line 120 in 896194f

    
           uniRef100ClusterIdIndex =  new TitanTypedVertexIndex.DefaultUnique<>(mgmt,this, UniRef100Cluster().id);

I just can't find any difference from how things are initialized in other modules that have already been successfully imported...
@eparejatobes could you have a look at this in case you see something I'm not seeing?

Modelling external references in new version

Hi!

I already started to implement the Uniprot importer and I needed to know how I whould store cross references such as:

ArrayExpress
PIR
KEGG
EMBL
ENSEMBL
ENSEMBL plants
UNIGENE

Perhaps we could take advantage of the fact that we're implementing this now and make these references somewhat "first class" as we do with Interpro or Pfam, that's to say, storing the terms as nodes and linking the proteins to them instead of just adding the ID of the term as an attribute of the protein.
What do you think?
Also you should tell me if there are any important or useful references that weren't stored so far _(such as PROSITE)_that we could add now.

@eparejatobes @marina-manrique @epareja @rtobes I wait for your reply on this 😉

Where to put test module DBs?

Hi!

I already imported Gene Ontology using the latest version of typed-graphs in AWS. Where should I upload this so that we can all use/test it?

ciao!

🐙

Relationships/properties/nodes definition

Hi!

I just started to implement the definition of relationships in the program InitBio4jTitan specifying cardinality and uniqueness consistency locks.
There's something I'm not clear about which is what to do with properties definition. I'm afraid we should maybe define all of them explicitly?? 😰 😱
Since we sort of already decided to deactivate automatic type creation I read it's recommended to do it explicitly.
What do you think?
Here's the link to the commit I'm referencing: d7faca9

Help needed to publish and run sbt from AWS

Hi!

It takes forever to publish the project with my internet connection so I wanted to do everything from AWS and simply commit-push the code from local.
Could you please help me out with what I exactly would need to config for that?
@eparejatobes @laughedelic
Thanks!

🐙

make IDE integration possible

Is there any reason for using Titan 0.5.1 instead of 0.5.2 here?

@eparejatobes @laughedelic ❓

Improving performance of the importing process

Perhaps we should put in practice some of the recommendations here:

http://s3.thinkaurelius.com/docs/titan/current/bulk-loading.html

@eparejatobes What do you think?

S3 based data distribution

We need to have this in a minimally organized way.

titandb batch loading - 0.4.1

Found this on the user group, probably we should update to titan 0.4.1

https://groups.google.com/forum/#!searchin/aureliusgraphs/batch|sort:date/aureliusgraphs/dQCtU7cn-Hs/ETvdHDDYppAJ

Upgrading Isoforms to proteins?

I need some insight on how we are going to do this?
@rtobes @eparejatobes

Term or GoTerm?

I think it should rather be called GoTerm, otherwise it'd be too unspecific since there other so called terms such as for instance Reactome terms...

Tests

One stupid, but extremely useful test would be just

create a temp directory
initialize DB there

it will at least check that relationships and properties can be created and don't have name collisions etc.

1 Is this the right way to retrieve the String which represents the name of a property?

graph.goTermT.id.fullName()

What to do with old tax ids coming from merged.dmp file?

So far we were indexing the taxonomic units by old IDs present in the file merged.dmp. That however is not longer possible with the new version unless we store these ids as an extra property.
What should I do?

@eparejatobes @rtobes

Problem when publishing

I'm trying to publish the project in order to the test the importer for Uniprot and I just ran into this

[trace] Stack trace suppressed: run last *:assembly for the full output.
[error] (*:assembly) deduplicate: different file contents found in the following:
[error] C:\Users\ppareja\.ivy2\cache\org.neo4j\neo4j-cypher-compiler-2.0\jars\neo4j-cypher-compiler-2.0-2.0.3.
jar:org/neo4j/cypher/internal/compiler/v2_0/ComponentVersion.class
[error] C:\Users\ppareja\.ivy2\cache\org.neo4j\neo4j-cypher-compiler-2.1\jars\neo4j-cypher-compiler-2.1-2.1.2.
jar:org/neo4j/cypher/internal/compiler/v2_0/ComponentVersion.class
[error] Total time: 81 s, completed Jul 14, 2014 1:12:53 PM

@eparejatobes @laughedelic
Any ideas?

I already deleted those folders in cache but I keep getting the same error...

Adequate config values for version using titan 0.5.2

Hi,

I'm having a look at the config parameters specifications here and I don't see at first sight any one that could lead to drastic improvements in terms of reducing the importing process time ... #49

any ideas?? @eparejatobes @laughedelic

Review access modifiers

We have a lot of fields that I think should all be private.

Problem with angulillos-titan at branch update/angulillos/0.2

  [info] Resolving com.thinkaurelius.titan#titan-berkeleyje;0.4.4 ...
[error] com.thinkaurelius.titan#titan-core;0.4.4 (needed by [com.thinkaurelius.titan#titan-berkeleyje;0.4.4])
conflicts with com.thinkaurelius.titan#titan-core;0.5.1 (needed by [bio4j#angulillos-titan;0.2.0-SNAPSHOT])
[trace] Stack trace suppressed: run last *:update for the full output.
[error] (*:update) org.apache.ivy.plugins.conflict.StrictConflictException: com.thinkaurelius.titan#titan-cor
;0.4.4 (needed by [com.thinkaurelius.titan#titan-berkeleyje;0.4.4]) conflicts with com.thinkaurelius.titan#ti
an-core;0.5.1 (needed by [bio4j#angulillos-titan;0.2.0-SNAPSHOT])
[error] Total time: 21 s, completed Nov 12, 2014 4:49:02 PM
>

What version am I supposed to be using ?

0.4 release

For tracking the next release, and general discussion about it.

Problems with index querying

@eparejatobes
Please have a look at the last commit 36767c4
I need some help with getting a vertex after all this refactoring...

🐺

Multi-valued properties with new paradigm

How should they be modeled/implemented ?

first full data import for the new version

This has a really high priority now. @pablopareja could you report on what's the current state of it?

We want this with the updated Titan version, after #41

How to deal with interdependent modules

@eparejatobes have a look at 7679ae8
We need a way to have a reference to all dependent modules in each of the modules implemented but without getting into infinite recursivity... 😄
any ideas?

Possible issue with typed graphs methods for indices

I'm getting this exception when trying to get an element using an index providing a value that's not included:

Jul 15, 2014 9:52:06 AM com.bio4j.titan.programs.ImportUniprotTitan main
SEVERE: Exception retrieving protein Q6GZV8
Jul 15, 2014 9:52:06 AM com.bio4j.titan.programs.ImportUniprotTitan main
SEVERE: Index: 0, Size: 0
Jul 15, 2014 9:52:06 AM com.bio4j.titan.programs.ImportUniprotTitan main
SEVERE: java.util.LinkedList.checkElementIndex(LinkedList.java:555)
Jul 15, 2014 9:52:06 AM com.bio4j.titan.programs.ImportUniprotTitan main
SEVERE: java.util.LinkedList.get(LinkedList.java:476)
Jul 15, 2014 9:52:06 AM com.bio4j.titan.programs.ImportUniprotTitan main
SEVERE: com.ohnosequences.typedGraphs.ElementIndex$Unique.getElement(ElementIndex.java:29)
Jul 15, 2014 9:52:06 AM com.bio4j.titan.programs.ImportUniprotTitan main
SEVERE: com.ohnosequences.typedGraphs.NodeIndex$Unique.getNode(NodeIndex.java:33)
Jul 15, 2014 9:52:06 AM com.bio4j.titan.programs.ImportUniprotTitan main
SEVERE: com.bio4j.titan.programs.ImportUniprotTitan.main(ImportUniprotTitan.java:301)
Jul 15, 2014 9:52:06 AM com.bio4j.titan.programs.ImportUniprotTitan main
SEVERE: com.bio4j.titan.programs.ImportUniprotTitan.execute(ImportUniprotTitan.java:102)
Jul 15, 2014 9:52:06 AM com.bio4j.titan.programs.ImportUniprotTitan main
SEVERE: com.ohnosequences.util.ExecuteFromFile.main(ExecuteFromFile.java:66)
Jul 15, 2014 9:52:06 AM com.bio4j.titan.programs.ImportUniprotTitan main
SEVERE: com.bio4j.titan.programs.ImportTitanDB.main(ImportTitanDB.java:11)

@eparejatobes ❓

2 Push permissions

Give me some, pleeeease! 🙏

3 optimize titandb for our read-only scenario

relevant:

https://groups.google.com/forum/#!searchin/aureliusgraphs/batch|sort:date/aureliusgraphs/Se6KvNrgkgs/NyYBTmLwzQUJ

Bug regarding indices initialization

We need to change how indices are initialized in the current version since the following error is thrown when running ImportUniprot:

SEVERE: null
java.lang.IllegalArgumentException: The property key is already indexed with the same index name and incompatible characteristics
        at com.bio4j.angulillos.titan.TitanTypedVertexIndex$DefaultUnique.<init>(TitanTypedVertexIndex.java:205)
        at com.bio4j.titan.model.uniprot.TitanUniprotGraph.initIndices(TitanUniprotGraph.java:1066)
        at com.bio4j.titan.model.uniprot.TitanUniprotGraph.<init>(TitanUniprotGraph.java:463)
        at com.bio4j.titan.model.uniprot.programs.ImportUniprotTitan.config(ImportUniprotTitan.java:49)
        at com.bio4j.titan.model.uniprot.programs.ImportUniprotTitan.config(ImportUniprotTitan.java:37)
        at com.bio4j.model.uniprot.programs.ImportUniprot.importUniprot(ImportUniprot.java:151)
        at com.bio4j.titan.model.uniprot.programs.ImportUniprotTitan.execute(ImportUniprotTitan.java:58)
        at com.ohnosequences.util.ExecuteFromFile.main(ExecuteFromFile.java:66)
        at com.bio4j.titan.programs.ImportTitanDB.main(ImportTitanDB.java:8)

The exception is thrown when ImportUniprot is executed for the second time, in this case for TrEMBL XML file
We should do something equivalent to the createOrGet method for properties....

👾

It seems nice 😃