Comments (23)
Any idea on the estimated time for this to be completed? I can also try and help if I can get caught up to speed.
from ideogram.
I would like for Ideogram.js to have basic support for all eukaryotes that have suitable data before August.
Any help would be appreciated!
Development can be divided into two tasks: data retrieval and rendering. If you want to help, @ProjectProgramAMark, I would recommend trying the data retrieval task. I'll take care of rendering.
Data retrieval
Given an organism's scientific name, get a list of chromosomes in its genome and their length in nucleotide base pairs. Each chromosome's length in base pairs (bp) is proportional to its length in pixels (px) after rendering: chrLength(bp) ~ chrLength(px).
Implement the data retrieval using D3's xhr module such that no server-side code is required by developers using this library feature.
A draft dataflow for Plasmodium falciparum is below. Details are of course likely to change, but I think the gist below will work. If you would like to help on this, I would recommend implementing a function for this in JavaScript and D3 outside Ideogram.js before integrating it into the library.
Get best genome for organism
We want to find the best genome assembly for the input organism. To accomplish this, query NCBI Assembly database via EUtils esearch.
- Request:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=assembly&retmode=json&term=Plasmodium%20falciparum%20AND%20(%22latest%20refseq%22[filter])%20AND%20%22chromosome%20level%22[filter])
(Theterm
value will likely be refined over time, but this is a decent start.) - Response:
{
"header": {
"type": "esearch",
"version": "0.3"
},
"esearchresult": {
"count": "1",
"retmax": "1",
"retstart": "0",
"idlist": [
"360518"
],
...
Parse first element from idlist
key of esearch JSON response, e.g. 360518
.
Resolve that internal identifier to a public identifier -- the assembly's RefSeq accession -- via EUtils esummary as follows.
- Request:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=assembly&retmode=json&id=360518 - Response:
{
"header": {
"type": "esummary",
"version": "0.3"
},
"result": {
"uids": [
"360518"
],
"360518": {
"uid": "360518",
"rsuid": "360518",
"gbuid": "256198",
"assemblyaccession": "GCF_000002765.3",
"lastmajorreleaseaccession": "GCF_000002765.3",
"chainid": "2765",
"assemblyname": "ASM276v1",
Parse value of assemblyaccesion
esummary JSON response, e.g. GCF_000002765.3
above.
The RefSeq accession represents the "best" genome assembly for the organism, or, more precisely, an assembly which should have sufficient data for the organism's chromosome complement.
Get chromosomes for genome
Now that we know the organism's best genome assembly, we can get a list of its chromosomes and their length.
Using the assembly RefSeq accession obtained from the previous step, get its full sequence report.
- Request:
ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/All/GCF_000002765.3.assembly.txt - Response:
...
# Sequence-Name Sequence-Role Assigned-Molecule Assigned-Molecule-Location/Type GenBank-Accn Relationship RefSeq-Accn Assembly-Unit Sequence-Length UCSC-style-name
1 assembled-molecule 1 Chromosome AL844501.1 = NC_004325.1 Primary Assembly 643292 na
2 assembled-molecule 2 Chromosome AE001362.1 = NC_000910.2 Primary Assembly 947102 na
3 assembled-molecule 3 Chromosome AL844502.1 = NC_000521.3 Primary Assembly 1060087 na
4 assembled-molecule 4 Chromosome AL844503.1 = NC_004318.1 Primary Assembly 1204112 na
5 assembled-molecule 5 Chromosome AL844504.1 = NC_004326.1 Primary Assembly 1343552 na
6 assembled-molecule 6 Chromosome AL844505.1 = NC_004327.2 Primary Assembly 1418244 na
7 assembled-molecule 7 Chromosome AL844506.2 = NC_004328.2 Primary Assembly 1501717 na
8 assembled-molecule 8 Chromosome AL844507.2 = NC_004329.2 Primary Assembly 1419563 na
9 assembled-molecule 9 Chromosome AL844508.1 = NC_004330.1 Primary Assembly 1541723 na
10 assembled-molecule 10 Chromosome AE014185.2 = NC_004314.2 Primary Assembly 1687655 na
11 assembled-molecule 11 Chromosome AE014186.2 = NC_004315.2 Primary Assembly 2038337 na
12 assembled-molecule 12 Chromosome AE014188.3 = NC_004316.3 Primary Assembly 2271478 na
13 assembled-molecule 13 Chromosome AL844509.2 = NC_004331.2 Primary Assembly 2895605 na
14 assembled-molecule 14 Chromosome AE014187.2 = NC_004317.2 Primary Assembly 3291871 na
MT assembled-molecule MT Mitochondrion na <> NC_002375.1 non-nuclear 5967 na
Here Sequence-Name
is the chromosome name and Sequence-Length
is the chromosomes length in base pairs. Splits those rows by tab, parse name
and length
values for each chromosome, and put them into an array of objects as shown in the example output in the following section.
Example input and output
- Input:
// Implement getChromosomes() function that takes scientific name as an argument
getChromosomes("Plasmodium falciparum")
- Output:
// Array of objects with basic data on all chromosomes in Plasmodium falciparum
[
{"name": "1", "length": 643292},
{"name": "2", "length": 947102},
...
{"name": "MT", "length": 5967}
]
from ideogram.
@eweitz, I'm having a bit of trouble with sending the request to get the full sequence report using the assembly RefSeq accession (for ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/All/GCF_000002765.3.assembly.txt). It's returning the following error:
XMLHttpRequest cannot load https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/All/GCF_000002765.3.assembly.txt. No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'http://localhost:3000' is therefore not allowed access.
I'm pretty sure this is a CORS problem, but I'm not sure how to get around that using only d3js. I'm running the test environment off of a NodeJS server, and am receiving the same error when I run it on Apache. I have uploaded a repo of my test environment here.
from ideogram.
@ProjectProgramAMark , I sent you this pull request, with some responses.
from ideogram.
Thanks @Klortho, the only thing I'm unsure about it @eweitz specified he didn't want any server code being used in this feature, so I wasn't sure if getting around CORS was only a temporary fix in my problem and didn't serve the bigger picture. I went ahead and merged your pull request though.
from ideogram.
Oh, right. Well, this is purely in the transport layer -- nothing to do with D3.
from ideogram.
@ProjectProgramAMark, can you try the following workaround? It required much sleuthing to determine, but the method described below gets all data from EUtils, and thus should avoid the CORS issue.
I quickly checked via browsing EUtils API results that the following approach using the little-known GenColl database works straightforwardly not only for Plasmodium falciparum, but also for Homo sapiens and Drosophila melanogaster, unlike several other approaches I tried with the better-known databases Assembly and BioProject.
Get chromosomes for genome, CORS workaround
Parse value of rsuid
esummary JSON response, e.g. 360518
in the "Get best genome for organism" section of my previous comment.
(Data recap: the rsuid 360518
is the internal RefSeq UID for the RefSeq genome assembly GCF_000002765.3
, i.e. ASM276v1, the latest chromosome-level RefSeq assembly for organism Plasmodium falciparum.)
Get a list of chromosome UIDs linked to Nucleotide (nuccore) from GenColl database for genome assembly 360518
:
- Request
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?retmode=json&db=nuccore&linkname=gencoll_nuccore_chr&from_uid=360518 - Response:
{
"header": {
"type": "elink",
"version": "0.3"
},
"linksets": [
{
"dbfrom": "pubmed",
"ids": [
360518
],
"linksetdbs": [
{
"dbto": "nuccore",
"linkname": "gencoll_nuccore_chr",
"links": [
296005645,
296005143,
296004920,
258549241,
258549170,
258549151,
258549100,
86176855,
23957709,
23613523,
23613362,
23613028,
23593254,
23509994,
11466244
]
}
]
}
]
}
Parse links
, and join the elements of that array into a comma-delimited string (e.g. ids = links.join(",")
).
Pass that string of chromosome UIDs into the id
parameter of an ESummary call to the Nucleotide database.
...
"result": {
"uids": [
"296005645",
"296005143",
"296004920",
"258549241",
"258549170",
"258549151",
"258549100",
"86176855",
"23957709",
"23613523",
"23613362",
"23613028",
"23593254",
"23509994",
"11466244"
],
"296005645": {
"uid": "296005645",
"caption": "NC_004331",
"title": "Plasmodium falciparum 3D7 chromosome 13",
"extra": "gi|296005645|ref|NC_004331.2||gnl|NCBI_GENOMES|103",
"gi": 296005645,
"createdate": "2002/10/03",
"updatedate": "2010/07/29",
"flags": 512,
"taxid": 36329,
"slen": 2895605,
"biomol": "genomic",
"moltype": "dna",
"topology": "linear",
"sourcedb": "refseq",
"segsetsize": "",
"projectid": "148",
"genome": "chromosome",
"subtype": "chromosome",
"subname": "13",
"assemblygi": 225631926,
"assemblyacc": "AL844509",
"tech": "",
"completeness": "",
"geneticcode": "1",
"strand": "",
"organism": "Plasmodium falciparum 3D7",
"strain": "",
"statistics": [
{
"type": "Length",
"count": 2895605
Here subname
is the chromosome name and slength
is the chromosomes length in base pairs. Iterate over each key-value pair in result
(skip uids
), parse name and length values for each chromosome, and put them into an array of objects as shown in the "Example output" section in my previous comment.
Notes:
subname
may require additional parsing to get the canonical, human-friendly chromosome name. Example: Drosophila chromosomes -- the expected name the first chromosome result there is "3L", but itssubname
value has lots of noise.- Don't worry about chromosome order or MT (mitochondrial DNA). I can take care of chromosome order unless you are especially interested. Drosophila chromosomes are an example again of a complication -- they are ordered X, 2L, 2R, 3L, 3R, 4, Y, MT. C. elegans chromosomes are ordered I, II, III, IV, X, MT.
from ideogram.
Thanks for the pull request, @Klortho, but @ProjectProgramAMark is correct: I would really prefer this feature to require no server-side code or configuration. This feature should work with a primitive, traditional web server stack, e.g. on a static web page served by Apache or Nginx.
I wonder if the lack of an Access-Control-Allow-Origin
HTTP header in https://ftp.ncbi.nlm.nih.gov is a matter of security, or if it's something that simply has not been implemented yet. I suspect it's the latter. Given the proliferation of client-side API calls and unique information available via FTP -- e.g. full sequence reports like https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/All/GCF_000002765.3.assembly.txt are the only machine-friendly resources I know that provide expected chromosome ordering, and conveniently include chromosome name and length -- I think eliminating the CORS restriction on NCBI's side would be widely beneficial.
from ideogram.
@ProjectProgramAMark, if you are still interested in this, please let me know. Otherwise I will begin wiring up data retrieval for this in about a week.
from ideogram.
@eweitz Where would be the ideal place/file to place this script in?
from ideogram.
@ProjectProgramAMark, what you have in d3test/public/index.html is a good start. I recommend continuing that path, perhaps in a fork of this repository. Once you can return names and lengths for chromosomes given only a scientific name like Plasmodium falciparum -- i.e. without hardcoded intermediate values like rsuid "360518"
-- ping me and we'll proceed from there.
Ideally, your data retrieval function will ultimately be a single method getChromosomes(organismName)
in src/js/ideogram.js, e.g.:
/**
* Returns names and lengths of chromosomes for an organism's best-known genome assembly
*/
Ideogram.prototype.getChromosomes = function(organismName) {
// Your data retrieval code
}
But let's take this one step at a time. First get a working independent function, then we'll integrate this into src/js/ideogram.js
. Integrating will require getting familiar with Ideogram's complex initialization logic, which I can help with when we get there.
Once you have a standalone function getChromosomes(organismName)
working without hardcoded values and returning something like the example output from my 6/29 comment, please let me know.
from ideogram.
@eweitz my script should be working now. pull my repo and cd into the directory, run "npm start" and click on the button.
from ideogram.
@ProjectProgramAMark, your data retrieval code looks good so far! Thanks for those instructions. I verified that your current getChromosomes(organismName)
function gets chromosome lengths and names for Plasmodium falciparum.
I think the next step is to begin integrating your function into this library, i.e. src/js/ideogram.js
.
Please:
- Fork this repo
- Create a new branch labeled
eukaryotes-data-retrieval
- Paste your function into
src/js/ideogram.js
- Modify your function's signature to add
getChromosomes
as an instance method of Ideogram; see outline in my previous comment. - Open a pull request to merge your forked
ideogram
repo'seukaryotes-data-retrieval
branch into this repo's master branch
If you're feeling ambitious, try calling ideogram.getChromosomes("Plasmodium falciparum")
after you install your method on Ideogram's prototype
. But don't worry if it doesn't work, or if Ideogram is giving you trouble installing. The main goal is to get a pull request with more or less your current getChromosomes
function open, and begin a code review. I can take care of any integration trouble.
In the code review, I'll recommend and make various code updates. Your current function will undergo some transformations, but its essence looks roughly OK at an initial glance.
from ideogram.
@eweitz ok done! Let me know what you want to change.
from ideogram.
@eweitz any time estimation on when this will be finished?
from ideogram.
I hope to have a basic version of this available in a week or two.
from ideogram.
@eweitz any progress?
from ideogram.
@ProjectProgramAMark, yes, slowly but steadily. Integrating the data retrieval code into the larger Ideogram library turned out to require major work that cut across more aspects of Ideogram than expected. See #54 for details.
I'm now beginning on the rendering task of this feature. I will ping you when it's done.
Update: @ProjectProgramAMark, this is done -- see e.g. https://eweitz.github.io/ideogram/eukaryotes.html?org=plasmodium-falciparum. Thanks again for your help!
from ideogram.
@mrouard, @ProjectProgramAMark, rough basic rendering of some arbitrary eukaryotic genomes is in place in the development branch https://github.com/eweitz/ideogram/tree/render-eukaryotic-chromosomes.
That branch can retrieve chromosome data for e.g. Plasmodium falciparum (malaria parasite), Caenorhabditis elegans (worm) and Musa acuminata (banana) given only the organism's scientific name. See examples/worm.html in the render-eukaryotic-chromosomes
branch for an example of how this feature will look at the app-developer level.
I'll comment here with a progress update within a week.
from ideogram.
The rendering of eukaryotic chromosomes is significantly better than it was last week. I've also replaced the worm.html example with something more expansive, examples/eukaryotes.html.
My next task is to fix the failing automated test suite. As before, the place to follow progress is https://github.com/eweitz/ideogram/tree/render-eukaryotic-chromosomes.
from ideogram.
This feature is done. Support for eukaryotes can be found at:
https://eweitz.github.io/ideogram/eukaryotes.html
from ideogram.
Looks great @eweitz
Looking at some examples, there is like a bug display
https://eweitz.github.io/ideogram/eukaryotes.html?org=arabidopsis-thaliana
There is like a very small additional chromosome. same for maize, rice and grape.
from ideogram.
Thanks for noting that problem, @mrouard. I've opened issue #56 to address it.
from ideogram.
Related Issues (20)
- API annotations not working on genomes with non "1,2,3" labelled chromosomes HOT 1
- Is there a limit to the number of keys in the legend? HOT 2
- Issue with custom organism bandfiles
- Display annotation track from a BED file HOT 8
- Support T2T human genome HOT 2
- Support gzipped BED files HOT 1
- Support for structural rearrangements HOT 4
- Histogram support for bedgraph format HOT 2
- 404 (Not Found) HOT 1
- Karyotype file of dual comparison error HOT 3
- Jupyter Notebook example no longer works HOT 5
- Chromosome disappears after being clicked for some organisms HOT 3
- Is there a way to zoom in or out ideograms to check overlapped annotations? HOT 2
- rangeSet does not support sex chromosomes HOT 4
- where is detailed documentation/wiki? HOT 1
- Two annotations layouts combined: overlaid + histogram HOT 2
- Annoatation with multiple tracks
- Histogram with tooltip? HOT 2
- Missing regions in fly ideogram? HOT 4
- Help with Uploading a Bed file instead of a URL link HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ideogram.