Code Monkey home page Code Monkey logo

genomic-features's Introduction

genomic-features

Tests Documentation codecov PyPI

Genomic annotations using BioConductor resources in Python.

Getting started

Check out the documentation, in particular the basic tutorial and API docs.

Installation

You need to have Python 3.9 or newer installed on your system. If you don't have Python installed, we recommend installing Mambaforge.

There are several alternative options to install genomic-features:

Development install

To install the latest development version:

pip install git+https://github.com/scverse/genomic-features.git@main

Release notes

See the changelog.

Contact

For questions and help requests, you can reach out in the scverse discourse. If you found a bug, please use the issue tracker.

Citation

t.b.a

genomic-features's People

Contributors

emdann avatar gamazeps avatar ivirshup avatar lauradmartens avatar pre-commit-ci[bot] avatar scverse-bot avatar thomas-reimonn avatar vmuckerson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

genomic-features's Issues

Example where we directly connect to the ensembl servers

Description of feature

It would be nice to have an example where we directly connect to the ensembl mysql servers to run a query. Especially for the biomart version, much more information should be available.

It may also be important to note that users may need to interact with the servers through ibis to be able to make full use of the extra information.

Transcription Factor column

Description of feature

Hi, this is a really useful tool. A simple but useful addition would be to annotate whether a gene/transcript is a Transcription Factor or not.

At the moment I do this using the table (attached) from this paper . This is human specific. There are probably other dbs for other species, maybe ensembl?

Filtering the genes output using the description is not capturing this well. For instance,genes[genes.description.str.contains('transcription', na=False, case=False)] gives me 705 genes, whereas filtering the table I attached by Is TF? == 'Yes' gives me over 1600 genes.

1-s2.0-S0092867418301065-mmc2.xlsx

User guide section: filters

Description of feature

It would be good to have a user guide section with a brief explanation of how filters can be combined.

TSS/ promoter regions

Description of feature

Retrieve transcription start sites for genes for use with ATAC data.

This could be similar to the GenomicFeatures::promoters function (description available in this vignette) which:

The promoters function computes a GRanges object that spans the promoter region around the transcription start site for the transcripts in a TxDb object. The upstream and downstream arguments define the number of bases upstream and downstream from the transcription start site that make up the promoter region.

This could be done using bioframe.expand (though that is currently not strand aware: open2c/bioframe#144). This could instead by done with ibis with: ifelse

Defining the main genomic coordinates for each table

Description of feature

For each table available via EnsemblDb (e.g. genes, promoters) we could rename the primary genomic coordinates to the format required by bioframe (e.g. chrom, start, end). This would (a) mimic the behaviour in bioconductor (where ranges make the IRange column) and (b) allow selecting columns with additional coordinates in the ibis query (e.g. add tx_seq_start column when running genes #9 ).

`list_columns` includes columns from `"metadata"` and `"chromosome"` table

Report

import genomic_features as gf

ensdb = gf.ensembl.annotation(species="Hsapiens", version="108")
ensdb.list_columns()
['seq_name',
 'seq_length',
 'is_circular',
...
 'name',
 'value',
...
]

I think we shouldn't include columns from these tables in the output of list_columns, since we can't join on them in the genes, transcripts, and exons functions. To me, this function tells you what possible values are for the cols argument, which I don't think these satisfy.

What do you think @lauradmartens?

Version information

No response

Providing access to protein annotations and general note

Description of feature

Dear developers! Great work you're doing!

Just to introduce myself: I'm the developer of the ensembldb package and am maintaining and adding new EnsDb databases to Bioconductor's AnnotationHub for each new Ensembl release.

As a general note: the EnsDbs would also provide protein annotations (amino acid sequences, (functional) protein domains and some mapping to Uniprot identifiers). These things are not general GenomicFeatures features, but more specific to the ensembldb package - and it allows also to map directly between positions within the protein to the transcript to the genome (and vice versa).

Also, please don't hesitate to ask if something about the EnsDb database layouts is unclear or if you have feature requests.

add pre-commit CI

Description of feature

Just noticed pre-commit isn't running on PRs, should add that

How tos

Description of feature

  • Find closet peak to each gene in anndata
  • Are these ATAC peaks intronic/ exonic? (intronic = overlaps with gene, but not exon)

annotate_anndata/ annotate_table

Description of feature

Helper function to join an annotation table on a existing table. E.g. add genomic ranges to an anndata var

This should do checks on thinks like uniqueness of the join key in both tables.

Because this is surprisingly hard to get right in pandas.

Discoverability of features

Description of feature

How do we make it clear which:

  • Annotation columns
  • Filters
  • Genomes

Are available to be used. GenomicFeatures/ ensembldb do this with a combination of functions for listing available features and pre-defined classes that make these available.

Examples include:

  • Available columns: accessor functions like transcripts, axons
  • Filters: GeneIdFilter, CoordinateFilter classes
  • Genomes: listEnsDbs

Do we copy this design? The filter classes may be quite a lot of effort for our first draft.


To implement:

  • Filters (gf.filters namespace + doc-site)
  • Annotation columns (list_columns, list_tables) (#22)
  • Genomes (gf.ensembl.list_ensdb_annotations) (#21)

Add functionality to join three tables that are partially related

Description of feature

At the moment it is not possible to join two tables where an intermediate join is necessary for common column. An example would be merging the gene table, the protein table requires joining on the transcript table first. One solution could be to have mapping which table is needed for each join.

Breaking changes to ibis

Report

New ibis release breaks the core gene function (see also in CI)

ensdb = genomic_features.ensembl.annotation(species="Hsapiens", version="108")
genes = ensdb.genes()

throws

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/oak/stanford/groups/pritch/users/emma/miniforge3/envs/perturb-vs-tissue-env/lib/python3.10/site-packages/genomic_features/ensembl/ensembldb.py", line 123, in genes
    query = self.db.table("gene").filter(filter.convert())
  File "/oak/stanford/groups/pritch/users/emma/miniforge3/envs/perturb-vs-tissue-env/lib/python3.10/site-packages/ibis/expr/types/relations.py", line 2486, in filter
    raise com.IbisInputError("You must pass at least one predicate to filter")
ibis.common.exceptions.IbisInputError: You must pass at least one predicate to filter

Quick fix: downgrading to ibis-framework[sqlite, duckdb]==8.0.0

filter argument

Description of feature

genes, transcripts, etc. should have a filter argument. We should be able to pass filter objects (as defined in #6) here that reduce the entries returned.

We could also allow passing ibis expressions here (at least at first, until all filter classes are defined)

What order should results be in?

Description of feature

Continuing from #59 (comment)

In the mentioned PR, we found out that duckdb is inconsistent with what order it returns results in. This by and large seems fine from a duckdb point of view, but would be frustrating for users of this package.

To solve this @thomas-reimonn added an order_by statement here

Right now the order returned is based on the order of columns in the user input (I believe). Is there a better/ more canonical way to do this? Off the top of my head I would assume we'd generally want to sort by chrom and start.

We should also check what the bioc packages do here.

@nvictus, @emdann, @lauradmartens any thoughts?

Finalizing names

Description of feature

We should finalize naming for things. I'm starting a list here of things to resolve

  • GeneRangesFilter -> GeneRangeFilter (since it only takes a single range)
  • CanonicalFilter -> ???
  • genes(cols=...) -> genes(columns=...) โ€“ Just to be more consistent. Could go the other way and use cols everywhere.
  • What are the canonical strand names? +, - I assume?

Pushing to next release:

  • GeneRangeFilter + TxRangeFilter + ExonFilter or all one RangeFilter as done in bioconductor?
  • gf.ensembl.annotation Do we still want to call this an annotation? I named the function when the package was called genomic-annotations

Conversion of seqlevel styles

Description of feature

Add functionality that allows translation between different chromosome sequence naming conventions (e.g., "chr1" versus "1").

This could be similar to the seqlevelsStyle function in the R package GenomeInfoDb :

seqlevelsStyle(gr_obj) = "UCSC"

Test against duckdb

This is now resolved, and we are discussing other performance topics below.

Description of feature

From past experience, it seems that operations in this library can be sped up quite a bit by using duckdb as the backend. Conveniently, we directly work with sqlite files while using the duckdb backend.

We should run tests with the duckdb backend and document how users can select it.

`return_type`

Description of feature

It would be nice to have a return_type argument that allowed results other than a pandas dataframe. Currently, I think that pyarrow.Table, ibis.Table, and pd.DataFrame (defaulting to pd.DataFrame would be a good set of options.

This would look like:

ensdb.genes(return_type=ibis.expr.types.Table)

(I generally like the idea of passing a type instead of a string literal here, but I see how the ibis case is unwieldy)

Advantages

Allows us to work with a few use cases:

  • pd.DataFrame: useful for people who don't even need to know they are using SQL. Very much the default dataframe type
  • ibis Table: lazy representation for those who explicitly want to work with the SQL
  • arrow In between, or if you want to pass it to some other library! I believe this also has potential to be lazy via chunks.

Disadvantages

Mainly complexity. Also keeping data types consistent.

It would probably be pretty easy to start with only supporting pandas and ibis, where the ibis case is just not running .execute()

Better downloads

Description of feature

Currently we grab data from bioconductor's AnnotationHub. This is very convenient, but we're grabbing uncompressed sqlite files. When bioc distributes these files as packages, they are gzipped. It turns out this data is very compressible and can be shrunk considerably.

I've had previous discussion with the bioc team if they could provide compressed versions of these files on annotation hub. This issue is tracking that.

columns argument (plus joins)

Description of feature

Getting some columns requires doing a join. For example, getting protein info for transcripts from ensembldb:

transcripts(edb, filter = GeneNameFilter("ZBTB16"),
           columns = c("protein_id", "uniprot_id", "tx_biotype"))

We should be able to do this as well. This requires being able to figure out which table a column is in, then joining on that column.

Package name

Description of feature

I've used genomic-annotations for now, since it's descriptive and seems available on pypi. But no need to keep it, and would be nice to have something shorter.

Lift-over between genome versions

Description of feature

Feature description: easily converting genomic coordinates of annotations between genome assemblies (or easy access to download gene model for different assembly)

In Bioconductor this is implemented in the liftOver function in rtracklayer, which works with GRanges objects and chain files from UCSC

library(rtracklayer)
path = system.file(package="liftOver", "extdata", "hg38ToHg19.over.chain")
ch = import.chain(path)
seqlevelsStyle(gr_obj) = "UCSC"  # necessary
cur19 = liftOver(gr_obj, ch)

Possible solutions

  • Specify Ensembl release with genome assembly of interest (to check: is there a specific release for a genome assembly? I don't think so)
  • Interact with Open2C people

Get sequences

Description of feature

Get sequences for a range. Maybe via genomepy, maybe bioframe does this

More filters

(Expand this list)

  • Genomic range filter
  • IsCanonical
  • EntrezFilter (column: entrez)
  • ExonEndFilter (column: exon_end)
  • ExonIdFilter (column: exon_id)
  • ExonRankFilter (column: exon_rank)
  • ExonStartFilter (column: exon_start)
  • GeneBiotypeFilter (column: gene_biotype)
  • GeneEndFilter (column: gene_end)
  • GeneIdFilter (column: gene_id)
  • GeneNameFilter (column: gene_name)
  • GeneStartFilter (column: gene_start)
  • ProtDomIdFilter (column: prot_dom_id)
  • ProteinDomainIdFilter (column: protein_domain_id)
  • ProteinDomainSourceFilter (column: protein_domain_source)
  • ProteinIdFilter (column: protein_id)
  • SeqNameFilter (column: seq_name)
  • SeqStrandFilter (column: seq_strand)
  • SymbolFilter (column: symbol)
  • TxBiotypeFilter (column: tx_biotype)
  • TxEndFilter (column: tx_end)
  • TxIdFilter (column: tx_id)
  • TxNameFilter (column: tx_name)
  • TxStartFilter (column: tx_start)
  • UniprotDbFilter (column: uniprot_db)
  • UniprotFilter (column: uniprot)
  • UniprotMappingTypeFilter (column: uniprot_mapping_type)

Return better dtypes

Description of feature

Right now, we can return some weird dtypes.

Integers with missing values in the ensembl table are cast to float, boolean values are stored as integer, and strings are returned for things better represented as categorical.

We should figure out what dtypes each column should be returned as, and make sure they are returned correctly. It's possible we could pass through pyarrow to do this more efficiently

Public doc builds

Description of feature

We should get a readthedocs site set up for this. I'm thinking we should wait until we've decided on a name (#2) so we don't squat a namespace.

Missing ensembl versions

Report

The default Ensembl version in ensembldb is v86, but this doesn't seem to exist in the sql database.

import pooch

PKG_CACHE_DIR = "genomic-annotations"
url_template = "https://bioconductorhubs.blob.core.windows.net/annotationhub/AHEnsDbs/v{version}/EnsDb.{species}.v{version}.sqlite"

local_path = pooch.retrieve(
        url = url_template.format(version="86", species="Hsapiens"),
        known_hash=None,
        path=pooch.os_cache(PKG_CACHE_DIR),
        progressbar=True,
    )
Downloading data from 'https://bioconductorhubs.blob.core.windows.net/annotationhub/AHEnsDbs/v86/EnsDb.Hsapiens.v86.sqlite' to file '/home/jovyan/.cache/genomic-annotations/1c849fa5b54f90367dbde8fd3f560e9a-EnsDb.Hsapiens.v86.sqlite'.
---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
Input In [120], in <cell line: 4>()
      1 import pooch
      2 url_template = "[https://bioconductorhubs.blob.core.windows.net/annotationhub/AHEnsDbs/v](https://bioconductorhubs.blob.core.windows.net/annotationhub/AHEnsDbs/v%3C/span%3E%3Cspan) class="ansi-bold" style="color:rgb(175,95,135)">{version}/EnsDb.{species}.v{version}.sqlite"
----> 4 local_path = pooch.retrieve(
      5         url = url_template.format(version="86", species="Hsapiens"),
      6         known_hash=None,
      7         path=pooch.os_cache(PKG_CACHE_DIR),
      8         progressbar=True,
      9     )

File ~/my-conda-envs/genomic-features/lib/python3.9/site-packages/pooch/core.py:239, in retrieve(url, known_hash, fname, path, processor, downloader, progressbar)
    236 if downloader is None:
    237     downloader = choose_downloader(url, progressbar=progressbar)
--> 239 stream_download(url, full_path, known_hash, downloader, pooch=None)
    241 if known_hash is None:
    242     get_logger().info(
    243         "SHA256 hash of downloaded file: %s\n"
    244         "Use this value as the 'known_hash' argument of 'pooch.retrieve'"
   (...)
    247         file_hash(str(full_path)),
    248     )

File ~/my-conda-envs/genomic-features/lib/python3.9/site-packages/pooch/core.py:803, in stream_download(url, fname, known_hash, downloader, pooch, retry_if_failed)
    799 try:
    800     # Stream the file to a temporary so that we can safely check its
    801     # hash before overwriting the original.
    802     with temporary_file(path=str(fname.parent)) as tmp:
--> 803         downloader(url, tmp, pooch)
    804         hash_matches(tmp, known_hash, strict=True, source=str(fname.name))
    805         shutil.move(tmp, str(fname))

File ~/my-conda-envs/genomic-features/lib/python3.9/site-packages/pooch/downloaders.py:207, in HTTPDownloader.__call__(self, url, output_file, pooch, check_only)
    205 try:
    206     response = requests.get(url, **kwargs)
--> 207     response.raise_for_status()
    208     content = response.iter_content(chunk_size=self.chunk_size)
    209     total = int(response.headers.get("content-length", 0))

File ~/my-conda-envs/genomic-features/lib/python3.9/site-packages/requests/models.py:1021, in Response.raise_for_status(self)
   1016     http_error_msg = (
   1017         f"{self.status_code} Server Error: {reason} for url: {self.url}"
   1018     )
   1020 if http_error_msg:
-> 1021     raise HTTPError(http_error_msg, response=self)

HTTPError: 404 Client Error: The specified blob does not exist. for url: https://bioconductorhubs.blob.core.windows.net/annotationhub/AHEnsDbs/v86/EnsDb.Hsapiens.v86.sqlite

The EnsemblDB class should have a method to check (and return?) available versions

Version information


bioframe 0.4.1
genomic_annotations 0.0.1
ibis 5.1.0
pandas 2.0.1
pooch v1.7.0
session_info 1.0.0

PIL 9.5.0
asttokens NA
backcall 0.2.0
bidict 0.22.1
certifi 2022.12.07
charset_normalizer 3.1.0
cycler 0.10.0
cython_runtime NA
dateutil 2.8.2
debugpy 1.6.7
decorator 5.1.1
executing 1.2.0
greenlet 2.0.2
idna 3.4
importlib_metadata NA
importlib_resources NA
ipykernel 6.14.0
jedi 0.18.2
kiwisolver 1.4.4
matplotlib 3.7.1
mpl_toolkits NA
multipledispatch 0.6.0
numpy 1.24.3
packaging 23.1
parso 0.8.3
parsy 2.1
pexpect 4.8.0
pickleshare 0.7.5
pkg_resources NA
platformdirs 3.3.0
prompt_toolkit 3.0.38
psutil 5.9.5
ptyprocess 0.7.0
public 3.1.1
pure_eval 0.2.2
pyarrow 11.0.0
pydev_ipython NA
pydevconsole NA
pydevd 2.9.5
pydevd_file_utils NA
pydevd_plugins NA
pydevd_tracing NA
pygments 2.15.1
pyparsing 3.0.9
pytz 2023.3
regex 2.5.125
requests 2.28.2
rich NA
six 1.16.0
sqlalchemy 2.0.10
sqlglot 11.5.7
stack_data 0.6.2
toolz 0.12.0
tornado 6.3
tqdm 4.65.0
traitlets 5.9.0
typing_extensions NA
urllib3 1.26.15
wcwidth 0.2.6
xxhash NA
zipp NA
zmq 25.0.2
zoneinfo NA

IPython 8.4.0
jupyter_client 8.2.0
jupyter_core 5.3.0

Python 3.9.16 | packaged by conda-forge | (main, Feb 1 2023, 21:39:03) [GCC 11.3.0]
Linux-4.15.0-112-generic-x86_64-with-glibc2.31

Session information updated at 2023-04-26 16:00

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.