sdsc-ordes / modos-api Goto Github PK

View Code? Open in Web Editor NEW

0.0 4.0 0.0 3.38 MB

Python API to manage multi-omics digital objects

Home Page: https://sdsc-ordes.github.io/modos-api

License: Apache License 2.0

Makefile 1.30% Python 93.70% Dockerfile 2.53% Shell 0.67% Nix 1.79%

fair metadata multi-omics omics cli rest-api

modos-api's Introduction

modos-api

Access and manage Multi-Omics Digital Objects (MODOs).

Context

Goals

Provide a digital object and system to process, store and serve multi-omics data with their metadata such that:

Traceability and reproducibility is ensured by rich metadata
The different omics layers are processed and distributed together
Common operations such as liftover can be automated easily and ensure that omics layers are kept in sync
Data can be accessed, sliced and streamed over the network without downloading the dataset.

Architecture

The client library by itself can be used to work with local MODOs, or connect to a server to access objects over s3.

The server configuration and setup insructions can be found in deploy. It consists of a REST API, an s3 server and an htsget server to stream CRAM/BCF over the network. The aim is to provide transparent remote access to MODOs without storing the data locally.

Format

The digital object is composed of a folder with:

Genomic data files (CRAM, BCF, ...)
A zarr archive for metadata and array-based data

The metadata links to the different files and provides context using the modos-schema.

Installation

The library can be installed with pip:

pip install modos

The development version can be installed directly from github:

pip install git+https://github.com/sdsc-ordes/modos-api.git@main

Usage

The CLI is convenient for quickly managing modos (creation, edition, deletion) and quick inspections:

$ modos show  -s3 https://s3.example.org --zarr ex-bucket/ex-modo
/
 ├── assay
 │   └── assay1
 ├── data
 │   ├── calls1
 │   └── demo1
 ├── reference
 │   └── reference1
 └── sample
     └── sample1

$ modos show --files data/ex
data/ex/reference1.fa.fai
data/ex/demo1.cram
data/ex/reference1.fa
data/ex/calls1.bcf
data/ex/demo1.cram.crai
data/ex/calls1.bcf.csi

The user facing API is in modos.api. It provides full programmatic access to the object's [meta]data:

>>> from modos.api import MODO

>>> ex = MODO('./example-digital-object')
>>> ex.list_samples()
['sample/sample1']
>>> ex.metadata["data/calls1"]
{'@type': 'DataEntity',
 'data_format': 'BCF',
 'data_path': 'calls1.bcf',
 'description': 'variant calls for tests',
 'has_reference': ['reference/reference1'],
 'has_sample': ['sample/sample1'],
 'name': 'Calls 1'}
>>> rec = next(ex.stream_genomics("calls1.bcf", "chr1:103-1321"))
>>> rec.alleles
('A', 'C')

For advanced use cases, the object's metadata can be queried with SPARQL!

>>> # Build a table with all files from male samples
>>> query = """
...   SELECT ?assay ?sample ?file
...   WHERE {
...     [] schema:name ?assay ;
...       modos:has_data [
...         modos:data_path ?file ;
...         modos:has_sample [
...           schema:name ?sample ;
...           modos:sex ?sex
...         ]
...       ] .
...     FILTER(?sex = "Male")
...   }
... """
>>> ex.query(query).serialize(format="csv").decode())
assay,sample,file
Assay 1,Sample 1,file://ex/calls1.bcf
Assay 1,Sample 1,file://ex/demo1.cram

Contributing

First, read the Contribution Guidelines.

For technical documentation on setup and development, see the Development Guide

Acknowledgements and Funding

The development of the Multi-Omics Digital Object System (MODOS) is being funded by the Personalized Health Data Analysis Hub, a joint initiative of the Personalized Health and Related Technologies (PHRT) and the Swiss Data Science Center (SDSC), for a period of three years from 2023 to 2025. The SDSC leads the development of MODOS, bringing expertise in complex data structures associated with multi-omics and imaging data to advance privacy-centric clinical-grade integration. The PHRT contributes its domain expertise of the Swiss Multi-Omics Center (SMOC) in the generation, analysis, and interpretation of multi-omics data for personalized health and precision medicine applications. We gratefully acknowledge the Health 2030 Genome Center for their substantial contributions to the development of MODOS by providing test data sets, deployment infrastructure, and expertise.

Copyright

Copyright © 2023-2024 Swiss Data Science Center (SDSC), www.datascience.ch. All rights reserved. The SDSC is jointly established and legally represented by the École Polytechnique Fédérale de Lausanne (EPFL) and the Eidgenössische Technische Hochschule Zürich (ETH Zürich). This copyright encompasses all materials, software, documentation, and other content created and developed by the SDSC in the context of the Personalized Health Data Analysis Hub.

modos-api's People

Contributors

Watchers

modos-api's Issues

Enable `extract_metadata()` to access remote cram files

Currently, modo.file_utils.extract_metadata fails, if the cramfile is stored remotely, as it can not access the cramfile and its header. To enable this remote cramfiles should automatically be accessed via htsget server.

Requirements:

change extract_metadata to access the cramfile header conditional on storage type
access cram file header via htsget-server, if stored remotely

Use consistent element ids within modo

Zarr groups are usually referred to with a prepending /. In line with this we sometimes use ids with a prepending / and sometimes not, e.g.:

[k for k in modo.metadata.keys()]
['ex', '/sample/sample1, /reference/reference1', ...]

modo.list_samples()
['/sample/sample1']

At the same time we refer to element_ids within other elements without /:

modo.metadata.values()
dict_values([{'@type': 'MODO',  'has_assay': ['assay/WGS_NA24143'], 'id': 'GIAB', ..'}, {'@type': 'Assay', 'has_data': ['data/NGS000000125'], 'has_sample': ['sample/NA24143'],..}])

Zarr itself seems to be flexible and maps both /element_id and element_id to the correct group, but within the modo api this leads to inconsistencies, e.g., modo.knowledge_graph() reports urls with//:

[s for s in modo.knowledge_graph().subjects()]
[rdflib.term.URIRef('file://ex//assay/assay1'), rdflib.term.URIRef('file://ex//reference/reference1'),...]

ci jobs for slow server tests

Server tests require deploying services and are slower. Currently, CI only runs (fast) client tests, while server tests are marked as "slow" and must be run using pytest --runslow. When do we want to run the slow jobs in CI?

Objective: define CI conditions and job to run server tests.

Requirements:

identify conditions (e.g. server files change, commits on main, PR open, ...)
implement GA workflow
optional: makefile rule for server tests

Allow client to stream cram from htsget server

The compose deployment now includes a functional htsget server. We need to allow the client to stream CRAM data from it as if it was available locally.

Objective: update client to stream from htsget with pysam.

Requirements:

can stream (anonymously, no authentication yet)
api methods work identically for local and remote data

e.g. something along the lines of:

local = MODO('data/ex')
local.stream_cram('cram/demo1') # stream from disk

remote = MODO('http://example.org/s3', bucket='demo')
remote.stream_cram('cram/demo1') # stream over http

Enable import from yaml file

Interactive object creation on the command line is not practical. We should add a command to load the entire modo (assays, samples, ...) from an input file.

stream CRAM data via htsget

The modo-server includes an htsget service to send CRAM binary streams over HTTP. This is currently not configured or made accessible.

Objective: Configure server to allow streaming CRAM files from S3(minio) via htsget

Requirements:

Setup htsget service to use s3 with minio
Configure ports in docker compose so that minio can be accessed from htsget
Test htsget endpoint

Allow linking files instead of copying

Problem:
Sometimes espec. for references the user does not want to copy a file into the object, but only link to a file

Proposed solution:
Add an option to link files external to the object
modos add --external --source http://localhost/s3/reference/ref_genome.fa data/ex data

To consider:
Modos won't be self-contained any longer, if we allow external linking. This is a trade-off between storage space and reusability.

feat(cli): add stream command to cli

Goal

Add a new command modos stream to modos cli

Context

Streaming from python is hard due to pysam's limited capabilities to work on byte streams. In the meantime we can still implement streaming via cli by direct parsing the bytestream from the htsget client to stdout.

allow virtual host-style buckets for s3

The current setup assumes s3 bucket follows a path-style pattern: http://domain/bucket/path.
We should also support virtual host style, which is becoming standard: http://bucket.domain/path.

[Feature request]: stream command

Contact Details

No response

Description

We need the ability to stream CRAM/BCF on the command line from modos into other tools.

Importance Level

High

Affected Components

cli

Technical Requirements

new subcommand in cli

Acceptance criteria

modos can stream remote + local CRAM/BCF on cli

make modo add remote-compatible

Modo add creates a folder locally. To support remote write operations we can directly "mount" the s3 folder

Requirements:

rely on zarr's built-in s3 support to update metadata transparently
use s3fs (?) to manually upload / remove / delete files on s3

note for now we ignore access control and use anon buckets

[Feature request]: Implement ci workflow for release

Contact Details

No response

Description

check gimie ci (https://github.com/sdsc-ordes/gimie/blob/main/.github/workflows/poetry-publish.yml)
add pypi secrets

Importance Level

Low

Affected Components

No response

Technical Requirements

No response

Acceptance criteria

No response

add ci to build client (maybe server) image

Improve modo.list_arrays() display

Hide zarr groups from list_arrays()

modos sync to update based on yaml file

Currently there is no simple way to update a modo: each change must be committed independently. And objects have to be deleted/recreated.

Objective: allow grouped updates based on an input file, incl renaming/editing objects.

Requirements:

new command in CLI: modos update dir [element_id]
command adds new elements
command delete old elements
- add + remove ~= rename
command edits existing elements

template s3 endpoint in htsget config file

Currently we set a fixed IP address for the minio service in docker compose, due to dns resolution problems from htsget when trying to reach minio. This fixed IP address is hardcoded in the htsget config file. This is undesirable, as a user should be able to configure the deployment only by changing the env file.

Objective: use templating to set ip address from environment variable in htsget config.

Requirements:

environment variable set to fixed ip address by default (using bash parameter substitution)
write template for config and mount into service
write entrypoint shell script to read template, substitute variables and write actual file into the container
- envsubst could be an appropriate tool for this substitution

[Feature request]: infer `data_path`

Contact Details

No response

Description

when including data files in a modo, one must specify both source_file and data_path. This is redundant as most often, data_path is just the filename of source_file.

Importance Level

Low

Affected Components

api, cli

Technical Requirements

infer data_path as filename(source_file) by default

Acceptance criteria

users can leave data_path empty

fix typo in the docs

Fix typo in "Find a specific MODO ad get it’s S3 path" 'ad' --> 'and'

clearer handling of `data_path` on creation

Currently, we use data-file on the CLI, and the yaml file uses data_path as the source path for files to add into the modo.
This is unclear for the user because:

we don't differentiate (semantically) between source and target during the copy/upload operation
argument name is very similar to the schema property but means something else

Objective: clearly separate the path inside the object from the source path when importing.

Requirements:

better cli parameter name: --source-file
yaml should clearly isolate schema properties from arguments
- currently we use some hacks in the backend to replace values when loading yaml

Proposal:
Instead of having a list of elements with unique ids, we have a list of objects with metadata and args keys:

# before
- id: abc
  name: A B C
  data_path: /data/source.xyz

# after
- metadata:
    id: abc
    name: A B C
  args:
    source_file: /data/source.xyz

add list of service urls for `/` path on server

When a user GETs the server without a path, they should be presented with a list of the services URLs:

s3
htsget
modo

fix server metadata when multiple modos

Only the metadata of the last modo on a server is read. This needs to be adjusted here by incrementally populating a dict instead:

https://github.com/sdsc-ordes/modo-api/blob/52b188012d5fbda0d499d7bd12a0f9273af22bc8/deploy/modo-server/server.py#L44

ref: #18 (comment)

Code clarity: Restructure code dividing modo into remote and local subclass

Currently, we use conditionals to enable remote and local use of the modo class and it's functions.
Instead, we should use an abstract class and child classes, similar as gimie extractors in https://github.com/sdsc-ordes/gimie/blob/main/gimie/extractors/abstract.py that manage the differences in code.
This would help to keep class functions clean and simple and enable sequential addition of new features.

Automatically add cram index file

modo.stream_cram() expects a cram index file to exist in the modo object (local and remote streaming). Currently, we don't check for the index file and especially for remote cram files it can be cumbersome to add the cram index file. From a user

There are differnet potential solutions:

Automatically generate the index file, if a cram file is added to a modo object. (+ easy for user, - may take time, - do we need make sure we use the correct reference for this?
Check for index file in same dir as cram file and copy it together with cram into the modo. If index does not exist warn/fail?
Check for index in modo and warn user if it does not exist + user function to copy/upload index file into modo

@cmdoret , @AssafSternberg do you have any preferences other ideas?

instantiation of remote MODO

Currently the API allows read-only access to a catalogue of modo. It supports listing them and their metadata.

Objective: We need the ability to remotely instantiate a MODO using the S3 bucket as zarr store.

Requirements:

Client side:
- MODO supports instantiation from S3 path (should already work via the archive parameter of the constructor)
  - provide a helper function to make this easier?
Server side:
- endpoint should give the public address of the bucket to the client

In essence, the client should be able to:

get the s3 path of a specific modo from the server
instantiate a modo using that path (as if it was local)

[Feature request]: streamline remote interactions

Contact Details

No response

Description

Currently, users have to specify the s3 endpoint and / or htsget endpoint manually. There should be a single "modos" endpoint, and the client does the job to find the s3/htsget endpoints behind the scenes?

Importance Level

Medium

Affected Components

api, cli, server setup

Technical Requirements

change CLI and API s3_endpoint / htsget_endpoint arguments
modos client auto-request individual endpoits from base

Acceptance criteria

Users can use all modos functionality knowing only the modos server endpoint.

Autofill `last_update_date` and `creation_date` when using python api

The modo metadata fields last_update_dateand creation_date` are defined by the user and the user needs to update them manually, if wanted.
Both could be automatically filled.

auto generate a time-date value for creation_date and last_update_date when not specified
define scope to update last_update_date (add_element, remove_element, update_element, ..?)
auto update last_update_date for this scope

remove mutable default values from function arguments

We use mutable dictionaries as default values in modo.api.MODO.__init__ and modo.io.build_modo_from_file e.g. :

https://github.com/sdsc-ordes/modo-api/blob/65dbc0ee563c133e2ea514bb5e02821323ea1333/modo/api.py#L55

Altering the dictionary in the function body affects the next calls, resulting in undefined behaviour.

Related bug: Instantiating MODO twice with s3_endpoint raises TypeError.

In [1]: from modo.api import MODO
In [2]: m1 = MODO('modo-demo/ex', s3_endpoint='http://localhost/s3')

In [3]: m2 = MODO('modo-demo/ex', s3_endpoint='http://localhost/s3')
--------------------------------------------------------------------------
TypeError                                Traceback (most recent call last)
Cell In[3], line 1
----> 1 m2 = MODO('modo-demo/ex', s3_endpoint='http://localhost/s3')

File ~/Repos/github.com/sdsc-ordes/modo-api/modo/api.py:74, in MODO.__init
__(self, path, s3_endpoint, s3_kwargs, htsget_endpoint, id, name, descript
ion, creation_date, last_update_date, has_assay, source_uri)
     72 self.path = Path(path)
     73 if s3_endpoint:
---> 74     fs = s3fs.S3FileSystem(endpoint_url=s3_endpoint, **s3_kwargs)
     75     if fs.exists(str(self.path / "data.zarr")):
     76         s3_kwargs["endpoint_url"] = s3_endpoint

TypeError: s3fs.core.S3FileSystem() got multiple values for keyword argume
nt 'endpoint_url'

region-based filtering not working in htsget

htsget streaming seems to return the whole region. htsget-rs receives the region specification but seems to always return all records. Reported upstream: umccr/htsget-rs#248

Add support for BCF format

MODO now supports remote streaming of CRAM files. The test dataset of the genome center is mainly BCF files, therefore we need to implement the same feature for variant files.

Requirements:

Make sure we can BCF files
metadata are stored in /data
files can be read locally
files can streamed
output accessible as a pysam.VariantRecord

data_path should be auto populated

When adding a data object into an existing MODO, the data_path property indicates the filepath inside the object. This must currently be filled by the user, but in most use cases, it will be the same filename as input. We should use the input filename as default value.

Objective: Use a default value, either from the input filename, or by combining id and extension.

Requirements:

Check behaviour in case of filename collision
Change the API to use the input filename if collisions are accounted for.

allow deleting entire MODO

The modo remove command deletes an element inside an existing MODO. We don't have a way in the CLI or API to delete the actuel MODO [folder]. This is important because deleting the folder is not straightforward when using a remote object on S3.

Objective: CLI/API deletion functionality

Requirements:

Add --force flag to allow deleting root object with modo remove
Example:

$ modos show --zarr data/ex
/ex
 ├── assay
 │   └── assay1
 ├── data
 │   ├── calls1
 │   └── demo1
 ├── reference
 │   └── reference1
 └── sample
     └── sample1

$ modos remove data/ex ex
Error: cannot delete root object. If you want to delete the entire MODO, use --force.

$ modos remove --force data/ex ex

Enable automatic metadata enrichment from cli

Currently, modo.enrich_metadata() has no equivalent from the cli. We should add that functionality there as well.
Important: Should we:

automatically enrich_metadata()when a modo is created/ a reference is added
mimic the api and add a specific function like modo enrich?

add functionality to cli
update docs (modo_access.md)

[Feature request]: implement logger

Description

Add a logger module. Maybe trying https://github.com/Delgan/loguru? Or logging?

Importance Level

Low

Affected Components

api, cli

Technical Requirements

Logging informs the user of the action being performed.

Acceptance criteria

No response

handle copying data files into archive

When creating a modo from a yaml file, we should define a clear behaviour to include external files.

Proposal:

External files can be included in the yaml file (under data_path).
We check for their existence at import time
We copy them by default, or optionally refer to them (or the opposite)
If copied, the actual MODO metadata will contained the new path inside the archive.

increase client_max_body_size in nginx

nginx is blocking file uploads to the s3 server due to a low default value for the client_max_body_size option.

We need to:

Set this value to a sane default.
Verify that uploading large files work

auto-generate identifiers

It is redundant for users to enter both the (unique) id and the (non-unique) human-readable name for each element in the modo. Most of the time these are ~identical. We should reduce the pain of data creation by automatically producing unique identifiers.

Objective: autogenerate unique id for every entity in the MODO

Proposal: <type>/<slugified-name>

Example:

name: "Tom the cat"

id: /sample/tom-the-cat-131fe2

Fail directly on not accepted prompt entries

On the cli some prompted fields have requirements, e.g. taxon_id needs to me numeric (mostly specified in modo.schema). The error will only be returned after all prompts are finished, which can be frustrating for the user. It would be nice to fail directly and allow a retry.

Steps:

Direct type check after each prompt
Repeat prompt question

/meta endpoint broken in REST API

After renaming the modos package it seems that /meta returns an empty list when queried.

Requirements:

reproduce bug
fix

reverse proxy in docker-compose

We want the docker-compose deployment to expose a single port.

Objective: Include nginx reverse proxy as a service in the docker-compose config to redirect specific paths to the correct port internally.

Requirements

Add nginx service
Define configuration
Parametrize endpoints and ports throughout services via env file

Resources:

tutorial

Disconnect zarr hierarchy from has_part relationship

zarr hierarchy currently mirrors the has part relationship, e.g modo/assay1/data/sample $Sample \in Data \in Assay \in MODO$

This is prone to errrors and confusing because zarr represents a tree model (like filesystems) whereas the metadata relationships are a graph (e.g. the same sample can belong to multiple files).

Proposal:

Keep a fixed hierarchy with a single depth level in the zarr archive.

ex-obj
├── assays
│  └── assay1
├── data
│  ├── align1
│  ├── align2
│  └── align3
├── references
│  └── hg38
└── samples
   ├── sample1
   └── sample2

Add array api function to simplify adding arrays to modo

The current way of adding an array to a modo includes 2 steps:

add the element specifying the array metadata to modo
add the array to modos zarr group

For the user these steps can be confusing and we allow inconsistencies, e.g. specifying any array path in the element, unrelated where the array is added in the hierarchy. Also the zarr framework can feel cumbersome, if not used to.

Add modo api function modo.add_array() to wrap both steps
Update docs in docs/tutorials/modo_array()

Open question:

What about modo.add_element()? Do we allow adding array metadata (step 1above)?
Should modo_add_array() also take care of rowname/colname handling?

[Feature request]: Incorporate metabolomics and proteomics data

Contact Details

No response

Description

As discussed with the SMOC team, MODOs should be extended to include metabolomics and proteomics data.

Patrick Pedrioli and Nicola Zamboni have provided example data with metadata via a SWITCH drive ORDES provided:
https://drive.switch.ch/index.php/s/k1Tng5UjjqV6Eqf

Importance Level

High

Affected Components

No response

Technical Requirements

No response

Acceptance criteria

No response

Simplify MODO creation

Logic to create a modo in modo.cli and modo.io requires a two-step procedure whereby the zarr store is manually initiated, and add_metadata_group is then called to add metadata. This is something that should be handled by MODO.__init__.

Expand modo --help message to show all

modo --help displays only part of the description for some commands (see publish and remove below).
Is this related to some typer setting and can we change this?

(modo-py3.10) (base) stefan@ordesmain:~/modo-api$ modo --help
Usage: modo [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  add      Add elements to a digital object.
  create   Create a digital object.
  publish  Creates a semantic artifact allowing to publish a digital...
  remove   Removes the target element from the digital object, along with...
  show     Show the contents of a digital object.

Consistent use of endpoint as attribute/argument name

Description: We use the arguments s3_endpoint and htsget_endpoint in the MODO class to describe the host address of the htsget server/S3 bucket. For the htsget client we use "host" for the htsget server address and "endpoint" to differentiate between the reads and variants endpoints.

Tasks:

change htsget_endpoint and s3_endpoint in MODO to *_host
adapt api and cli

`modo.update_element()` does not update modo metadata

Description

modo.update_element() does not update modo metadata, but also does not raise an error.

Code to reproduce:

from modos.api import MODO
import modos_schema.datamodel as model

modo = MODO("data/ex")
# Check metadata --> sample1: sex:  Male
modo.metadata

# Generate a new sample element with updates for sample/sample1
sample = model.Sample(id="sample1", collector= 'Foo university', sex= 'Female')

modo.update_element("sample/sample1", sample)
# Check metadata --> still sample1: sex:  Male
modo.metadata

sdsc-ordes / modos-api Goto Github PK

modos-api's Introduction

modos-api

Context

Goals

Architecture

Format

Installation

Usage

Contributing

Acknowledgements and Funding

Copyright

modos-api's People

Contributors

Watchers

modos-api's Issues

Goal

Context

Contact Details

Description

Importance Level

Affected Components

Technical Requirements

Acceptance criteria

Contact Details

Description

Importance Level

Affected Components

Technical Requirements

Acceptance criteria

Contact Details

Description

Importance Level

Affected Components

Technical Requirements

Acceptance criteria

Contact Details

Description

Importance Level

Affected Components

Technical Requirements

Acceptance criteria

Description

Importance Level

Affected Components

Technical Requirements

Acceptance criteria

Contact Details

Description

Importance Level

Affected Components

Technical Requirements

Acceptance criteria

Description

Code to reproduce:

Recommend Projects

Recommend Topics

Recommend Org