Build Figure 3: query interface, framework pipeline, and technical benchmark

The figure 3 of the manuscript

https://docs.google.com/document/d/11zGOnMKONZzAEnfzmf-wriqWIvqAcPx3lpws4MSl20w/edit?usp=sharing

will include

The flowchart of the backend process of querying and file organisation
The vignette of the query interface (how it looks) with an example
The technical benchmark of H5 framework performance

Prepare users for 36GB download

The download seems fast enough, but people may benefit from knowing that they will receive 36GB of data. I'd suggest making the download function conditional on a positive response to a query on whether they want to proceed.

Separate the HCA content from this API repository

We have a lot of code used for analyses in this repository.

Drop all code from history to clean things up, also move analysis-related issues to another repository.

decrease DB size, through a permanent VIEW?

Provided that the keys will be realised from the non-redundant tables to the virtual table, so performances are not affected, I think it would be incredibly beneficial to decrease the DB size of > 10x folds separating cell metadata from sample metadata, and replace the metadata table with a virtual one.

This would result in the download and decompression of xz SQLite in 10 seconds rather than 3-5 minutes, with a huge benefit to R and Py interface and life-quality for new users.

Replace rappdirs with tools::R_user_dir

As suggested by Vince Carey. This is an unnecessary dependency.

[EDIT] whole DB preview slower in Rstudio HPC than local Rstudio

While with DuckDB printing metadata on screen is immediate, with our new implementation of Parquet, it takes ~1 second. Which is slightly annoying.

(the other complicated queries are very fast for both frameworks)

Maybe in the future, we could discuss again using DuckDB vs parquet, advantages and disadvantages.

Issues with 1-column SCEs

Reprex:

library(dplyr)
library(HCAquery)
tibble(file_id_db="f1ba4ef0a4822e3860fe045f18b2b2dc", .cell="ACGATACGTAAGAGGA-3") |> get_SingleCellExperiment()

Reading 1 files.
.
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘assayNames<-’ for signature ‘"NULL", "character"’

Converting makeflow workflow to a more mainstream workflow

e.g.

clusteR
makesnake

this query fails

This query fails probably due to select one cell selected

get_metadata() |> filter(.cell == "Arunachalam-cov01_AAAGTCCTCGCCTTGT-1") |> head(1) |>  get_SingleCellExperiment()

Reading 1 files.
.
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘assayNames<-’ for signature ‘"NULL", "character"’

AWS Database

Upload database to S3
Update R package to support loading from S3

Different row numbers

As discussed. Here is an example:

"/vast/projects/cellxgene_curated/splitted_DB2_data_0.2" |> list.files() |> head(20) |> purrr::map_int(\(file){ file |> HDF5Array::loadHDF5SummarizedExperiment() |> nrow() })
 [1] 36422 35662 35662 36422 35676 35676 35676 35761 35343 35761 35492 35790
[13] 35770 35761 35495 35372 35790 35790 35917 35941

check and improve .sample definition

Remove `library()` call from vignette/readme

Since we are now namespacing all functions we use, we don't need:

library(dplyr)
library(stringr)

Improve hyerarchical cell type harmonisation

overlap more evidence from algorithms and databases other than

SingleR + blueprint
Seurat Azimuth + pbmc

Add hierarchical classification for hierarchical filtering

integrate cell types and reclassify

divide cells based on mcroclusters (e.g. B cells, CD8 T, monocytes). This is not always trivial, we have some high-confidence annotation, but some cells cannot be easily classified in T, B, Monocytes.

select a small set of gene markers for the major immune cells (B, T, mono, DC) and do integration on 11M cells. This is procedurally cleaner, however I don't know if it's possible to do so with so many cells, even if we just use 20 genes.
@multimeric list in a comment below the best candidate algorithms that have been shown to suit atlas-level integration.
@ALL we will pick a couple from that list
@ConnieLWS select two optimal datasets to test the initial classification @ConnieLWS and integration @multimeric
@multimeric implement/install those methods, and test with the small sample selection that @ConnieLWS is using
@multimeric, then select high-confidence NK cells (starting testing with a small random cell selection of the full NK database) and try to produce an integrated PCA and UMAP (colouring by file_id and .sample, omitting the legends to save space in the plot).
@ConnieLWS run PCA, and tSNE (built from the first ~5 PC) with the gene signature we have (colouring by file_id and .sample, omitting the legends to save space in the plot), without integration.
Integrate cells in the same space, and do clustering and trajectory.

`_cell` must be size 202 or 1, not 157.

I get this error with the new version.

> get_metadata() |> 
+     
+     # Filter and subset
+     filter(cell_type_harmonised=="cdc") |> 
+     select(`_cell`, file_id_db, dataset_id, tissue_harmonised) |> 
+     # Get counts per million for NCAM1 gene 
+     get_SingleCellExperiment(assays = "cpm", features = "HLA-A") 
ℹ Realising metadata.
ℹ Synchronising files
ℹ Reading files.
Error in `map2()`:
ℹ In index: 1.
ℹ With name: cpm.
Caused by error in `dplyr::summarise()`:
ℹ In argument: `sces = list(...)`.
Caused by error in `mutate()`:
ℹ In argument: `_cell = new_cellnames`.
Caused by error:
! `_cell` must be size 202 or 1, not 157.
Run `rlang::last_error()` to see where the error occurred.
Warning messages:
1: Connection is garbage-collected, use dbDisconnect() to avoid this. 
2: Database is garbage-collected, use dbDisconnect(con, shutdown=TRUE) or duckdb::duckdb_shutdown(drv) to avoid this.

minor bugs/improvements

1. In a new installation, the cache doesn't yet exist so normalizePath() issues a warning

> meta = get_metadata()
ℹ Downloading https://object-store.rc.nectar.org.au/v1/AUTH_06d6e008e3e642da99d806ba3ea629c5/metadata-sqlite/metadata.parquet to /Users/ma38727/Library/Caches/org.R-project.R/R/CuratedAtlasQueryR/metadata.parquet
  |======================================================================| 100%
Warning message:
In normalizePath(R_user_dir(packageName(), "cache")) :
  path[1]="/Users/ma38727/Library/Caches/org.R-project.R/R/CuratedAtlasQueryR": No such file or directory

I don't think the call to normalizePath() is necessary, since R_user_dir() does not return a 'denormalized' path

CuratedAtlasQueryR/R/query.R

Line 316 in dbe2d63

normalizePath()

2. The 0.3.0 tag release has Version: 0.2.0

CuratedAtlasQueryR/DESCRIPTION

Line 4 in dbe2d63

Version: 0.2.0

3. This line is unused

CuratedAtlasQueryR/R/query.R

Lines 94 to 97 in dbe2d63

    
           cells_of_interest <- raw_data |> 
        
               pull(.data$.cell) |> 
        
               unique() |> 
        
               as.character()

4. aside is not used

CuratedAtlasQueryR/R/query.R

Lines 12 to 19 in 2b79209

    
           #' Used in a pipeline to run one or more expressions with side effects, but 
        
           #' return the input value as the output value unaffected 
        
           aside <- function(x, ...) { 
        
               # Courtesy of Hadley: https://fosstodon.org/@hadleywickham/109558265769090930 
        
               list(...) 
        
               x 
        
           }

5. tar is not needed anymore

CuratedAtlasQueryR/R/query.R

Line 375 in 2b79209

#' @importFrom utils untar
@vjcitn mentioned that we should improve the @description fles for get_* functions about that files including only counts (sparse HDf5) are saved in the cache.

Thanks!

Track used access for metadata query and file download

For future support and gathering funds it is important that we show how much our database and API has being used.

This could be measured in terms of the download of metadata and RNa count files, and/or unique user access to the NECTAR machine.

Explanation of the DB versions

base (0.1): This is the first version
0.1.5
- new metadata with the 0.1 cell_type_harmonised because the 0.2 had a bug
0.2:
- updated tissue classification
- update gene nomenclature
- curated some gene distributions
- update cell predictions, keeping same hierarchy
0.2.1
- update cell hierarchy
- dropped all 0 cells, and Inf counts

Twitter communication thread of CuratedCellAtlas R and Py

Hello APP team,

here we are defining what is the content, images and text, or the Twitter thread we want to post. I think that, when we make this app public, we should have synergies in communicating this App, still first of it's kind, to get the community excited.

Some points for thought

do we want to mention/tag funding bodies
do we want to mention/tag institutions
which language
which content
which figures
do we produce a github webpage/post in some blog or not
which hashtags (I'm not a twitter expert)

add disease to sample same, as there are 24 samples with multiple diseases

Investigate using cloudstor

Although it will be decomissioned this year, it will be cheaper in this alpha phase

Enable gene selection in get_SingleCellExperiment()

Hello @multimeric , to enable cell integration I think we should allow loading data just for a subset of genes, so if we need just 30 genes we can load little data.

I think the best place could be when we loop and load single files, we can subset before merging.

Document the solution for `namespace ‘dbplyr’ 2.2.1 is being loaded, but >= 2.3.0 is required`

Users who have an older dbplyr might receive this error on installing the package:

Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) :
  namespace ‘dbplyr’ 2.2.1 is being loaded, but >= 2.3.0 is required
Calls: <Anonymous> ... withCallingHandlers -> loadNamespace -> namespaceImport -> loadNamespace
Execution halted
ERROR: lazy loading failed for package ‘CuratedAtlasQueryR’
* removing ‘/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/CuratedAtlasQueryR’
Warning message:
In i.p(...) :
  installation of package ‘/var/folders/q5/fvpyppm924525yl9m3xpgjm400031g/T//RtmpIedEsU/file86b62ec9a27a/CuratedAtlasQueryR_0.4.2.tar.gz’ had non-zero exit status

Some package seems to pull in dbplyr automatically, so it's not as simple as just unloading the package. The easiest solution seems to be: install.packages("dbplyr") first.

Database update process

Count Data Update

The data in nectar is deleted using swift delete harmonised-human-atlas

The new data is uploaded using:

swift upload harmonised-human-atlas /vast/projects/cellxgene_curated/splitted_DB2_data_0.2 --object-name original --segment-size 5000000000
swift upload harmonised-human-atlas /vast/projects/cellxgene_curated/splitted_DB2_data_scaled_0.2 --object-name cpm --segment-size 5000000000

The REMOTE_URL is updated in the R package

Local cache needs to be given appropriate permissions:

chmod --recursive a+rX /vast/projects/cellxgene_curated/splitted_DB2_data_scaled_0.2 /vast/projects/cellxgene_curated/splitted_DB2_data_0.2

Metadata File Update

The old metadata is deleted using swift delete metadata
The new metadata is uploaded using swift upload metadata /vast/projects/RCP/human_cell_atlas/metadata.0.2.2.parquet --object-name metadata.0.2.2.parquet
The default remote_url for the metadata is updated to this new path

welcome to the repo! Please fork and familiarise with the project-page

Hello @multimeric, @nishikaoshadini , @ConnieLWS ! The team is formed.

Few points to get started

Please familiarise yourself with the https://github.com/users/stemangiola/projects/2/views project and how to track your progress by moving tasks across tabs (todo, in progress, etc..)
In the Github README https://github.com/stemangiola/HCAquery you will find an example of how to interrogate the Human Cell Atlas with our harmonised API.
Let's work using github forking of this repository, so our progress will be tracked and clearly visible to all in the github network page https://github.com/stemangiola/HCAquery/network
We have 3 channels of communication

Teams for internal extemporaneous communication
Github Project/issues for formal code related communication, where we can add working figures, codes, commits comments etc..
Slack for international communication/coorination with Martin

memory profile of workflow raw data processing, splitting jobs

The job submission of the workflow is very inefficient because we are asking for much more memory than needed. It would be nice to profile memory consumption of the jobs launched so to learn how memory to ask for each job based on for example input file size.

plan forward

Documentation inconsistent?

The defaults for assays seem in conflict with the description of the argument

     get_SingleCellExperiment(
       data,
       assays = c("counts", "cpm"),
       cache_directory = get_default_cache_dir(),
       repository = REMOTE_URL,
       features = NULL
     )

Arguments:

    data: A data frame containing, at minimum, a `.sample` column,
          which corresponds to a single cell sample ID. This can be
          obtained from the [get_metadata()] function.

  assays: A character vector whose elements must be either "raw" or
          "scaled", representing the corresponding assay you want to
          request.

Curate gene IDs

@stemangiola Could you please tackle the curation of gene symbols?

For example

of the object

    get_metadata() |> 
    filter(
         ethnicity == "African" & 
        assay %LIKE% "%10x%" & 
        tissue == "lung parenchyma" & 
        cell_type %LIKE% "%CD4%"
    ) |> 
    
    get_SingleCellExperiment() |> rownames() |> sort()

the rownames include.

We need to identify pathologies, for example

tRNA, or
IDs both at the gene and isoform level (this would artificially create heterogeneity across databases)
Non-coding genes, with no real use
Pseudo genes
other strange things we might find among the 60K IDs


   [1] "5_8S_rRNA_ENSG00000273730" "5_8S_rRNA_ENSG00000275757" "5_8S_rRNA_ENSG00000275877" "5_8S_rRNA_ENSG00000276871" "5_8S_rRNA_ENSG00000277739"
   [6] "5_8S_rRNA_ENSG00000283274" "5_8S_rRNA_ENSG00000283291" "5_8S_rRNA_ENSG00000283568" "5S_rRNA_ENSG00000276861"   "5S_rRNA_ENSG00000277411"  
  [11] "5S_rRNA_ENSG00000277488"   "5S_rRNA_ENSG00000285609"   "5S_rRNA_ENSG00000285626"   "5S_rRNA_ENSG00000285674"   "5S_rRNA_ENSG00000285776"  
  [16] "5S_rRNA_ENSG00000285912"   "7SK_ENSG00000202198"       "7SK_ENSG00000271394"       "7SK_ENSG00000273591"       "7SK_ENSG00000274303"      
  [21] "7SK_ENSG00000275933"       "7SK_ENSG00000276626"       "7SK_ENSG00000277313"       "A1BG"                      "A1BG-AS1"                 
  [26] "A1CF"                      "A2M"                       "A2M-AS1"                   "A2ML1"                     "A2ML1-AS1"                
  [31] "A2ML1-AS2"                 "A2MP1"                     "A3GALT2"                   "A4GALT"                    "A4GNT"                                  "ABL2"                      "ABLIM1"                   
 [191] "ABLIM2"                    "ABLIM3"                    "ABO"                       "ABR"                       "ABRA"                     
 [196] "ABRACL"                    "ABRAXAS1"                  "ABRAXAS2"                  "ABT1"                      "ABT1P1"                   
 [201] "ABTB1"                     "ABTB2"                     "AC000032.2"                "AC000035.3"                "AC000036.4"               
 [206] "AC000041.8"                "AC000050.1"                "AC000067.1"                "AC000068.10"               "AC000068.5"               
 [211] "AC000068.9"                "AC000077.2"                "AC000078.5"                "AC000081.2"                "AC000082.4"               
 [216] "AC000085.4"                "AC000089.3"                "AC000095.12"               "AC000095.9"                "AC000111.4"

After we identified pathologies we should design rules for changing IDs or filtering IDs in our pipeline.

After we defined those rules we can meet to add them to our pipeline in form of code.

create get_Seurat

Hello @multimeric to accommodate all researchers we should create a get_Seurat constructor. Some resources are

https://satijalab.org/seurat/reference/read10x_h5

Fix github actions

We should fix github actions, so we can render our github pages.

For tests we can just select a small amount of cells from a very small file_id (containing few cells in total)

unable to open database file

Hello @multimeric,

I get

> hca_metadata = get_metadata()
Error: Could not connect to database:
unable to open database file

is /vast/projects/RCP/human_cell_atlas/metadata.sqlite readeable by any WEHI user?

Update DB to version 0.2 for counts, and 0.2.1 for metadata

Would be great to

create a script to automatically update the database with a new version
update DB to version 0.2 for counts, and 0.2.1 for metadata

Write email to WEHI bioinformatics about the package

Ship compressed SQLite ?

Hello @multimeric ,

downloading the SQLite takes incredibly long (~ 2h - 3h).

6627.679 sec elapsed

I was wondering, since the rds occupies < 1Gb and the sqlite occupies ~ 30Gb. Would it be possible to download a compressed form of the sqlite database?

funnily enough, I think it would take less time to download rds file and create sqlite DB, than download sqlite directly.

Storage Formats and Query Times

Just creating an issue for the discussion of the storage format: ie the format that we use for providing the final dataset for users, and the method that we use to retrieve data for a user query. I have benchmarked a few formats/methods, and so far the Python AnnData approach is the fastest. I will attach more results later.

However, something we should consider is not just the query time, but also the memory footprint of each method. Currently the in-memory approaches are the fastest, which is not surprising, but also use the most memory. As we increase the size of the data beyond the minimal benchmarking dataset I have used, the memory usage will probably exceed what a reasonable user might have access to. For this reason we should investigate streaming approaches, especially in Python.

However, for the moment I understand this work is on hold, while we work on integration.

Produce figure about biases of high/low confidence cell annotation

Share remote package with collaborators on slack channel

Size optimisation

Hello @multimeric,

I believe that if we just save sparse H5 matrices, and compose the SingleCellExperiment in the get_SingleCellExperiment function, we would be able to save 10x disk space.

Have a look at the weight of the R object compared to the count matrix, using DelayedArray infrastructure

(cctools-env) slurm-login02 552 % du -h /vast/scratch/users/mangiola.s/human_cell_atlas/splitted_light_data/facb6f08ba946c2a98eeb2cf53db2a9f/*
181M    /vast/scratch/users/mangiola.s/human_cell_atlas/splitted_light_data/facb6f08ba946c2a98eeb2cf53db2a9f/assays.h5
2.3G    /vast/scratch/users/mangiola.s/human_cell_atlas/splitted_light_data/facb6f08ba946c2a98eeb2cf53db2a9f/se.rds

Almost all Gb are occupied by the se.rds itself not by the counts

Add scaled counts for get_ function

Create GIF video for the API usage

@multimeric if you don't terribly dislkie it, in the past it was useful for me to create a 10ish seconds GIF video about a package usage, it is a good product for any communication.

For example the video in this REAMDE https://github.com/stemangiola/tidygate

the way I did it

record my screen
import into a video editing, and cropping
cutting dead times, and compress slow parts
export to GIF

The code we might showcase, could be

get_metadata()

get_metadata() |>
    dplyr::filter(
        ethnicity == "African" &
        str_like(assay, "%10x%") &
        tissue == "lung parenchyma" &
        str_like(cell_type, "%CD4%")
    )

get_metadata() |>
    dplyr::filter(
        ethnicity == "African" &
        str_like(assay, "%10x%") &
        tissue == "lung parenchyma" &
        str_like(cell_type, "%CD4%")
    ) |>
    get_SingleCellExperiment()

Persistent connections

It isn't ideal to establish a new database connection each time we run get_metadata() because:

It's slightly slower
It burdens the database server
We get the annoying message: Warning message: In .Internal(gc(verbose, reset, full)) : call dbDisconnect() when finished working with a connection

The API needs to be changed in some way to persist the database connection between queries.

Pivot from Mysql to SQLite for complex queries

I am not sure on how to produce proper feature request documentation, I will just describe it.

get_metadata* add a class to the tbl instance (e.g. "CuratedAtlasQuery") to the existing classes.
Add the method count (not redefining the generics). In that method we check whether the size of the ... parameter (columns to count) is > 1. Then we though error, pointing to get_metadata_local. If the count is on one column, we call the original count method with no changes
Add the method collect (not redefining the generics). Through an error on this method for the moment.

check the tranformation of data (e.g. log, sqrt, none)

makesql produces a 0b file

For example if I do

Rscript metadata_database.R /stornext/General/scratch/GP_Transfer/stefano_HCA/metadata_annotated.rds temp.sqlite

I get

 Rscript metadata_database.R /stornext/General/scratch/GP_Transfer/stefano_HCA/metadata_annotated.rds temp.sqlite
Error: file is not a database
In addition: Warning message:
Couldn't set synchronous mode: file is not a database
Use `synchronous` = NULL to turn off this warning. 
Execution halted

Minor repo/website feedback

Capitalise website and maybe bold it in the readme to make it more prominent
Add the pkgdown website to the github repo as its website under settings in the top right:
Update the documentation for get_metadata() to handle the new cell_ naming convention rather than .cell
data argument for get_SingleCellExperiment needs updated documentation
Make it clear that repository is an optional argument
Shorten the get_Seurat() title line in the docs, it's very long
Hyperlink to cellxgene.cziscience.com wherever we mention it
Minor grammar in the get_metadata details: > it's vignette using_cellxgenedp provides an overview of the columns in the metadata. The data for which the column organism_name included "Homo sapiens" was collected collected from cellxgenedp.

I'm happy to work on these tomorrow

get_metadata() alone give no output

get_metadata() produce no output, unless you do |> print()

`dplyr::filter` not imported

Issue is here: https://github.com/stemangiola/HCAquery/blob/f7205649f774fc462e279d6028c4f1932aac3dbb/R/query.R#L44

Causes the error:

Error in filter(raw_data, file_id_db == .x) : 
  object 'file_id_db' not found

Error while installing package

Installing package into ‘/stornext/Home/data/allstaff/i/iskander.j/R/x86_64-pc-linux-gnu-library/4.2’
(as ‘lib’ is unspecified)
* installing *source* package ‘CuratedAtlasQueryR’ ...
** using staged installation
** R
** inst
** byte-compile and prepare package for lazy loading
Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) : 
  namespace ‘dbplyr’ 2.2.1 is being loaded, but >= 2.3.0 is required
Calls: <Anonymous> ... withCallingHandlers -> loadNamespace -> namespaceImport -> loadNamespace
Execution halted
ERROR: lazy loading failed for package ‘CuratedAtlasQueryR’
* removing ‘/stornext/Home/data/allstaff/i/iskander.j/R/x86_64-pc-linux-gnu-library/4.2/CuratedAtlasQueryR’
Warning message:
In i.p(...) :
  installation of package ‘/tmp/Rtmp3xpRmI/fileb36a69a2c572/CuratedAtlasQueryR_0.2.0.tar.gz’ had non-zero exit status

	cells_of_interest <- raw_data \|>
	pull(.data$.cell) \|>
	unique() \|>
	as.character()


	#' Used in a pipeline to run one or more expressions with side effects, but
	#' return the input value as the output value unaffected
	aside <- function(x, ...) {
	# Courtesy of Hadley: https://fosstodon.org/@hadleywickham/109558265769090930
	list(...)
	x
	}

stemangiola / curatedatlasqueryr Goto Github PK

curatedatlasqueryr's People

Contributors

Stargazers

Watchers

Forkers

curatedatlasqueryr's Issues

Count Data Update

Metadata File Update

Recommend Projects

Recommend Topics

Recommend Org