Code Monkey home page Code Monkey logo

curatedatlasqueryr's People

Contributors

jwokaty avatar multimeric avatar myushen avatar stemangiola avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

curatedatlasqueryr's Issues

Prepare users for 36GB download

The download seems fast enough, but people may benefit from knowing that they will receive 36GB of data. I'd suggest making the download function conditional on a positive response to a query on whether they want to proceed.

decrease DB size, through a permanent VIEW?

Provided that the keys will be realised from the non-redundant tables to the virtual table, so performances are not affected, I think it would be incredibly beneficial to decrease the DB size of > 10x folds separating cell metadata from sample metadata, and replace the metadata table with a virtual one.

This would result in the download and decompression of xz SQLite in 10 seconds rather than 3-5 minutes, with a huge benefit to R and Py interface and life-quality for new users.

[EDIT] whole DB preview slower in Rstudio HPC than local Rstudio

While with DuckDB printing metadata on screen is immediate, with our new implementation of Parquet, it takes ~1 second. Which is slightly annoying.

(the other complicated queries are very fast for both frameworks)

Maybe in the future, we could discuss again using DuckDB vs parquet, advantages and disadvantages.

Issues with 1-column SCEs

Reprex:

library(dplyr)
library(HCAquery)
tibble(file_id_db="f1ba4ef0a4822e3860fe045f18b2b2dc", .cell="ACGATACGTAAGAGGA-3") |> get_SingleCellExperiment()
Reading 1 files.
.
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘assayNames<-’ for signature ‘"NULL", "character"’

this query fails

This query fails probably due to select one cell selected

get_metadata() |> filter(.cell == "Arunachalam-cov01_AAAGTCCTCGCCTTGT-1") |> head(1) |>  get_SingleCellExperiment()
Reading 1 files.
.
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for functionassayNames<-for signature"NULL", "character"

AWS Database

  • Upload database to S3
  • Update R package to support loading from S3

Different row numbers

As discussed. Here is an example:

"/vast/projects/cellxgene_curated/splitted_DB2_data_0.2" |> list.files() |> head(20) |> purrr::map_int(\(file){ file |> HDF5Array::loadHDF5SummarizedExperiment() |> nrow() })
 [1] 36422 35662 35662 36422 35676 35676 35676 35761 35343 35761 35492 35790
[13] 35770 35761 35495 35372 35790 35790 35917 35941

integrate cell types and reclassify

  1. divide cells based on mcroclusters (e.g. B cells, CD8 T, monocytes). This is not always trivial, we have some high-confidence annotation, but some cells cannot be easily classified in T, B, Monocytes.
  • select a small set of gene markers for the major immune cells (B, T, mono, DC) and do integration on 11M cells. This is procedurally cleaner, however I don't know if it's possible to do so with so many cells, even if we just use 20 genes.
  • @multimeric list in a comment below the best candidate algorithms that have been shown to suit atlas-level integration.
  • @ALL we will pick a couple from that list
  • @ConnieLWS select two optimal datasets to test the initial classification @ConnieLWS and integration @multimeric
  • @multimeric implement/install those methods, and test with the small sample selection that @ConnieLWS is using
  • @multimeric, then select high-confidence NK cells (starting testing with a small random cell selection of the full NK database) and try to produce an integrated PCA and UMAP (colouring by file_id and .sample, omitting the legends to save space in the plot).
  • @ConnieLWS run PCA, and tSNE (built from the first ~5 PC) with the gene signature we have (colouring by file_id and .sample, omitting the legends to save space in the plot), without integration.
  • Integrate cells in the same space, and do clustering and trajectory.

`_cell` must be size 202 or 1, not 157.

I get this error with the new version.

> get_metadata() |> 
+     
+     # Filter and subset
+     filter(cell_type_harmonised=="cdc") |> 
+     select(`_cell`, file_id_db, dataset_id, tissue_harmonised) |> 
+     # Get counts per million for NCAM1 gene 
+     get_SingleCellExperiment(assays = "cpm", features = "HLA-A") 
ℹ Realising metadata.Synchronising filesReading files.
Error in `map2()`:In index: 1.With name: cpm.
Caused by error in `dplyr::summarise()`:In argument: `sces = list(...)`.
Caused by error in `mutate()`:In argument: `_cell = new_cellnames`.
Caused by error:
! `_cell` must be size 202 or 1, not 157.
Run `rlang::last_error()` to see where the error occurred.
Warning messages:
1: Connection is garbage-collected, use dbDisconnect() to avoid this. 
2: Database is garbage-collected, use dbDisconnect(con, shutdown=TRUE) or duckdb::duckdb_shutdown(drv) to avoid this. 

minor bugs/improvements

  • 1. In a new installation, the cache doesn't yet exist so normalizePath() issues a warning
> meta = get_metadata()
ℹ Downloading https://object-store.rc.nectar.org.au/v1/AUTH_06d6e008e3e642da99d806ba3ea629c5/metadata-sqlite/metadata.parquet to /Users/ma38727/Library/Caches/org.R-project.R/R/CuratedAtlasQueryR/metadata.parquet
  |======================================================================| 100%
Warning message:
In normalizePath(R_user_dir(packageName(), "cache")) :
  path[1]="/Users/ma38727/Library/Caches/org.R-project.R/R/CuratedAtlasQueryR": No such file or directory

I don't think the call to normalizePath() is necessary, since R_user_dir() does not return a 'denormalized' path

normalizePath()

  • 2. The 0.3.0 tag release has Version: 0.2.0

    Version: 0.2.0

  • 3. This line is unused

    cells_of_interest <- raw_data |>
    pull(.data$.cell) |>
    unique() |>
    as.character()

  • 4. aside is not used

    #' Used in a pipeline to run one or more expressions with side effects, but
    #' return the input value as the output value unaffected
    aside <- function(x, ...) {
    # Courtesy of Hadley: https://fosstodon.org/@hadleywickham/109558265769090930
    list(...)
    x
    }

  • 5. tar is not needed anymore

    #' @importFrom utils untar

  • @vjcitn mentioned that we should improve the @description fles for get_* functions about that files including only counts (sparse HDf5) are saved in the cache.

Thanks!

Track used access for metadata query and file download

For future support and gathering funds it is important that we show how much our database and API has being used.

This could be measured in terms of the download of metadata and RNa count files, and/or unique user access to the NECTAR machine.

Explanation of the DB versions

  • base (0.1): This is the first version
  • 0.1.5
    • new metadata with the 0.1 cell_type_harmonised because the 0.2 had a bug
  • 0.2:
    • updated tissue classification
    • update gene nomenclature
    • curated some gene distributions
    • update cell predictions, keeping same hierarchy
  • 0.2.1
    • update cell hierarchy
    • dropped all 0 cells, and Inf counts

Twitter communication thread of CuratedCellAtlas R and Py

Hello APP team,

here we are defining what is the content, images and text, or the Twitter thread we want to post. I think that, when we make this app public, we should have synergies in communicating this App, still first of it's kind, to get the community excited.

Some points for thought

  • do we want to mention/tag funding bodies
  • do we want to mention/tag institutions
  • which language
  • which content
  • which figures
  • do we produce a github webpage/post in some blog or not
  • which hashtags (I'm not a twitter expert)

Enable gene selection in get_SingleCellExperiment()

Hello @multimeric , to enable cell integration I think we should allow loading data just for a subset of genes, so if we need just 30 genes we can load little data.

I think the best place could be when we loop and load single files, we can subset before merging.

Document the solution for `namespace ‘dbplyr’ 2.2.1 is being loaded, but >= 2.3.0 is required`

Users who have an older dbplyr might receive this error on installing the package:

Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) :
  namespace ‘dbplyr’ 2.2.1 is being loaded, but >= 2.3.0 is required
Calls: <Anonymous> ... withCallingHandlers -> loadNamespace -> namespaceImport -> loadNamespace
Execution halted
ERROR: lazy loading failed for package ‘CuratedAtlasQueryR’
* removing ‘/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/CuratedAtlasQueryR’
Warning message:
In i.p(...) :
  installation of package ‘/var/folders/q5/fvpyppm924525yl9m3xpgjm400031g/T//RtmpIedEsU/file86b62ec9a27a/CuratedAtlasQueryR_0.4.2.tar.gz’ had non-zero exit status

Some package seems to pull in dbplyr automatically, so it's not as simple as just unloading the package. The easiest solution seems to be: install.packages("dbplyr") first.

Database update process

Count Data Update

  1. The data in nectar is deleted using swift delete harmonised-human-atlas
  2. The new data is uploaded using:
    swift upload harmonised-human-atlas /vast/projects/cellxgene_curated/splitted_DB2_data_0.2 --object-name original --segment-size 5000000000
    swift upload harmonised-human-atlas /vast/projects/cellxgene_curated/splitted_DB2_data_scaled_0.2 --object-name cpm --segment-size 5000000000
  3. The REMOTE_URL is updated in the R package
  4. Local cache needs to be given appropriate permissions:
    chmod --recursive a+rX /vast/projects/cellxgene_curated/splitted_DB2_data_scaled_0.2 /vast/projects/cellxgene_curated/splitted_DB2_data_0.2

Metadata File Update

  1. The old metadata is deleted using swift delete metadata
  2. The new metadata is uploaded using swift upload metadata /vast/projects/RCP/human_cell_atlas/metadata.0.2.2.parquet --object-name metadata.0.2.2.parquet
  3. The default remote_url for the metadata is updated to this new path

welcome to the repo! Please fork and familiarise with the project-page

Hello @multimeric, @nishikaoshadini , @ConnieLWS ! The team is formed.

Few points to get started

  1. Please familiarise yourself with the https://github.com/users/stemangiola/projects/2/views project and how to track your progress by moving tasks across tabs (todo, in progress, etc..)

  2. In the Github README https://github.com/stemangiola/HCAquery you will find an example of how to interrogate the Human Cell Atlas with our harmonised API.

  3. Let's work using github forking of this repository, so our progress will be tracked and clearly visible to all in the github network page https://github.com/stemangiola/HCAquery/network

  4. We have 3 channels of communication

  • Teams for internal extemporaneous communication
  • Github Project/issues for formal code related communication, where we can add working figures, codes, commits comments etc..
  • Slack for international communication/coorination with Martin

memory profile of workflow raw data processing, splitting jobs

The job submission of the workflow is very inefficient because we are asking for much more memory than needed. It would be nice to profile memory consumption of the jobs launched so to learn how memory to ask for each job based on for example input file size.

plan forward

  • create a usage statistics with NECTAR
  • create a function that passes the full dataset metadata as list of tibbles
  • build an automated DB upload and update system, when a new version of the DB is created
  • publish Python API
  • ask for more disk space, to allow 3 DB versions
  • convert CELLxGENE processing pipeline to a more standard framework, to update our database in the future
  • use Anndata for R and Python APIs
  • Better automate cell type curation
  • make the DB portable for institute to setup local cache like WEHI has
  • make pseudobulk available

Documentation inconsistent?

The defaults for assays seem in conflict with the description of the argument

     get_SingleCellExperiment(
       data,
       assays = c("counts", "cpm"),
       cache_directory = get_default_cache_dir(),
       repository = REMOTE_URL,
       features = NULL
     )
Arguments:

    data: A data frame containing, at minimum, a `.sample` column,
          which corresponds to a single cell sample ID. This can be
          obtained from the [get_metadata()] function.

  assays: A character vector whose elements must be either "raw" or
          "scaled", representing the corresponding assay you want to
          request.

Curate gene IDs

@stemangiola Could you please tackle the curation of gene symbols?

For example

of the object

    get_metadata() |> 
    filter(
         ethnicity == "African" & 
        assay %LIKE% "%10x%" & 
        tissue == "lung parenchyma" & 
        cell_type %LIKE% "%CD4%"
    ) |> 
    
    get_SingleCellExperiment() |> rownames() |> sort()

the rownames include.

We need to identify pathologies, for example

  • tRNA, or
  • IDs both at the gene and isoform level (this would artificially create heterogeneity across databases)
  • Non-coding genes, with no real use
  • Pseudo genes
  • other strange things we might find among the 60K IDs

   [1] "5_8S_rRNA_ENSG00000273730" "5_8S_rRNA_ENSG00000275757" "5_8S_rRNA_ENSG00000275877" "5_8S_rRNA_ENSG00000276871" "5_8S_rRNA_ENSG00000277739"
   [6] "5_8S_rRNA_ENSG00000283274" "5_8S_rRNA_ENSG00000283291" "5_8S_rRNA_ENSG00000283568" "5S_rRNA_ENSG00000276861"   "5S_rRNA_ENSG00000277411"  
  [11] "5S_rRNA_ENSG00000277488"   "5S_rRNA_ENSG00000285609"   "5S_rRNA_ENSG00000285626"   "5S_rRNA_ENSG00000285674"   "5S_rRNA_ENSG00000285776"  
  [16] "5S_rRNA_ENSG00000285912"   "7SK_ENSG00000202198"       "7SK_ENSG00000271394"       "7SK_ENSG00000273591"       "7SK_ENSG00000274303"      
  [21] "7SK_ENSG00000275933"       "7SK_ENSG00000276626"       "7SK_ENSG00000277313"       "A1BG"                      "A1BG-AS1"                 
  [26] "A1CF"                      "A2M"                       "A2M-AS1"                   "A2ML1"                     "A2ML1-AS1"                
  [31] "A2ML1-AS2"                 "A2MP1"                     "A3GALT2"                   "A4GALT"                    "A4GNT"                                  "ABL2"                      "ABLIM1"                   
 [191] "ABLIM2"                    "ABLIM3"                    "ABO"                       "ABR"                       "ABRA"                     
 [196] "ABRACL"                    "ABRAXAS1"                  "ABRAXAS2"                  "ABT1"                      "ABT1P1"                   
 [201] "ABTB1"                     "ABTB2"                     "AC000032.2"                "AC000035.3"                "AC000036.4"               
 [206] "AC000041.8"                "AC000050.1"                "AC000067.1"                "AC000068.10"               "AC000068.5"               
 [211] "AC000068.9"                "AC000077.2"                "AC000078.5"                "AC000081.2"                "AC000082.4"               
 [216] "AC000085.4"                "AC000089.3"                "AC000095.12"               "AC000095.9"                "AC000111.4"

After we identified pathologies we should design rules for changing IDs or filtering IDs in our pipeline.

After we defined those rules we can meet to add them to our pipeline in form of code.

Fix github actions

We should fix github actions, so we can render our github pages.

For tests we can just select a small amount of cells from a very small file_id (containing few cells in total)

unable to open database file

Hello @multimeric,

I get

> hca_metadata = get_metadata()
Error: Could not connect to database:
unable to open database file

is /vast/projects/RCP/human_cell_atlas/metadata.sqlite readeable by any WEHI user?

Ship compressed SQLite ?

Hello @multimeric ,

downloading the SQLite takes incredibly long (~ 2h - 3h).

6627.679 sec elapsed

I was wondering, since the rds occupies < 1Gb and the sqlite occupies ~ 30Gb. Would it be possible to download a compressed form of the sqlite database?

funnily enough, I think it would take less time to download rds file and create sqlite DB, than download sqlite directly.

Storage Formats and Query Times

Just creating an issue for the discussion of the storage format: ie the format that we use for providing the final dataset for users, and the method that we use to retrieve data for a user query. I have benchmarked a few formats/methods, and so far the Python AnnData approach is the fastest. I will attach more results later.

However, something we should consider is not just the query time, but also the memory footprint of each method. Currently the in-memory approaches are the fastest, which is not surprising, but also use the most memory. As we increase the size of the data beyond the minimal benchmarking dataset I have used, the memory usage will probably exceed what a reasonable user might have access to. For this reason we should investigate streaming approaches, especially in Python.

However, for the moment I understand this work is on hold, while we work on integration.

Size optimisation

Hello @multimeric,

I believe that if we just save sparse H5 matrices, and compose the SingleCellExperiment in the get_SingleCellExperiment function, we would be able to save 10x disk space.

Have a look at the weight of the R object compared to the count matrix, using DelayedArray infrastructure

(cctools-env) slurm-login02 552 % du -h /vast/scratch/users/mangiola.s/human_cell_atlas/splitted_light_data/facb6f08ba946c2a98eeb2cf53db2a9f/*
181M    /vast/scratch/users/mangiola.s/human_cell_atlas/splitted_light_data/facb6f08ba946c2a98eeb2cf53db2a9f/assays.h5
2.3G    /vast/scratch/users/mangiola.s/human_cell_atlas/splitted_light_data/facb6f08ba946c2a98eeb2cf53db2a9f/se.rds

Almost all Gb are occupied by the se.rds itself not by the counts

Create GIF video for the API usage

@multimeric if you don't terribly dislkie it, in the past it was useful for me to create a 10ish seconds GIF video about a package usage, it is a good product for any communication.

For example the video in this REAMDE https://github.com/stemangiola/tidygate

the way I did it

  1. record my screen
  2. import into a video editing, and cropping
  3. cutting dead times, and compress slow parts
  4. export to GIF

The code we might showcase, could be

get_metadata()

get_metadata() |>
    dplyr::filter(
        ethnicity == "African" &
        str_like(assay, "%10x%") &
        tissue == "lung parenchyma" &
        str_like(cell_type, "%CD4%")
    )

get_metadata() |>
    dplyr::filter(
        ethnicity == "African" &
        str_like(assay, "%10x%") &
        tissue == "lung parenchyma" &
        str_like(cell_type, "%CD4%")
    ) |>
    get_SingleCellExperiment()

Persistent connections

It isn't ideal to establish a new database connection each time we run get_metadata() because:

  • It's slightly slower
  • It burdens the database server
  • We get the annoying message: Warning message: In .Internal(gc(verbose, reset, full)) : call dbDisconnect() when finished working with a connection

The API needs to be changed in some way to persist the database connection between queries.

Pivot from Mysql to SQLite for complex queries

I am not sure on how to produce proper feature request documentation, I will just describe it.

  • get_metadata* add a class to the tbl instance (e.g. "CuratedAtlasQuery") to the existing classes.
  • Add the method count (not redefining the generics). In that method we check whether the size of the ... parameter (columns to count) is > 1. Then we though error, pointing to get_metadata_local. If the count is on one column, we call the original count method with no changes
  • Add the method collect (not redefining the generics). Through an error on this method for the moment.

makesql produces a 0b file

For example if I do

Rscript metadata_database.R /stornext/General/scratch/GP_Transfer/stefano_HCA/metadata_annotated.rds temp.sqlite

I get

 Rscript metadata_database.R /stornext/General/scratch/GP_Transfer/stefano_HCA/metadata_annotated.rds temp.sqlite
Error: file is not a database
In addition: Warning message:
Couldn't set synchronous mode: file is not a database
Use `synchronous` = NULL to turn off this warning. 
Execution halted

Minor repo/website feedback

  • Capitalise website and maybe bold it in the readme to make it more prominent
  • Add the pkgdown website to the github repo as its website under settings in the top right:
    image
  • Update the documentation for get_metadata() to handle the new cell_ naming convention rather than .cell
  • data argument for get_SingleCellExperiment needs updated documentation
  • Make it clear that repository is an optional argument
  • Shorten the get_Seurat() title line in the docs, it's very long
  • Hyperlink to cellxgene.cziscience.com wherever we mention it
  • Minor grammar in the get_metadata details: > it's vignette using_cellxgenedp provides an overview of the columns in the metadata. The data for which the column organism_name included "Homo sapiens" was collected collected from cellxgenedp.

I'm happy to work on these tomorrow

Error while installing package

Installing package into ‘/stornext/Home/data/allstaff/i/iskander.j/R/x86_64-pc-linux-gnu-library/4.2’
(as ‘lib’ is unspecified)
* installing *source* package ‘CuratedAtlasQueryR’ ...
** using staged installation
** R
** inst
** byte-compile and prepare package for lazy loading
Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) : 
  namespace ‘dbplyr’ 2.2.1 is being loaded, but >= 2.3.0 is required
Calls: <Anonymous> ... withCallingHandlers -> loadNamespace -> namespaceImport -> loadNamespace
Execution halted
ERROR: lazy loading failed for package ‘CuratedAtlasQueryR’
* removing ‘/stornext/Home/data/allstaff/i/iskander.j/R/x86_64-pc-linux-gnu-library/4.2/CuratedAtlasQueryR’
Warning message:
In i.p(...) :
  installation of package ‘/tmp/Rtmp3xpRmI/fileb36a69a2c572/CuratedAtlasQueryR_0.2.0.tar.gz’ had non-zero exit status

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.