Code Monkey home page Code Monkey logo

numbat's Introduction

Numbat

<kharchenkolab> CRAN status CRAN downloads

Numbat is a haplotype-aware CNV caller from single-cell and spatial transcriptomics data. It integrates signals from gene expression, allelic ratio, and population-derived haplotype information to accurately infer allele-specific CNVs in single cells and reconstruct their lineage relationship.

Numbat can be used to:

  1. Detect allele-specific copy number variations from scRNA-seq and spatial transcriptomics
  2. Differentiate tumor versus normal cells in the tumor microenvironment
  3. Infer the clonal architecture and evolutionary history of profiled tumors.

image

Numbat does not require paired DNA or genotype data and operates solely on the donor scRNA-seq data (for example, 10x Cell Ranger output). For details of the method, please checkout our paper:

Teng Gao, Ruslan Soldatov, Hirak Sarkar, Adam Kurkiewicz, Evan Biederstedt, Po-Ru Loh, Peter Kharchenko. Haplotype-aware analysis of somatic copy number variations from single-cell transcriptomes. Nature Biotechnology (2022).

User Guide

For a complete guide, please see Numbat User Guide.

Questions?

We appreciate your feedback! Please raise a github issue for bugs, questions and new feature requests. For bug reports, please attach full log, error message, input parameters, and ideally a reproducible example (if possible).

numbat's People

Contributors

evanbiederstedt avatar hiraksarkar avatar igordot avatar teng-gao avatar whtns avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

numbat's Issues

Document creation of *rda objects

For reproducibility, users should know how all *rda objects were created.

I'm immediately skeptical when I see these files without some code + comments which explains how these objects were created.

Users should be able to create each *rda object following code which is in this file: https://github.com/kharchenkolab/numbat/tree/main/raw-data

Here's an example from conos: https://github.com/kharchenkolab/conos/blob/main/data-raw/create_rdata_file.R

Files for documentation:

  • count_mat_ATC2.rda
  • gtf_hg19.rda
  • ref_hca_counts.rda
  • df_allele_ATC2.rda
  • gtf_hg38.rda
  • vcf_meta.rda
  • genetic_map_hg19.rda
  • genetic_map_hg38.rda
  • ref_hca.rda

CNVs are not detected

Good day,

I have tried your tool and for many samples works like a charm. However, in some cases are not able to detect CNVs that are present (seen in WES and also other iCNV tools (e.g., copyKAT)). One example, the image below shows the iCNV determined by copyKAT but numbat could not identify them (attached log file).

Is there a way that by changing some parameters, those CNVs can be detected by numbat? What are other reasons why numbat can not detect them?

Thanks in advance!

Sample_1
log.txt

"No CNV remains after filtering" using default settings on 10x lymphoma data

I wanted to use Numbat to investigate the copy number variations in a publicly available 10x lymphoma dataset. My attempts ended with Numbat finding no copy number variations in this dataset, which presumably should have some.

Steps to reproduce:

  1. Run a docker with Numbat: docker run -v /home/jupyter/docker_test:/mnt/mydata -it pkharchenkolab/numbat-rbase:latest /bin/bash

  2. Download and extract the following files from 10x link: save the feature matrix along with barcodes and features from lymph_node_lymphoma_14k_filtered_feature_bc_matrix.tar.gz as lymphoma_matrix.mtx, lymphoma_barcodes.tsv and lymphoma_features.tsv; save the BAM file and its index from lymph_node_lymphoma_14k_gex_possorted_bam.bam as lymphoma.bam and lymphoma.bam.bai.

  3. Inside the docker, run

Rscript /numbat/inst/bin/pileup_and_phase.R --label lymphoma --samples lymphoma --bams /mnt/mydata/lymphoma.bam --barcodes /mnt/mydata/lymphoma_barcodes.tsv --outdir /mnt/mydata/lymphoma --gmap /Eagle_v2.4.1/tables/genetic_map_hg38_withX.txt.gz --snpvcf /data/genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf --paneldir /data/1000G_hg38 --ncores 16

to generate an allele count file.

  1. Prepare the matrix and allele count file to work with Numbat
mtx <- as(Matrix::readMM("/mnt/mydata/lymphoma_matrix.mtx"), "dgCMatrix")
df_allele_cnt <- read.csv("/mnt/mydata/lymphoma/lymphoma_allele_counts.tsv", sep = "\t")
features <- read.csv("/mnt/mydata/lymphoma_features.tsv", sep = "\t", header = F)
barcodes <- read.csv("/mnt/mydata/lymphoma_barcodes.tsv", sep = "\t", header = F)
dimnames(mtx) = list(features$V2, barcodes$V1)

duplicate_genes = duplicated(features$V2)
features_new = features[!duplicate_genes, ]
mtx_new = mtx[!duplicate_genes, ]
  1. Run Numbat:
run_numbat(mtx_new, ref_hca, df_allele_cnt, genome = "hg38", t = 1e-5, ncores = 4, plot = TRUE, out_dir = "/mnt/mydata/lymphoma_results")

Expected result:
I expected Numbat to find some copy number variations in this dataset.

Actual result:
Numbat runs for 1 iteration, then terminates with "No CNV after filtering - terminating" message (see attached image).

photo1665699191

Please help me understand the results I am getting: are there settings that should be different? Is the result expected for this dataset?

Error: Must group by variables found in `.data`. * Column `seg` is not found. * Column `sample` is not found. When analyzing two samples bound together

Hi again,
As you suggested, I used pileup_and_phase.R with two samples using ",". Then I used cbind and rbind on count matrices and allele dataframes by substituting before the "-1" suffix on barcodes in the second files for a "-2".
The code was going smoothly up to the fifth 'Retesting CNVs.." where it threw this:

Error: Must group by variables found in .data.
*** Column seg is not found.

  • Column sample is not found.**
    Backtrace:
  1. ├─numbat::run_numbat(...)
  2. │ └─%>%(...)
  3. ├─numbat::run_group_hmms(...)
  4. │ └─%>%(...)
  5. ├─dplyr::ungroup(.)
  6. ├─dplyr::mutate(., seg_start_index = min(snp_index), seg_end_index = max(snp_index))
  7. ├─dplyr::group_by(., seg, sample)
  8. └─dplyr:::group_by.data.frame(., seg, sample)
  9. └─dplyr::group_by_prepare(.data, ..., .add = .add, caller_env = caller_env())

Thanks again!
Jose

Barcodes pileup_and_phase

Hi @teng-gao ,
thanks for developing this nice tool! I'm currently testing it and comparing it with the Copykat and inferCNV and unexpectedly getting bad results and weird errors for some samples. Before posting their description here, I wanted to check if one should use only barcodes for high-quality cells (i.e. after QC and doublet removal) during the pileup_and_phase step? I used all barcodes from the barcodes.tsv files from the filtered_feature_bc_matrix folders .... could this cause the issues?

Clones in pseudobulk vs. tumour phylogeny

Hi Teng,

This is more of a question than an issue, sorry if I've missed something obvious about it in the docs. I'm trying to visually assess each clone's CNVs using

nb$bulk_clones %>% 
  plot_bulks(
    min_LLR = 50, # filtering CNVs by evidence
    legend = TRUE
    )

which produces the following plot:

bulk_clones

But I've noticed the number of clones (9) doesn't match the clones I visualise with

nb$mut_graph %>% plot_mut_history()

tree

which has 7. Do you have any info on why the clones in each plot are different? And is it possible to produce a plot like that above with plot_bulks() for the same 7 clones that are presented in the phylogeny? My goal is just to visually assess the CNVs for each clone to try and get a feel for whether I need to adjust any of the run parameters.

Thanks again for your help.

Error in if (UPGMA_score > NJ_score)

Error caused by providing segs_loh (ER3 in ovarian visium dataset)

Building phylogeny ..
Mem used: 0.517Gb
Using 9 CNVs to construct phylogeny
Aggregate function missing, defaulting to 'length'
Error in if (UPGMA_score > NJ_score) { : 
  missing value where TRUE/FALSE needed
Calls: run_numbat
In addition: Warning messages:
2: In log(1 - P) : NaNs produced
3: In log(1 - P) : NaNs produced
Execution halted

Error while using custom reference expression

I have an error when I run numbat with a custom reference expression rather than the one that comes with the package. Could you provide some guidance?
I am attaching the log file. The following was the error message:
...
number of genes left: 10296
number of genes left: 10324
Running HMMs on 5 cell groups..
Error in mutate():
! Problem while computing seg_index = 1:n().
seg_index must be size 0 or 1, not 2.
Backtrace:

  1. ├─numbat::run_numbat(...)
  2. │ └─bulk_subtrees %>% ...
  3. ├─numbat::run_group_hmms(...)
  4. │ └─numbat::find_common_diploid(...)
  5. │ └─bulks %>% annot_consensus(segs_consensus)
  6. ├─numbat::annot_consensus(., segs_consensus)
  7. │ └─... %>% select(-any_of(c("sample")))
  8. ├─dplyr::select(., -any_of(c("sample")))
  9. ├─dplyr::distinct(., snp_id, .keep_all = TRUE)
  10. ├─dplyr::left_join(...)
  11. ├─dplyr:::left_join.data.frame(...)
  12. │ └─dplyr::auto_copy(x, y, copy = copy)
  13. │ ├─dplyr::same_src(x, y)
  14. │ └─dplyr:::same_src.data.frame(x, y)
  15. │ └─base::is.data.frame(y)
  16. ├─segs_consensus %>% mutate(seg_index = 1:n())
  17. ├─dplyr::mutate(., seg_index = 1:n())
  18. ├─dplyr:::mutate.data.frame(., seg_index = 1:n())
  19. │ └─dplyr:::mutate_cols(.data, dplyr_quosures(...), caller_env = caller_env())
  20. │ ├─base::withCallingHandlers(...)
  21. │ └─mask$eval_all_mutate(quo)
  22. ├─dplyr:::dplyr_internal_error(...)
  23. │ └─rlang::abort(class = c(class, "dplyr:::internal_error"), dplyr_error_data = data)
  24. │ └─rlang:::signal_abort(cnd, .file)
  25. │ └─base::signalCondition(cnd)
  26. └─dplyr <fn>(<dpl:::__>)
  27. └─rlang::abort(...)
    Execution halted

log.txt

error with genotype function

Hi,
Do you have any suggestions on the following error popping up after running the genotype function in the pileup_and_phase.R script?

Error in as.list.default(X) :
no method for coercing this S4 class to a vector

VCF files from cellsnplite seem to be fine.
Thanks!

pileup_and_phase.R warnings

[W::vcf_parse] INFO 'DP' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'OTH' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'AD' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'DP' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'OTH' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'AD' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'DP' is not defined in the header, assuming Type=String

Run_numbat unable to successfully get through more than one iteration (no clone_post_x.tsv file generated)

Hi ,
I'm getting this error when running "run_numbat"
Error in dcast.data.table(., cell + clone_opt + GT_opt + p_opt ~ clone, : Can not cast an empty data.table

and I think that I'm failing with the "cell_to_clone" function since I don't see a clone_post_x.tsv file in my output directory. I tried changing the max_iter to one when running this function, but that did not help.

Do you have any suggestions?
Thanks!
Anu

Error: C stack usage 7970628 is too close to the limit

Hi Teng,

I'm re-running on a subset of cells for which I had run Numbat successfully previously. However, it exits with error: "C stack usage 7970628 is too close to the limit"
I'm attaching the log file. I have the ulimit as unlimited.

I would appreciate your help. Thanks.
log (4).txt

Problem while computing `cnv_state_post = ...[]

Thank you for this package! However I got an error while running it and not sure where the problem was. Since I do not know which cels are cancer cells, I used copykat results as reference groups and at first it seem to work but after a certain step got this:
...
number of genes left: 11018
Fitted reference proportions:
aneuploid:1,diploid:7.6e-05,not_assigned:7.6e-05
number of genes left: 10736
Finishing..
Finishing..
Finishing..
Finishing..
Finishing..
Finishing..
Finishing..
Finishing..
Error in mutate(., cnv_state_post = c("loh", "amp", "del", "bamp", "bdel")[which.max(c(p_loh, :
Problem while computing cnv_state_post = ...[].
cnv_state_post must be size 1, not 0.
ℹ Did you mean: cnv_state_post = list(...[]) ?
ℹ The error occurred in row 41.

`Backtrace:

  1. ├─numbat::run_numbat(...)
  2. │ └─numbat::retest_bulks(...)
  3. │ └─bulks %>% run_group_hmms(run_hmm = F) %>% ...
  4. ├─dplyr::mutate(., LLR = ifelse(is.na(LLR), 0, LLR))
  5. ├─numbat::run_group_hmms(., run_hmm = F)
  6. │ └─... %>% ungroup()
  7. ├─dplyr::ungroup(.)
  8. ├─dplyr::mutate(., seg_start_index = min(snp_index), seg_end_index = max(snp_index))
  9. ├─dplyr::group_by(., seg, sample)
  10. └─dplyr::bind_rows(.)
  11. ├─tibble::as_tibble(dots)
  12. └─tibble:::as_tibble.list(dots)
  13. └─tibble:::lst_to_tibble(x, .rows, .name_repair, col_lengths(x))
    
  14.   └─tibble:::recycle_columns(x, .rows, lengths)
    
  15.     └─rlang::cnd_signal(...)
    

`
Thank you!

Error during 1. Iteration

Dear Professor:
I have some questions about the use of numbat, look forward to your reply.
The sample of TNBC1 BAM file downloaded from the https://sra-pub-src-2.s3.amazonaws.com/SRR11546787/BAM_TNBC1.bam.1 and the expression count matrix download form https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM4476486 . Use Seurat to process the expression matrix,only process to clustering ( FindClusters() ) and use aggregate_counts() to prepare the expression reference,Use pileup_and_phase.R to generate TNBC1_allele_counts.tsv.gz . At last to run numbat with run_numbat ,but an error occurred. I do not know how to deal with.

Approximating initial clusters using smoothed expression ..
Mem used: 1.36Gb
number of genes left: 10480
running hclust...
Iteration 1
Mem used: 9.47Gb
Error in arrange():
! Problem with the implicit transmute() step.
✖ Problem while computing ..1 = CHROM.
Caused by error in mask$eval_all_mutate():
! object 'CHROM' not found
Backtrace:

  1. ├─global run_one_sample(...)
  2. │ └─numbat::run_numbat(...)
  3. │ └─numbat:::make_group_bulks(...)
  4. │ └─... %>% arrange(sample)
  5. ├─dplyr::arrange(., sample)
  6. ├─dplyr::mutate(., snp_index = as.integer(snp_id))
  7. ├─dplyr::mutate(., snp_id = factor(snp_id, unique(snp_id)))
  8. ├─dplyr::arrange(., CHROM, POS)
  9. ├─dplyr:::arrange.data.frame(., CHROM, POS)
  10. │ └─dplyr:::arrange_rows(.data, dots)
  11. │ ├─base::withCallingHandlers(...)
  12. │ ├─dplyr::transmute(new_data_frame(.data), !!!quosures)
  13. │ └─dplyr:::transmute.data.frame(new_data_frame(.data), !!!quosures)
  14. │ └─dplyr:::mutate_cols(.data, dots, caller_env = caller_env())
  15. │ ├─base::withCallingHandlers(...)
  16. │ └─mask$eval_all_mutate(quo)
  17. ├─base::.handleSimpleError(...)
  18. │ └─dplyr (local) h(simpleError(msg, call))
  19. │ └─rlang::abort(...)
  20. │ └─rlang:::signal_abort(cnd, .file)
  21. │ └─base::signalCondition(cnd)
  22. └─dplyr (local) <fn>(<dply:::_>)
  23. └─rlang::abort(bullets, call = error_call, parent = parent)
    Warning message:
    In mclapply(groups, mc.cores = ncores, function(g) { :
    all scheduled cores encountered errors in user code
    Execution halted

No CNV remains after filtering

Follow-up to #57

I’m running a pilot experiment with the objective to select the best tool for distinguishing between normal and malignant cells based on their CNV profile. I use data from this study 10.1016/j.molcel.2021.10.013.

I run sample SRR13914002 with hca_ref and with a custom ref using normal cells from the dataset. In both cases I got this:
"No CNV remains after filtering - terminating."

(I got same comment for another sample from so far six tested samples)

Here are log files -
log_custom_ref.txt
log_hca_ref.txt

Bulk plots for custom ref:
bulk_clones_1
bulk_subtrees_1

Bulk plots for hca ref
bulk_clones_2
bulk_subtrees_2

This is result from Copykat:

image

where ‘not available’ are malignant cells classified based on particular marker genes.

`Error in filter()` when calling `run_numbat`

Hello,

I am having an issue with the samples that were able to successfully work and produce allele counts with pileup_and_phase.R.

I call the container using singularity like so:

singularity run \
   --bind /mnt/isilon/ \
   --cleanenv -H /mnt/isilon/cscb/xxx/code/04b_NumBat \
   /mnt/isilon/cscb/xxx/code/04b_NumBat/numbat-rbase_latest.sif

and I enter an interactive R session. Following the Getting Started guide, I try to supply run_numbat with the gene x cell counts matrix and the allele counts, but regardless of the sample I use, I keep getting the same error:

counts <- Matrix::readMM("/mnt/isilon/cscb/xxx/code/02-counts/MCD1-redo_wIntrons-force/outs/filtered_feature_bc_matrix/matrix.mtx.gz")

alleles <- read.table("/mnt/isilon/cscb/Projects/10x/xxx/04b-NumBat/MCD1/MCD1_allele_counts.tsv.gz", sep="\t")

out = run_numbat(
    as.matrix(counts), # gene x cell integer UMI count matrix
    ref_hca, # reference expression profile, a gene x cell type normalized expression level matrix
    alleles, # allele dataframe generated by pileup_and_phase script
    genome = "hg38",
    t = 1e-5,
    ncores = 1,
    plot = TRUE,
    out_dir = './test'
)

> Error in `filter()`:
> ! Problem while computing `..1 = GT != ""`.
> Caused by error in `mask$eval_all_filter()`:
> ! object 'GT' not found
> Run `rlang::last_error()` to see where the error occurred.

I wish I had more information to give you but the only other thing I can say is that some of these samples were unable to work with pileup_and_phase.R and produce the allele counts gz.

Here are the bam and barcode files I used for each sample:

/mnt/isilon/cscb/Projects/10x/xxx/MCD1-redo_wIntrons-force/outs/possorted_genome_bam.bam

/mnt/isilon/cscb/xxx/MCD1-redo_wIntrons-force/outs/filtered_feature_bc_matrix/barcodes.tsv.gz

Do you have any advice on how to address this problem? The samples I'm asking about in this issue specifically are scRNA-seq samples processed by 10x CellRanger.

Error during the update tree step

Hi,

I am testing on a sample that has ~15000 cells and I am using the ref_hca that ships with the package as the expression reference.
During the first iteration, the code exited with the following error:
...
Iter 55 -183301.59481998 120s
Iter 56 -183301.59481998 110s
Error in mutate():
! Problem while computing p = exp(Z_clone - logSumExp(Z_clone)).
Caused by error in logSumExp():
! index_max(): object has no elements
Backtrace:

  1. ├─numbat::run_numbat(...)
  2. │ └─numbat::get_clone_post(gtree, exp_post, allele_post)
  3. │ └─... %>% as.data.frame()
  4. ├─base::as.data.frame(.)
  5. ├─data.table::dcast(...)
  6. │ └─data.table::is.data.table(data)
  7. ├─data.table::as.data.table(.)
  8. ├─dplyr::mutate(...)
  9. ├─dplyr::mutate(...)
  10. ├─dplyr:::mutate.data.frame(...)
  11. │ └─dplyr:::mutate_cols(.data, dplyr_quosures(...), caller_env = caller_env())
  12. │ ├─base::withCallingHandlers(...)
  13. │ └─mask$eval_all_mutate(quo)
  14. ├─numbat:::logSumExp(Z_clone)
  15. ├─base::stop(<std::lg_>)
  16. └─dplyr <fn>(<std::lg_>)
  17. └─rlang::abort(...)
    Execution halted

The last files that were written were tree_final_1.rds and mut_graph_1.rds. I am attaching my log file here.
I am using the devel version.
packageVersion("numbat")
[1] ‘0.1.1’

Could you please help with what is going wrong and if I need to modify any parameters?
log.txt

Thanks.

Error in (function (cond) : error in evaluating the argument 'x' in selecting a method for function 't': subscript out of bounds

Hi,

I've created my 1-column count matrix with featureCounts. However, after creating my count_mat and df_allele, it still shows an error when I try run_numbat.

The code I use is as follows:

library(numbat)
library(Matrix)
library(dplyr)

## count_mat by featureCounts
CPM00000383_count_mat <- as.matrix((read.table("/Users/samuellu/Desktop/Labs/CHLA_Lab/Numbat/CPM00000383-BM-R_20210510/count_mat.csv", sep = ",")))
colnames(CPM00000383_count_mat) <- c('CPM00000383-BM-R_20210510')
CPM00000383_count_mat_dgC <- as(CPM00000383_count_mat, "dgCMatrix") 

## df_allele
CPM00000383_df_allele <- as.data.frame(read.table("/Users/samuellu/Desktop/Labs/CHLA_Lab/Numbat/CPM00000383-BM-R_20210510/CPM00000383-BM-R_20210510_allele_counts.tsv", sep = "\t", header = T))


bulk <- get_bulk(
  count_mat = CPM00000383_count_mat_dgC,
  lambdas_ref = ref_hca,
  df_allele = CPM00000383_df_allele,
  gtf = gtf_hg38)

segs_loh <- bulk %>% detect_clonal_loh(t = 1e-4)


out <- run_numbat(
  CPM00000383_count_mat_dgC, 
  ref_hca, 
  CPM00000383_df_allele, 
  ncores = 4,
  plot = TRUE,
  out_dir = './test/CPM00000383',
  segs_loh = segs_loh
)

This is the error message I got after run_numbat.

Running under parameters:
t = 1e-05
alpha = 1e-04
gamma = 20
min_cells = 50
init_k = 3
max_cost = 0.3
max_iter = 2
max_nni = 100
min_depth = 0
use_loh = auto
multi_allelic = TRUE
min_LLR = 5
min_overlap = 0.45
max_entropy = 0.5
skip_nj = FALSE
diploid_chroms = 
ncores = 4
ncores_nni = 4
common_diploid = TRUE
tau = 0.3
check_convergence = FALSE
plot = TRUE
genome = hg38
Input metrics:
1 cells
Mem used: 0.414Gb
Approximating initial clusters using smoothed expression ..
Mem used: 0.414Gb
number of genes left: 9313
Error in (function (cond)  : 
  error in evaluating the argument 'x' in selecting a method for function 't': subscript out of bounds

Can you share with me where is the problem that causes this error?

Thanks,
Sam

Informative error handling for users

Functions which accept user data need more extensive error handling. These are errors which should tell the user how to fix the problem (so the user doesn't create a GitHub issues, which takes a lot of your time).

A simple example is the following, whereby the input cannot be NULL or the user has made a mistake:

important_function <- function(input) {
    if (is.null(input)) {
        stop(paste0("The input ", input, " is NULL. Please correct."))
    }
}

Check out how this is done elsewhere. In fact, virtually all functions in conos or pagoda2 have this.

e.g. sccore: https://github.com/kharchenkolab/sccore/blob/main/R/diff_expression.R

appendSpecificityMetricsToDE <- function(de.df, clusters, cluster.id, p2.counts, low.expression.threshold=0, append.auc=FALSE) {

  if (length(de.df) == 0 || nrow(de.df) == 0) {
    return(de.df)
  }

  cluster.mask <- stats::setNames(clusters == cluster.id, names(clusters))

  if (!any(cluster.mask)) {
    stop("Cluster ", cluster.id, " not presented in the data")
  }

Error in `arrange()`:

I am trying to run numbat using my own data with matrix and allele built following instruction I got
Error in arrange():
! Problem with the implicit transmute() step.
✖ Problem while computing ..1 = CHROM.
Caused by error in mask$eval_all_mutate():
! object 'CHROM' not found
Run rlang::last_error() to see where the error occurred.
Warning message:
In mclapply(groups, mc.cores = ncores, function(g) { :
all scheduled cores encountered errors in user code

Error caused by `retest_cnv()` function

Hi,

Thank you for developing this great R package. I have been encountering this error message:

ERROR [2022-03-28 09:22:03] Error in mutate(., cnv_state_post = c("loh", "amp", "del", "bamp", "bdel")[which.max(c(p_loh,  : 
  Problem while computing `cnv_state_post = ...[]`.
✖ `cnv_state_post` must be size 1, not 0.

which seems to be caused by some code within the retest_cnv function:

                                p_loh = (G['20'] * L_x_n * L_y_d)/Z,
                                p_amp = ((G['31'] + G['21']) * L_x_a * L_y_a)/Z,
                                p_del = (G['10'] * L_x_d * L_y_d)/Z,
                                p_bamp = (G['22'] * L_x_a * L_y_n)/Z,
                                p_bdel = (G['00'] * L_x_d * L_y_n)/Z

when Z happens to be zero, it causes p_loh, p_amp, p_del, p_bamp and p_del to be NaN, which subsequently causes an error in this line:

cnv_state_post = c('loh', 'amp', 'del', 'bamp', 'bdel')[which.max(c(p_loh, p_amp, p_del, p_bamp, p_bdel))]

when which.max recieves just NaNs like:

c("a","b","c")[which.max(c(NaN,NaN,NaN))]
character(0)

Thank you for your help!

Error - 'X' is not a dendrogram when running Numbat example dataset

Hi Gao,

When running the run_numbat using example data, an error throws out:

Running under parameters:
t = 0.001
alpha = 1e-04
gamma = 20
min_cells = 20
init_k = 3
sample_size = 1e+05
max_cost = 51.9
max_iter = 2
min_depth = 0
use_loh = auto
multi_allelic = TRUE
min_LLR = 50
max_entropy = 0.6
skip_nj = FALSE
exclude_normal = FALSE
diploid_chroms =
ncores = 20
common_diploid = TRUE
Input metrics:
173 cells
Approximating initial clusters using smoothed expression ..
number of genes left: 10242
Error in dendrapply(dend, get_attr_from_leaf) : 'X' is not a dendrogram
In addition: Warning message:
In dendextend::get_leaves_attr(den, attribute = "label") :
'dend' should be a dendrogram.

Do you have any comments about this error?

Ryan Sun

Christophe's comments

Raised by @GeorgescuC:

  • there are a number of warnings (14) about overriding imports. You may want to use a file to set the imports method by method for dependencies so as to be sure that the correct method is the one loaded by the package in the end. Roxygen2 will be able to automatically update your NAMESPACE file based on those too. For example, in infercnv, we have that in the R/inferCNV_constants.R file.

  • The example data should probably not be automatically loaded with the package, but instead have it load when calling "data()". This will prevent potential conflicting/overriding variable names and improve loading speed.

    • need to remove anyway
  • I get the following warning, which might not be due to Numbat but to a dependency
    Warning messages:
    1: Unknown or uninitialised column: n_states.
    2: Unknown or uninitialised column: n_states.
    3: Unknown or uninitialised column: n_states.
    4: The x argument of as_tibble.matrix() must have unique column names if .name_repair is omitted as of tibble 2.0.0.
    Using compatibility .name_repair.

    • The name_repair warning comes from treeio - probably needs to be fixed externally
    • bbinom NA warning is from optim function of theta laplace approx, L-FGBS-R is trying theta > 0.5
  • In the example, I would not put an output path directly in the home directory, but instead use tempdir()/tempfile(), or use a location relative to the current working directory.

  • In the example, the "res = fetch_results()" step does not have the out_dir value set.

  • I think a dependency seems to be missing from the list, as I was able to install/load Numbat but got the following when first running it (which was resolved by installing the package):

res = fetch_results("~/results/test", i = 2)
Error in fread(f) :
To read gz and bz2 files directly, fread() requires 'R.utils' package which cannot be found. Please install 'R.utils' using 'install.packages('R.utils')'
There could be others that I happened to already have installed. The easiest way to figure out missing requirements would be to test in a virtual env/docker.

  • In the plotting step, "library(ggplot2)" needs to be loaded first to add the title.

  • The example final plot does not have the 1/2/3 groupings (both the color bar and the legend are missing for me) as shown on the wiki.

Use of smart-seq input

I'd like to apply this tool to scRNAseq data generated with smart-seq2 protocol. Is this recommended? Do you have suggestions for formatting input?

Generate single-cell likelihoods for a pre-specified CNV

Hi,

Is there a way to input a specific chosen CNV (for instance deletion or amplification of a chromosome of interest) and direct Numbat to generate single-cell likelihoods for this CNV? My understanding is that Numbat only generates single-cell likelihoods for the CNVs that are detected and pass filtering. But I would be interested in looking at single-cell likelihoods for CNVs that don't make it past these two steps (as these two steps rely on pseudobulk profiles, and I am interested in very rare subclonal events that may be present in only 1 or 2 cells in the data).

Thank you!

Phasing failed with smartseq input.

cat('Creating VCFs\n')
vcfs = lapply(samples, function(sample){vcfR::read.vcfR(glue('{outdir}/pileup/{sample}/cellSNP.base.vcf'), verbose = F)})

numbat:::genotype(label, samples, vcfs, glue('{outdir}/phasing'))

as smartseq cellsnp-lite only have one output, the vcfs would be rendered different.

I am wondering in the case of smartseq how should I do the genotype and phasing step

1 normal cell still generates error

PR3 sample of ovarian visium

INFO [2022-11-24 11:14:59] Found 1 normal cells..
ERROR [2022-11-24 11:15:04] job 1 failed
ERROR [2022-11-24 11:15:04] Error in mutate(., inter_snp_cm = c(NA, cM[2:length(cM)] - cM[1:(length(cM) -  : 
  Problem while computing `inter_snp_cm = c(NA, cM[2:length(cM)] - cM[1:(length(cM) - 1)])`.
Caused by error in `cM[1:(length(cM) - 1)]`:
! only 0's may be mixed with negative subscripts

Account for WGD

Ideally, we should account for WGD (baseline is 4 copies) when analyzing hyperploid tumors (e.g. TNBC5), which should lead to better model fit and prediction stability

Memory usage, tidyverse copy-on-modify and mclapply

Thanks for this package. I have run the development release successfully on a few tumor samples. I have lately been running into persistent memory issues especially in phylogeny steps. I am running on our lab's local server with 64 Gb RAM and 8 cores. I have begun to suspect that the issue is the use of tidyverse in mclapply routines. I have recently converted every mclapply step into a basic for loop assigning to a pre-allocated list--ignoring any parallelization speed benefits. That has helped a little bit. I have considered trying to use dtplyr to ease some memory issues.

Do you think this suspicion is reasonable? Have you already made efforts to reduce the memory footprint?

Ferocious memory usage

During iteration 1, memory usage is pretty standard but during iteration 2, in the step "Evaluating CNV per cell .." memory usage goes through the roof. Is there any sort of workaround for this? Is this behavior expected?

`snp_index` must be size 0 or 1, not 2 [pileup_and_phase.R]

Hi, I'm running the first step (pileup_and_phase.R)...

Rscript pileup_and_phase.R \
    --label x21922_mm8 \
    --samples x21922_mm8 \
    --bams /home/igoroz/jgarces/scCSCMM_sara/data/21922_mm8/outs/per_sample_outs/21922/count/sample_alignments.bam \
    --barcodes /home/igoroz/jgarces/scCSCMM_sara/data/21922_mm8/outs/per_sample_outs/21922/count/sample_filtered_barcodes.csv \
    --outdir x21922_mm8 \
    --gmap Eagle_v2.4.1/tables/genetic_map_hg38_withX.txt.gz \
    --snpvcf genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf \
    --paneldir 1000G_hg38 \
    --ncores 10

... and having the following error:

Using genome version: hg38
Running pileup
[I::main] start time: 2022-10-11 16:13:36
[W::check_args] Max depth set to maximum value (2147483647)
[W::check_args] Max pileup set to maximum value (2147483647)
[I::main] global settings after checking:
        num of input files = 1
        out_dir = x21922_mm8/pileup/x21922_mm8
        is_out_zip = 0, is_genotype = 0
        is_target = 0, num_of_pos = 0
        num_of_barcodes = 8122, num_of_samples = 0
        refseq file = (null)
        0 chroms:
        cell-tag = CB, umi-tag = UB
        nthreads = 10, tp_max_open = 51200
        mthreads = 10, tp_errno = 0, tp_ntry = 0
        min_count = 2, min_maf = 0.00, double_gl = 0
        min_len = 30, min_mapq = 20
        rflag_filter = 772, rflag_require = 0
        max_depth = 2147483647, max_pileup = 2147483647, no_orphan = 1
[I::main] loading the VCF file for given SNPs ...
[I::main] fetching 7352497 candidate variants ...
[I::main] mode 1a: fetch given SNPs in 8122 single cells.
[I::csp_fetch_core][Thread-2] 2.00% SNPs processed.
[I::csp_fetch_core][Thread-3] 2.00% SNPs processed.
[I::csp_fetch_core][Thread-5] 2.00% SNPs processed.
[I::csp_fetch_core][Thread-4] 2.00% SNPs processed.
(...)
[I::csp_fetch_core][Thread-7] 92.00% SNPs processed.
[I::csp_fetch_core][Thread-7] 94.00% SNPs processed.
[I::csp_fetch_core][Thread-7] 96.00% SNPs processed.
[I::csp_fetch_core][Thread-7] 98.00% SNPs processed.
[I::main] All Done!
[I::main] Version: 1.2.2 (htslib 1.16)
[I::main] CMD: cellsnp-lite -s /home/igoroz/jgarces/scCSCMM_sara/data/21922_mm8/outs/per_sample_outs/21922/count/sample_alignments.bam -b /home/igoroz/jgarces/scCSCMM_sara/data/21922_mm8/outs/per_sample_outs/21922/count/sample_filtered_barcodes.csv -O x21922_mm8/pileup/x21922_mm8 -R genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf -p 10 --minMAF 0 --minCOUNT 2 --UMItag Auto --cellTAG CB
[I::main] end time: 2022-10-11 17:30:50
[I::main] time spent: 4634 seconds.
character(0)
Creating VCFs
Error in `mutate()`:
! Problem while computing `snp_index = 1:n()`.
✖ `snp_index` must be size 0 or 1, not 2.
Backtrace:
     ▆
  1. ├─numbat:::genotype(label, samples, vcfs, glue("{outdir}/phasing"))
  2. │ ├─... %>% arrange(CHROM, POS)
  3. │ └─base::lapply(...)
  4. │   └─numbat (local) FUN(X[[i]], ...)
  5. │     └─numbat:::get_snps(vcf)
  6. │       └─... %>% filter(!is.na(AR))
  7. ├─dplyr::arrange(., CHROM, POS)
  8. ├─dplyr::mutate(., AR = AD/DP)
  9. ├─dplyr::summarise(...)
 10. ├─dplyr::group_by(., CHROM, POS, REF, ALT, snp_id)
 11. ├─dplyr::bind_rows(.)
 12. │ └─rlang::list2(...)
 13. ├─dplyr::filter(., !is.na(AR))
 14. ├─dplyr::mutate(., AR = AD/DP)
 15. ├─dplyr::mutate(., snp_index = 1:n())
 16. ├─dplyr:::mutate.data.frame(., snp_index = 1:n())
 17. │ └─dplyr:::mutate_cols(.data, dplyr_quosures(...), caller_env = caller_env())
 18. │   ├─base::withCallingHandlers(...)
 19. │   └─mask$eval_all_mutate(quo)
 20. ├─dplyr:::dplyr_internal_error(...)
 21. │ └─rlang::abort(class = c(class, "dplyr:::internal_error"), dplyr_error_data = data)
 22. │   └─rlang:::signal_abort(cnd, .file)
 23. │     └─base::signalCondition(cnd)
 24. └─dplyr (local) `<fn>`(`<dpl:::__>`)
 25.   └─rlang::abort(...)
Execution halted

What could I been missing, please?

(I'm running the docker container from singularity)

Roxygen2 documentation

Hi @teng-gao

It would be nice to get this on CRAN when the code is ready. The big impediment to this is always the roxygen2 documentation.

--- if a function is not exported, please use the tag #' @keywords internal. It's useful to know which functions are exported and which are not
--- if a function is exported, please document all fields. Please use the format #' @parameterName type Explanation here (default=xxx). It's very useful to know the data type in the rxoygen2 comments.
--- it will be useful for CRAN reviews if all exported functions have examples, but....that doesn't always happen. But let's try our best.

With this, the other steps will be much easier, and we can quickly get this on CRAN.

Running numbat on merged bam [smart-seq]

Hi! I'm trying to run numbat on my smartseq data.
I’ve been using numbat before for my 10x data and it worked fine so the software part is not really the problem.

So the data was provided to be as a set of bam files. My understanding is that numbat treats 1 bam file as 1 sample and therefore if all my cells are coming from one sample, then the bam files should be merged.

So I’ve ran samtools merge with -r option too keep file name as a RG tag:
samtools merge -r -O BAM --threads 20 rna_merged.bam STAR/*/*.bam

Next with the same file names I created barcodes.tsv file

And then pileup:

Rscript numbat/inst/bin/pileup_and_phase.R --label label --samples sample --bams rna_merged.bam --barcodes barcodes.tsv --outdir pileup --ncores 10 —smartseq <other args>

And this step fails:

During startup - Warning message: Setting LC_CTYPE failed, using "C" Using genome version: hg38 Running pileup [I::main] start time: 2022-11-14 15:31:04 [E::check_args] 'BAM'?' does not exist. [E::main] error global settings Error in value[[3L]](cond) : Pileup failed Calls: tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous> Execution halted

Could you please advise what am I doing wrong?

Generate count_mat with Bulk RNA-seq data

Hello,

I'm a beginner in bioinformatics and I have an issue with creating a workable count_mat with my bulk RNA-seq data.

First, I used your ready-to-run Docker container. then, I ran

Rscript /numbat/inst/bin/pileup_and_phase.R \
    --label sample \
    --samples sample \
    --bams /mnt/mydata/sample.bam \
    --outdir /mnt/mydata/sample \
    --gmap /Eagle_v2.4.1/tables/genetic_map_hg38_withX.txt.gz \
    --snpvcf /data/genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf \
    --paneldir /data/1000G_hg38 \
    --ncores 10 \
    --bulk

it generated allele_counts.tsv.gz within my outdir.

After that, I tried to go on with my next steps which are get_bulk and run_numbat on my Rstudio local machine. However, these steps both require count_mat. you mention that count_mat is a gene x cell raw count matrix. I was wondering if is there any suggested way to generate count_mat with my bulk RNA-seq data. Thank you.

Best,
Sam

Numbat is crashing with custom reference data

Hi!
I was using numbat before with provided reference data and it worked fairly well.
Then I wanted to try with a more relevant reference data and used provided aggregate_counts function for that. It ran without any error and provided gene x cell types matrix without any NANs.

But numbat is crashing when I use that reference:

Running under parameters:
t = 1e-05
alpha = 1e-04
gamma = 20
min_cells = 50
init_k = 3
max_cost = 819
max_iter = 2
max_nni = 100
min_depth = 0
use_loh = auto
multi_allelic = TRUE
min_LLR = 5
min_overlap = 0.45
max_entropy = 0.5
skip_nj = FALSE
diploid_chroms = 
ncores = 20
ncores_nni = 20
common_diploid = TRUE
tau = 0.3
check_convergence = FALSE
plot = TRUE
genome = hg38
Input metrics:
2730 cells
Mem used: 1.93Gb
Approximating initial clusters using smoothed expression ..
Mem used: 1.93Gb
number of genes left: 1853
running hclust...
Registered S3 method overwritten by 'dendextend':
  method     from 
  rev.hclust vegan
Iteration 1
Mem used: 2.57Gb
Error in value[[3L]](cond) : `glue` failed in `formatter_glue` on:

  'try-error' chr "Error in optim(fn = function(w) { : L-BFGS-B needs finite values of 'fn'\n" 

Raw error message:

 Expecting '}' 

Please consider using another `log_formatter` or `skip_formatter` on strings with curly braces.`glue` failed in `formatter_glue` on:

  - attr(*, "condition")=List of 2 

Raw error message:

 Expecting '}' 

Please consider using another `log_formatter` or `skip_formatter` on strings with curly braces.`glue` failed in `formatter_glue` on:

   ..$ message: chr "L-BFGS-B needs finite values of 'fn'" 

Raw error message:

 Expecting '}' 

Please consider using another `log_formatter` or `skip_formatter` on strings with curly braces.`glue` failed in `formatter_glue` on:

   ..$ call   : language optim(fn = function(w) {     w = w/sum(w) ... 

Raw error message:

 Expecting '}' 

Please consider using another `log_formatter` or `skip_formatter` on strings with curly braces.`glue` failed in `formatter_glue` on:
In addition: Warning messages:
1: In log(lambdas_ref * 1e+06 + 1) : NaNs produced
2: In mclapply(groups, mc.cores = ncores, function(g) { :
  all scheduled cores encountered errors in user code

Could you please suggest what might be wrong?

Olga

pileup_and_phase error, no allele dataframe produced

pileup_and_phase error. Phasing was successful but no allele dataframe produced.

Warning message in fread(vcf_file):
“Detected 1 column names but the data has 2 columns (i.e. invalid file). Added 1 extra default column name for the first column which is guessed to be row names or an index. Use setnames() afterwards if this guess is not correct, or fix the file write command that created the file to create a valid file.”
Warning message in fread(vcf_file):
“Stopped early on line 35. Expected 2 fields but found 3. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<##FILTER=<ID=discordanthet,Description="Filtered calls where a passing call is het and a high GQ but filtered call is hom var, since often the het is wrong">>>”
Error in `chr_as_locations()`:
! Can't rename columns that don't exist.
✖ Column `#CHROM` doesn't exist.
Traceback:

1. lapply(1:22, function(chr) {
 .     message(chr)
 .     vcf_file = glue("{outdir}/phasing/{label}_chr{chr}.phased.vcf.gz")
 .     if (file.exists(vcf_file)) {
 .         fread(vcf_file) %>% rename(CHROM = `#CHROM`) %>% mutate(CHROM = str_remove(CHROM, 
 .             "chr"))
 .     }
 .     else {
 .         stop("Phased VCF not found")
 .     }
 . })
2. FUN(X[[i]], ...)
3. fread(vcf_file) %>% rename(CHROM = `#CHROM`) %>% mutate(CHROM = str_remove(CHROM, 
 .     "chr"))   # at line 5-7 of file <text>
4. mutate(., CHROM = str_remove(CHROM, "chr"))
5. rename(., CHROM = `#CHROM`)
6. rename.data.frame(., CHROM = `#CHROM`)
7. tidyselect::eval_rename(expr(c(...)), .data)
8. rename_impl(data, names(data), as_quosure(expr, env), strict = strict, 
 .     name_spec = name_spec, error_call = error_call)
9. eval_select_impl(x, names, {
 .     {
 .         sel
 .     }
 . }, strict = strict, name_spec = name_spec, type = "rename", error_call = error_call)
10. with_subscript_errors(vars_select_eval(vars, expr, strict = strict, 
  .     data = x, name_spec = name_spec, uniquely_named = uniquely_named, 
  .     allow_rename = allow_rename, type = type, error_call = error_call), 
  .     type = type)
11. tryCatch(with_entraced_errors(expr), vctrs_error_subscript = function(cnd) {
  .     cnd$subscript_action <- subscript_action(type)
  .     cnd$subscript_elt <- "column"
  .     cnd_signal(cnd)
  . })
12. tryCatchList(expr, classes, parentenv, handlers)
13. tryCatchOne(expr, names, parentenv, handlers[[1L]])
14. value[[3L]](cond)
15. cnd_signal(cnd)
16. signal_abort(cnd)

long vectors not supported yet

Hi,
Thanks for the package. Spectacular results with single cell RNASeq in tumors. I'd like to publish the identification of CAFs as normal cells in my tumors using it and it works smoothly in single datasets from 10X but I tried to mix three samples and got this error:

..../....
Retesting CNVs..
Retesting CNVs..
Retesting CNVs..
Retesting CNVs..
Retesting CNVs..
Finishing..
Finishing..
Finishing..
Finishing..
Finishing..
Error in vec_slice(x_out, x_slicer) :
long vectors not supported yet: ../../src/include/Rinlinedfuns.h:535

Error: Tibble columns must have compatible sizes.

  • Size 70731: Column 3.
  • Size 84967: Column 1.
  • Size 104417: Column 2.
  • Size 112067: Column 0.
    ℹ Only values of size one are recycled.
    Backtrace:
    1. ├─numbat::run_numbat(...)
    2. │ └─%>%(...)
    3. ├─numbat::run_group_hmms(...)
    4. │ └─%>%(...)
    5. ├─dplyr::ungroup(.)
    6. ├─dplyr::mutate(., seg_start_index = min(snp_index), seg_end_index = max(snp_index))
    7. ├─dplyr::group_by(., seg, sample)
    8. └─dplyr::bind_rows(.)
    9. ├─tibble::as_tibble(dots)
  1. └─tibble:::as_tibble.list(dots)
  2. └─tibble:::lst_to_tibble(x, .rows, .name_repair, col_lengths(x))
    
  3.   └─tibble:::recycle_columns(x, .rows, lengths)
    

Warning message:
In mclapply(bulks %>% split(.$sample), mc.cores = ncores, function(bulk) { :
scheduled core 2 encountered error in user code, all values of the job will be affected
Execution halted

I used pileup_and_phase.R without problems on a bam merged from the cellranger bams where I substituted the "-1" at the end of the barcodes to avoid collisions after using cellranger aggr to generate barcodes.
The error is thrown by run_numbat run with 64GB and 12 cores.
Thanks for the help
Jose

Error during 2. Iteration [multiplexing]

Hello,

I'm running into an error during the second iteration. Curiously, the error doesn't occur for other samples in the dataset and also not for this one when segs_loh = NULL. So, it appears to be an issue for this combination of sample, reference and segs_loh.

Iteration 2
Mem used: 10.3Gb
Running HMMs on 5 cell groups..
Running HMMs on 4 cell groups..                                                                                                      
less than 5% of genome is in neutral region - including LOH in baseline
Testing for multi-allelic CNVs ..
0 multi-allelic CNVs found: 
Evaluating CNV per cell ..
Mem used: 10.5Gb
Excluding clonal LOH regions .. 
All cells succeeded
Expanding allelic states..
No multi-allelic CNVs, skipping ..
No multi-allelic CNVs, skipping ..
No multi-allelic CNVs, skipping ..
Building phylogeny ..
Mem used: 10.6Gb
Using 6 CNVs to construct phylogeny
Using UPGMA tree as seed..
Mem used: 10.6Gb
Iter 2 -1496.78014009687 0.16s
Iter 3 -1496.78014009687 0.11s
Found 1 normal cells..
Error:
! Tibble columns must have compatible sizes.
• Size 46923: Column `2`.
• Size 94963: Column `3`.
ℹ Only values of size one are recycled.
Run `rlang::last_error()` to see where the error occurred.
There were 16 warnings (use warnings() to see them)

The subtrees_2.rds and clones_2.rds files can be found in the output directory, but not bulk_clones_final.tsv.gz. Therefore, I assume the error occurs somewhere in between.

I'm running numbat using a custom reference (normalized expression values) from a public dataset.

ref_internal <- aggregate_counts(refCountMat, refAnnoTab)

bulk <- get_bulk(count_mat = patMat,
                   lambdas_ref = ref_internal,
                   df_allele = alleleCounts, gtf = gtf_hg38 
  )

segs_loh <- bulk %>% detect_clonal_loh(t = 1e-4)


run_numbat(
      count_mat = patMat,
      lambdas_ref = ref_internal, 
      df_allele = alleleCounts,
      segs_loh = segs_loh,
      genome = "hg38", gamma = 20, min_cells = 50, multi_allelic = TRUE,
      t = 1e-5, max_iter = 2, max_nni = 100, min_LLR = 5,
      max_entropy = 0.5, tau = 0.3,
      ncores = 10, 
      plot = TRUE
    )

Any help would be appreciated. : )

run_numbat seems to freeze after starting iteration 1

Hello,

I wanted to try out numbat on one of our datasets. I was able to get the pileup and phasing to work.
When I run out_numbat(), it gets to iteration 1 and then seems to freeze without reporting any errors
(R stops running, nothing happens). It does produce output files until this stage.

I don't know how to debug this problem and would be happy about some insight from you.

Thanks,

livius

> out = run_numbat(counts, ref_hca, df_allele, gtf_hg38, genetic_map_hg38, out_dir = './data/Pool96_30/numbat/')
Running under parameters:
t = 1e-05
alpha = 1e-04
gamma = 20
min_cells = 10
init_k = 3
sample_size = 1e+05
max_cost = 1295.1
max_iter = 2
min_depth = 0
use_loh = auto
multi_allelic = TRUE
min_LLR = 50
max_entropy = 0.6
skip_nj = FALSE
exclude_normal = FALSE
diploid_chroms =
ncores = 30
common_diploid = TRUE
Input metrics:
4317 cells
Approximating initial clusters using smoothed expression ..
number of genes left: 9756
Iteration 1

Issue during detect_clonal_loh()

Good evening,

I'm trying to compare a 10x scRNA-seq data set (high tumor purity) to a publicly available reference.

I ran as instructed:

bulk <- get_bulk(count_mat = countMat, lambdas_ref = refMat,
  df_allele = alleleCounts, gtf = gtf_hg38
) 

Where countMat is a raw count matrix (genes x barcodes) and refMat is a matrix of log-norm. pseudobulks. This part works and the head of my bulk looks as following:

However, when I then ran segs_loh <- bulk %>% detect_clonal_loh(t = 1e-4) and got this error:

Error in `mutate()`:
! Problem while computing `boundary = c(...)`.
✖ `boundary` must be size 1, not 3.
ℹ The error occurred in group 1: CHROM = 1.

Any insights would be much appreciated : )

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.