harbourlab / uphyloplot2 Goto Github PK

Draw phylogenetic trees of tumor evolution

Python 100.00%

uphyloplot2's Introduction

uphyloplot2 version 2.3

If you encounter any issues or to request features please open an issue on this github page, and state the version you are running.

Please cite
Kurtenbach, S., Cruz, A.M., Rodriguez, D.A. et al. Uphyloplot2: visualizing phylogenetic trees from single-cell RNA-seq data. BMC Genomics 22, 419 (2021). https://doi.org/10.1186/s12864-021-07739-3

Draw phylogenetic trees of tumor evolution, as seen in our Nature communications paper (Nature Communications volume 11, Article number: 496 (2020).

Uphyloplot2 takes input from CaSpER, HoneyBADGER, and InferCNV to generate evolutionary plots. Please follow the guide below to visualize your tree using inputs from all three programs. You can download example data from this github page to test the program.

Download uphyloplot2 and recreate the following directory structure: ../Uphyloplot2/
- uphyloplot2.py
- newick_input.py
- Inputs/ - infercnv.cell_groupings

You must populate the "Inputs" folder with ".cell_groupings" files from your respective pipeline. Files can have any name as long as it ends in ".cell_groupings".

Follow the appropriate guide below to pre-process your data:

INFERCNV: To generate the necessary files, inferCNV needs to be run with HMM, which will produce the "HMM_CNV_predictions.HMMi6.rand_trees.hmm_mode-subclusters.Pnorm_0.5.cell_groupings” files used for plotting. cluster_by_groups should be set to FALSE when calling infercnv::run:

> infercnv_obj = infercnv::run(infercnv_obj,cutoff=1,out_dir="output_dir",cluster_by_groups=FALSE,plot_steps=T,scale_data=T,denoise=T,noise_filter=0.12,analysis_mode='subclusters',HMM_type='i6')

The '.cell_groupings' file will be located in your R working directory under the path you specify with the 'out_dir=' parameter. It is important that you remove the reference and/or control cells in the ".cell_groupings" file. For instance, if you followed the inferCNV tutorial on the test data provided, your '.cell_groupings' file contains a 'cell_group_name' and 'cell' column with rows in the following format:

all_observations.all_observations.1.1.1.1	  MGH264_A01
...
all_references.all_references.1.1.1.1	  GTEX-111FC-3326-SM-5GZYV

On a unix system, you can quickly remove the reference cell data with the following command, substituting your values where appropriate:

sed '/^all_references/d' <  infercnv.cell_groupings > trimmed_infercnv.cell_groupings

CaSpER or HoneyBADGER: In an R session with your corressponding R libraries and objects loaded, install and use phylogram function 'as.dendrogram()' to export your trees as newick formated strings:

> BiocManager::install('phylogram')
> library(phylogram)
> casper_dendrogram <— as.dendrogram(tree) # tree : CaSpER tree object of class 'phylo'
> hc_dendrogram <— as.dendrogram(hc_tree) # hc_tree : HoneyBADGER tree object of class 'hclust'
> vc_dendrogram <— as.dendrogram(vc_tree) # vc_tree: another HoneyBADGER tree object of class 'hclust'
> write.dendrogram(insert_your_dendrogram_name,file=‘/path/to/uphyloplot2/Inputs’)
> q()

After exiting R, navigate to the uphyloplot2 home directory and run the following script:

./newick_input.py

The newick_input.py script parses the dendrogram object produced in the pre-processing steps above. The script allows you to select a desired maximum length for the tree. You can see sample execution and output below:

Please input the path to your newick file (no quotes, absolute or relative to current path)
Path_to_newick_file= dendrograms/casper_dendro
Unrooted tree detected!
PRUNING
###########################################################
###########################################################
#################   USER_INPUT    #########################
###########################################################

Your tree currently has 69 individual leaves
The longest branch in your tree is forked 16 times
How long do you want your tree? (input an integer)
> Length = 4


Name your output file:
> File = casper_out
###########################################################
###########################################################
###########################################################
###########################################################

This configuration will stack the leaves of your tree into 6 clusters
There are 2 clusters that are smaller than 5% of the total cell population, these will not be plotted.
Not Plotted Clusters:  [11, 13]

It will output a '.cell_groupings' file in the ~/Uphyloplot2/Inputs directory. For instance, in the example above, a 'casper_out.cell_groupings' will be placed in the Uphyloplot2/Inputs directory.

Navigate to the uphyloplot2 home directory directory and run the script with this simple command:

python uphyloplot2.py

Optional: -c Defines the percentage cutoff used to remove smaller subclones. Default is 5 (Only subclones that comprise at least 5% of cells will be included for plotting.

Example usage:

python uphyloplot2.py -c 10

UPhyloplot2 will generate a "output.svg" vector graphics plot. Also, it will generate a new folder called "CNV_files", containing CNV files for each input, containing the subclone ID's identified by inferCNV in column 1, the percentage of cells for each subclone in column 2, and the letter marking the subclone in the output.svg file in column 3.

UPhyloplot2 will not identify the characteristic CNV changes for each subclone. If desired, these have to be be inferred manually for each subclone IDs in the "HMM_CNV_predictions.HMMi6.rand_trees.hmm_mode-subclusters.Pnorm_0.5.pred_cnv_regions.dat file from the inferCNV output manually.

Please be aware that depending on the subclones present branches and subclone circles of the output.svg file might overlap. However, they can be rotated manually with Adobe Illustrator or any other svg editor.

For some reason the output SVG files appear empty when previewing in MacOS or opening with a browser. Use Adobe Illustrator or such to open them, I am working on why this issue occurs.

uphyloplot2's People

Contributors

Stargazers

uphyloplot2's Issues

inferCNV - when you have more than one type of reference cells

I am new to inferCNV and am trying to understand how the program works. The way I have it in my head right now is that the reference cell's expression distribution (the average expression?) is subtracted from both the normal cell expression and tumor cell expression, leaving the normal cell expression nearly void and the tumor cell expression showing the CNVs more clearly if they exist.
My question is, if there is more than one set of reference cells, how are they averaged? Are each of their distributions compared with the tumor cells separately or are their averages averaged again to be subtracted from the respective reference cell expressions and the tumor cell expressions?

Oh, I also have another problem - an error occured at Step 18 :

STEP 18: Run Bayesian Network Model on HMM predicted CNV's

INFO [2020-09-12 21:08:32] Initializing new MCM InferCNV Object.
INFO [2020-09-12 21:08:32] validating infercnv_obj
Error in (function (cl, name, valueClass) :
assignment of an object of class “logical” is not valid for @‘cnv_regions’ in an object of class “MCMC_inferCNV”; is(value, "factor") is not TRUE

The data is scRNAseq 10X genomics, using cutoff 0.1, cluster_by_groups = FALSE. But the tumor cells in the cell annotation file are left clustered (ex. TN57_01, TN57_02, TN57_03)...if that possibly brings up some issues.

Thank you in advance !

Mapping subclones loss/gains

Hi there,

I was wondering how do I manually curate the loss/gains from the HMM*.pred_cnv_regions.dat file with the *cell_groupings file? How do I match each branches of the phylogenetic tree?

Thanks

how to choose tree length and can one use a partitioning method other than random tree?

Dear developers,

First, thank you for this very useful tool.
Second, can you explain how one should choose the length of the tree?

Moreover, can you elaborate on why should we use random tree during infercnv run and if leiden could produce reliable results regarding your tool?
I know that the main difference between those methods is the resolution gamma (leiden) and the threshold p-value (random_tree), but playing with those parameters should lead to similar results, so why random tree?

Best,
Andy

Missing sub clones

I ran the python command and for some reason from the branches and CNV_Files there is no information for some sub clones. For example the unique sub clone names from cell groupings was
[1] "malignant_HPT1Pat1.malignant_HPT1Pat1.1.1.1.1"
[2] "malignant_HPT1Pat1.malignant_HPT1Pat1.1.1.1.2"
[3] "malignant_HPT1Pat1.malignant_HPT1Pat1.1.1.2.1"
[4] "malignant_HPT1Pat1.malignant_HPT1Pat1.1.1.2.2"
[5] "malignant_HPT1Pat1.malignant_HPT1Pat1.1.2.1.1"
[6] "malignant_HPT1Pat1.malignant_HPT1Pat1.1.2.1.2"
[7] "malignant_HPT1Pat1.malignant_HPT1Pat1.1.2.2.1"
[8] "malignant_HPT1Pat1.malignant_HPT1Pat1.1.2.2.2"

but in the CNV_files there is only
1,0.0
1.1,0.0,B
1.1.1,50.25716385011021,C
1.1.2,5.6208670095518,D
1.2,0.0,E
1.2.1,9.184423218221896,F
1.2.2,34.937545922116094,G

There never seems to cluster names with all four digits.

Compatibility with inferCNV v1.16.0

"cell_groupings" files are no longer generated as outputs when running infercnv::run() (and none of the standard output files match the format required for generating a tree).

Phylograms look the same

Hi there,

thank you for creating this awesome script. Unfortunately, I ran into some problems with creating the trees. I have 10x data from 10 different patients and performed the downstream analyses with Seurat.

When I run infer CNV, I get the heatmaps that are very consistent with FISH data and CNVs called from WES. When I use UPhyloplot2, all trees look the same, although there are clear differences between the samples.

Might it be a problem that I downsampled to 100 cells? Raw data from each sample contains more than 10,000 cells.

I ran inferCNV with HMM and analysis_mode = "subclusters" and a logistic noise filter.

Thanks in advance

Max

Can I set the width of the tree branches and distance between each two tree branches?

Hi harbourlab,

Many thanks for providing this useful tool! I have two questions when trying to include it in my analysis. It would be great if you could help me with it.

I wonder if there is any way to set the gaps between the tree branches? Some tree branches will appear covered by others (e.g. the K-L branch covers the B-E-G branch below). Or is there a way to let K-L swing a little bit to the left side ? In your Nature Communications paper, there seems not this kind of issue. Did you change anything manually in the uphyloplot2.py file?

Any information would be useful. Many thanks in advance!

Also, I wonder if those branches with 0 percentage are redundant (like branches B, C, I, J, K shown below)? Do I need to get rid of them for plotting?

infercnv::run error - Error in if (run_arguments$HMM) { : argument is of length zero

Hi there,

I'm having an issue with the infercnv::run code. I'm getting this error with the exact code shown:

Plus, for 10x data, shall the cutoff be 0.1 as suggested by inferCNV?

Thanks

I can"t find the HMM_CNV_predictions.HMMi6.rand_trees.hmm_mode-subclusters.Pnorm_0.5.cell_groupings outcome

I can"t find the HMM_CNV_predictions.HMMi6.rand_trees.hmm_mode-subclusters.Pnorm_0.5.cell_groupings outcome from InferCNV outcome,I wonder if I use the wrong parameter in InferCNV step，The code for the InferCNV step as follows：

infercnv_obj <- infercnv::CreateInfercnvObject(raw_counts_matrix=exprMatrix,
gene_order_file=mm_geneLocate,
annotations_file=cellAnnota,
ref_group_names=c("control"))
infercnv_obj = infercnv::run(infercnv_obj,
cutoff=0.1,
out_dir='inferCNV/positive_1',
cluster_by_groups=TRUE,
denoise=TRUE,
HMM=TRUE,
num_threads=30)

sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] infercnv_1.2.1

can be used for copykat?

Hi,

Thanks for your developed tool. I was wondering how to use this tool with the input from copykat, not infercnv?
Could you give me a hand?

Thanks!

Figure 2 of your Nat Common paper

Hello,

Thank you for providing this very useful tool. I am wondering how/where can I find information about q and p to generate a heatmap similar to figure 2b in your paper, and also add the information about LOH rain the clonality trees?

Thank you!

Infercnv subcluster nomenclature

Hi team,

Thanks for the great package!

I am attempting to generate phylogenetic trees with subclustering info from infercnv, but it looks like my .cell_groupings file has a different subcluster nomenclature. When I run infercnv with the same sample data and code that's described in the Uphyloplot2 tutorial, the subcluster names in the .cell_groupings file are "all_observations.all_observations_s1" and so on. I am not seeing any groups labeled with the "all_observations.all_observations.1.1.1.1" format. Same pattern when I use my own data. Any idea why this might be?

Also I was thinking of trying to use the dendrogram file instead of the groupings, but I am not seeing the "newick_input.py" file anywhere in the Uphyloplot2 directory. Does this need to be ddownloaded from elsewhere?

Thank you!

Best,
Kaleab

Using Uphyloplot2 with CaSpER data

Hello!
First of all thank you for making this wonderful and useful tool. I am currently working with CaSpER, and I was wondering how I should process the CaSpER final object (the one obtained after running the runCaSpER function) in order to obtain a file suitable for the Uphyloplot2 Python algorithm.
Thanks in advance.

Error when running code uphyloplot2 version 2.3 using the test data

"...\uphyloplot2-master\uphyloplot2.py", line 166, in main
if len(data_row[0].split(".")) > longest_tree:

IndexError: list index out of range

Error running with cluster_by_groups set to FALSE

Hi,

Thanks for this awesome tool. I encounter the following error when I set cluster_by_groups set to FALSE:

Error in if (runif(1) <= padj) { : missing value where TRUE/FALSE needed

This is my infercnv code:
infercnv_obj = infercnv::run(infercnv_obj,
cutoff=0.1,
out_dir='output',
cluster_by_groups=FALSE,
denoise=TRUE,
HMM=TRUE)

Any ideas why this might be the case?

Thank you!

Run uphyloplot2 when multiple samples

when I have multiple samples, to get the mannual annotations precisely in NC paper Fug2.C. Do I have to run the infercnv sample by sample? Otherwise, how can i get the precise cnv of each sample?

Leiden - script fail

Hi, thanks for this tool ;) A quick Q. I already ran InferCNV, HMM = T, on current "best", which is Leiden.

io2 = infercnv::run(io1,
                    cutoff=0.1, 
                    out_dir="cutoff0_1_res0_000375_HMM", 
                    cluster_by_groups=F, 
                    HMM=T, 
                    analysis_mode='subclusters',
                    tumor_subcluster_partition_method='leiden',
                    leiden_resolution=0.000375,
                    denoise=T,
                    sd_amplifier=2,
                    #up_to_step = 15, 
                    resume_mode = TRUE,
                    num_threads=14
)

I used this for some downstream analysis, but I now want to make a phylogenetic tree, ideally without re-running all prior analysis steps.
Your tool looks good.
I ran the test fine. But on my data, it fails.

Here is a snippet from test, and from my data:

Test:

cell_group_name	cell
Retinoblastoma.Retinoblastoma_sRetinoblastoma.1.1.1.1	GATTCAGAGACGCAAC
Retinoblastoma.Retinoblastoma_sRetinoblastoma.1.1.1.1	GGCCGATCAAGTTCTG
Retinoblastoma.Retinoblastoma_sRetinoblastoma.1.1.1.1	AACTCTTAGACGCTTT
Retinoblastoma.Retinoblastoma_sRetinoblastoma.1.1.1.1	CACATTTGTACAGCAG

Mine:

cell_group_name	cell
all_observations.all_observations_s1	AAAGTGAAGTGGAAGA
all_observations.all_observations_s1	AACAACCTCAGTCTTT
all_observations.all_observations_s1	AACCACAGTTTGGGTT
all_observations.all_observations_s1	AACGGGACAAGCGCAA

For my one, I edited out the reference cells, and also prefixes like REL_ TN_ ahead of cell names. Nonetheless, the result of mine is:

,100.0

Where the test is like:

1,0.0
1.1,0.0,B
1.1.1,0.0,C
1.1.1.1,15.776699029126213,D
1.1.1.2,16.74757281553398,E
1.1.2,0.0,F
1.1.2.1,15.29126213592233,G
1.1.2.2,27.9126213592233,H
1.2,0.0,I
1.2.1,0.0,J
1.2.1.1,9.466019417475728,K
1.2.1.2,6.553398058252427,L

I guess the difference relates to the .1.1.1.1 etc format, where I have .1-15.
Do you know if there is some way I can get round this, or a way I could make my file in the format?

Would be great to know, as would be very useful if it was possible to run this tool with the Leiden approach ..

Uphyloplot2 with different samples

Hello all and thanks for developing this tool,

In the case of having different tumors from the sample samples (such as sample pre-treatment and in relapse), would it be possible to apply uphyloplot2 to the merged samples (diagnostic+relapse integrated) instead of individually? Would it be recommended?

Thanks

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 178: invalid start byte

Hi there,
Thanks for sharing this tool! I am getting this error when trying to apply it to my inferCNV output:

(upuphyloplot2) -bash-4.2$ python uphyloplot2.py -I /icgc/dkfzlsdf/analysis/OE0519_projects/chptumor/marla/CNVproject/infer_CNV/Uphyloplot2/Inputs
UPhyloplot2 version 2.2
Traceback (most recent call last):
  File "/home/m221r/.conda/envs/upuphyloplot2/lib/python3.8/site-packages/uphyloplot2/uphyloplot2.py", line 239, in <module>
    main()
  File "/home/m221r/.conda/envs/upuphyloplot2/lib/python3.8/site-packages/uphyloplot2/uphyloplot2.py", line 35, in main
    for x, line in enumerate(groupings_file):
  File "/home/m221r/.conda/envs/upuphyloplot2/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 178: invalid start byte

I installed uphyloplot2 in a condo virtual environment using:
a) conda install -c amcruz uphyloplot2 --> from:https://anaconda.org/amcruz/uphyloplot2 (version2.2)
b) pip install git+https://github.com/harbourlab/uphyloplot2.git#egg=uphyloplot2 (version 2.3)

After installation of a) I could allocate an uphyloplot2.py file in the following directory: /home/user/.conda/envs/upuphyloplot2/lib/python3.8/site-packages/uphyloplot2
But running the python uphyloplot2.py command gave me the above error.

After installation of b) I could not find the uphyloplot2.py file at all.

Do you know what is causing this error and why I cannot find the uphyloplot2.py file in the 2.3 installation?
Any help would be highly appreciated. Thanks

Best

compatibility with infercnv 1.9.1

Hi. Thanks for your great job!

Recently I am working with infercnv(version:1.9.1) ， after run inferCNV as you mentioned in the manual, the cell_grouping result was distinct as you mentioned.

main parameters as follow:
cutoff=0.1, window_length= 20, out_dir='./res', cluster_by_groups=FALSE,denoise=T,analysis_mode='subclusters',HMM_type='i6',HMM=TRUE

So, any compatibility problems with new version of infercnv ?

harbourlab / uphyloplot2 Goto Github PK

uphyloplot2's Introduction

uphyloplot2 version 2.3

uphyloplot2's People

Contributors

Stargazers

Forkers

uphyloplot2's Issues

Recommend Projects

Recommend Topics

Recommend Org