zktuong / dandelion Goto Github PK

View Code? Open in Web Editor NEW

90.0 90.0 23.0 429.37 MB

dandelion - A single cell BCR/TCR V(D)J-seq analysis package for 10X Chromium 5' data

Home Page: https://sc-dandelion.readthedocs.io/

License: GNU Affero General Public License v3.0

R 0.17% Python 20.90% TeX 0.28% Shell 0.04% Jupyter Notebook 77.45% JavaScript 1.17%

dandelion's People

Contributors

Stargazers

Watchers

dandelion's Issues

contigs that fail germline reconstruction

There should still be some use for contigs that don't pass germline reconstruction (maybe novel genes?).

Need to come up with a way to retain these contigs and cluster with some other method.

filter_contigs extremely slow for large objects

Can't work out how to speed this up...

tests failing for macosx python 3.8

Not sure why.

Seems to get stuck because of multiprocessing.Pool.

adjustText dependency missing

Hi @zktuong,

it seems you'll have to add the adjusttext dependency to your setup.py:

conda create -n test_dandelion python=3.8 pip
conda activate test_dandelion
pip install sc-dandelion
python -m dandelion

Traceback (most recent call last):
  File "/home/sturm/anaconda3/envs/test_dandelion/lib/python3.8/runpy.py", line 185, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/home/sturm/anaconda3/envs/test_dandelion/lib/python3.8/runpy.py", line 144, in _get_module_details
    return _get_module_details(pkg_main_name, error)
  File "/home/sturm/anaconda3/envs/test_dandelion/lib/python3.8/runpy.py", line 111, in _get_module_details
    __import__(pkg_name)
  File "/home/sturm/anaconda3/envs/test_dandelion/lib/python3.8/site-packages/dandelion/__init__.py", line 10, in <module>
    from . import plotting as pl
  File "/home/sturm/anaconda3/envs/test_dandelion/lib/python3.8/site-packages/dandelion/plotting/__init__.py", line 7, in <module>
    from ._plotting import random_palette, clone_network, barplot, stackedbarplot, spectratype, clone_rarefaction, clone_overlap
  File "/home/sturm/anaconda3/envs/test_dandelion/lib/python3.8/site-packages/dandelion/plotting/_plotting.py", line 16, in <module>
    from adjustText import adjust_text
ModuleNotFoundError: No module named 'adjustText'

Problems with manual preprocessing

Hi Kelvin,

I am trying to run the manual preprocessing with the example data, but I keep getting a few errors:

When doing:
ddl.pp.reassign_alleles(samples[:2], combined_folder = 'tutorial_scgp1', germline = wd+"database/germlines/imgt/human/vdj")

I get the following output and no plots:
...
Concatenating objects
Running tigger-genotype with novel allele discovery.
Novel allele discovery execution halted.
Attempting to run tigger-genotype without novel allele discovery.
Insufficient contigs for running tigger-genotype. Defaulting to original heavy chain v_calls.
Reconstructing heavy chain dmask germline sequences with v_call.
...
and at the end:
... Although tigger-genotype was not run successfully, file will still be saved with _genotyped.tsvextension for convenience.

When running:
ddl.pp.assign_isotypes(samples, blastdb=wd+"database/blast/human/human_BCR_C.fasta")

I get:
BLAST Database error: Error: Not a valid version 4 database.

I am using the databases from your github (/container/database) which I downloaded.

It would be great if you could tell me what I am doing wrong.

Best,
Christiane

Network generation

Hi Kelvin,

I have 2 question about the way the network is generated:

1- how connections between different cells is generated? Because for some of the cells we don't have data, but still I see that there is connection between them. Is there any way of replacement in the cells based on the colon id? If not how the connections are made?

2- Also, I wanted to ask is there any way to know which node was the start of the mutation? Is there any way to know about the roots?

Thanks,
Sara

ImportError: Unable to initialise R instance. Please run this separately through R with Shazam's tutorial.

Hi Kelvin,

Thanks for your helps. I am running the dandelion pipeline for BCR Clustering. When running "ddl.pp.calculate_threshold(vdj)" I get the error:

"ImportError: Unable to initialise R instance. Please run this separately through R with Shazam's tutorial."

How I should run this problem in Jupiter?

Thanks,
Sara

update_metadata needs update to work with concat smoothly

update how update_metadata deals with NaN, None etc. perhaps solvable by #62

AIRR-C ratification issues to address

Could you please explain in more detail how automated tests are run? We’re not familiar with GitHub actions.
- Testing a automated build and test on a separate repo
For compliance, the tool would need to provide a summary of the command line parameters in its output, so that the parameters used are clearly recorded.
- Added a print out for the command line parameters that should appear properly.
The Working Group would also like to see a more specific statement of the support that you provide. We appreciate that most people supporting tools do so as volunteers in their own time, so it’s more a matter of setting expectations than making specific commitments. You might like to review the support statements provided on the websites of currently approved tools.
- Volunteer basis

filter_vj_chain = True/False does the same thing

Description of the bug

in filter_contigs, filter_vj_chain option gets overwritten at some point during the logic. this is now being rectified in the v2.4 branch

Minimal reproducible example

No response

The error message produced by the code above

No response

OS information

No response

Version information

No response

Additional context

No response

TODO:

Pre-processing
~~- [ ] Do we run reassign_genotype for IGL/IGK and/or the J genes?~~ No.

Add in a step to rename the locus for multiple IGHV/J chain calls as ‘Multi’-whatever. Maybe after reassign_alleles
shift create_germlines to preprocessing?
retrieve identity score and length of constant region after assign_isotype
- also return the constant region sequence call to allow for manual checks and adjustments?
create a new column where constant genes are collapsed to isotypes

Visualization and analysis

Finding clones

Add in a wrapper to call Changeo’s DefineClones.py as an alternative way to call clones.
- Not sure if its working yet?
- Also port in SCOPer?
Add a function using graph-based community detection to define sub-clones base on the eventual network
- figure out a different way to calculate the BCR network/embedding using distance matrices from heavy, light and potential transcriptomics data

Network stats

measures to describe the overall clonal structure in a single sample, or per clone.
- Calculate gini indices (e.g. size of clones versus number of clones)
- node connectivity (network stability)
- - closeness-centrality ~~and betweeness-centrality~~ (highlight clonally expanded nodes!)
- edge-betweeness (mutation rate/relationship)
Add a random sampling function to calculate the above if needed to control for sample size.

Miscellaneous

Add parsing function to convert the processed BCR changeo/airr tsv format back to a 10x-style format because there’s some tools like immunarch that can parse 10x format for their analyses and some of the tools included look quite cool.
Add parsing function to convert processed TCR data changeo/airr tsv format back to a 10x-style format to integrate with scirpy
Bulk integration

Documentation

Update the documentation for the functions to ensure that they are descriptive enough for general use.
Add expected type of parameter into functions so it's a bit clearer for some of the more complicated ones. - added 5334512
Clean up some spaghetti code in _tools.py and _plotting.py
update notebook 1 with reccomendations to stick to airr format

function to retrieve/add info directly to metadata

for convenience and perhaps useful to then propagate back to data slot.
~~Also something weird happens when trying to retrieve numeric columns. I need to allow for some sort of behavior that doesn't automatically try to split during retrieval~~ fixed in #91

Integration with scirpy

TODO:

~~Allow for initialization with scirpy initialized adata.~~ Done in scverse/scirpy#241
Allow for initialization with scirpy processed adata.
- need to wait for AirrCell class to be made i guess.
- recover connectivities and distances from scirpy processed adata
- - Not sure if necessary actually, since dandelion only uses the network for visualization which scirpy can do already. Calculating network stats probably requires a newly generated network from dandelion to ensure that it can perform the network contractions properly.
- forwards : should already work with current ddl.tl.transfer

[QUESTION]

Hi Kelvin,

I have a question: I am going to generate a tree to show where the mutation in sequences are occurring, in those clones that dandelion finds edges between them. In other words, I want to generate the tree for the network generated by dandelion.

So the questions are

1- what column I should use for generating the tree, to see the mutations of sequences?
2- Should I use "hamming" distance, or other methods when generating the distance matrix?
3- Have you ever generated a tree for visualizing the sequence mutations based on this dandelion network? Do you have any comments?

Thank you,
Sara

Singularity error

Hi Kelvin,

I am running my Singularity again, but I get this error:

Formating fasta(s) : 0it [00:00, ?it/s]
Assigning genes : 0it [00:00, ?it/s]
Processing data file(s) : 0it [00:00, ?it/s]
Traceback (most recent call last):
File "/share/dandelion_preprocess.py", line 177, in
main()
File "/share/dandelion_preprocess.py", line 144, in main
ddl.pp.reassign_alleles(samples,
File "/opt/conda/envs/sc-dandelion-container/lib/python3.9/site-packages/dandelion/preprocessing/_preprocessing.py", line 1275, in reassign_alleles
] + [filepathlist_heavy[0]] + ['>'] + [
IndexError: list index out of range

Is there any way that I can remove this error? What can be the problem?

Thanks,
Sara

[QUESTION]

Hi Kelvin,

I have a question about the column "germline_dandelion" in the file "filtered_contig_igblast_db-pass_genotyped.tsv".

How this column is genrated?

Thank you,
Sara

BCR pre-processing

Would you please help me to figure out the problem with BSR Pre_processing. For the step 4 I get this error:

ddl.pp.assign_isotypes(samples, blastdb = "/Users/saramoein/Downloads/dandelion-master/container/database/blast/human/human_BCR_C.fasta")

The error:

FileNotFoundError Traceback (most recent call last)
in
----> 1 ddl.pp.assign_isotypes(samples, blastdb = "/Users/saramoein/Downloads/dandelion-master/container/database/blast/human/human_BCR_C.fasta")
2
3
4

/opt/anaconda3/lib/python3.8/site-packages/dandelion/preprocessing/_preprocessing.py in assign_isotypes(fastas, fileformat, org, correct_c_call, correction_dict, plot, save_plot, show_plot, figsize, blastdb, allele, parallel, ncpu, filename_prefix, verbose)
1004
1005 for i in range(0, len(fastas)):
-> 1006 assign_isotype(fastas[i],
1007 fileformat=fileformat,
1008 org=org,

/opt/anaconda3/lib/python3.8/site-packages/dandelion/preprocessing/_preprocessing.py in assign_isotype(fasta, fileformat, org, correct_c_call, correction_dict, plot, save_plot, show_plot, figsize, blastdb, allele, parallel, ncpu, filename_prefix, verbose)
818 if verbose:
819 print('Running blastn \n')
--> 820 _run_blastn(filePath, blastdb, format_dict[fileformat], org, verbose)
821 # parsing output into a summary.txt file
822 if verbose:

/opt/anaconda3/lib/python3.8/site-packages/dandelion/preprocessing/_preprocessing.py in _run_blastn(fasta, blastdb, fileformat, org, verbose)
416 print('Running command: %s\n' % (' '.join(cmd)))
417 with open(blast_out, 'w') as out:
--> 418 run(cmd, stdout=out, env=env)
419
420 def _parse_BLAST(fasta, fileformat):

/opt/anaconda3/lib/python3.8/subprocess.py in run(input, capture_output, timeout, check, *popenargs, **kwargs)
491 kwargs['stderr'] = PIPE
492
--> 493 with Popen(*popenargs, **kwargs) as process:
494 try:
495 stdout, stderr = process.communicate(input, timeout=timeout)

/opt/anaconda3/lib/python3.8/subprocess.py in init(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, encoding, errors, text)
856 encoding=encoding, errors=errors)
857
--> 858 self._execute_child(args, executable, preexec_fn, close_fds,
859 pass_fds, cwd, env,
860 startupinfo, creationflags, shell,

/opt/anaconda3/lib/python3.8/subprocess.py in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, restore_signals, start_new_session)
1704 if errno_num != 0:
1705 err_msg = os.strerror(errno_num)
-> 1706 raise child_exception_type(errno_num, err_msg, err_filename)
1707 raise child_exception_type(err_msg)
1708

FileNotFoundError: [Errno 2] No such file or directory: 'blastn'

Thanks

Compatibility with Python 3.10 and diversity metrics

Describe the bug
Can't import dandelion on Python 3.10
(Python 3.11 will be released soon, so about time to support 3.10 ;) )

To Reproduce
In Python 3.10 environment

import dandelion as ddl

scirpy/io/_io.py:723: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    #!/usr/bin/env python
    # @Author: kt16
    # @Date:   2020-05-12 18:11:20
    # @Last Modified by:   Kelvin
    # @Last Modified time: 2022-01-27 18:[53](https://github.com/scverse/scirpy/runs/6478125985?check_suite_focus=true#step:9:54):21
    
>   from . import preprocessing as pp

/opt/hostedtoolcache/Python/3.10.4/x64/lib/python3.10/site-packages/dandelion/__init__.py:7: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    #!/usr/bin/env python
    # @Author: kt16
    # @Date:   2020-05-12 18:42:02
    # @Last Modified by:   Kelvin
    # @Last Modified time: 2022-02-17 22:59:16
    
>   from . import external

/opt/hostedtoolcache/Python/3.10.4/x64/lib/python3.10/site-packages/dandelion/preprocessing/__init__.py:7: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    #!/usr/bin/env python
    # @Author: kt16
    # @Date:   2020-05-12 18:42:02
    # @Last Modified by:   Kelvin
    # @Last Modified time: 2020-11-29 00:48:30
    
>   from ._preprocessing import assigngenes_igblast, makedb_igblast, tigger_genotype, parsedb_heavy, parsedb_light, creategermlines, recipe_scanpy_qc

/opt/hostedtoolcache/Python/3.10.4/x64/lib/python3.10/site-packages/dandelion/preprocessing/external/__init__.py:7: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    #!/usr/bin/env python
    # @Author: kt16
    # @Date:   2020-05-12 17:56:02
    # @Last Modified by:   Kelvin
    # @Last Modified time: 2022-03-04 18:25:36
    
    import os
    import pandas as pd
    import numpy as np
    from subprocess import run
    from datetime import timedelta
    from anndata import AnnData
    from time import time
    
    from sklearn import mixture
>   from ...utilities._utilities import *

/opt/hostedtoolcache/Python/3.10.4/x64/lib/python3.10/site-packages/dandelion/preprocessing/external/_preprocessing.py:16: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    #!/usr/bin/env python
    # @Author: kt16
    # @Date:   2020-05-12 18:41:45
    # @Last Modified by:   Kelvin
    # @Last Modified time: 2021-02-11 12:23:07
    
>   from ._utilities import *

/opt/hostedtoolcache/Python/3.10.4/x64/lib/python3.10/site-packages/dandelion/utilities/__init__.py:7: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    #!/usr/bin/env python
    # @Author: kt16
    # @Date:   2020-05-12 14:01:32
    # @Last Modified by:   Kelvin
    # @Last Modified time: 2022-03-11 20:32:[54](https://github.com/scverse/scirpy/runs/6478125985?check_suite_focus=true#step:9:55)
    
    import os
    import re
    import pandas as pd
    import numpy as np
    
>   from collections import defaultdict, Iterable
E   ImportError: cannot import name 'Iterable' from 'collections' (/opt/hostedtoolcache/Python/3.10.4/x[64](https://github.com/scverse/scirpy/runs/6478125985?check_suite_focus=true#step:9:65)/lib/python3.10/collections/__init__.py)

/opt/hostedtoolcache/Python/3.10.4/x64/lib/python3.10/site-packages/dandelion/utilities/_utilities.py:12: ImportError

Additional context
Scirpy CI run:
https://github.com/scverse/scirpy/runs/6478125985?check_suite_focus=true

Known issues: preprocessing hiccups when run via jupyterhub

So there's a couple of issues that may happen when running via a notebook initialized by jupyterhub if following the default tutorial:

python scripts like AssignGenes.py may complain that the file cannot be found and environmental variables set in ~/.bash_profile are not seen.

This can be fixed by doing this:

Open a terminal and use jupyter kernelspec list to find the folder for your kernel.

Available kernels:
  dandelion    /home/jovyan/.local/share/jupyter/kernels/dandelion

Go there and edit kernel.json, it should look something like this:

cd /home/jovyan/.local/share/jupyter/kernels/dandelion
vi kernel.json

{
 "argv": [
  "/home/jovyan/my-conda-envs/dandelion/bin/python",
  "-m",
  "ipykernel_launcher",
  "-f",
  "{connection_file}"
 ],
 "display_name": "Python (dandelion)",
 "language": "python"
}

Now edit that file and add an "env" key to that json. and add a dict with a "PATH" key to the binaries, as well as the other environmental variables, in your conda env like this:

{
 "argv": [
  "/home/jovyan/my-conda-envs/dandelion/bin/python",
  "-m",
  "ipykernel_launcher",
  "-f",
  "{connection_file}"
 ],
 "env": {
     "PATH": "/home/jovyan/my-conda-envs/dandelion/bin:$PATH",
     "GERMLINE": "/home/jovyan/Softwares/database/germlines/",
     "IGDATA": "/home/jovyan/Softwares/database/igblast/",
     "BLASTDB": "/home/jovyan/Softwares/database/blast/"
 } ,
 "display_name": "Python (dandelion)",
 "language": "python"
}

Restart the instance/kernel and it should start working.

create_germlines

Function is not working as intended immediately after reassign_alleles.
It fails a lot of the heavy chain contigs and doesn't reconstruct the germline sequence.
However, the function seems to work later on as in seen in notebook-3.

Is this an issue with CreateGermlines.py or Dandelion initialization? or the dataframe that is generated?
Not sure. Need to spend more time on this.

TODO: ensure compatibility with mouse data

Need to check if it breaks. I think it will.

keep_trailing_hyphen_number log is misleading

dandelion/container/dandelion_preprocess.py

Lines 49 to 53 in 10654cf

    
           parser.add_argument( 
        
               '--keep_trailing_hyphen_number', 
        
               action='store_false', 
        
               help=('If passed, do not strip out the trailing hyphen number, ' + 
        
                     'e.g. "-1", from the end of barcodes.'))

dandelion/container/dandelion_preprocess.py

Line 84 in 10654cf

f' --keep_trailing_hyphen_number = {args.keep_trailing_hyphen_number}\n'

The option is working as intended but the logging may be misleading as keep_trailing_hyphen_number = false is not intuitive . should add an if-else to switch for the purpose of logging.

[QUESTION] Cannot combine mutiple samples via ddl.read_10x_vdj

Dear @zktuong,

I try to combine mutiple samples in ddl.read_10x_vdj by applying with you tutorial as follows;

first we read in the 2 bcr files

samples = ['A', 'B']
bcr_files = []
for sample in samples:
folder_location = sample
bcr_files.append(ddl.read_10x_vdj(folder_location, filename_prefix='filtered', verbose = True))
bcr = bcr_files[0].append(bcr_files[1:])
bcr.reset_index(inplace = True, drop = True)
bcr

The result show that 'Dandelion' object has no attribute 'append'.
Is there a way to combine mutiple samples using ddl.read_10x_vdj ?

[QUESTION]

Hello,

I have question about the definition of the clode _id. originally it had 4 parts, as it is explained in the tutorial. But in my new run of dandelion, each clone_is has 6 parts (clone_is is like: a_b_c_d_e_f)

Would you please explain what are each part?

Thanks
Sara

Combining samples

Hi Kelvin,

I have a question about generating networks based on combining samples. The way I am doing is concatenating the adata files and also concatenating the VDJ files. But the final results looks something is wrong, because the obtained edges are much less than the individual samples.
Is there any example about it? Or any instruction for combining the samples to generating the networks?

Thank you,
Sara

Extracting the cell_id from the network

Hi Kelvin,

Thanks for answering my previous questions.
I have generated the clone network with dandelion. Now, I am going to extract the list of cells in the dense part of the network.
Is there any way that we can extract the cell_id from a part of the network? For example after zooming on the image, we export the list of cells?

Any information and help would be appreciated. Thanks.

Best,
Sara

`productive_only` option in `filter_bcr`

Currently productive_only option is toggled at the beginning of filter_contigs to restrict analysis.

However, during productive_only = False, because there's more contigs to screen against, you might end up with lesser contigs that pass the filter.

Ideally, the function should still prioritise productive contigs over non-productive contigs but right now it just keeps contigs based on UMI when productive_only = False. Perhaps a simple if-else statement might be useful to do this before comparing UMI?

However, I'm not sure what happens if there's more than 1 productive contigs of the same locus - would this impact on subsequent steps?

[BUG] PerformanceWarning: DataFrame is highly fragmented

best recipe to combine png images into movie

I've now sorted out how to create the 360 png images for the 3D visualization, but every recipe I've tried to combine them into a movie using ffmpeg has resulted in large mp4 file which does not display with vlc or other video players.

What's the recommended way of generating the movie? Maybe the fact that they are dpi=720 is affecting the results from older ffmpeg recipes in StackOverflow?

Thanks in advance

[FEATURE] Add ability in preprocess wrapper to skip format headers

Docker/singularity container?

How does one set this up? hmm.

[BUG] scikit-bio and rpy2 dependency updates

need to see what has changed with rpy2>=3.5 and scikit-bio>=5.7 that is causing issues.

[BUG] preprocessing bugs

The re-alignment of contigs by igblastn is reliable if there is a good V gene but then it can become dodgy when V isn't present and will return the next highest possible hit but the expected significance value for the hit may be unreliable.

#132 Will implement a strict flavour to try and catch these offending contigs. For TCR, currently there's also the ability to just align the J gene separately. Can't get this to work for BCR yet because it seems to interfere with create_germlines later on.

mutation counting

Write a simple hamming distance counter to compare germline sequences (ignore Ns where possible) with input sequence.

Could further code this to work out if this is a synonymous mutation or non-synonymous.

[BUG] scirpy conversion - rename productive

related to scverse/scirpy#343

[BUG] allow manual paths for `germline`

Currently an improper if-else statement is prevent manual specification of germline location.

[BUG] Find threshold using density method NameError: name 'ncpu_' is not defined

Describe the bug
When I run the step in the tutorial BCRclustering,

ddl.pp.calculate_threshold(vdj)

get thi error

   2523                                         model=model_,
-> 2524                                         nproc=ncpu_,
   2525                                         **kwargs)

NameError: name 'ncpu_' is not defined

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: linux ubuntu
dandelion==0.2.0 pandas==1.3.5 numpy==1.21.5 matplotlib==3.5.1 networkx==2.6.3 scipy==1.7.3 skbio==0.5.7

Additional context
Add any other context about the problem here.

dandelion on bulk data

Description of the question

Hi Kelvin,

I have a question about how to run the dandelion on bulk data? Cell ranger output are only generated on single cell.
But dandelion needs filtered_contig_annotation.csv and filtered_contig.fasta.

Would you please help me to solve this issue, when I can not run cell ranger on my data?

Thanks,
Sara

Minimal example

No response

Any error message produced by the code above

NA

OS information

Version information

Additional context

[BUG] min_size in generate_network

min_size in generate_network was doing the opposite behaviour i.e. it was working like max size rather than min size.

reproducing network graph

I've seen a network graph (in 3D) of the "Visualization of an antibody expression landscape in COVID-19, shown as a rotating network": https://www2.mrc-lmb.cam.ac.uk/how-immune-responses-differ-between-asymptomatic-cases-and-people-with-severe-covid-19/

I would like to reproduce it with similar input data, although my preprocessing is done differently as I am dealing with a different VDJ repertoire.

What kind of input does the code for this network graph require? I see some bits of code in this repo mentioning "graph", but I am unsure where it is.

filtered_feature_bc_matrix.h5

Hi Kelvin,

After saying thanks, I have a question: where "filtered_feature_bc_matrix.h5" file is generated? Is it from cell ranger? What is the format of this file? What are rows and what are columns?

Thanks,
Sara

[BUG] nxviz dependency update

Should update current code to try and match the new nxviz api.

/nfs/team297/kt16/Softwares/conda/envs/scvi-env/lib/python3.8/site-packages/nxviz/__init__.py:18: UserWarning: 
nxviz has a new API! Version 0.7.3 onwards, the old class-based API is being
deprecated in favour of a new API focused on advancing a grammar of network
graphics. If your plotting code depends on the old API, please consider
pinning nxviz at version 0.7.3, as the new API will break your old code.

To check out the new API, please head over to the docs at
https://ericmjl.github.io/nxviz/ to learn more. We hope you enjoy using it!

(This deprecation message will go away in version 1.0.)

/nfs/team297/kt16/Softwares/conda/envs/scvi-env/lib/python3.8/site-packages/nxviz/api.py:275: UserWarning: As of nxviz 0.7, the object-oriented API is being deprecated in favour of a functional API. Please consider switching your plotting code! The object-oriented API wrappers remains in place to help you transition over. A few changes between the old and new API exist; please consult the nxviz documentation for more information. When the 1.0 release of nxviz happens, the object-oriented API will be dropped entirely.
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_35951/2267730071.py in <module>
----> 1 ddl.pl.clone_overlap(bdatax,
      2                      groupby = 'batch',
      3                      colorby = 'group',
      4                      return_graph=True,
      5                      group_label_offset=.5)

/nfs/team297/kt16/Softwares/conda/envs/scvi-env/lib/python3.8/site-packages/dandelion/plotting/_plotting.py in clone_overlap(self, groupby, colorby, min_clone_size, clone_key, color_mapping, node_labels, node_label_layout, group_label_position, group_label_offset, figsize, return_graph, save, **kwargs)
    833             subset=groupby, keep="first").reset_index(drop=True)
    834 
--> 835     c = nxv.CircosPlot(G,
    836                        node_color=colorby,
    837                        node_grouping=colorby,

/nfs/team297/kt16/Softwares/conda/envs/scvi-env/lib/python3.8/site-packages/nxviz/api.py in __init__(self, G, **kwargs)
    335     def __init__(self, G, **kwargs):
    336         super().__init__()
--> 337         func_kwargs = {object_to_functional[k]: v for k, v in kwargs.items()}
    338         self.fig = plt.figure()
    339         self.ax = circos(G, **func_kwargs)

/nfs/team297/kt16/Softwares/conda/envs/scvi-env/lib/python3.8/site-packages/nxviz/api.py in <dictcomp>(.0)
    335     def __init__(self, G, **kwargs):
    336         super().__init__()
--> 337         func_kwargs = {object_to_functional[k]: v for k, v in kwargs.items()}
    338         self.fig = plt.figure()
    339         self.ax = circos(G, **func_kwargs)

KeyError: 'node_labels'

ddl.pp.calculate_threshold issue

Hi Thank you for the package. Everything run fine until this command. Any advise on how to fix this will be great.

ddl.pp.calculate_threshold(vdj, manual_threshold = 0.1)

NameError Traceback (most recent call last)
/var/folders/3x/r5d2xyzx39sgl5nq6cj8fmvw0000gn/T/ipykernel_21105/3386437591.py in
----> 1 ddl.pp.calculate_threshold(vdj, manual_threshold = 0.1)

~/miniconda3/envs/dandelion/lib/python3.7/site-packages/dandelion/preprocessing/_preprocessing.py in calculate_threshold(self, mode, manual_threshold, VJthenLen, onlyHeavy, model, normalize_method, threshold_method, edge, cross, subsample, threshold_model, cutoff, sensitivity, specificity, ncpu, plot, plot_group, figsize, **kwargs)
2594 warnings.filterwarnings("ignore")
2595
-> 2596 sanitize_dtype(dat)
2597
2598 sh = importr('shazam')

NameError: name 'sanitize_dtype' is not defined

updates to singularity definition file

dandelion/container/sc-dandelion.def

Line 11 in 065ae4b

sources.list /etc/apt/sources.list

remove sources.list /etc/apt/sources.list

dandelion/container/sc-dandelion.def

Line 15 in 065ae4b

    
           apt update -y --allow-insecure-repositories && apt install -y --allow-unauthenticated sudo software-properties-common gnupg

add the following

wget http://security.ubuntu.com/ubuntu/pool/main/i/icu/libicu66_66.1-2ubuntu2_amd64.deb
wget http://security.ubuntu.com/ubuntu/pool/main/libj/libjpeg-turbo/libjpeg-turbo8_2.0.3-0ubuntu1_amd64.deb
wget http://security.ubuntu.com/ubuntu/pool/main/libj/libjpeg8-empty/libjpeg8_8c-2ubuntu8_amd64.deb
sudo apt update && sudo dpkg -i libicu66_66.1-2ubuntu2_amd64.deb
sudo apt update && sudo dpkg -i libjpeg-turbo8_2.0.3-0ubuntu1_amd64.deb
sudo apt update && sudo dpkg -i libjpeg8_8c-2ubuntu8_amd64.deb
rm libicu66_66.1-2ubuntu2_amd64.deb libjpeg-turbo8_2.0.3-0ubuntu1_amd64.deb libjpeg8_8c-2ubuntu8_amd64.deb

Alternatively, build the image using rocker/r-base:4.1.2 as base image. but can't seem to get the gcc to link up to python.

Remove skbio dependency / clean up dependencies

Hi @zktuong,

I am having issues testing the to_dandelion on the CI, because installation of scikit-bio fails. This is a known problem and will be fixed in the next skbio release, but as I didn't find any usage of this package in dandelion, it might be easiest to just remove it.

In general, it would be great if you could clean up the requirements.txt to include only packages you really need. The fewer packages, the more likely dandelion is to install without issues.

Cheers,
Gregor

Singularity path

If igblast and blast are installed via conda, it can interfere with singularity's attempt to call the right version of the software as the local conda bin path is appended to the front of the container's $PATH whenever the singularity image is mounted.

The current work around is unfortunately not to install igblast and blast via conda, but managed separatedly and exporting to $PATH locally, if the singularity container is also used.

Tree from BCR clone_id after BCR clustering

Hi Kelvin,

I have a question: is there any function in dandelion that helps to visualize the results of BCR clustering? I have generated all clone_ids , and would like to visualize it. But I can not find any function in dandelion.

Do you have any comment for me? Thanks.

Best,
Sara

compatibility with TCR data

base mode is compatible but need to update the rest of the associated functions.

Need to write tests to make sure they make it all the way through.

running without filtered h5 file

Would it be possible to run the airr_rearrangement.tsv file based clustering and 3D network plotting without supplying a filtered h5 file?

If one only has data for the VDJ library and not the 5'GEX, this filtered h5 file wouldn't exist, yet the airr_rearrangement.tsv file for the VDJ data is available. See below the steps that currently work for both:

#!/usr/bin/env python
airr_file='/data/RUN/RUN021/X204SC20080363-Z01-F00N/L005/IM20p21-PET003-VDJ/IM20p21-PET003-VDJ_mlti/outs/vdj_b/airr_rearrangement.tsv'
filt_file='/data/RUN/RUN021/X204SC20080363-Z01-F00N/L005/IM20p21-PET003-VDJ/IM20p21-PET003-VDJ_mlti/outs/count/filtered_feature_bc_matrix.h5'

outputdir='/data/RUN/RUN021/X204SC20080363-Z01-F00N/L005/IM20p21-PET003-VDJ/IM20p21-PET003-VDJ_mlti/outs/vdj_b//ddli_images'

import os
import pandas as pd
os.chdir(os.path.expanduser('/home/avilella/dandelion/'))
import dandelion as ddl
import scanpy as sc

print("read_10x_airr")
vdj = ddl.read_10x_airr(airr_file)
print("end read_10x_airr")
print("read_10x_h5")
adata = sc.read_10x_h5(filt_file,gex_only=True)
print("end read_10x_h5")
adata.obs['sample_id'] = 'sample'
print("filter bcr")
vdj, adata = ddl.pp.filter_bcr(vdj, adata)
print("end filter bcr")

print("find_clones")
ddl.tl.find_clones(vdj)
print("end find_clones")

print("transfer")
ddl.tl.transfer(adata, vdj)
print("end transfer")

print("generate_network")
ddl.tl.generate_network(vdj, dim=3, key="sequence_alignment")
print("end generate_network")

import networkx as nx
import random
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import numpy as np

color_dict = {'unassigned':'#e7e7e7',
              'No_BCR':'#ffffff',
              'IgM':'#264653',
              'IgD':'#2A9D8F',
              'IgA':'#E9C369',
              'IgE':'#000000',
              'IgG':'#D7431D'
             }

# replace missing values
vdj.metadata['isotype'] = vdj.metadata['isotype'].replace(np.nan, 'No_BCR')
vdj.metadata['isotype'] = vdj.metadata['isotype'].replace('', 'unassigned')

## create a color dict mapping for the nodes
isotype = {x:color_dict[vdj.metadata.at[x, 'isotype']] for x in vdj.metadata.index}

# let's just do it in the expanded plot
layout = vdj.layout[1]
G = vdj.graph[1]

# extract layoyt positons
def extract_position(layout):
    pos = {l: (layout[l][0], layout[l][1], layout[l][2]) for l in layout}
    return(pos)

pos = extract_position(layout)

# add node attributes
for p in pos:
    G.nodes[p]['pos'] = pos[p]
for i in isotype:
    if i in G.nodes():
        G.nodes[i]['isotype'] = isotype[i]

# core function
def network_plot_3D(G, angle, col_dict, outputdir, save=False):
    # Get node positions
    pos = nx.get_node_attributes(G, 'pos')
    
    # Get the maximum number of edges adjacent to a single node
    edge_max = max([G.degree(i) for i in G.nodes()])
    # Define color range proportional to number of edges adjacent to a single node
    colors = col_dict 
    # 3D network plot
    with plt.style.context(('ggplot')):
        
        fig = plt.figure(figsize=(14,14))
        ax = Axes3D(fig)
        
        # Loop on the pos dictionary to extract the x,y,z coordinates of each node
        for key, value in pos.items():
            xi = value[0]
            yi = value[1]
            zi = value[2]
            
            # Scatter plot
            ax.scatter(xi, yi, zi, color=colors[key], s=50, edgecolors='k', alpha=1)
            ax.set_facecolor("white")
        
        # Loop on the list of edges to get the x,y,z, coordinates of the connected nodes
        # Those two points are the extrema of the line to be plotted
        for i,j in enumerate(G.edges()):
            x = np.array((pos[j[0]][0], pos[j[1]][0]))
            y = np.array((pos[j[0]][1], pos[j[1]][1]))
            z = np.array((pos[j[0]][2], pos[j[1]][2]))
        
        # Plot the connecting lines
            ax.plot(x, y, z, c='black', alpha=0.5)
        ax.set_xlim(-1, 1)
        ax.set_ylim(-1, 1)
        ax.set_zlim(-1, 1)
        fig.subplots_adjust(left=0, right=1, bottom=0, top=1)
        bbox = fig.bbox_inches.from_bounds(1, 1, 12, 12)
    
    # Set the initial view
    ax.view_init(30, angle)
    # Hide the axes
    ax.set_axis_off()
    if save == True:
        plt.savefig(outputdir+str(angle).zfill(3)+".png", bbox_inches=bbox, dpi=720)
    else:
        plt.show()
    return

from tqdm import tqdm
from joblib import Parallel, delayed

# plot and save out!
# for k in tqdm(range(0,360,1)):
#     angle = k
#     network_plot_3D(G, angle, col_dict = isotype, save=True)

Parallel(n_jobs=8)(delayed(network_plot_3D)(G, i, col_dict = isotype, outputdir=outputdir, save = True) for i in tqdm(range(0,360,1)))
#Parallel(n_jobs=8)(delayed(network_plot_3D)(G, i, save = True) for i in tqdm(range(0,360,1)))

#sc.set_figure_params(figsize = [4,4])
#ddl.pl.clone_network(adata, 
#                     color = ['sample_id'], 
#                     edges_width = 1,
#                     size = 50) 
#
import pdb; pdb.set_trace()
pass; # args.debug=False

1;

update_metadata too complicated

Need to streamline the options to make the code dryer and easier to trace.

Perhaps enforcing literal terms so that there is only a predefined set of behaviors?
Make each behavior a Class?
Update tutorial on how to use this to maximum benefits

	parser.add_argument(
	'--keep_trailing_hyphen_number',
	action='store_false',
	help=('If passed, do not strip out the trailing hyphen number, ' +
	'e.g. "-1", from the end of barcodes.'))

zktuong / dandelion Goto Github PK

dandelion's People

Contributors

Stargazers

Watchers

Forkers

dandelion's Issues

Description of the bug

Minimal reproducible example

The error message produced by the code above

OS information

Version information

Additional context

first we read in the 2 bcr files

Description of the question

Minimal example

Any error message produced by the code above

OS information

Version information

Additional context

Recommend Projects

Recommend Topics

Recommend Org