sidhomj / deeptcr Goto Github PK

View Code? Open in Web Editor NEW

113.0 12.0 40.0 104.06 MB

Deep Learning Methods for Parsing T-Cell Receptor Sequencing (TCRSeq) Data

Home Page: https://sidhomj.github.io/DeepTCR/

License: MIT License

Python 17.85% Jupyter Notebook 82.14% Shell 0.01%

deep-learning repertoire tcr adaptive-immunity immunology tcr-repertoire

deeptcr's People

Contributors

Stargazers

Watchers

Forkers

laserson pio82 emalone01 pablasimarjot cetienn01 adong77 jude-zheng blingarida bensolomon shulp2211 marrojwala tnakada shashidhar22 geng-lee rosemary94 qshao jiangmn jbonenx kendomaniac xzhan50 ajaysinghpathania savitaj arkhan19 genomicsnx alxndravc hereagain-y alyssablack katztyu jeremi-nh wangchen8888 analogiks parkerdow sgp79 yanqinghuang01 elulu3 nand-xie billjeffries jlehrer1 aijeanka baraslab

deeptcr's Issues

Recommendations for handling large datasets

Hi, thank you for creating this great tool!

I was wondering if you could offer some guidance on handling large datasets in the unsupervised workflow? In particular this seems to be a problem with the clustering/KNN classification steps as it seems to be prohibitively memory-expensive.

I think that downsampling is interfering with the classification accuracy so I would like to use all the data if possible.

Thanks so much for your help!

Leeana

Question about dropout rate

Hi.
Is the dropout probability in Convolutional_Features() 0.0 when training the unsupervised model? If not, where is this probability defined?

And did I understand correctly that there is no pooling between layers?

matched alpha beta

Question about incomplete data

Hi again,
How does DeepTCR deal with columns that have some missing values? For example, if there are some TCRBs that are missing the D gene.

-

DeepTCR_WF object has no attribute 'Y'

Interpreting Sequence_Inference output

Hello,

I would like to train a supervised model with known antigen specificity, then use that model to classify new TCR sequences as potentially targeting certain antigens. I have followed along with the tutorials, but am still unclear on the best way to do this. I believe the closest is the "8 - VAE Inference.ipynb" tutorial but using a supervised model rather than the unsupervised. However, I am unclear on how to interpret the output from Sequence_Inference. I am using the example data Mouse Antigens for the model and Rudqvist for the new dataset. The resulting "features" object is 23856x9 which I believe corresponds to the individual TCR sequences (23856) and 9 different antigens with the entriesS being scores for how well the TCR sequence fits that antigen.

Does a higher or lower score mean the TCR sequence fits better with the given antigen?

I tried to assess this myself by looking at the features of the supervised model, however this object has 224 columns. I was expecting this to have 9 corresponding with the different antigens.

What do the columns of the features object from the supervised model correspond to?
Would you suggest this method of classification, or something more akin to this tutorial "3 - Supervised Sequence Regression.ipynb"?

Thank you for your help!

Information

Hi sidhomj,

very nice tool, i have doubt, it can be used with SMART-Seq v4 PLUS Kit or SMARTer Human TCR ab Profiling kit ?

Simone

Reproducible clustering

Hi,
Is there a way to make the training and clustering reproducible? Setting graph_seed and split_seed in Train_VAE does not seem to do the trick.

Shared Motif for Clusters after Phenograph clustering

Hello Mr. Sidhom,
thank you for creating DeepTCR! I am using the supervised Sequence Classifier, including HLA Supertypes for Samples from different Patients and Treatments. My Question now is: Is it possible to extract shared Motifs for each Cluster that are common?
As DTCRSS.Representative_Sequences() as well as DTCR_SS.Motif_Identification() do extract explicitly the Sample specific Motifs if I understood correctly ?
Thank you in Advance for your help!

Error while running 6 - KNN classification tutorial

similar to #18 , but in a different file:

All the other tutorials work : )

supervised learning example 2 error

I am running through the 2nd example of supervised learning using Rudqvist data.
and keep getting error when doing the training.

Input:

Instantiate training object

DTCR_WF = DeepTCR_WF('Tutorial')
#Load Data from directories
DTCR_WF.Get_Data(directory='github/DeepTCR/Data/Rudqvist',
Load_Prev_Data=False,
aggregate_by_aa=True, aa_column_beta=1,count_column=2,v_beta_column=7,d_beta_column=14,j_beta_column=21)
DTCR_WF.Get_Train_Valid_Test(test_size=0.25)
DTCR_WF.Train()

Error:
Training_Statistics:
Epoch: 1/10000 Training loss: 1.39491 Validation loss: 1.38709 Testing loss: 1.36048 Training Accuracy: 0.41667 Validation Accuracy: 0.0 Testing Accuracy: 0.5 Testing AUC: 0.66667
Training_Statistics:
Epoch: 2/10000 Training loss: 1.37405 Validation loss: 1.37631 Testing loss: 1.36329 Training Accuracy: 0.33333 Validation Accuracy: 0.0 Testing Accuracy: 0.5 Testing AUC: 0.66667
Training_Statistics:
Epoch: 3/10000 Training loss: 1.35652 Validation loss: 1.36742 Testing loss: 1.36681 Training Accuracy: 0.41667 Validation Accuracy: 0.25 Testing Accuracy: 0.5 Testing AUC: 0.58333
Training_Statistics:
Epoch: 4/10000 Training loss: 1.34040 Validation loss: 1.35875 Testing loss: 1.37043 Training Accuracy: 0.66667 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.58333
Training_Statistics:
Epoch: 5/10000 Training loss: 1.32491 Validation loss: 1.34920 Testing loss: 1.37438 Training Accuracy: 0.75 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.58333
Training_Statistics:
Epoch: 6/10000 Training loss: 1.30922 Validation loss: 1.33924 Testing loss: 1.37817 Training Accuracy: 0.83333 Validation Accuracy: 0.25 Testing Accuracy: 0.5 Testing AUC: 0.58333
Training_Statistics:
Epoch: 7/10000 Training loss: 1.29348 Validation loss: 1.32873 Testing loss: 1.38209 Training Accuracy: 0.83333 Validation Accuracy: 0.25 Testing Accuracy: 0.25 Testing AUC: 0.5
Training_Statistics:
Epoch: 8/10000 Training loss: 1.27746 Validation loss: 1.31783 Testing loss: 1.38608 Training Accuracy: 0.91667 Validation Accuracy: 0.25 Testing Accuracy: 0.25 Testing AUC: 0.5
Training_Statistics:
Epoch: 9/10000 Training loss: 1.26097 Validation loss: 1.30654 Testing loss: 1.39095 Training Accuracy: 0.91667 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.5
Training_Statistics:
Epoch: 10/10000 Training loss: 1.24401 Validation loss: 1.29454 Testing loss: 1.39617 Training Accuracy: 0.83333 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.5
Training_Statistics:
Epoch: 11/10000 Training loss: 1.22642 Validation loss: 1.28242 Testing loss: 1.40190 Training Accuracy: 0.91667 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.5
Training_Statistics:
Epoch: 12/10000 Training loss: 1.20822 Validation loss: 1.27018 Testing loss: 1.40825 Training Accuracy: 0.83333 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.5
Training_Statistics:
Epoch: 13/10000 Training loss: 1.18927 Validation loss: 1.25744 Testing loss: 1.41549 Training Accuracy: 0.83333 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.58333
Training_Statistics:
Epoch: 14/10000 Training loss: 1.16937 Validation loss: 1.24402 Testing loss: 1.42367 Training Accuracy: 0.83333 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.58333
Training_Statistics:
Epoch: 15/10000 Training loss: 1.14860 Validation loss: 1.22993 Testing loss: 1.43312 Training Accuracy: 0.83333 Validation Accuracy: 0.75 Testing Accuracy: 0.25 Testing AUC: 0.58333
Training_Statistics:
Epoch: 16/10000 Training loss: 1.12682 Validation loss: 1.21520 Testing loss: 1.44380 Training Accuracy: 0.75 Validation Accuracy: 0.75 Testing Accuracy: 0.25 Testing AUC: 0.58333
Training_Statistics:
Epoch: 17/10000 Training loss: 1.10405 Validation loss: 1.20000 Testing loss: 1.45633 Training Accuracy: 0.75 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.58333
Training_Statistics:
Epoch: 18/10000 Training loss: 1.08021 Validation loss: 1.18428 Testing loss: 1.47065 Training Accuracy: 0.83333 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.58333
Training_Statistics:
Epoch: 19/10000 Training loss: 1.05532 Validation loss: 1.16801 Testing loss: 1.48714 Training Accuracy: 0.83333 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.58333
Training_Statistics:
Epoch: 20/10000 Training loss: 1.02967 Validation loss: 1.15104 Testing loss: 1.50585 Training Accuracy: 0.83333 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.58333
Training_Statistics:
Epoch: 21/10000 Training loss: 1.00291 Validation loss: 1.13350 Testing loss: 1.52693 Training Accuracy: 0.83333 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.58333
Training_Statistics:
Epoch: 22/10000 Training loss: 0.97514 Validation loss: 1.11567 Testing loss: 1.55058 Training Accuracy: 0.83333 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.58333
Training_Statistics:
Epoch: 23/10000 Training loss: 0.94636 Validation loss: 1.09774 Testing loss: 1.57718 Training Accuracy: 0.83333 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.58333
Training_Statistics:
Epoch: 24/10000 Training loss: 0.91668 Validation loss: 1.07971 Testing loss: 1.60673 Training Accuracy: 0.83333 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.58333
Training_Statistics:
Epoch: 25/10000 Training loss: 0.88618 Validation loss: 1.06198 Testing loss: 1.63951 Training Accuracy: 0.83333 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.58333
Training_Statistics:
Epoch: 26/10000 Training loss: 0.85506 Validation loss: 1.04472 Testing loss: 1.67605 Training Accuracy: 0.83333 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.58333
Training_Statistics:
Epoch: 27/10000 Training loss: 0.82325 Validation loss: 1.02834 Testing loss: 1.71696 Training Accuracy: 0.91667 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.58333

AttributeError Traceback (most recent call last)
in
1 DTCR_WF.Get_Train_Valid_Test(test_size=0.25)
----> 2 DTCR_WF.Train()

~/anaconda3/envs/dl/lib/python3.7/site-packages/DeepTCR/DeepTCR.py in Train(self, batch_size, epochs_min, stop_criterion, stop_criterion_window, kernel, on_graph_clustering, num_clusters, weight_by_class, class_weights, trainable_embedding, accuracy_min, num_fc_layers, units_fc, drop_out_rate, suppress_output, use_only_seq, use_only_gene, use_only_hla, size_of_net, embedding_dim_aa, embedding_dim_genes, embedding_dim_hla)
3223 GO.saver.save(sess, os.path.join(self.Name, 'model', 'model.ckpt'))
3224
-> 3225 self.HLA_embed = GO.embedding_layer_hla.eval()
3226
3227 with open(os.path.join(self.Name, 'model', 'model_type.pkl'), 'wb') as f:

AttributeError: 'graph_object' object has no attribute 'embedding_layer_hla'

System: MacOS 10.13.6

Error while running tutorial

1 - Training VAE, second cell:

Add Pillow to requirements

Otherwise one receives an error saying TIF not supported when making heatmaps:

VAE AUC violin plot Y-axis value >1

I am running the unsupervised tutorial by

Instantiate training object

DTCRU = DeepTCR_U('Tutorial')

#Load Data from directories
DTCRU.Get_Data(directory='github/DeepTCR/Data/Rudqvist',Load_Prev_Data=False,aggregate_by_aa=True,
aa_column_beta=1,count_column=2,v_beta_column=7,d_beta_column=14,j_beta_column=21)

#Train VAE
DTCRU.Train_VAE(Load_Prev_Data=False,accuracy_min=0.9)

Output-----------

Finding 30 nearest neighbors using minkowski metric and 'auto' algorithm
Neighbors computed in 50.87618613243103 seconds
Jaccard graph constructed in 12.015304803848267 seconds
Wrote graph to binary file in 3.909883975982666 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.991586
Louvain completed 21 runs in 9.033319234848022 seconds
PhenoGraph complete in 75.99179887771606 seconds
Clustering Done

Suggestion for UMAP_Plot() function

Hi,

This package has been usefully but I had a suggestion for the UMAP_Plot(). When showing the legend for the UMAP plot it would be nice to be able to have the legend to the right of the plot, or really anywhere that isn't on the plot itself, as when the number of labels is large it tends to block a substantial portion of the graph.

KNN_Sequence_Classifier

Hi,

Is it possible to have a Load_Previous_Data for the KNN_Sequence_Classifier function? It takes too much time to run.

I am currently using version 1.2.21
Thanks!

Tensor shape mismatch while running "2 - Supervised Repertoire Classification" Tutorial

I am running the tutorial as is. When I am training the model on the data, there seems to be a tensor shape mismatch.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-685fbc9e79fc> in <module>()
----> 1 DTCR_WF.Train()

/home/ubuntu/.local/lib/python3.6/site-packages/DeepTCR/DeepTCR.py in Train(self, kernel, num_concepts, trainable_embedding, embedding_dim_aa, embedding_dim_genes, embedding_dim_hla, num_fc_layers, units_fc, weight_by_class, class_weights, use_only_seq, use_only_gene, use_only_hla, size_of_net, graph_seed, qualitative_agg, quantitative_agg, num_agg_layers, units_agg, drop_out_rate, multisample_dropout, multisample_dropout_rate, multisample_dropout_num_masks, batch_size, batch_size_update, epochs_min, stop_criterion, stop_criterion_window, accuracy_min, train_loss_min, hinge_loss_t, convergence, learning_rate, suppress_output, loss_criteria, batch_seed)
   5026               accuracy_min,train_loss_min,hinge_loss_t,convergence,learning_rate, suppress_output,
   5027                     loss_criteria)
-> 5028         self._train(write=True,batch_seed=batch_seed,iteration=0)
   5029 
   5030     def Monte_Carlo_CrossVal(self,folds=5,test_size=0.25,LOO=None,combine_train_valid=False,random_perm=False,seeds=None,

/home/ubuntu/.local/lib/python3.6/site-packages/DeepTCR/DeepTCR.py in _train(self, write, batch_seed, iteration)
   4747                 train_loss, train_accuracy, train_predicted,train_auc = \
   4748                     Run_Graph_WF(self.train,sess,self,GO,batch_size,batch_size_update,random=True,train=True,
-> 4749                                  drop_out_rate=drop_out_rate,multisample_dropout_rate=multisample_dropout_rate)
   4750 
   4751                 train_accuracy_total.append(train_accuracy)

/home/ubuntu/.local/lib/python3.6/site-packages/DeepTCR/functions/utils_s.py in Run_Graph_WF(set, sess, self, GO, batch_size, batch_size_update, random, train, drop_out_rate, multisample_dropout_rate)
    719         elif train:
    720             loss_i, accuracy_i, _, predicted_i = sess.run([GO.loss, GO.accuracy, GO.opt, GO.predicted],
--> 721                                                           feed_dict=feed_dict)
    722         else:
    723             loss_i, accuracy_i, predicted_i = sess.run([GO.loss, GO.accuracy, GO.predicted],

/home/ubuntu/anaconda3/envs/deeptcr/lib/python3.6/site-packages/tensorflow_gpu-1.15.2-py3.6-linux-x86_64.egg/tensorflow_core/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata)
    954     try:
    955       result = self._run(None, fetches, feed_dict, options_ptr,
--> 956                          run_metadata_ptr)
    957       if run_metadata:
    958         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

/home/ubuntu/anaconda3/envs/deeptcr/lib/python3.6/site-packages/tensorflow_gpu-1.15.2-py3.6-linux-x86_64.egg/tensorflow_core/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
   1154                 'Cannot feed value of shape %r for Tensor %r, '
   1155                 'which has shape %r' %
-> 1156                 (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape())))
   1157           if not self.graph.is_feedable(subfeed_t):
   1158             raise ValueError('Tensor %s may not be fed.' % subfeed_t)

ValueError: Cannot feed value of shape (16, 1) for Tensor 'Placeholder_2:0', which has shape '(?, 4)'

I am running it on centos with NVIDIA GPU. All the other tutorials seem to be working well.

Can´t load own Data using DTCR_SS.Get_Data

TRB.txt
I have TCRseq Data which was annotated by IGB and preprocessed for DeepTCR as indicated in the tutorial.
I have 9 Samples with many TCRs, here is an excerpt of the Data for one Sample:

cdr3_aa	v_call	d_call	j_call	Count
ASSARQDLQQY	TRBV2*01	TRBD1*01	TRBJ2-7*01	39890
ASKDRALLRAV	TRBV21-1*01	TRBD1*01	TRBJ2-7*01	32323
ASSFSATNTGELF	TRBV5-1*01	TRBD2*01	TRBJ2-2*01	26637
ASSPGEQNTGELF	TRBV7-8*01	TRBD2*01	TRBJ2-2*01	26258
ASSGAGTGGYNEQF	TRBV12-3*01	TRBD1*01	TRBJ2-1*01	16692
ASSFSGHTGELF	TRBV7-2*01	TRBD2*01	TRBJ2-2*01	13838
ASSVETGTEKY	TRBV7-9*01	TRBD1*01	TRBJ2-3*01	13831
PPVIWTATSST	TRBV24-1*01	TRBD1*01	TRBJ2-7*01	13819
ASSSGLAGAYEQY	TRBV7-2*02	TRBD2*01	TRBJ2-7*01	13216
ASSFGVSGANVLT	TRBV7-9*03	TRBD2*01	TRBJ2-6*01	11449
ASSGLAGGPGTGELF	TRBV9*01	TRBD2*02	TRBJ2-2*01	11292
ASSPLAGGVAQF	TRBV7-6*01	TRBD2*02	TRBJ2-1*01	11019
ASSSTGQGNSYEQY	TRBV28*01	TRBD1*01	TRBJ2-7*01	10466

If I run the Tutorial using the example Data from the Repository for supervised Sequence Classification, loading Data, cluster etc. works perfectly (except for DTCR_SS.Train() which throws:

[AttributeError: 'DeepTCR_SS' object has no attribute 'test_pred']()

DTCR_SS.Monte_Carlo_CrossVal, DTCR_SS.K_Fold_CrossVal etc. work.

If I then replace the Folders in Data/Murine_Antigens with my Samples, DTCR_SS.Get_Data() which usually takes just a moment to load the data gets stuck (stopped it after 40min).

Even after only using TCRs >= 1000 Reads which results in Tables between 50-80 rows, does not resolve the issue.

import sys
sys.path.append('../../')
from DeepTCR.DeepTCR import DeepTCR_SS

# Instantiate training object
DTCR_SS = DeepTCR_SS('Tutorial')

#Load Data from directories
DTCR_SS.Get_Data(directory='../../Data/TRB',Load_Prev_Data=False,aggregate_by_aa=True,
               aa_column_beta=0,count_column=4,v_beta_column=1,j_beta_column=3)

Output:

Loading Data ...

Is there anything that could cause this kind of Bug?

Attached you will find the data for one Sample for TCR-seqs > 1000 (as .txt file saved .tsv)

Thank you in Advance for your help!

How to export latent feature matrix resulted by VAE training

Hello!
I would like to use the latent feature matrix resulted by VAE training to do other analysis on my way instead of using DeepTCR pipline. But I don't know how to export it (as a .tsv file, for example). could you give me some advice?

Thank you for your help!

Performance

Thanks for writing this package. This package is very useful.

Wrong sklearn version

requirements.txt states 0.19.1

but OneHotEncoder in that version doesn't have a "categories" named variable:

https://scikit-learn.org/0.19/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder

which is used here:

DeepTCR/DeepTCR/DeepTCR.py

Line 324 in eb58a95

OH = OneHotEncoder(sparse=False,categories='auto')

This wasn't introduced until version 0.20

parallele for KNN

Could we add n_jobs to the KNN_Sequence_Classifier by adding n_jobs to KNeighborsClassifier from sklearn module? Currently the KNN_Sequence_Classifier is very slow.

Thanks!

Supervised learning train error: need at least one array to concatenate

I am running a testing using my own data
After loading the data successfully, I got an error when training:
#Load Data from directories
DTCR_WF.Get_Data(directory='data_test/',
Load_Prev_Data=False,
aggregate_by_aa=True,
aa_column_beta=1,v_beta_column=3,d_beta_column=4,j_beta_column=5,
count_column=6,n_jobs = 2, sep=",")
DTCR_WF.Get_Train_Valid_Test(test_size=0.2)
DTCR_WF.Train()
error msg start ---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
in
1 DTCR_WF.Get_Train_Valid_Test(test_size=0.2)
----> 2 DTCR_WF.Train()
3
4 # DTCR_WF.Monte_Carlo_CrossVal(folds=5,test_size=0.3,stop_criterion=0.25,epochs_min=100,
5 # suppress_output = False)

~/anaconda3/envs/dl/lib/python3.7/site-packages/DeepTCR-1.2.15-py3.7.egg/DeepTCR/DeepTCR.py in Train(self, batch_size, epochs_min, stop_criterion, stop_criterion_window, kernel, on_graph_clustering, num_clusters, weight_by_class, class_weights, trainable_embedding, accuracy_min, num_fc_layers, units_fc, drop_out_rate, suppress_output, use_only_seq, use_only_gene, use_only_hla, size_of_net, embedding_dim_aa, embedding_dim_genes, embedding_dim_hla)
3148
3149 valid_loss, valid_accuracy, valid_predicted, valid_auc =
-> 3150 Run_Graph_WF(self.valid, sess, self, GO, batch_size, random=False, train=False)
3151
3152

~/anaconda3/envs/dl/lib/python3.7/site-packages/DeepTCR-1.2.15-py3.7.egg/DeepTCR/functions/utils_s.py in Run_Graph_WF(set, sess, self, GO, batch_size, random, train, drop_out_rate)
390 loss = np.mean(loss)
391 accuracy = np.mean(accuracy)
--> 392 predicted_out = np.vstack(predicted_list)
393 try:
394 auc = roc_auc_score(set[-1], predicted_out)

~/anaconda3/envs/dl/lib/python3.7/site-packages/numpy/core/shape_base.py in vstack(tup)
281 """
282 _warn_for_nonsequence(tup)
--> 283 return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
284
285

ValueError: need at least one array to concatenate
End of error msg ------------------------------------
The directory structure is as following:

data_test/
├── A
│   ├── A_1.csv
│   └── A_2.csv
├── B
│   ├── B_1.csv
│   └── B_2.csv
├── C
│   ├── C_1.csv
│   └── C_2.csv
└── D
├── D_1.csv
└── D_2.csv

In each csv file, there is beta chain information
AAACCTGCAGGCTGAA-1,CASSIRDTETLYF,498,TRBV16,TRBD1,TRBJ2-3,1
AAACGGGAGGGTGTGT-1,CASGEGQTNSDYTF,568,TRBV13-2,TRBD1,TRBJ1-2,5
AAACGGGGTCTTCAAG-1,CASSGQNQDTQYF,503,TRBV15,TRBD1,TRBJ2-5,1
AAACGGGTCTAACTGG-1,CASSLGWHSYEQYF,572,TRBV16,None,TRBJ2-7,3
AAAGATGAGAATTGTG-1,CASGPGQSNTEVFF,527,TRBV13-2,TRBD1,TRBJ1-1,7
AAAGCAATCTGGCGAC-1,CASSDGLGGLEQYF,481,TRBV13-1,TRBD2,TRBJ2-7,7
AAATGCCCAATCCAAC-1,CAWVDWAQNTLYF,544,TRBV31,TRBD2,TRBJ2-4,3
AAATGCCTCGGCTTGG-1,CSAQGAHTEVFF,566,TRBV1,TRBD1,TRBJ1-1,18
AACACGTGTATAATGG-1,CASSSPLAGQDTQYF,519,TRBV3,None,TRBJ2-5,1

Number of records for each input file :
808 data_test/A/A_1.csv
1920 data_test/A/A_2.csv
2163 data_test/B/B_1.csv
1879 data_test/B/B_2.csv
836 data_test/C/C_1.csv
1182 data_test/C/C_2.csv
1705 data_test/D/D_1.csv
2091 data_test/D/D_2.csv

Error when changing max_length

Hi,
In Train_VAE(), a ValueError is raised if a different max_length has been given to DeepTRC_U. The tensor shapes do not match on row 2292 when tf.equal() is called.

ValueError when running "8 - VAE Inference" in unsupervised tutorials

when I am running the inference, I am getting an error when running the following line:

features,_ = DTCRU.Sequence_Inference(beta_sequences=beta_sequences,v_beta=v_beta,j_beta=j_beta)

The error is the following

WARNING:tensorflow:From /home/ubuntu/.local/lib/python3.6/site-packages/DeepTCR/functions/utils_s.py:1005: The name tf.train.import_meta_graph is deprecated. Please use tf.compat.v1.train.import_meta_graph instead.

WARNING:tensorflow:From /home/ubuntu/.local/lib/python3.6/site-packages/DeepTCR/functions/utils_s.py:1006: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

INFO:tensorflow:Restoring parameters from murine_antigens/models/model_0/model.ckpt
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-da530d2eb961> in <module>()
----> 1 features,_ = DTCRU.Sequence_Inference(beta_sequences=beta_sequences,v_beta=v_beta,j_beta=j_beta)

ValueError: too many values to unpack (expected 2)

Heatmap save to file output blank file

DTCRU.HeatMap_Sequences(filename='Heatmap_Sequences.pdf')

The resulted file is blank.

No Sample Labels on Dendrogramplots

Hello Prof. Sidhom,
thank you for creating DeepTCR, it is a very useful & cool tool.

The Issue I experience when creating the Dendrogramplots is that the Circles of the Samples are getting labeled (by Sample/Class) in very unreadable colors, almost undistinguishable from the background.

Code

DTCRU.Repertoire_Dendrogram(n_jobs=40,distance_metric='correlation',sample_labels=True)
DTCRU.Repertoire_Dendrogram(n_jobs=40,distance_metric='correlation',log_scale=True,Load_Prev_Data=True,sample_labels=True)

Output

UMAP transformation...
PhenoGraph Clustering...
Finding 30 nearest neighbors using minkowski metric and 'auto' algorithm
Neighbors computed in 5.6398985385894775 seconds
Jaccard graph constructed in 1.4102559089660645 seconds
Wrote graph to binary file in 0.9278509616851807 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.970462
Louvain completed 21 runs in 2.248636484146118 seconds
PhenoGraph complete in 10.259825944900513 seconds
Clustering Done
/home/patrick/anaconda3/envs/DEEPTCR_ENV/lib/python3.8/site-packages/DeepTCR/functions/utils_u.py:161: MatplotlibDeprecationWarning: You are modifying the state of a globally registered colormap. In future versions, you will not be able to modify a registered colormap in-place. To remove this warning, you can make a copy of the colormap first. cmap = copy.copy(mpl.cm.get_cmap("viridis"))
  cmap_viridis.set_under(color='white', alpha=0)
/home/patrick/anaconda3/envs/DEEPTCR_ENV/lib/python3.8/site-packages/DeepTCR/functions/utils_u.py:161: MatplotlibDeprecationWarning: You are modifying the state of a globally registered colormap. In future versions, you will not be able to modify a registered colormap in-place. To remove this warning, you can make a copy of the colormap first. cmap = copy.copy(mpl.cm.get_cmap("viridis"))
  cmap_viridis.set_under(color='white', alpha=0)

Is it possible to change the font color of every Sample Labels ?

run DeepTCR using more than one GPUs

Hi,

I have a few questions regarding using GPU.

I was wondering if I can run DeepTCR using multiple GPUs. I noticed that I am allowed to select which GPU I want to put the graph and train on if I have a multi-GPU environment. But does that mean I can only specify one GPU?
I saw "I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero" in my log. Some people say this is just a warning instead of an error and can be simply ignored. Have you encountered the same issue before? Any insight will be greatly appreciated.

Also, I was wondering if you could provide some insights on training an imbalanced dataset (binary classification) for this algorithm. Would you suggest using a balanced training dataset or including as much data as possible?

Thanks for your time and help!

Getting an import error when running 'from DeepTCR.DeepTCR import DeepTCR_U'

ImportError: Bio.Alphabet has been removed from Biopython. In many cases, the alphabet can simply be ignored and removed from scripts. In a few cases, you may need to specify the molecule_type as an annotation on a SeqRecord for your script to work correctly. Please see https://biopython.org/wiki/Alphabet for more information.

Issue with DTCR_WF and DTCR_SS Get_Train_Valid_Test

Sorry if theres an obvious answer to this, I'm not very experienced with python and am learning for my Honours degree.

When running this part of the DTCR_WF script:

DTCR_WF.Get_Train_Valid_Test(test_size=0.25)
DTCR_WF.Train()

I'm getting an error that reads:

" line 4014, in
Get_Train_Valid_Test
raise Exception('Choose different train/valid/test parameters!')

Exception: Choose different train/valid/test parameters! "

I also am having what feels like a related issue with DTCR_SS using the same part of it's script, however it does not throw an error and commences training, however nothing occurs in the training and it seems to continue until stopped with the training loss, validation loss, testing loss, etc..., all reading 0.

Again, sorry if there is an obvious answer. I'd appreciate any help. Thankyou. I should note that the data I'm using works fine using the unsupervised script.

Sequence not featurized

Dear all,

I am trying to analyze a dataset but, for unknown reasons, some of the sequences are not considered.

My code is the follow

%%capture
import sys
import pandas as pd
sys.path.append('../../')
from DeepTCR.DeepTCR import DeepTCR_U

# Instantiate training object
DTCRU = DeepTCR_U('Tutorial')

a_target="MA0"
#Load Data from directories
DTCRU.Get_Data(directory='data_deep_tcr/'+a_target,Load_Prev_Data=False,aggregate_by_aa=True,
               aa_column_beta=0,count_column=1,v_beta_column=2,j_beta_column=3)

#Train VAE
DTCRU.Train_VAE(Load_Prev_Data=False, size_of_net="small")
DTCRU.Cluster(clustering_method='phenograph', sample=500)
DFs = DTCRU.Cluster_DFs

r_df=pd.DataFrame()

for i in range(0, len(DFs)):
    tdf=DFs[i]
    tdf["cluster_index"]=i
    r_df=r_df.append(tdf)

fn="result_clustering_MA0_clean.txt" 
r_df.to_csv(fn)

In the directory "data_deep_tcr/MA0 I have two subfolders. Each of these subfolders contains one TSV file with the following format:

aminoAcid	counts	v_beta	j_beta
CASTHLDPPGEQYFG	571795	hTRBV28	hTRBJ02-7
CASSPLGASGEQFFG	317906	hTRBV28	hTRBJ02-1
CASGGGEQFFG	104692	hTRBV12-3	hTRBJ02-1
CANEGASENTEAFFG	86447	hTRBV06-1	hTRBJ01-1
CASSFFPFNEQFFG	74908	hTRBV12-3	hTRBJ02-1

For example, I pass this sequence (with the v_beta and j_beta and the counts)
CANEGASENTEAFFG 73703 hTRBV06-8 hTRBJ01-1
but it is not clustered.

Whereas, the same sequence with a different count, v_beta and j_beta is clustered:
CANEGASENTEAFFG 86447 hTRBV06-1 hTRBJ01-17

Any idea why is this happening?

Optimization of the threshold parameter in hierarchical clustering

Hello @sidhomj,

I used the unsupervised partof DeepTCR to cluster TCR sequences, but when I allowed the method to determine the optimal threshold parameter with the following command line, I got this error:

DTCRU_test.Cluster(clustering_method="hierarchical", linkage_method="ward", criterion="distance", write_to_sheets=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/.conda/envs/DeepTCR_env/lib/python3.7/site-packages/DeepTCR/DeepTCR.py", line 1054, in Cluster
    IDX = hierarchical_optimization(distances, features, method=linkage_method, criterion=criterion)
  File "/home/ubuntu/.conda/envs/DeepTCR_env/lib/python3.7/site-packages/DeepTCR/functions/utils_u.py", line 52, in hierarchical_optimization
    sil.append(skmetrics.silhouette_score(features[sel, :], IDX[sel]))
  File "/home/ubuntu/.conda/envs/DeepTCR_env/lib/python3.7/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 118, in silhouette_score
    return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
  File "/home/ubuntu/.conda/envs/DeepTCR_env/lib/python3.7/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 229, in silhouette_samples
    check_number_of_labels(len(le.classes_), n_samples)
  File "/home/ubuntu/.conda/envs/DeepTCR_env/lib/python3.7/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 35, in check_number_of_labels
    % n_labels
ValueError: Number of labels is 2876. Valid values are 2 to n_samples - 1 (inclusive)

To correct this, I tried to modifiy the function hierarchical_optimization in the utils_u.py script in DeepTCR/functions folder (l.44):

def hierarchical_optimization(distances,features,method,criterion):
    Z = linkage(squareform(distances), method=method)
    t_list = np.arange(1, 100, 1) #t_list = np.arange(0, 100, 1)
    sil = []
    for t in t_list:
        IDX = fcluster(Z, t, criterion=criterion)
        if len(np.unique(IDX[IDX >= 0])) == 1:
            sil.append(0.0)
            continue
        sel = IDX >= 0
        sil.append(skmetrics.silhouette_score(features[sel, :], IDX[sel]))

    IDX = fcluster(Z, t_list[np.argmax(sil)], criterion=criterion)
    return IDX

and it works !

'LabelEncoder' object has no attribute 'classes_' when loading own data

Hey,

I've managed to load my own data, however having trouble training the VAE (which works on the training dataset).

Code so far:

from DeepTCR.DeepTCR import DeepTCR_U
import pandas as pd
import numpy as np
DeepTCR_input = pd.read_csv('/Users/gordonbeattie/Documents/Projects/Maria_2/TCR/DeepTCR/DeepTCR_input.tsv', sep= '\t')

alpha = np.genfromtxt(DeepTCR_input.TRA, dtype='str')
beta = np.genfromtxt(DeepTCR_input.TRB, dtype='str')
sample = np.genfromtxt(DeepTCR_input.sample_ID, dtype='str')

DTCRU.Load_Data(alpha_sequences=alpha,beta_sequences=beta, sample_labels=sample)

DTCRU.Train_VAE(Load_Prev_Data=False)

Which throws the following error:
AttributeError: 'LabelEncoder' object has no attribute 'classes_'

Thanks in advance for any assistance!

Understanding Training Strategy of Supervised TCR repertoire classification on HIV dataset

Hi, Sorry to disturb:

I am trying to understand the training strategy of HIV dataset and replicate the results you get in your publication.

It seems that the dataset can be categorized as non-cognate groups (CEF, AY9, No Peptide conditions), or cognate groups (where there is an epitope). We have 3 * 3 samples that are non-cognate, while 25 * 3 samples as cognate groups. I saw from the paper that deeptcr can distinguish non-cognate samples from cognate samples, and the training used keep two out of three for training data.

My question is, when doing the training, did you

fit the model using all (3+25) * 2 data at once, where 3 * 2 are non-cognate and 25*2 are cognate group? Then you test the model on the remaining 3+25 samples and see whether the model can correctly predict whether each sample is cognate or non-cognate.
Or you use (3+1) * 2 data, where the 3 * 2 data are non-cognate while the 1 * 2 data is from one specific epitope instead using all 25 * 2 samples as cognate group data? Then you test the model on the remaining 3+1 samples to see whether it can corrected predict which (one) sample is the cognate group.
Then you repeat 2 for each specific epitope (MSPRTLNAW, NTQGYFPDW, etc...)

Thanks and looking forward to your reply!

Availability of trained models

Is any of the trained VAE models available publicly, ideally together with an evaluation script?

We are working on a related topic and would like to perform an apple-to-apple comparison to your VAE approach.

sequence_inference error

Hi:

I am trying to do sequence_inference based on a trained model, while the following error occurs:

I am not sure how I shall change my code to make it work. May I ask for your suggestions? Thanks!

tensorflow/core/common_runtime/colocation_graph.cc:1218] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
/job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:0' assigned_device_name_='' resource_device_name_='/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
ResourceApplyAdam: CPU
ReadVariableOp: CPU
AssignVariableOp: CPU
VarIsInitializedOp: CPU
Const: CPU
VarHandleOp: CPU

Colocation members, user-requested devices, and framework assigned devices, if any:
dense_1/bias/Initializer/zeros (Const) /device:GPU:0
dense_1/bias (VarHandleOp) /device:GPU:0
dense_1/bias/IsInitialized/VarIsInitializedOp (VarIsInitializedOp) /device:GPU:0
dense_1/bias/Assign (AssignVariableOp) /device:GPU:0
dense_1/bias/Read/ReadVariableOp (ReadVariableOp) /device:GPU:0
dense_1/BiasAdd/ReadVariableOp (ReadVariableOp) /device:GPU:0
dense_1/bias/Adam/Initializer/zeros (Const) /device:GPU:0
dense_1/bias/Adam (VarHandleOp) /device:GPU:0
dense_1/bias/Adam/IsInitialized/VarIsInitializedOp (VarIsInitializedOp) /device:GPU:0
dense_1/bias/Adam/Assign (AssignVariableOp) /device:GPU:0
dense_1/bias/Adam/Read/ReadVariableOp (ReadVariableOp) /device:GPU:0
dense_1/bias/Adam_1/Initializer/zeros (Const) /device:GPU:0
dense_1/bias/Adam_1 (VarHandleOp) /device:GPU:0
dense_1/bias/Adam_1/IsInitialized/VarIsInitializedOp (VarIsInitializedOp) /device:GPU:0
dense_1/bias/Adam_1/Assign (AssignVariableOp) /device:GPU:0
dense_1/bias/Adam_1/Read/ReadVariableOp (ReadVariableOp) /device:GPU:0
Adam/update_dense_1/bias/ResourceApplyAdam (ResourceApplyAdam) /device:GPU:0
save/AssignVariableOp_35 (AssignVariableOp) /device:GPU:0
save/AssignVariableOp_36 (AssignVariableOp) /device:GPU:0
save/AssignVariableOp_37 (AssignVariableOp) /device:GPU:0

Extract accuracy of decoded TCRs from Sequence_inference

I'd like to get the encoding accuracy of TCRs encoded on a VAE model trained on an initial dataset. i.e. How could I extract the encoding accuracy as well as the latent encoded variables from DTCRU.Sequence_Inference()?

Inquiry about Fig2.c Motifs Visualization in DeepTCR

Hello Dr. Sidhom,

I recently came across your “DeepTCR” paper and I found the idea of combining supervised and unsupervised learning and applying them to modeling TCR repertoires intriguing. Also, the performance in your paper is very impressive! I have some questions regarding Figure 2.c (titled representative TCRs and learned TCR motifs) that I hope you can help with.

The length of learned TCR motifs in DeepTCR is 5 or 4, could you provide some justification for that?
Through the tutorial code, I noticed that there are around 30 different motifs are learned for each representative TCRs, however, it seems that the selected two motifs in Fig2.c are not the top two learned motif results. I wonder follow what principles did you select those motifs?

Thank you in advance for your help!

The dataset used in the regression model

Hello,

I checked the dataset used in the regression model. It seems that simply dropping duplicate TCR won't get the dataset used in the regression model. Could you tell you where I can find the preprocessing detail to obtain a dataset for the regression model?

Thanks!

Clustering by phenograph error

Hi I was trying to run the clustering tutorial and I got an error after running through the commands below:

DTCRU = DeepTCR_U('Tutorial')
#Load Data from directories
DTCRU.Get_Data(directory='github/DeepTCR/Data/Murine_Antigens',Load_Prev_Data=False,aggregate_by_aa=True,aa_column_beta=0,count_column=1,v_beta_column=2,j_beta_column=3)
DTCRU.Train_VAE(Load_Prev_Data=False, suppress_output=False)
features = DTCRU.features
DTCRU.Cluster(clustering_method='phenograph')

OUTPUT

Finding 30 nearest neighbors using minkowski metric and 'auto' algorithm
Neighbors computed in 3.018752098083496 seconds
Jaccard graph constructed in 1.022472858428955 seconds
Wrote graph to binary file in 0.31229281425476074 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.883528
After 2 runs, maximum modularity is Q = 0.88486
Louvain completed 22 runs in 1.1757018566131592 seconds
PhenoGraph complete in 5.551486968994141 seconds

TypeError Traceback (most recent call last)
in
1 # cluster using phenograph
----> 2 DTCRU.Cluster(clustering_method='phenograph')

~/anaconda3/envs/dl/lib/python3.7/site-packages/DeepTCR/DeepTCR.py in Cluster(self, set, clustering_method, t, criterion, linkage_method, write_to_sheets, sample, n_jobs)
1051 df['D_beta'] = d_beta[sel]
1052 df['J_beta'] = j_beta[sel]
-> 1053 df['HLA'] = list(map(list,hla_data_seq[sel].tolist()))
1054
1055 df_sum = df.groupby(by='Sample', sort=False).agg({'Frequency': 'sum'})

TypeError: 'float' object is not iterable

OUTPUT

I am running using macos, python 3.7

Tutorial: Clustering TCR Sequences list index out of Range Error when using DeepTCR in WSL

I tried to run the jupyter notebook tutorial as is after installing DeepTCR Development Version into its own Environment like this : pip3 install git+https://github.com/sidhomj/DeepTCR.git.

Running the Second Cell of the Tutorial (Phenograph Clustering) of the loaded Data I get a List index out of Range Error:

Command:

DTCRU.Cluster(clustering_method='phenograph')

Output:

Finding 30 nearest neighbors using minkowski metric and 'auto' algorithm Neighbors computed in 3.4997947216033936 seconds Jaccard graph constructed in 1.022956371307373 seconds Wrote graph to binary file in 0.5937418937683105 seconds Running Louvain modularity optimization

Error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-2-1c6d55ce009b> in <module>
----> 1 DTCRU.Cluster(clustering_method='phenograph')

~/.local/lib/python3.8/site-packages/DeepTCR/DeepTCR.py in Cluster(self, set, clustering_method, t, criterion, linkage_method, write_to_sheets, sample, n_jobs, order_by_linkage)
   1044 
   1045             elif clustering_method == 'phenograph':
-> 1046                 IDX, _, _ = phenograph.cluster(features, k=30, n_jobs=n_jobs)
   1047 
   1048             elif clustering_method == 'kmeans':

~/.local/lib/python3.8/site-packages/DeepTCR/phenograph/cluster.py in cluster(data, k, directed, prune, min_cluster_size, jaccard, primary_metric, n_jobs, q_tol, louvain_time_limit, nn_method)
    114     uid = uuid.uuid1().hex
    115     graph2binary(uid, graph)
--> 116     communities, Q = runlouvain(uid, tol=q_tol, time_limit=louvain_time_limit)
    117     print("PhenoGraph complete in {} seconds".format(time.time() - tic), flush=True)
    118     communities = sort_by_size(communities, min_cluster_size)

~/.local/lib/python3.8/site-packages/DeepTCR/phenograph/core.py in runlouvain(filename, max_runs, time_limit, tol)
    261 
    262         # continue only if we've reached a higher modularity than before
--> 263         if q[-1] - Q > tol:
    264 
    265             Q = q[-1]

IndexError: list index out of range
``

What am I doing wrong ?

KNN_Repertoire_Classifer Error

Input data structure
6 labels, each label has 4 files, I tried, folds= 4, folds = 5, and folds = 10, all return the same error.

command is

DTCRU.KNN_Repertoire_Classifier(folds=10,
Load_Prev_Data=False,
metrics=['AUC', 'F1', 'Recall', 'Precision'],
plot_metrics=True, plot_type='box',
by_class=False,
n_jobs=40)

error msg start -----------------------------

File "/home/ubuntu/anaconda3/envs/dl/lib/python3.6/site-packages/DeepTCR/DeepTCR.py", line 2290, in KNN_Repertoire_Classifier
sns.catplot(data=df_out, x='Metric', y='Value', kind=plot_type)
File "/home/ubuntu/anaconda3/envs/dl/lib/python3.6/site-packages/seaborn/categorical.py", line 3724, in catplot
p.establish_colors(color, palette, 1)
File "/home/ubuntu/anaconda3/envs/dl/lib/python3.6/site-packages/seaborn/categorical.py", line 315, in establish_colors
lum = min(light_vals) * .6
ValueError: min() arg is an empty sequence

error msg end -----------------------------

Clustering error

Getting an error when using our data, after loading with:

DTCRU.Load_Data(beta_sequences=beta,v_beta=v_beta,j_beta=j_beta,class_labels=class_labels,
                sample_labels=sample_labels, counts=counts)

Training appears to have gone ok :

DTCRU.Train_VAE(Load_Prev_Data=False,accuracy_min=0.85)

But clustering appears to fail:

same error when using phenograph method, so isn't clustering approach specific. Also happens when randomly sampling:

Is it possible that there are some outliers produced by the clustering methods, causing "sel" to be not an integer? or perhaps there is some meta data i need to set?

Other functions appear to work okay:

i see #1 which has a similar error, but my data exists as a single csv file which i'm loading via pandas and chopping the necessary columns out of. as such, loading via directory doesn't appear to be an option

any ideas?

load previous-trained model

Hi Dr. John William Sidhom:

Sorry to disturb you, while I have a question about deepTCR. This is an amazing package, and I would like to use it to do sequence-level prediction. So here let us say that I train a model named 'model', and store the middle file at: /user. Then the program will generate a subfolder 'model' and store model checkpoint information there.

My question is, how to load this pre-trained model and do prediction using new data. I saw from website that 'sequence_inference' can 'load previous trained model', but I did not see an example at tutorial. So it will be extremely helpful if you may briefly tell me how to load a model from a pre-generated folder and thus utilize it to do prediction instead of train the model again.

Thanks in advance for your patience and have a great day!

Download with docker

Hi , professor John-William Sidhom !
Well , I encountered some problems when I tried to download DeepTCR .
I have tried to use miniconda or virtualenv to create a virtual environment , but , the system is always reporting errors.
I can install python3.9 , but I cannot get the corresponding pip version .
It really troubled me for a long time .
l hope the excellent software will have docker image someday .
Best wishes to you !

example input hla files

Would be possible to provide some example files for the HLA type? Thanks a lot!

Issue loading single cell data

Hi, thank you for the great tool!

I am experiencing some issues loading paired single cell data.
I have csvs for each of my samples that have a barcode column and cdr3, v, and j (d coverage was low so I removed that column) genes for each chain. There are no NA values or empty values that I can see so I'm not sure why it is throwing an empty array error.

Any help would be appreciated!

DTCR_WF.Get_Data(directory='pln/',Load_Prev_Data=False,aggregate_by_aa=True,
... aa_column_beta=7,v_beta_column=5,j_beta_column=6,
... aa_column_alpha=4, v_alpha_column=2,j_alpha_column=3,count_column=8)
Loading Data...
Traceback (most recent call last):
File "", line 3, in
File "/#/RIMA/miniconda3/envs/deeptcr3.7/lib/python3.7/site-packages/DeepTCR/DeepTCR.py", line 336, in Get_Data
Y = OH.fit_transform(Y.reshape(-1,1))
File "/#/RIMA/miniconda3/envs/deeptcr3.7/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 488, in fit_transform
return super().fit_transform(X, y)
File "/#/RIMA/miniconda3/envs/deeptcr3.7/lib/python3.7/site-packages/sklearn/base.py", line 847, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/#/RIMA/miniconda3/envs/deeptcr3.7/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 461, in fit
self._fit(X, handle_unknown=self.handle_unknown, force_all_finite="allow-nan")
File "/#/RIMA/miniconda3/envs/deeptcr3.7/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 78, in _fit
X, force_all_finite=force_all_finite
File "/#/RIMA/miniconda3/envs/deeptcr3.7/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 44, in _check_X
X_temp = check_array(X, dtype=None, force_all_finite=force_all_finite)
File "/#/RIMA/miniconda3/envs/deeptcr3.7/lib/python3.7/site-packages/sklearn/utils/validation.py", line 800, in check_array
% (n_samples, array.shape, ensure_min_samples, context)
ValueError: Found array with 0 sample(s) (shape=(0, 1)) while a minimum of 1 is required.

sidhomj / deeptcr Goto Github PK

deeptcr's People

Contributors

Stargazers

Watchers

Forkers

deeptcr's Issues

Instantiate training object

Instantiate training object

Output-----------

ValueError: need at least one array to concatenate End of error msg ------------------------------------ The directory structure is as following:

Code

Output

OUTPUT

OUTPUT

command is

error msg start -----------------------------

error msg end -----------------------------

Recommend Projects

Recommend Topics

Recommend Org

ValueError: need at least one array to concatenate
End of error msg ------------------------------------
The directory structure is as following: