sidhomj / deeptcr Goto Github PK
View Code? Open in Web Editor NEWDeep Learning Methods for Parsing T-Cell Receptor Sequencing (TCRSeq) Data
Home Page: https://sidhomj.github.io/DeepTCR/
License: MIT License
Deep Learning Methods for Parsing T-Cell Receptor Sequencing (TCRSeq) Data
Home Page: https://sidhomj.github.io/DeepTCR/
License: MIT License
Hi, thank you for creating this great tool!
I was wondering if you could offer some guidance on handling large datasets in the unsupervised workflow? In particular this seems to be a problem with the clustering/KNN classification steps as it seems to be prohibitively memory-expensive.
I think that downsampling is interfering with the classification accuracy so I would like to use all the data if possible.
Thanks so much for your help!
Leeana
Hi.
Is the dropout probability in Convolutional_Features() 0.0 when training the unsupervised model? If not, where is this probability defined?
And did I understand correctly that there is no pooling between layers?
Hi again,
How does DeepTCR deal with columns that have some missing values? For example, if there are some TCRBs that are missing the D gene.
Hello,
I would like to train a supervised model with known antigen specificity, then use that model to classify new TCR sequences as potentially targeting certain antigens. I have followed along with the tutorials, but am still unclear on the best way to do this. I believe the closest is the "8 - VAE Inference.ipynb" tutorial but using a supervised model rather than the unsupervised. However, I am unclear on how to interpret the output from Sequence_Inference. I am using the example data Mouse Antigens for the model and Rudqvist for the new dataset. The resulting "features" object is 23856x9 which I believe corresponds to the individual TCR sequences (23856) and 9 different antigens with the entriesS being scores for how well the TCR sequence fits that antigen.
I tried to assess this myself by looking at the features of the supervised model, however this object has 224 columns. I was expecting this to have 9 corresponding with the different antigens.
What do the columns of the features object from the supervised model correspond to?
Would you suggest this method of classification, or something more akin to this tutorial "3 - Supervised Sequence Regression.ipynb"?
Thank you for your help!
Hi sidhomj,
very nice tool, i have doubt, it can be used with SMART-Seq v4 PLUS Kit or SMARTer Human TCR ab Profiling kit ?
Simone
Hi,
Is there a way to make the training and clustering reproducible? Setting graph_seed and split_seed in Train_VAE does not seem to do the trick.
Hello Mr. Sidhom,
thank you for creating DeepTCR! I am using the supervised Sequence Classifier, including HLA Supertypes for Samples from different Patients and Treatments. My Question now is: Is it possible to extract shared Motifs for each Cluster that are common?
As DTCRSS.Representative_Sequences() as well as DTCR_SS.Motif_Identification() do extract explicitly the Sample specific Motifs if I understood correctly ?
Thank you in Advance for your help!
I am running through the 2nd example of supervised learning using Rudqvist data.
and keep getting error when doing the training.
Input:
DTCR_WF = DeepTCR_WF('Tutorial')
#Load Data from directories
DTCR_WF.Get_Data(directory='github/DeepTCR/Data/Rudqvist',
Load_Prev_Data=False,
aggregate_by_aa=True, aa_column_beta=1,count_column=2,v_beta_column=7,d_beta_column=14,j_beta_column=21)
DTCR_WF.Get_Train_Valid_Test(test_size=0.25)
DTCR_WF.Train()
AttributeError Traceback (most recent call last)
in
1 DTCR_WF.Get_Train_Valid_Test(test_size=0.25)
----> 2 DTCR_WF.Train()
~/anaconda3/envs/dl/lib/python3.7/site-packages/DeepTCR/DeepTCR.py in Train(self, batch_size, epochs_min, stop_criterion, stop_criterion_window, kernel, on_graph_clustering, num_clusters, weight_by_class, class_weights, trainable_embedding, accuracy_min, num_fc_layers, units_fc, drop_out_rate, suppress_output, use_only_seq, use_only_gene, use_only_hla, size_of_net, embedding_dim_aa, embedding_dim_genes, embedding_dim_hla)
3223 GO.saver.save(sess, os.path.join(self.Name, 'model', 'model.ckpt'))
3224
-> 3225 self.HLA_embed = GO.embedding_layer_hla.eval()
3226
3227 with open(os.path.join(self.Name, 'model', 'model_type.pkl'), 'wb') as f:
AttributeError: 'graph_object' object has no attribute 'embedding_layer_hla'
System: MacOS 10.13.6
I am running the unsupervised tutorial by
DTCRU = DeepTCR_U('Tutorial')
#Load Data from directories
DTCRU.Get_Data(directory='github/DeepTCR/Data/Rudqvist',Load_Prev_Data=False,aggregate_by_aa=True,
aa_column_beta=1,count_column=2,v_beta_column=7,d_beta_column=14,j_beta_column=21)
#Train VAE
DTCRU.Train_VAE(Load_Prev_Data=False,accuracy_min=0.9)
Finding 30 nearest neighbors using minkowski metric and 'auto' algorithm
Neighbors computed in 50.87618613243103 seconds
Jaccard graph constructed in 12.015304803848267 seconds
Wrote graph to binary file in 3.909883975982666 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.991586
Louvain completed 21 runs in 9.033319234848022 seconds
PhenoGraph complete in 75.99179887771606 seconds
Clustering Done
Hi,
This package has been usefully but I had a suggestion for the UMAP_Plot(). When showing the legend for the UMAP plot it would be nice to be able to have the legend to the right of the plot, or really anywhere that isn't on the plot itself, as when the number of labels is large it tends to block a substantial portion of the graph.
Hi,
Is it possible to have a Load_Previous_Data for the KNN_Sequence_Classifier function? It takes too much time to run.
I am currently using version 1.2.21
Thanks!
I am running the tutorial as is. When I am training the model on the data, there seems to be a tensor shape mismatch.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-5-685fbc9e79fc> in <module>()
----> 1 DTCR_WF.Train()
/home/ubuntu/.local/lib/python3.6/site-packages/DeepTCR/DeepTCR.py in Train(self, kernel, num_concepts, trainable_embedding, embedding_dim_aa, embedding_dim_genes, embedding_dim_hla, num_fc_layers, units_fc, weight_by_class, class_weights, use_only_seq, use_only_gene, use_only_hla, size_of_net, graph_seed, qualitative_agg, quantitative_agg, num_agg_layers, units_agg, drop_out_rate, multisample_dropout, multisample_dropout_rate, multisample_dropout_num_masks, batch_size, batch_size_update, epochs_min, stop_criterion, stop_criterion_window, accuracy_min, train_loss_min, hinge_loss_t, convergence, learning_rate, suppress_output, loss_criteria, batch_seed)
5026 accuracy_min,train_loss_min,hinge_loss_t,convergence,learning_rate, suppress_output,
5027 loss_criteria)
-> 5028 self._train(write=True,batch_seed=batch_seed,iteration=0)
5029
5030 def Monte_Carlo_CrossVal(self,folds=5,test_size=0.25,LOO=None,combine_train_valid=False,random_perm=False,seeds=None,
/home/ubuntu/.local/lib/python3.6/site-packages/DeepTCR/DeepTCR.py in _train(self, write, batch_seed, iteration)
4747 train_loss, train_accuracy, train_predicted,train_auc = \
4748 Run_Graph_WF(self.train,sess,self,GO,batch_size,batch_size_update,random=True,train=True,
-> 4749 drop_out_rate=drop_out_rate,multisample_dropout_rate=multisample_dropout_rate)
4750
4751 train_accuracy_total.append(train_accuracy)
/home/ubuntu/.local/lib/python3.6/site-packages/DeepTCR/functions/utils_s.py in Run_Graph_WF(set, sess, self, GO, batch_size, batch_size_update, random, train, drop_out_rate, multisample_dropout_rate)
719 elif train:
720 loss_i, accuracy_i, _, predicted_i = sess.run([GO.loss, GO.accuracy, GO.opt, GO.predicted],
--> 721 feed_dict=feed_dict)
722 else:
723 loss_i, accuracy_i, predicted_i = sess.run([GO.loss, GO.accuracy, GO.predicted],
/home/ubuntu/anaconda3/envs/deeptcr/lib/python3.6/site-packages/tensorflow_gpu-1.15.2-py3.6-linux-x86_64.egg/tensorflow_core/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata)
954 try:
955 result = self._run(None, fetches, feed_dict, options_ptr,
--> 956 run_metadata_ptr)
957 if run_metadata:
958 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
/home/ubuntu/anaconda3/envs/deeptcr/lib/python3.6/site-packages/tensorflow_gpu-1.15.2-py3.6-linux-x86_64.egg/tensorflow_core/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
1154 'Cannot feed value of shape %r for Tensor %r, '
1155 'which has shape %r' %
-> 1156 (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape())))
1157 if not self.graph.is_feedable(subfeed_t):
1158 raise ValueError('Tensor %s may not be fed.' % subfeed_t)
ValueError: Cannot feed value of shape (16, 1) for Tensor 'Placeholder_2:0', which has shape '(?, 4)'
I am running it on centos with NVIDIA GPU. All the other tutorials seem to be working well.
TRB.txt
I have TCRseq Data which was annotated by IGB and preprocessed for DeepTCR as indicated in the tutorial.
I have 9 Samples with many TCRs, here is an excerpt of the Data for one Sample:
cdr3_aa v_call d_call j_call Count
ASSARQDLQQY TRBV2*01 TRBD1*01 TRBJ2-7*01 39890
ASKDRALLRAV TRBV21-1*01 TRBD1*01 TRBJ2-7*01 32323
ASSFSATNTGELF TRBV5-1*01 TRBD2*01 TRBJ2-2*01 26637
ASSPGEQNTGELF TRBV7-8*01 TRBD2*01 TRBJ2-2*01 26258
ASSGAGTGGYNEQF TRBV12-3*01 TRBD1*01 TRBJ2-1*01 16692
ASSFSGHTGELF TRBV7-2*01 TRBD2*01 TRBJ2-2*01 13838
ASSVETGTEKY TRBV7-9*01 TRBD1*01 TRBJ2-3*01 13831
PPVIWTATSST TRBV24-1*01 TRBD1*01 TRBJ2-7*01 13819
ASSSGLAGAYEQY TRBV7-2*02 TRBD2*01 TRBJ2-7*01 13216
ASSFGVSGANVLT TRBV7-9*03 TRBD2*01 TRBJ2-6*01 11449
ASSGLAGGPGTGELF TRBV9*01 TRBD2*02 TRBJ2-2*01 11292
ASSPLAGGVAQF TRBV7-6*01 TRBD2*02 TRBJ2-1*01 11019
ASSSTGQGNSYEQY TRBV28*01 TRBD1*01 TRBJ2-7*01 10466
If I run the Tutorial using the example Data from the Repository for supervised Sequence Classification, loading Data, cluster etc. works perfectly (except for DTCR_SS.Train() which throws:
[AttributeError: 'DeepTCR_SS' object has no attribute 'test_pred']()
DTCR_SS.Monte_Carlo_CrossVal, DTCR_SS.K_Fold_CrossVal etc. work.
If I then replace the Folders in Data/Murine_Antigens with my Samples, DTCR_SS.Get_Data() which usually takes just a moment to load the data gets stuck (stopped it after 40min).
Even after only using TCRs >= 1000 Reads which results in Tables between 50-80 rows, does not resolve the issue.
import sys
sys.path.append('../../')
from DeepTCR.DeepTCR import DeepTCR_SS
# Instantiate training object
DTCR_SS = DeepTCR_SS('Tutorial')
#Load Data from directories
DTCR_SS.Get_Data(directory='../../Data/TRB',Load_Prev_Data=False,aggregate_by_aa=True,
aa_column_beta=0,count_column=4,v_beta_column=1,j_beta_column=3)
Output:
Loading Data ...
Is there anything that could cause this kind of Bug?
Attached you will find the data for one Sample for TCR-seqs > 1000 (as .txt file saved .tsv)
Thank you in Advance for your help!
Hello!
I would like to use the latent feature matrix resulted by VAE training to do other analysis on my way instead of using DeepTCR pipline. But I don't know how to export it (as a .tsv file, for example). could you give me some advice?
Thank you for your help!
Thanks for writing this package. This package is very useful.
requirements.txt states 0.19.1
but OneHotEncoder in that version doesn't have a "categories" named variable:
which is used here:
Line 324 in eb58a95
This wasn't introduced until version 0.20
Could we add n_jobs to the KNN_Sequence_Classifier by adding n_jobs to KNeighborsClassifier from sklearn module? Currently the KNN_Sequence_Classifier is very slow.
Thanks!
I am running a testing using my own data
After loading the data successfully, I got an error when training:
#Load Data from directories
DTCR_WF.Get_Data(directory='data_test/',
Load_Prev_Data=False,
aggregate_by_aa=True,
aa_column_beta=1,v_beta_column=3,d_beta_column=4,j_beta_column=5,
count_column=6,n_jobs = 2, sep=",")
DTCR_WF.Get_Train_Valid_Test(test_size=0.2)
DTCR_WF.Train()
error msg start ---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
in
1 DTCR_WF.Get_Train_Valid_Test(test_size=0.2)
----> 2 DTCR_WF.Train()
3
4 # DTCR_WF.Monte_Carlo_CrossVal(folds=5,test_size=0.3,stop_criterion=0.25,epochs_min=100,
5 # suppress_output = False)
~/anaconda3/envs/dl/lib/python3.7/site-packages/DeepTCR-1.2.15-py3.7.egg/DeepTCR/DeepTCR.py in Train(self, batch_size, epochs_min, stop_criterion, stop_criterion_window, kernel, on_graph_clustering, num_clusters, weight_by_class, class_weights, trainable_embedding, accuracy_min, num_fc_layers, units_fc, drop_out_rate, suppress_output, use_only_seq, use_only_gene, use_only_hla, size_of_net, embedding_dim_aa, embedding_dim_genes, embedding_dim_hla)
3148
3149 valid_loss, valid_accuracy, valid_predicted, valid_auc =
-> 3150 Run_Graph_WF(self.valid, sess, self, GO, batch_size, random=False, train=False)
3151
3152
~/anaconda3/envs/dl/lib/python3.7/site-packages/DeepTCR-1.2.15-py3.7.egg/DeepTCR/functions/utils_s.py in Run_Graph_WF(set, sess, self, GO, batch_size, random, train, drop_out_rate)
390 loss = np.mean(loss)
391 accuracy = np.mean(accuracy)
--> 392 predicted_out = np.vstack(predicted_list)
393 try:
394 auc = roc_auc_score(set[-1], predicted_out)
~/anaconda3/envs/dl/lib/python3.7/site-packages/numpy/core/shape_base.py in vstack(tup)
281 """
282 _warn_for_nonsequence(tup)
--> 283 return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
284
285
data_test/
├── A
│ ├── A_1.csv
│ └── A_2.csv
├── B
│ ├── B_1.csv
│ └── B_2.csv
├── C
│ ├── C_1.csv
│ └── C_2.csv
└── D
├── D_1.csv
└── D_2.csv
In each csv file, there is beta chain information
AAACCTGCAGGCTGAA-1,CASSIRDTETLYF,498,TRBV16,TRBD1,TRBJ2-3,1
AAACGGGAGGGTGTGT-1,CASGEGQTNSDYTF,568,TRBV13-2,TRBD1,TRBJ1-2,5
AAACGGGGTCTTCAAG-1,CASSGQNQDTQYF,503,TRBV15,TRBD1,TRBJ2-5,1
AAACGGGTCTAACTGG-1,CASSLGWHSYEQYF,572,TRBV16,None,TRBJ2-7,3
AAAGATGAGAATTGTG-1,CASGPGQSNTEVFF,527,TRBV13-2,TRBD1,TRBJ1-1,7
AAAGCAATCTGGCGAC-1,CASSDGLGGLEQYF,481,TRBV13-1,TRBD2,TRBJ2-7,7
AAATGCCCAATCCAAC-1,CAWVDWAQNTLYF,544,TRBV31,TRBD2,TRBJ2-4,3
AAATGCCTCGGCTTGG-1,CSAQGAHTEVFF,566,TRBV1,TRBD1,TRBJ1-1,18
AACACGTGTATAATGG-1,CASSSPLAGQDTQYF,519,TRBV3,None,TRBJ2-5,1
Number of records for each input file :
808 data_test/A/A_1.csv
1920 data_test/A/A_2.csv
2163 data_test/B/B_1.csv
1879 data_test/B/B_2.csv
836 data_test/C/C_1.csv
1182 data_test/C/C_2.csv
1705 data_test/D/D_1.csv
2091 data_test/D/D_2.csv
when I am running the inference, I am getting an error when running the following line:
features,_ = DTCRU.Sequence_Inference(beta_sequences=beta_sequences,v_beta=v_beta,j_beta=j_beta)
The error is the following
WARNING:tensorflow:From /home/ubuntu/.local/lib/python3.6/site-packages/DeepTCR/functions/utils_s.py:1005: The name tf.train.import_meta_graph is deprecated. Please use tf.compat.v1.train.import_meta_graph instead.
WARNING:tensorflow:From /home/ubuntu/.local/lib/python3.6/site-packages/DeepTCR/functions/utils_s.py:1006: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.
INFO:tensorflow:Restoring parameters from murine_antigens/models/model_0/model.ckpt
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-5-da530d2eb961> in <module>()
----> 1 features,_ = DTCRU.Sequence_Inference(beta_sequences=beta_sequences,v_beta=v_beta,j_beta=j_beta)
ValueError: too many values to unpack (expected 2)
DTCRU.HeatMap_Sequences(filename='Heatmap_Sequences.pdf')
The resulted file is blank.
Hello Prof. Sidhom,
thank you for creating DeepTCR, it is a very useful & cool tool.
The Issue I experience when creating the Dendrogramplots is that the Circles of the Samples are getting labeled (by Sample/Class) in very unreadable colors, almost undistinguishable from the background.
DTCRU.Repertoire_Dendrogram(n_jobs=40,distance_metric='correlation',sample_labels=True)
DTCRU.Repertoire_Dendrogram(n_jobs=40,distance_metric='correlation',log_scale=True,Load_Prev_Data=True,sample_labels=True)
UMAP transformation...
PhenoGraph Clustering...
Finding 30 nearest neighbors using minkowski metric and 'auto' algorithm
Neighbors computed in 5.6398985385894775 seconds
Jaccard graph constructed in 1.4102559089660645 seconds
Wrote graph to binary file in 0.9278509616851807 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.970462
Louvain completed 21 runs in 2.248636484146118 seconds
PhenoGraph complete in 10.259825944900513 seconds
Clustering Done
/home/patrick/anaconda3/envs/DEEPTCR_ENV/lib/python3.8/site-packages/DeepTCR/functions/utils_u.py:161: MatplotlibDeprecationWarning: You are modifying the state of a globally registered colormap. In future versions, you will not be able to modify a registered colormap in-place. To remove this warning, you can make a copy of the colormap first. cmap = copy.copy(mpl.cm.get_cmap("viridis"))
cmap_viridis.set_under(color='white', alpha=0)
/home/patrick/anaconda3/envs/DEEPTCR_ENV/lib/python3.8/site-packages/DeepTCR/functions/utils_u.py:161: MatplotlibDeprecationWarning: You are modifying the state of a globally registered colormap. In future versions, you will not be able to modify a registered colormap in-place. To remove this warning, you can make a copy of the colormap first. cmap = copy.copy(mpl.cm.get_cmap("viridis"))
cmap_viridis.set_under(color='white', alpha=0)
Is it possible to change the font color of every Sample Labels ?
Hi,
I have a few questions regarding using GPU.
Also, I was wondering if you could provide some insights on training an imbalanced dataset (binary classification) for this algorithm. Would you suggest using a balanced training dataset or including as much data as possible?
Thanks for your time and help!
ImportError: Bio.Alphabet has been removed from Biopython. In many cases, the alphabet can simply be ignored and removed from scripts. In a few cases, you may need to specify the molecule_type
as an annotation on a SeqRecord for your script to work correctly. Please see https://biopython.org/wiki/Alphabet for more information.
Sorry if theres an obvious answer to this, I'm not very experienced with python and am learning for my Honours degree.
When running this part of the DTCR_WF script:
DTCR_WF.Get_Train_Valid_Test(test_size=0.25)
DTCR_WF.Train()
I'm getting an error that reads:
" line 4014, in
Get_Train_Valid_Test
raise Exception('Choose different train/valid/test parameters!')
Exception: Choose different train/valid/test parameters! "
I also am having what feels like a related issue with DTCR_SS using the same part of it's script, however it does not throw an error and commences training, however nothing occurs in the training and it seems to continue until stopped with the training loss, validation loss, testing loss, etc..., all reading 0.
Again, sorry if there is an obvious answer. I'd appreciate any help. Thankyou. I should note that the data I'm using works fine using the unsupervised script.
Dear all,
I am trying to analyze a dataset but, for unknown reasons, some of the sequences are not considered.
My code is the follow
%%capture
import sys
import pandas as pd
sys.path.append('../../')
from DeepTCR.DeepTCR import DeepTCR_U
# Instantiate training object
DTCRU = DeepTCR_U('Tutorial')
a_target="MA0"
#Load Data from directories
DTCRU.Get_Data(directory='data_deep_tcr/'+a_target,Load_Prev_Data=False,aggregate_by_aa=True,
aa_column_beta=0,count_column=1,v_beta_column=2,j_beta_column=3)
#Train VAE
DTCRU.Train_VAE(Load_Prev_Data=False, size_of_net="small")
DTCRU.Cluster(clustering_method='phenograph', sample=500)
DFs = DTCRU.Cluster_DFs
r_df=pd.DataFrame()
for i in range(0, len(DFs)):
tdf=DFs[i]
tdf["cluster_index"]=i
r_df=r_df.append(tdf)
fn="result_clustering_MA0_clean.txt"
r_df.to_csv(fn)
In the directory "data_deep_tcr/MA0 I have two subfolders. Each of these subfolders contains one TSV file with the following format:
aminoAcid counts v_beta j_beta
CASTHLDPPGEQYFG 571795 hTRBV28 hTRBJ02-7
CASSPLGASGEQFFG 317906 hTRBV28 hTRBJ02-1
CASGGGEQFFG 104692 hTRBV12-3 hTRBJ02-1
CANEGASENTEAFFG 86447 hTRBV06-1 hTRBJ01-1
CASSFFPFNEQFFG 74908 hTRBV12-3 hTRBJ02-1
For example, I pass this sequence (with the v_beta and j_beta and the counts)
CANEGASENTEAFFG 73703 hTRBV06-8 hTRBJ01-1
but it is not clustered.
Whereas, the same sequence with a different count, v_beta and j_beta is clustered:
CANEGASENTEAFFG 86447 hTRBV06-1 hTRBJ01-17
Any idea why is this happening?
Hello @sidhomj,
I used the unsupervised partof DeepTCR to cluster TCR sequences, but when I allowed the method to determine the optimal threshold parameter with the following command line, I got this error:
DTCRU_test.Cluster(clustering_method="hierarchical", linkage_method="ward", criterion="distance", write_to_sheets=False)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ubuntu/.conda/envs/DeepTCR_env/lib/python3.7/site-packages/DeepTCR/DeepTCR.py", line 1054, in Cluster
IDX = hierarchical_optimization(distances, features, method=linkage_method, criterion=criterion)
File "/home/ubuntu/.conda/envs/DeepTCR_env/lib/python3.7/site-packages/DeepTCR/functions/utils_u.py", line 52, in hierarchical_optimization
sil.append(skmetrics.silhouette_score(features[sel, :], IDX[sel]))
File "/home/ubuntu/.conda/envs/DeepTCR_env/lib/python3.7/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 118, in silhouette_score
return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File "/home/ubuntu/.conda/envs/DeepTCR_env/lib/python3.7/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 229, in silhouette_samples
check_number_of_labels(len(le.classes_), n_samples)
File "/home/ubuntu/.conda/envs/DeepTCR_env/lib/python3.7/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 35, in check_number_of_labels
% n_labels
ValueError: Number of labels is 2876. Valid values are 2 to n_samples - 1 (inclusive)
To correct this, I tried to modifiy the function hierarchical_optimization in the utils_u.py script in DeepTCR/functions folder (l.44):
def hierarchical_optimization(distances,features,method,criterion):
Z = linkage(squareform(distances), method=method)
t_list = np.arange(1, 100, 1) #t_list = np.arange(0, 100, 1)
sil = []
for t in t_list:
IDX = fcluster(Z, t, criterion=criterion)
if len(np.unique(IDX[IDX >= 0])) == 1:
sil.append(0.0)
continue
sel = IDX >= 0
sil.append(skmetrics.silhouette_score(features[sel, :], IDX[sel]))
IDX = fcluster(Z, t_list[np.argmax(sil)], criterion=criterion)
return IDX
and it works !
Hey,
I've managed to load my own data, however having trouble training the VAE (which works on the training dataset).
Code so far:
from DeepTCR.DeepTCR import DeepTCR_U
import pandas as pd
import numpy as np
DeepTCR_input = pd.read_csv('/Users/gordonbeattie/Documents/Projects/Maria_2/TCR/DeepTCR/DeepTCR_input.tsv', sep= '\t')
alpha = np.genfromtxt(DeepTCR_input.TRA, dtype='str')
beta = np.genfromtxt(DeepTCR_input.TRB, dtype='str')
sample = np.genfromtxt(DeepTCR_input.sample_ID, dtype='str')
DTCRU.Load_Data(alpha_sequences=alpha,beta_sequences=beta, sample_labels=sample)
DTCRU.Train_VAE(Load_Prev_Data=False)
Which throws the following error:
AttributeError: 'LabelEncoder' object has no attribute 'classes_'
Thanks in advance for any assistance!
Hi, Sorry to disturb:
I am trying to understand the training strategy of HIV dataset and replicate the results you get in your publication.
It seems that the dataset can be categorized as non-cognate groups (CEF, AY9, No Peptide conditions), or cognate groups (where there is an epitope). We have 3 * 3 samples that are non-cognate, while 25 * 3 samples as cognate groups. I saw from the paper that deeptcr can distinguish non-cognate samples from cognate samples, and the training used keep two out of three for training data.
My question is, when doing the training, did you
Thanks and looking forward to your reply!
Hello,
We are interested in running this tool on our TCR seq data. We have multiple cohorts of cancer patients, responders and non-responders plus a big cohort of COVID patients, and covid vaccinated patients. We would very much like to study these samples using your tool. However, as I was reviewing the documentations I could not follow through all the steps to run this tool.
Are you planning to add more instructions? Or would you be able to send us detailed documentations on how to run this tool?
Thank you,
Arnavaz Danesh -Bioinformatician at University Health Network, Toronto.
Is any of the trained VAE models available publicly, ideally together with an evaluation script?
We are working on a related topic and would like to perform an apple-to-apple comparison to your VAE approach.
Hi:
I am trying to do sequence_inference based on a trained model, while the following error occurs:
I am not sure how I shall change my code to make it work. May I ask for your suggestions? Thanks!
tensorflow/core/common_runtime/colocation_graph.cc:1218] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
/job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:0' assigned_device_name_='' resource_device_name_='/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
ResourceApplyAdam: CPU
ReadVariableOp: CPU
AssignVariableOp: CPU
VarIsInitializedOp: CPU
Const: CPU
VarHandleOp: CPU
Colocation members, user-requested devices, and framework assigned devices, if any:
dense_1/bias/Initializer/zeros (Const) /device:GPU:0
dense_1/bias (VarHandleOp) /device:GPU:0
dense_1/bias/IsInitialized/VarIsInitializedOp (VarIsInitializedOp) /device:GPU:0
dense_1/bias/Assign (AssignVariableOp) /device:GPU:0
dense_1/bias/Read/ReadVariableOp (ReadVariableOp) /device:GPU:0
dense_1/BiasAdd/ReadVariableOp (ReadVariableOp) /device:GPU:0
dense_1/bias/Adam/Initializer/zeros (Const) /device:GPU:0
dense_1/bias/Adam (VarHandleOp) /device:GPU:0
dense_1/bias/Adam/IsInitialized/VarIsInitializedOp (VarIsInitializedOp) /device:GPU:0
dense_1/bias/Adam/Assign (AssignVariableOp) /device:GPU:0
dense_1/bias/Adam/Read/ReadVariableOp (ReadVariableOp) /device:GPU:0
dense_1/bias/Adam_1/Initializer/zeros (Const) /device:GPU:0
dense_1/bias/Adam_1 (VarHandleOp) /device:GPU:0
dense_1/bias/Adam_1/IsInitialized/VarIsInitializedOp (VarIsInitializedOp) /device:GPU:0
dense_1/bias/Adam_1/Assign (AssignVariableOp) /device:GPU:0
dense_1/bias/Adam_1/Read/ReadVariableOp (ReadVariableOp) /device:GPU:0
Adam/update_dense_1/bias/ResourceApplyAdam (ResourceApplyAdam) /device:GPU:0
save/AssignVariableOp_35 (AssignVariableOp) /device:GPU:0
save/AssignVariableOp_36 (AssignVariableOp) /device:GPU:0
save/AssignVariableOp_37 (AssignVariableOp) /device:GPU:0
I'd like to get the encoding accuracy of TCRs encoded on a VAE model trained on an initial dataset. i.e. How could I extract the encoding accuracy as well as the latent encoded variables from DTCRU.Sequence_Inference()?
Hello Dr. Sidhom,
I recently came across your “DeepTCR” paper and I found the idea of combining supervised and unsupervised learning and applying them to modeling TCR repertoires intriguing. Also, the performance in your paper is very impressive! I have some questions regarding Figure 2.c (titled representative TCRs and learned TCR motifs) that I hope you can help with.
Thank you in advance for your help!
Hello,
I checked the dataset used in the regression model. It seems that simply dropping duplicate TCR won't get the dataset used in the regression model. Could you tell you where I can find the preprocessing detail to obtain a dataset for the regression model?
Thanks!
Hi I was trying to run the clustering tutorial and I got an error after running through the commands below:
DTCRU = DeepTCR_U('Tutorial')
#Load Data from directories
DTCRU.Get_Data(directory='github/DeepTCR/Data/Murine_Antigens',Load_Prev_Data=False,aggregate_by_aa=True,aa_column_beta=0,count_column=1,v_beta_column=2,j_beta_column=3)
DTCRU.Train_VAE(Load_Prev_Data=False, suppress_output=False)
features = DTCRU.features
DTCRU.Cluster(clustering_method='phenograph')
TypeError Traceback (most recent call last)
in
1 # cluster using phenograph
----> 2 DTCRU.Cluster(clustering_method='phenograph')
~/anaconda3/envs/dl/lib/python3.7/site-packages/DeepTCR/DeepTCR.py in Cluster(self, set, clustering_method, t, criterion, linkage_method, write_to_sheets, sample, n_jobs)
1051 df['D_beta'] = d_beta[sel]
1052 df['J_beta'] = j_beta[sel]
-> 1053 df['HLA'] = list(map(list,hla_data_seq[sel].tolist()))
1054
1055 df_sum = df.groupby(by='Sample', sort=False).agg({'Frequency': 'sum'})
TypeError: 'float' object is not iterable
I am running using macos, python 3.7
I tried to run the jupyter notebook tutorial as is after installing DeepTCR Development Version into its own Environment like this : pip3 install git+https://github.com/sidhomj/DeepTCR.git.
Running the Second Cell of the Tutorial (Phenograph Clustering) of the loaded Data I get a List index out of Range Error:
Command:
DTCRU.Cluster(clustering_method='phenograph')
Output:
Finding 30 nearest neighbors using minkowski metric and 'auto' algorithm Neighbors computed in 3.4997947216033936 seconds Jaccard graph constructed in 1.022956371307373 seconds Wrote graph to binary file in 0.5937418937683105 seconds Running Louvain modularity optimization
Error:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-2-1c6d55ce009b> in <module>
----> 1 DTCRU.Cluster(clustering_method='phenograph')
~/.local/lib/python3.8/site-packages/DeepTCR/DeepTCR.py in Cluster(self, set, clustering_method, t, criterion, linkage_method, write_to_sheets, sample, n_jobs, order_by_linkage)
1044
1045 elif clustering_method == 'phenograph':
-> 1046 IDX, _, _ = phenograph.cluster(features, k=30, n_jobs=n_jobs)
1047
1048 elif clustering_method == 'kmeans':
~/.local/lib/python3.8/site-packages/DeepTCR/phenograph/cluster.py in cluster(data, k, directed, prune, min_cluster_size, jaccard, primary_metric, n_jobs, q_tol, louvain_time_limit, nn_method)
114 uid = uuid.uuid1().hex
115 graph2binary(uid, graph)
--> 116 communities, Q = runlouvain(uid, tol=q_tol, time_limit=louvain_time_limit)
117 print("PhenoGraph complete in {} seconds".format(time.time() - tic), flush=True)
118 communities = sort_by_size(communities, min_cluster_size)
~/.local/lib/python3.8/site-packages/DeepTCR/phenograph/core.py in runlouvain(filename, max_runs, time_limit, tol)
261
262 # continue only if we've reached a higher modularity than before
--> 263 if q[-1] - Q > tol:
264
265 Q = q[-1]
IndexError: list index out of range
``
What am I doing wrong ?
Input data structure
6 labels, each label has 4 files, I tried, folds= 4, folds = 5, and folds = 10, all return the same error.
DTCRU.KNN_Repertoire_Classifier(folds=10,
Load_Prev_Data=False,
metrics=['AUC', 'F1', 'Recall', 'Precision'],
plot_metrics=True, plot_type='box',
by_class=False,
n_jobs=40)
File "/home/ubuntu/anaconda3/envs/dl/lib/python3.6/site-packages/DeepTCR/DeepTCR.py", line 2290, in KNN_Repertoire_Classifier
sns.catplot(data=df_out, x='Metric', y='Value', kind=plot_type)
File "/home/ubuntu/anaconda3/envs/dl/lib/python3.6/site-packages/seaborn/categorical.py", line 3724, in catplot
p.establish_colors(color, palette, 1)
File "/home/ubuntu/anaconda3/envs/dl/lib/python3.6/site-packages/seaborn/categorical.py", line 315, in establish_colors
lum = min(light_vals) * .6
ValueError: min() arg is an empty sequence
Getting an error when using our data, after loading with:
DTCRU.Load_Data(beta_sequences=beta,v_beta=v_beta,j_beta=j_beta,class_labels=class_labels,
sample_labels=sample_labels, counts=counts)
Training appears to have gone ok :
DTCRU.Train_VAE(Load_Prev_Data=False,accuracy_min=0.85)
But clustering appears to fail:
same error when using phenograph method, so isn't clustering approach specific. Also happens when randomly sampling:
Is it possible that there are some outliers produced by the clustering methods, causing "sel" to be not an integer? or perhaps there is some meta data i need to set?
Other functions appear to work okay:
i see #1 which has a similar error, but my data exists as a single csv file which i'm loading via pandas and chopping the necessary columns out of. as such, loading via directory doesn't appear to be an option
any ideas?
Hi Dr. John William Sidhom:
Sorry to disturb you, while I have a question about deepTCR. This is an amazing package, and I would like to use it to do sequence-level prediction. So here let us say that I train a model named 'model', and store the middle file at: /user. Then the program will generate a subfolder 'model' and store model checkpoint information there.
My question is, how to load this pre-trained model and do prediction using new data. I saw from website that 'sequence_inference' can 'load previous trained model', but I did not see an example at tutorial. So it will be extremely helpful if you may briefly tell me how to load a model from a pre-generated folder and thus utilize it to do prediction instead of train the model again.
Thanks in advance for your patience and have a great day!
Hi , professor John-William Sidhom !
Well , I encountered some problems when I tried to download DeepTCR .
I have tried to use miniconda or virtualenv to create a virtual environment , but , the system is always reporting errors.
I can install python3.9 , but I cannot get the corresponding pip version .
It really troubled me for a long time .
l hope the excellent software will have docker image someday .
Best wishes to you !
Would be possible to provide some example files for the HLA type? Thanks a lot!
Hi, thank you for the great tool!
I am experiencing some issues loading paired single cell data.
I have csvs for each of my samples that have a barcode column and cdr3, v, and j (d coverage was low so I removed that column) genes for each chain. There are no NA values or empty values that I can see so I'm not sure why it is throwing an empty array error.
Any help would be appreciated!
DTCR_WF.Get_Data(directory='pln/',Load_Prev_Data=False,aggregate_by_aa=True,
... aa_column_beta=7,v_beta_column=5,j_beta_column=6,
... aa_column_alpha=4, v_alpha_column=2,j_alpha_column=3,count_column=8)
Loading Data...
Traceback (most recent call last):
File "", line 3, in
File "/#/RIMA/miniconda3/envs/deeptcr3.7/lib/python3.7/site-packages/DeepTCR/DeepTCR.py", line 336, in Get_Data
Y = OH.fit_transform(Y.reshape(-1,1))
File "/#/RIMA/miniconda3/envs/deeptcr3.7/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 488, in fit_transform
return super().fit_transform(X, y)
File "/#/RIMA/miniconda3/envs/deeptcr3.7/lib/python3.7/site-packages/sklearn/base.py", line 847, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/#/RIMA/miniconda3/envs/deeptcr3.7/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 461, in fit
self._fit(X, handle_unknown=self.handle_unknown, force_all_finite="allow-nan")
File "/#/RIMA/miniconda3/envs/deeptcr3.7/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 78, in _fit
X, force_all_finite=force_all_finite
File "/#/RIMA/miniconda3/envs/deeptcr3.7/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 44, in _check_X
X_temp = check_array(X, dtype=None, force_all_finite=force_all_finite)
File "/#/RIMA/miniconda3/envs/deeptcr3.7/lib/python3.7/site-packages/sklearn/utils/validation.py", line 800, in check_array
% (n_samples, array.shape, ensure_min_samples, context)
ValueError: Found array with 0 sample(s) (shape=(0, 1)) while a minimum of 1 is required.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.