hanwenxuthu / biotranslatorproject Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
Hi Hanwen,
I'd successfully run through your analysis, thanks a lot for your support, I have some questions about the input and output:
As you mentioned in the email previously:
"R3: Our framework supports taking the text as input and then output the biological instances which are most closed to this textual descriptions. I suggest 1.Embedding the textual descriptions into the vector format as BioTranslator_go_embeddings.pkl
(see def get_BioTranslator_emb) 2. prepare the biology instances features. 3. Feed both the textual description embeddings and biology instances into the model.pth files, which could output the probability for each biology instance being classified into one textual description and then you can find the biology instance with the highest probability. "
To my understanding, for example in SingleCell, def get_BioTranslator_emb uses the BioTranslator Text Encoder to embed the Gene Ontology terms. It loads the Gene Ontology terms from cl.obo
using the load_co_text
function. It then uses a pre-trained BERT model named microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext
to encode each Gene Ontology term into a 768-dimensional vector. The encoded vectors are then stored in an ordered dictionary (default output file name is BioTranslator_go_embeddings.pkl
) with their corresponding Gene Ontology terms in the input file (cl.obo
) as keys.
It seems that this function is specifically designed to embed Gene Ontology terms and may not be suitable for embedding natural language text.
I wonder how to embed natural language from a text file rather than the cl.obo
file? For example, input a description "expression of endothelial cell in lung" and a h5ad file which includes multiple single cell datasets, and the output is the single cell expression profile from the input that match the description most. What should I do to achieve this? Do you have code/script in the BioTranslator package for this kind of purpose?
Best,
Zelin
Hi,
When I am trying Section 2.5 New Cell Type Discovery, I found that an ontology file was required (default path is /data/xuhw/data/Ontology_data/cl.obo), what is cl.obo here and how can I get it (if necessary)?
Best
Hi Hanwen,
Thanks for sharing the code! I was wondering what is the process/model for the text generation, like Fig 2 (g). Maybe I miss something but I couldn't find the descriptions in the paper or code. And further to clarify, does the generation mean the auto-regressive generation (generating something doesn't exist before) or retrieval from the existing training data? Many thanks!
Hi Hanwen,
I had been trying to run the inference of the model, with the cached models and text encoder:
device = torch.device('cuda')
model = torch.load(model_path)
model.to(device)
model.eval()
bert_name = 'microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext'
tokenizer = AutoTokenizer.from_pretrained(bert_name)
encoder = BioUtils.NeuralNetwork('None', 'cls', bert_name)
encoder.load_state_dict(torch.load(text_encoder_path))
encoder = encoder.to(device)
encoder.eval()
# protein sequence
input_seq = None
# protein description
input_desc = None
# pathway vector
input_vec = None
# texts
texts = 'Protein sequence that causes severe brest cancer that make the patient die in less than a year.'
inputs = tokenizer(texts, return_tensors='pt').to(device)
sents_len = min(inputs['input_ids'].size(1), 512)
input_ids = inputs['input_ids'][0, 0: sents_len].view(len(inputs['input_ids']), -1).to(device)
attention_mask = inputs['attention_mask'][0, 0: sents_len].view(len(inputs['attention_mask']), -1).to(device)
token_type_ids = inputs['token_type_ids'][0, 0: sents_len].view(len(inputs['token_type_ids']), -1).to(device)
text_emb = encoder(input_ids, attention_mask, token_type_ids)
# Run inference
with torch.no_grad():
output = model(input_seq, input_desc, input_vec, text_emb)
My questions are:
BioConfig
.data_encoder
took all 3 features as input (while the BioDataEncoder
class itself treat these features as optional input):class BioTranslator(nn.Module):
...
def forward(self, input_seq, input_description, input_vector, texts):
# get textual description encodings
text_encodings = texts.permute(1, 0)
# get biology instance encodings
data_encodings = self.data_encoder(input_seq, input_description, input_vector)
# compute logits
logits = torch.matmul(data_encodings, text_encodings)
return self.activation(logits)
input_seq
, input_description
and input_vector
tensors?with open('train_data_fold_2.pkl', 'rb') as f:
train2 = pickle.load(f)
train_loader = DataLoader(train2, batch_size=32, shuffle=False
)
for batch in train_loader:
print(batch)
break
I got error:
KeyError Traceback (most recent call last)
[~/BioTranslatorProject/venv/lib/python3.7/site-packages/pandas/core/indexes/base.py](https://vscode-remote+ssh-002dremote-002bmarjorie.vscode-resource.vscode-cdn.net/nfs_home/users/zell/BioTranslatorProject/DataProcess/data/Protein_Pathway_data/GOA_Human/~/BioTranslatorProject/venv/lib/python3.7/site-packages/pandas/core/indexes/base.py) in get_loc(self, key, method, tolerance)
3360 try:
-> 3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
[~/BioTranslatorProject/venv/lib/python3.7/site-packages/pandas/_libs/index.pyx](https://vscode-remote+ssh-002dremote-002bmarjorie.vscode-resource.vscode-cdn.net/nfs_home/users/zell/BioTranslatorProject/DataProcess/data/Protein_Pathway_data/GOA_Human/~/BioTranslatorProject/venv/lib/python3.7/site-packages/pandas/_libs/index.pyx) in pandas._libs.index.IndexEngine.get_loc()
[~/BioTranslatorProject/venv/lib/python3.7/site-packages/pandas/_libs/index.pyx](https://vscode-remote+ssh-002dremote-002bmarjorie.vscode-resource.vscode-cdn.net/nfs_home/users/zell/BioTranslatorProject/DataProcess/data/Protein_Pathway_data/GOA_Human/~/BioTranslatorProject/venv/lib/python3.7/site-packages/pandas/_libs/index.pyx) in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 0
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
/tmp/ipykernel_3484746/410246742.py in
----> 1 for batch in train_loader:
2 print(batch)
3 break
[~/BioTranslatorProject/venv/lib/python3.7/site-packages/torch/utils/data/dataloader.py](https://vscode-remote+ssh-002dremote-002bmarjorie.vscode-resource.vscode-cdn.net/nfs_home/users/zell/BioTranslatorProject/DataProcess/data/Protein_Pathway_data/GOA_Human/~/BioTranslatorProject/venv/lib/python3.7/site-packages/torch/utils/data/dataloader.py) in __next__(self)
433 if self._sampler_iter is None:
434 self._reset()
--> 435 data = self._next_data()
436 self._num_yielded += 1
437 if self._dataset_kind == _DatasetKind.Iterable and \
[~/BioTranslatorProject/venv/lib/python3.7/site-packages/torch/utils/data/dataloader.py](https://vscode-remote+ssh-002dremote-002bmarjorie.vscode-resource.vscode-cdn.net/nfs_home/users/zell/BioTranslatorProject/DataProcess/data/Protein_Pathway_data/GOA_Human/~/BioTranslatorProject/venv/lib/python3.7/site-packages/torch/utils/data/dataloader.py) in _next_data(self)
473 def _next_data(self):
474 index = self._next_index() # may raise StopIteration
--> 475 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
476 if self._pin_memory:
477 data = _utils.pin_memory.pin_memory(data)
[~/BioTranslatorProject/venv/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py](https://vscode-remote+ssh-002dremote-002bmarjorie.vscode-resource.vscode-cdn.net/nfs_home/users/zell/BioTranslatorProject/DataProcess/data/Protein_Pathway_data/GOA_Human/~/BioTranslatorProject/venv/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py) in fetch(self, possibly_batched_index)
42 def fetch(self, possibly_batched_index):
43 if self.auto_collation:
---> 44 data = [self.dataset[idx] for idx in possibly_batched_index]
45 else:
46 data = self.dataset[possibly_batched_index]
[~/BioTranslatorProject/venv/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py](https://vscode-remote+ssh-002dremote-002bmarjorie.vscode-resource.vscode-cdn.net/nfs_home/users/zell/BioTranslatorProject/DataProcess/data/Protein_Pathway_data/GOA_Human/~/BioTranslatorProject/venv/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py) in (.0)
42 def fetch(self, possibly_batched_index):
43 if self.auto_collation:
---> 44 data = [self.dataset[idx] for idx in possibly_batched_index]
45 else:
46 data = self.dataset[possibly_batched_index]
[~/BioTranslatorProject/venv/lib/python3.7/site-packages/pandas/core/frame.py](https://vscode-remote+ssh-002dremote-002bmarjorie.vscode-resource.vscode-cdn.net/nfs_home/users/zell/BioTranslatorProject/DataProcess/data/Protein_Pathway_data/GOA_Human/~/BioTranslatorProject/venv/lib/python3.7/site-packages/pandas/core/frame.py) in __getitem__(self, key)
3456 if self.columns.nlevels > 1:
3457 return self._getitem_multilevel(key)
-> 3458 indexer = self.columns.get_loc(key)
3459 if is_integer(indexer):
3460 indexer = [indexer]
[~/BioTranslatorProject/venv/lib/python3.7/site-packages/pandas/core/indexes/base.py](https://vscode-remote+ssh-002dremote-002bmarjorie.vscode-resource.vscode-cdn.net/nfs_home/users/zell/BioTranslatorProject/DataProcess/data/Protein_Pathway_data/GOA_Human/~/BioTranslatorProject/venv/lib/python3.7/site-packages/pandas/core/indexes/base.py) in get_loc(self, key, method, tolerance)
3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
-> 3363 raise KeyError(key) from err
3364
3365 if is_scalar(key) and isna(key) and not self.hasnans:
KeyError: 0
train2
is a pandas dataframe with 4 columns, their names are 'proteins', 'genes', 'sequences' and 'annotations':
How did the BioTrainer get batch['prot_seq']
, batch['prot_description']
and batch['prot_network']
from it?
pip install biotranslator
Collecting biotranslator
Using cached biotranslator-0.1.2.tar.gz (32 kB)
Installing build dependencies ... done
Getting requirements to build wheel ... error
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [16 lines of output]
Traceback (most recent call last):
File "/opt/py/conda/PyLib_Common/envs/biotranslator/lib/python3.9/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353
, in
main()
File "/opt/py/conda/PyLib_Common/envs/biotranslator/lib/python3.9/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335
, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "/opt/py/conda/PyLib_Common/envs/biotranslator/lib/python3.9/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118
, in get_requires_for_build_wheel
return hook(config_settings)
File "/tmp/pip-build-env-_vwnyd73/overlay/lib/python3.9/site-packages/flit_core/buildapi.py", line 23, in get_requires_for_build_wheel
info = read_flit_config(pyproj_toml)
File "/tmp/pip-build-env-_vwnyd73/overlay/lib/python3.9/site-packages/flit_core/config.py", line 79, in read_flit_config
return prep_toml_config(d, path)
File "/tmp/pip-build-env-_vwnyd73/overlay/lib/python3.9/site-packages/flit_core/config.py", line 106, in prep_toml_config
loaded_cfg = read_pep621_metadata(d['project'], path)
File "/tmp/pip-build-env-_vwnyd73/overlay/lib/python3.9/site-packages/flit_core/config.py", line 616, in read_pep621_metadata
raise ConfigError(
flit_core.config.ConfigError: flit only supports dynamic metadata for 'version' & 'description'
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.