hanwenxuthu / biotranslatorproject Goto Github PK

View Code? Open in Web Editor NEW

33.0 33.0 7.0 6.71 MB

License: MIT License

Python 98.90% Shell 1.10%

biotranslatorproject's People

Contributors

Stargazers

Watchers

Forkers

animesh aorist-ai lzlniu marcasmed shitoudidi shicheng-guo

biotranslatorproject's Issues

Question about input and output of BioTranslator

Hi Hanwen,
I'd successfully run through your analysis, thanks a lot for your support, I have some questions about the input and output:

As you mentioned in the email previously:
"R3: Our framework supports taking the text as input and then output the biological instances which are most closed to this textual descriptions. I suggest 1.Embedding the textual descriptions into the vector format as BioTranslator_go_embeddings.pkl (see def get_BioTranslator_emb) 2. prepare the biology instances features. 3. Feed both the textual description embeddings and biology instances into the model.pth files, which could output the probability for each biology instance being classified into one textual description and then you can find the biology instance with the highest probability. "

To my understanding, for example in SingleCell, def get_BioTranslator_emb uses the BioTranslator Text Encoder to embed the Gene Ontology terms. It loads the Gene Ontology terms from cl.obo using the load_co_text function. It then uses a pre-trained BERT model named microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext to encode each Gene Ontology term into a 768-dimensional vector. The encoded vectors are then stored in an ordered dictionary (default output file name is BioTranslator_go_embeddings.pkl) with their corresponding Gene Ontology terms in the input file (cl.obo) as keys.

It seems that this function is specifically designed to embed Gene Ontology terms and may not be suitable for embedding natural language text.

I wonder how to embed natural language from a text file rather than the cl.obo file? For example, input a description "expression of endothelial cell in lung" and a h5ad file which includes multiple single cell datasets, and the output is the single cell expression profile from the input that match the description most. What should I do to achieve this? Do you have code/script in the BioTranslator package for this kind of purpose?

Best,
Zelin

Ontology file for SingleCell model

Hi,
When I am trying Section 2.5 New Cell Type Discovery, I found that an ontology file was required (default path is /data/xuhw/data/Ontology_data/cl.obo), what is cl.obo here and how can I get it (if necessary)?
Best

Text generation

Hi Hanwen,

Thanks for sharing the code! I was wondering what is the process/model for the text generation, like Fig 2 (g). Maybe I miss something but I couldn't find the descriptions in the paper or code. And further to clarify, does the generation mean the auto-regressive generation (generating something doesn't exist before) or retrieval from the existing training data? Many thanks!

Question about inference of the model

Hi Hanwen,
I had been trying to run the inference of the model, with the cached models and text encoder:

device = torch.device('cuda')

model = torch.load(model_path)
model.to(device)
model.eval()

bert_name = 'microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext'
tokenizer = AutoTokenizer.from_pretrained(bert_name)
encoder = BioUtils.NeuralNetwork('None', 'cls', bert_name)
encoder.load_state_dict(torch.load(text_encoder_path))
encoder = encoder.to(device)
encoder.eval()

# protein sequence
input_seq = None

# protein description
input_desc = None

# pathway vector
input_vec = None

# texts
texts = 'Protein sequence that causes severe brest cancer that make the patient die in less than a year.'
inputs = tokenizer(texts, return_tensors='pt').to(device)
sents_len = min(inputs['input_ids'].size(1), 512)
input_ids = inputs['input_ids'][0, 0: sents_len].view(len(inputs['input_ids']), -1).to(device)
attention_mask = inputs['attention_mask'][0, 0: sents_len].view(len(inputs['attention_mask']), -1).to(device)
token_type_ids = inputs['token_type_ids'][0, 0: sents_len].view(len(inputs['token_type_ids']), -1).to(device)
text_emb = encoder(input_ids, attention_mask, token_type_ids)

# Run inference
with torch.no_grad():
    output = model(input_seq, input_desc, input_vec, text_emb)

My questions are:

It seem that to run the inference, BioTranslator required input_seq, input_desc, input_vec as input to the data encoder, since the training process (with GOA_Human) used them (class BioDataEncoder(nn.Module)) -- the features was already set in BioConfig.
Can I run the inference using the model that trained with all 3 features (seqs, network, description), but not using all 3 features during the inference? If so, how can I do that? Because I saw that in your code the data_encoder took all 3 features as input (while the BioDataEncoder class itself treat these features as optional input):

class BioTranslator(nn.Module):
    ...
    def forward(self, input_seq, input_description, input_vector, texts):
        # get textual description encodings
        text_encodings = texts.permute(1, 0)
        # get biology instance encodings
        data_encodings = self.data_encoder(input_seq, input_description, input_vector)
        # compute logits
        logits = torch.matmul(data_encodings, text_encodings)
        return self.activation(logits)

I am not fully understand how BioTranslator encode the data, can you point out what should I prepare the input_seq, input_description and input_vector tensors?
For example, I had tried to go through the training process to check what is batch['prot_seq'] and where is its raw data from, so that I can prepare the data for inference, but failed to inspect it:

with open('train_data_fold_2.pkl', 'rb') as f:
    train2 = pickle.load(f)
train_loader = DataLoader(train2, batch_size=32, shuffle=False
)
for batch in train_loader:
    print(batch)
    break

I got error:

KeyError                                  Traceback (most recent call last)
[~/BioTranslatorProject/venv/lib/python3.7/site-packages/pandas/core/indexes/base.py](https://vscode-remote+ssh-002dremote-002bmarjorie.vscode-resource.vscode-cdn.net/nfs_home/users/zell/BioTranslatorProject/DataProcess/data/Protein_Pathway_data/GOA_Human/~/BioTranslatorProject/venv/lib/python3.7/site-packages/pandas/core/indexes/base.py) in get_loc(self, key, method, tolerance)
   3360             try:
-> 3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:

[~/BioTranslatorProject/venv/lib/python3.7/site-packages/pandas/_libs/index.pyx](https://vscode-remote+ssh-002dremote-002bmarjorie.vscode-resource.vscode-cdn.net/nfs_home/users/zell/BioTranslatorProject/DataProcess/data/Protein_Pathway_data/GOA_Human/~/BioTranslatorProject/venv/lib/python3.7/site-packages/pandas/_libs/index.pyx) in pandas._libs.index.IndexEngine.get_loc()

[~/BioTranslatorProject/venv/lib/python3.7/site-packages/pandas/_libs/index.pyx](https://vscode-remote+ssh-002dremote-002bmarjorie.vscode-resource.vscode-cdn.net/nfs_home/users/zell/BioTranslatorProject/DataProcess/data/Protein_Pathway_data/GOA_Human/~/BioTranslatorProject/venv/lib/python3.7/site-packages/pandas/_libs/index.pyx) in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_3484746/410246742.py in 
----> 1 for batch in train_loader:
      2     print(batch)
      3     break

[~/BioTranslatorProject/venv/lib/python3.7/site-packages/torch/utils/data/dataloader.py](https://vscode-remote+ssh-002dremote-002bmarjorie.vscode-resource.vscode-cdn.net/nfs_home/users/zell/BioTranslatorProject/DataProcess/data/Protein_Pathway_data/GOA_Human/~/BioTranslatorProject/venv/lib/python3.7/site-packages/torch/utils/data/dataloader.py) in __next__(self)
    433         if self._sampler_iter is None:
    434             self._reset()
--> 435         data = self._next_data()
    436         self._num_yielded += 1
    437         if self._dataset_kind == _DatasetKind.Iterable and \

[~/BioTranslatorProject/venv/lib/python3.7/site-packages/torch/utils/data/dataloader.py](https://vscode-remote+ssh-002dremote-002bmarjorie.vscode-resource.vscode-cdn.net/nfs_home/users/zell/BioTranslatorProject/DataProcess/data/Protein_Pathway_data/GOA_Human/~/BioTranslatorProject/venv/lib/python3.7/site-packages/torch/utils/data/dataloader.py) in _next_data(self)
    473     def _next_data(self):
    474         index = self._next_index()  # may raise StopIteration
--> 475         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    476         if self._pin_memory:
    477             data = _utils.pin_memory.pin_memory(data)

[~/BioTranslatorProject/venv/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py](https://vscode-remote+ssh-002dremote-002bmarjorie.vscode-resource.vscode-cdn.net/nfs_home/users/zell/BioTranslatorProject/DataProcess/data/Protein_Pathway_data/GOA_Human/~/BioTranslatorProject/venv/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py) in fetch(self, possibly_batched_index)
     42     def fetch(self, possibly_batched_index):
     43         if self.auto_collation:
---> 44             data = [self.dataset[idx] for idx in possibly_batched_index]
     45         else:
     46             data = self.dataset[possibly_batched_index]

[~/BioTranslatorProject/venv/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py](https://vscode-remote+ssh-002dremote-002bmarjorie.vscode-resource.vscode-cdn.net/nfs_home/users/zell/BioTranslatorProject/DataProcess/data/Protein_Pathway_data/GOA_Human/~/BioTranslatorProject/venv/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py) in (.0)
     42     def fetch(self, possibly_batched_index):
     43         if self.auto_collation:
---> 44             data = [self.dataset[idx] for idx in possibly_batched_index]
     45         else:
     46             data = self.dataset[possibly_batched_index]

[~/BioTranslatorProject/venv/lib/python3.7/site-packages/pandas/core/frame.py](https://vscode-remote+ssh-002dremote-002bmarjorie.vscode-resource.vscode-cdn.net/nfs_home/users/zell/BioTranslatorProject/DataProcess/data/Protein_Pathway_data/GOA_Human/~/BioTranslatorProject/venv/lib/python3.7/site-packages/pandas/core/frame.py) in __getitem__(self, key)
   3456             if self.columns.nlevels > 1:
   3457                 return self._getitem_multilevel(key)
-> 3458             indexer = self.columns.get_loc(key)
   3459             if is_integer(indexer):
   3460                 indexer = [indexer]

[~/BioTranslatorProject/venv/lib/python3.7/site-packages/pandas/core/indexes/base.py](https://vscode-remote+ssh-002dremote-002bmarjorie.vscode-resource.vscode-cdn.net/nfs_home/users/zell/BioTranslatorProject/DataProcess/data/Protein_Pathway_data/GOA_Human/~/BioTranslatorProject/venv/lib/python3.7/site-packages/pandas/core/indexes/base.py) in get_loc(self, key, method, tolerance)
   3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 
   3365         if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: 0

train2 is a pandas dataframe with 4 columns, their names are 'proteins', 'genes', 'sequences' and 'annotations':

How did the BioTrainer get batch['prot_seq'], batch['prot_description'] and batch['prot_network'] from it?

Do you have a structure diagram of overall neural network architecture for inference? Basically the transformation method and tensor's shape (or, the I/O data format) ---- of the input (seqs, networks etc.), output (predictions) and the data encodings/embeddings that in the intermediate process.

python package installation error

pip install biotranslator

Collecting biotranslator
Using cached biotranslator-0.1.2.tar.gz (32 kB)
Installing build dependencies ... done
Getting requirements to build wheel ... error
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [16 lines of output]
Traceback (most recent call last):
File "/opt/py/conda/PyLib_Common/envs/biotranslator/lib/python3.9/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353
, in
main()
File "/opt/py/conda/PyLib_Common/envs/biotranslator/lib/python3.9/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335
, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "/opt/py/conda/PyLib_Common/envs/biotranslator/lib/python3.9/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118
, in get_requires_for_build_wheel
return hook(config_settings)
File "/tmp/pip-build-env-_vwnyd73/overlay/lib/python3.9/site-packages/flit_core/buildapi.py", line 23, in get_requires_for_build_wheel
info = read_flit_config(pyproj_toml)
File "/tmp/pip-build-env-_vwnyd73/overlay/lib/python3.9/site-packages/flit_core/config.py", line 79, in read_flit_config
return prep_toml_config(d, path)
File "/tmp/pip-build-env-_vwnyd73/overlay/lib/python3.9/site-packages/flit_core/config.py", line 106, in prep_toml_config
loaded_cfg = read_pep621_metadata(d['project'], path)
File "/tmp/pip-build-env-_vwnyd73/overlay/lib/python3.9/site-packages/flit_core/config.py", line 616, in read_pep621_metadata
raise ConfigError(
flit_core.config.ConfigError: flit only supports dynamic metadata for 'version' & 'description'
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.