Comments (36)
@stevenberg1 You can download pubmed from https://www.nlm.nih.gov/databases/download/pubmed_medline.html . The pubtator api is on https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/ . You can parse pubmed data using https://github.com/titipata/pubmed_parser . As I mentioned before, for the # of entities, I only chose entities that exist in the training set. Besides some entities don't have a link to the three types (gene, disease, chemical) of entities we have.
from paperrobot.
@EagleW thanks
from paperrobot.
Yes, I agree with you for the batch size problem. For the concatenation, the original paper for graph attention is https://arxiv.org/pdf/1710.10903.pdf. There implementation is on https://github.com/Diego999/pyGAT/blob/master/layers.py line 31-32. I think our implementation is equivalent to them because our plus will automatically broadcast.
from paperrobot.
Top 10 is on the test phase. For training, I just extract all entities in the paragraph and try to find potential triples (with head and tail in those entities)
from paperrobot.
hahhh.thanks to your work...seems PaperRobot can handle very well in paper writing skill .. such as this sentences. "The results of this study demonstrate ..." "The results of this study...","The objective of this study was to" .and so on..
Still two probleam exists.
1:the NER problem. since the pubtator api is offline,it can't extract a raw input sentence....so the ner is limit to search the entity dict.. it's time consuming, and can't handle slight change in entity term(if edit distance used.will take longer time..)
2: how to define and get the related entities during before writing a sentence......seem the related entities is the key factor .. i can seed your program input.py in writing stage ,when input a raw sentence, then the related items is inputed by users...
from paperrobot.
Hi @stevenberg1, thank you very much for your interest in our work! We first splite the PubMed papers we crawled into training, validation, and test set. For the text corresponding to an entity, we use the Pubtator to annotate all abstract and title available in our dataset. We then extract sentences which contain that entity and use nltk tokenizer to tokenize them. For the term.pth, we also uses Pubtator to annotate all abstract and title in the all dataset. We then match those mentions to MeSH IDs. After that, we establish relations between entities by looking up CTD data. We then split those relations based on whether both of the head and tail exist in the training Pubmed paper dataset. If those relations don't belong to the training set, we will put them in the validation or training set.
from paperrobot.
@EagleW thanks for your quick reply. and what the strategy to crawl the PubMed papers,and what type paper will you crawl? can you share some api? and i find the entity_text_title_tokenized.json have 30000+ entities ? but entity2id seems only hava 14857?
from paperrobot.
@EagleW i wanna to ask some other questions.thanks for you suggestion.i've learned to use pubtator to extract entities. and i've try pubmed_parser .it can extract title,abstract,pmid and so on... but still some questions confused me .
1: the text corresponding to an entity.how to define text corresponding to an entity ?
in entity_text_title_tokenized.json ,such as D000255 ,have 1000+ tokenized sentences... how to get this sentences ? i can find pubmed in https://www.nlm.nih.gov/databases/download/pubmed_medline.html.but since the file is quite large, i'm not sure which file to download to get this entity related sentences...
2:in New paper writing module.after we train a model such as pubmed_abstract model.. then we can inference a input and get a output to run the input.py script. but terms is also needes as a input besides Sources before get Output . how to choose the terms ?
3:in New paper writing module,one json string contains title(as source string),abs(as target string),and term(how to generate??) ,words(i guess the words is Union set of source and target words )? ? confused..
from paperrobot.
- As I previously said, if a paper title or abstract contains that entity, it is considered as the relevant sentence. I download all pubmed data. You just need to run pubtator on those papers to get entities and linked them to the CTD dataset. You can check my paper for more details.
2/3. First, we build a KB based on the entity extracted from the training dataset and their corresponding relation from CTD dataset. We trained a link prediction model and used this link prediction model to enhance original KB. For the dataset used for paper writing, the terms are predicted as following: first given a title, we will run pubtator on the title to extract entities that contained in the previous enhanced KB. We used those entites to find top-10 most relevant neighbors. We consider those neighbors and extracted entities as terms.
from paperrobot.
@EagleW i want to ask some trival question.
entity_text_title_tokenized.json have 30000+ entities but entity2id seems only hava 14857 entities.
I mentioned before, for the # of entities, I only chose entities that exist in the training set. Besides some entities don't have a link to the three types (gene, disease, chemical) of entities we have.
beside the train2id.txt, however in test2id.txt,valid2txt, the two entities in a relation also have the new id ,and the id is create in entity2id.txt. and you mentioned entity2id.txt's head entity and tail entity all come from the train set. so i'm confused about how to produce train2id.txt,test2id.txt,valid2txt....
from paperrobot.
in my opinion, entity_text_title_tokenized.json have 30000+ ,and the entities come from all papers (include train,val,test set ,include three categories,chemical,disease,gene) ,and the term.pth have almost 80000 entities(all categories) . and entity2.txt have 14000+ (only come from train set ????
confused...), and the entities in edges such as train2id.txt,test2id.txt,valid2txt ,because edge's vertex all share the same new created increasing ids, so the three data set(train,val,test set) vertex id all in entity2.txt ...
is anything wrong? just ask eagerly ask for help...
from paperrobot.
Sorry for the confusion. I think what you said is almost correct. Yes, entity_text_title_tokenized and terms did have many entities that are not in those three categories. entity2id did include all entities from all train2id.txt,test2id.txt,valid2txt.
from paperrobot.
but entity_text_title_tokenized have only 30000+,terms.pth have 80000(all category maybe)+。 i’m not sure entity_text_title_tokenized is also all categories,or just three categories? why only 30000+ and is smaller than terms.pth?
i can understand train2id.txt 's head and tail vertex all from train set,in case of data leakage..
entity_text_title_tokenized.json just from what? all papar data set ,or just train set, if from train set,why entity2id is smaller than entity_text_title_tokenized.json...?
term.pth absolutly can from all paper data set..
from paperrobot.
terms.pth actually could have some entities that didn't include in all datasets because it includes almost all entiites in CTD dataset. besides, I think entity_text_title_tokenized only contain the entity from three types. Sorry, another author write code for this preprocessing code. I might not explain the procedure very clear
from paperrobot.
emm.thanks for your reply..look forward your team can release the preprocessing code one day...
from paperrobot.
hi,qingyun, seems the reading paper model get much bigger than two months ago ??? i can train the reading paper model without change some parameter two months ago . but today, i've change the batch_size from 200 to 5 . it still RuntimeError: CUDA out of memory. my nvidia is 1080 with 8GB memory..
from paperrobot.
Hi @stevenberg1 , sorry, I found a bug in my code last month that when generating corrupted triples, I fail to consider the occasion of the tail by giving the wrong index. I fix it in the utils. did you pull the latest version so the code no longer works?
from paperrobot.
yeah,i've pull the latest code,in Existing paper reading/utils/utils.py ,by given the h = tri[0] t = tri[1]..
it's still inclined to RuntimeError ... but i think no this changed code to cause the problem? maybe other place to get out of cuda memory..
from paperrobot.
Yes, I think this change increases the corrupted tuples. Maybe you can reduce the batchsize to 1?
from paperrobot.
emm..if batchsize == 1..seem the train process become very unstable ? ok .. i will debug it to watch if the corrupted tuples. increase much bigger than before..
from paperrobot.
maybe change to 2? or you can just let bern keep the previous version so the model won't consider negative example of replacing tail. I think this won't change performance much
from paperrobot.
hi,qingyun, i've debugged that , h = tri[0] t = tri[1]. (tri[1] represent the tail node. as before,h = tri[0] t = tri[2]. tri[2] represent the relationid.... so t=tri[1] is correct... ) and generate corrupt triples num is equal to positive number... the positive number is 800,000+ ,just as before.. so the number really not changed,just the index value get bigger?
from paperrobot.
Yes the tri[1] represents the tail. I just review the code. Which pytorch version did you currently use?
from paperrobot.
torch 1.1.0
torchvision 0.3.0
from paperrobot.
another thing is that I change the get_subgraph in the utils.py. The previous subgraph implementation has some problems. Anyway, I think you can reduce the batch size to fix the problem
from paperrobot.
thanks, i am just debugging get_subgraph function and DataLoader with collate_fn,try to understand it ..
from paperrobot.
hi,qingyun,i see you use the graph attention network to attract the importance of neighbor's feature to the entity. named self attention. when you get the similarity between f_1
and f_2. you paper use the concatenation matrix.
while in program ,use these three codes.
f_1 = h @ self.a1
f_2 = h @ self.a2
e = self.leakyrelu(f_1 + f_2.transpose(0, 1)) #node_num * node_num
is the sum of f_1(N * 1),f_2 transpose(1 * N) is the concatenation way you apply?
because calculate two vector's similarity have three ways(vector A,vector B), such as 1:dot, 2:multiply (At * B A transpose) , 3: A concat B.
the code f_1 + f_2.transpose(0, 1) just different from all these three ways?
from paperrobot.
and i've found why the read_paper train process get so slow... after modify the get_subgraph. for example,you batch_size is 100, you seed a 100 triples. contain 200 nodes... and this 200 nodes have a lot of neighbors . i debug that 200 nodes contain 8000+ neighbors .. and all entities is 14000+... so more than half of the whole entities are participating in caculate the GAT network in a batch....
so it both batch time consumed and GPU memory consumed...
so the best way is fix the batch size to a very small value ...
from paperrobot.
from paperrobot.
1: curl -L https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/RESTful/tmTool.cgi/Chemical/19894120/JSON/ this api is operating well
2:curl -d '{"sourcedb":"PubMed","sourceid":"1000001","text":"A kinetic model identifies phosphorylated estrogen receptor-a (ERa) as a critical regulator of ERa dynamics in breast cancer."}' https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/RESTful/tmTool.cgi/tmChem/Submit/
return session id: 6096-8616-8473-7524
curl https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/RESTful/tmTool.cgi/6096-8616-8473-7524/Receive/
[Warning] : The Result is not ready.
always return this ...
so the single sentence(not contain pmid, just single sentence) extracted by pubtator to produces entities seems not work.
hi,@EagleW, is anyting i call this api wrong ?
from paperrobot.
@stevenberg1 Actually we use the entire annotation of PubTator instead of the API.
from paperrobot.
See ftp://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator/
from paperrobot.
yeah. download it before...haha ....when you define the subgraph.you just input a triple list. and replace head or tail with head or tail's ajacent node.. when input a raw sentence, pubtator only extract these node... but not triples with two node and relation... so i guess after extract this node... but seem no triples... so how do you define the top10 related entities? just the extracted entity's ajacent node ? seem not good..
from paperrobot.
so,the way is when extract some entity from the sentence... get the ajacent node list from the graph. and this entity and itself ajacent node produce triples.. and then call the api get_subgraph to produce the one hop related entities ?
from paperrobot.
yeah.based your work, i build a web demo to write code. http://autowrite.natapp1.cc/ this web server is run on a single pc .so inference performance is bad..
from paperrobot.
Good to hear it! @stevenberg1
from paperrobot.
Related Issues (20)
- How 'terms' are related to KB HOT 6
- Confusingly slow on testing of existing_model_reading model HOT 5
- 处理one-hop节点的代码的一个疑问 HOT 4
- It said that I need to have more RAM?tks HOT 12
- How many epoches do you use in these two tasks? HOT 3
- Training part of the dataset HOT 2
- --
- Model weights sharing and training stopping criterion HOT 3
- Something has been deprecated. HOT 1
- How to generate KGs? HOT 4
- Questions about code and paper_reading dataset HOT 2
- 这个项目现在还更新吗,还有人在使用吗?
- PubTator-MeSH-CTD HOT 1
- METEOR in eval.py HOT 4
- CUDA out of memory HOT 7
- Are new entities formed or just links? HOT 2
- RuntimeError HOT 13
- TypeError: can't convert np.ndarray of type numpy.int32. HOT 18
- TypeError: can't convert np.ndarray of type numpy.int32. HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from paperrobot.