excellent completed work !! can you share some details about how to produce the dataset,such as entity_text_title_tokenized.json,term.pth in PubMed Paper Reading Dataset ? thanks about paperrobot HOT 36 CLOSED

stevenberg1 commented on July 21, 2024

excellent completed work !! can you share some details about how to produce the dataset,such as entity_text_title_tokenized.json,term.pth in PubMed Paper Reading Dataset ? thanks

from paperrobot.

Comments (36)

EagleW commented on July 21, 2024 4

@stevenberg1 You can download pubmed from https://www.nlm.nih.gov/databases/download/pubmed_medline.html . The pubtator api is on https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/ . You can parse pubmed data using https://github.com/titipata/pubmed_parser . As I mentioned before, for the # of entities, I only chose entities that exist in the training set. Besides some entities don't have a link to the three types (gene, disease, chemical) of entities we have.

from paperrobot.

stevenberg1 commented on July 21, 2024 1

@EagleW thanks

from paperrobot.

EagleW commented on July 21, 2024 1

Yes, I agree with you for the batch size problem. For the concatenation, the original paper for graph attention is https://arxiv.org/pdf/1710.10903.pdf. There implementation is on https://github.com/Diego999/pyGAT/blob/master/layers.py line 31-32. I think our implementation is equivalent to them because our plus will automatically broadcast.

from paperrobot.

EagleW commented on July 21, 2024 1

Top 10 is on the test phase. For training, I just extract all entities in the paragraph and try to find potential triples (with head and tail in those entities)

from paperrobot.

stevenberg1 commented on July 21, 2024 1

hahhh.thanks to your work...seems PaperRobot can handle very well in paper writing skill .. such as this sentences. "The results of this study demonstrate ..." "The results of this study...","The objective of this study was to" .and so on..

Still two probleam exists.
1:the NER problem. since the pubtator api is offline,it can't extract a raw input sentence....so the ner is limit to search the entity dict.. it's time consuming, and can't handle slight change in entity term(if edit distance used.will take longer time..)
2: how to define and get the related entities during before writing a sentence......seem the related entities is the key factor .. i can seed your program input.py in writing stage ,when input a raw sentence, then the related items is inputed by users...

from paperrobot.

EagleW commented on July 21, 2024

Hi @stevenberg1, thank you very much for your interest in our work! We first splite the PubMed papers we crawled into training, validation, and test set. For the text corresponding to an entity, we use the Pubtator to annotate all abstract and title available in our dataset. We then extract sentences which contain that entity and use nltk tokenizer to tokenize them. For the term.pth, we also uses Pubtator to annotate all abstract and title in the all dataset. We then match those mentions to MeSH IDs. After that, we establish relations between entities by looking up CTD data. We then split those relations based on whether both of the head and tail exist in the training Pubmed paper dataset. If those relations don't belong to the training set, we will put them in the validation or training set.

from paperrobot.

stevenberg1 commented on July 21, 2024

@EagleW thanks for your quick reply. and what the strategy to crawl the PubMed papers，and what type paper will you crawl? can you share some api? and i find the entity_text_title_tokenized.json have 30000+ entities ? but entity2id seems only hava 14857?

from paperrobot.

stevenberg1 commented on July 21, 2024

@EagleW i wanna to ask some other questions.thanks for you suggestion.i've learned to use pubtator to extract entities. and i've try pubmed_parser .it can extract title,abstract,pmid and so on... but still some questions confused me .
1: the text corresponding to an entity.how to define text corresponding to an entity ?
in entity_text_title_tokenized.json ,such as D000255 ,have 1000+ tokenized sentences... how to get this sentences ? i can find pubmed in https://www.nlm.nih.gov/databases/download/pubmed_medline.html.but since the file is quite large, i'm not sure which file to download to get this entity related sentences...

2:in New paper writing module.after we train a model such as pubmed_abstract model.. then we can inference a input and get a output to run the input.py script. but terms is also needes as a input besides Sources before get Output . how to choose the terms ?

3:in New paper writing module,one json string contains title(as source string),abs(as target string),and term(how to generate??) ,words(i guess the words is Union set of source and target words )? ? confused..

from paperrobot.

EagleW commented on July 21, 2024

@stevenberg1

As I previously said, if a paper title or abstract contains that entity, it is considered as the relevant sentence. I download all pubmed data. You just need to run pubtator on those papers to get entities and linked them to the CTD dataset. You can check my paper for more details.

2/3. First, we build a KB based on the entity extracted from the training dataset and their corresponding relation from CTD dataset. We trained a link prediction model and used this link prediction model to enhance original KB. For the dataset used for paper writing, the terms are predicted as following: first given a title, we will run pubtator on the title to extract entities that contained in the previous enhanced KB. We used those entites to find top-10 most relevant neighbors. We consider those neighbors and extracted entities as terms.

from paperrobot.

stevenberg1 commented on July 21, 2024

@EagleW i want to ask some trival question.

entity_text_title_tokenized.json have 30000+ entities but entity2id seems only hava 14857 entities.
I mentioned before, for the # of entities, I only chose entities that exist in the training set. Besides some entities don't have a link to the three types (gene, disease, chemical) of entities we have.

beside the train2id.txt, however in test2id.txt,valid2txt, the two entities in a relation also have the new id ,and the id is create in entity2id.txt. and you mentioned entity2id.txt's head entity and tail entity all come from the train set. so i'm confused about how to produce train2id.txt,test2id.txt,valid2txt....

from paperrobot.

stevenberg1 commented on July 21, 2024

in my opinion, entity_text_title_tokenized.json have 30000+ ,and the entities come from all papers (include train,val,test set ,include three categories,chemical,disease,gene) ,and the term.pth have almost 80000 entities(all categories) . and entity2.txt have 14000+ (only come from train set ????
confused...), and the entities in edges such as train2id.txt,test2id.txt,valid2txt ,because edge's vertex all share the same new created increasing ids, so the three data set(train,val,test set) vertex id all in entity2.txt ...

is anything wrong? just ask eagerly ask for help...

from paperrobot.

EagleW commented on July 21, 2024

Sorry for the confusion. I think what you said is almost correct. Yes, entity_text_title_tokenized and terms did have many entities that are not in those three categories. entity2id did include all entities from all train2id.txt,test2id.txt,valid2txt.

from paperrobot.

stevenberg1 commented on July 21, 2024

but entity_text_title_tokenized have only 30000+，terms.pth have 80000(all category maybe)+。 i’m not sure entity_text_title_tokenized is also all categories,or just three categories? why only 30000+ and is smaller than terms.pth?
i can understand train2id.txt 's head and tail vertex all from train set,in case of data leakage..
entity_text_title_tokenized.json just from what? all papar data set ,or just train set, if from train set,why entity2id is smaller than entity_text_title_tokenized.json...?
term.pth absolutly can from all paper data set..

from paperrobot.

EagleW commented on July 21, 2024

terms.pth actually could have some entities that didn't include in all datasets because it includes almost all entiites in CTD dataset. besides, I think entity_text_title_tokenized only contain the entity from three types. Sorry, another author write code for this preprocessing code. I might not explain the procedure very clear

from paperrobot.

stevenberg1 commented on July 21, 2024

emm.thanks for your reply..look forward your team can release the preprocessing code one day...

from paperrobot.

stevenberg1 commented on July 21, 2024

hi,qingyun, seems the reading paper model get much bigger than two months ago ??? i can train the reading paper model without change some parameter two months ago . but today, i've change the batch_size from 200 to 5 . it still RuntimeError: CUDA out of memory. my nvidia is 1080 with 8GB memory..

from paperrobot.

EagleW commented on July 21, 2024

Hi @stevenberg1 , sorry, I found a bug in my code last month that when generating corrupted triples, I fail to consider the occasion of the tail by giving the wrong index. I fix it in the utils. did you pull the latest version so the code no longer works?

from paperrobot.

stevenberg1 commented on July 21, 2024

yeah,i've pull the latest code,in Existing paper reading/utils/utils.py ,by given the h = tri[0] t = tri[1]..
it's still inclined to RuntimeError ... but i think no this changed code to cause the problem? maybe other place to get out of cuda memory..

from paperrobot.

EagleW commented on July 21, 2024

Yes, I think this change increases the corrupted tuples. Maybe you can reduce the batchsize to 1?

from paperrobot.

stevenberg1 commented on July 21, 2024

emm..if batchsize == 1..seem the train process become very unstable ? ok .. i will debug it to watch if the corrupted tuples. increase much bigger than before..

from paperrobot.

EagleW commented on July 21, 2024

maybe change to 2? or you can just let bern keep the previous version so the model won't consider negative example of replacing tail. I think this won't change performance much

from paperrobot.

stevenberg1 commented on July 21, 2024

hi,qingyun, i've debugged that , h = tri[0] t = tri[1]. (tri[1] represent the tail node. as before,h = tri[0] t = tri[2]. tri[2] represent the relationid.... so t=tri[1] is correct... ) and generate corrupt triples num is equal to positive number... the positive number is 800,000+ ,just as before.. so the number really not changed,just the index value get bigger?

from paperrobot.

EagleW commented on July 21, 2024

Yes the tri[1] represents the tail. I just review the code. Which pytorch version did you currently use?

from paperrobot.

stevenberg1 commented on July 21, 2024

torch 1.1.0
torchvision 0.3.0

from paperrobot.

EagleW commented on July 21, 2024

another thing is that I change the get_subgraph in the utils.py. The previous subgraph implementation has some problems. Anyway, I think you can reduce the batch size to fix the problem

from paperrobot.

stevenberg1 commented on July 21, 2024

thanks, i am just debugging get_subgraph function and DataLoader with collate_fn,try to understand it ..

from paperrobot.

stevenberg1 commented on July 21, 2024

hi，qingyun,i see you use the graph attention network to attract the importance of neighbor's feature to the entity. named self attention. when you get the similarity between f_1
and f_2. you paper use the concatenation matrix.
while in program ,use these three codes.
f_1 = h @ self.a1
f_2 = h @ self.a2
e = self.leakyrelu(f_1 + f_2.transpose(0, 1)) #node_num * node_num

is the sum of f_1(N * 1),f_2 transpose(1 * N) is the concatenation way you apply?
because calculate two vector's similarity have three ways(vector A,vector B), such as 1:dot, 2:multiply (At * B A transpose) , 3: A concat B.
the code f_1 + f_2.transpose(0, 1) just different from all these three ways?

from paperrobot.

stevenberg1 commented on July 21, 2024

and i've found why the read_paper train process get so slow... after modify the get_subgraph. for example,you batch_size is 100, you seed a 100 triples. contain 200 nodes... and this 200 nodes have a lot of neighbors . i debug that 200 nodes contain 8000+ neighbors .. and all entities is 14000+... so more than half of the whole entities are participating in caculate the GAT network in a batch....

so it both batch time consumed and GPU memory consumed...

so the best way is fix the batch size to a very small value ...

from paperrobot.

EagleW commented on July 21, 2024

@stevenberg1

from paperrobot.

stevenberg1 commented on July 21, 2024

1: curl -L https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/RESTful/tmTool.cgi/Chemical/19894120/JSON/ this api is operating well

2:curl -d '{"sourcedb":"PubMed","sourceid":"1000001","text":"A kinetic model identifies phosphorylated estrogen receptor-a (ERa) as a critical regulator of ERa dynamics in breast cancer."}' https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/RESTful/tmTool.cgi/tmChem/Submit/

return session id: 6096-8616-8473-7524
curl https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/RESTful/tmTool.cgi/6096-8616-8473-7524/Receive/

[Warning] : The Result is not ready.

always return this ...

so the single sentence(not contain pmid, just single sentence) extracted by pubtator to produces entities seems not work.
hi,@EagleW, is anyting i call this api wrong ?

from paperrobot.

EagleW commented on July 21, 2024

@stevenberg1 Actually we use the entire annotation of PubTator instead of the API.

from paperrobot.

EagleW commented on July 21, 2024

See ftp://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator/

from paperrobot.

stevenberg1 commented on July 21, 2024

yeah. download it before...haha ....when you define the subgraph.you just input a triple list. and replace head or tail with head or tail's ajacent node.. when input a raw sentence, pubtator only extract these node... but not triples with two node and relation... so i guess after extract this node... but seem no triples... so how do you define the top10 related entities? just the extracted entity's ajacent node ? seem not good..

from paperrobot.

stevenberg1 commented on July 21, 2024

so,the way is when extract some entity from the sentence... get the ajacent node list from the graph. and this entity and itself ajacent node produce triples.. and then call the api get_subgraph to produce the one hop related entities ?

from paperrobot.

stevenberg1 commented on July 21, 2024

yeah.based your work, i build a web demo to write code. http://autowrite.natapp1.cc/ this web server is run on a single pc .so inference performance is bad..

from paperrobot.

EagleW commented on July 21, 2024

Good to hear it! @stevenberg1

from paperrobot.

excellent completed work !! can you share some details about how to produce the dataset,such as entity_text_title_tokenized.json,term.pth in PubMed Paper Reading Dataset ? thanks about paperrobot HOT 36 CLOSED

Comments (36)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent