Code Monkey home page Code Monkey logo

vj1494 / pipelineie Goto Github PK

View Code? Open in Web Editor NEW
10.0 2.0 1.0 63 KB

PipelineIE is a project that contains a pipeline for information extraction (currently triple) from free text and domain specific text (eg. biomedical domain) and also supports custom models making it flexible to support other domains. It takes care of coreference resolution and entity resolution by also allowing to test with different tools.

License: MIT License

Python 52.38% Jupyter Notebook 47.62%
natural-language-processing text-mining coreference-resolution entity information-extraction knowledge-graph pipeline natural-language-understanding corenlp spacy

pipelineie's Introduction

PipelineIE

PipelineIE is an Information Extraction Pipeline primarily based on spaCy that lets you extract information from free text and provides the flexibility to run general to domain specific pipeline like the biomedical domain for information extraction.

Currently the pipeline extracts information in the form of triplets and consists of Coreference Resolution (Stanford CoreNLP / neuralcoref) >> Sentence Simplification that decomposes complex sentences to simple sentences >> Entity Linking (spaCy / ScispaCy / custom spaCy model) >> Triplet Extraction (Currently Subject - Verb - Object Rule using textaCy).

How does it help? / What problem does it solve?

  1. It is important to resolve coreferences in the text before entities and triplets can be extracted so that they contain the original entities rather than pronouns.
  2. Usually, the subject and object does not represent the complete entity (which can be a sequence of many words) and might only represent a substring of the original entity. The Entity Linker in the pipeline helps to solve this problem while extracting triplets.
  3. Complex sentences makes it difficult to extract information from text. This pipeline solves this problem by decomposing complex sentence into simple sentences.
  4. Finally, in a few lines, anyone can extract triplets from text using the default pipeline or the biomedical pipeline, taking care of the above 2 problems, and use their custom pipeline making it easy to try different options on the input data.

Installation

Install neuralcoref from source as mentioned below (referenced from their github repo)

venv .env
source .env/bin/activate
git clone https://github.com/huggingface/neuralcoref.git
cd neuralcoref
pip install -r requirements.txt
pip install -e .

Optional: Download and unzip CoreNLP 4.2.0 if CoreNLP has to be used for coreference resolution.

Install PipelineIE

git clone https://github.com/vj1494/PipelineIE.git
cd PipelineIE
pip install -r requirements.txt
pip install -e .

Usage

Biomedical Pipeline

from pipeline_ie.pipeline_ie import PipelineIE

text = "Co-culture of NK cells with transfected EC enhanced E-selectin, IL-8, and NF-kappaB-dependent promoter activity."

#Biomedical PipelineIE
#Default Biomedical Pipeline uses ScispaCy en_core_sci_lg model
#Same model is used for neuralcoref, entity linkage and triple extraction 
#pipeline_ie="default" uses spacy en model
#Sentence Simplification is set as True by default. In order to disable it pass sentence_simplify=False
pie = PipelineIE(text, pipeline="biomedical")

#Returns a dataframe
df = pie.pipeline_triplet()

Additional Usage and Example

Please refer to the example for Additional Usage.

License

MIT

Credits

Sentence Simplification - (https://github.com/freyamehta99/Sentence-Simplification)

pipelineie's People

Contributors

vj1494 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

wrobert

pipelineie's Issues

empty triple

                     Sentences Triplet
1  the park does not have beds      []

I am using pipelineie as the example in the readme file. However, I got many sentences with empty triple extraction, how can I fix it?

KeyError: 'parse'

Followed all steps under Installation. Trying to run the PipelineIE/PipelineIE.ipynb, but get this error

KeyError                                  Traceback (most recent call last)
PipelineIE/PipelineIE.ipynb in <module>
      8 
      9 #Returns a dataframe
---> 10 pie.pipeline_triplet()

~/Downloads/PipelineIE/pipeline_ie/pipeline_ie.py in pipeline_triplet(self)
     93             annotation = 'parse'
     94             sentence_simp = SentenceSimplify(sentences,annotation,memory,timeout)
---> 95             sentences = sentence_simp.sentence_simplify()
     96 
     97         #Entity Linking

~/Downloads/PipelineIE/pipeline_ie/sentence_simplification.py in sentence_simplify(self)
    106                 sentence = re.sub(r"(\.|,|\?|\(|\)|\[|\])", " ", sentence)
    107                 ann = client.annotate(sentence)
--> 108                 clause_list = self.get_clause_list(ann)
    109                 if not clause_list:
    110                     decomposed_sent.append(sentence)

~/Downloads/PipelineIE/pipeline_ie/sentence_simplification.py in get_clause_list(self, ann)
     62     def get_clause_list(self,ann):
     63         print(ann["sentences"][0])
---> 64         sent_tree = nltk.tree.ParentedTree.fromstring(ann["sentences"][0]["parse"])
     65         clause_level_list = ["S", "SBAR", "SBARQ", "SINV", "SQ"]
     66         clause_list = []

KeyError: 'parse'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.