allenai / scispacy Goto Github PK

View Code? Open in Web Editor NEW

1.6K 52.0 223.0 243.02 MB

A full spaCy pipeline and models for scientific/biomedical documents.

Home Page: https://allenai.github.io/scispacy/

License: Apache License 2.0

Dockerfile 0.38% Python 99.57% Shell 0.05%

scientific-documents spacy custom-pipes nlp biomedical bioinformatics

scispacy's Introduction

This repository contains custom pipes and models related to using spaCy for scientific documents.

In particular, there is a custom tokenizer that adds tokenization rules on top of spaCy's rule-based tokenizer, a POS tagger and syntactic parser trained on biomedical data and an entity span detection model. Separately, there are also NER models for more specific tasks.

Just looking to test out the models on your data? Check out our demo (Note: this demo is running an older version of scispaCy and may produce different results than the latest version).

Installation

Installing scispacy requires two steps: installing the library and intalling the models. To install the library, run:

pip install scispacy

to install a model (see our full selection of available models below), run a command like the following:

pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_sm-0.5.4.tar.gz

Note: We strongly recommend that you use an isolated Python environment (such as virtualenv or conda) to install scispacy. Take a look below in the "Setting up a virtual environment" section if you need some help with this. Additionally, scispacy uses modern features of Python and as such is only available for Python 3.6 or greater.

Setting up a virtual environment

Conda can be used set up a virtual environment with the version of Python required for scispaCy. If you already have a Python environment you want to use, you can skip to the 'installing via pip' section.

Follow the installation instructions for Conda.
Create a Conda environment called "scispacy" with Python 3.9 (any version >= 3.6 should work):
```
conda create -n scispacy python=3.9
```
Activate the Conda environment. You will need to activate the Conda environment in each terminal in which you want to use scispaCy.
```
source activate scispacy
```

Now you can install scispacy and one of the models using the steps above.

Once you have completed the above steps and downloaded one of the models below, you can load a scispaCy model as you would any other spaCy model. For example:

import spacy
nlp = spacy.load("en_core_sci_sm")
doc = nlp("Alterations in the hypocretin receptor 2 and preprohypocretin genes produce narcolepsy in some animals.")

Note on upgrading

If you are upgrading scispacy, you will need to download the models again, to get the model versions compatible with the version of scispacy that you have. The link to the model that you download should contain the version number of scispacy that you have.

Available Models

To install a model, click on the link below to download the model, and then run

pip install </path/to/download>

Alternatively, you can install directly from the URL by right-clicking on the link, selecting "Copy Link Address" and running

pip install CMD-V(to paste the copied URL)

Model	Description	Install URL
en_core_sci_sm	A full spaCy pipeline for biomedical data with a ~100k vocabulary.	Download
en_core_sci_md	A full spaCy pipeline for biomedical data with a ~360k vocabulary and 50k word vectors.	Download
en_core_sci_lg	A full spaCy pipeline for biomedical data with a ~785k vocabulary and 600k word vectors.	Download
en_core_sci_scibert	A full spaCy pipeline for biomedical data with a ~785k vocabulary and `allenai/scibert-base` as the transformer model. You may want to use a GPU with this model.	Download
en_ner_craft_md	A spaCy NER model trained on the CRAFT corpus.	Download
en_ner_jnlpba_md	A spaCy NER model trained on the JNLPBA corpus.	Download
en_ner_bc5cdr_md	A spaCy NER model trained on the BC5CDR corpus.	Download
en_ner_bionlp13cg_md	A spaCy NER model trained on the BIONLP13CG corpus.	Download

Additional Pipeline Components

AbbreviationDetector

The AbbreviationDetector is a Spacy component which implements the abbreviation detection algorithm in "A simple algorithm for identifying abbreviation definitions in biomedical text.", (Schwartz & Hearst, 2003).

You can access the list of abbreviations via the doc._.abbreviations attribute and for a given abbreviation, you can access it's long form (which is a spacy.tokens.Span) using span._.long_form, which will point to another span in the document.

Example Usage

import spacy

from scispacy.abbreviation import AbbreviationDetector

nlp = spacy.load("en_core_sci_sm")

# Add the abbreviation pipe to the spacy pipeline.
nlp.add_pipe("abbreviation_detector")

doc = nlp("Spinal and bulbar muscular atrophy (SBMA) is an \
           inherited motor neuron disease caused by the expansion \
           of a polyglutamine tract within the androgen receptor (AR). \
           SBMA can be caused by this easily.")

print("Abbreviation", "\t", "Definition")
for abrv in doc._.abbreviations:
	print(f"{abrv} \t ({abrv.start}, {abrv.end}) {abrv._.long_form}")

>>> Abbreviation	 Span	    Definition
>>> SBMA 		 (33, 34)   Spinal and bulbar muscular atrophy
>>> SBMA 	   	 (6, 7)     Spinal and bulbar muscular atrophy
>>> AR   		 (29, 30)   androgen receptor

Note If you want to be able to serialize your doc objects, load the abbreviation detector with make_serializable=True, e.g. nlp.add_pipe("abbreviation_detector", config={"make_serializable": True})

EntityLinker

The EntityLinker is a SpaCy component which performs linking to a knowledge base. The linker simply performs a string overlap - based search (char-3grams) on named entities, comparing them with the concepts in a knowledge base using an approximate nearest neighbours search.

Currently (v2.5.0), there are 5 supported linkers:

umls: Links to the Unified Medical Language System, levels 0,1,2 and 9. This has ~3M concepts.
mesh: Links to the Medical Subject Headings. This contains a smaller set of higher quality entities, which are used for indexing in Pubmed. MeSH contains ~30k entities. NOTE: The MeSH KB is derived directly from MeSH itself, and as such uses different unique identifiers than the other KBs.
rxnorm: Links to the RxNorm ontology. RxNorm contains ~100k concepts focused on normalized names for clinical drugs. It is comprised of several other drug vocabularies commonly used in pharmacy management and drug interaction, including First Databank, Micromedex, and the Gold Standard Drug Database.
go: Links to the Gene Ontology. The Gene Ontology contains ~67k concepts focused on the functions of genes.
hpo: Links to the Human Phenotype Ontology. The Human Phenotype Ontology contains 16k concepts focused on phenotypic abnormalities encountered in human disease.

You may want to play around with some of the parameters below to adapt to your use case (higher precision, higher recall etc).

resolve_abbreviations : bool = True, optional (default = False) Whether to resolve abbreviations identified in the Doc before performing linking. This parameter has no effect if there is no AbbreviationDetector in the spacy pipeline.
k : int, optional, (default = 30) The number of nearest neighbours to look up from the candidate generator per mention.
threshold : float, optional, (default = 0.7) The threshold that a mention candidate must reach to be added to the mention in the Doc as a mention candidate.
no_definition_threshold : float, optional, (default = 0.95) The threshold that a entity candidate must reach to be added to the mention in the Doc as a mention candidate if the entity candidate does not have a definition.
filter_for_definitions: bool, default = True Whether to filter entities that can be returned to only include those with definitions in the knowledge base.
max_entities_per_mention : int, optional, default = 5 The maximum number of entities which will be returned for a given mention, regardless of how many are nearest neighbours are found.

This class sets the ._.kb_ents attribute on spacy Spans, which consists of a List[Tuple[str, float]] corresponding to the KB concept_id and the associated score for a list of max_entities_per_mention number of entities.

You can look up more information for a given id using the kb attribute of this class:

print(linker.kb.cui_to_entity[concept_id])

Example Usage

import spacy
import scispacy

from scispacy.linking import EntityLinker

nlp = spacy.load("en_core_sci_sm")

# This line takes a while, because we have to download ~1GB of data
# and load a large JSON file (the knowledge base). Be patient!
# Thankfully it should be faster after the first time you use it, because
# the downloads are cached.
# NOTE: The resolve_abbreviations parameter is optional, and requires that
# the AbbreviationDetector pipe has already been added to the pipeline. Adding
# the AbbreviationDetector pipe and setting resolve_abbreviations to True means
# that linking will only be performed on the long form of abbreviations.
nlp.add_pipe("scispacy_linker", config={"resolve_abbreviations": True, "linker_name": "umls"})

doc = nlp("Spinal and bulbar muscular atrophy (SBMA) is an \
           inherited motor neuron disease caused by the expansion \
           of a polyglutamine tract within the androgen receptor (AR). \
           SBMA can be caused by this easily.")

# Let's look at a random entity!
entity = doc.ents[1]

print("Name: ", entity)
>>> Name: bulbar muscular atrophy

# Each entity is linked to UMLS with a score
# (currently just char-3gram matching).
linker = nlp.get_pipe("scispacy_linker")
for umls_ent in entity._.kb_ents:
	print(linker.kb.cui_to_entity[umls_ent[0]])


>>> CUI: C1839259, Name: Bulbo-Spinal Atrophy, X-Linked
>>> Definition: An X-linked recessive form of spinal muscular atrophy. It is due to a mutation of the
  				gene encoding the ANDROGEN RECEPTOR.
>>> TUI(s): T047
>>> Aliases (abbreviated, total: 50):
         Bulbo-Spinal Atrophy, X-Linked, Bulbo-Spinal Atrophy, X-Linked, ....

>>> CUI: C0541794, Name: Skeletal muscle atrophy
>>> Definition: A process, occurring in skeletal muscle, that is characterized by a decrease in protein content,
                fiber diameter, force production and fatigue resistance in response to ...
>>> TUI(s): T046
>>> Aliases: (total: 9):
         Skeletal muscle atrophy, ATROPHY SKELETAL MUSCLE, skeletal muscle atrophy, ....

>>> CUI: C1447749, Name: AR protein, human
>>> Definition: Androgen receptor (919 aa, ~99 kDa) is encoded by the human AR gene.
                This protein plays a role in the modulation of steroid-dependent gene transcription.
>>> TUI(s): T116, T192
>>> Aliases (abbreviated, total: 16):
         AR protein, human, Androgen Receptor, Dihydrotestosterone Receptor, AR, DHTR, NR3C4, ...

Hearst Patterns (v0.3.0 and up)

This component implements Automatic Aquisition of Hyponyms from Large Text Corpora using the SpaCy Matcher component.

Passing extended=True to the HyponymDetector will use the extended set of hearst patterns, which include higher recall but lower precision hyponymy relations (e.g X compared to Y, X similar to Y, etc).

This component produces a doc level attribute on the spacy doc: doc._.hearst_patterns, which is a list containing tuples of extracted hyponym pairs. The tuples contain:

The relation rule used to extract the hyponym (type: str)
The more general concept (type: spacy.Span)
The more specific concept (type: spacy.Span)

Usage:

import spacy
from scispacy.hyponym_detector import HyponymDetector

nlp = spacy.load("en_core_sci_sm")
nlp.add_pipe("hyponym_detector", last=True, config={"extended": False})

doc = nlp("Keystone plant species such as fig trees are good for the soil.")

print(doc._.hearst_patterns)
>>> [('such_as', Keystone plant species, fig trees)]

Citing

If you use ScispaCy in your research, please cite ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing. Additionally, please indicate which version and model of ScispaCy you used so that your research can be reproduced.

@inproceedings{neumann-etal-2019-scispacy,
    title = "{S}cispa{C}y: {F}ast and {R}obust {M}odels for {B}iomedical {N}atural {L}anguage {P}rocessing",
    author = "Neumann, Mark  and
      King, Daniel  and
      Beltagy, Iz  and
      Ammar, Waleed",
    booktitle = "Proceedings of the 18th BioNLP Workshop and Shared Task",
    month = aug,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W19-5034",
    doi = "10.18653/v1/W19-5034",
    pages = "319--327",
    eprint = {arXiv:1902.07669},
    abstract = "Despite recent advances in natural language processing, many statistical models for processing text perform extremely poorly under domain shift. Processing biomedical and clinical text is a critically important application area of natural language processing, for which there are few robust, practical, publicly available models. This paper describes scispaCy, a new Python library and models for practical biomedical/scientific text processing, which heavily leverages the spaCy library. We detail the performance of two packages of models released in scispaCy and demonstrate their robustness on several tasks and datasets. Models and code are available at https://allenai.github.io/scispacy/.",
}

ScispaCy is an open-source project developed by the Allen Institute for Artificial Intelligence (AI2). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.

scispacy's People

Stargazers

Watchers

Forkers

dakinggg deneutoy eric-czech abhijit-2592 lucian-whu amoliu codeaudit sunilsivadas ayankayode ajaykumar2409 ssameerr ctilli bharatr21 fros1y yerevann e7dal stacktracehq project-renard-survey matthieurouland danlou indieaner attler shanepeckham shivaniarbat prem0862 yimingli90 shinde-rahul lfoppiano haa95 neuronys jpsmithnl iclrandd wuraolaoyewusi qiuyuew qwerfdsaplking sadam1195 azdatascience aartika ghostintheshellarise isabelcachola nipunsadvilkar bioshare afcarl nicole-he fainaszar strategist922 aspirincode tonydeep sriloksagar emailhy dsnoor fj-morales snosrap izuna385 himanshumittal01 aashaybhupendradoshi jakobjanot nikolaospapachristou justcherie shyamalschandra vyaslkv manikant92 ronghuizhou sts-sadr arun477 aishwarya-agrawal dmiruke databill86 silva-luana bendavis-chicago baumanab kchennen inbalweiss kathirvelkg stjordanis vpna09 ravis1110 silviodc arijitchandra geekyneuro yanezj2 gkovaig hillelt rushabh31 yazdavar petersenjoern adbmd kaushikacharya bopeng112 chaoneng ngo010 rahulmadanraju iamprashant sandertan consultingmd arianpasquali zalkikar dumbshow thetextmining petcai

scispacy's Issues

How to know what type of entity is extracted (e.g drug name,indication,treatment)using scispacy?

Can scispacy models be used in spacy without having installed the additional requirements?

I noticed that the models can be loaded into spacy without the need to install the scispacy package and all its requirements (i.e. just install spacy on its own and load your models). Is there any problem with this? What do the other requirements do and can we use the models without installing these packages?

numpy
spacy>=2.1.3
pandas
awscli
conllu

Candidate generation and entity linking
joblib
nmslib>=1.7.3.6
scikit-learn>=0.20.3

Override definition filtering for exact match

Entity linker defaults to filtering out entities that don't have definitions in UMLS, we should at least override this filtering when a mention is an exact match for a UMLS entity.

No module named 'scispacy.custom_sentence_segmenter'; 'scispacy' is not a package

I am getting following error:
Traceback (most recent call last):
File "scispacy.py", line 2, in
import scispacy
File "/Users/shai26/office/spacy/scispacy/scispacy.py", line 5, in
nlp = spacy.load("en_core_sci_sm")
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/spacy/init.py", line 21, in load
return util.load_model(name, **overrides)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/spacy/util.py", line 114, in load_model
return load_model_from_package(name, **overrides)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/spacy/util.py", line 134, in load_model_from_package
cls = importlib.import_module(name)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/en_core_sci_sm/init.py", line 7, in
from scispacy.custom_sentence_segmenter import combined_rule_sentence_segmenter
ModuleNotFoundError: No module named 'scispacy.custom_sentence_segmenter'; 'scispacy' is not a package

Add spacy tests

Efficient Computation of Similarity Matrix for Set of Documents?

Do you have any suggestions on how to efficiently compute the pairwise similarity over a large set of tokens/sentences/documents using the scispacy en_core_sci_md vectors? I'm finding the method suggested by the spacy documentation ((https://spacy.io/usage/vectors-similarity) to be extremely slow and resource intensive.

Issue determining symptoms and affected body parts: e.g. arms

I am trying to determine a way to retrieve symptoms and the affected body parts when a text is analyzed. I am using en_ner_bc5cdr_md and en_ner_bionlp13cg_md to retrieve this information. But I do seem some issues. For example

text ="I have numbness in my arm and leg"

import scispacy
import spacy
symptom_nlp = spacy.load("en_ner_bc5cdr_md") 
organ_nlp = spacy.load("en_ner_bionlp13cg_md")


text ="I have numbness in my arm and leg"

doc_symptoms = symptom_nlp(text)
doc_organs = organ_nlp(text)

I get the following results:

Symptoms:
numbness DISEASE
Organ:
arm CANCER
leg ORGAN

I never thought of any correlation between arm and cancer. Also when I just use text = "I have numbness in my arm" , the word arm is not even detected. Also if I change to text = "I have numbness in my arm and leg and toes" I get the following:

Symptoms:
numbness DISEASE
Organ:
arm PATHOLOGICAL_FORMATION
leg ORGAN

Any ideas why? Any help would be appreciated. Thank you.

I am using
spacy Version: 2.1.3
scispacy Version: 0.2.0

Sentence segmenter can not be loaded

Hello,

I ran the following commands to train the parser:

bash ./scripts/base_model.sh small base_model
bash ./scripts/parser.sh base_model parser

When I used the following commands to load the trained parser, the sentence segmenter does not work:

import spacy
nlp = spacy.load('parser/best')
x = nlp('Hello. Hello.')
print(len(list(x.sents)) # Output: 1, Expected: 2

However, when I followed the code and added the segmenter to the pipeline, the sentense segmenter still does not work:

import spacy
from scispacy.custom_sentence_segmenter import combined_rule_sentence_segmenter
nlp = spacy.load('parser/best')
nlp.add_pipe(combined_rule_sentence_segmenter, first=True)
x = nlp('Hello. Hello.')
print(len(list(x.sents)) # Output: 1, Expected: 2

What should I do to add the sentence segmenter into pipeline?

Thanks!

Abbreviations and UMLS linking

Here's a test sentence:

Human induced pluripotent stem cells (hiPSC) are generated from reprogrammed fibroblasts by overexpression of pluripotency factors (Takahashi et al., 2007; Yu et al., 2007).

The abbreviation detector correctly identifies hiPSC and "Human induced pluripotent stem cells". Also, "Human induced pluripotent stem cells" is in UMLS as CUI C3658289. However, the UMLS linker does not find that code. Instead of the long form of the abbreviation being used (which is associated with the document), the linker is using the entities from the mention detector. In this case it had found the mentions (Human, induced, pluripotent stem cells, hiPSC, fibroblasts, overexpression, pluripotency factors, Takahashi).
The result is that the abbreviation hiPSC gets candidate codes C2717959 and C0872076, which are for 'Induced Pluripotent Stem Cells' and 'Pluripotent Stem Cells', respectively.

It may be good to have an early step in the UMLS linker that looks for document-level abbreviations. If it finds some then exclude those spans from consideration when looking for non-abbreviated mentions. (Or maybe let people ask for nested concepts, in which case the abbreviation spans would not be excluded).

OSError: [E050] Can't find model 'en_core_sci_sm'

Hi,
I followed the steps here https://allenai.github.io/scispacy/ but when I try to run the example, it says

File "example.py", line 4, in
nlp = spacy.load("en_core_sci_sm")
File "/home/darren/anaconda3/lib/python3.7/site-packages/spacy/init.py", line 27, in load
return util.load_model(name, **overrides)
File "/home/darren/anaconda3/lib/python3.7/site-packages/spacy/util.py", line 136, in load_model
raise IOError(Errors.E050.format(name=name))
OSError: [E050] Can't find model 'en_core_sci_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

Abbreviation Detector and UMLS Linker for en_core_sci_md doesn't return anything

I am using en_core_sci_md for AbbreviationDetector and ran a quick test using the same sentence in the README.md but no result is returned for the following code snippet:

for abrv in doc._.abbreviations:
	print(f"{abrv} \t ({abrv.start}, {abrv.end}) {abrv._.long_form}")

Similarly, no result is returned for the UMLS entity linker:

for umls_ent in entity._.umls_ents:
	print(linker.umls.cui_to_entity[umls_ent[0]])

I have followed all previous steps mentioned for both code snippets. Is this an issue with en_core_web_md? I thought that this was just a larger version of _sm.

Update: Tested both of the above with the _sm model, but results are only printed for the AbbreviationDetector and not for the UMLS entity linker.

Investigate making CI public

`Found array with 0 sample(s)`

Lucy's team ran into this bug during the hackathon

>>> nlp("hydroxytryptophan")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-231-7e41c1b0131c> in <module>
----> 1 nlp("hydroxytryptophan")

//anaconda/envs/scispacy/lib/python3.6/site-packages/spacy/language.py in __call__(self, text, disable, component_cfg)
    393             if not hasattr(proc, "__call__"):
    394                 raise ValueError(Errors.E003.format(component=type(proc), name=name))
--> 395             doc = proc(doc, **component_cfg.get(name, {}))
    396             if doc is None:
    397                 raise ValueError(Errors.E005.format(name=name))

//anaconda/envs/scispacy/lib/python3.6/site-packages/scispacy/umls_linking.py in __call__(self, doc)
     85 
     86         mention_strings = [x.text for x in mentions]
---> 87         batch_candidates = self.candidate_generator(mention_strings, self.k)
     88 
     89         for mention, candidates in zip(doc.ents, batch_candidates):

//anaconda/envs/scispacy/lib/python3.6/site-packages/scispacy/candidate_generation.py in __call__(self, mention_texts, k)
    201         if self.verbose:
    202             print(f'Generating candidates for {len(mention_texts)} mentions')
--> 203         tfidfs = self.vectorizer.transform(mention_texts)
    204         start_time = datetime.datetime.now()
    205 

//anaconda/envs/scispacy/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in transform(self, raw_documents, copy)
   1679 
   1680         X = super().transform(raw_documents)
-> 1681         return self._tfidf.transform(X, copy=False)
   1682 
   1683     def _more_tags(self):

//anaconda/envs/scispacy/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in transform(self, X, copy)
   1300         vectors : sparse matrix, [n_samples, n_features]
   1301         """
-> 1302         X = check_array(X, accept_sparse='csr', dtype=FLOAT_DTYPES, copy=copy)
   1303         if not sp.issparse(X):
   1304             X = sp.csr_matrix(X, dtype=np.float64)

//anaconda/envs/scispacy/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    548                              " minimum of %d is required%s."
    549                              % (n_samples, array.shape, ensure_min_samples,
--> 550                                 context))
    551 
    552     if ensure_min_features > 0 and array.ndim == 2:

ValueError: Found array with 0 sample(s) (shape=(0, 53479)) while a minimum of 1 is required.

Fix install instructions in readme once ready

most/least similar tokens (like gensim)?

Is there some way to return the n most/least similar tokens to a given token in the en_core_sci_md vocabulary? In gensim the most_similar method allows you to do this.

Under/over-splitting in BioNLP09: common cases

Hi everyone, and thank you very much for your great work!

I tried the scispacy en_core_sci_md model on the BioNLP09 corpus and I noticed an improved sentence segmentation accuracy w.r.t. the default written text genre en_core_web_md model. I read your paper and I'm excited that the rule-based segmenter module is not usually needed due to the in-domain dependency parser training.

However, I noticed some recurrent errors that I want to share with you, since they occur on the aforementioned, widely used BioNLP corpus. I collected many examples that I'm reporting here, and that can be summarized as:

Oversplitting after "+/-" or at the dot in "p50.c-rel"
Undersplitting after a capital letter followed by a dot (e.g., kappa B., kinase A., Cya.)

You can also find attached a list of other less common errors I screened (other_errors.txt), but I think even just identify a solution for and/or handling these cases would be great since they represent the majority of errors (~75%) in the BioNLP09 corpus!

What would you recommend for handling these cases? Are they easily manageable by adding tokenization rules or you suggest to have a custom module to workaround the problem?

Thank you very much indeed!
Alan

Oversplitting after "+/-"

Example 1

PDBu + iono induced equally high IL-2 levels in both groups and, when stimulated with plate-bound anti-CD3 monoclonal antibody (mAb), the IL-2 secretion by neonatal cells was undetectable and adult cells produced low amounts of IL-2 (mean 331 +/-
86 pg/ml).

Example 2

The addition of anti-CD28 mAb to anti-CD3-stimulated cells markedly increased IL-2 production in both cell types, but levels of IL-2 in neonatal T cells remained clearly lower than those of adult T cells (respective mean values: 385 +/-
109 pg/ml and 4494 +/-
1199 pg/ml).

Example 3

Maximal inhibition of IgE production for B cells was at 10(-8) mol/L for all-trans RA (94% +/-
1.8%) and 96% +/-
3.2% for 13-cis RA.

Example 4

Anti-CD40 + IL-4-mediated proliferation of PBMC and B cells was inhibited by RA in a dose-dependent manner, with maximal inhibition of 62% +/-
5% in PBMC and 55% +/-
4.4% in B cells by all-trans RA, and 58% +/-
6.7% and 51% +/-
4.7%, respectively by 13-cis RA.

Example 5

By immunocytochemistry, 25 +/-
7% of the human neutrophils were shown to express immunoreactive GH, whereas eosinophils were negative.

Example 6

TCP succinate (200 microM, 24 h) reduced TNF-induced VCAM-1 and E-selectin expression from a specific mean fluorescence intensity of 151 +/- 28 to 12 +/-
4 channels and from 225 +/- 38 to 79 +/-
21 channels, respectively.

Example 7

In this study, we report that both Exosurf and Survanta suppress TNF mRNA and secretion (85 +/-
4% mean percent inhibition +/-
SEM by Exosurf; 71 +/-
6% by Survanta) by endotoxin-stimulated THP-1, a human monocytic cell line.

Oversplitting in between "p65.c-Rel" / "p50-.c-Rel"

Example 1

p65 or as p50.
c-Rel heterodimers.

Example 2

p50.
c-rel heterodimers were also detected bound to this sequence at early time points (7-16 h; early), and both remained active at later time points (40 h; late) after activation.

Example 3

However, immediately after TcR/CD3 cross-linking (after approximately 1 h; immediate) binding of p50.
p65 heterodimers was observed.

Undersplitting after "kappa B." / "kappaB."

Protein kinase C inhibitor staurosporine, but not cyclic nucleotide-dependent protein kinase inhibitor HA-1004, also dramatically reduced constitutive levels of nuclear NF kappa B. [SPLIT HERE] Finally, TPA addition to monocytes infected with HIV-1 inhibited HIV-1 replication, as determined by reverse transcriptase assays, in a concentration-dependent manner.
The NF-kappa B p65 subunit provides the transactivation activity in this complex and serves as an intracellular receptor for a cytoplasmic inhibitor of NF-kappa B, termed I kappa B. [SPLIT HERE] In contrast, NF-kappa B p50 alone fails to stimulate kappa B-directed transcription, and based on prior in vitro studies, is not directly regulated by I kappa B. [SPLIT HERE] To investigate the molecular basis for the critical regulatory interaction between NF-kappa B and I kappa B/MAD-3, a series of human NF-kappa B p65 mutants was identified that functionally segregated DNA binding, I kappa B-mediated inhibition, and I kappa B-induced nuclear exclusion of this transcription factor.
This protein is most similar to the 105-kDa precursor polypeptide of p50-NF-kappa B. [SPLIT HERE] Like the 105-kDa precursor, it contains an amino-terminal Rel-related domain of about 300 amino acids and a carboxy-terminal domain containing six full cell cycle or ankyrin repeats.
The kappa B enhancer of the gene encoding the interleukin-2 (IL-2) receptor alpha chain (IL-2R alpha) is functional only in the hybrids expressing nuclear NF-kappa B. [SPLIT HERE] These findings show that nuclear NF-kappa B is necessary to activate the kappa B enhancer, while KBF1 by itself is not sufficient.
In this report we describe how signals initiated through the type I IL-1R interact with signals from the antigen receptor to synergistically augment the transactivating properties of NF-kappa B. [SPLIT HERE] The synergistic antigen receptor initiated signals are mediated through protein kinase C because they can be mimicked by the phorbol ester, 12-O-tetradecanoylphorbol-13-acetate, but not with calcium ionophores; and are staurosporine sensitive but cyclosporine resistant.
This study demonstrates that human immunodeficiency virus type 1 (HIV-1) Tat protein amplifies the activity of tumor necrosis factor (TNF), a cytokine that stimulates HIV-1 replication through activation of NF-kappa B. [SPLIT HERE] In HeLa cells stably transfected with the HIV-1 tat gene (HeLa-tat cells), expression of the Tat protein enhanced both TNF-induced activation of NF-kappa B and TNF-mediated cytotoxicity.
Treatment of human resting T cells with phorbol esters strongly induced the expression of IL-2R alpha and the activation of NF.kappa B. [SPLIT HERE] This activation was due to the translocation of p65 and c-Rel NF.kappa B proteins from cytoplasmic stores to the nucleus, where they bound the kappa B sequence of the IL-2R alpha promoter either as p50.
A mutant Tax protein deficient in transactivation of genes by the nuclear factor (NF)-kappaB pathway was unable to induce transcriptional activity of IL-1alpha promoter-CAT constructs, but was rescued by exogenous provision of p65/p50 NF-kappaB. [SPLIT HERE] We found that two IL-1alpha kappaB-like sites (positions -1,065 to -1,056 and +646 to +655) specifically formed a complex with NF-kappaB-containing nuclear extract from MT-2 cells and that NF-kappaB bound with higher affinity to the 3' NF-kappaB binding site than to the 5' NF-kappaB site.
Electrophoretic mobility shift assays (EMSAs) demonstrated that unstimulated monocytes predominantly expressed p50 NF-kappa B. [SPLIT HERE] Stimulation with LPS or IFN-gamma resulted in the expression of p50 and p65 subunits, while the combination of IFN-gamma plus LPS caused a further increase in the expression of NF-kappa B. [SPLIT HERE] With Western blotting, it was shown that nuclear extracts from monocytes contained p50 and p65 protein in response to LPS and IFN-gamma stimulation.
The effects of IFN-gamma on the transcription factors were specific, since no change was observed in the expression of NF-IL-6 or I kappa B alpha, the inhibitor of NF-kappa B. [SPLIT HERE] We conclude that the effects of IFN-gamma on the expression of the transcription factors AP-1 and NF-kappa B may be important for the modulatory effects of IFN-gamma on the cytokine expression in activated human monocytes.
Protein-DNA complexes of constitutive NF-kappa B are similar in mobility to the LPS-induced NF-kappa B and both are recognized by an antibody specific to the p50 subunit of NF-kappa B. [SPLIT HERE] By contrast, treatment of cells with pyrrolidine dithiocarbamate (PDTC) will only block LPS-induced NF-kappa B, but not the constitutive binding protein.
Stimulation of T-cells by agonistic anti-CD28 antibodies in conjunction with phorbol 12-myristate 13-acetate (PMA)- or TcR-derived signals induces the enhanced activation of the transcription factor NF-kappa B. [SPLIT HERE] Here we report that CD28 engagement, however, exerts opposite effects on the transcription factor AP-1.
In addition, cotransfection of a negative dominant molecule of PKC-zeta (PKC-zeta mut) with NF-kappa B-dependent reporter genes selectively inhibits the HIV- but not phorbol myristate acetate- or lipopolysaccharide-mediated activation of NF-kappa B. [SPLIT HERE] That PKC-zeta is specific in regulating NF-kappa B is concluded from the inability of PKC-zeta(mut) to interfere with the basal or phorbol myristate acetate-inducible CREB- or AP1-dependent transcriptional activity.
Inhibition of TNF-alpha secretion by LPS-stimulated THP-1-hGH cells was associated with a decrease in nuclear translocation of nuclear factor-kappaB. [SPLIT HERE] The capacity of GH to inhibit LPS-induced TNF-alpha production by monocytes without altering other pathways leading to TNF-alpha production may be of potential relevance in septic shock, since GH is available for clinical use.
In this manuscript we have investigated the molecular mechanisms by which T cell lines stimulated with phorbol 12-myristate 13-acetate (PMA) and phytohemagglutin (PHA) display significantly higher levels of NF-kappa B1 encoding transcripts than cells stimulated with tumor necrosis factor-alpha, despite the fact that both stimuli activate NF-kappa B. [SPLIT HERE] Characterization of the NF-kappa B1 promoter identified an Egr-1 site which was found to be essential for both the PMA/PHA-mediated induction as well as the synergistic activation observed after the expression of the RelA subunit of NF-kappa B and Egr-1.
The expression of many genes for which products are involved in inflammation is controlled by the transcriptional regulator nuclear factor (NF)-kappa B. [SPLIT HERE] Because surfactant protein (SP) A is involved in local host defense in the lung and alters immune cell function by modulating the expression of proinflammatory cytokines as well as surface proteins involved in inflammation, we hypothesized that SP-A exerts its action, at least in part, via activation of NF-kappa B. We used gel shift assays to determine whether SP-A activated NF-kappa B in the THP-1 cell line, a human monocytic cell line.
Similarly, a kinase-deficient mutant of NIK (NF-kappaB-inducing kinase), which represents an upstream kinase in the TNF-alpha and IL-1 signaling pathways leading to IKKalpha and IKKbeta activation, blocks Tax induction of NF-kappaB. [SPLIT HERE] However, plasma membrane-proximal elements in these proinflammatory cytokine pathways are apparently not involved since dominant negative mutants of the TRAF2 and TRAF6 adaptors, which effectively block signaling through the cytoplasmic tails of the TNF-alpha and IL-1 receptors, respectively, do not inhibit Tax induction of NF-kappaB. [SPLIT HERE] Together, these studies demonstrate that HTLV-1 Tax exploits a distal part of the proinflammatory cytokine signaling cascade leading to induction of NF-kappaB. [SPLIT HERE] The pathological alteration of this cytokine pathway leading to NF-kappaB activation by Tax may play a central role in HTLV-1-mediated transformation of human T cells, clinically manifested as the adult T-cell leukemia.
Our analyses of the induction of nuclear factor-kappaB (NFkappaB) in activated memory (CD45RO+) and naive (CD45RA+) T cell subsets from young and elderly donors has demonstrated that, regardless of donor age, memory T cells are not significantly altered in their responsiveness to TNF-alpha-mediated induction of NFkappaB. [SPLIT HERE] Although treatment with TNF-alpha induced nuclear localization of NFkappaB in both memory and naive T cell subsets, irrespective of the age of the donor, the levels of induced NFkappaB were significantly lower in both subsets of T cells obtained from the elderly, when compared to those in young.
Examination of IkappaB alpha regulation revealed that TNF-alpha-mediated degradation of IkappaB alpha in both memory and naive T cells from the elderly was severely impaired, thus contributing to the lowered induction of the observed NFkappaB. [SPLIT HERE] In addition, this age-related decrease in induction of nuclear NFkappaB correlated with decrease in intracellular IL-2 receptor expression and anti-CD3-induced proliferation of both memory and naive T cells subsets.

Undersplitting after "kinase C." / "kinase A."

In contrast, anti-AIM mAb did not induce any change in the binding activity of NF-kappa B, a transcription factor whose activity is also regulated by protein kinase C. [SPLIT HERE] The increase in AP-1-binding activity was accompanied by the marked stimulation of the transcription of c-fos but not that of c-jun.
The stimulatory effect of gp160 on NF-kappa B activation is protein synthesis independent, is dependent upon protein tyrosine phosphorylation, and abrogated by inhibitors of protein kinase C. [SPLIT HERE] The gp160-mediated activation of NF-kappa B in CD4 positive T cells may be involved in biological effects, e.g., enhanced HIV replication, hypergammaglobulinemia, increased cytokine secretion, hypercellularity in bone marrow and apoptosis.
The phosphorylation of CREB that results in activation is mediated by protein kinase C rather than by protein kinase A. [SPLIT HERE] Although the CRE site is necessary, optimal induction of bcl-2 expression requires participation of the upstream regulatory element, suggesting that phosphorylation of CREB alters its interaction with the upstream regulatory element.

Undersplitting after "CyA." or "CsA."

Induction of the PILOT gene is detectable in human T cells 20 min following activation in the presence of cycloheximide and is fully suppressed by CyA. [SPLIT HERE] The PILOT protein has a calculated M(r) of 42.6 kDa and contains three zinc fingers of the C2H2-type at the carboxyl-terminus which are highly homologous to the zinc finger regions of the transcription factors EGR1, EGR2, and pAT 133.
Transactivation by recombinant NFAT1 in Jurkat T cells requires dual stimulation with ionomycin and phorbol 12-myristate 13-acetate; this activity is potentiated by coexpression of constitutively active calcineurin and is inhibited by CsA. [SPLIT HERE]
Immunocytochemical analysis indicates that recombinant NFAT1 localizes in the cytoplasm of transiently transfected T cells and translocates into the nucleus in a CsA-sensitive manner following ionomycin stimulation.
CONCLUSIONS: This study demonstrates that TF activation, occurring in mononuclear cells of cardiac transplant recipients, is inhibited by treatment with CsA. [SPLIT HERE] Inhibition of monocyte TF induction by CsA may contribute to its successful use in cardiac transplant medicine and might be useful in managing further settings of vascular pathology also known to involve TF expression and NF-kappaB activation.

Add finished models

Syntax error(python 3.5)

When I am running the example code available in the scispacy site, I am getting this error..
Could you help me solve the problem?

(env) vk@vk:~$ python test1.py
Traceback (most recent call last):
File "test1.py", line 5, in
nlp = spacy.load("en_core_sci_sm")
File "/home/vk/env/lib/python3.5/site-packages/spacy/init.py", line 21, in load
return util.load_model(name, **overrides)
File "/home/vk/env/lib/python3.5/site-packages/spacy/util.py", line 114, in load_model
return load_model_from_package(name, **overrides)
File "/home/vk/env/lib/python3.5/site-packages/spacy/util.py", line 134, in load_model_from_package
cls = importlib.import_module(name)
File "/home/vk/env/lib/python3.5/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 986, in _gcd_import
File "", line 969, in _find_and_load
File "", line 958, in _find_and_load_unlocked
File "", line 673, in _load_unlocked
File "", line 665, in exec_module
File "", line 222, in _call_with_frames_removed
File "/home/vk/env/lib/python3.5/site-packages/en_core_sci_sm/init.py", line 7, in
from scispacy.custom_sentence_segmenter import combined_rule_sentence_segmenter
File "/home/vk/env/lib/python3.5/site-packages/scispacy/custom_sentence_segmenter.py", line 15
prev_token_1: Token = None
^
SyntaxError: invalid syntax

GPU Support for scispacy

Is GPU support available for scispacy?

How to visualize named entities in custom colors

There's an options in Spacy which allows us to use custom colors for named entity visualization. I'm trying to use the same options in scispacy for the named entities. I simply created two lists of entities and randomly generated colors and put them in options dictionary like the following:

options = {"ents": entities, "colors": colors}

Where entities is a list of NEs in scispacy NER models and colors is a list of the same size. But using such an option in either displacy.serve or displacy.render (for jupyter) does not work. I'm using the options like the following:

displacy.serve(doc, style="ent", options=options)

I wonder if using the color option only works for predefined named entities in the Spacy or there's something wrong with the way I'm using the option?

Add custom tokenizer

Update readme/docs to not require scispacy install for models after next release

pip install fails

I've created the conda env, and ran pip install scispacy see the result:

(scispacy) lucas-mbp:jats lfoppiano$ pip install scispacy
Collecting scispacy
  Using cached https://files.pythonhosted.org/packages/72/55/30b30a78abafaaf34d0d8368a090cf713964d6c97c5e912fb2016efadab0/scispacy-0.2.2-py3-none-any.whl
Collecting numpy (from scispacy)
  Downloading https://files.pythonhosted.org/packages/0f/c9/3526a357b6c35e5529158fbcfac1bb3adc8827e8809a6d254019d326d1cc/numpy-1.16.4-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (13.9MB)
     |████████████████████████████████| 13.9MB 3.5MB/s 
Collecting joblib (from scispacy)
  Using cached https://files.pythonhosted.org/packages/cd/c1/50a758e8247561e58cb87305b1e90b171b8c767b15b12a1734001f41d356/joblib-0.13.2-py2.py3-none-any.whl
Collecting spacy>=2.1.3 (from scispacy)
  Downloading https://files.pythonhosted.org/packages/cb/ef/cccdeb1ababb2cb04ae464098183bcd300b8f7e4979ce309669de8a56b9d/spacy-2.1.6-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (34.6MB)
     |████████████████████████████████| 34.6MB 33.6MB/s 
Collecting conllu (from scispacy)
  Downloading https://files.pythonhosted.org/packages/ae/54/b0ae1199f3d01666821b028cd967f7c0ac527ab162af433d3da69242cea2/conllu-1.3.1-py2.py3-none-any.whl
Collecting awscli (from scispacy)
  Using cached https://files.pythonhosted.org/packages/e6/48/8c5ac563a88239d128aa3fb67415211c19bd653fab01c7f11cecf015c343/awscli-1.16.203-py2.py3-none-any.whl
Collecting nmslib>=1.7.3.6 (from scispacy)
  Using cached https://files.pythonhosted.org/packages/b2/4d/4d110e53ff932d7a1ed9c2f23fe8794367087c29026bf9d4b4d1e27eda09/nmslib-1.8.1.tar.gz
    ERROR: Complete output from command python setup.py egg_info:
    ERROR: Download error on https://pypi.org/simple/numpy/: [Errno 8] nodename nor servname provided, or not known -- Some packages may not be found!
    Couldn't find index page for 'numpy' (maybe misspelled?)
    Download error on https://pypi.org/simple/: [Errno 8] nodename nor servname provided, or not known -- Some packages may not be found!
    No local packages or working download links found for numpy
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/mk/scd8428n18jfgh3jdthbvpz00000gn/T/pip-install-l00jm4xn/nmslib/setup.py", line 172, in <module>
        zip_safe=False,
      File "/anaconda3/envs/scispacy/lib/python3.6/site-packages/setuptools/__init__.py", line 144, in setup
        _install_setup_requires(attrs)
      File "/anaconda3/envs/scispacy/lib/python3.6/site-packages/setuptools/__init__.py", line 139, in _install_setup_requires
        dist.fetch_build_eggs(dist.setup_requires)
      File "/anaconda3/envs/scispacy/lib/python3.6/site-packages/setuptools/dist.py", line 717, in fetch_build_eggs
        replace_conflicting=True,
      File "/anaconda3/envs/scispacy/lib/python3.6/site-packages/pkg_resources/__init__.py", line 782, in resolve
        replace_conflicting=replace_conflicting
      File "/anaconda3/envs/scispacy/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1065, in best_match
        return self.obtain(req, installer)
      File "/anaconda3/envs/scispacy/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1077, in obtain
        return installer(requirement)
      File "/anaconda3/envs/scispacy/lib/python3.6/site-packages/setuptools/dist.py", line 784, in fetch_build_egg
        return cmd.easy_install(req)
      File "/anaconda3/envs/scispacy/lib/python3.6/site-packages/setuptools/command/easy_install.py", line 673, in easy_install
        raise DistutilsError(msg)
    distutils.errors.DistutilsError: Could not find suitable distribution for Requirement.parse('numpy')
    ----------------------------------------
ERROR: Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/mk/scd8428n18jfgh3jdthbvpz00000gn/T/pip-install-l00jm4xn/nmslib/
(scispacy) lucas-mbp:jats lfoppiano$

To solve the issue I had to install numpy and nmslib:

conda install numpy
conda install -c akode nmslib

It seems to work, but maybe is not the proper way to solve it - the pip script should be updated perhaps?

Move models and data to s3/ai2-s2-scispacy

All script commands should work with remote data (using cached_path) and we shouldn't have models stored in the repo.

Add custom segmenter

if i want to recognise specific entity(like genes or decease) what to do?

mix in ontonotes data

Replace NER

Hi,

I'm trying to add one of your pre-trained NER to one of the main models but unfortunately, I run into an error. Can somebody help?

This is what I did:

nlp = spacy.load("en_core_sci_sm")
ner = spacy.load("en_ner_bionlp13cg_md")

nlp.replace_pipe('ner',ner.pipeline[0][1])

Then run on some text:

doc = nlp(text)

When I ask for entities I get the error:

doc.ents

ValueError Traceback (most recent call last)
in ()
----> 1 doc.ents
doc.pyx in spacy.tokens.doc.Doc.ents.get()
span.pyx in spacy.tokens.span.Span.cinit()
ValueError: [E084] Error assigning label ID 7634832301877222523 to span: not in StringStore.

If I try to just add a new NER to the model with:

nlp.add_pipe(ner.pipeline[0][1], 'new_ner')

the kernel crashes...

Thank you!

explicitly list dependencies in setup.py

General ideas

explore training with gold preprocessing = True/False
explore using a different parser metric or rejoining badly tokenized words in some way so the parser isn't hurt by the poor tokenization
play around more with how to mix ontonotes data into training of the parser/tagger

remove `parent_package` from model meta once spacy release 2.1

Refactor tests to not use global cache for models

We should move away from a functional test framework because it makes the "god-object-spacy-models" dangerous. Instead we should just load a single model for groups of related tests and re-use them.

`is_oov` in comparison to `in nlp.vocab`

I'm looking to use scispacy's en_core_sci_md model for various purposes, one being using its word vectors as an input to a neural network.
As I was checking the coverage of the existing embedding, I noticed a weird phenomenon where a given token's token.is_oov == True, thought token.text in nlp.vocab == True. When this happens the token.vector.sum() == 0.
I can't figure out how does this make sense, if it is in the vocabulary, how come it is oov and has an all zero vector? Also some basic words are missing, for example

tokens = gather_all_tokens_from_corpus()

some_token = random.choice([t for t in tokens if t.is_oov])
print(some_token)
>>> smelling

some_token.text in nlp.vocab
>>> True

some_token.vector
>>> array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)

How come it is OOV yet returns True when checking in nlp.vocab?
Is it expected that basic words like smelling won't have a vector?

Performance of (alpha) UMLS Linker on MedMentions ?

Hi,

Just wondering if you have any results regarding the performance of the alpha UMLS linker on MedMentions, using the mention span detector you trained for en_core_sci_*.

Also, I see an F1 of 69.26 on https://allenai.github.io/scispacy/, this is the span detection performance of the latest en_core_sci_md on the test set for full data right (not st21pv) ? Is data from st21pv ever used?

The MedMentions paper only reports results for TaggerOne with a very beefy setup (0.9TB RAM!), it'd be really useful to have another baseline using scispacy, even in this alpha stage.

Thanks,
Dan

Respect spacy naming conventions for models.

Format should be [lang] [core/ent/dep] [genre] [size]

'core' if it has all components, 'ent' if it's only NER, 'dep' if it's syntax

Therefore, our released models should be:
en_core_sci_sm
en_core_sci_lg

possibly, if we release a mention detector separately:
en_ent_sci_[sm/lg]

investigate if we can use the name `SciSpaCy`

Add custom tests

Evaluate how difficult feature parity with MetamapLite is

Term normalisation
Entity Linking
Negation Detection

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6080672/

Span.vector usage

Hi,

I am trying to use en_ner_craft_md and en_ner_jnlpba_md modules.

I found Spacy's Span object has Span.vector as a vector representation its text,
is it possible to use these representations directly for some gene name normalization task (e.g. "Basic Helix-Loop-Helix Transcription Factor Scleraxis" is similar to "SCX" gene)?

Thanks!

Shunfu

Can not reproduce tagging and parsing results

Hello,

I'm trying to reproduce tagging and parsing results using the GENIA corpus. I downloaded the officially released models (en_core_sci_sm-0.2.0 and en_core_sci_md-0.2.0) and officially released GENIA corpus (train/dev/test.json). I modified the scripts parser.sh and train_parser_and_tagger.py, and use them to evaluate the models. However, there seems to be large differences between the results reported in the paper, reported in the github repo and my reproduced result.

Paper:
- en_core_sci_sm: 98.38 89.69 87.67
- en_core_sci_md: 98.51 90.60 88.79
Github Repo docs/index.md:
- en_core_sci_sm: 98.42 89.47 87.61
- en_core_sci_md: 98.61 89.94 88.08
My reproduced result:
- en_core_sci_sm: 98.42 89.47 84.04
- en_core_sci_md: 98.61 89.94 84.37

The numbers are POS, UAS, LAS, respectively.

Could you please check your results? Thanks a lot for your help!

Sincerely,
Yuhui

Parse a large quantity of text with a very strong dependency parser for retraining

We have the ability to train an extremely strong dependency parser (via allennlp + the scientific version of elmo we have). We could generate a larger dataset of biomedical text which we can use to train the spacy model with.

Run evaluation again with stanford dependencies to make results comparable

https://twitter.com/d_q_nguyen/status/1098821054414258177

How to get dependency parse annotation from OntoNotes 5.0 corpus?

As mention in paper:
To increase the robustness of the dependency parser and POS tagger to generic text, we make use of the OntoNotes 5.0 corpus
I can only find the Constituency parse annotation.

Model download URLs

Hi, Where can I download the trained models? I am unable to find the download URLs for the available models mentioned in this repo.

Moreover the current pip install fails. We have to clone the repo and build from the setup.py files

Error when trying to load "en_core_sci_sm"

Not sure of what's is going on, but I got this parsing error:

`>>> import spacy

nlp = spacy.load("en_core_sci_sm")
Traceback (most recent call last):
File "", line 1, in
File "/home/bancherd3/anaconda3/lib/python3.7/site-packages/spacy/init.py", line 27, in load
return util.load_model(name, **overrides)
File "/home/bancherd3/anaconda3/lib/python3.7/site-packages/spacy/util.py", line 131, in load_model
return load_model_from_package(name, **overrides)
File "/home/bancherd3/anaconda3/lib/python3.7/site-packages/spacy/util.py", line 152, in load_model_from_package
return cls.load(**overrides)
File "/home/bancherd3/anaconda3/lib/python3.7/site-packages/en_core_sci_sm/init.py", line 14, in load
nlp = load_model_from_init_py(file, **overrides)
File "/home/bancherd3/anaconda3/lib/python3.7/site-packages/spacy/util.py", line 190, in load_model_from_init_py
return load_model_from_path(data_path, meta, **overrides)
File "/home/bancherd3/anaconda3/lib/python3.7/site-packages/spacy/util.py", line 173, in load_model_from_path
return nlp.from_disk(model_path)
File "/home/bancherd3/anaconda3/lib/python3.7/site-packages/spacy/language.py", line 786, in from_disk
util.from_disk(path, deserializers, exclude)
File "/home/bancherd3/anaconda3/lib/python3.7/site-packages/spacy/util.py", line 611, in from_disk
reader(path / key)
File "/home/bancherd3/anaconda3/lib/python3.7/site-packages/spacy/language.py", line 776, in
deserializers["tokenizer"] = lambda p: self.tokenizer.from_disk(p, exclude=["vocab"])
File "tokenizer.pyx", line 390, in spacy.tokenizer.Tokenizer.from_disk
File "tokenizer.pyx", line 436, in spacy.tokenizer.Tokenizer.from_bytes
File "/home/bancherd3/anaconda3/lib/python3.7/re.py", line 234, in compile
return _compile(pattern, flags)
File "/home/bancherd3/anaconda3/lib/python3.7/re.py", line 286, in _compile
p = sre_compile.compile(pattern, flags)
File "/home/bancherd3/anaconda3/lib/python3.7/sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
File "/home/bancherd3/anaconda3/lib/python3.7/sre_parse.py", line 930, in parse
p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
File "/home/bancherd3/anaconda3/lib/python3.7/sre_parse.py", line 426, in _parse_sub
not nested and not items))
File "/home/bancherd3/anaconda3/lib/python3.7/sre_parse.py", line 536, in _parse
code1 = _class_escape(source, this)
File "/home/bancherd3/anaconda3/lib/python3.7/sre_parse.py", line 337, in _class_escape
raise source.error('bad escape %s' % escape, len(escape))
re.error: bad escape \p at position 326`

Thank you very much.

Update readme to be complete

Support for spacy-nightly (2.1.0a*) version

Looks like the support has changed from spacy-nightly version to spacy 2.0.8. Is it possible to use it with spacy-nightly (2.1.0a*) versions?

Do both of 'en_core_sci_sm' and 'en_core_sci_md' use MedMentions all dataset?

Thanks for developing nice and useful library.
My question is , do both of 'en_core_sci_sm' and 'en_core_sci_md' use MedMentions all dataset?
Because my task is related to MedMentions dataset itself, so I can only use MedMentions train dataset.
If 'en_core_sci_sm' or 'en_core_sci_md' only used MedMentions train datasets, it's OK. But if datasets including dev/test data are used for training model, I can't use it because my task is related to datasets itself.

Or, how can I retrain spacy model? If these model use MedMentions all(including dev/test) data, I'd like to create model all over again.

Embedding

Can I know what embeddings you use to train the models?
Is it clinic data related embeddings?

allenai / scispacy Goto Github PK

scispacy's Introduction

Installation

Setting up a virtual environment

Note on upgrading

Available Models

Additional Pipeline Components

AbbreviationDetector

Example Usage

EntityLinker

Example Usage

Hearst Patterns (v0.3.0 and up)

Usage:

Citing

scispacy's People

Stargazers

Watchers

Forkers

scispacy's Issues

Recommend Projects

Recommend Topics

Recommend Org