Code Monkey home page Code Monkey logo

princetondh's Introduction

Hi there

Welcome Digital Humanists! This is the repo where you can find all code used for the the Priceton University online Workshop titled "spaCy: A Python Library for Natural Language Processing" and held Tue, Apr 4, 2023 4:30 PM โ€“ 6 PM EDT (GMT-4). Hope those of you who participated had a good time and for the rest we hope you'll find this repository useful!

p.s.: We use radicli library for creating command line interfaces, which we will be using in spaCy soon! Also, during the VSCode plugin for spaCy, which is also coming soon!

Notebooks

The notebooks directory has three Jupyter notebooks:

  1. intro_to_spacy.ipynb is a short introduction to how to work with spaCy and a whirlwind tour of many of the tools spaCy provides.
  2. casestudy_1.ipynb walks through building a pipeline to extract information from restaurant reviews by identifying spans of interest such as mentions of cuisines or ratings. The pipeline is a blend of rule-based and learning-based techniques and there is an excersize to build your own rules.
  3. casestudy_2.ipynb focuses only on learned pipelines and the various tools spaCy provides to find spans in texts. It runs some parts of the litbank_pipeline project.

LitBank pipeline

The LitBank dataset is a collection of a 100 works of fiction publicly available from Project Gutenberg majority of which were published between 1852 and 1911. Each document is approximately the first 2000 words of the novels leading to a total of 210532 tokens in the entire data set.

The litbank_pipeline downloads LitBank and trains models on the Named Entity and Event annotations. To learn about the entity annotations please checkout this paper and this one for the event annotations.

Most config files in litbank_pipeline/configs project were generated with an appropriate init config command.

The commands to preprocess are in litbank_pipeline/scripts/prepare.py. For the event trigger detection we wrote a special scoring function that computes the precision, recall and F1 score only for the positive class i.e. the tokens that have EVENT label. You can find the scorer in litbank_pipeline/scripts/positive_tagger_scorer.py.

For the named entity recognition tasks there are config files to train ner, spancat or spancat_singlelabel components with either the default Convolutional Network or a Recurrent Network encoder.

The ner component does only a single left-to-right pass over the document to find all entities, while spancat classifies each possible span. This means that ner is much more efficient than spancat, but spancat is more flexible. For a comparison between the to checkout this blogpost.

Homework

As an excersize to get more familiar with spaCy we recommend training the different architectures with the different encoders and see how they compare in terms of accuracy, speend and the kinds of mistakes they make.

We also think it would be a useful excersize to train a pipeline that has a single tok2vec component providing representations both to a tagger component for the event detection and a ner or spancat or spancat_singlelabel component for entity recognition. To learn more about shared tok2vec layers please checkout: https://spacy.io/usage/embeddings-transformers#embedding-layers.

References

princetondh's People

Contributors

kadarakos avatar victorialslocum avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.