Code Monkey home page Code Monkey logo

chirpycardinal's Introduction

StanfordNLP: A Python NLP Library for Many Human Languages

Travis Status PyPI Version Python Versions

⚠️ Note ⚠️

All development, issues, ongoing maintenance, and support have been moved to our new GitHub repository as the toolkit is being renamed as Stanza since version 1.0.0. Please visit our new website for more information. You can still download stanfordnlp via pip, but newer versions of this package will be made available as stanza. This repository is kept for archival purposes.

The Stanford NLP Group's official Python NLP library. It contains packages for running our latest fully neural pipeline from the CoNLL 2018 Shared Task and for accessing the Java Stanford CoreNLP server. For detailed information please visit our official website.

References

If you use our neural pipeline including the tokenizer, the multi-word token expansion model, the lemmatizer, the POS/morphological features tagger, or the dependency parser in your research, please kindly cite our CoNLL 2018 Shared Task system description paper:

@inproceedings{qi2018universal,
 address = {Brussels, Belgium},
 author = {Qi, Peng  and  Dozat, Timothy  and  Zhang, Yuhao  and  Manning, Christopher D.},
 booktitle = {Proceedings of the {CoNLL} 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies},
 month = {October},
 pages = {160--170},
 publisher = {Association for Computational Linguistics},
 title = {Universal Dependency Parsing from Scratch},
 url = {https://nlp.stanford.edu/pubs/qi2018universal.pdf},
 year = {2018}
}

The PyTorch implementation of the neural pipeline in this repository is due to Peng Qi and Yuhao Zhang, with help from Tim Dozat and Jason Bolton.

This release is not the same as Stanford's CoNLL 2018 Shared Task system. The tokenizer, lemmatizer, morphological features, and multi-word term systems are a cleaned up version of the shared task code, but in the competition we used a Tensorflow version of the tagger and parser by Tim Dozat, which has been approximately reproduced in PyTorch (though with a few deviations from the original) for this release.

If you use the CoreNLP server, please cite the CoreNLP software package and the respective modules as described here ("Citing Stanford CoreNLP in papers"). The CoreNLP client is mostly written by Arun Chaganty, and Jason Bolton spearheaded merging the two projects together.

Issues and Usage Q&A

To ask questions, report issues or request features, please use the GitHub Issue Tracker.

Setup

StanfordNLP supports Python 3.6 or later. We strongly recommend that you install StanfordNLP from PyPI. If you already have pip installed, simply run:

pip install stanfordnlp

this should also help resolve all of the dependencies of StanfordNLP, for instance PyTorch 1.0.0 or above.

If you currently have a previous version of stanfordnlp installed, use:

pip install stanfordnlp -U

Alternatively, you can also install from source of this git repository, which will give you more flexibility in developing on top of StanfordNLP and training your own models. For this option, run

git clone https://github.com/stanfordnlp/stanfordnlp.git
cd stanfordnlp
pip install -e .

Running StanfordNLP

Getting Started with the neural pipeline

To run your first StanfordNLP pipeline, simply following these steps in your Python interactive interpreter:

>>> import stanfordnlp
>>> stanfordnlp.download('en')   # This downloads the English models for the neural pipeline
# IMPORTANT: The above line prompts you before downloading, which doesn't work well in a Jupyter notebook.
# To avoid a prompt when using notebooks, instead use: >>> stanfordnlp.download('en', force=True)
>>> nlp = stanfordnlp.Pipeline() # This sets up a default neural pipeline in English
>>> doc = nlp("Barack Obama was born in Hawaii.  He was elected president in 2008.")
>>> doc.sentences[0].print_dependencies()

The last command will print out the words in the first sentence in the input string (or Document, as it is represented in StanfordNLP), as well as the indices for the word that governs it in the Universal Dependencies parse of that sentence (its "head"), along with the dependency relation between the words. The output should look like:

('Barack', '4', 'nsubj:pass')
('Obama', '1', 'flat')
('was', '4', 'aux:pass')
('born', '0', 'root')
('in', '6', 'case')
('Hawaii', '4', 'obl')
('.', '4', 'punct')

Note: If you are running into issues like OSError: [Errno 22] Invalid argument, it's very likely that you are affected by a known Python issue, and we would recommend Python 3.6.8 or later and Python 3.7.2 or later.

We also provide a multilingual demo script that demonstrates how one uses StanfordNLP in other languages than English, for example Chinese (traditional)

python demo/pipeline_demo.py -l zh

See our getting started guide for more details.

Access to Java Stanford CoreNLP Server

Aside from the neural pipeline, this project also includes an official wrapper for acessing the Java Stanford CoreNLP Server with Python code.

There are a few initial setup steps.

  • Download Stanford CoreNLP and models for the language you wish to use
  • Put the model jars in the distribution folder
  • Tell the python code where Stanford CoreNLP is located: export CORENLP_HOME=/path/to/stanford-corenlp-full-2018-10-05

We provide another demo script that shows how one can use the CoreNLP client and extract various annotations from it.

Online Colab Notebooks

To get your started, we also provide interactive Jupyter notebooks in the demo folder. You can also open these notebooks and run them interactively on Google Colab. To view all available notebooks, follow these steps:

  • Go to the Google Colab website
  • Navigate to File -> Open notebook, and choose GitHub in the pop-up menu
  • Note that you do not need to give Colab access permission to your github account
  • Type stanfordnlp/stanfordnlp in the search bar, and click enter

Trained Models for the Neural Pipeline

We currently provide models for all of the treebanks in the CoNLL 2018 Shared Task. You can find instructions for downloading and using these models here.

Batching To Maximize Pipeline Speed

To maximize speed performance, it is essential to run the pipeline on batches of documents. Running a for loop on one sentence at a time will be very slow. The best approach at this time is to concatenate documents together, with each document separated by a blank line (i.e., two line breaks \n\n). The tokenizer will recognize blank lines as sentence breaks. We are actively working on improving multi-document processing.

Training your own neural pipelines

All neural modules in this library, including the tokenizer, the multi-word token (MWT) expander, the POS/morphological features tagger, the lemmatizer and the dependency parser, can be trained with your own CoNLL-U format data. Currently, we do not support model training via the Pipeline interface. Therefore, to train your own models, you need to clone this git repository and set up from source.

For detailed step-by-step guidance on how to train and evaluate your own models, please visit our training documentation.

LICENSE

StanfordNLP is released under the Apache License, Version 2.0. See the LICENSE file for more details.

chirpycardinal's People

Contributors

abisee avatar ameliahardy avatar ashwinparanjape avatar dependabot[bot] avatar dilarasoylu avatar ethanachi avatar gcampax avatar hs189 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chirpycardinal's Issues

Treelet State

Here is what I would like to do:

I would like a treelet to remember its state e.g. if a treelet is supposed to do a 2-3 turn conversation. So if this treelet sent an opening question, it needs to know whether the response generator it belongs to was selected among all the response generators. If it was selected, it knows that the user utterance responds to its question otherwise it is not.

So basically the response generator needs to query which response generator was selected in the last round, how can this be done?

If it cannot be done, can you suggest how should it be implemented?

Updating requirements.txt `ipython-sql`

Not sure if still active.
It might make more sense to update from ipython-sql to Jupysql.
Adding more context.

This transition should help you avoid compatibility issues and get access to newer features.
Just switching the name and version as backward compatibility is on.

docker images for all annotators

I looked at the docker images which are available at [1]. However, there are a couple of more annotators which are not available at the docker hub, these are:

  1. emotionclassifier
  2. coref
  3. qa
  4. sentseg

I see that they are not currently used in the chirpy cardinal bot but I have some use cases for these. And I tried building them from codebase but ran into versioning issues so if you can make these 4 images also available at docker hub that will be great.

  1. https://hub.docker.com/u/openchirpy

Possible Extensions

Hello,

This is not an issue but more like a feature request or asking for advise.

I am looking to use/extend chirpy cardinal for the following use case:

A social bot that can take reviews/feedback for a particular item e.g. if I am running an online cooking class and someone participated in one of them and now I want to take feedback on that class, so what I want precisely bot to do:

  1. Pick one of the attributes/features to ask review on e.g. chef's preparedness, availability of ingredients, overall experience, the outcome of the class
  2. Have a bit of in-detail conversation on any of the attributes. I want to measure the performance of the bot on how much detailed, nuanced information it can get from the user. So the metric is more qualitative than quantitative.
  3. To be able to use as much context in the conversation e.g. if I provide other people's reviews about the same attribute then can the bot use those reviews as well in the conversation e.g. I am sorry and surprised to hear you did not like the attention to detail by the chef, because other people have rated the chef positively on it.
  4. So I am looking for the capability where the bot can have a more detailed (or somewhat meaningful) conversation on any of the attribute.

How would you advise me to go build these features in the existing implementation?

self pronoun inconsistency - chirpy says "we" for housemates in icebreaker and "lives alone" in living_condition treelet

neural_chat RG refers to itself as living alone in living_situation treelet and as "we" when talking about taking a vacation in icebreaker

Living_situation RG

"... we are overdue for a vacation..."

return "I live by myself, but luckily I got to talk to people all day, so it's not too lonely."

Icebreaker RG

"It's the middle of the year now. I'm thinking we might be overdue for a vacation, to take some time to recharge and relax. Do you have a favorite thing to do during vacation?",

dependency of packages in questionclassifier

While installing the package I am running into some package conflicts

e.g. when I run the following (in an isolated virtualenv):

I am using python3.7 on macOS.

pip3 install -r docker/questionclassifier/app/requirements.txt

I get the following errors:
ERROR: pytorch-ignite 0.4.4 has requirement torch<2,>=1.3, but you'll have torch 1.1.0 which is incompatible.
ERROR: transformers 4.5.0 has requirement numpy>=1.17, but you'll have numpy 1.16.4 which is incompatible.
ERROR: pyarrow 3.0.0 has requirement numpy>=1.16.6, but you'll have numpy 1.16.4 which is incompatible.
ERROR: pandas 1.2.3 has requirement numpy>=1.16.5, but you'll have numpy 1.16.4 which is incompatible.
ERROR: datasets 1.5.0 has requirement numpy>=1.17, but you'll have numpy 1.16.4 which is incompatible.
ERROR: datasets 1.5.0 has requirement tqdm<4.50.0,>=4.27, but you'll have tqdm 4.60.0 which is incompatible.
ERROR: scipy 1.6.2 has requirement numpy<1.23.0,>=1.16.5, but you'll have numpy 1.16.4 which is incompatible.
ERROR: jsonschema 3.2.0 has requirement six>=1.11.0, but you'll have six 1.10.0 which is incompatible.
ERROR: wandb 0.10.25 has requirement six>=1.13.0, but you'll have six 1.10.0 which is incompatible.
ERROR: torchvision 0.9.1 has requirement torch==1.8.1, but you'll have torch 1.1.0 which is incompatible.

I believe similar conflicts in python packages are present in other docker images as well.

Can you guide how to get around these errors?
I can remove all the version numbers in the requirements.txt and then it works fine but the docker image does not return response to curl commands as pointed out in another issue.

No index phone_doc-0520-3 found in Elastic Search

I have been trying to run the code locally. All installation steps including Spark are completed successfully. When I run shell_chat, the bot is replying and I am able to chat with it. But there is an error which gets printed as 'No index phone_doc-0520-3 was found in Elastic Search'.
I tried searching in Spark codes I ran where there is no mention of 'phone_doc-0520-3' in 'upload.py' getting created.
What should I do to resolve this? Or do I actually need to resolve it because the bot is already up and running?
Its searching for phone_doc-0520-3 in file chirpy/core/asr/search_phone_to_ent.py .

Separately, (instead of raising another issue, I am asking here itself), the bot replies are okay but not exactly the same as in live demo. The bot does not seem to understand some utterances which it does in the live demo. Am I missing something here?
All docker images have been pulled and containers started. The only thing I have not setup is the twitter opinion database in Postgres (for which it is showing an error in terminal).
Are these two errors (1. No index phone_doc-0520-3 found , 2. No Postgres) responsible for my reduction in accuracy in bot?

Thanks in advance for you reply!

Possibly missing Postgresql db schema

Hello

when I try to import

from agent.agents.local.local_agent import LocalAgent

I get some DB errors. And digging into it:

chirpycardinal/chirpy/response_generators/opinion2/opinion_sql.py

This requires database twitter_opinions. And it has some tables e.g. labeled_phrases_cat

However, I could not find the DB schema anywhere in the code or in the instructions.

If I am missing something please point me to that place or if not, could you please specify the schema.

Issues in running preprocess

Hello

I am trying to install the codebase and when I run preprocess.py like this

python3 preprocess.py ../../dump_dir/ ../../pgview_dir/ ../../wikidata_dir 24

This raises the following error:

File "preprocess.py", line 214, in stage0
all_seeks_tup = [(args.dump_path, all_seeks[i], all_seeks[i + 1] - all_seeks[i]) for i in range(len(all_seeks) - 1)] + [(args.dump_path, all_seeks[-1], -1)]
IndexError: list index out of range

This is because the list all_seeks is empty ([]).

As per wiki-setup.md I set up all the variables. So I am not sure what is going wrong.

Questions:

Do I need to download the wiki dump manually and place it somewhere?

Also, is it possible to run it without spark as I am running this on a single-core machine? So that it is easier to debug.

And finally, thanks for the nice work and open-sourcing it.

Running Tests

Hello

I am trying to run tests and running into this issue:

integration_base.py: Line 17

from bin.run_utils import setup_lambda, ASKInvocation, setup_logtofile

I cannot find the run_utils.py file in the repository.

Am I missing something here?

Downloading wikidumps data

I am not sure which wikipedia_dumps files to download. Is this the right link for the download [https://dumps.wikimedia.org/wikidatawiki/20221020/] ? If so how many files will I have to download?
In #31, it has been mentioned that a month's worth of pageviews would be required. Similarly for the wikipedia_dumps, how many files would I need?

Live Demo is down

Hi there,

The live demo is broken, could you please fix it?
Really interested in trying the chatbot out.

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.