Code Monkey home page Code Monkey logo

textdata's Introduction

textdata

R-CMD-check CRAN status Downloads DOI Codecov test coverage Lifecycle: stable

The goal of textdata is to provide access to text-related data sets for easy access without bundling them inside a package. Some text datasets are too large to store within an R package or are licensed in such a way that prevents them from being included in an OSS-licensed package. Instead, this package provides a framework to download, parse, and store the datasets on the disk and load them when needed.

Installation

You can install the not yet released version of textdata from CRAN with:

install.packages("textdata")

And the development version from GitHub with:

# install.packages("remotes")
remotes::install_github("EmilHvitfeldt/textdata")

Example

The first time you use one of the functions for accessing an included text dataset, such as lexicon_afinn() or dataset_sentence_polarity(), the function will prompt you to agree that you understand the dataset’s license or terms of use and then download the dataset to your computer.

After the first use, each time you use a function like lexicon_afinn(), the function will load the dataset from disk.

Included text datasets

As of today, the datasets included in textdata are:

Dataset Function
v1.0 sentence polarity dataset dataset_sentence_polarity()
AFINN-111 sentiment lexicon lexicon_afinn()
Hu and Liu’s opinion lexicon lexicon_bing()
NRC word-emotion association lexicon lexicon_nrc()
NRC Emotion Intensity Lexicon lexicon_nrc_eil()
The NRC Valence, Arousal, and Dominance Lexicon lexicon_nrc_vad()
Loughran and McDonald’s opinion lexicon for financial documents lexicon_loughran()
AG’s News dataset_ag_news()
DBpedia ontology dataset_dbpedia()
Trec-6 and Trec-50 dataset_trec()
IMDb Large Movie Review Dataset dataset_imdb()
Stanford NLP GloVe pre-trained word vectors embedding_glove6b()
embedding_glove27b()
embedding_glove42b()
embedding_glove840b()

Check out each function’s documentation for detailed information (including citations) for the relevant dataset.

Community Guidelines

Note that this project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms. Feedback, bug reports (and fixes!), and feature requests are welcome; file issues or seek support here. For details on how to add a new dataset to this package, check out the vignette!

textdata's People

Contributors

ellisvalentiner avatar emilhvitfeldt avatar jmclawson avatar jonthegeek avatar juliasilge avatar olivroy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

textdata's Issues

Release textdata 0.4.3

Prepare for release:

  • git pull
  • Check current CRAN check results
  • Polish NEWS
  • devtools::build_readme()
  • urlchecker::url_check()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • revdepcheck::revdep_check(num_workers = 4)
  • Update cran-comments.md
  • git push

Submit to CRAN:

  • usethis::use_version('patch')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • git push
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • git push

lexicon_nrc() appears broken again

It looks like the structural change in the source file that led to issue #50 has been rolled back, so lexicon_nrc() fails due to a missing file.

I think the path used in process_nrc() should once again be:
"NRC-Emotion-Lexicon/NRC-Emotion-Lexicon-v0.92/NRC-Emotion-Lexicon-Wordlevel-v0.92.txt"

Release textdata 0.4.0

Prepare for release:

  • Check current CRAN check results
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • revdepcheck::revdep_check(num_workers = 4)
  • Update cran-comments.md
  • Polish NEWS
  • Review pkgdown reference index for, e.g., missing topics
  • Draft blog post

Submit to CRAN:

  • usethis::use_version('minor')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • Finish blog post
  • Tweet
  • Add link to blog post in pkgdown news menu

Release textdata 0.4.5

Prepare for release:

  • git pull
  • Check current CRAN check results
  • Polish NEWS
  • urlchecker::url_check()
  • devtools::build_readme()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • revdepcheck::revdep_check(num_workers = 4)
  • Update cran-comments.md
  • git push

Submit to CRAN:

  • usethis::use_version('patch')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • usethis::use_dev_version(push = TRUE)

lexicon_nrc_vad() columns aren't named

The text file unzipped for the NRC Valence, Arousal, and Dominance lexicon does not include column names, but the read_tsv() assumes it does. As a result, the function returns a tibble with columns named "aaaaaaah", "0.479", "0.606", and "0.291".

data <- read_tsv(path(
folder_path,
"NRC-VAD-Lexicon-Aug2018Release/NRC-VAD-Lexicon.txt"
),
col_types = cols(
Word = col_character(),
Valence = col_double(),
Arousal = col_double(),
Dominance = col_double()
)
)

Screenshot 2024-01-22 at 8 47 50 AM

A PR has been prepared to address this issue.

Upkeep for textdata

Pre-history

  • usethis::use_readme_rmd()
  • usethis::use_roxygen_md()
  • usethis::use_github_links()
  • usethis::use_pkgdown_github_pages()
  • usethis::use_tidy_github_labels()
  • usethis::use_tidy_style()
  • usethis::use_tidy_description()
  • urlchecker::url_check()

2020

  • usethis::use_package_doc()
    Consider letting usethis manage your @importFrom directives here.
    usethis::use_import_from() is handy for this.
  • usethis::use_testthat(3) and upgrade to 3e, testthat 3e vignette
  • Align the names of R/ files and test/ files for workflow happiness.
    usethis::rename_files() can be helpful.

2021

  • usethis::use_tidy_dependencies()
  • usethis::use_tidy_github_actions() and update artisanal actions to use setup-r-dependencies
  • Remove check environments section from cran-comments.md
  • Bump required R version in DESCRIPTION to 3.5
  • Use lifecycle instead of artisanal deprecation messages, as described in Communicate lifecycle changes in your functions

2022

Allow non-interactive use of load_dataset()

Currently load_dataset() calls printer(), which in turn calls menu(), which throws an error if R is run non-interactively.

See

An example of wanting to use this function non-interactively is including the datasets in a Docker image.

I think that in a non-interactive session you can always assume that the user wants to download the dataset, so a possible reworking of printer() might be something like the following.

printer <- function(name) {
  info_name <- print_info[[name]]
  if(interactive()) {
    cat("Do you want to download:\n",
      "Name:", info_name[["name"]], "\n",
      "URL:", info_name[["url"]], "\n",
      "License:", info_name[["license"]], "\n",
      "Size:", info_name[["size"]], "\n",
      "Download mechanism:", info_name[["download_mech"]], "\n"
    )
    menu(choices = c("Yes", "No"))
  } else {
    cat("Downloading:\n",
      "Name:", info_name[["name"]], "\n",
      "URL:", info_name[["url"]], "\n",
      "License:", info_name[["license"]], "\n",
      "Size:", info_name[["size"]], "\n",
      "Download mechanism:", info_name[["download_mech"]], "\n"
    )
    1
  }
}

That is, the message is changed from "Do you want to download" to "Downloading" and menu() is replaced by always returning 1.

Add non-interactive option

Hi! Thanks for your hard work on this repo :) I've been using it for an analysis lately and I run into a problem.

My goal was to deploy a Shiny App that uses the NRC dataset. The app would work locally but it wouldn't when I deployed it. This is because every time I tried to run get_sentiments("nrc"), a menu is used asking whether I wanted to download the dataset. This is shown in the demo:
https://github.com/EmilHvitfeldt/textdata/blob/master/man/figures/textdata_demo.gif

I would suggest to add a parameter so the answer to that menu is automatically "Yes", and it downloads whatever it needs without asking.

I managed to finish my app by using the dataset locally so no rush for me, but I think I would be a nice addition. Tell me what you think :)

Add Ag_news to readme

Doublecheck that vignette specify that new data should be mentioned on the readme.

package ‘fs’ does not have a namespace

When install textdata package via remotes::install_github last night I got an error message below when trying to load the library. Please could you kindly advice? Thank you so much.

Error: package or namespace load failed for ‘textdata’:
package ‘fs’ does not have a namespace

Release textdata 0.4.1

Prepare for release:

  • devtools::build_readme()
  • Check current CRAN check results
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • revdepcheck::revdep_check(num_workers = 4)
  • Update cran-comments.md
  • Polish NEWS
  • Review pkgdown reference index for, e.g., missing topics

Submit to CRAN:

  • usethis::use_version('patch')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • usethis::use_dev_version()

how to download the data outside of R?

Hello and thanks for this great package!

Unfortunately my corporate firewall prevents me from downloading the data from within R (I can download the file manually with the browser).

What should I do to make it work?

Thanks!

Add Stanford GloVe Embeddings Datasets

I'd like to add the GloVe pre-trained word vectors, for use in tidymodels/textrecipes#20

The datasets are available here: https://nlp.stanford.edu/projects/glove/

There are 4 downloads, that break down like this:

  • glove.6B.zip = 4 datasets
  • glove.42B.300d.zip = 1 dataset
  • glove.840B.300d.zip = 1 dataset
  • glove.twitter.27B.zip = 4 datasets

The first one is all I'm directly in need of right now, but it feels worthwhile to work out a standard for all of them while I'm at it.

I don't want to make the functions too complicated to understand, but it feels like maybe it should be one set of textdata functions (download_glove, process_glove, dataset_glove), with arguments about the specifics (something like dataset_glove({normal stuff plus}, token_set, dimensions)).

Let me know what you think and I can knock this out (I'm doing it anyway for personal/work use, so formalizing it won't be a lot of extra work).

tidytext data

  • nma_words
  • parts_of_speech
  • sentiments
    • nrc
    • bing
    • loughran
    • AFINN
  • stop_words
    • onix
    • SMART
    • snowball

indonesian nrc

As i know that NRC Emolex is available in 40+ languages including Indonesian language. is indonesian laguage of NRC also available on this package?

lexicon_afinn() is forcing https, but url is http

I get a 404 error when I try to download the AFINN sentiment lexicon. It looks like https is being forced somewhere in the function, but I can't figure out how to fix it myself. I've tried editing the textdata::catalogue data.frame but that did not work. The only way I was able to fix it was to manually download the zip file.

Release textdata 0.2.0

Prepare for release:

  • devtools::check()
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • revdepcheck::revdep_check(num_workers = 4)
  • Polish NEWS
  • Polish pkgdown reference index
  • Draft blog post

Submit to CRAN:

  • usethis::use_version('minor')
  • Update cran-comments.md
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • Finish blog post
  • Tweet
  • Add link to blog post in pkgdown news menu

afinn dataset has improperly labelled columns

For some reason, the afinn dataset seems to have improperly named columns on my local mac installation.

The columns are "word" and "value", instead of "word" and "sentiment" like the documentation would suggest (and a previous version of the tidytext package reflects a third specification, "word" and "score").

For reference:

> afinn=lexicon_afinn()
> names(afinn)
[1] "word"    "value"
> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.5

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] textdata_0.3.0

loaded via a namespace (and not attached):
 [1] readr_1.3.1     compiler_3.5.2  R6_2.4.0        hms_0.4.2       tools_3.5.2     pillar_1.3.1    fs_1.2.7        rstudioapi_0.10
 [9] rappdirs_0.3.1  tibble_2.1.1    yaml_2.2.0      crayon_1.3.4    Rcpp_1.0.1      pkgconfig_2.0.2 rlang_0.3.4    

Release textdata 0.1.0

Prepare for release:

  • Check that description is informative
  • Check licensing of included files
  • usethis::use_cran_comments()
  • devtools::check()
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • Polish pkgdown reference index
  • Draft blog post

Submit to CRAN:

  • usethis::use_version('minor')
  • Update cran-comments.md
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • usethis::use_news()
  • Update install instructions in README
  • Finish blog post
  • Tweet
  • Add link to blog post in pkgdown news menu

Release textdata 0.4.2

Prepare for release:

  • git pull
  • Check current CRAN check results
  • Polish NEWS
  • devtools::build_readme()
  • urlchecker::url_check()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • revdepcheck::revdep_check(num_workers = 4)
  • Update cran-comments.md
  • git push

Submit to CRAN:

  • usethis::use_version('patch')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • git push
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • git push

lexicon_nrc() broken due to structural changes in source ZIP archive

lexicon_nrc() fails to run due to missing file:

It appears that there has been a change to the structure of the file downloaded from: http://saifmohammad.com/WebDocs/Lexicons/NRC-Emotion-Lexicon.zip
(according to http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm this file was updated in August 2022)

Seems like the path that is currently specified in process_nrc():
"NRC-Emotion-Lexicon/NRC-Emotion-Lexicon-v0.92/NRC-Emotion-Lexicon-Wordlevel-v0.92.txt
should actually be:
"NRC-Emotion-Lexicon/NRC-Emotion-Lexicon-Wordlevel-v0.92.txt

Release textdata 0.3.0

Prepare for release:

  • Check current CRAN check results
  • devtools::check()
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • revdepcheck::revdep_check(num_workers = 4)
  • Update cran-comments.md
  • Polish NEWS
  • Review pkgdown reference index for, e.g., missing topics
  • Draft blog post

Submit to CRAN:

  • usethis::use_version('minor')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • Finish blog post
  • Tweet
  • Add link to blog post in pkgdown news menu

Release textdata 0.4.4

Prepare for release:

  • git pull
  • Check current CRAN check results
  • Polish NEWS
  • devtools::build_readme()
  • urlchecker::url_check()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • revdepcheck::revdep_check(num_workers = 4)
  • Update cran-comments.md
  • git push

Submit to CRAN:

  • usethis::use_version('patch')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • git push
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • git push

Add clean argument

Implement an argument that delete intermediate files.

Default to FALSE to adhere to current implementation

Add support for NRC Emotion Intensity Lexicon and Valence, Arousal, Dominance Lexicon

NRC Emotion Intensity Lexicon (aka Affect Intensity Lexicon) http://saifmohammad.com/WebPages/AffectIntensity.htm

NRC Valence, Arousal, Dominance Lexicon
http://saifmohammad.com/WebPages/nrc-vad.html

For lexicon_nrc_eil:

http://saifmohammad.com/WebDocs/NRC-AffectIntensity-Lexicon.txt​
@inproceedings{LREC18-AIL,
author = {Mohammad, Saif M.},
title = {Word Affect Intensities},
booktitle = {Proceedings of the 11th Edition of the Language Resources and Evaluation Conference (LREC-2018)},
year = {2018},
address={Miyazaki, Japan}
}

For lexicon_nrc_vad:

http://saifmohammad.com/WebDocs/VAD/NRC-VAD-Lexicon-Aug2018Release.zip
@inproceedings{vad-acl2018,
title={Obtaining Reliable Human Ratings of Valence, Arousal, and Dominance for 20,000 English Words},
author={Mohammad, Saif M.},
booktitle={Proceedings of The Annual Conference of the Association for Computational Linguistics (ACL)},
year={2018},
address={Melbourne, Australia}
}

lexicon_nrc_vad() is currently malformatted

The original data file doesn't seem to contain headers.

textdata::lexicon_nrc_vad()
# A tibble: 19,970 × 4
   aaaaaaah    `0.479` `0.606` `0.291`
   <chr>         <dbl>   <dbl>   <dbl>
 1 aaaah         0.52    0.636   0.282
 2 aardvark      0.427   0.49    0.437
 3 aback         0.385   0.407   0.288
 4 abacus        0.51    0.276   0.485

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.