Code Monkey home page Code Monkey logo

sttr's Introduction

sttr

Calculate STTR on tokenized text with metadata using Python

Requirements

Tested using Python 3.7.

  • Pandas
pip install pandas

Or using Pipenv:

pipenv install
pipenv shell

Usage

Run the run_sttr.py script, specifying the datadir and output parameters.

You may also use the pre-defined sttr run command when using pipenv (i.e. replace all python run_sttr.py incantations with pipenv run sttr).

Example

python run_sttr.py /path/to/corpus/dir

The above command will look under /path/to/corpus/dir for all directories that have a groups.csv or metadata.csv file and try to extract the specified filenames from the Tokenized, Lemmatized, POS, POS_Tri, UniversalPOS, and UniversalPOS_Tri directories (if present). For each corpus and folder (Tokenized/Lemmatized/...) combination, a results_CORPUSNAME_TYPE.tsv will be generated containing calculated measures.

Finally, a merged_results_CORPUSNAME1+CORPUSNAME2+...tsv file wile be generated containing the merged results from all corpora.

An example run on the whole project, with extended metadata:

python run_sttr.py --meta 'author,genre,brow,narrative_perspective,year' ~/Dropbox/Complexity/Corpora/*

This will calculate Yule's K, STTR, and associated length measures, for every corpus directory under ~/Dropbox/Complexity/Corpora. The author,genre,brow,narrative_perspective metadata will be extracted from the groups.csv file as well and merged into the merged_results_....tsv file at the end. Missing metadata is output as NA.

Advanced usage

See the usage:

usage: run_sttr.py [-h] [--check-only] [--meta META_FIELDS] [-t TYPES] [-p]
                   [-f FIELD]
                   datadirs [datadirs ...]

calculates sttr

positional arguments:
  datadirs            directory with data in csv files

optional arguments:
  -h, --help          show this help message and exit
  --check-only        do a pass through all specified corpus directories to
                      make sure they conform to project standards
  --meta META_FIELDS  specify metadata fields in CSV to use as categorical
                      features, optional, (default='Brow'); Format: specify as
                      CSV string
  -t TYPES            specify folders to use (Tokenized or POS etc.),
                      optional, (default='Tokenized,Lemmatized,POS,POS_Tri,Uni
                      versalPOS,UniversalPOS_Tri')
  -p                  remove punctuation, optional, (default='False')
  -f FIELD            use delimited field number to extract chosen unit
                      (token/POS/lemma/...), optional, (default='0' (the first
                      field))

Note that you may specify multiple corpora on the command line like below:

python run_sttr.py /path/to/corpus/dirs/* /path/to/other/corpus

sttr's People

Contributors

borh avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

sttr's Issues

corpus version support

Currently there is a -t (tokenization) switch, but it would be nice to support multiple versions (lemmatized/etc.) of the same underlying corpus data, and merge this into the final results.

hapax legomena removal option

Issues:

  • Does switching this on do the analysis two times (once with and once without), and merged in final results?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.