Code Monkey home page Code Monkey logo

Differential Language Analysis ToolKit

DLATK is an end to end human text analysis package for Python 3. It is specifically suited for social media, Psychology, and health research, developed originally for projects out of the University of Pennsylvania, Stony Brook University, and Stanford University. Currently, it has been used in over 100 peer-reviewed publicaitons (many from before there was an article to reference).

Dlatk is designed to handle the multi-level nature of language (words belong to documents, written by people, within communities) that makes it particularly useful for psychological and social science.

Some examples of what DLATK can perform:

  • linguistic feature extraction (i.e. turning text into features or variables for analyses)
  • differential language analysis (i.e. finding the language that is most associated with psychological or health variables)
  • wordcloud visualization
  • statistical- and machine learning-based supervised prediction (regression and classification)
  • statistical- and machine learning-based dimensionality reduction and clustering
  • mediation analysis
  • contextual embeddings: using deep learning transformers message, user, or group embeddings
  • part-of-speech tagging

DLATK can integrate with

DLATK use:

Installation

DLATK is available via any of four popular installation platforms: pip (recommended), conda, github, or Docker:

New to installing Python packages?

It is recommended that you see the full installation instructions.

STEP 1: Make sure you have python3-mysqldb (if using mysql):

sudo apt-get install python3-mysqldb
sudo apt install libmysqlclient-dev  #OR for MariaDB: sudo apt-get install libmariadbclient-dev
sudo apt-get install python3-pip
pip3 install mysqlclient

STEP 2: Install from one of these options:

A. GitHub

git clone https://github.com/dlatk/dlatk.git
cd dlatk
python setup.py install

B. pip

pip3 install dlatk

C. conda

conda install -c wwbp dlatk

D. Docker (from 2018; may not work well for newer versions)

Detailed Docker install instructions here.

docker run --name mysql_v5  --env MYSQL_ROOT_PASSWORD=my-secret-pw --detach mysql:5.5
docker run -it --rm --name dlatk_docker --link mysql_v5:mysql dlatk/dlatk bash

Still didn't work? If using linux, Try this:

sudo apt install python3-pip
pip3 install numpy scipy scikit-learn statsmodels jsonrpclib simplejson nltk
sudo apt-get install python3-mysqldb
sudo apt install libmysqlclient-dev
pip3 install mysqlclient
git clone https://github.com/dlatk/dlatk.git
cd dlatk
python setup.py install

Dependencies

See the full installation instructions for recommended and optional dependencies.

Quick Start

To check if it will run:

python3 dlatkInterface.py -h

To add packaged data to mysql and text with it:

mysql -e 'CREATE DATABASE dla_tutorial'; cat dlatk/data/dla_tutorial.sql | mysql dla_tutorial
mysql -e 'CREATE DATABASE dlatk_lexica'; cat dlatk/data/dlatk_lexica.sql | mysql dlatk_lexica

python3 dlatkInterface.py -d dla_tutorial -t msgs -g user_id --add_ngrams -n 1 --add_lex -l dd_intAff --weighted_lex

Expected output:

-----
DLATK Interface Initiated: XXXX-XX-XX XX:XX:XX
-----
SQL QUERY: DROP TABLE IF EXISTS feat$1gram$msgs$user_id$16to16
SQL QUERY: CREATE TABLE feat$1gram$msgs$user_id$16to16 ( id BIGINT(16) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, group_id int(10) unsigned, feat VARCHAR(36) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin, value INTEGER, group_norm DOUBLE, KEY `correl_field` (`group_id
SQL QUERY: DROP TABLE IF EXISTS feat$meta_1gram$msgs$user_id$16to16
SQL QUERY: CREATE TABLE feat$meta_1gram$msgs$user_id$16to16 ( id BIGINT(16) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, group_id int(10) unsigned, feat VARCHAR(16) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin, value INTEGER, group_norm DOUBLE, KEY `correl_field` (`gro
finding messages for 1000 'user_id's
SQL QUERY: ALTER TABLE feat$1gram$msgs$user_id$16to16 DISABLE KEYS
Messages Read: 5k
...
Messages Read: 30k
Done Reading / Inserting.
Adding Keys (if goes to keycache, then decrease MAX_TO_DISABLE_KEYS or run myisamchk -n).
SQL QUERY: ALTER TABLE feat$1gram$msgs$user_id$16to16 ENABLE KEYS
Done

Intercept detected 5.037105 [category: AFFECT_AVG]
Intercept detected 2.399763 [category: INTENSITY_AVG]
SQL QUERY: DROP TABLE IF EXISTS feat$cat_dd_intAff_w$msgs$user_id$1gra
SQL QUERY: CREATE TABLE feat$cat_dd_intAff_w$msgs$user_id$1gra ( id BIGINT(16) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, group_id int(10) unsigned, feat VARCHAR(13) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin, value INTEGER, group_norm DOUBLE, KEY `correl_field` (`
WORD TABLE feat$1gram$msgs$user_id$16to16
SQL QUERY: ALTER TABLE feat$cat_dd_intAff_w$msgs$user_id$1gra DISABLE KEYS
10 out of 1000 group Id's processed; 0.01 complete
20 out of 1000 group Id's processed; 0.02 complete
...
1000 out of 1000 group Id's processed; 1.00 complete
SQL QUERY: ALTER TABLE feat$cat_dd_intAff_w$msgs$user_id$1gra ENABLE KEYS
--
Interface Runtime: 167.67 seconds
DLATK exits with success! A good day indeed  ¯\_(ツ)_/¯.

Documentation

The documentation for the latest release is at dlatk.wwbp.org.

Citation

If you use DLATK in your work please cite the following paper:

H. Andrew Schwartz, Salvatore Giorgi, Maarten Sap, Patrick Crutchley, Lyle Ungar, and Johannes Eichstaedt. 2017. DLATK: Differential Language Analysis ToolKit. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 55–60, Copenhagen, Denmark. Association for Computational Linguistics.

bibtex

@InProceedings{DLATKemnlp2017,
  author =  "Schwartz, H. Andrew and Giorgi, Salvatore and Sap, Maarten and Crutchley, Patrick and Eichstaedt, Johannes and Ungar, Lyle",
  title =   "DLATK: Differential Language Analysis ToolKit",
  booktitle =   "Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
  year =  "2017",
  publisher =   "Association for Computational Linguistics",
  pages =   "55--60",
  location =  "Copenhagen, Denmark",
  url =   "http://aclweb.org/anthology/D17-2010"
}

License

Licensed under a GNU General Public License v3 (GPLv3)

Background

Developed by the World Well-Being Project based out of the University of Pennsylvania and Stony Brook University.

Differential Language Analysis ToolKit's Projects

dlatk icon dlatk

End to end human text analysis package, specifically suited for social media and social scientific applications. It is written in Python 3 and developed by the World Well-Being Project at the University of Pennsylvania and Stony Brook University.

dlatk-docker icon dlatk-docker

Docker container for the Differential Language Analysis ToolKit

usefulscripts icon usefulscripts

Helpful scripts for getting data between MySQL and text files (csv, tsv, json)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.