Code Monkey home page Code Monkey logo

qta2023's Introduction

QTA2023

This is the repository for the 2023/24 quantitative text analysis module of the Methods course in the research master program History at Utrecht University (see: https://www.uu.nl/en/masters/history). See the course manual in the folder with the same name.

The Notebooks folder contains a series of Jupyter Notebooks. The Sample data folder contains some example text files (see below). The aim of the notebooks is to provide an introduction to quantitative text analysis (text mining). The notebooks are structured as listed below. Most notebooks take .txt files as input, but can be tweaked very easily to import .csv files. Text files are ideally chronological and named for the year they represent (for example '1981.txt', '1982.txt', etc.).

Most of the code is my own, or linked in the notebooks to projects I copied it from. Doing things with text 1 is based on code that Berit Jansen (Research Software Lab, Utrecht University) wrote. Brecht Nijman contributed to Doing things with text 4. Thanks to both!

Notebooks

Doing things with text 1 - Preprocessing of a single text file (.txt):

  • remove html, punctuation, numbers, short words, stopwords; lowercase
  • save cleaned text to file
  • basic statistics of text

Doing things with text 2 - word counts on a single, preprocessed text file (.txt):

  • most common words as bar chart
  • most common words as word cloud
  • most common words by word length

Doing things with text 3a - Preprocessing and word counts on multiple text files (.txt, raw or preprocessed):

  • same as Doing things with text 1 and 2 but for multiple text files

Doing things with text 3b - Preprocessing and word counts on multiple .csv files (raw text):

  • same as Doing things with text 1 and 2 but for one or more csv files

Doing things with text 4 - Text analysis (for multiple .txt files, preprocessed):

  • plot word / n-gram frequency per file in a scatter plot
  • print and save collocations (log likelihood, pmi, raw frequency) of one or more keywords per file
  • print and save top n-grams per file
  • print and save top n-grams per file starting or ending with a given keyword

Doing things with text 5 - tf-idf with gensim (for multiple .txt files, preprocessed):

  • plot top distinct words (tf-idf) per file in a bar chart
  • create heatmap for cosine similarity

Doing things with text 6 - part-of-speech with spacy (for multiple .txt files, preprocessed):

  • print most common words of a particular type (adjective, verb, (proper) noun) per file
  • print most common named entities per file

Doing things with text 7 - word embeddings with gensim's word2vec (for multiple .txt files, raw or preprocessed):

  • train word2vec model on dataset
  • search most similar terms for one or more keywords
  • plot most similar terms as clusters in a t-sne plot

Sample data

  • screenplays for Star Wars I - VII as .txt
  • screenplays for a series of movies about science/scientists as .csv

qta2023's People

Contributors

pimhuijnen avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.