This project is an exploration of topic modeling with Latent Dirachlet Allocation for literature.
A bulk package of Gutenberg books was obtained. Each book was an obscurely named text file, so the first step was to write book_cleaner.py
to parse the books. Each file has inconsistent headers and footers containing attributions and transcription notes that needed to be stripped out. Due to the overwhelming inconsistency, some books will be skipped and successful parsing with dump the user in vim
to edit any remaining text they don't want included in the topic modeling. Saved books will be placed in ./Books
subdirectory.
This file will parse books saved to ./Books
. It will tokenize the text in one of 6 ways depending on how the METHOD
constant is set. Each of these methods is added to the previous one.
basic
: BasicNLTK
word tokenization.trimmed
: Will trim the book text to the nearest sentence boundary after 1000 words. This is inspired by Jockers and Mimno (2013) who tested this on 19th century literature.stopwords
: Will remove all stopwords inNLTK
as well as some others that were identified during the project. It will also remove any roman numerals.onlynouns
: Will remove all tokens that aren't nouns. Proper nouns are also removed.lemmatize
: Will lemmatize tokens usingWordNetLemmatizer
fromnltk.stem.wordnet
.bigrams
: Will use only bigrams. Warning, this needs a lot more work to be useful.
After running this, it will output a visualization from PyLDAvis
with a name based on the chosen method. For example: books-lemmatize-32.html
where 32 is the selected number of topics to find. This can be changed using the NUM_TOPICS
constant. If you want to run a coherence plot, then set FIND_COHERENCE
to True
.
Example LDAvis: