Instructions and supporting tools for training word embeddings for historical (ca. 1600 - 1900) German based on the texts of the Deutsches Textarchiv using fastText.
A pre-trained model can be obtained here [1,43 GB].
-
Follow the instructions for building fastText
-
Download the DTA normalized XML files
wget -i dta_normalized.txt -P dta_normalized
-
Transform the DTA normalized XML files into plain text
xsltproc tei2txt.xsl dta_normalized/* -o dta_normalized/*.txt
-
Concatenate all plain text files into a single text file
cp dta_normalized/*.txt dta_normalized/dta_normalized_all.txt
-
Compute word embeddings using fastText
fasttext skipgram -input dta_normalized/dta_normalized_all.txt -output dta_emb
-
Download
wget.exe
from https://eternallybored.org/misc/wget/ -
Download
msxsl.exe
from https://www.microsoft.com/en-us/download/details.aspx?id=21714 -
Download
fasttext.exe
from https://github.com/xiamx/fastText/releases -
Download the DTA normalized XML files
wget.exe -i dta_normalized.txt -P dta_normalized
-
Run a batch script to convert the XML files to a combined plain text file
dta2txt.bat
-
Compute word embeddings using fastText
fasttext.exe skipgram -input dta_normalized\dta_normalized_all.txt -output dta_emb