The dta_emb's intro from cneud

dta_emb's Introduction

dta_emb

Instructions and supporting tools for training word embeddings for historical (ca. 1600 - 1900) German based on the texts of the Deutsches Textarchiv using fastText.

A pre-trained model can be obtained here [1,43 GB].

Linux

Follow the instructions for building fastText
Download the DTA normalized XML files
wget -i dta_normalized.txt -P dta_normalized
Transform the DTA normalized XML files into plain text
xsltproc tei2txt.xsl dta_normalized/* -o dta_normalized/*.txt
Concatenate all plain text files into a single text file
cp dta_normalized/*.txt dta_normalized/dta_normalized_all.txt
Compute word embeddings using fastText
fasttext skipgram -input dta_normalized/dta_normalized_all.txt -output dta_emb

Windows

Download wget.exe from https://eternallybored.org/misc/wget/
Download msxsl.exe from https://www.microsoft.com/en-us/download/details.aspx?id=21714
Download fasttext.exe from https://github.com/xiamx/fastText/releases
Download the DTA normalized XML files
wget.exe -i dta_normalized.txt -P dta_normalized
Run a batch script to convert the XML files to a combined plain text file
dta2txt.bat
Compute word embeddings using fastText
fasttext.exe skipgram -input dta_normalized\dta_normalized_all.txt -output dta_emb

Recommend Projects