Code Monkey home page Code Monkey logo

hmbench's Introduction

hmBench: A Benchmark for Historical Language Models on NER Datasets

hmBench

This repository presents a benchmark for Historical Language Models with main focus on NER Datasets such as HIPE-2022.

Models

The following Historical Language Models are currently used in benchmarks:

Model Hugging Face Model Hub Org
hmBERT Historical Multilingual Language Models for Named Entity Recognition
hmTEAMS Historical Multilingual TEAMS Models
hmByT5 Historical Multilingual and Monolingual ByT5 Models

Datasets

We benchmark pretrained language models on various datasets from HIPE-2020, HIPE-2022 and Europeana. The following table shows an overview of used datasets:

Language Datasets
English AjMC - TopRes19th
German AjMC - NewsEye - HIPE-2020
French AjMC - ICDAR-Europeana - LeTemps - NewsEye - HIPE-2020
Finnish NewsEye
Swedish NewsEye
Dutch ICDAR-Europeana

Results

The hmLeaderboard space on the Hugging Face Model Hub shows all results and can be accessed here.

Best Models

A collection of best performing models can be found here (grouped by the used backbone LM):

Fine-Tuning

We use Flair for fine-tuning NER models on HIPE-2022 datasets from HIPE-2022 Shared Task. Additionally, the ICDAR-Europeana is used for benchmarks on Dutch and French.

We use a tagged version of Flair to ensure a kind of reproducibility. The following commands need to be run to install all necessary dependencies:

$ pip3 install -r requirements.txt

In order to use the hmTEAMS models you need to authorize with your account on Hugging Face Model Hub. This can be done via cli:

# Use access token from https://huggingface.co/settings/tokens
$ huggingface-cli login

We use a config-driven hyper-parameter search. The script flair-fine-tuner.py can be used to fine-tune NER models from our Model Zoo.

Additionally, we provide a script that uses Hugging Face AutoTrain Advanced (Space Runner) to fine-tune models. The following snippet shows an example:

$ pip3 install git+https://github.com/huggingface/autotrain-advanced.git
$  export HF_TOKEN="" # Get token from: https://huggingface.co/settings/tokens
$ autotrain spacerunner --project-name "flair-hmbench-hmbyt5-ajmc-de" \
  --script-path $(pwd) \
  --username stefan-it \
  --token $HF_TOKEN \
  --backend spaces-t4s \
  --env "CONFIG=configs/ajmc/de/hmbyt5.json;HF_TOKEN=$HF_TOKEN;HUB_ORG_NAME=stefan-it"

The concrete implementation can be found in script.py.

Notice: the AutoTrain implementation is currently under development!

All configurations for fine-tuning are located in the ./configs folder with the following naming convention: ./configs/<dataset-name>/<language>/<model-name>.json.

Changelog

  • 17.10.2023: Over 1.200 models from hyper-parameter search are now available on the Model Hub.
  • 05.10.2023: Initial version of this repository.

Acknowledgements

We thank Luisa März, Katharina Schmid and Erion Çano for their fruitful discussions about Historical Language Models.

Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). Many Thanks for providing access to the TPUs ❤️

hmbench's People

Contributors

stefan-it avatar

Stargazers

Lukas Rosenberger avatar Johannes Baiter avatar malteos avatar Max Ploner avatar Clemens Neudecker avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.