Code Monkey home page Code Monkey logo

bsnlp-slavner-shared-task-2021-with-spacy-v3-main's Introduction

BSNLP-SlavNER-shared-task-2021-with-Spacy-v3

This repository consists of the code for proccessing the data and training NER for Russian language at BSNLP SlavNER 2021 using Spacy

Install dependencies

  • pip install -r requirements.txt

External resources / Prerequisites

  • Look at this article if you're experiencing troubles with gpu to fine-tune Bert
  • Add pre-trained vectors
  • Add pretraining corpus

Instructions

  1. Clone this repository
  2. Download the data from BSNLP Shared Task page and put it into data/bsnlp2021_train_r1
  3. Run python save_data.py $folder1 $folder2 $folder3 split_train split_dev folder4 train_file dev_file test_file where folder1, folder2, folder3 are the names of the folders for training and development sets, split_train and split_dev are percentage of data used in training and dev sets, folder4 is the name of the folder for test_set, train_file, dev_file, test_file are the file names of the resulting spacy binary data sets. Now you have your training data for Spacy v.3
  4. (optional) Run save_pretraining.py $folder_names where folder_names are the different folder names you want to use for the pretraining corpus. Check it out here how pretraining might help you to obtain better results.
  5. Run python -m spacy train config_ner_ruVec_pretrain.cfg --output ./tok2vec_output to train the model. Specify paths to the training and development sets inside the config file before training. Also you may use a pretraining corpus and choose different pretrained vectors to potentially obtain better results. By default there are vectors from ru_core_news_lg Russian model. You may find more info on training command and config editting here
  6. Run python -m spacy train config_spacy_trans.cfg --output ./multilingual_output to train the model. Specify paths to the training and development sets inside the config file before training. You can change hyperparameters there and choose different models from https://huggingface.co/models. By default the model is "bert-base-multilingual-uncased", which was used for the training.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.