Code Monkey home page Code Monkey logo

pshat's Introduction

PSHAT

(Pronounced "P'Shot") Part of Speech Handling for Aramaic Talmud

This is the official repo for Noah's Master's thesis.

This project aims to fill the gaping hole in ancient Aramaic POS tagging. Astonishingly, this field of research is scant. My work begins to show that modern machine learning techniques can learn patterns syntactic patterns in Talmud, despite two major issues

  1. Talmud has no punctuation. Because of this, it can be very difficult to break up sentences and ideas, even if one is familiar with the Aramaic and the structure of the text

  2. Talmud is actually a mix of two languages, Mishnaic Hebrew and Talmudic Aramaic. While in some places the distinction between these languages is clearly marked, the majority of Talmud is a mixture of the two.

Despite these issues, LSTMs were able to achieve above 90% POS tagging on a validation set.

I gratefully thank CAL and especially Steve Kaufman for working with me on this project. The use of his dataset was crucial and his help working with the dataset was just as important.

Thesis PDF

The thesis is located here

Requirements

  1. This project uses the Sefaria library. Certain scripts require you to have Sefaria set up on your computer. Follow the instructions on their repo to set it up.

  2. You need to install dynet to run the LSTMs.

Pipeline

  1. DatasetMatcher.py: takes input from data/1_cal_input and outputs to data/2_matched_sefaria.
  2. make_lang_training.py: generates language training dataset from Sefaria library and CAL files. Aramaic training comes from data/1_cal_input/caldbfull.txt and Mishnaic training comes from Sefaria's Mishnah. Outputs training as json file to data/3_lang_tagged/model/lstm_training.json.
  3. LangTagger.py: takes input from data/3_lang_tagged/model/lstm_training.json and trains an LSTM to differentiate between Hebrew and Aramaic (only on individual words). Outputs to data/3_lang_tagged.
  4. Dilate language tagged output. Run dilate_lang.py. Outputs to 4_lang_tagged_dilated
  5. POSTagger2MLP-beam.py: takes input from 4_lang_tagged_dilated, 2_sefaria_matched and outputs to 5_pos_tagged. Trains LSTM to learn POS tags of Aramaic words in Talmud

pshat's People

Contributors

nsantacruz avatar dimidd avatar

Stargazers

Jonah Lubin avatar Paul Chang avatar jata avatar  avatar Adam Poliak avatar  avatar

Watchers

James Cloos avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.