Code Monkey home page Code Monkey logo

wp3-information-extraction-system-v2's Introduction

wp3-information-extraction-system-v2

Step 1 - Convert Books

The script in he folder books-converter converts plain texts to the format used for the multitask classification. Run the script books_converter.py on the folder containing the documents you want to use to extract frame elements and convert them in a format readable by the classifier.

IMPORTANT: The converter script filters the books by keeping only portions of text (parameter --window) around the seedwords. The SeedLists folder contains the seed lists from the beginning of the Odeuropa project. To ensure you don't miss relevant text or you don't introduce excessive noise in the data before running the script fix the seed list in your language. You can use English of French files as examples of the format required by the script. The script doesn't lemmatize, so you need to add all the inflected forms of the seeds to the list.

--folder: The input folder containing the books/document (plain txt, no metadata or tags)

--output: The output folder for the converted documents

--seeds: The file containing the seeds list. E.g. 'SeedLists/seed-en-pos.txt'

--books: The script allows to merge multible books into a single file, setting the value to 1 create a file for each book. Default value is 100.

--window: The number of sentences to keep around each smell word. 3 means 3 before and 3 after. Default value is 3.

--label: A short label used to assign an ID to the documents (so that later they can be matched with the metadata)

The script creates a -mapping file outside the output folder to map the document ID with the original books.

Usage example:

python3 books_converter-filter.py --folder books_folder --output output_folder --seeds SeedLists/seed-en-pos.txt --label abc --books 1000

Step 2 - Smells Prediction

The folder run-predictions contains the classifier (predict.py) to extract the smell sources from the books converted in the previous step.

Before running the script download the model in your language form here https://zenodo.org/records/10598306 and move it in run-predictions/models folder.

The code has ben tested with python 3.8. To install the required packages, in run-predictions folder run:

pip install -r requirements.txt

The script takes as argument in order: model, file to predict, output file (containing the predictions)

Optional: --device to select the gpu to be used. 0 for CUDA based GPUs, 1 for MPS (Apple M1/M2 chips) or -1 for CPU.

The models for each language are in the folder models.

The folder test-files contains a sample file for each language to test if the classifier works.

Usage examples:

python3 predict.py models/en.pt test-files/test-en.tsv predictions/predictions-test-en.tsv --device 0

The file predictions/sample-predictions-test-en.tsv shows the correct output to check your system output against.

Disclaimer: The multitask classifier uses machamp framework. The code in run-predictions contains selected parts of the framework, modified for the purpose of Odeuropa Project. If you need to use the framework get the official version at https://github.com/machamp-nlp/machamp

Step 3 - Frames Extraction

This extract-annotations.py script in frames-extraction folder extract the predictions from the output of the previous step providing a tsv file with all the frames and sentences that can be then uploaded in the knowledge graph.

--folder: the folder with the predictions from the classifier

--output: the output .tsv file

The code has ben tested with python 3.8. To install the required packages, in frames-extraction folder run:

pip install -r requirements.txt

Usage example:

python3 extract-annotations.py --folder ../run-predictions/predictions/ --output test-frames.tsv

Publication

If you use this resource, please cite:

Menini, Stefano. Semantic Frame Extraction in Multilingual Olfactory Events. In Proceedings of LREC-Coling 2024

Funding acknowledgement

EU logo

This work has been realised in the context of Odeuropa, a research project that has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 101004469.

wp3-information-extraction-system-v2's People

Contributors

stefanomenini avatar mvanerp avatar pasqlisena avatar

Watchers

Daniel Schwabe avatar Arno Bosse avatar Raphael Troncy avatar Thibault Ehrhart avatar Ali Hürriyetoğlu avatar  avatar Sara Tonelli avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.