Code Monkey home page Code Monkey logo

nicolay-r / arekit-ss Goto Github PK

View Code? Open in Web Editor NEW
2.0 3.0 0.0 2.01 MB

Low Resource Context Relation Sampler for contexts with relations for fact-checking and fine-tuning your LLM models, powered by AREkit

Home Page: https://github.com/nicolay-r/AREkit/wiki/Binded-Sources

License: MIT License

Python 66.59% Jupyter Notebook 33.41%
googletrans googletranslate python ml nlp relations-extraction dataset datasets datasets-preparation factchecking

arekit-ss's People

Contributors

nicolay-r avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

arekit-ss's Issues

No frames in output

This is related to #465 task.
All the writers are fetch columns from the get_columns_list_with_types.
Which is a SampleColumnsProvider at the moment and a default provider.
The latter does not include frames.

For the following scenario:

python3 -m arekit_ss.sample --writer csv --source ruattitudes --sampler nn --src_lang ru --dest_lang ru --docs_limit 10 --text_parser nn --output_dir ./_out/a/ --no-vec

We have an output without annotated frames

sample-train-0.csv

Label scaler implementation could be simplified [NEREL registration backlog]

class NerelAnyLabelScaler(BaseLabelScaler):
def __init__(self):
self.__uint_to_label_dict = OrderedDict([
(labels.OpinionBelongsTo(), 0),
(labels.OpinionRelatesTo(), 1),
(labels.NegEffectFrom(), 2),
(labels.PosEffectFrom(), 3),
(labels.NegStateFrom(), 4),
(labels.PosStateFrom(), 5),
(labels.NegativeTo(), 6),
(labels.PositiveTo(), 7),
(labels.StateBelongsTo(), 8),
(labels.PosAuthorFrom(), 9),
(labels.NegAuthorFrom(), 10),
(labels.AlternativeName(), 11),
(labels.OriginsFrom(), 12),
])
super(NerelAnyLabelScaler, self).__init__(
uint_dict=self.__uint_to_label_dict,
int_dict=self.__uint_to_label_dict)

Sampling modes

  • subject -> object as a classic of attitude extraction task
  • target as a targeted sentiment analysis

What's new 0.23.1 release

Main goal of this release: #34
At this stage, we have a text parser presets that could be splitted into smaller parts and utilized in future.

  • ๐Ÿ”ง remove update_arekit.sh script
  • ๐Ÿ““ provide logo.png for the project
  • ๐Ÿ““ rename reference section with Powered by AREkit.
  • prompting techniques could be mentioned not in reference, but for a prompt-related comment
  • #37
  • ๐Ÿ”ง #43
  • ๐Ÿ““ : Explain how to read the project name ("arekit double s")
  • ๐Ÿ““ Mention image of the results in the README.md (Google colab ๐Ÿช„ results)
  • Move examples from the notebook into the test folder
  • Clarify in filename that nn is only for sentiment
    def create_nn_rows_provider(labels_scaler):
  • quick fix ident
    LemmasBasedFrameVariantsParser(

Post updates

Documentation -- Usecase for Checking resources with LLM

Treat as framework for polishing datasets.

Application is as follows:

  1. Sampling with prompting.
  2. Application of LLM
  3. Gathering results and manual analysing.

Use NEREL for experiments

We use the following instruction:

python3 -m arekit_ss.sample --writer csv --source nerel --sampler prompt \
--prompt "For text: '{text}', is the relation of type {label_val} from '{s_val}' towards '{t_val}'? Answer yes or no, and explain why if no." \
--src_lang ru --dest_lang en --text_parser lm --output_dir ./_out/nerel-prompting-fact-checking/ --splits train:test
 python3 -m arekit_ss.sample --writer csv --source nerel-bio --sampler prompt \
 --prompt "For the text part of the PubMed abstract: '{text}', is the relation of type {label_val} from '{s_val}' towards '{t_val}'? Answer yes or no, and explain why if no." \
 --dest_lang en --text_parser lm --output_dir ./_out/nerel-bio-prompting-fact-checking/ --splits train:test

User experice and feedback

Structurization

#38 related fix

  • Collect every spruce in a separated folder (future movements and ParlAI experience)
  • source list make a single Json with the rest setups within it

Frames could not be parsed in other languages [known limitation]

The impelementation of the text-processing pipeline for nn which supports frame annotation is as folllows:

TextAndEntitiesGoogleTranslator(src=cfg.src_lang, dest=cfg.dest_lang) if cfg.dest_lang != cfg.src_lang else None,
LemmasBasedFrameVariantsParser(frame_variants=frame_variant_collection, stemmer=stemmer)])

Now, we mindfuly consider to avoid this support, because GoogleTranslate parser item was designed only for entiies as an another objects mentioned in text:

elif isinstance(part, Entity):
# Register first the prior parts were merged.
__optionally_register(parts_to_join)
# Register entities information for further restoration.
origin_entity_ind.append(len(content))
origin_entities.append(part)
content.append(part.Value)

To add the related support, it is better to first generalize object representation in framework, with Value parameter as the common

add installer

Perform the following updates:

  • __init__.py
  • setup.py
  • README updates in project
  • rewrite installation section
  • move sampler into the root folder

What's new in 0.24.0

Main feature is support of the custom documents

Backend Updates and Extended Schemata

  • Switch to the AREkit==0.24.0 (be14ffc)

Quality of the sampled data

According to the #52 experiments:

Dynamic prompting support

When analysing NEREL-bio outputs, it was found that some labels and relations are better to describe more precisely with prompts

Simplify new sources registration

There were few feedbacks on bad API with labels and that whole project might be crashed because of the incompletely registered source

Other / Minor updates

  • #68
  • #69
  • #42
  • doc_ids argument support which allows to select the specific doc_ids for processing (90488c2)
  • [output_dir] customization removed (234ae21)
  • #72
  • dest_lang is optional parameter (by default is src_lang) (234ae21)
  • #74
  • ๐Ÿ”ง #75
  • #76

Fold type selection

#61 related.

Reason: some parts are designed for the particular cases of application in ml. For example:

  • training data is expected to have a labels
  • test data without labels.

is to provide the specific part of the dataset that supposed to be taken for sampling; by default we may consider no-folding, which means keep all the documents.

External source for the synonyms collection [0.24.0]

In 0.23.1 it is not applicable, because we have a set of fixed sources, where grouping is based on the predefined entity parsers.

We can follow the way it works in ARElight
https://github.com/nicolay-r/ARElight/blob/c3d388cc7bcc5cea5be1ed1f0c19419c6157d309/examples/serialize_bert.py#L68-L69

Then it is followed by grouping method for entities:
https://github.com/nicolay-r/ARElight/blob/c3d388cc7bcc5cea5be1ed1f0c19419c6157d309/examples/serialize_bert.py#L76-L82

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.