No frames in output

This is related to #465 task.
All the writers are fetch columns from the get_columns_list_with_types.
Which is a SampleColumnsProvider at the moment and a default provider.
The latter does not include frames.

For the following scenario:

python3 -m arekit_ss.sample --writer csv --source ruattitudes --sampler nn --src_lang ru --dest_lang ru --docs_limit 10 --text_parser nn --output_dir ./_out/a/ --no-vec

We have an output without annotated frames

sample-train-0.csv

Label scaler implementation could be simplified [NEREL registration backlog]

arekit-ss/arekit_ss/sources/s_nerel.py

Lines 25 to 47 in e961eea

    
           class NerelAnyLabelScaler(BaseLabelScaler): 
        
               def __init__(self): 
        
                   self.__uint_to_label_dict = OrderedDict([ 
        
                       (labels.OpinionBelongsTo(), 0), 
        
                       (labels.OpinionRelatesTo(), 1), 
        
                       (labels.NegEffectFrom(), 2), 
        
                       (labels.PosEffectFrom(), 3), 
        
                       (labels.NegStateFrom(), 4), 
        
                       (labels.PosStateFrom(), 5), 
        
                       (labels.NegativeTo(), 6), 
        
                       (labels.PositiveTo(), 7), 
        
                       (labels.StateBelongsTo(), 8), 
        
                       (labels.PosAuthorFrom(), 9), 
        
                       (labels.NegAuthorFrom(), 10), 
        
                       (labels.AlternativeName(), 11), 
        
                       (labels.OriginsFrom(), 12), 
        
                   ]) 
        
                   super(NerelAnyLabelScaler, self).__init__( 
        
                       uint_dict=self.__uint_to_label_dict, 
        
                       int_dict=self.__uint_to_label_dict)

Sampling modes

subject -> object as a classic of attitude extraction task
target as a targeted sentiment analysis

What's new 0.23.1 release

Main goal of this release: #34
At this stage, we have a text parser presets that could be splitted into smaller parts and utilized in future.

Post updates

❌ #57
📂 #59
🔧 #60
🔧 #61

Remove SentiNEREL dependencies from sampling core

`--output-dir` as a parameter

Update the description of the project

https://github.com/nicolay-r/arekit-ss/blob/d972b26cb7c279fdc56bb8f76fa46f9a6e25b0ef/arekit_ss/sample.py#L44C51-L44C69

Provide `docs_limit` for RuSentRel

Disable `vectorizer` option for NN sampler

Here we may pass None to vectorizers argument:
https://github.com/nicolay-r/arekit-ss/blob/7785455be33c6be77cdf2ec967c8eb35c6bfa10f/arekit_ss/framework/arekit/rows_nn.py#L30C25-L30C25

Customizable `src` language

add --src_lang key
limit key value to ru for some particular sources.

No need vocab and embedding for network [Confuses]

`pos` tags does not makes sense anymore

Once it was done
nicolay-r/AREkit#435

Documentation -- Usecase for Checking resources with LLM

Treat as framework for polishing datasets.

Application is as follows:

Sampling with prompting.
Application of LLM
Gathering results and manual analysing.

Use NEREL for experiments

#56
#63

We use the following instruction:

python3 -m arekit_ss.sample --writer csv --source nerel --sampler prompt \
--prompt "For text: '{text}', is the relation of type {label_val} from '{s_val}' towards '{t_val}'? Answer yes or no, and explain why if no." \
--src_lang ru --dest_lang en --text_parser lm --output_dir ./_out/nerel-prompting-fact-checking/ --splits train:test

 python3 -m arekit_ss.sample --writer csv --source nerel-bio --sampler prompt \
 --prompt "For the text part of the PubMed abstract: '{text}', is the relation of type {label_val} from '{s_val}' towards '{t_val}'? Answer yes or no, and explain why if no." \
 --dest_lang en --text_parser lm --output_dir ./_out/nerel-bio-prompting-fact-checking/ --splits train:test

User experice and feedback

#70
#71

Translation performed during `idle iteration mode` (total amount of rows assessment)

`text_b` field missed (JSON) and BERT serialization

Structurization

#38 related fix

Collect every spruce in a separated folder (future movements and ParlAI experience)
source list make a single Json with the rest setups within it

Support custom texts

support reading of any text, with sentence splitter
support NER

`nn` sampler won't work for NEREL [due to frames parsing]

#56 related

`train`/`test` prompt formatting; in `train` we know the label

Apply added `PromptedSampleRowProvider` promp-based sampler from AREkit

Implement the latter as a mode prompt

Add original NEREL support

Entities are wrong refered and labeled

Force disable balancing to speedup result translation

SentiNEREL collection -- add adapter

Sampler type (nn/bert) as a parameter for source-based serializer.

Adopt Streaming sampling

Reason: Boost sampling up to x2 time. 🔥
nicolay-r/AREkit#462
Limitation: AREkit now supports sampling only for CSV

Frames could not be parsed in other languages [known limitation]

The impelementation of the text-processing pipeline for nn which supports frame annotation is as folllows:

arekit-ss/arekit_ss/text_parser/text_nn_ru_frames.py

Lines 38 to 39 in 76af4c7

    
           TextAndEntitiesGoogleTranslator(src=cfg.src_lang, dest=cfg.dest_lang) if cfg.dest_lang != cfg.src_lang else None, 
        
           LemmasBasedFrameVariantsParser(frame_variants=frame_variant_collection, stemmer=stemmer)])

Now, we mindfuly consider to avoid this support, because GoogleTranslate parser item was designed only for entiies as an another objects mentioned in text:

arekit-ss/arekit_ss/text_parser/translator.py

Lines 61 to 67 in 76af4c7

    
           elif isinstance(part, Entity): 
        
               # Register first the prior parts were merged. 
        
               __optionally_register(parts_to_join) 
        
               # Register entities information for further restoration. 
        
               origin_entity_ind.append(len(content)) 
        
               origin_entities.append(part) 
        
               content.append(part.Value)

To add the related support, it is better to first generalize object representation in framework, with Value parameter as the common

`data_list.py` -- provide the related file [ParlAI project feedback]

Reason: sample file now includes the related information. it would be great to move everything in a separated file/folder.

add src_list.py file

Provide output dir by default

update on colab
update on README section

BERT -- `text_b` prompt translation

add installer

Perform the following updates:

Move `entity_filter` for SentiNEREL to AREkit

https://github.com/nicolay-r/arekit-prompt-sampler/blob/f9e2188d9e72ce1428780c4a1496b03d86bc525f/sources/s_sentinerel.py#L96-L113

What's new in 0.24.0

Main feature is support of the custom documents

Backend Updates and Extended Schemata

Switch to the AREkit==0.24.0 (be14ffc)

Quality of the sampled data

According to the #52 experiments:

#70
#71

Dynamic prompting support

When analysing NEREL-bio outputs, it was found that some labels and relations are better to describe more precisely with prompts

#67
~~#66~~

Simplify new sources registration

There were few feedbacks on bad API with labels and that whole project might be crashed because of the incompletely registered source

#62
~~#58~~
#59

Other / Minor updates

`prompting` -- enhance output formatting [colab feedback]

This is the related issue in AREkit project:
nicolay-r/AREkit#496

label returns uint label, while it is expected to be a formatted textual value. (fixed in AREkit)
s_ind and t_ind should not be used without a value; we need to add support for s_val and t_val (fixed in AREkit)

`CroppedBertSampleRowProvider` -- provide by AREkit

Cover case of the connection lost

SentiNEREL -- entity filter has not been provided [NIVTS backlog]

Text parser presets

`httpcore._exceptions.ReadTimeout:` The read operation timed out

Use textual label formatter for prompting

#40 related

#34 related

Rename `translate.py` to `sample.py`

continues #23

update google colab script

Update logo to ChatGPT-related and prompt

The output is lack of the translated text

Update Google-colab project

Use pure pandas API in cell for resuls reading.

Implement for knowledge broadcasting on other languages

fix bugs at colab (#39, nicolay-r/AREkit#500)
enhance prompting #40
propose AD in which display logo + script for launching prompt sampling

Fold type selection

#61 related.

Reason: some parts are designed for the particular cases of application in ml. For example:

training data is expected to have a labels
test data without labels.

is to provide the specific part of the dataset that supposed to be taken for sampling; by default we may consider no-folding, which means keep all the documents.

Then it is followed by grouping method for entities:
https://github.com/nicolay-r/ARElight/blob/c3d388cc7bcc5cea5be1ed1f0c19419c6157d309/examples/serialize_bert.py#L76-L82

	class NerelAnyLabelScaler(BaseLabelScaler):

	def __init__(self):

	self.__uint_to_label_dict = OrderedDict([
	(labels.OpinionBelongsTo(), 0),
	(labels.OpinionRelatesTo(), 1),
	(labels.NegEffectFrom(), 2),
	(labels.PosEffectFrom(), 3),
	(labels.NegStateFrom(), 4),
	(labels.PosStateFrom(), 5),
	(labels.NegativeTo(), 6),
	(labels.PositiveTo(), 7),
	(labels.StateBelongsTo(), 8),
	(labels.PosAuthorFrom(), 9),
	(labels.NegAuthorFrom(), 10),
	(labels.AlternativeName(), 11),
	(labels.OriginsFrom(), 12),
	])

	super(NerelAnyLabelScaler, self).__init__(
	uint_dict=self.__uint_to_label_dict,
	int_dict=self.__uint_to_label_dict)

	TextAndEntitiesGoogleTranslator(src=cfg.src_lang, dest=cfg.dest_lang) if cfg.dest_lang != cfg.src_lang else None,
	LemmasBasedFrameVariantsParser(frame_variants=frame_variant_collection, stemmer=stemmer)])

	elif isinstance(part, Entity):
	# Register first the prior parts were merged.
	__optionally_register(parts_to_join)
	# Register entities information for further restoration.
	origin_entity_ind.append(len(content))
	origin_entities.append(part)
	content.append(part.Value)

nicolay-r / arekit-ss Goto Github PK

arekit-ss's People

Contributors

Stargazers

Watchers

arekit-ss's Issues