nicolay-r / arenets Goto Github PK

Tensorflow-based framework which lists attentive implementation of the conventional neural network models (CNN, RNN-based), applicable for Relation Extraction classification tasks as well as API for custom model implementation

License: MIT License

Python 84.97% Jupyter Notebook 14.97% Shell 0.07%

attention bilstm cnn-model lstm pcnn rnn tensorflow neural-network classification ml

arenets's Introduction

Hi I'm Nicolay! 👋

My personal website at github for more information about me
Combine it with track-and-field 🏃‍♂️, ⛷️ and 🌊🏄‍♂️

The most recent

02/07/2024: Our CombinedLoss-based and Role-play + Contrasting Reasoning studies on Empathy/Emotion prediction were accepted @ WASSA-2024 hosted by ACL-2024 🇹🇭🥳
21/06/2024: Our CoT THOR-ECAC and CoT-NumHG-Mistral-7B systems were presented @ SemEval-2024 🇲🇽 🥳
08/06/2024: Paper on charters personalities extraction 📚 has been accepted for LOD-2024 @ Toscana, Italy 🇮🇹 🥳
31/05/2024: Presenting 📊 LLM application findings in SA @ DataFest-2024 online/youtube
09/05/2024: Taking part of the i3-simulations @ Luten / UK on 9-10th May 2024 for MMI-NLP 🇬🇧

07/05/2024: Joining the reviewer PC @ CIKM-2024 ✍️
06/05/2024: Joining the reviewer PC @ LOD-2024 ✍️
19/04/2024: Our findings on LLMs reasoning prospects in Sentiment Analysis pre-printed @ ArXiv 🥳
05/04/2024: Our findings on LLMs reasoning prospects in Sentiment Analysis were accepted @ LJoM 🥳
25/03/2024: Presenting our ARElight demo @ ECIR-2024 🥳
19/03/2024: Our CoT LLM systems #1 and #2 accepted @ SemEval-2024 🥳
01/03/2024: Research Fellow in Multimodal NLP (🖼️+📰) @ BU in the UK 💼
25/02/2024: Joining the reviewer PC @ BigCom2024 ✍️
22/02/2024: Giving a seminar @ Glasgow IR 🎤
13/02/2024: Joining the reviewer PC @ TextGraphs-17 as a part of ACL-2024 ✍️
23/01/2024: Joining the reviewer PC @ AINL-2024 ✍️
19/01/2024: Joining distingushed reviewers list @ ACM TiiS 🥳✍️
17/10/2023: Joining the reviewer PC @ CHIIR-2024 ✍️
19/03/2023: Our systems #1 and #2 accepted @ SemEval-2023 🥳
02/04/2023: Joining the reviewer PC @ CIKM-2023 ✍️
11/03/2023: Giving a seminar @ Newcastle University in the UK 🎤
10/02/2023: Giving a seminar @ Oxford Wolfson College in the UK 🎤
04/12/2022: Research Fellow in NLP / IR @ Newcastle University in the UK 💼

arenets's People

Contributors

Stargazers

Watchers

arenets's Issues

`VocabRepositoryUtils` -- `numpy` API considers `#` by default in vocabulary on load

It is crutial to set comments = None instead of # by default!

vocab = np.loadtxt(source, dtype=str, comments=None)

PZhou and ZYang attentions -- clarify that `output` is actually a `rnn_output` which reduces the time series

Reason: these functions are only applicable for RNN networks:

AREnets/arenets/attention/architectures/self_p_zhou.py

Lines 31 to 37 in d44c099

    
           # Output of (Bi-)RNN is reduced with attention vector; the result has (B,D) shape 
        
           output = tf.reduce_sum(inputs * tf.expand_dims(alphas, -1), 1) 
        
           # Final output with tanh 
        
           output = tf.tanh(output) 
        
           return output, alphas

and

AREnets/arenets/attention/architectures/self_z_yang.py

Lines 82 to 87 in d44c099

    
           output = tf.reduce_sum(inputs * tf.expand_dims(alphas, -1), 1) 
        
           if not return_alphas: 
        
               return output 
        
           else: 
        
               return output, alphas

Predict -- Hidden states dir should be inside of the related model

#18 related

LabelsScaler -- uint dict and dict might have different sizes

Use `modify_config` fucntion from the `train.py` declaration.

AREnets/tutorials/predict.py

Lines 7 to 12 in 4b6fda9

    
           def modify_config(config): 
        
               assert(isinstance(config, DefaultNetworkConfig)) 
        
               config.modify_terms_per_context(50) 
        
           predict(input_data_dir="_data", output_dir="_out",

Make automatic converter from embedding in `model.txt` format (quickstart)

rewrite code into function
calling that function

RCNN attention -- Incorrect parameter mentioning for backward LSTM

Here, obviously it is expected to be bw:

AREnets/arenets/context/architectures/att_self_rcnn.py

Line 31 in d44c099

output_bw_w = output_fw * tf.expand_dims(self.__att_alphas, -1)

Networks -- SampleFeatures Refactoring

For NetworkSampleRowProvider there is a need to perform refactoring, such that allows to manually setup the required features in output

Better user experience [Personal Experience backlog]

infer -- incorrect path for loading saved `tf` model

`predict` -- provide hidden states, including input related [feature request]

Reason: necessary for further visualization and analysis, including attention-related data especially.
Continues #18

`SingleInstanceNeuralNetwork` -- provide `scaled_logits` extraction [ensamble-related feature required]

CNN with self attention

This parameter might be utilized to embed self-attention on top of it

AREnets/arenets/context/architectures/cnn.py

Line 53 in d44c099

bwgc_conv = tf.reshape(bwc_conv, [self.Config.BatchSize,

P_Zhou attention
Z_Yang attention

Embedding -- considering `-1` for missed value is not a solution in general

AREnets/arenets/sample.py

Line 38 in 22ccb29

TERM_VALUE_MISSING = -1

It may results in the following exeption:

  File "/home/nicolay/proj/AREnets/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[23,12] = -1 is not in [0, 302866)
	 [[node embedding_lookup (defined at /arenets/context/architectures/base/base.py:352) ]]

Config -- class weights are 3 by deafalt, while amount of labels may vary

'sent-ind' and 'doc-id' don't required

ENTITY_VALUES, ENTITY_TYPES -- might be removed

readers -- provide `target_extension` as a base reader function.

reason -- simplify SamplesIO API

`test_samples_iter` -- test failed

returing None instead of the [] in some cases lead to the incorrect behavior during sampling. This is caused by subj and obj inds, which are expected to be declared anyway, with non empty its synonyms lists.

Logger -- save log in model dir [Personal experience feedback]

By default, keep all logging information in file
Keep model details in log as well

from logging.handlers import TimedRotatingFileHandler
h = TimedRotatingFileHandler('/home/user/Desktop/myLogFile.log')

Reference Section -- provide reference onto github project in `bibtex` format

Inputs and tests -- expected to work with samples.

provide `modify_config` func

Advanced section -- provide data annotation tutorial , based on AREkit

Explicitly provide labels in examples

Since this is a crucial parameter of the every task in particular.

#15

Optional parameters such as pos, frames, entities

Networks -- features has complicated implementation in terms of shift operation

The implementation should be frustrated in terms of shift operation. The pipeline is as follows:

Calculating result regions of terms
Perform everything else using the precalculated regions.

(https://github.com/nicolay-r/AREkit/blob/master/contrib/networks/sample.py, from_tsv_row function)

`IsPredefinedStateProvided` -- remove for ppl contexts because it confuses, since used only for skipping missed tokens at embedding

Embedding -- provide formatter for w2v text files

http://vectors.nlpl.eu/repository/

`frames` -- has limitation in terms of the application for sentiment analysis only

`LabesScaler` -- too much usage for train/infer

Since we only need a labels count to use it in a form a list of labels [0, c) where c -- amount of labels

remove labels declaration from tutorial

PZhou and ZYang configs -- using different `BasicLSTM` cells by default.

replace with original and complex LSTM by default

`BaseIDProvider` -- this class should not provide the connection of `opinions` and `samples` [Refactoring ids]

Copied from AREkit

nicolay-r/AREkit#376

AREnets/arenets/arekit/common/data/row_ids/base.py

Lines 1 to 20 in e41b058

    
           class BaseIDProvider(object): 
        
               """ 
        
               Opinion in text is a sequence of opinions in context 
        
               o1, o2, o3, ..., on 
        
               o1 -- first_text_opinion 
        
               i -- index in lined (for example: i=3 => 03) 
        
               # TODO. This should be definitely refactored. This implementation 
        
                 TODO. combines opinion-based and sample-based data sources, which allows 
        
                 TODO. us to bypass such connection via external foreign keys. 
        
                 Since we are head to remove opinions, there is a need to refactor so in a 
        
                 way of an additional column that provides such information for further connection 
        
                 between rows of different storages. 
        
               """ 
        
               SEPARATOR = '_' 
        
               OPINION = "o{}" + SEPARATOR 
        
               INDEX = "i{}" + SEPARATOR

remove sample_id
remove opinion_id parsing; use a separated column instead.

English embedding adoptation

Call initializer instance with the dtype argument instead of passing it to the constructor
INFO:arenets.arekit.contrib.utils.io_utils.utils:Check existence [True]: /content/drive/MyDrive/Share/SemEval2023-task6-c/_unsorted/sample-train-0.jsonl
INFO:arenets.arekit.contrib.utils.io_utils.utils:Check existence [True]: /content/drive/MyDrive/Share/SemEval2023-task6-c/_unsorted/vocab.txt
INFO:arenets.arekit.contrib.utils.io_utils.utils:Check existence [True]: /content/drive/MyDrive/Share/SemEval2023-task6-c/_unsorted/term_embedding.npz
INFO:arenets.arekit.contrib.utils.np_utils.embedding:Embedding read [size=(302866, 300)]: /content/drive/MyDrive/Share/SemEval2023-task6-c/_unsorted/term_embedding.npz
INFO:arenets.arekit.contrib.utils.np_utils.vocab:Loading vocabulary [size=302866]: /content/drive/MyDrive/Share/SemEval2023-task6-c/_unsorted/vocab.txt
Traceback (most recent call last):
  File "<stdin>", line 10, in <module>
  File "/usr/local/lib/python3.6/dist-packages/arenets/quickstart/train.py", line 96, in train
    "data_type": DataType.Train})
  File "/usr/local/lib/python3.6/dist-packages/arenets/arekit/common/pipeline/base.py", line 18, in run
    input_data = item.apply(input_data=input_data, pipeline_ctx=pipeline_ctx)
  File "/usr/local/lib/python3.6/dist-packages/arenets/arekit/common/pipeline/items/base.py", line 11, in apply
    output_data = self.apply_core(input_data=input_data, pipeline_ctx=pipeline_ctx)
  File "/usr/local/lib/python3.6/dist-packages/arenets/pipelines/items/training.py", line 171, in apply_core
    data_type=data_type if data_type is not None else DataType.Train)
  File "/usr/local/lib/python3.6/dist-packages/arenets/pipelines/items/training.py", line 115, in __handle_iteration
    terms_vocab=self.__emb_io.load_vocab(),
  File "/usr/local/lib/python3.6/dist-packages/arenets/arekit/contrib/utils/io_utils/embedding.py", line 30, in load_vocab
    return dict(VocabRepositoryUtils.load(source))
ValueError: dictionary update sequence element #0 has length 4; 2 is required

Remove non-utlized functions from AREkit kernel.

Remove DataFolding
f639f0d
Remove BaseDataFolding; use only pre-separated samples. (#7 related)
Remove duplicated Label instance for test; the latter causes an extra errors.
Storages: rows modification and blanking.
get_row might be refactored onto usage of __iter__ for BaseRowsStorage (d5e438c)
remove model name suffix model (train/infer)

Google colab -- running large training processes should be from files with `!python code.py`

Solution:
Use as IDE as follows:

Networks -- config initialization might be simplified [Backlog]

Problem: in most cases there is a need to initialize the following parameters once you encounter with an exception.
Reason: some of the parameters below are None by default, which means it is necessary to initialize them.

This might be wrapped for a better usage:

config = network_config_func()
config.modify_classes_count(value=labels_count)
config.modify_learning_rate(0.01)
config.modify_use_class_weights(True)
config.modify_dropout_keep_prob(0.9)
config.modify_bag_size(1)
config.modify_bags_per_minibatch(1)
config.modify_embedding_dropout_keep_prob(1.0)
config.modify_terms_per_context(50)
config.modify_use_entity_types_in_embedding(False)
config.set_pos_count(PartOfSpeechTypesService.get_mystem_pos_count())

`model_log_dir` -- is actually about hidden states rather then logging; suppose to be updated

`Colab` -- neural networks could not be trained with `GPU` instances due to the old API of Tensorflow (lack of CUDNN and CUDA drivers)

need to update to tensorflow==2.0+ API

`NetworksTrainingPipelineItem` -- `load` parameter is not used

#9 related

Tensorflow warnings

Reasons: tensorflow 1.14.0 is pretty outdated. The main contribution of this framework is a data preparation rather than model training.

move dependency into network contrib module
make it a particular kernel. (Move all the tf dependencies into the related subfolder)
xw_plus_b
Adadelta optimizer
Dropout wrapper

Missing values of the embedding and vocabulary

for now we consider the incomplete embedding only in case when model fine-tuned from scratch

`data` -- provide for RUS/ENG

download links

	# Output of (Bi-)RNN is reduced with attention vector; the result has (B,D) shape
	output = tf.reduce_sum(inputs * tf.expand_dims(alphas, -1), 1)

	# Final output with tanh
	output = tf.tanh(output)

	return output, alphas

	output = tf.reduce_sum(inputs * tf.expand_dims(alphas, -1), 1)

	if not return_alphas:
	return output
	else:
	return output, alphas

	def modify_config(config):
	assert(isinstance(config, DefaultNetworkConfig))
	config.modify_terms_per_context(50)


	predict(input_data_dir="_data", output_dir="_out",

	class BaseIDProvider(object):
	"""
	Opinion in text is a sequence of opinions in context
	o1, o2, o3, ..., on

	o1 -- first_text_opinion
	i -- index in lined (for example: i=3 => 03)

	# TODO. This should be definitely refactored. This implementation
	TODO. combines opinion-based and sample-based data sources, which allows
	TODO. us to bypass such connection via external foreign keys.

	Since we are head to remove opinions, there is a need to refactor so in a
	way of an additional column that provides such information for further connection
	between rows of different storages.
	"""

	SEPARATOR = '_'
	OPINION = "o{}" + SEPARATOR
	INDEX = "i{}" + SEPARATOR