Code Monkey home page Code Monkey logo

nicolay-r / arenets Goto Github PK

View Code? Open in Web Editor NEW
7.0 3.0 0.0 1.37 MB

Tensorflow-based framework which lists attentive implementation of the conventional neural network models (CNN, RNN-based), applicable for Relation Extraction classification tasks as well as API for custom model implementation

License: MIT License

Python 84.97% Jupyter Notebook 14.97% Shell 0.07%
attention bilstm cnn-model lstm pcnn rnn tensorflow neural-network classification ml

arenets's Introduction

Hi I'm Nicolay! ๐Ÿ‘‹

  • My personal website at github for more information about me
  • Combine it with track-and-field ๐Ÿƒโ€โ™‚๏ธ, โ›ท๏ธ and ๐ŸŒŠ๐Ÿ„โ€โ™‚๏ธ

The most recent

arenets's People

Contributors

nicolay-r avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

arenets's Issues

`BaseIDProvider` -- this class should not provide the connection of `opinions` and `samples` [Refactoring ids]

Copied from AREkit

nicolay-r/AREkit#376

class BaseIDProvider(object):
"""
Opinion in text is a sequence of opinions in context
o1, o2, o3, ..., on
o1 -- first_text_opinion
i -- index in lined (for example: i=3 => 03)
# TODO. This should be definitely refactored. This implementation
TODO. combines opinion-based and sample-based data sources, which allows
TODO. us to bypass such connection via external foreign keys.
Since we are head to remove opinions, there is a need to refactor so in a
way of an additional column that provides such information for further connection
between rows of different storages.
"""
SEPARATOR = '_'
OPINION = "o{}" + SEPARATOR
INDEX = "i{}" + SEPARATOR

  • remove sample_id
  • remove opinion_id parsing; use a separated column instead.

Tensorflow warnings

Reasons: tensorflow 1.14.0 is pretty outdated. The main contribution of this framework is a data preparation rather than model training.

  • move dependency into network contrib module
  • make it a particular kernel. (Move all the tf dependencies into the related subfolder)
  • xw_plus_b
  • Adadelta optimizer
  • Dropout wrapper

`test_samples_iter` -- test failed

  • returing None instead of the [] in some cases lead to the incorrect behavior during sampling. This is caused by subj and obj inds, which are expected to be declared anyway, with non empty its synonyms lists.

Remove non-utlized functions from AREkit kernel.

  • Remove DataFolding
  • f639f0d
  • Remove BaseDataFolding; use only pre-separated samples. (#7 related)
  • Remove duplicated Label instance for test; the latter causes an extra errors.
  • Storages: rows modification and blanking.
  • get_row might be refactored onto usage of __iter__ for BaseRowsStorage (d5e438c)
  • remove model name suffix model (train/infer)

English embedding adoptation

Call initializer instance with the dtype argument instead of passing it to the constructor
INFO:arenets.arekit.contrib.utils.io_utils.utils:Check existence [True]: /content/drive/MyDrive/Share/SemEval2023-task6-c/_unsorted/sample-train-0.jsonl
INFO:arenets.arekit.contrib.utils.io_utils.utils:Check existence [True]: /content/drive/MyDrive/Share/SemEval2023-task6-c/_unsorted/vocab.txt
INFO:arenets.arekit.contrib.utils.io_utils.utils:Check existence [True]: /content/drive/MyDrive/Share/SemEval2023-task6-c/_unsorted/term_embedding.npz
INFO:arenets.arekit.contrib.utils.np_utils.embedding:Embedding read [size=(302866, 300)]: /content/drive/MyDrive/Share/SemEval2023-task6-c/_unsorted/term_embedding.npz
INFO:arenets.arekit.contrib.utils.np_utils.vocab:Loading vocabulary [size=302866]: /content/drive/MyDrive/Share/SemEval2023-task6-c/_unsorted/vocab.txt
Traceback (most recent call last):
  File "<stdin>", line 10, in <module>
  File "/usr/local/lib/python3.6/dist-packages/arenets/quickstart/train.py", line 96, in train
    "data_type": DataType.Train})
  File "/usr/local/lib/python3.6/dist-packages/arenets/arekit/common/pipeline/base.py", line 18, in run
    input_data = item.apply(input_data=input_data, pipeline_ctx=pipeline_ctx)
  File "/usr/local/lib/python3.6/dist-packages/arenets/arekit/common/pipeline/items/base.py", line 11, in apply
    output_data = self.apply_core(input_data=input_data, pipeline_ctx=pipeline_ctx)
  File "/usr/local/lib/python3.6/dist-packages/arenets/pipelines/items/training.py", line 171, in apply_core
    data_type=data_type if data_type is not None else DataType.Train)
  File "/usr/local/lib/python3.6/dist-packages/arenets/pipelines/items/training.py", line 115, in __handle_iteration
    terms_vocab=self.__emb_io.load_vocab(),
  File "/usr/local/lib/python3.6/dist-packages/arenets/arekit/contrib/utils/io_utils/embedding.py", line 30, in load_vocab
    return dict(VocabRepositoryUtils.load(source))
ValueError: dictionary update sequence element #0 has length 4; 2 is required

Networks -- config initialization might be simplified [Backlog]

Problem: in most cases there is a need to initialize the following parameters once you encounter with an exception.
Reason: some of the parameters below are None by default, which means it is necessary to initialize them.

This might be wrapped for a better usage:

config = network_config_func()
config.modify_classes_count(value=labels_count)
config.modify_learning_rate(0.01)
config.modify_use_class_weights(True)
config.modify_dropout_keep_prob(0.9)
config.modify_bag_size(1)
config.modify_bags_per_minibatch(1)
config.modify_embedding_dropout_keep_prob(1.0)
config.modify_terms_per_context(50)
config.modify_use_entity_types_in_embedding(False)
config.set_pos_count(PartOfSpeechTypesService.get_mystem_pos_count())

Better user experience [Personal Experience backlog]

  • TermsPerContext provide as a custom config callback modifier.
  • Batch size -- required for not to exceed the memory limit;
  • Learning Rate -- to modify the speed of training;
  • model_input_type -- (single, ctx)
  • Disable logging hidden states logging on training by default
  • Train acc limit move outside
  • Support input as the list of terms (#23)
  • text_a -> text
  • dropout
  • remove -0 (index) suffix for input filenames\
  • switch to English examples instead of Russian (due to internalization)
  • rename rnn to lstm
  • rename att-bilstm to att-bilstm-pzhou
  • data description tutorial!
  • avg_acc and avg_loss

PZhou and ZYang attentions -- clarify that `output` is actually a `rnn_output` which reduces the time series

Reason: these functions are only applicable for RNN networks:

# Output of (Bi-)RNN is reduced with attention vector; the result has (B,D) shape
output = tf.reduce_sum(inputs * tf.expand_dims(alphas, -1), 1)
# Final output with tanh
output = tf.tanh(output)
return output, alphas

and

output = tf.reduce_sum(inputs * tf.expand_dims(alphas, -1), 1)
if not return_alphas:
return output
else:
return output, alphas

Embedding -- considering `-1` for missed value is not a solution in general

TERM_VALUE_MISSING = -1

It may results in the following exeption:

  File "/home/nicolay/proj/AREnets/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[23,12] = -1 is not in [0, 302866)
	 [[node embedding_lookup (defined at /arenets/context/architectures/base/base.py:352) ]]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.