Code Monkey home page Code Monkey logo

nicolay-r / arenets Goto Github PK

View Code? Open in Web Editor NEW
7.0 3.0 0.0 1.37 MB

Tensorflow-based framework which lists attentive implementation of the conventional neural network models (CNN, RNN-based), applicable for Relation Extraction classification tasks as well as API for custom model implementation

License: MIT License

Python 84.97% Jupyter Notebook 14.97% Shell 0.07%
attention bilstm cnn-model lstm pcnn rnn tensorflow neural-network classification ml

arenets's Introduction

Hi I'm Nicolay! ๐Ÿ‘‹

  • My personal website at github for more information about me
  • Combine it with track-and-field ๐Ÿƒโ€โ™‚๏ธ, โ›ท๏ธ and ๐ŸŒŠ๐Ÿ„โ€โ™‚๏ธ

The most recent

arenets's People

Contributors

nicolay-r avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

arenets's Issues

PZhou and ZYang attentions -- clarify that `output` is actually a `rnn_output` which reduces the time series

Reason: these functions are only applicable for RNN networks:

# Output of (Bi-)RNN is reduced with attention vector; the result has (B,D) shape
output = tf.reduce_sum(inputs * tf.expand_dims(alphas, -1), 1)
# Final output with tanh
output = tf.tanh(output)
return output, alphas

and

output = tf.reduce_sum(inputs * tf.expand_dims(alphas, -1), 1)
if not return_alphas:
return output
else:
return output, alphas

Better user experience [Personal Experience backlog]

  • TermsPerContext provide as a custom config callback modifier.
  • Batch size -- required for not to exceed the memory limit;
  • Learning Rate -- to modify the speed of training;
  • model_input_type -- (single, ctx)
  • Disable logging hidden states logging on training by default
  • Train acc limit move outside
  • Support input as the list of terms (#23)
  • text_a -> text
  • dropout
  • remove -0 (index) suffix for input filenames\
  • switch to English examples instead of Russian (due to internalization)
  • rename rnn to lstm
  • rename att-bilstm to att-bilstm-pzhou
  • data description tutorial!
  • avg_acc and avg_loss

Embedding -- considering `-1` for missed value is not a solution in general

TERM_VALUE_MISSING = -1

It may results in the following exeption:

  File "/home/nicolay/proj/AREnets/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[23,12] = -1 is not in [0, 302866)
	 [[node embedding_lookup (defined at /arenets/context/architectures/base/base.py:352) ]]

`test_samples_iter` -- test failed

  • returing None instead of the [] in some cases lead to the incorrect behavior during sampling. This is caused by subj and obj inds, which are expected to be declared anyway, with non empty its synonyms lists.

`BaseIDProvider` -- this class should not provide the connection of `opinions` and `samples` [Refactoring ids]

Copied from AREkit

nicolay-r/AREkit#376

class BaseIDProvider(object):
"""
Opinion in text is a sequence of opinions in context
o1, o2, o3, ..., on
o1 -- first_text_opinion
i -- index in lined (for example: i=3 => 03)
# TODO. This should be definitely refactored. This implementation
TODO. combines opinion-based and sample-based data sources, which allows
TODO. us to bypass such connection via external foreign keys.
Since we are head to remove opinions, there is a need to refactor so in a
way of an additional column that provides such information for further connection
between rows of different storages.
"""
SEPARATOR = '_'
OPINION = "o{}" + SEPARATOR
INDEX = "i{}" + SEPARATOR

  • remove sample_id
  • remove opinion_id parsing; use a separated column instead.

English embedding adoptation

Call initializer instance with the dtype argument instead of passing it to the constructor
INFO:arenets.arekit.contrib.utils.io_utils.utils:Check existence [True]: /content/drive/MyDrive/Share/SemEval2023-task6-c/_unsorted/sample-train-0.jsonl
INFO:arenets.arekit.contrib.utils.io_utils.utils:Check existence [True]: /content/drive/MyDrive/Share/SemEval2023-task6-c/_unsorted/vocab.txt
INFO:arenets.arekit.contrib.utils.io_utils.utils:Check existence [True]: /content/drive/MyDrive/Share/SemEval2023-task6-c/_unsorted/term_embedding.npz
INFO:arenets.arekit.contrib.utils.np_utils.embedding:Embedding read [size=(302866, 300)]: /content/drive/MyDrive/Share/SemEval2023-task6-c/_unsorted/term_embedding.npz
INFO:arenets.arekit.contrib.utils.np_utils.vocab:Loading vocabulary [size=302866]: /content/drive/MyDrive/Share/SemEval2023-task6-c/_unsorted/vocab.txt
Traceback (most recent call last):
  File "<stdin>", line 10, in <module>
  File "/usr/local/lib/python3.6/dist-packages/arenets/quickstart/train.py", line 96, in train
    "data_type": DataType.Train})
  File "/usr/local/lib/python3.6/dist-packages/arenets/arekit/common/pipeline/base.py", line 18, in run
    input_data = item.apply(input_data=input_data, pipeline_ctx=pipeline_ctx)
  File "/usr/local/lib/python3.6/dist-packages/arenets/arekit/common/pipeline/items/base.py", line 11, in apply
    output_data = self.apply_core(input_data=input_data, pipeline_ctx=pipeline_ctx)
  File "/usr/local/lib/python3.6/dist-packages/arenets/pipelines/items/training.py", line 171, in apply_core
    data_type=data_type if data_type is not None else DataType.Train)
  File "/usr/local/lib/python3.6/dist-packages/arenets/pipelines/items/training.py", line 115, in __handle_iteration
    terms_vocab=self.__emb_io.load_vocab(),
  File "/usr/local/lib/python3.6/dist-packages/arenets/arekit/contrib/utils/io_utils/embedding.py", line 30, in load_vocab
    return dict(VocabRepositoryUtils.load(source))
ValueError: dictionary update sequence element #0 has length 4; 2 is required

Remove non-utlized functions from AREkit kernel.

  • Remove DataFolding
  • f639f0d
  • Remove BaseDataFolding; use only pre-separated samples. (#7 related)
  • Remove duplicated Label instance for test; the latter causes an extra errors.
  • Storages: rows modification and blanking.
  • get_row might be refactored onto usage of __iter__ for BaseRowsStorage (d5e438c)
  • remove model name suffix model (train/infer)

Networks -- config initialization might be simplified [Backlog]

Problem: in most cases there is a need to initialize the following parameters once you encounter with an exception.
Reason: some of the parameters below are None by default, which means it is necessary to initialize them.

This might be wrapped for a better usage:

config = network_config_func()
config.modify_classes_count(value=labels_count)
config.modify_learning_rate(0.01)
config.modify_use_class_weights(True)
config.modify_dropout_keep_prob(0.9)
config.modify_bag_size(1)
config.modify_bags_per_minibatch(1)
config.modify_embedding_dropout_keep_prob(1.0)
config.modify_terms_per_context(50)
config.modify_use_entity_types_in_embedding(False)
config.set_pos_count(PartOfSpeechTypesService.get_mystem_pos_count())

Tensorflow warnings

Reasons: tensorflow 1.14.0 is pretty outdated. The main contribution of this framework is a data preparation rather than model training.

  • move dependency into network contrib module
  • make it a particular kernel. (Move all the tf dependencies into the related subfolder)
  • xw_plus_b
  • Adadelta optimizer
  • Dropout wrapper

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.