Code Monkey home page Code Monkey logo

isanlp_rst's Introduction

Python 3.8

IsaNLP RST Parser

This Python 3 library provides RST parser for Russian based on neural network models trained on RuRSTreebank Russian discourse corpus. The parser should be used in conjunction with IsaNLP library and can be considered its module.

Installation

  1. Install IsaNLP library:
pip install git+https://github.com/IINemo/isanlp.git
  1. Deploy docker containers for syntax and discourse parsing:
docker run --rm -d -p 3334:3333 --name spacy_ru tchewik/isanlp_spacy:ru
docker run --rm -d -p 3335:3333 --name rst_ru tchewik/isanlp_rst:2.1-rstreebank
  1. Connect from python using PipelineCommon:
from isanlp import PipelineCommon
from isanlp.processor_remote import ProcessorRemote
from isanlp.processor_razdel import ProcessorRazdel

# put the address here ->
address_syntax = ('', 3334)
address_rst = ('', 3335)

ppl_ru = PipelineCommon([
    (ProcessorRazdel(), ['text'],
     {'tokens': 'tokens',
      'sentences': 'sentences'}),
    (ProcessorRemote(address_syntax[0], address_syntax[1], '0'),
     ['tokens', 'sentences'],
     {'lemma': 'lemma',
      'morph': 'morph',
      'syntax_dep_tree': 'syntax_dep_tree',
      'postag': 'postag'}),
    (ProcessorRemote(address_rst[0], address_rst[1], 'default'),
     ['text', 'tokens', 'sentences', 'postag', 'morph', 'lemma', 'syntax_dep_tree'],
     {'rst': 'rst'})
])

text = ("Парацетамол является широко распространённым центральным ненаркотическим анальгетиком, обладает довольно "
        "слабыми противовоспалительными свойствами. Вместе с тем при приёме больших доз может вызывать нарушения "
        "работы печени, кровеносной системы и почек. Риски нарушений работы данных органов и систем "
        "увеличивается при одновременном принятии спиртного, поэтому лицам, употребляющим алкоголь, рекомендуют "
        "употреблять пониженную дозу парацетамола.")

res = ppl_ru(text)
  1. The res variable should contain all annotations including RST annotations stored in res['rst']; each tree anotation in list represents one or more paragraphs of the given text.
{'text': 'Парацетамол является широко распространённым ...',
 'tokens': [<isanlp.annotation.Token at 0x7f833dee0910>, ...],
 'sentences': [<isanlp.annotation.Sentence at 0x7f833dee07d0>, ...],
 'lemma': [['парацетамол', 'являться', ...], ...],
 'morph': [[{'Animacy': 'Inan', 'Case': 'Nom', ...}, ...], ...],
 'syntax_dep_tree': [[<isanlp.annotation.WordSynt at 0x7f833deddc10>, ...], ...],
 'postag': [['NOUN', ...], ...],
 'rst': [<isanlp.annotation_rst.DiscourseUnit at 0x7f833defa5d0>]}
  1. The variable res['rst'] can be visualized as:

  2. To convert a list of DiscourseUnit objects to *.rs3 file with visualization, run:

from isanlp.annotation_rst import ForestExporter

exporter = ForestExporter(encoding='utf8')
exporter(res['rst'], 'filename.rs3')

Package overview

  1. The discourse parser. Is implemented in ProcessorRST class. Path: src/isanlp_rst/processor_rst.py.
  2. Trained neural network models for RST parser: models for segmentation, structure prediction, and label prediction. Path: models.
  3. Docker container tchewik/isanlp_rst with preinstalled libraries and models. Use the command: docker run --rm -p 3335:3333 tchewik/isanlp_rst

Usage

The usage example is available in examples/usage.ipynb.

RST data structures

The results of RST parser are stored in a list of isanlp.annotation_rst.DiscourseUnit objects. Each object represents a tree for a paragraph or multiple paragraphs of a text. DiscourseUnit objects have the following members:

  • id (int): id of a discourse unit.
  • start (int): starting position (in characters) of a current discourse unit span in original text.
  • end (int): ending position (in characters) of a current discourse unit span in original text.
  • relation (string): 'elementary' if the current unit is a discourse tree leaf, or RST relation.
  • nuclearity (string): nuclearity orientation for current unit. _ for elementary discourse units or one of NS, SN , NN for non-elementary units.
  • left (DiscourseUnit or None): left child node of a non-elementary unit.
  • right (DiscourseUnit or None): right child node of a non-elementary unit.
  • proba (float): probability of the node presence obtained from structure classifier.

It is possible to operate with DiscourseUnits objects as binary structures. For example, to extract relations pairs from the tree like this:

def extr_pairs(tree, text):
    pp = []
    if tree.left:
        pp.append([text[tree.left.start:tree.left.end],
                   text[tree.right.start:tree.right.end],
                   tree.relation, tree.nuclearity])
        pp += extr_pairs(tree.left, text)
        pp += extr_pairs(tree.right, text)
    return pp

print(extr_pairs(res['rst'][0], res['text']))
# [['Президент Филиппин заявил,', 'что поедет на дачу, если будут беспорядки.', 'attribution', 'SN'], 
# ['что поедет на дачу,', 'если будут беспорядки.', 'condition', 'NS']]

Cite

https://link.springer.com/chapter/10.1007/978-3-030-72610-2_8

  • Gost: Chistova E., Shelmanov A., Pisarevskaya D., Kobozeva M. and Isakov V., Panchenko A., Toldova S. and Smirnov I. RST Discourse Parser for Russian: An Experimental Study of Deep Learning Models // Proceedings of Analysis of Images, Social Networks and Texts (AIST). — 2020. — P. 105-119.

  • BibTeX:

@inproceedings{chistova2020rst,
  title={{RST} Discourse Parser for {R}ussian: An Experimental Study of Deep Learning Models},
  author={Chistova, Elena and Shelmanov, Artem and Pisarevskaya, Dina and Kobozeva, Maria and Isakov, Vadim  and Panchenko, Alexander  and Toldova, Svetlana  and Smirnov, Ivan },
  booktitle={In Proceedings of Analysis of Images, Social Networks and Texts (AIST)},
  pages={105--119},
  year={2020}
}
  • Springer: Chistova E. et al. (2021) RST Discourse Parser for Russian: An Experimental Study of Deep Learning Models. In: van der Aalst W.M.P. et al. (eds) Analysis of Images, Social Networks and Texts. AIST 2020. Lecture Notes in Computer Science, vol 12602. Springer, Cham. https://doi.org/10.1007/978-3-030-72610-2_8

isanlp_rst's People

Contributors

iinemo avatar tchewik avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

isanlp_rst's Issues

Ошибка при доступе к контейнеру isanlp_rst

Здравствуйте!

Столкнулась с ошибкой при запуске модели, скорее всего я что-то не учла, но делала по инструкции. Кажется, дело в том, что контейнер автоматически отключается через 1-2 минуты после запуска через docker run.

---------------------------------------------------------------------------
_InactiveRpcError                         Traceback (most recent call last)
Cell In [6], line 30
      9 ppl_ru = PipelineCommon([
     10     (ProcessorRazdel(), ['text'],
     11      {'tokens': 'tokens',
   (...)
     21      {'rst': 'rst'})
     22 ])
     24 text = ("Парацетамол является широко распространённым центральным ненаркотическим анальгетиком, обладает довольно "
     25         "слабыми противовоспалительными свойствами. Вместе с тем при приёме больших доз может вызывать нарушения "
     26         "работы печени, кровеносной системы и почек. Риски нарушений работы данных органов и систем "
     27         "увеличивается при одновременном принятии спиртного, поэтому лицам, употребляющим алкоголь, рекомендуют "
     28         "употреблять пониженную дозу парацетамола.")
---> 30 res = ppl_ru(text)

File ~/Desktop/Универ/rst_parsing/.venv/lib/python3.10/site-packages/isanlp/pipeline_common.py:74, in PipelineCommon.__call__(self, *input_data)
     71 result = {e : inp for (e, inp) in zip(list(self._processors.values())[0][1], input_data)}
     73 for proc, proc_input, proc_output in list(self._processors.values()):
---> 74     results = proc(*[result[e] for e in proc_input])
     75     if type(results) is tuple:
     76         results = {i : results[i] for i in range(len(results))}

File ~/Desktop/Универ/rst_parsing/.venv/lib/python3.10/site-packages/isanlp/processor_remote.py:42, in ProcessorRemote.__call__(self, *input_data)
     38 pb_ann.Pack(annotation_to_protobuf.convert_annotation(input_data))
     39 request = annotation_pb2.ProcessRequest(pipeline_name = self._pipeline_name, 
     40                                         input_annotations = pb_ann)
---> 42 response = self._stub.process(request)
     43 return annotation_from_protobuf.convert_annotation(response.output_annotations)

File ~/Desktop/Универ/rst_parsing/.venv/lib/python3.10/site-packages/grpc/_channel.py:946, in _UnaryUnaryMultiCallable.__call__(self, request, timeout, metadata, credentials, wait_for_ready, compression)
    937 def __call__(self,
    938              request,
    939              timeout=None,
   (...)
    942              wait_for_ready=None,
    943              compression=None):
    944     state, call, = self._blocking(request, timeout, metadata, credentials,
    945                                   wait_for_ready, compression)
--> 946     return _end_unary_response_blocking(state, call, False, None)

File ~/Desktop/Универ/rst_parsing/.venv/lib/python3.10/site-packages/grpc/_channel.py:849, in _end_unary_response_blocking(state, call, with_call, deadline)
    847         return state.response
    848 else:
--> 849     raise _InactiveRpcError(state)

_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:3335: Failed to connect to remote host: Connection refused"
	debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:3335: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2022-11-05T18:43:29.799901+03:00"}"
>

Скачала образы, запустила:

docker run --rm -d -p 3334:3333 --name spacy_ru tchewik/isanlp_spacy:ru
docker run --rm -d -p 3335:3333 --name rst_ru tchewik/isanlp_rst:2.1-rstreebank

Вызываю парсинг как в примере (подставила localhost):

from isanlp import PipelineCommon
from isanlp.processor_remote import ProcessorRemote
from isanlp.processor_razdel import ProcessorRazdel

# put the address here ->
address_syntax = ('localhost', 3334)
address_rst = ('localhost', 3335)

ppl_ru = PipelineCommon([
    (ProcessorRazdel(), ['text'],
     {'tokens': 'tokens',
      'sentences': 'sentences'}),
    (ProcessorRemote(address_syntax[0], address_syntax[1], '0'),
     ['tokens', 'sentences'],
     {'lemma': 'lemma',
      'morph': 'morph',
      'syntax_dep_tree': 'syntax_dep_tree',
      'postag': 'postag'}),
    (ProcessorRemote(address_rst[0], address_rst[1], 'default'),
     ['text', 'tokens', 'sentences', 'postag', 'morph', 'lemma', 'syntax_dep_tree'],
     {'rst': 'rst'})
])

text = ("Парацетамол является широко распространённым центральным ненаркотическим анальгетиком, обладает довольно "
        "слабыми противовоспалительными свойствами. Вместе с тем при приёме больших доз может вызывать нарушения "
        "работы печени, кровеносной системы и почек. Риски нарушений работы данных органов и систем "
        "увеличивается при одновременном принятии спиртного, поэтому лицам, употребляющим алкоголь, рекомендуют "
        "употреблять пониженную дозу парацетамола.")

res = ppl_ru(text)

Если убрать второй ProcessorRemote, то все работает, но без ТРС-парсинга, естественно. То есть дело в его контейнере.

[Python 3.10.4]

label_encoder bug

file_label_encoder = os.path.join(self.model_dir_path, 'label_encoder.pkl')
self._label_encoder = pickle.load(open(file_label_encoder, 'rb')) if os.path.isfile(
file_one_hot_encoder) else None

кажется, в строке 29 нужно заменить file_one_hot_encoder на file_label_encoder. Открою PR?

Main stream

Current:

  1. Update segmentation model for Russian to the one used in the Shared Task.
  2. Prepare a document with an example of inconsistency in the tree annotation.

Future:

  1. Create a pipeline for rst parsing of Russian language.
  2. Create a docker container that can be utilized as a stand-alone service (text in => rst annotation out) together with other linguistic services from isanlp.
  3. Fix annotation of trees in the rst corpus.
  4. Train a new model for rst construction of a rst trees.
  5. Create a evaluation scripts for the whole pipeline.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.