Code Monkey home page Code Monkey logo

parstdex's Introduction

HengamTagger or Parstdex (persian time date extractor)

Pypi Package Documentation Status Hugging Face Spaces Google Colab

Description

Parstdex (knwon as HengamTagger in our paper at aacl) is a rule-based Persian temporal extractor built on top of regular expressions specifying pattern units and patterns that can match temporal expressions.

How to Install parstdex

pip install parstdex

How to use

from parstdex import Parstdex

model = Parstdex()

sentence = """ماریا شنبه عصر راس ساعت ۱۷ و بیست و سه دقیقه به نادیا زنگ زد اما تا سه روز بعد در تاریخ ۱۸ شهریور سال ۱۳۷۸ ه.ش. خبری از نادیا نشد"""

Extract spans

model.extract_span(sentence)

output :

{"datetime": [[6, 47], [68, 78], [82, 111]], "date": [[6, 10], [68, 78], [82, 111]], "time": [[11, 47]]}

Extract markers

model.extract_marker(sentence)
{
   "datetime":{
      "[6, 47]":"شنبه عصر راس ساعت ۱۷ و بیست و سه دقیقه به",
      "[68, 78]":"سه روز بعد",
      "[82, 111]":"تاریخ ۱۸ شهریور سال ۱۳۷۸ ه.ش."
   },
   "date":{
      "[6, 10]":"شنبه",
      "[68, 78]":"سه روز بعد",
      "[82, 111]":"تاریخ ۱۸ شهریور سال ۱۳۷۸ ه.ش."
   },
   "time":{
      "[11, 47]":"عصر راس ساعت ۱۷ و بیست و سه دقیقه به"
   }
}

Extract TimeML scheme

model.extract_time_ml(sentence)

output :

ماریا 
<TIMEX3 type='DATE'>
شنبه
</TIMEX3>
<TIMEX3 type='TIME'>
عصر راس ساعت ۱۷ و بیست و سه دقیقه به
</TIMEX3>
 نادیا زنگ زد اما 
<TIMEX3 type='DURATION'>
تا سه روز بعد
</TIMEX3>
 در 
<TIMEX3 type='DATE'>
تاریخ ۱۸ شهریور سال ۱۳۷۸ ه.ش.
</TIMEX3>
خبری از نادیا نشد

Extract markers' NER tags

DATTIM mode (Default):

model.extract_ner(sentence, mode="dattim")

output :

[
    ("ماریا", "O"),
    ("شنبه", "B-DAT"),
    ("عصر", "B-TIM"),
    ("راس", "I-TIM"),
    ("ساعت", "I-TIM"),
    ("۱۷", "I-TIM"),
    ("و", "I-TIM"),
    ("بیست", "I-TIM"),
    ("و", "I-TIM"),
    ("سه", "I-TIM"),
    ("دقیقه", "I-TIM"),
    ("به", "I-TIM"),
    ("نادیا", "O"),
    ("زنگ", "O"),
    ("زد", "O"),
    ("اما", "O"),
    ("تا", "B-DAT"),
    ("سه", "I-DAT"),
    ("روز", "I-DAT"),
    ("بعد", "I-DAT"),
    ("در", "I-DAT"),
    ("تاریخ", "I-DAT"),
    ("۱۸", "I-DAT"),
    ("شهریور", "I-DAT"),
    ("سال", "I-DAT"),
    ("۱۳۷۸", "I-DAT"),
    ("ه", "I-DAT"),
    (".", "I-DAT"),
    ("ش", "I-DAT"),
    (".", "I-DAT"),
    ("خبری", "O"),
    ("از", "O"),
    ("نادیا", "O"),
    ("نشد", "O"),
]

TMP mode:

model.extract_ner(sentence, mode="tmp")

output :

[
    ("ماریا", "O"),
    ("شنبه", "B-TMP"),
    ("عصر", "I-TMP"),
    ("راس", "I-TMP"),
    ("ساعت", "I-TMP"),
    ("۱۷", "I-TMP"),
    ("و", "I-TMP"),
    ("بیست", "I-TMP"),
    ("و", "I-TMP"),
    ("سه", "I-TMP"),
    ("دقیقه", "I-TMP"),
    ("به", "I-TMP"),
    ("نادیا", "O"),
    ("زنگ", "O"),
    ("زد", "O"),
    ("اما", "O"),
    ("تا", "B-TMP"),
    ("سه", "I-TMP"),
    ("روز", "I-TMP"),
    ("بعد", "I-TMP"),
    ("در", "I-TMP"),
    ("تاریخ", "I-TMP"),
    ("۱۸", "I-TMP"),
    ("شهریور", "I-TMP"),
    ("سال", "I-TMP"),
    ("۱۳۷۸", "I-TMP"),
    ("ه", "I-TMP"),
    (".", "I-TMP"),
    ("ش", "I-TMP"),
    (".", "I-TMP"),
    ("خبری", "O"),
    ("از", "O"),
    ("نادیا", "O"),
    ("نشد", "O"),
]


File Structure:

Parstdex architecture is very flexible and scalable and therefore suggests an easy solution to adapt to new patterns which haven't been considered yet.

├── parstdex                 
│   └── utils
|   |   └── annotation
|   |   |   └── ...
|   |   └── pattern
|   |   |   └── ...
|   |   └── special_words
|   |   |   └── words.txt
|   |   └── const.py
|   |   └── normalizer.py
|   |   └── pattern_to_regex.py
|   |   └── deprecation.py
|   |   └── regex_tool.py
|   |   └── spans.py
|   |   └── tokenizer.py
|   └── marker_extractor.py
|   └── settings.py
└── Test           
│   └── data.json
|   └── test_parstdex.py
|      
└── examples.py
└── performance_test.ipynb
└── requirement.txt
└── setup.py

Performance Test

Executable codes and performance test results are accessible on google colab.

The average time required to obtain temporal expressions is 6 ms. This test was conducted using 264 sentences with an average length of 50 characters that covered all of the patterns.

How to contribute

Please feel free to provide us with any feedback or suggestions. You can find more information on how to contribute to Parstdex by reading the contribution document.

Citation

If you use any part of this repository in your research, please cite it using the following BibTex entry.

@inproceedings{mirzababaei-etal-2022-hengam,
	title        = {Hengam: An Adversarially Trained Transformer for {P}ersian Temporal Tagging},
	author       = {Mirzababaei, Sajad  and Kargaran, Amir Hossein  and Sch{\"u}tze, Hinrich  and Asgari, Ehsaneddin},
	year         = 2022,
	booktitle    = {Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing},
	publisher    = {Association for Computational Linguistics},
	address      = {Online only},
	pages        = {1013--1024},
	url          = {https://aclanthology.org/2022.aacl-main.74}
}

parstdex's People

Contributors

hamidjahad avatar kargaranamir avatar optimopium avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

parstdex's Issues

Problem in installation

Hi,
when I use py -m pip install parstdex command I get the followings which states that there is something wrong with the package:

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting parstdex
  Downloading parstdex-1.3.1-py3-none-any.whl (44 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.9/44.9 kB 170.8 kB/s eta 0:00:00
Collecting pytextspan~=0.5.0
  Downloading pytextspan-0.5.4.tar.gz (9.2 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Preparing metadata (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [6 lines of output]

      Cargo, the Rust package manager, is not installed or is not on PATH.
      This package requires Rust and Cargo to compile extensions. Install it through
      the system's package manager or via https://rustup.rs/

      Checking for Rust toolchain....
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

P. S. 1- I use python 3.10 version
2- Actually I wanted to use QuestionGenerator from here but I got the error that states ModuleNotFoundError: No module named 'parstdex' and stopped with the above-mentioned problem.

Any Idea how should I solve it?
Thanks in advance

Suggestions

We welcome your suggestions below. Please be as specific as possible. Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.