simgus / chatette Goto Github PK

View Code? Open in Web Editor NEW

311.0 12.0 54.0 16.48 MB

A powerful dataset generator for Rasa NLU, inspired by Chatito

License: MIT License

Python 100.00%

chatbot rasa-nlu dataset-generation python chatito chatbots nlp rasa nlu botkit

chatette's Introduction

Chatette

A data generator for Rasa NLU

Installation • Uninstallation • How to use Chatette? • Chatette vs Chatito? • Development • Credits

Chatette is a Python program that generates training datasets for Rasa NLU given template files. If you want to make large datasets of example data for Natural Language Understanding tasks without too much of a headache, Chatette is a project for you.

Specifically, Chatette implements a Domain Specific Language (DSL) that allows you to define templates to generate a large number of sentences, which are then saved in the input format(s) of Rasa NLU.

The DSL used is a near-superset of the excellent project Chatito created by Rodrigo Pimentel. (Note: the DSL is actually a superset of Chatito v2.1.x for Rasa NLU, not for all possible adapters.)

An interactive mode is available as well:

Installation

To run Chatette, you will need to have Python installed. Chatette works with both Python 2.7 and 3.x (>= 3.4).

Chatette is available on PyPI, and can thus be installed using pip:

pip install chatette

Alternatively, you can clone the GitHub repository and install the requirements:

pip install -r requirements/common.txt

You can then install the project (as an editable package) using pip, by executing the following command from the directory Chatette/chatette/:

pip install -e .

You can then run the module by using the commands below in the cloned directory.

Uninstallation

You can just use pip to uninstall Chatette:

pip uninstall chatette

How to use Chatette?

Input and output data

The data that Chatette uses and generates is loaded from and saved to files. You will thus have:

One or several input file(s) containing the templates. There is no need for a specific file extension. The syntax of the DSL to make those templates is described on the wiki.
One or several output file(s), which will be generated by Chatette and will contain the generated examples. Those files can be formatted in JSON (by default) or in Markdown and can be directly fed to Rasa NLU. It is also possible to use a JSONL format.

Running Chatette

Once Chatette is installed and you created the template files, run the following command:

python -m chatette <path_to_template>

where python is your Python interpreter (some operating systems use python3 as the alias to the Python 3.x interpreter).

You can specify the name of the output file as follows:

python -m chatette <path_to_template> -o <output_directory_path>

<output_directory_path> is specified relatively to the directory from which the script is being executed. The output file(s) will then be saved in numbered .json files in <output_directory_path>/train and <output_directory_path>/test. If you didn't specify a path for the output directory, the default one is output.

Other program arguments and are described in the wiki.

Chatette vs Chatito?

TL;DR: main selling point: it is easier to deal with large projects using Chatette, and you can transform most Chatito projects into a Chatette one without any modification.

A perfectly legitimate question is:

Why does Chatette exist when Chatito already fulfills the same purposes?

The two projects actually have different goals:

Chatito aims to be a generic but powerful DSL, that should stay very legible. While it is perfectly fine for small projects, when projects get larger, the simplicity of its DSL may become a burden: your template file becomes overwhelmingly large, to the point you get lost inside it.

Chatette defines a more complex DSL to be able to manage larger projects and tries to stay as interoperable with Chatito as possible. Here is a non-exhaustive list of features Chatette has and that Chatito does not have:

Ability to break down templates into multiple files
Possibility to specify the probability of generating some parts of the sentences
Conditional generation of some parts of the sentences, given which other parts were generated
Choice syntax to prevent copy-pasting rules with only a few changes and to easily modify the generation behavior of parts of sentences
Ability to define the value of each slot (entity) whatever the generated example
Syntax for generating words with different case for the leading letter
Argument support so that some templates may be filled by different strings in different situations
Indentation is permissive and must only be somewhat coherent
Support for synonyms
Interactive command interpreter
Output for Rasa in JSON or in Markdown formats

As the Chatette's DSL is a superset of Chatito's one, input files used for Chatito are most of the time completely usable with Chatette (not the other way around). Hence, it is easy to start using Chatette if you used Chatito before.

As an example, this Chatito data:

// This template defines different ways to ask for the location of toilets (Chatito version)
%[ask_toilet]('training': '3')
    ~[sorry?] ~[tell me] where the @[toilet#singular] is ~[please?]?
    ~[sorry?] ~[tell me] where the @[toilet#plural] are ~[please?]?

~[sorry]
    sorry
    Sorry
    excuse me
    Excuse me

~[tell me]
    ~[can you?] tell me
    ~[can you?] show me
~[can you]
    can you
    could you
    would you

~[please]
    please

@[toilet#singular]
    toilet
    loo
@[toilet#plural]
    toilets

could be directly given as input to Chatette, but this Chatette template would produce the same results:

// This template defines different ways to ask for the location of toilets (Chatette version)
%[&ask_toilet](3)
    ~[sorry?] ~[tell me] where the @[toilet#singular] is [please?]?
    ~[sorry?] ~[tell me] where the @[toilet#plural] are [please?]?

~[sorry]
    sorry
    excuse me

~[tell me]
    ~[can you?] [tell|show] me
~[can you]
    [can|could|would] you

@[toilet#singular]
    toilet
    loo
@[toilet#plural]
    toilets

The Chatito version is arguably easier to read, but the Chatette version is shorter, which may be very useful when dealing with lots of templates and potential repetition.

Beware that, as always with machine learning, having too much data may cause your models to perform less well because of overfitting. While this script can be used to generate thousands upon thousands of examples, it isn't advised for machine learning tasks.

Chatette is named after Chatito: -ette in French could be translated to -ita or -ito in Spanish. Note that the last e in Chatette is not pronouced (as is the case in "note").

Development

For developers, you can clone the repo and install the development requirements: pip install -r requirements/develop.txt Then, install the module as editable: pip install -e <path-to-chatette-module>

Run pylint: tox -e pylint
Run pycodestyle: tox -e pycodestyle
Run pytest: tox -e pytest

Credits

Author and maintainer

SimGus

Disclaimer: This is a side-project I'm not paid for, don't expect me to work 24/7 on it.

Contributors

Many thanks to them!

chatette's People

Contributors

Stargazers

Watchers

Forkers

vsfedorenko githubclj reloadbrain ai-learningandoptimize fanfanfeng beegeesquare we1l1n wegamekinglc meelement volerog alvarorivasg mbkan ziligy sekmet xuchen zjcanjux flingjie tomgun132 cold-eye xiabai84 brianingermany keshava sandy1811 colinsongf nihitavr kizivat c-chaitanya rogervaas howl-anderson ivesbai oguzhankarahan varunr1995 lizhengo afiqmuzaffar bjmwang ispml duyphongdn1997 diorw imanmesgaran tirumudi duanzhihua shinichr poveteen serjievg liyandan simon-pensur tiagosmarqs huckles-learning-lab mastersatish marahmanjubup

chatette's Issues

Bug with file paths when including files

Hi,

FIrst of all, thank you for the excellent tool, I love it!

I noticed the following issue starting from 1.4.0. If you include a file that includes other files, those other files are assumed to be in the same folder as the initial main file.

Assume the following file structure:
/path1/main.chatette - includes /path2/shared.chatette
/path2/shared.chatette - includes /path2/aliases.chatette
/path2/aliases.chatette

If we try to generate examples using main.chatette, in this scenario, Chatette will look for the aliases.chatette file in /path1/ and not in /path2/, which is incorrect and throws an error.

Thanks!
Martin

entity output in markdown file not in proper format

If we have chatito file as below

%[duplicate]
duplicate my @[date]

@[date]
~[last week]
~[week]

~[week]
current week
week

~[last week]
last week
prior week
previous week

The markdown file output is as follows

intent:duplicate

duplicate my current week [] (date)
duplicate my last week [] (date)
duplicate my previous week [] (date)

However, I think the output should be as follows

intent:duplicate

duplicate my [ current week ] (date)
duplicate my [ last week ] (date)
duplicate my [ previous week ] (date)

@SimGus , please check on this and let me know the correct behavior. Thanks!

how do we output a markdown format file instead of json?

KeyError: "Didn't expect the random generation name 'WWII' to already be set (for example with text '')."

The rule "I don't like [world?WWII] war[II?WWII]." lead to the error.
It seems there is something wrong with Named random generation.

Syntax check

Hi, first many thanks for this awesome project, we are using it intensively in our chatbot. The thing is that we have a very big number of examples, making the generation slow, but this isn't an issue, the problem is that sometimes there are syntax errors that raise an exception during the generation, and I would like to know if there is a way to run some command to simply check all the templates, this could be useful during testing in our CI pipeline.

Even distribution of strings with imbalanced sub-rules ?

If one of my two sub-rules has only one word while the other one has 99 words and when I generate 100 strings, I want to ensure 50% are from either of the sub rule. Is there a way to do this?

"end-index" support for JSONL adapter

Currently, generating examples using the JSONL adapter provides the following fields for each slot: start-index, text, and slot-name. For instance:

{
    "slot-name": "PER",
    "value": "Albert Einstein",
    "start-index": 0
}

At this point, if I want to infer the span of slot (i.e., to pass it along to SpaCy's NER), I can find the end-index by summing the length of the value and the start index.

While this is fine when no specific value is provided, when a value is provided for the slot, it becomes impossible to know for sure what is the span of the slot. For instance, if "Albert Einstein" is equal to some concept ID in a knowledge base:

{
    "slot-name": "PER",
    "value": "KB4245435",
    "start-index": 0
}

It could be potentially doable to go through the generated synonyms, but this seems to be way harder than it needs to be. What I propose is the addition of a end-index field, as follows:

{
    "slot-name": "PER",
    "value": "KB4245435",
    "start-index": 0,
    "end-index":  15
}

Theory question: using word vectors for similarity generation

Hey,

I'm a big fan of Rasa and these NLU-set generation platforms, but in my experience (as noted) they can quickly lead to overfitting as it can be hard to generate the true range of data you might expect from real labeled data (perhaps an unrealistic expectation).

I think, in part, the reason for this is the inability of rule based substitutions/synonyms to really capture this variety.

Thus, I wonder if it might be useful to explore substitutions based on some unsupervised embeddings. For example, rather than specifying synonyms we use a word2vec model to choose words based on similarity. One could even go further and use something like BERT to utilise context.

This might seem a bit circular but in my mind is akin to semi-supervised learning. The assumption would be of course that the word vectors are appropriate to the domain or application. That being said, the w2v process is unsupervised and so people who do have domain specific data, even if unlabelled, could benefit from it.

A couple of motivating examples. I have been looking at building an NLU system to extract intent and entities for a chatbot to give quotes for a freight company. Part of the issue here is that some of the entities needed are address components (cities, suburbs, postcodes) which can be a bit tricky, even with a gazetter. The paragraphs we would like to process are also sometimes quite long. I have found that intent classification was quite straight forward, but the NER was harder (also needs to extract dimensions like length, width, height and weight). The variety in the observed data is significant. For example

- Just wanted to check if you pick up a fridge from [sydney](suburb) and deliver to the [Hunter valley](suburb)
- Customer is missing [1,000](quantity) of item XXXXX from order. Weight is approx [500 lbs](weight)
- I have made a booking for a package to be delivered from [Ballarat](suburb) to [Port Macquarie](suburb) starting Monday.
- Hi I have a client in [Moranbah](suburb) [QLD](state) [4744](postcode) which wishes to pick up from the freight depo. were is the freight depo at [Moranbah](suburb) and the address please?

These are all quite similar but I think, maybe naively and if so please do prove me wrong, quite hard to extract good DSL rules to generate things like this.

Similarly, it would be very interesting if one could actually 'train' the DSL rules based on an input dataset, again using word vectors.

Apologies for the long post and if this is the wrong forum for this. I think these tools are crucial for NLU and I'm just looking for ways to extend their applicability.

very long generation time

Thanks for putting time and effort in such an amazing project, Great work!

Currently i'm using Chatette in a project (Generation modifiers are awesome), one problem though, the generation takes too much time!

i'm generating about ~35K sentences

Statistics:

Parsed files: 29
Declared units: 80 (80 variations)
	Declared intents: 8 (8 variations)
	Declared slots: 17 (17 variations)
	Declared aliases: 55 (55 variations)
Parsed rules: 11031

generation takes about ~ 1 hour, while a clone to chatito takes about ~ 10 min

my question is:
what's the complexity of chatette? or what aspect of the statistics above, the generation time is directly propotional to?
(i noticed that chatette runs on a single core, so i've tried splitting the master file into 4 files and used ray to generate each master file on a separate worker.
with this i managed to get the generation time down to ~25 min but still quite an overhead)

if anyone have a thought or advice, i'd be grateful!

Does this tool support chinese?

I just wondering whether this tool support chinese corpus.

For example, do i suppose to use Jieba or other chinese tokenizer ? And is there interface reserved for chinese tokenizer...

Thanks a lot.

Can argument be as another aliase?

Question about usage of argument:
https://github.com/SimGus/Chatette/wiki/Generation-modifiers#argument

Example was provided in link:

~[greetings$NAME]
   Hi $NAME
   Hello $NAME!

I would like to use instead of $NAME another alias:

~[NAME]
  John
  Elvis

Following usage
~[greetings$~[NAME]]
cause SyntaxError: Invalid token

Is there a way to use alias as argument?

Duplicate strings in output

Observation

It's possible to generate multiple exact duplicate phrases using the DSL.

For example, this input file...

%[greet]
    ~[&greet] ~[&bot?]

~[greet]
    {hi/hello/howdy/greetings/good morning/good day/good evening}

~[bot]
    hal
    bot

Gives this output (truncated to first two entries)...

{
  "rasa_nlu_data": {
    "common_examples": [
      {
        "entities": [],
        "intent": "greet",
        "text": "hi"
      },
      {
        "entities": [],
        "intent": "greet",
        "text": "hi"
      },
      {
         ...

Suggestion

As it isn't always going to be immediately obvious where DSLs like this are going to generate a duplicate phrase, I'd suggest duplicates be stripped before generating the Json output.

Many thanks for your consideration.

wiki improvement: `|` should be escaped

Can't submit a PR to wiki so I'll just write here:

Here are all the characters that should be escaped if you want to use them in a unit identifier (i.e. use \; instead of ,): ?, ;, /, //, #, $, &, [ and ].

from https://github.com/SimGus/Chatette/wiki/Quickstart

You missed |. ;-)

Chatette removes output folder

Hello! I've faced with issue:
python -m chatette template/path -o output/dir
It removes output/dir
Please, check lines 96-97 in facade.py
I think it's very strange and dangerous behavior

https://github.com/SimGus/Chatette/blob/master/chatette/facade.py#L96

Thank you

random generation of a content of my training data

Hi all,

I'am in Linux with multi-processeurs and i have configurated my chatette.py as follows:

from chatette.facade import Facade
import sys
facade = Facade(sys.argv[1], sys.argv[2], "rasamd", seed='ljjeek', force_overwriting=True)
facade.run()

My problem is that chatette run in // with a random manner. If i launch two times chatette.

The first Time i obtain:

intent: intent 1

intent: intent 2

intent: intent 3

The second time i obtain:

intent: intent 3

intent: intent 1

intent: intent 2

So the order of the intents is not the same. Is there a solution to fix this issue ?

Thanks

Support for mutually-exclusive selection in multi-occuring slot

For example:

%[&request_agg](100)
    [~[i need?]|what is] [the?] @[aggregation][ and @[aggregation]?][\??]

@[aggregation]
    average [value?]
    median [value?]
    minimum [value?]
    maximum [value?]

~[i need]
    [&i]['d| would?] [need|want]
    [&i]['d| would] like
    give [me?]

aggregation occurs twice (since slot type list is possible in RASA). I want it to not use the same slot value for both occurrences in the same sentence. Example:

what is minimum and maximum?

Is correct, but:

what is maximum and maximum?

is not.
How do I do this using the DSL rules?

Nested slots

I haven't found a solution for nesting slots with synonyms
Like

@[paymentPeriod:hour]
	hour
	h

@[paymentPeriod:day]
	day
	d

@[paymentPeriod]
	@[paymentPeriod:hour]
        @[paymentPeriod:day]

I want to use nested perymentPeriod slot like this

No,[ per|per?] @[paymentPeriod]

But right now I get

- No, [[](paymentPeriod)d](paymentPeriod:day)
- No, [[da](paymentPeriod)y](paymentPeriod:day)
- No, [[](paymentPeriod)h](paymentPeriod:hour)
- No, [[hou](paymentPeriod)r](paymentPeriod:hour)
...

UnicodeDecodeError if use Russian language

Hi,

I tried to create intents with Russian examples but there is an issue:
Traceback (most recent call last):

  File "...\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "...\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "...\lib\site-packages\chatette\__main__.py", line 111, in <module>
    main()
  File "...\lib\site-packages\chatette\__main__.py", line 22, in main
    facade.run()
  File "...\lib\site-packages\chatette\facade.py", line 90, in run
    self.run_parsing()
  File "...\lib\site-packages\chatette\facade.py", line 95, in run_parsing
    self.parser.parse_file(self.master_file_path)
  File "...\lib\site-packages\chatette\parsing\parser.py", line 92, in parse_file
    line = self.input_file_manager.read_line()
  File "...\lib\site-packages\chatette\parsing\input_file_manager.py", line 163, in read_line
    line = self._current_file.readline()
  File "...\lib\site-packages\chatette\parsing\line_count_file_wrapper.py", line 28, in readline
    return self.f.readline()
  File "...\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 25: character maps to <undefined>

Intent with Russian phrase:

%[intent_no]
  у меня нет никаких вопросов

Versions:

Python 3.7.4
chatette==1.6.2
Windows 10

Memory issues

I know it's been said before about the very long generation time, and I too am experiencing that. However, a bigger concern I have is the memory usage. I'm running Chatette on Hyperconverge on a VM that my Rasa bot is on. This VM is kUbuntu. Here's the uname -a output:
Linux Huginn 6.2.0-34-generic #34~22.04-1-Ubuntu SMP PREEMPT DYNAMIC Thu Sep 7 13:12:03 UTC 2 x86 64 x86 64 GNU/Linux"

and here's the cat /etc/issue output:
Ubuntu 22.04.3 LTS \n \l
_

This VM has 8 cores with HT (so 16 logical cores) and had 30GB of ram assigned to it. Initially, the script kept exiting out with the last line saying "Killed". I came to realize it must be because the Linux kernel is invoking the OOM Killer. I browsed through the GitHub issue and came across some advice that said to break up the file and also try to minimize the length of alias chains. I've done all this and it still keeps getting killed. I've even tried adding in extra code into the Chatette package so I could see the recursion depth (using editor.mergely.com here's the generated diff file contents):

24c24,27
<     def generate_train(self):
---
>     def generate_train(self, NestedLevel=0):
>         strIndentation = "";
>         for i in range(0, NestedLevel):
>              strIndentation = strIndentation +"\t"
29c32,34
<             examples = intent.generate_train()
---
>             print(strIndentation + "<" + str(NestedLevel) + ">");
>             examples = intent.generate_train();
>             print(strIndentation + "</" + str(NestedLevel) + ">");
60a66
>

Unfortunately, it only showed:
[...]
<0>
Killed

So I thought something else must be going on, then I started to unravel that it's because the code is written extremely meta / dynamic (makes sense to do so given the context of what the program's purpose is), but the AST code looked crazy nasty and without a full understanding of what is called when, where, and why, I then decided to power the VM down and bump the VM up to 64GB of ram, and left it generating over night. I came in this morning and it still got killed. I'm not sure how much more variable-expansion I can do without starting to really rack up some serious line counts in my chatette file... So as I mentioned earlier, I've split my master file across multiple files in an attempt to generate one set at a time (since I think somewhere in an issue it was mentioned that Chatette caches all of the combinations in memory) and avoid overusing memory and to better determine where the problem lies.

Here is the directory layout of files:

Chatette_Workspace/imperative_compound_1a.chatette
Chatette_Workspace/Imperative_Compound/Aliases/aliases.chatette
Chatette_Workspace/Imperative_Compound/Slots/slots.chatette

Here are the file contents

Chatette_Workspace/imperative_compound_1a.chatette:

%[imperativeform_compound]
	// Example:
	// <Show me> a <list> of <devices> <connected to> <Access_Point-109> that also <connected to> <Access_Point-204>
	~[imperativeform_openingverb] [a] @[Request_Output] @[Request_A] [that] [also] @[Request_B]
	~[imperativeform_openingverb] [a] @[Request_Output] @[Request_A] [that] [also is] @[Request_B]
	~[imperativeform_openingverb] [a] @[Request_Output] @[Request_A] [that] [also has] @[Request_B]
	~[imperativeform_openingverb] [a] @[Request_Output] @[Request_A] [that] [also are] @[Request_B]
	~[imperativeform_openingverb] [a] @[Request_Output] @[Request_A] [that] [also have] @[Request_B]
	~[imperativeform_openingverb] [a] @[Request_Output] @[Request_A] [that] [is also] @[Request_B]
	~[imperativeform_openingverb] [a] @[Request_Output] @[Request_A] [that] [has also] @[Request_B]
	~[imperativeform_openingverb] [a] @[Request_Output] @[Request_A] [that] [are also] @[Request_B]
	~[imperativeform_openingverb] [a] @[Request_Output] @[Request_A] [that] [have also] @[Request_B]
	~[imperativeform_openingverb] [a] @[Request_Output] @[Request_A] [that] [is] @[Request_B]
	~[imperativeform_openingverb] [a] @[Request_Output] @[Request_A] [that] [has] @[Request_B]
	~[imperativeform_openingverb] [a] @[Request_Output] @[Request_A] [that] [are] @[Request_B]
	~[imperativeform_openingverb] [a] @[Request_Output] @[Request_A] [that] [have] @[Request_B]

|Imperative_Compound/Aliases/aliases.chatette
|Imperative_Compound/Slots/slots.chatette

Chatette_Workspace/Imperative_Compound/Aliases/aliases.chatette:

//================== Alias definitions =======================
~[imperativeform_opening]
	~[&imperativeform_openingverb] ~[&determiner]

~[imperativeform_openingverb]
	[give me|provide me|provide for me|show me|tell me]

// Determiners
// 	https://dictionary.cambridge.org/grammar/british-grammar/a-an-and-the
// 	https://www.thoughtco.com/a-an-and-1692639
// 	https://www.grammar-monster.com/glossary/articles.htm
// 	https://english.stackexchange.com/a/328994
// 	https://en.wikipedia.org/wiki/Syncategorematic_term
~[determiner]
	[a|the]

~[Connecting_Words]
	[that]

~[Helper_Words1]
	[also|is|has|are|have]

~[also_Helper_Words2]
	[is|has|are|have]

Chatette_Workspace/Imperative_Compound/Slots/slots.chatette:

@[Request_Output]
	@[&Requested_Format] [of?] @[&Requested_Data]

@[Requested_Format]
	[list|graph|plot]

@[Requested_Data]
	[devices|users|locations|access points|ids|serial numbers|memory|disk space used|disk space in use|free space|memory]



@[Request_A]
	@[&Verb_A] @[&Data_A]

@[Verb_A]
	[connect to|connects to|connected to|connected at]

@[Data_A]
	[Access_Point-109|Access_Point-204]



@[Filter_Verb]
	[connect to|connects to|connected to|connected at]




@[Request_B]
	@[&Verb_B] @[&Data_B]

@[Verb_B]
	[connect to|connects to|connected to|connected at]

@[Data_B]
	[Access_Point-109|Access_Point-204]




@[Logical_Operators]
	[doesn't|does not|hasn't|has not|haven't|have not|aren't|are not|wasn't|was not|isn't|is not|not|is|had|has|are|have|was]

I'm going to keep trying to minimize and variable-expand but I don't hold out a lot of hope that my efforts alone are going to solve this. It'd also be nice to have some more debugging functionality / tools to determine where it's spinning out of control (over-recursing because of some logic error in our template files)

Named random generation modifiers not working properly

Hello,

I am encountering an issue with Chatette v1.5.0 PyPI (installed via pip).

Rule: I don't like [world?WWII] war[ II?WWII].

Expected examples:

I don't like war.
I don't like world war II.

Generated examples:

I don't like war.
I don't like war II.
I don't like world war.
I don't like world war II.

I tested it with a .chatette file and the interactive console. Both methods generate the same sentences.

Unnecessary memory error

As far as I can tell, generated examples can be accessed via a generator. However, to write to a file, that generator is turned into a list, which, if that list can cause a memory error.

However, by looking at the source code, it seems that writing to file is done by batch anyway. So why keep all generated examples in memory?

I think this memory error can be easily avoided.

Bug in chatette.parsing.utils

First of all really nice library! I did however notice a small bug in the chatette.parsing.utils file

On line 49-56 it states the following:
escapable_chars = [
ESCAPEMENT_SYM,
COMMENT_SYM, OLD_COMMENT_SYM,
UNIT_START_SYM, UNIT_START_SYM,
RAND_GEN_SYM, RAND_GEN_PERCENT_SYM,
VARIATION_SYM,
ARG_SYM,
CASE_GEN_SYM,
]

However, line 52 repeats UNIT_START_SYM, UNIT_START_SYM twice. Instead it the second instance should be replaced with UNIT_END_SYM.

Thus it should be
escapable_chars = [
ESCAPEMENT_SYM,
COMMENT_SYM, OLD_COMMENT_SYM,
UNIT_START_SYM, UNIT_END_SYM,
RAND_GEN_SYM, RAND_GEN_PERCENT_SYM,
VARIATION_SYM,
ARG_SYM,
CASE_GEN_SYM,
]

Handling several slots in the same sentence

Hi all,

Thanks to this great library :-)

I would like to use several slots within a same sentence but the produced json file does not pick up the right start and end of the slots.

Below a simple example:

******** txt_file ***********
%&ask_toilet
where the @[toilet#singular] is @[please]?

@[toilet#singular]
toilet
loo
@[please]
please
plz

***** json result ***********

{
"rasa_nlu_data": {
"common_examples": [
{
"entities": [
{
"end": 13,
"entity": "toilet",
"start": 10,
"value": "loo"
},
{
"end": 42,
"entity": "please",
"start": 36,
"value": "please"
}
],
"intent": "ask_toilet",
"text": "Where the loo is please?"
},
{
"entities": [
{
"end": 13,
"entity": "toilet",
"start": 10,
"value": "loo"
},
{
"end": 39,
"entity": "please",
"start": 36,
"value": "plz"
}
],
"intent": "ask_toilet",
"text": "where the loo is plz?"
},
{
"entities": [
{
"end": 16,
"entity": "toilet",
"start": 10,
"value": "toilet"
},
{
"end": 39,
"entity": "please",
"start": 36,
"value": "plz"
}
],
"intent": "ask_toilet",
"text": "where the toilet is plz?"
}
],
"entity_synonyms": [],
"lookup_tables": [],
"regex_features": []
}
}

Any idea please ???

Randomly pick few variations from each intent example instead of all varations.

After writing the template files and generating a huge output file, i realised that i just dont need all possible combinations.
is there any way we can randomly sample variatons from each intent example.

for example if I have an intent defined like this.

%[greet user]
~[greeting] user! What is your @[operating system]?
~[greeting] What is the name of your @[operating system]?

instead of generating all possible variations for these intents, if I can control the number of variations from each sentence under every intent, something like ouput only 25% variations from each sentence by random sampling.

This would make my output files smaller and at the same time cover most of the variations.

Just wanted to know if I can acheive something like this using the existing features. This feature would be really helpful.

Error on second run within same script

There seems to be an issue when running chatette more than once in a single script. Even when we reset either the instance or the system, it throws a syntax error like:
SyntaxError: Tried to declare intent 'E7WzqixfmcHF1zAvsE9LIbEL450pPHOe' twice.
It always works the first time, but even though we give it a random intent name every time, it thinks it's already been declared.

I've tried deleting the facade object as well as reloading the chatette library altogether, but I still can't avoid the error.

chatetteTest.txt is the python script I'm using to test it and common.txt is a required grammar file to run it.
Any help you could provide on this would be greatly appreciated.

chatetteTest.txt

common.txt

Support for lookup tables

https://blog.rasa.com/improving-entity-extraction/

If this library could take in a lookup file flag that maps to the lookup_tables json entry, it would be really useful

-o/--out command line parameter not work

I checked the source code , it seems not use the parameter , always use "output"

dir_path = os.path.join(dir_path, "output")

Generate inline entity values for synonyms

Hi all,

is there a way for generating inline entity values for synonyms within intents? like:

## intent:my_intent
- Can you give me the available for [DataManagement]{"entity": "code", "value": "LA014")

Thanks

Support for Snips NLU data structure

Managed to work on Snips NLU adapter

{
"entities": {
"device": {
"automatically_extensible": true,
"data": [
{
"synonyms": [
"airconditioner"
],
"value": "airconditioner"
},
{
"synonyms": [
"fan"
],
"value": "fan"
},
{
"synonyms": [
"bulb",
"lights"
],
"value": "light"
}
],
"matching_strictness": 1.0,
"use_synonyms": true
},
"room": {
"automatically_extensible": true,
"data": [
{
"synonyms": [
"bedroom",
"livingroom"
],
"value": "switch"
}
],
"matching_strictness": 1.0,
"use_synonyms": true
},
"snips/datetime": {},
"state": {
"automatically_extensible": true,
"data": [
{
"synonyms": [
"on"
],
"value": "on"
},
{
"synonyms": [
"off"
],
"value": "off"
}
],
"matching_strictness": 1.0,
"use_synonyms": true
}
},
"intents": {
"skill-devices.switchDevice": {
"utterances": [
{
"data": [
{
"text": "turn "
},
{
"entity": "state",
"slot_name": "state",
"text": "on"
},
{
"text": " the "
},
{
"entity": "room",
"slot_name": "room",
"text": "bedroom"
},
{
"text": " "
},
{
"entity": "device",
"slot_name": "device",
"text": "airconditioner"
},
{
"entity": "snips/datetime",
"slot_name": "snips/datetime",
"text": "at the end of the day"
},
{
"text": " please"
}
]
},
{
"data": [
{
"text": "turn "
},
{
"entity": "state",
"slot_name": "state",
"text": "on"
},
{
"text": " the "
},
{
"entity": "room",
"slot_name": "room",
"text": "bedroom"
},
{
"text": " "
},
{
"entity": "device",
"slot_name": "device",
"text": "airconditioner"
},
{
"entity": "snips/datetime",
"slot_name": "snips/datetime",
"text": "tomorrow"
},
{
"text": " please"
}
]
},
{
"data": [
{
"text": "turn "
},
{
"entity": "state",
"slot_name": "state",
"text": "on"
},
{
"text": " the "
},
{
"entity": "room",
"slot_name": "room",
"text": "livingroom"
},
{
"text": " "
},
{
"entity": "device",
"slot_name": "device",
"text": "bulb"
},
{
"entity": "snips/datetime",
"slot_name": "snips/datetime",
"text": "today"
},
{
"text": " please"
}
]
}
]
}
},
"language": "en"
}

YAML support

Hi,

Is there any plans to implement adapter to generate data in YAML format?
Rasa 3.x version does not support markdown format and supports only YAML.
https://rasa.com/docs/rasa/training-data-format

Support for JSON format will be removed in Rasa 4.0
https://rasa.com/docs/rasa/migration-guide/#nlu-json-format

Useful tool but very confusing doc.

A lot of ambiguity in the documentation.

Great Work!

Hi @SimGus,

Happy to have inspired you with Chatito to build Chatette. Seems like we and @YuukanOO are working on similar stuff but with slight differences, maybe we can have a conversation to build a unified spec that is also generic enough? The main principle behind Chatito is not minimalism, but be generic enough. Maybe we can do some unified work, i have some ideas that may extend the DSL to also include dialogue generation. Cheers

How to get random strings rather than all permutations

This is a very useful tool, thanks for it!

Maybe I'm just doing something wrong, but when I read the Wiki, it seems like I should get "random" strings. But it really looks more like all permutations.

Let's say I have this configuration file:

~[and-amp]
   and
   &

~[name]
   Clifford Woods
   Lorenzo Malvasi
   Ryan Wade

I run the interactive mode and I issue this command:

rule "~[name] ~[and-amp] ~[name] went to town."

The generated text output is a careful permutation of all possibilities, in sequence.

Generated examples:
Text: 'Clifford Woods & Clifford Woods went to town.'
        Entities: []
Text: 'Clifford Woods & Lorenzo Malavasi went to town.'
        Entities: []
Text: 'Clifford Woods & Ryan Wade went to town.'
        Entities: []
Text: 'Clifford Woods and Clifford Woods went to town.'
        Entities: []
Text: 'Clifford Woods and Lorenzo Malavasi went to town.'
        Entities: []
Text: 'Clifford Woods and Ryan Wade went to town.'
        Entities: []
Text: 'Lorenzo Malavasi & Clifford Woods went to town.'
        Entities: []
Text: 'Lorenzo Malavasi & Lorenzo Malavasi went to town.'
        Entities: []
Text: 'Lorenzo Malavasi & Ryan Wade went to town.'
        Entities: []
Text: 'Lorenzo Malavasi and Clifford Woods went to town.'
        Entities: []
Text: 'Lorenzo Malavasi and Lorenzo Malavasi went to town.'
        Entities: []
Text: 'Lorenzo Malavasi and Ryan Wade went to town.'
        Entities: []
Text: 'Ryan Wade & Clifford Woods went to town.'
        Entities: []
Text: 'Ryan Wade & Lorenzo Malavasi went to town.'
        Entities: []
Text: 'Ryan Wade & Ryan Wade went to town.'
        Entities: []
Text: 'Ryan Wade and Clifford Woods went to town.'
        Entities: []
Text: 'Ryan Wade and Lorenzo Malavasi went to town.'
        Entities: []
Text: 'Ryan Wade and Ryan Wade went to town.'
        Entities: []

Also, your animated gif on the homepage of the project show a number after the rule command for the number of generation, but I get an error if I put a number. And your output is different... Probably the Gif was made from an older version prior to 1.6.1.

Anyway, I was wondering if it is possible to get more random than permutation. For example, I would want just one pair of names and sometimes and and other times &. But not all name pairs with and and again all name pairs with &.

Support for passing configuration file from command arguments

It would be great if Chatette could support passing configuration JSON file with Rasa options from command arguments.

Here is an example for the configuration JSON file.

      1  {
      2     "rasa_nlu_data": {
      3         "regex_features": [
      4         {
      5             "name": "tier_pattern",
      6             "pattern": "tier[_\\-\\s]*\\d+\\.?\\d*"
      7         },
      8         ]
      9     }
     10 }

Slot with choices

I am using the currently latest PyPI version (1.4.1).

The following input

@[entity]
	{ one / two } = one_two
	{ three / four } three_four

%[intent]
	@[entity]

produces the following output

{
  "rasa_nlu_data": {
    "common_examples": [
      {
        "entities": [
          {
            "end": 23,
            "entity": "entity",
            "start": 20,
            "value": "one_two"
          }
        ],
        "intent": "intent",
        "text": "<<CHATETTE_ENTITY>> "
      },
      {
        "entities": [
          {
            "end": 23,
            "entity": "entity",
            "start": 20,
            "value": "one_two"
          }
        ],
        "intent": "intent",
        "text": "<<CHATETTE_ENTITY>> "
      },
      {
        "entities": [
          {
            "end": 37,
            "entity": "entity",
            "start": 20,
            "value": " three  three_four"
          }
        ],
        "intent": "intent",
        "text": "<<CHATETTE_ENTITY>> "
      },
      {
        "entities": [
          {
            "end": 36,
            "entity": "entity",
            "start": 20,
            "value": " four  three_four"
          }
        ],
        "intent": "intent",
        "text": "<<CHATETTE_ENTITY>> "
      }
    ],
    "entity_synonyms": [
      {
        "synonyms": [
          " one ",
          " two "
        ],
        "value": "one_two"
      }
    ],
    "regex_features": []
  }
}

I noticed however, that using {one/two} = one_two instead of { one / two } = one_two works just fine. I couldn't figure out if the latter usage is supported, I find it much more readable though.

Slow when alias has large number of candidates

Hello,

Thanks for open-sourcing this great proj!

I encounter that Chatette is very slow when alias has large number of candidates even when the training example needed is set to very small.
e.g.,
%intent_name
@[generic-name]

@[generic-name]
~[name] ~[name?] ~[family-name]

~[name]
<2000 names>

~[family-name]
<100 popular family names>

When i run chatette with the above, it takes ages even i only request 1 random example.
Looks like chatette generate all combination, which is (2000^2)*100/2, then random sample one example from it.

Do I understand it right? Could that be, in some way, configured as sampling dynamically. E.e., if I request 1 example, I only randomly generate 1 example?

Thanks!

escape character multiplication

I think I found a small bug regarding escape character. I created this simple master.chatette file:

%[&a\/b]
    hello

and used command:

python -m chatette -f "master.chatette"

Generated output should be:

{
  "rasa_nlu_data": {
    "common_examples": [
      {
        "entities": [],
        "intent": "a/b",
        "text": "hello"
      },
      {
        "entities": [],
        "intent": "a/b",
        "text": "Hello"
      }
    ],
    "entity_synonyms": [],
    "lookup_tables": [],
    "regex_features": []
  }
}

but it is:

{
  "rasa_nlu_data": {
    "common_examples": [
      {
        "entities": [],
        "intent": "a\\/b",
        "text": "hello"
      },
      {
        "entities": [],
        "intent": "a\\/b",
        "text": "Hello"
      }
    ],
    "entity_synonyms": [],
    "lookup_tables": [],
    "regex_features": []
  }
}

I use command:

sed -i 's/\\\\\//\//g' "output/train/output.json"

as a workaround.

Thanks for creating this tool. It's awesome!

Define certain probability for categorical choice.

Hello everyone,

If I define:

~[greet]
      ooh [hi|hello?/70] chatette community !

It will eather generate:

"ooh chatette community" with 50% chance
"ooh hello chatette community" with 50% * 70% chance
"ooh hi chatette community" with 50% * 30% chance

My question is the following:
Can I unbalance the probability of a categorical choice without having to use the "?" ?

I would want to specify something like this:

~[greet]
      ooh [hi|hello/70] chatette community !

how to create templete any body can help me?

here i done the configuration of chatette but now i have confusion in creating templete.please give any sample code.

Convert Rasa NLU training data to Chatette format

Is it possible to convert my data/nlu.md or nlu.json into a Chatette file? The base file option only extracts the regex and lookup from Rasa files, from what I can tell.

Error if alias is not defined

I want to say is that this is a great work. And I apologize for my poor English.

Well, if the alias don´t exists, the system should must take the text between the square brackets.
Now I catch error.

Example:
busca**~[me]** un buen restaurante

`...

nb_possible_ex = self.parser.get_definition(self.name, Unit.alias)
File "/site-packages/chatette/parsing.py", line 454, in get_definition
def_name + "'")

ValueError: Couldn't find a definition for alias 'me'
`

Best regards.

Support for Rasa Entities Roles and Groups

Hello, thanks for your work. I've been using chatette until recently.

I want to know whether you have plan to support the new Rasa's entities roles and groups since it's quite important to have with the recent Rasa.
If you don't plan to do it in short term, I might be able to look at it, but the parser section of the code is quite difficult to understand for me compared to the adapter part.

I've also created a new yml adapter in preparation for Rasa 2.0 on my forked repo.

Thanks!

ranking list, frequency counter for synonyms

Synonyms list can be rather large and we do not like to generate all possible synonym variants with same probability. Because of that we would like to have more control which variants of synonyms will be generated.

Is it possible to add ranking list or frequency counter for synonyms list?

Incorrect entity position in rasa adapter

{
"entities": [
{
"end": 112,
"entity": "bot_job",
"start": 93,
"value": "sứ mệnh như thế nào"
}
],
"intent": "ask_for_bot_job",
"text": "xếp sứ mệnh như thế nào"
},

The 1.6.0 version and I are having the error of generating the wrong locations of entities in the sentence, above is an example. Please check again! Thank you

To control the data generate

When i run the run.py for twice,can you make sure that generated data are the same

Incorrect entity insertion

Given this file...

%[ask_phone_number]
    {what is/tell/display} your @[number_type] number

@[number_type]
    direct
    tel
    telephone

I get JSON with one of the entities marked at the wrong place. Here's one example...

      {
        "entities": [
          {
            "end": 3,
            "entity": "number_type",
            "start": 0,
            "value": "tel"
          }
        ],
        "intent": "ask_phone_number",
        "text": "tell your tel num"
      },

To ease diagnosing the issue, I've passed the generated JSON through rasa_nlu.convert to get markdown (which makes it more obvious)...

what is your [direct](number_type) number
what is your [tel](number_type) number
what is your [telephone](number_type) number
tell your [direct](number_type) number
[tel](number_type)l your tel number              <<< Erroneous output
tell your [telephone](number_type) number
display your [direct](number_type) number
display your [tel](number_type) number
display your [telephone](number_type) number

This seems to happen when one of the example entity strings under a slot definition occurs as a sub-string earlier in the generated text than the position that actually needs to be targeted.

Guessing that a \b might need adding to a regex somewhere to ensure any entity value substitutions happen on word boundaries.

Many thanks for your consideration.

error in markdown file generation

I encountered error while generating markdown file

master chatette file is as below:

%[test_intent]
~[for] @[entity_name_1] ~[for]

|alias_and_entity.chatette

alias_and_entity chatette file is as below:

@[entity_name_1]
~[names]

~[names]
test names containing 4 words
test name having 5 unique words
cashflows and investment work

~[for]
for
on
to

Command used for output generation
python -m chatette new_chatette.chatette -a rasa-md

Output is as below:

intent:test_intent

for cash [ flows and investment work for ] (entity_name_1)
for cash [ flows and investment work on ] (entity_name_1)
for cash [ flows and investment work to ] (entity_name_1)
for test [ name having 5 unique words for ] (entity_name_1)
for test [ name having 5 unique words on ] (entity_name_1)
for test [ name having 5 unique words to ] (entity_name_1)
for test [ names containing 4 words for ] (entity_name_1)
for test [ names containing 4 words on ] (entity_name_1)
for test [ names containing 4 words to ] (entity_name_1)
on cashf [ lows and investment work for ] (entity_name_1)
on cashf [ lows and investment work on ] (entity_name_1)
on cashf [ lows and investment work to ] (entity_name_1)
on test [ name having 5 unique words for ] (entity_name_1)
on test [ name having 5 unique words on ] (entity_name_1)
on test [ name having 5 unique words to ] (entity_name_1)
on test [ names containing 4 words for ] (entity_name_1)
on test [ names containing 4 words on ] (entity_name_1)
on test [ names containing 4 words to ] (entity_name_1)
to cashf [ lows and investment work for ] (entity_name_1)
to cashf [ lows and investment work on ] (entity_name_1)
to cashf [ lows and investment work to ] (entity_name_1)
to test [ name having 5 unique words for ] (entity_name_1)
to test [ name having 5 unique words on ] (entity_name_1)
to test [ name having 5 unique words to ] (entity_name_1)
to test [ names containing 4 words for ] (entity_name_1)
to test [ names containing 4 words on ] (entity_name_1)
to test [ names containing 4 words to ] (entity_name_1)

Note that the entities are going outside the square brackets. @SimGus please suggest.

simgus / chatette Goto Github PK

chatette's Introduction

Chatette

A data generator for Rasa NLU

Installation

Uninstallation

How to use Chatette?

Input and output data

Running Chatette

Chatette vs Chatito?

Development

Credits

Author and maintainer

Contributors

chatette's People

Contributors

Stargazers

Watchers

Forkers

chatette's Issues

intent:duplicate

intent:duplicate

Observation

Suggestion

intent: intent 1

intent: intent 2

intent: intent 3

intent: intent 3

intent: intent 1

intent: intent 2

master chatette file is as below:

|alias_and_entity.chatette

alias_and_entity chatette file is as below:

~[for] for on to

Output is as below:

intent:test_intent

Recommend Projects

Recommend Topics

Recommend Org

~[for]
for
on
to