nlu-engine-prototype-benchmarks's Issues

`cross_val_predict` from sklearn gives an interesting error for the entity extractor

All the relevant information can be found in this notebook to reproduce and document the error.

Flow for entity refinement

Description

We want users, in this case most likely myself and any developers whoever, to benchmark an NLU data set for entity extraction and be able to refine those entities to improve the data set.

Making sure the “human in the for loop” flow works comes from refining the entities, however there will be improvements that need to be made as they block the refinement process. This is a dummy ticket to append such minor code fixes to.

User stories

As a user, I want to

visually see the analytics of the entities so that I know what needs to be improved
review all the incorrect entities so that I can fix them

Sounds easy, right? Especially since we have already built the intent refinement work flow, but it is a bit more complex than that.

With intents; we could visualize all domains, see where the intents are doing the worst, pick those domains, and then review all the incorrectly classified intents in that domain for refinement. With entities, it is a bit more tricky.

We'll do this, but with entities. However, we need to group together the entities in a domain, and there will also be overlap. Some utterances have more than one entity type. So, we have to keep track of that. Furthermore, do we tell the user to refine all entities in an utterance, or do we tell them to ignore them? It would be super annoying to have to go back over the same utterances 2 or more times! This is why we should have users working on multiple entities at the same time. This is harder for a user to do, as the user must know if each one is correct and if not, what they should be.

Ergo, it is better for a user to review incorrect entries in batches. They should have an overview for that domain of example entries where the entities are correct, then go through correcting no more than 100 at a time.

This means, however, we will have to adapt our flow from the intent refinement. With intent refinement, we recorded into CSVs by domain and intent. Here we will just do it by domain in batches, then merge those together into one for the whole domain. If the user is lucky, they will only have to do one batch per domain.

DoD

`ipysheet` is deprecated and should be replaced

Description

ipysheet is deprecated. It seems it doesn't work correctly anymore, which completely blocks the refinement flow to get the cleaned data. A replacement is needed. ipydatagrid is recommended from the previous project.

Question

Can ipydatagrid completely replace ipysheet in functionality?

DoD

The above question is answered
Notebook prototype of refinement flow with ipydatagrid
Above notebook is refactored into a class
The refinement flow is replaced within the entity refinement notebook to continue entity refinement

Find a good separation criteria for entity refinement

The refactored notebook for entity refinement (work in progress) needs a good way to separate the entries to be refined.

Currently, if we separate the errors by domain (scenario/skill) the top one (calendar) will have 681 errors. This is far too many for a user to be expected to review in one go using ipysheet. I am uncertain if they should be separated by intent is the case with intent refinement.

This needs to be solved to facility the entity refinement to achieve the deliverable of the data set.

Refine entities

This is the biggest ticket item. It requires mind-numbingly going through each incorrectly predicted entity and correcting them.

Add review of entity types with examples in `Macro_NLU_Entity_Refinement.ipynb`

Description

After the incorrect_predicted_entities_report, there should be a review of the entity types:

Remove entity types
Remove entries that have out of scope entity types
Merge two or more entity types

This issue is related to #4.

User stories

As an entity refiner, I want to:

Remove entity types from entries where they make no sense
Remove entries that have out of scope entity types
Merge two or more entity types where only one type is needed

DoD

User can easily review entity types with utterance examples
User can remove entity types from the data set
User can remove entries that have out of scope entity types
User can merge two or more entity types
This flow passes tests

Refactor `Macro_NLU_Entity_Refinement.ipynb` to update the dataframes for the removed and complete data sets

Description

The current flow updates only the dataframe where removed entries are considered. While the dataframe that contains all the removed entries only is important for training and testing both the intent and entity models, we want to create a complete data set tracking all the entries, similar to Macro_NLU_Intent_Refinement.ipynb. This issue is realted to #4

DoD

Create a CSV for the updated entity refinement entries with the refined intent entries
Update the code in Macro_NLU_Entity_Refinement.ipynb to load the above dataset and update this data set

Define quality of delivered code

Description

Although this phase of the project is a prototype proof of concept, it would still be nice to deliver high quality, readable code that potential developers (including my future self! LOL) could easily use and learn from.

Question

What would be the standard that should be met for this kind of project?
Who would be willing to review the code shortly before I more publicly release it?

Find overlapping entities in a domain by type

Description

When creating or maintaining an NLU dataset, there are very frequently overlaps between types of entities. These overlaps cause confusion for entity extraction models. Naturally, many entity extractors use the context in some form of the utterance to label the entity words and types. Nevertheless, this is often an issue.

To solve this, one should get all the entities and their types and look for these duplicates to try to combine them as much as possible.

This issue relates to Flow for entity refinement #4 and is required to be solved for entity refinement.

User stories

As a user refining entity data, I want to:

find the overlapping entity types so that I can review them,
take the reviewed entries and refine them into my original data set.

Problems

Data types

One problem here is the data types. In the notebook for entity refinement the domain_df (and other dfs!), use a column called entities where the entities were extracted using EntityExtractor.extract_entities. This creates entries into the column like so:
['{type': 'time', 'words': ['five', 'am']}, ..].

One shouldn't simply place lists of dictionaries into pandas data frame columns. But instead of trying to figure out a smart way to do this, I focused on getting it done because the resulting clean data set is considered a higher priority than collecting crappy technical debt.

Question

However, now we need to search through this to find duplicates. So, the issue here is: how to do this well considering the current data structure?

Partial or full matches

Would it be interesting to match both exact matches of the entity words, or would it be beneficial to also search for individual words for partial matches? We could even remove stop words to avoid some boring matches. Perhaps this might be suited for the next version.

Question

Should we focus on just full matches as an MVP, or do we need to also go for partial matches and remove stop words?

Solutions

There are just the solutions I will go with for now, perhaps this can be refactored in the future. As stated, ripping out the cleaned up data is more important than nice code! We got quick and dirty here.

Data types

I propose we make a quick and dirty version where we create a new data frame that contains:

index: Just a normal unique index for every entry, like pandas usually does.
id: From the original index from domains_df, several entries can have the same id. There will be, as there can be multiple entities and entity types in one utterance!
entity_type: The word type from the original data frame's entities column.
entity_words: The words joined with space from the list inside the words part of the original data frame's entities column.

We use this data frame to pandas out what we want for the matches and return the matches, including with the original data frame's index, so we can bring them together for review and refinement.

If this works well, this can be integrated as a method in the EntityExtraction class and the method extract_entities can spit out this new data frame instead of just appending all of that junk into a column.

Partial or full matches

We will just go with full exact matches for now. We can always throw in the partial matching later. Let's keep the MVP lean! I will make an issue for this, perhaps someone else would like to give it a try? It shouldn't be too hard to do. It would make a good first issue.

DoD (Definition of Done)

Flow to refine entities by overlap
The results can be saved to CSVs
The refinements can be added to the original data frame

secretsauceai / nlu-engine-prototype-benchmarks Goto Github PK

nlu-engine-prototype-benchmarks's People

Contributors

Stargazers

Forkers

nlu-engine-prototype-benchmarks's Issues

Description

User stories

DoD

Description

Question

DoD

Description

User stories

DoD

Description

DoD

Description

Question

Description

User stories

Problems

Data types

Question

Partial or full matches

Question

Solutions

Data types

Partial or full matches

DoD (Definition of Done)

Recommend Projects

Recommend Topics

Recommend Org