Description
When creating or maintaining an NLU dataset, there are very frequently overlaps between types of entities. These overlaps cause confusion for entity extraction models. Naturally, many entity extractors use the context in some form of the utterance to label the entity words and types. Nevertheless, this is often an issue.
To solve this, one should get all the entities and their types and look for these duplicates to try to combine them as much as possible.
This issue relates to Flow for entity refinement #4 and is required to be solved for entity refinement.
User stories
As a user refining entity data, I want to:
- find the overlapping entity types so that I can review them,
- take the reviewed entries and refine them into my original data set.
Problems
Data types
One problem here is the data types. In the notebook for entity refinement the domain_df
(and other dfs!), use a column called entities
where the entities were extracted using EntityExtractor.extract_entities
. This creates entries into the column like so:
['{type': 'time', 'words': ['five', 'am']}, ..]
.
One shouldn't simply place lists of dictionaries into pandas data frame columns. But instead of trying to figure out a smart way to do this, I focused on getting it done because the resulting clean data set is considered a higher priority than collecting crappy technical debt.
Question
However, now we need to search through this to find duplicates. So, the issue here is: how to do this well considering the current data structure?
Partial or full matches
Would it be interesting to match both exact matches of the entity words, or would it be beneficial to also search for individual words for partial matches? We could even remove stop words to avoid some boring matches. Perhaps this might be suited for the next version.
Question
Should we focus on just full matches as an MVP, or do we need to also go for partial matches and remove stop words?
Solutions
There are just the solutions I will go with for now, perhaps this can be refactored in the future. As stated, ripping out the cleaned up data is more important than nice code! We got quick and dirty here.
Data types
I propose we make a quick and dirty version where we create a new data frame that contains:
index
: Just a normal unique index for every entry, like pandas usually does.
id
: From the original index from domains_df
, several entries can have the same id
. There will be, as there can be multiple entities and entity types in one utterance!
entity_type
: The word type from the original data frame's entities
column.
entity_words
: The words joined with space from the list inside the words
part of the original data frame's entities
column.
We use this data frame to pandas out what we want for the matches and return the matches, including with the original data frame's index, so we can bring them together for review and refinement.
If this works well, this can be integrated as a method in the EntityExtraction
class and the method extract_entities
can spit out this new data frame instead of just appending all of that junk into a column.
Partial or full matches
We will just go with full exact matches for now. We can always throw in the partial matching later. Let's keep the MVP lean! I will make an issue for this, perhaps someone else would like to give it a try? It shouldn't be too hard to do. It would make a good first issue.
DoD (Definition of Done)
- Flow to refine entities by overlap
- The results can be saved to CSVs
- The refinements can be added to the original data frame