Code Monkey home page Code Monkey logo

secretsauceai / nlu-engine-prototype-benchmarks Goto Github PK

View Code? Open in Web Editor NEW

This project forked from amateuracademic/interview-code-examples

5.0 5.0 1.0 8.5 MB

Demo and benchmarks for building an NLU engine similar to those in voice assistants. Several intent classifiers are implemented and benchmarked. Conditional Random Fields (CRFs) are used for entity extraction.

License: Apache License 2.0

Jupyter Notebook 94.31% Python 5.69%
conditional-random-fields entity-extraction intent-classification logistic-regression machine-learning naive-bayes named-entity-recognition natural-language-processing nlp nlu nlu-engine random-forest scikit-learn sklearn svm tutorial-code

nlu-engine-prototype-benchmarks's People

Contributors

amateuracademic avatar secretsauceai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Forkers

brunotech

nlu-engine-prototype-benchmarks's Issues

Flow for entity refinement

Description

We want users, in this case most likely myself and any developers whoever, to benchmark an NLU data set for entity extraction and be able to refine those entities to improve the data set.

Making sure the “human in the for loop” flow works comes from refining the entities, however there will be improvements that need to be made as they block the refinement process. This is a dummy ticket to append such minor code fixes to.

User stories

As a user, I want to

  • visually see the analytics of the entities so that I know what needs to be improved
  • review all the incorrect entities so that I can fix them

Sounds easy, right? Especially since we have already built the intent refinement work flow, but it is a bit more complex than that.

With intents; we could visualize all domains, see where the intents are doing the worst, pick those domains, and then review all the incorrectly classified intents in that domain for refinement. With entities, it is a bit more tricky.

We'll do this, but with entities. However, we need to group together the entities in a domain, and there will also be overlap. Some utterances have more than one entity type. So, we have to keep track of that. Furthermore, do we tell the user to refine all entities in an utterance, or do we tell them to ignore them? It would be super annoying to have to go back over the same utterances 2 or more times! This is why we should have users working on multiple entities at the same time. This is harder for a user to do, as the user must know if each one is correct and if not, what they should be.

Ergo, it is better for a user to review incorrect entries in batches. They should have an overview for that domain of example entries where the entities are correct, then go through correcting no more than 100 at a time.

This means, however, we will have to adapt our flow from the intent refinement. With intent refinement, we recorded into CSVs by domain and intent. Here we will just do it by domain in batches, then merge those together into one for the whole domain. If the user is lucky, they will only have to do one batch per domain.

DoD

  • benchmark entities over whole data set
  • graph analysis of entities for the whole data set
  • benchmark entities per domain
  • graphs of entities per domain
  • add incorrect_entities_report to macro_entities_refinement.py
  • ipysheet refinement of a batch in the domain
  • save to a CSV of batches
  • merge with CSV for the whole domain
  • merge with the CSV for the whole data set
  • benchmark again

`ipysheet` is deprecated and should be replaced

Description

ipysheet is deprecated. It seems it doesn't work correctly anymore, which completely blocks the refinement flow to get the cleaned data. A replacement is needed. ipydatagrid is recommended from the previous project.

Question

  • Can ipydatagrid completely replace ipysheet in functionality?

DoD

  • The above question is answered
  • Notebook prototype of refinement flow with ipydatagrid
  • Above notebook is refactored into a class
  • The refinement flow is replaced within the entity refinement notebook to continue entity refinement

Find a good separation criteria for entity refinement

The refactored notebook for entity refinement (work in progress) needs a good way to separate the entries to be refined.

Currently, if we separate the errors by domain (scenario/skill) the top one (calendar) will have 681 errors. This is far too many for a user to be expected to review in one go using ipysheet. I am uncertain if they should be separated by intent is the case with intent refinement.

This needs to be solved to facility the entity refinement to achieve the deliverable of the data set.

Refine entities

This is the biggest ticket item. It requires mind-numbingly going through each incorrectly predicted entity and correcting them.

Add review of entity types with examples in `Macro_NLU_Entity_Refinement.ipynb`

Description

After the incorrect_predicted_entities_report, there should be a review of the entity types:

  • Remove entity types
  • Remove entries that have out of scope entity types
  • Merge two or more entity types

This issue is related to #4.

User stories

As an entity refiner, I want to:

  • Remove entity types from entries where they make no sense
  • Remove entries that have out of scope entity types
  • Merge two or more entity types where only one type is needed

DoD

  • User can easily review entity types with utterance examples
  • User can remove entity types from the data set
  • User can remove entries that have out of scope entity types
  • User can merge two or more entity types
  • This flow passes tests

Refactor `Macro_NLU_Entity_Refinement.ipynb` to update the dataframes for the removed and complete data sets

Description

The current flow updates only the dataframe where removed entries are considered. While the dataframe that contains all the removed entries only is important for training and testing both the intent and entity models, we want to create a complete data set tracking all the entries, similar to Macro_NLU_Intent_Refinement.ipynb. This issue is realted to #4

DoD

  • Create a CSV for the updated entity refinement entries with the refined intent entries
  • Update the code in Macro_NLU_Entity_Refinement.ipynb to load the above dataset and update this data set

Define quality of delivered code

Description

Although this phase of the project is a prototype proof of concept, it would still be nice to deliver high quality, readable code that potential developers (including my future self! LOL) could easily use and learn from.

Question

  • What would be the standard that should be met for this kind of project?
  • Who would be willing to review the code shortly before I more publicly release it?

Find overlapping entities in a domain by type

Description

When creating or maintaining an NLU dataset, there are very frequently overlaps between types of entities. These overlaps cause confusion for entity extraction models. Naturally, many entity extractors use the context in some form of the utterance to label the entity words and types. Nevertheless, this is often an issue.

To solve this, one should get all the entities and their types and look for these duplicates to try to combine them as much as possible.

This issue relates to Flow for entity refinement #4 and is required to be solved for entity refinement.

User stories

As a user refining entity data, I want to:

  • find the overlapping entity types so that I can review them,
  • take the reviewed entries and refine them into my original data set.

Problems

Data types

One problem here is the data types. In the notebook for entity refinement the domain_df (and other dfs!), use a column called entities where the entities were extracted using EntityExtractor.extract_entities. This creates entries into the column like so:
['{type': 'time', 'words': ['five', 'am']}, ..].

One shouldn't simply place lists of dictionaries into pandas data frame columns. But instead of trying to figure out a smart way to do this, I focused on getting it done because the resulting clean data set is considered a higher priority than collecting crappy technical debt.

Question

However, now we need to search through this to find duplicates. So, the issue here is: how to do this well considering the current data structure?

Partial or full matches

Would it be interesting to match both exact matches of the entity words, or would it be beneficial to also search for individual words for partial matches? We could even remove stop words to avoid some boring matches. Perhaps this might be suited for the next version.

Question

Should we focus on just full matches as an MVP, or do we need to also go for partial matches and remove stop words?

Solutions

There are just the solutions I will go with for now, perhaps this can be refactored in the future. As stated, ripping out the cleaned up data is more important than nice code! We got quick and dirty here.

Data types

I propose we make a quick and dirty version where we create a new data frame that contains:

  • index: Just a normal unique index for every entry, like pandas usually does.
  • id: From the original index from domains_df, several entries can have the same id. There will be, as there can be multiple entities and entity types in one utterance!
  • entity_type: The word type from the original data frame's entities column.
  • entity_words: The words joined with space from the list inside the words part of the original data frame's entities column.

We use this data frame to pandas out what we want for the matches and return the matches, including with the original data frame's index, so we can bring them together for review and refinement.

If this works well, this can be integrated as a method in the EntityExtraction class and the method extract_entities can spit out this new data frame instead of just appending all of that junk into a column.

Partial or full matches

We will just go with full exact matches for now. We can always throw in the partial matching later. Let's keep the MVP lean! I will make an issue for this, perhaps someone else would like to give it a try? It shouldn't be too hard to do. It would make a good first issue.

DoD (Definition of Done)

  • Flow to refine entities by overlap
  • The results can be saved to CSVs
  • The refinements can be added to the original data frame

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.