The imixs-ml from imixs

MLTrainingScheduler - type in environment variables

MLTrainingScheduler has a typo in config param

rename:
ml.trainng.scheduler.enabled  =>   ml.training.scheduler.enabled

MLTraningService do not create training events if scheduler is disabled

A training entity should not be created in case the Traning Scheduler is disabled

Change Dependencies for Imixs-Archive

Change modul dependency form Imixs-Archive-Documents into Imixs-Archive-ORC.

See: imixs/imixs-archive#101

XMLTrainingData - build - wrong quality if no items found

during the build method the quality is set to TRAININGDATA_QUALITY_LEVEL_PARTIAL even if no entites are found.

If no entities are found the quality need to be set to TRAININGDATA_QUALITY_LEVEL_BAD

Implement Jax-rs client

Use a Jax-rs client impl to send the Imixs-ML training and anayse objects.

Provide additional XMLRoot classes

Training Data - US currency and German date not recognized correctyl

663.52 is not detected with float 663,52

invoice np

also invoice date in german long format is not detected during anaysis

26. Januar 2017

invoice kraxi

Introduce AdapterConcept for analyzing entites

Entites stored in a workitem can be represented in different format in a document. For example the float value 1042.0 can be

Another case is the representation of a date or an IBAN.

To support different representations of a value we need a n Adapter concept.
Implementation CDI Observer Pattern.

TrainingService - test mode

implement a test resource to test a existing model against a data set of workitems

Add Text Classification feature

Add the new feature for "Text Classification"

MLTrainingScheduler does not remove events after processing

The MLTrainingScheduler does not remove events after processing

Jakarta 8 support

add Jakarta 8 support
refactoring module setup (imixs-ml-core has wrong location!)

Add Java jUnit Tests

Add a sub project based on maven for testing via junit

Improve training resource in ml module

We need to update the training method in the python object datatrain.

In your default use case we did not provide more than one training set at one time

imixs-ml-spacy - provide health endpoint

provide a health endpoint for imxis-ml-spacy to configure a Liveness probe in Kubernetes

The endpoint just need to verify if models are available. If not we can assume that something went wrong

Refactoring project structure

imixs-ml-spacy - wrapper code for spacy
imixs-ml-core - core java classes (e.g. training data obejcts)
imixs-ml-workflow - workflow integration (analyzing)
imixs-ml-training - training api

Add Docker Hub Support

Add configuration for Docker Hub Images

Update pom.xml

AnalyseText should return the categories including the score

AnalyseText must return the categories including the score.
Its up to the application to interpret the score of a category

We need to extend the datamodel returned by analyseText method.

create modules for data management and training

separate the code for data objects and training methods in separate modules

TrainingService - refine training mode

In the current implementation training data is only generated for a worktitem with 100% match of all ML entities.
The reason for this is that bad data should not be send to the ML Service as this data can downgrade the model quality.

For example if the OCR extraction did not extract the IBAN correctly but the Workitem has the correct iban, than this workitem should NEVER be used for training as the text is wrong.

The Problem

We recognized that if we train invoice data there might be a relevant amount of invoice data not including cdtr.iban and cdtr.bic. This is true for a lot of invoices. Currently we ignore those worktitems because of the missing items cdtr.iban and cdtr.bic.

The Solution

If iban / bic are not included in the work item, we can assume that this type of entity data is not relevant at all and that the text data is probably not of bad quality.

But this might not be true for all kind of entites. For example an cdtr.name or a invoice.date and invoice.total are essential for training a invoice workitem.

So we can solve this by marking items as 'optional' to indicate that if these kind of items are empty in the workitem the workitem can also be relevant for training.

As an example a Amazon invoice can be taken.

Replace XMLConfig with ItemCollection

Replace XMLConfig with ItemCollection to get more flexibility in providing additional config data.

XMLTrainingData - cleanTextdata - replace newline with pilcow sign

The XMLTrainingData method cleanTextdata should replace newlines with pilcow sign.
And also we should not strip spaces as this can be a hint for the ml framework to regognize entities in a better quality

Refactoring - imixs-ml-api => imixs-ml-training

rename api module into imixs-ml-training

Implement a JSF Front-End Integration

Implement a JSF Front-End Integration to display entity values suggested from the ml analyses
and provide a ajax search method to search text phrases within the current document content

provide jsf components / subfomrs
provide javaScript library

Example for training data

To get started, we first need some kind of simple example code to train an empty model for entity recognition.

Change API Endpoints for a multi-model support

For a multi model support each api endpoint should consume the model name

In this scenario it is necessary that a workitem hold a list of ml.definition objects containing the details of a ml service endpoint. e.g. the locales or the ml.status.

add TraingDataBuilder

add TraingDataBuilder implementing a builder pattern to restructure code

SpaCy - initialize model with categories

because categories can not be added dynamically like ner entities we need a separate method in the spacy wrapper to initialize a blank model with a set of categories. This allows to us to use a simplified api for incremental training of new workitems.

See also discussion here: explosion/spaCy#6905 (comment)

Upgrade spaCy version 3.0

Upgrade to spaCy version 3.0

upgrade depedencies
refactoring

Add optional Tika Options into the config

Add optional Tika Options into the config to get more configuration options concering the tika server.

MLService - endpoint must not include the /analyzse/ resource

The MLService endpoint must not include the /analyse/ resource. The resource is added by the corresponding service method only.

Adapter - Textlength

The Adatper classes should support an optional text length. In some cases the returned text is to long.

Add maven release management

Add maven release management for java modules

MLService - Eclipse MicroProfile 3.x platform specification compatibility

Empty Strings in a defaultValue @ConfigProperty are not allowed with microprofile 3.3.

Change MLService

add licence

add licence files

Change default lcale to GERMANY, UK

the current default lcale is set to "GERMAN" "UK" . In that case for germany only the language is defined.
Correct default should be

GERMANY, UK

Extend CDI events - EntityObjectEvent, EntityTextEvent

Extend the CDI events and provide separate events for text and object adaption

EntityObjectEvent
EntityTextEvent

Setup a Docker Container for spaCy

Provide a docker image to run a basic microservice with exposing spaCy functionality.

See discussion here: https://stackoverflow.com/questions/60964785/how-to-expose-spacy-as-an-rest-api

Move spacy wrapper into a module Imixs-ML-SpaCy

Move spacy wrapper into a separate module Imixs-ML-SpaCy

MLController - improve findMaches

improve the MLController method 'findMaches()'

increase suggest size from 32 length to 64
add more text variants including spaces

MLService - overwriting the ML Status flag

It should be possible to reset the ML Status flag by a BPMN event. Example:

<ml-config name="status">suggest</ml-config>

should reset the status to 'suggest'

MLAdapter - interrupt processing life cycle on a ml api error

In case of a ml api error (e.g. form spacy wrapper service) the MLAdapter should interrupt the processing life cycle with a ProcessingException.

UI-Integration - improve suggest box

Improve suggest box with a keyUp/keyDown feature

MLAdapter - ml items not defined in the workflow model must be ignored

The MLAdapter must ignore ml items not defined in the workflow model. Otherwise the ml adapter would create irrelevant content for a workitem.

training-service

add a training-service module providing a microservice to train and maintain models based on training data provided by an Imixs-Worklfow instance.

MLServcie - typo ITEM_ML_ITEMES

MLService - support trainingdata quality level

currentliy the trainingdata quality level is ignored.
But in case a quality level FULL is required and the workiem does not match that level the workitem should not be used for training!

Implement MLAdapter class

Implement a Signal Adapter class to be added into a BPMN model

 MLAdapter

This adapter class is used for ml analysis based on the Imixs-ML project.

The Adapter is configured through the model by defining a workflow result item named 'ml'.

Example:

<item name="ml_config">
    <endpoint>https://localhost:8111/api/resource/</endpoint>
    <locales>DE,UK</locales>
</item>

MLAdapter - aggregate text content

In case more than on attachment exists the MLAdapter should aggregete the text and call the ML API Endpoint only once with the complete text.

There for an optional filtering by file name should be supported: In this way a event can analyse ony a specific file type using a regular expression.

XMLTrainingData - build - wrong quality if no items found

See Issue #44

imixs / imixs-ml Goto Github PK

imixs-ml's People

Contributors

Stargazers

Watchers

imixs-ml's Issues

The Problem

The Solution

Recommend Projects

Recommend Topics

Recommend Org