Code Monkey home page Code Monkey logo

fms-guardrails-orchestrator's Introduction

Foundation Model Stack

Foundation Model Stack is a collection of components for development, inference, training, and tuning of foundation models leveraging PyTorch native components. For inference optimizations we aim to support PyTorch compile, accelerated transformers, and tensor parallelism. At training time we aim to support FSDP, accelerated transformers, and PyTorch compile. To enable these optimizations, we will provide reimplementations of several popular model architectures starting with Llama and GPT-BigCode.

Models Supported

Model family Inference Tuning and Training
LLaMA ✔️ ✔️
GPT-BigCode ✔️
RoBERTa ✔️

Installation

We recommend running this on Python 3.11 and CUDA 12.1 for best performance, as the CPU overheads of the models are reduced significantly.

Pypi

pip install ibm-fms

Local

Requires PyTorch >= 2.1.

pip install -e .

or

python setup.py install

Inference

Approach

Our approach for inference optimization is to use PyTorch compile, accelerated transformers, and tensor parallelism. PyTorch compile compiles the code into optimized kernels, accelerated transformers leverages scaled_dot_product_attention (SDPA) for accelerating attention computation while saving memory, and tensor parallelism is necessary for larger models.

To enable the Llama models to compile, we had to reimplement RoPE encodings without complex numbers. With this change, Llama model inference is able to leverage model compilation for latency reduction.

Inference latency

We measured inference latencies with 1024 token prompt and generation of 256 tokens on AWS P4de instance nodes with 8 80G A100 GPUs and report the median latency in the below table.

Model # GPUs Median latency (ms)
7B 1 14ms
13B 1 22ms
70B 8 30ms

If you would like to reproduce the latencies, you can run the scripts/benchmark_inference.py and the details are described in inference.

For more information on reproducing the benchmarks and running some examples, see here

HF Model Support

The support for HF models is provided by our HF model adapter. One can obtain similar latencies as tabulated above with HF models using our HF model adapter:

from fms.models import get_model
from fms.models.hf import to_hf_api
import torch
from transformers import pipeline
# fms model
llama = get_model("llama", "13b")

# huggingface model backed by fms internals
llama_hf = to_hf_api(llama)

# compile the model -- in HF, the decoder only
llama_hf.decoder = torch.compile(llama_hf.decoder)

# generate some text -- the first time will be slow since the model needs to be compiled, but subsequent generations should be faster.
llama_generator = pipeline(task="text-generation", model=llama_hf, tokenizer=tokenizer)
llama_generator("""q: how are you? a: I am good. How about you? q: What is the weather like today? a:""")

A detailed example is provided here.

Tuning

To fine-tune LLaMA, use the scripts/train_causal.py training script. Here's an example of that command.

torchrun --nproc_per_node=2 \
        scripts/train_causal.py \
        --architecture=llama \
        --variant=7b \
        --tokenizer=~/models/tokenizer.model \
        --model_path=~/models/7B/ \
        --report_steps=10 \
        --checkpoint_format=meta \
        --distributed=fsdp

See options in the script for other ways to train and tune.

Structure and contents of this Repository

  • fms/models/ - Pure pytorch implementations of popular model architectures, without requiring any specific common interface beyond nn.Module. Each model configuration is registered with fms.models.register_model() so that instances can be obtained through fms.models.get_model('architecture', 'variant', '/path/to/data'). Each model can also register sources/formats/versions of data to load (e.g. checkpoints provided by meta, HF, or trained from this repo). Users of the repo (e.g. fms-extras) can register their own model architectures as well.
  • fms/models/hf/ - Adapters that compose our native PyTorch FMS model architecture implementations in HF-compatible wrapper interfaces. Each FMS model implements an adapter, and adapted instances are obtained via fms.models.hf.to_hf_api(model)
  • fms/datasets/ - Code for loading data for pre-training and fine-tuning. Individual datasets are retrieved by fms.datasets.get_dataset('name', tokenizer, 'optional path or other data reference'). The expected tokenizer conforms to an fms.utils.tokenizers.BaseTokenizer interface.
  • fms/modules/ - Components extending nn.Module used in our model architecture implementations. Each Module has a corresponding TPModule so that modules can be sharded using a tensor-parallel distribution strategy. FMS modules should all support torch.compile without graph breaks.
  • fms/training/ - Pre-training and fine-tuning code.
  • fms/utils/ - Other operators useful in working with LLMs. These include a generate() function, Tensor subclasses, code for dealing with LLM checkpoints that might be saved/sharded in a variety of formats, tokenization code, and various other useful helper functions.
  • scripts/ - Various scripts for inference, benchmarking, and evaluation, as well as an entry-point for tuning/training.

Extensions and Use Cases

This library is used by three dependent projects at IBM.

  • fms-fsdp - This repo shares training code that has been used to pretrain an fms implementation of LLaMA on IBM internal data.
  • fms-extras - This repo shares code for additional fms-based models trained by IBM. This repo will also be a home for other extensions, and may also include research or in-developent work intended for eventual upstreaming to fms.
  • TGIS - This inference server includes support for serving fms models.

Open Issues

  • pytorch/pytorch#107824 prevents training/finetuning from working with torch.compile.
  • In addition, there are several open issues we are tracking to improve stability and memory footprint of inference

References

fms-guardrails-orchestrator's People

Contributors

alanbraz avatar declark1 avatar evaline-ju avatar gkumbhat avatar mdevino avatar

Watchers

 avatar  avatar

fms-guardrails-orchestrator's Issues

Use separate chunkers client

Using the current NLP client for chunkers assumes that the chunkers servers will expose an NlpService, which is not generic towards "chunkers", even if the initial API will involved "tokenization". Assuming/using a more generalized chunkers client will be more flexible towards chunker servers.

Add configurable request timeouts for all clients

Description

Orchestrator is responsible to talk to several different clients. Each client can have potentially different req / response timeout requirement. This story is to add a timeout configuration, either common for all clients (with default) or / and allow model / detector / chunker specific timeout configuration.

Acceptance Criteria

  • Default timeout documented and added in config/config.yaml
  • Decision for per client timeout configuration mentioned in this issue
  • PR for adding this timeout support in the code merged to main

v2.0 not returning tokens

Describe the bug

When running functional tests, calling the /api/v1/task/classification-with-text-generation end point through:

response = requests.post(url="https://gr2-fmaas-tuning.apps.fmaas-devstage-backend.fmaas.res.ibm.com/api/v1/task/classification-with-text-generation", json=PAYLOAD, headers=HEADERS, verify=SSL)
json_response = json.loads(response.content.decode(response.encoding))

In the case of unary output with HAP.

Platform

Please provide details about the environment you are using, including the following:

  • fms-orchestrator image version: us.icr.io/cil15-shared-registry/fms-orchestr8:latest

Expected behavior

The previous response at v1.0 was:

{
  "generated_text": "I hate cats they are stupid. I love dogs [email protected].", 
  "token_classification_results": {"input": "", "output": []}, 
                                                             "finish_reason": "MAX_TOKENS", 
                                                             "generated_token_count": 99, 
                                                             "seed": 42, 
                                                             "input_token_count": 21, 
                                                             "warnings": , 
"tokens": [{"text": "Ġnot", "logprob": 0.0, "rank": 1}, 
                  {"text": "Ġnot", "logprob": 0.0, "rank": 1}, {...}], 
"input_tokens": []
}

Observed behavior

The actual response is:

{
  "generated_text": "I hate cats they are stupid. I love dogs [email protected]. [...]",
  "token_classification_results": {
    "output": [
      {
        "start": 0,
        "end": 28,
        "word": "I hate cats they are stupid.",
        "entity": "has_HAP",
        "entity_group": "hap",
        "score": 0.982990860939026
      }
    ]
  },
  "finish_reason": "EOS_TOKEN",
  "generated_token_count": 59,
  "seed": 0,
  "input_token_count": 21
}

Additional context

The functional test run to get this error is:

    def test_unary_output_hap():
        payload = copy.deepcopy(PAYLOAD)
        payload["guardrail_config"]["output"]["models"] = HAP
        response = requests.post(url=UNARY_GUARDRAILS_URL, json=payload, headers=HEADERS, verify=SSL)
        assert 200 == response.status_code
        json_response = json.loads(response.content.decode(response.encoding))
>       assert json_response["tokens"] is not None
E       KeyError: 'tokens'

Should the tokens information only be added? Maybe this information is already available at the API.

NLP client not working on tokenize at least

Describe the bug

When testing with the NLP client for the generation provider as opposed to the TGIS client, tokenization with the LLM unexpectedly fails. It is not clear at this time whether or not generation with the NLP client works.

Sample Code

As config use the nlp provider with a caikit-nlp pod

e.g. [or localhost if port-forwarded]

        provider: nlp # tgis or nlp
        service:
            hostname: caikit-inference
            port: 8085

instead of the tgis client e.g.

        provider: tgis # tgis or nlp
        service:
            hostname: model-inference-server
            port: 8033

Expected behavior

Same success as with the TGIS client

Observed behavior

404 seen on tokenization for the LLM model

Add a check to verify if the len of input and output match for detectors

Description

Follow-up to #96

We send out list of strings to detectors as input and expect list of outputs in return. These input and output list needs to be of same length for us to properly process the response.. We want to implement a check for this to verify if detectors are responding as expected, otherwise throw a 500

Add CI

Description

As an orchestrator developer I want tests and ideally formatting/linting checks to run on each build automatically so developers can ensure that any changes pass existing tests/checks and reviewers have more confidence in changes.

Acceptance Criteria

  • Github workflow to run tests and formatting

Add orchestrator config validation logic

Description

OrchestratorConfig.validate() is a placeholder method to implement logic that should be checked during OrchestratorConfig::load(). This should include anything that won't be checked during deserialization, such as ensuring that tls ref's match the name of a defined TlsConfig, etc.

e.g. the following config should fail validation as caikit isn't defined under tls

# ...

chunkers:
  en_regex:
    type: sentence
    service:
      hostname: localhost
      port: 8085
      tls: caikit
tls: {}

Acceptance Criteria

  • All needed config validation rules have been identified
  • Validation rules have been implemented in OrchestratorConfig.validate() and return descriptive panic messages in the event of failing
  • self.validate() is called from OrchestratorConfig::load()

Implement streaming result aggregation policy

Description

Depends on #56

As an orchestrator developer, I want to implement the aggregation policy for streamed detector results, so that I can return results to an end-user of orchestrator streaming endpoints.

Discussion

Provide detailed discussion here

Acceptance Criteria

  • Unit tests cover new/changed code
  • Examples build against new/changed code
  • READMEs are updated
  • Type of semantic version change is identified

Additional validation on threshold parameter

Description

As a orchestrator developer, I want to validate the threshold field, so that I can prevent users from putting in non-sensical values for threshold, like strings. This includes potentially using a specific struct for detector parameters.

Discussion

Currently the detector parameters are very flexible serde JSON values:

// TODO: When detector API is updated, consider if fields
// like 'threshold' can be named options instead of the
// use a generic HashMap with Values here
// ref. https://github.com/foundation-model-stack/fms-guardrails-orchestrator/issues/37
pub type DetectorParams = HashMap<String, serde_json::Value>;

Acceptance Criteria

  • Unit tests cover new/changed code
  • Examples build against new/changed code
  • READMEs are updated
  • Type of semantic version change is identified

Failed to deserialize the JSON body into the target type: missing field `models`

Describe the bug

When calling /api/v1/task/classification-with-text-generation with a missing models object inside guardrail_config.input or guardrail_config.output objects in the payload, a 422 Failed to deserialize the JSON body response occurs.

Platform

Please provide details about the environment you are using, including the following:

fms-orchestrator image version: fms-guardrails-orchestr8:e5b72c1

Observed behavior

Request body

{
	"inputs": "My email is",
	"model_id": "<model-id>",
	"guardrail_config": {
		"input": {
			"models": {}
		},
		"output": {
		}
	}
}

Response

Failed to deserialize the JSON body into the target type: guardrail_config.output: missing field `models` at line 9 column 3

Additional context

It happens for both guardrail_config.output and guardrail_config.input.

Improve error handling

Description

Improve error handling:

  • Determine the appropriate level of error granularity and add variants to capture context or use anyhow::Error (with it's Context trait)
    • e.g. rather than returning Error::ReqwestError, we probably want something like Error::Detector("request timed out")
  • Ensure errors flow up cleanly from orchestrator tasks, we should not be using Result::unwrap()
  • Properly convert Error variants into suitable http::StatusCode (with a message for context) to return to caller from handler methods

Implement whole document "chunker"

Description

As a user that may want to use various detectors, I may want to use detectors that do not require "chunking." Particularly, some detectors like regexes may just operate on entire documents.

Discussion

For consistency with detectors that will invoke chunkers, we will implement a "default" whole document chunker that just returns a span with the start and end corresponding to an entire input document. Any detectors that do not specify a chunker will use this default one.

For code organization purposes, we likely want to implement this not as branching logic in the orchestrator itself but as a separate chunker that can get called via some name (still implemented within the orchestrator repo, but from a deployment/user configuration POV, this name can be still be specified in the config).

Acceptance Criteria

  • Unit tests cover new/changed code
  • Examples build against new/changed code
  • READMEs are updated
  • Type of semantic version change is identified

Warnings and seed fields, not always present in json_response

Describe the bug

When running functional tests, calling the /api/v1/task/classification-with-text-generation end point through:

 response = requests.post(url='https://gr2-fmaas-tuning.apps.fmaas-devstage-backend.fmaas.res.ibm.com/api/v1/task/classification-with-text-generation', json=PAYLOAD, headers=HEADERS, verify=SSL)
 json_response = json.loads(response.content.decode(response.encoding))

Platform

Please provide details about the environment you are using, including the following:

  • fms-orchestrator image version: us.icr.io/cil15-shared-registry/fms-orchestr8:latest

Expected behavior

The previous response at v1.0 was:

{
  "generated_text": "I love dogs [email protected]",
  "token_classification_results": {
    "input": null,
    "output": []
  },
  "finish_reason": "MAX_TOKENS",
  "generated_token_count": 99,
  "seed": 42,
  "input_token_count": 21,
  "warnings": null,
  "tokens": [
    {
      "text": "\u0120not",
      "logprob": 0.0,
      "rank": 1
    }],
  "input_tokens": []
  }

Containing the seed and warnings fields, even when not populated. Seed needs to be populated when the request goes through the guardrails, and it's expected to have the same value as the text_gen_parameters sent in the request payload.

Observed behavior

The actual response is:

{
  "generated_text": "I hate cats they are stupid. I love dogs [email protected]",
  "token_classification_results": {},
  "finish_reason": "EOS_TOKEN",
  "generated_token_count": 59,
  "seed": 0,
  "input_token_count": 21
}

Additional context

When no HAP or PII is found, the warning will be empty, but in the v1.0 the field was returned anyway. So the assertions regarding the fields containing within the response will break. Maybe think if it is mandatory to be at the response?

Update HTTP client creation for mTLS

Description

As an orchestrator developer, I want to update the HTTP client creation, so that I can use mTLS for use with detectors, which are expected to be REST servers [as opposed to GRPC servers].

Discussion

It appears that while the created GRPC client uses client_ca_cert_path etc, the HTTP client We are generally expecting the same "format" of cert information regardless of GRPC or HTTP servers.

Ref: https://github.com/foundation-model-stack/fms-guardrails-orchestrator/blob/main/src/clients.rs

Acceptance Criteria

  • Unit tests cover new/changed code
  • Examples build against new/changed code
  • READMEs are updated
  • Type of semantic version change is identified

Add TLS support for orchestrator [server-side]

Description

We need to add support for the orchestrator server to allow serving the endpoints over mutual TLS.

There are a couple TODOs left in the code, such as unused parameters for running the server. With the client path CA info, mTLS will be enabled.

Acceptance criteria

  • Orchestrator server works with tls configurations and accepts requests only over https
  • Tls configuration tested
  • How to configure server with TLS config documented in README

Add tests for orchestrator response with text generation edge cases

Description

As an orchestrator developer, I want to confirm that the orchestrator responds as expected if any text generation servers do not respond with text or all whitespace text or potentially other edge cases.

Discussion

Add unit tests with mock results from text generation clients returning no text (say just a stop token), all whitespace text, etc.

Acceptance Criteria

  • Unit tests cover new/changed code
  • Examples build against new/changed code
  • READMEs are updated
  • Type of semantic version change is identified

Change `get_test_context()` to have a default GenerationClient

Description

As a developer, I want to have a default generation client on get_test_context, so that I can write unit tests without having to much the generation client if the test doesn't require it.

Discussion

This is how the method currently looks like:

    async fn get_test_context(
        gen_client: GenerationClient,
        chunker_client: Option<ChunkerClient>,
        detector_client: Option<DetectorClient>,
    ) -> Context {
        let chunker_client = chunker_client.unwrap_or_default();
        let detector_client = detector_client.unwrap_or_default();

        Context {
            generation_client: gen_client,
            chunker_client,
            detector_client,
            config: OrchestratorConfig::default(),
        }
    }

I'm just not quite sure what would be a good default, since adding derive(Default) to GenerationClient doesn't work.

Acceptance Criteria

  • Unit tests that don't interface with a generation client, such as test_handle_detection_task could be rewritten to work without mocking the GenerationClient.

Improve tls name to config mapping

Description

The way tls config names (provided in ServiceConfigs) are mapped to TlsConfigs in Orchestrator::load() can use some clean up. Also, the tlsfield in OrchestratorConfig should be Option.

Improve generation parameter passing in orchestrator server

Description

Currently we are manually re-defining the parameters required by the text generation API manually here. This can be fragile and limiting if TGIS API changes for these parameters, or if they add new parameter etc.

It would be good to use generation proto def directly to define that part of the API, or take that in as a JSON object and pass it along as is to underlying generation server. This would avoid us getting in the middle of the generation parameters we can accept.

Update detector client for use of `/api/v1/text/contents`

Description

As a developer, I want to call any detector's API, so that I can use it correctly in my workflow.

Detector API is being revisited and we are adding more endpoints to nicely group together detectors in logical bins. This will result in changes in orchestrator.

Discussion

API updated and will need to refactor orchestrator code to accommodate them.

Acceptance Criteria

Note: this story only covers handing of classification type APIs. For other endpoints that we add to detector API, we will need to add respective APIs in orchestrator as well, which will be covered in different story.

Add ADR for orchestrator API

Description

Similar to #71 as an orchestrator developer I want to document the design decisions, rationale, and definitions behind the current orchestrator API.

Acceptance Criteria

  • ADR merged for orchestrator API

Span offsets not accounted for in results

Describe the bug

The span outputs of results with detection models are incorrect.

Sample Code

curl -v -H "Content-Type: application/json" --request POST \
--data '{"model_id": "model_name",
    "inputs": "There was once a really dumb cat. The chicken crossed the road. The social security number is 123-45-6789. His phone number is (408) 123-4567. ",
    "guardrail_config": {
        "input": {
            "models": {"detector_name": {"threshold": 0.5}
            }
        },
        "output": {"models":{}}
    },
    "text_gen_parameters": {
        "min_new_tokens": 40,
        "max_new_tokens": 200
    }
    }' \
    http://localhost:8081/api/v1/task/classification-with-text-generation

Expected behavior

Span starts and ends should correctly indicate text positions

Observed behavior

Spans all start with 0 in results

{"token_classification_results":{"input":[{"start":0,"end":42,"word":"The social security number is 123-45-6789.","entity":"has_HAP","entity_group":"hap","score":0.00021640431077685207},{"start":0,"end":33,"word":"There was once a really dumb cat.","entity":"has_HAP","entity_group":"thing","score":0.9265003204345704},{"start":0,"end":35,"word":"His phone number is (408) 123-4567.","entity":"has_thing","entity_group":"hap","score":0.0003851974615827203},{"start":0,"end":29,"word":"The chicken crossed the road.","entity":"has_thing","entity_group":"thing","score":0.0013679045950993896}]},"input_token_count":35,"warnings":[{"id":"UNSUITABLE_INPUT","message":"Unsuitable input detected. Please check the detected entities on your input and try again with the unsuitable input removed."}]}

Detector result ordering on unary endpoint

Description

As an orchestrator user, I want detector results to be in span order on the unary endpoint and predictable (same ordering on the same exact call), so that I can process detector results in order.

Discussion

Currently ordering can be unpredictable (different on each call) and unordered based on span e.g.

{"token_classification_results":{"input":[{"start":0,"end":33,"word":"There was once a really dumb cat.","entity":"has_thing","entity_group":"thing","score":0.9265003204345704},{"start":34,"end":63,"word":"The chicken crossed the road.","entity":"has_thing","entity_group":"thing","score":0.0013679045950993896},{"start":107,"end":142,"word":"His phone number is (408) 123-4567.","entity":"has_thing","entity_group":"thing","score":0.0003851974615827203},{"start":64,"end":106,"word":"The social security number is 123-45-6789.","entity":"has_thing","entity_group":"thing","score":0.00021640431077685207}]},"input_token_count":35,"warnings":[{"id":"UNSUITABLE_INPUT","message":"Unsuitable input detected. Please check the detected entities on your input and try again with the unsuitable input removed."}]}

Acceptance Criteria

  • Predictable result ordering

Revisit default ports for clients decision

Description

Currently we are using default port for all types of clients. While it is nice to have things work, it can sometimes create confusions if things seem to work magically. Additionally, people can always be explicit about the port which would make it more clear configuration setting! This story is to revisit the default port decision and remove it if deemed confusing.

Detected PII word's "start" and "end" are returning the wrong positions

Describe the bug

The start and end fields returned are different from expected. For example, the e-mail is not in the mask position returned by the request response from the detected PII.

Platform

Please provide details about the environment you are using, including the following:

GR Version 2.0 NLP Client, TLS

Sample Code

POST call to /api/v1/task/classification-with-text-generation with the payload:

{
    "inputs": "I hate cats they are stupid. I love dogs [email protected]. Rabbits are pretty but",
    "model_id": "bloom-560m",
    "guardrail_config": {
        "input": {
            "models": {
                "en_syntax_rbr_pii": {
                    "threshold": 0.8
                }
            }
        },
        "output": {
            "models": {}
        }
    },
    "text_gen_parameters": {
        "preserve_input_text": true,
        "max_new_tokens": 99,
        "min_new_tokens": 1,
        "truncate_input_tokens": 500,
        "decoding_method": "SAMPLING",
        "top_k": 2,
        "top_p": 0.9,
        "typical_p": 0.5,
        "temperature": 0.8,
        "seed": 42,
        "repetition_penalty": 2,
        "max_time": 0,
        "stop_sequences": [
            "42"
        ]
    }
}

Expected behavior

{
        "start": 41,
        "end": 54,
        "word": "[email protected]",
        "entity": "EmailAddress",
        "entity_group": "",
        "score": 0.8
  }

Observed behavior

{
    "token_classification_results": {
        "input": [
            {
                "start": 12,
                "end": 25,
                "word": "they are stup",
                "entity": "EmailAddress",
                "entity_group": "pii",
                "score": 0.8
            }
        ]
    },
    "input_token_count": 21,
    "warnings": [
        {
            "id": "UNSUITABLE_INPUT",
            "message": "Unsuitable input detected. Please check the detected entities on your input and try again with the unsuitable input removed."
        }
    ]
}

Implement server side streaming for text generation

Description

As a user, I want to handle server side streaming for text generation, instead of unary request, so that I can utilize orchestration in more interactive use-cases.

This story is to wire up text generation streaming with guardrails orchestrator streaming endpoint. Connecting it with chunker and detection will be handled in separate issue

Implement detection on text generation streaming response

Blocked by: #44

Description

We want to implement detection on text generation streaming response in such a way that (high level notes, up for discussion):

  1. We pass on stream from text generation response directly to chunker
  2. As soon as chunker responds with a "chunk", (we wait for chunker to respond), we use the response from chunker, tag it with a sequence identifier and call out to detector async.
  3. Before waiting for response from the detector, if we receive another chunk from chunker, we send it out (by increasing sequence id) to detector. This is to not keep waiting on sequentially calling out to detector with each chunk and instead utilize the speed of generation+chunker as the main driving factor.
  4. As we start receiving the response from detector, we put them in a list as per their sequence id.

cc: @evaline-ju

Implement real liveness and readiness endpoints

Description

Currently we a health check endpoint which is stubbed out version. We need to implement real health check endpoint that is able to measure health of the server.

Tasks

  • Figure out how to measure liveness of orchestr8 server
  • Implement liveness endpoint
  • Figure out how to measure readiness of orchestr8 server
  • implement liveness endpoint

Acceptance Criteria

  • Liveness and readiness probes implemented in orchestr8 server and merged to main branch

Host API in swagger pages in the repo

Description

To enable easy readability of the API, we want to host the swagger UI in github pages for the orchestrator API. We should also try if there is a way to host multiple APIs, that way we can also host detectors API in this repo as well. But first priority is to host orchestrator API.

Document orchestration logic in an ADR

Description

As we are starting to add core logic for orchestration of input / output detectors and how a text gets processed in various scenarios, this issue is to document this in an ADR. This is the heart of the orchestrator and an ADR on its design would allow us and future contributors to understand whats going on there more clearly.

Replace prompt in detector context analysis API with content

Description

prompt in context analysis API seems to be confusing, since this API is really targeting context of input, which can be prompt or generated by LLM for example. So in this issue, we want to rename prompt to content to fix this confusion

Update the detector API with text

Description

As a detector API user and orchestrator developer, I want to have detectors return text corresponding to the spans [start, end] of a found detection, so that I can more easily debug and verify detector behavior and pass this information to the user instead of re-indexing codepoints to get the text for the final API response.

Once the API is updated in https://github.com/foundation-model-stack/fms-guardrails-orchestrator/blob/main/docs/api/openapi_detector_api.yaml, changes will be needed in the orchestrator to accommodate this here

Acceptance Criteria

  • Unit tests cover new/changed code
  • Examples build against new/changed code
  • READMEs are updated
  • Type of semantic version change is identified

Add user request validation

Description

As an orchestrator user, I want to receive errors on invalid requests, so that I can know how to potentially update my request.

Discussion

Scenarios include:

  • Unsupported LLMs (model IDs) (this may be handled by #13 / PR #31)
  • Unsupported detectors (this may be handled by #13 / PR #31)
  • Incorrect text generation parameters (this may be handled by #13 / PR #31) [500s should be 400s]
  • Missing parameters e.g. users can put in "input": { "model-name": {} } without the models parameter and receive no error. The model-name would then just be ignored on input, which seems unexpected.
  • Extra parameters e.g. users can put in random additional fields to the overall request or under guardrail_config. These extra fields are currently ignored but could indicate unintentional misformatting.
  • Should masks without models on input detection even be valid?

Acceptance Criteria

  • Handling of above cases
  • Unit tests

API response field different from v1.0: generated_text (v1.0) --> text_generated (v2.0)

Describe the bug

The new v2.0 API json response is returning different field name than the v1.0.

Sample Code

Please include a minimal sample of the code that will (if possible) reproduce the bug in isolation

Expected behavior

{
  "generated_text": "I love dogs [email protected]",
  "token_classification_results": {
    "input": null,
    "output": []
  },
  "finish_reason": "MAX_TOKENS",
  "generated_token_count": 99,
  "seed": 42,
  "input_token_count": 21,
  "warnings": null,
  "tokens": [
    {
      "text": "\u0120not",
      "logprob": 0.0,
      "rank": 1
    }],
  "input_tokens": []
  }

Observed behavior

{
  "text_generated": "I hate cats. [...] I love dogs [email protected]!",
  "token_classification_results": {},
  "finish_reason": "EOS_TOKEN",
  "generated_token_count": 59,
  "seed": 0,
  "input_token_count": 21
}

Add unit test to verify call to text generation is working as expected or not

Description

In order to better verify if the text generation call request and response are getting processed correctly or not, in this story, we would implement unit test for some of the client functions for testing text generation clients.

Tasks

  1. Implement mocking mechanism to mock TGIS outgoing call
  2. Implement unit tests for client generate functions.

Empty masks array leads to different behavior/results on unary endpoint

Describe the bug

The presence of the an empty masks array should lead to the same behavior as if the masks array is left off completely. Both indicate that there are no spans that should be specifically split out of the inputs text.

Sample Code

Replace detector_name and model_name as appropriate for the inputs, and port as appropriate for the local server

curl -v -H "Content-Type: application/json" --request POST \
--data '{"model_id": "model_name",
    "inputs": "There was once a really dumb cat. The chicken crossed the road. The social security number is 123-45-6789.",
    "guardrail_config": {
        "input": {
            "models": {
                "detector_name": {}
            }, "masks": []
        },
        "output": {"models":{}}
    },
    "text_gen_parameters": {
        "min_new_tokens": 40,
        "max_new_tokens": 200
    }
    }' \
    http://localhost:<port>/api/v1/task/classification-with-text-generation

vs.

curl -v -H "Content-Type: application/json" --request POST \
--data '{"model_id": "model_name",
    "inputs": "There was once a really dumb cat. The chicken crossed the road. The social security number is 123-45-6789.",
    "guardrail_config": {
        "input": {
            "models": {
                "detector_name": {}
            }
        },
        "output": {"models":{}}
    },
    "text_gen_parameters": {
        "min_new_tokens": 40,
        "max_new_tokens": 200
    }
    }' \
    http://localhost:<port>/api/v1/task/classification-with-text-generation

Expected behavior

Same results

Observed behavior

Different results [potentially a some vs. none case]

Example output with empty masks array in input

{"generated_text":" The cat was a very smart cat. He was a very smart cat. He was a very smart cat. He was a very smart cat. He was a very smart cat. He was a very smart cat. He was a very smart cat. He was a very smart cat. He was a very smart cat. He was a very smart cat. He was a very smart cat. He was a very smart cat. He was a very smart cat. He was a very smart cat. He was a very smart cat. He was a very smart cat. He was a very smart cat. He was a very smart cat. He was a very smart cat. He was a very smart cat. He was a very smart cat. He was a very smart cat. He was a very smart cat. He was a very smart cat. He was a very smart cat. He was a very smart cat. He was a very smart cat. He was a very smart cat. He was a","token_classification_results":{},"finish_reason":"MAX_TOKENS","generated_token_count":200,"seed":0,"input_token_count":24}%

Example output without masks in input

{"token_classification_results":{"input":[{"start":30,"end":41,"word":"123-45-6789","entity":"NationalNumber.SocialSecurityNumber.US","entity_group":"detection","score":0.8}]},"input_token_count":24,"warnings":[{"id":"UNSUITABLE_INPUT","message":"Unsuitable input detected. Please check the detected entities on your input and try again with the unsuitable input removed."}]}

v2.0 finish_reason responds with EOS_TOKEN instead of MAX_TOKENS.

Describe the bug

When running functional tests, calling the /api/v1/task/classification-with-text-generation end point through:

response = requests.post(url="https://gr2-fmaas-tuning.apps.fmaas-devstage-backend.fmaas.res.ibm.com/api/v1/task/classification-with-text-generation", json=PAYLOAD, headers=HEADERS, verify=SSL)
json_response = json.loads(response.content.decode(response.encoding))

Platform

Please provide details about the environment you are using, including the following:

  • fms-orchestrator image version: us.icr.io/cil15-shared-registry/fms-orchestr8:latest

Expected behavior

The previous response at v1.0 was:

{
  "generated_text": "I love dogs [email protected]",
  "token_classification_results": {
    "input": null,
    "output": []
  },
  "finish_reason": "MAX_TOKENS",
  "generated_token_count": 99,
  "seed": 42,
  "input_token_count": 21,
  "warnings": null,
  "tokens": [
    {
      "text": "\u0120not",
      "logprob": 0.0,
      "rank": 1
    }],
  "input_tokens": []
  }

finish_reason value is expected to be "MAX_TOKENS".

Observed behavior

The actual response is:

{
  "generated_text": "I hate cats they are stupid. I love dogs [email protected]",
  "token_classification_results": {},
  "finish_reason": "EOS_TOKEN",
  "generated_token_count": 59,
  "seed": 0,
  "input_token_count": 21
}

finish_reason value is "EOS_TOKEN".

Additional context

The functional test run to get this error is:

    def test_unary_no_guardrails():
        response = requests.post(url=UNARY_GUARDRAILS_URL, json=PAYLOAD, headers=HEADERS, verify=SSL)
        assert 200 == response.status_code
        json_response = json.loads(response.content.decode(response.encoding))
>       assert json_response["finish_reason"] == "MAX_TOKENS"
E       AssertionError: assert 'EOS_TOKEN' == 'MAX_TOKENS'
E         - MAX_TOKENS
E         + EOS_TOKEN

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.