openshift-psap / llm-load-test Goto Github PK

View Code? Open in Web Editor NEW

8.0 8.0 12.0 11.93 MB

License: MIT License

Python 99.74% Dockerfile 0.26%

llm-load-test's People

Contributors

Stargazers

Watchers

Forkers

dagrayvid fcami kpouget npalaska ccamacho poornima-sivanand maljazaery drewrip huntergerlach chengyuzhu6 kelchen123

llm-load-test's Issues

Add dataset config to set min input tokens

It would be useful to optionally test only with long inputs. It would also be helpful to avoid errors due to too-long sequence lengths. We can solve both of these issues by adding two new config.yaml options to filter the dataset based on:

minimum input token length
max sequence length

Capture the model's output token length

Be able to capture the model's output token length in ghz.
This depends on ghz being able to do it (investigation / upstream PR etc).

Add initial CI

Add initial CI so that the linter runs on all PRs

Make the system prompt and prompt format configurable

Add a field to the config.yaml file for specifying a string template that is used to define the prompt / system prompt format for the inputs coming from the dataset. This would also require regenerating the dataset to remove the system prompts which are currently hard-coded for llama-type models.

Interrupt ongoing requests at end of test

Currently at the end of the test duration, the main process waits for the user processes to finish all active requests. This behavior can produce strange results when load test concurrency goes above the maximum batch size that the runtime can handle for a given model. In cases like these, the server side throughput looks lower because of the time spent finishing up the last few pending requests, not fully utilizing the server side resources.

Some potential solutions:

In streaming case, user processes can check if the test is over between each token
Main process can communicate expected end time to the user processes, and user processes can add a timeout to the http requests based on the end time of the test.
Keep existing test behavior and filter out the results for requests that ended after the test end time in the results processing code.

Add evaluation metrics on the test dataset

It will be nice for the runtime performance benchamrking tool to also track some evaluation metrics on the test dataset. This can help shed light on cases like if the improved performance is coming at the cost of degradation in evaluation metrics. Reporting evaluation metrics along with runtime performance metrics (throughput/latency) will provide a more comprehensive picture.

add the ability to warm-up Model Mesh so that all pods are able to serve the model equally

Parallel execution

launch multiple (ghz, etc) instances in parallel with different parameters (inference query, nb users, rps, etc)