Code Monkey home page Code Monkey logo

py-grpc-prometheus's Introduction

py-grpc-prometheus

Instrument library to provide prometheus metrics similar to:

Status

Currently, the library has the parity metrics with the Java and Go library.

Server side:

  • grpc_server_started_total
  • grpc_server_handled_total
  • grpc_server_msg_received_total
  • grpc_server_msg_sent_total
  • grpc_server_handling_seconds

Client side:

  • grpc_client_started_total
  • grpc_client_handled_total
  • grpc_client_msg_received_total
  • grpc_client_msg_sent_total
  • grpc_client_handling_seconds
  • grpc_client_msg_recv_handling_seconds
  • grpc_client_msg_send_handling_seconds

How to use

pip install py-grpc-prometheus

Client side:

Client metrics monitoring is done by intercepting the gPRC channel.

import grpc
from py_grpc_prometheus.prometheus_client_interceptor import PromClientInterceptor

channel = grpc.intercept_channel(grpc.insecure_channel('server:6565'),
                                         PromClientInterceptor())
# Start an end point to expose metrics.
start_http_server(metrics_port)

Server side:

Server metrics are exposed by adding the interceptor when the gRPC server is started. Take a look at tests/integration/hello_world/hello_world_client.py for the complete example.

import grpc
from concurrent import futures
from py_grpc_prometheus.prometheus_server_interceptor import PromServerInterceptor
from prometheus_client import start_http_server

Start the gRPC server with the interceptor, take a look at tests/integration/hello_world/hello_world_server.py for the complete example.

server = grpc.server(futures.ThreadPoolExecutor(max_workers=10),
                         interceptors=(PromServerInterceptor(),))
# Start an end point to expose metrics.
start_http_server(metrics_port)

Histograms

Prometheus histograms are a great way to measure latency distributions of your RPCs. However, since it is bad practice to have metrics of high cardinality the latency monitoring metrics are disabled by default. To enable them please call the following in your interceptor initialization code:

server = grpc.server(futures.ThreadPoolExecutor(max_workers=10),
                     interceptors=(PromServerInterceptor(enable_handling_time_histogram=True),))

After the call completes, its handling time will be recorded in a Prometheus histogram variable grpc_server_handling_seconds. The histogram variable contains three sub-metrics:

  • grpc_server_handling_seconds_count - the count of all completed RPCs by status and method
  • grpc_server_handling_seconds_sum - cumulative time of RPCs by status and method, useful for calculating average handling times
  • grpc_server_handling_seconds_bucket - contains the counts of RPCs by status and method in respective handling-time buckets. These buckets can be used by Prometheus to estimate SLAs (see here)

Server Side:

  • enable_handling_time_histogram: Enables 'grpc_server_handling_seconds'

Client Side:

  • enable_client_handling_time_histogram: Enables 'grpc_client_handling_seconds'
  • enable_client_stream_receive_time_histogram: Enables 'grpc_client_msg_recv_handling_seconds'
  • enable_client_stream_send_time_histogram: Enables 'grpc_client_msg_send_handling_seconds'

Legacy metrics:

Metric names have been updated to be in line with those from https://github.com/grpc-ecosystem/go-grpc-prometheus.

The legacy metrics are:

server side:

  • grpc_server_started_total
  • grpc_server_handled_total
  • grpc_server_handled_latency_seconds
  • grpc_server_msg_received_total
  • grpc_server_msg_sent_total

client side:

  • grpc_client_started_total
  • grpc_client_completed
  • grpc_client_completed_latency_seconds
  • grpc_client_msg_sent_total
  • grpc_client_msg_received_total

In order to be able to use these legacy metrics for backwards compatibility, the legacy flag can be set to True when initialising the server/client interceptors

For example, to enable the server side legacy metrics:

server = grpc.server(futures.ThreadPoolExecutor(max_workers=10),
                     interceptors=(PromServerInterceptor(legacy=True),))

How to run and test

make initialize-development
make test

TODO:

Reference

py-grpc-prometheus's People

Contributors

awong-evoiq avatar lchenn avatar popart avatar ryansiu1995 avatar sdanzan avatar tiernan-stapleton avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

py-grpc-prometheus's Issues

multiprocess not work

   MAX_MESSAGE_LENGTH = 41943040
    # prometheus related 参考自: https://github.com/lchenn/py-grpc-prometheus#server-side-1
    server = grpc.server(thread_pool=futures.ThreadPoolExecutor(max_workers=_THREAD_CONCURRENCY, ),
                         options=[('grpc.max_send_message_length', MAX_MESSAGE_LENGTH), (
                             'grpc.max_receive_message_length', MAX_MESSAGE_LENGTH), ('grpc.so_reuseport', 1)],
                         interceptors=(PromServerInterceptor(),))
    port = 50051
    bind_address = '0.0.0.0:{}'.format(port)
    _LOGGER.info("Binding to '%s'", bind_address)
    sys.stdout.flush()
    workers = []
    for _ in range(_PROCESS_COUNT):
        # NOTE: It is imperative that the worker subprocesses be forked before
        # any gRPC servers start up. See
        # https://github.com/grpc/grpc/issues/16001 for more details.
        worker = multiprocessing.Process(
            target=_run_server, args=(bind_address,))
        worker.start()
        workers.append(worker)
    for worker in workers:
        worker.join()
root@4c633109246e:~# curl http://localhost:50061
# HELP grpc_server_msg_received_total Histogram of response latency (seconds) of gRPC that had been application-level handled by the server.
# TYPE grpc_server_msg_received_total counter
# HELP grpc_server_started_total Total number of RPCs started on the server.
# TYPE grpc_server_started_total counter
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 2423529472.0
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 144318464.0
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1565093071.54
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 4.57
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 8.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1048576.0
# HELP grpc_server_msg_sent_total Total number of stream messages sent by the server.
# TYPE grpc_server_msg_sent_total counter
# HELP grpc_server_handled_latency_seconds Histogram of response latency (seconds) of gRPC that had been application-level handled by the server
# TYPE grpc_server_handled_latency_seconds histogram
# HELP grpc_server_handled_total Total number of RPCs completed on the server, regardless of success or failure.
# TYPE grpc_server_handled_total counter
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="2",minor="7",patchlevel="16",version="2.7.16"} 1.0

metrics without value!

Asyncio support

Currently the library doesn't support any of the grpc.aio stuff. Would you be interested in that as a contribution? If so, how long until it shows up in a new release?

Time histogram buckets

Is it possible to change the histogram buckets when using enable_handling_time_histogram?

Repeated Status Code Name Extraction

When the exception raise, it will return this error.

Traceback (most recent call last):
  File "/home/appuser/.local/lib/python3.8/site-packages/grpc/_server.py", line 435, in _call_behavior
    response_or_iterator = behavior(argument, context)
  File "/home/appuser/.local/lib/python3.8/site-packages/py_grpc_prometheus/prometheus_server_interceptor.py", line 95, in new_behavior
    self._compute_error_code(e).name)
AttributeError: "str" object has no attribute "name"

This is because _compute_error_code is returning a string instead of StatusCode enum.

grpc_server_handling_seconds Histogram

No grpc_server_handling_seconds metrics found when enable_handling_time_histogram has been set to True.

CODE:

interceptor = PromServerInterceptor(enable_handling_time_histogram=True)

server = grpc.server(
        futures.ThreadPoolExecutor(max_workers=max_workers),
        interceptors=(interceptor,))
start_http_server(metrics_port)

OUTPUT:

# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 348.0
python_gc_objects_collected_total{generation="1"} 449.0
python_gc_objects_collected_total{generation="2"} 10.0
# HELP python_gc_objects_uncollectable_total Uncollectable object found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 141.0
python_gc_collections_total{generation="1"} 12.0
python_gc_collections_total{generation="2"} 1.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="6",patchlevel="9",version="3.6.9"} 1.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 9.11884288e+08
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 8.0941056e+07
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.66564437377e+09
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 0.56
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 15.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
# HELP grpc_server_handled_total Total number of RPCs completed on the server, regardless of success or failure.
# TYPE grpc_server_handled_total counter
# HELP grpc_server_started_total Total number of RPCs started on the server.
# TYPE grpc_server_started_total counter
# HELP grpc_server_msg_received_total Total number of RPC stream messages received on the server.
# TYPE grpc_server_msg_received_total counter
# HELP grpc_server_msg_sent_total Total number of gRPC stream messages sent by the server.
# TYPE grpc_server_msg_sent_total counter
# HELP grpc_server_handling_seconds Histogram of response latency (seconds) of gRPC that had been application-level handled by the server.
# TYPE grpc_server_handling_seconds histogram

Monitor gRPC server queue size

Hi!

First, thanks for the contribution so far.

It is well known that the balance between max_workers and maximum_concurrent_rpcs, and number of pods/servers can be tricky. One important way to measure how this balance is doing, is by monitoring the work queue size in the thread pool. It would be a great addition to this lib.

I have hacked my way around it to do it for my use case, and I'm open to contribute here. What do you think? Follows the rough sketch of the code:

class ThreadPoolQueueMetricServerInterceptor(PromServerInterceptor):
    """This interceptor measures the size of the work queue in the
        gRPC server thread pool, by the end of an execution.
    """
    ...
    def _measure_queue_size(self):
        # From the current thread, find its thread pool, and fetch its current queue size.
        thread = current_thread()
        # A thread has target arguments as follows. We get the work queue.
        # executor_reference, work_queue, initializer, initargs
        _, work_queue, _, _ = thread._args
        self.queue_size_gauge.set(work_queue.qsize())

    def intercept_service(self, continuation, handler_call_details):

        def wrapper(behavior_fn, *args):
            def new_behavior(request, context):
                # Call original behaviour
                response_or_iterator = behavior_fn(request, context)

                self._measure_queue_size()

                return response_or_iterator

            return new_behavior

        return self._wrap_rpc_behavior(continuation(handler_call_details), wrapper)

incrementing grpc_server_msg_received_total and grpc_server_msg_sent_total in unary methods

👋 hello @lchenn

firstly thank you for writing this library 🙇

when comparing the py-grpc-prometheus library to the go-grpc-prometheus library, you can see that the go version increments the grpc_server_msg_received_total and grpc_server_msg_sent_total metrics for unary and streaming methods. in the py-grc-prometheus library, it seems that we only increment these metrics if the method itself is a streaming type of method.

Compare the go version:
https://github.com/grpc-ecosystem/go-grpc-prometheus/blob/82c243799c991a7d5859215fba44a81834a52a71/server_metrics.go#L103-L116

To the python version (received):
https://github.com/lchenn/py-grpc-prometheus/blob/master/py_grpc_prometheus/prometheus_server_interceptor.py#L53-L59

And here (sent):
https://github.com/lchenn/py-grpc-prometheus/blob/master/py_grpc_prometheus/prometheus_server_interceptor.py#L69-L77

could we update this so that the msg received/sent metrics are also updated for unary methods? my understanding is that the grpc_server_msg_received_total should always be incremented, before going into the handler. if the handler function finishes without error, we increment the handler metric. then, finally, just before our "final" return, we increment the grpc_server_msg_received_total. this understanding might be incorrect, but that is what i thought from reading the go version of the prometheus interceptor.

best,
xander

Server non-RPC exceptions are not logged as StatusCode.INTERNAL grpc_code

Whenever the server's RPC implementation fails to process a query (say, due to a KeyError), no metric is generated.

It would seem natural to me to generate such metric: grpc_server_handled_total{grpc_code="StatusCode.INTERNAL",grpc_method="METHOD",grpc_service="SERVICE",grpc_type="TYPE"} <N>

Integration with OpenTelemetry Autotracing

When I tried to integrate the OpenTelemetry with this interceptor,
it will throw the following exception.

Traceback (most recent call last):
  File "/home/appuser/.local/lib/python3.8/site-packages/grpc/_server.py", line 435, in _call_behavior
    response_or_iterator = behavior(argument, context)
  File "/home/appuser/.local/lib/python3.8/site-packages/opentelemetry/instrumentation/grpc/_server.py", line 273, in telemetry_interceptor
    raise error
  File "/home/appuser/.local/lib/python3.8/site-packages/opentelemetry/instrumentation/grpc/_server.py", line 264, in telemetry_interceptor
    return behavior(request_or_iterator, context)
  File "/home/appuser/.local/lib/python3.8/site-packages/py_grpc_prometheus/prometheus_server_interceptor.py", line 88, in new_behavior
    self._compute_status_code(
  File "/home/appuser/.local/lib/python3.8/site-packages/py_grpc_prometheus/prometheus_server_interceptor.py", line 122, in _compute_status_code
    if servicer_context._state.client == "cancelled":
AttributeError: '_OpenTelemetryServicerContext' object has no attribute '_state'

I propose to have an environment variable or configuration in the interceptor to ignore all the exception raised in the interceptor.
Instead, those exceptions can either stream to the logger or mute it.
All in all, the interceptor should not fail the request when it has an issue with it.

"ValueError: Duplicated timeseries in CollectorRegistry" in latest version (v0.6.0)

The latest version v0.6.0 is raising the following error by prometheus_client/registry.py, when using the PromClientInterceptor()

"ValueError: Duplicated timeseries in CollectorRegistry: {'grpc_client_started_created', 'grpc_client_started', 'grpc_client_started_total'}"

I think I've narrowed down the issue to being a side effect of changes from #21,
specifically https://github.com/lchenn/py-grpc-prometheus/pull/21/files#diff-04b7eb91d0278b63925e45233c6beffcf6d3f4acb166c847d08982268c8cf111L4 where the metrics are changed from being defined at the file level to being defined per interceptor instance via init_metrics(). This is causing duplication of metrics registration when the PromClientInterceptor is defined multiple times to intercept different GRPC calls.

I've included a simple test which reproduces the error:

import grpc
from py_grpc_prometheus.prometheus_client_interceptor import PromClientInterceptor

class TestMetric:
    def test__broken_metrics(self):
        a = grpc.intercept_channel(PromClientInterceptor())
        b = grpc.intercept_channel(PromClientInterceptor())

when run

$ pytest test_file.py 
================================================ test session starts ================================================
platform linux -- Python 3.8.6, pytest-6.2.2, py-1.9.0, pluggy-0.13.1
rootdir: /code
plugins: requests-mock-1.8.0, mock-3.5.1, grpc-0.8.0, cov-2.11.1, optional-tests-0.1.1, flake8-1.0.7
collected 1 item                                                                                                    

test_file.py F                                                                                                [100%]

===================================================== FAILURES ======================================================
__________________________________________ TestMetric.test__broken_metrics __________________________________________

self = <test_file.TestMetric object at 0x7f0cdfdf5040>

    def test__broken_metrics(self):
        a = grpc.intercept_channel(PromClientInterceptor())
>       b = grpc.intercept_channel(PromClientInterceptor())

test_file.py:7: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/code/venv/lib/python3.8/site-packages/py_grpc_prometheus/prometheus_client_interceptor.py:31: in __init__
    self._metrics = init_metrics(registry)
/code/venv/lib/python3.8/site-packages/py_grpc_prometheus/client_metrics.py:6: in init_metrics
    "grpc_client_started_counter": Counter(
/code/venv/lib/python3.8/site-packages/prometheus_client/metrics.py:121: in __init__
    registry.register(self)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <prometheus_client.registry.CollectorRegistry object at 0x7f0cdfe2ea30>
collector = prometheus_client.metrics.Counter(grpc_client_started)

    def register(self, collector):
        """Add a collector to the registry."""
        with self._lock:
            names = self._get_names(collector)
            duplicates = set(self._names_to_collectors).intersection(names)
            if duplicates:
>               raise ValueError(
                    'Duplicated timeseries in CollectorRegistry: {0}'.format(
                        duplicates))
E               ValueError: Duplicated timeseries in CollectorRegistry: {'grpc_client_started_total', 'grpc_client_started', 'grpc_client_started_created'}

/code/venv/lib/python3.8/site-packages/prometheus_client/registry.py:29: ValueError
============================================== short test summary info ==============================================
FAILED test_file.py::TestMetric::test__broken_metrics - ValueError: Duplicated timeseries in CollectorRegistry: {'...
================================================= 1 failed in 0.12s =================================================

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.