<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

yes dropped_traces_counts and <code class="notranslat

More efficient way to send/store repeated spans? about apm-server HOT 21 CLOSED

elastic commented on May 12, 2024

More efficient way to send/store repeated spans?

from apm-server.

Comments (21)

beniwohli commented on May 12, 2024 1

Awesome! I'll try to have a first implementation of this in the Python agent, and a view which is over the limit in opbeans-python so the Kibana team has a test case

from apm-server.

roncohen commented on May 12, 2024 1

@watson the idea is that we can add the total recorded span count later. See @simitt's initial example.

from apm-server.

roncohen commented on May 12, 2024

For Python on Opbeat we would group traces together based on their name in order to avoid this problem. It gave the other problem which was that it's very hard to group these together in a good way and you sometimes want to see the full picture. And as you mentioned, if we go back to grouping, querying will become more complex.

Here's my proposal for an initial fix:

We introduce a trace count limit in the agents. A good default could be 1000 traces. When the agent hits 1000 traces for a single transaction it will set a property on the transaction to indicate that it hit the limit. Starting more traces in the transaction will be a noop. The full duration of the transaction is still recorded. Using the "limit hit" property on the transaction, we can show clearly in the UI that the transaction went over the limit. The idea is that if you have many traces many of them will be essentially the same and they will take up the bulk of the time and you'd want to fix that before diving into other performance problems for that transaction. So you'll have maybe 200 regular traces and the last 800 would just be the same trace over and over and that's something you'd want to fix. This way we still highlight the problem without complicating ES queries, intake-API and avoid having not-quite-right grouping in the agents.

The UI that shows the timeline view will have a empty space between hitting the limit and ending the transaction. To make the UI better, we could consider starting a big trace when we hit the limit. That trace would run to the end of the transaction and be named something like trace limit hit to indicate that more stuff went on but we didn't record it.

from apm-server.

beniwohli commented on May 12, 2024

That seems like a nice compromise! I like!

from apm-server.

roncohen commented on May 12, 2024

@mikker @watson @jahtalab WDYT?

from apm-server.

mikker commented on May 12, 2024

I think your suggestion sounds very reasonable 👍

Only thing it doesn't take into account is if the trace count is on purpose. And I'm fine with just not supporting that for now. 1000 seems like it's already way over any regular use (?)

from apm-server.

watson commented on May 12, 2024

Yeah, I think a configurable maxTracesPerTransaction option is a good idea whether or not the default should be 1000 or maybe 500 I'm not sure. I guess it depends on the language?

from apm-server.

gingerwizard commented on May 12, 2024

Is there any possibility we could include rollup metrics for the traces on the source transaction? e.g. Number of traces ignored? Average Trace time? Percentile metrics etc? I can't immediately think of uses of the latter two, but the former would be good to know. Even being able to plot number of traces per transaction is actually potentially interesting - in django it maybe alludes to some poor use of the framework for example.

from apm-server.

roncohen commented on May 12, 2024

@gingerwizard i like that idea. We could have a "dropped_traces_count" on the transaction or similar.

Taking it further, it would possibly be nice to know the type of traces were dropped:

dropped_traces_counts: {
  db: 23,
  cache: 940,
}

I would file that under nice-to-have though.

from apm-server.

gingerwizard commented on May 12, 2024

yes dropped_traces_counts and total_traces_counts would be sufficient initially.

from apm-server.

hmdhk commented on May 12, 2024

On the frontend we try to group traces that are "continuously similar and insignificant" and show them as one trace (that contains the end time of the last trace and the count for traces grouped together). We consider the trace's type, signature, how small it is in the context of the whole transaction and how far away it is (timewise) from the last trace we grouped.

@roncohen, I like the compromise you suggested and I think it also makes sense to have a limit on the frontend even though we have a grouping algorithm already.

from apm-server.

roncohen commented on May 12, 2024

@simitt will you write up some high-level notes on how you plan to implement this?

from apm-server.

simitt commented on May 12, 2024

Following the discussions and looking into the current structure what do you think about

Intake API Transaction

{
    "service": {},
    "system": {},
    "transactions": [
        {
            "id": "945254c5-67a5-417e-8a4e-aa29efcbfb79",
            "name": "GET /api/types",
            "context": { ... },
            "spans": {
                "sampled": [
                    {
                        "id": 0,
                        "group": "database",
                        "parent": null,
                        "name": "SELECT FROM product_types",
                        ...
                    },
                    ...
                ],
                "count": {
                    "recorded": {
                        "overall": 291,
                        "database": 45,
                        "cache": 231,
                        "misc": 15
                    },
                    "sampled":{
                        "overall": 223,
                        "database": 22,
                        "cache": 186,
                        "misc": 15
                    }
                }
            }
        }
    ]
}

Elasticsearch Transaction Document

{
    "context": { ... },
    "processor": {
        "event": "transaction",
        "name": "transaction"
    },
    "transaction": {
        "duration": {
            "us": 32592
        },
        "id": "945254c5-67a5-417e-8a4e-aa29efcbfb79",
        "name": "GET /api/types",
        "result": "success",
        "type": "request",
        "spans": {
            "count": {
                "recorded": {
                    "overall": 291,
                    "database": 45,
                    "cache": 231,
                    "misc": 15
                },
                "sampled":{
                    "overall": 223,
                    "database": 22,
                    "cache": 186,
                    "misc": 15
                },
                "dropped":{
                    "overall": 68,
                    "database": 23,
                    "cache": 45
                }
            }
        }
        ...
    }
}

I added a group attribute to every span that is used for splitting the counts, e.g. database, cache, etc. This reflects splitting counts as a nice-to-have mentioned by @roncohen, which could be added by the agents in the future and for now the agents could only send the overall counts.

I am aware that this would mean quite some changes in the agents, server and UI but I think adding the count information as attribute of a spans object would be the cleanest thing to do.
In case sampling and dropping is also added to stracktraces at some point we could reuse the same structure and terminology.

from apm-server.

roncohen commented on May 12, 2024

@simitt good draft.

I think adding sampled here (in both places) could get confusing when we also have it on transactions. Also, I didn't think of this as sampling spans, more as a sort of protection against edge cases with exceptionally many spans where we'll need to drop some spans. However, I'm having a real hard time coming up with better names. Not super keen on items or list either. Any ideas?
For counts, I suggest we start with just spans.count.dropped.total both in the intake API and in the elasticsearch doc.

from apm-server.

beniwohli commented on May 12, 2024

I feel like this is a fairly invasive change, less than 10 work days before GA feature freeze (and also BC1 IIRC). And it's not like we haven't anything else to do.

What speaks against @roncohen's suggestion in #280 (comment)? The only change would be to add an optional property on the transaction.

from apm-server.

roncohen commented on May 12, 2024

yeah, I've also had second thoughts overnight.
@simitt how do you feel about only adding something like dropped_spans.total to the transaction?

from apm-server.

simitt commented on May 12, 2024

Based on an offline discussion we will only add a simple count for now:

Suggestion:

Intake API Transaction

{
    "service": {},
    "system": {},
    "transactions": [
        {
            "id": "945254c5-67a5-417e-8a4e-aa29efcbfb79",
            "name": "GET /api/types",
            "span_count": {
                "dropped": {
                    "total": 2
                }
            },
            "context": { ... },
            "spans": [...]
        }
    ]
}

Elasticsearch Transaction Document

{
    "context": { ... },
    "processor": {
        "event": "transaction",
        "name": "transaction"
    },
    "transaction": {
        "duration": {
            "us": 32592
        },
        "id": "945254c5-67a5-417e-8a4e-aa29efcbfb79",
        "name": "GET /api/types",
        "result": "success",
        "type": "request",
        "span_count": {
             "dropped": {
                 "total": 2
             }
        },
        ...
    }
}

from apm-server.

roncohen commented on May 12, 2024

thanks @simitt !

from apm-server.

watson commented on May 12, 2024

I'm not sure what to think of the property name span_count. I get that the nested objects dropped.total is so that we can add more properties later without making it a breaking change, but I'm not sure what purpose span_count serves?

How about simply

"dropped_spans": {
    "total": 42
}

from apm-server.

beniwohli commented on May 12, 2024

these deeply nested structures are only slightly awkward in Python 😁

result['span_count'] = {'dropped': {'total': self.dropped_spans}}

from apm-server.

beniwohli commented on May 12, 2024

First stab at implementing this: elastic/apm-agent-python#127

I'm also OK with @watson's suggestion, no strong feelings either way

from apm-server.

More efficient way to send/store repeated spans? about apm-server HOT 21 CLOSED

Comments (21)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent