Comments (21)
Awesome! I'll try to have a first implementation of this in the Python agent, and a view which is over the limit in opbeans-python so the Kibana team has a test case
from apm-server.
@watson the idea is that we can add the total recorded span count later. See @simitt's initial example.
from apm-server.
For Python on Opbeat we would group traces together based on their name in order to avoid this problem. It gave the other problem which was that it's very hard to group these together in a good way and you sometimes want to see the full picture. And as you mentioned, if we go back to grouping, querying will become more complex.
Here's my proposal for an initial fix:
We introduce a trace count limit in the agents. A good default could be 1000 traces. When the agent hits 1000 traces for a single transaction it will set a property on the transaction to indicate that it hit the limit. Starting more traces in the transaction will be a noop. The full duration of the transaction is still recorded. Using the "limit hit" property on the transaction, we can show clearly in the UI that the transaction went over the limit. The idea is that if you have many traces many of them will be essentially the same and they will take up the bulk of the time and you'd want to fix that before diving into other performance problems for that transaction. So you'll have maybe 200 regular traces and the last 800 would just be the same trace over and over and that's something you'd want to fix. This way we still highlight the problem without complicating ES queries, intake-API and avoid having not-quite-right grouping in the agents.
The UI that shows the timeline view will have a empty space between hitting the limit and ending the transaction. To make the UI better, we could consider starting a big trace when we hit the limit. That trace would run to the end of the transaction and be named something like trace limit hit
to indicate that more stuff went on but we didn't record it.
from apm-server.
That seems like a nice compromise! I like!
from apm-server.
@mikker @watson @jahtalab WDYT?
from apm-server.
I think your suggestion sounds very reasonable 👍
Only thing it doesn't take into account is if the trace count is on purpose. And I'm fine with just not supporting that for now. 1000 seems like it's already way over any regular use (?)
from apm-server.
Yeah, I think a configurable maxTracesPerTransaction
option is a good idea whether or not the default should be 1000 or maybe 500 I'm not sure. I guess it depends on the language?
from apm-server.
Is there any possibility we could include rollup metrics for the traces on the source transaction? e.g. Number of traces ignored? Average Trace time? Percentile metrics etc? I can't immediately think of uses of the latter two, but the former would be good to know. Even being able to plot number of traces per transaction is actually potentially interesting - in django it maybe alludes to some poor use of the framework for example.
from apm-server.
@gingerwizard i like that idea. We could have a "dropped_traces_count" on the transaction or similar.
Taking it further, it would possibly be nice to know the type of traces were dropped:
dropped_traces_counts: {
db: 23,
cache: 940,
}
I would file that under nice-to-have though.
from apm-server.
yes dropped_traces_counts
and total_traces_counts
would be sufficient initially.
from apm-server.
On the frontend we try to group traces that are "continuously similar and insignificant" and show them as one trace (that contains the end time of the last trace and the count for traces grouped together). We consider the trace's type
, signature
, how small it is in the context of the whole transaction and how far away it is (timewise) from the last trace we grouped.
@roncohen, I like the compromise you suggested and I think it also makes sense to have a limit on the frontend even though we have a grouping algorithm already.
from apm-server.
@simitt will you write up some high-level notes on how you plan to implement this?
from apm-server.
Following the discussions and looking into the current structure what do you think about
Intake API Transaction
{
"service": {},
"system": {},
"transactions": [
{
"id": "945254c5-67a5-417e-8a4e-aa29efcbfb79",
"name": "GET /api/types",
"context": { ... },
"spans": {
"sampled": [
{
"id": 0,
"group": "database",
"parent": null,
"name": "SELECT FROM product_types",
...
},
...
],
"count": {
"recorded": {
"overall": 291,
"database": 45,
"cache": 231,
"misc": 15
},
"sampled":{
"overall": 223,
"database": 22,
"cache": 186,
"misc": 15
}
}
}
}
]
}
Elasticsearch Transaction Document
{
"context": { ... },
"processor": {
"event": "transaction",
"name": "transaction"
},
"transaction": {
"duration": {
"us": 32592
},
"id": "945254c5-67a5-417e-8a4e-aa29efcbfb79",
"name": "GET /api/types",
"result": "success",
"type": "request",
"spans": {
"count": {
"recorded": {
"overall": 291,
"database": 45,
"cache": 231,
"misc": 15
},
"sampled":{
"overall": 223,
"database": 22,
"cache": 186,
"misc": 15
},
"dropped":{
"overall": 68,
"database": 23,
"cache": 45
}
}
}
...
}
}
I added a group
attribute to every span that is used for splitting the counts, e.g. database
, cache
, etc. This reflects splitting counts as a nice-to-have mentioned by @roncohen, which could be added by the agents in the future and for now the agents could only send the overall
counts.
I am aware that this would mean quite some changes in the agents, server and UI but I think adding the count
information as attribute of a spans
object would be the cleanest thing to do.
In case sampling and dropping is also added to stracktraces
at some point we could reuse the same structure and terminology.
from apm-server.
@simitt good draft.
- I think adding
sampled
here (in both places) could get confusing when we also have it on transactions. Also, I didn't think of this as sampling spans, more as a sort of protection against edge cases with exceptionally many spans where we'll need to drop some spans. However, I'm having a real hard time coming up with better names. Not super keen onitems
orlist
either. Any ideas? - For counts, I suggest we start with just
spans.count.dropped.total
both in the intake API and in the elasticsearch doc.
from apm-server.
I feel like this is a fairly invasive change, less than 10 work days before GA feature freeze (and also BC1 IIRC). And it's not like we haven't anything else to do.
What speaks against @roncohen's suggestion in #280 (comment)? The only change would be to add an optional property on the transaction.
from apm-server.
yeah, I've also had second thoughts overnight.
@simitt how do you feel about only adding something like dropped_spans.total
to the transaction?
from apm-server.
Based on an offline discussion we will only add a simple count for now:
Suggestion:
Intake API Transaction
{
"service": {},
"system": {},
"transactions": [
{
"id": "945254c5-67a5-417e-8a4e-aa29efcbfb79",
"name": "GET /api/types",
"span_count": {
"dropped": {
"total": 2
}
},
"context": { ... },
"spans": [...]
}
]
}
Elasticsearch Transaction Document
{
"context": { ... },
"processor": {
"event": "transaction",
"name": "transaction"
},
"transaction": {
"duration": {
"us": 32592
},
"id": "945254c5-67a5-417e-8a4e-aa29efcbfb79",
"name": "GET /api/types",
"result": "success",
"type": "request",
"span_count": {
"dropped": {
"total": 2
}
},
...
}
}
from apm-server.
thanks @simitt !
from apm-server.
I'm not sure what to think of the property name span_count
. I get that the nested objects dropped.total
is so that we can add more properties later without making it a breaking change, but I'm not sure what purpose span_count
serves?
How about simply
"dropped_spans": {
"total": 42
}
from apm-server.
these deeply nested structures are only slightly awkward in Python 😁
result['span_count'] = {'dropped': {'total': self.dropped_spans}}
from apm-server.
First stab at implementing this: elastic/apm-agent-python#127
I'm also OK with @watson's suggestion, no strong feelings either way
from apm-server.
Related Issues (20)
- feat: support otlp/http JSON protobuf encoding
- apm-server binary is not statically linked HOT 1
- Support APM Server on Amazon Linux 2023
- [docs] Remove false deprecation note for standalone apm-server
- APM Server 8.13.0 Test Plan HOT 1
- Document release process with the automation workflow HOT 4
- [DOCS] Create openAPI specifications for public APM APIs
- Fix automation, bring back openAPI spec HOT 1
- API UI failing for error events derived from traces with Jaeger agent
- APM Server 8.14.0 Test Plan HOT 6
- ci: minor-release automation is not aligned with expectations HOT 4
- ci: minor-release automation does not sign commits HOT 3
- release: review test plan template
- Consider flattening headers and cookies HOT 8
- Perf regression, too many small bulk requests HOT 5
- Document 7.17.20 being broken for apm-server managed by Fleet with self-signed certificates HOT 1
- Fix/Improve APM Server Smoke Tests HOT 3
- APM Event Intake: All errors are considered exceptions. HOT 1
- APM Integration Check Not Compatible With Elastic APM Data Option HOT 3
- Flaky tests due to changes in the geoip database
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from apm-server.