newrelic / infra-integrations-sdk Goto Github PK
View Code? Open in Web Editor NEWNew Relic Infrastructure Integrations SDK
License: Apache License 2.0
New Relic Infrastructure Integrations SDK
License: Apache License 2.0
It would be very helpful to be able to mock out/override the global logger in the log
package. It's not trivial right now to test error or other cases that just log rather than return a certain value.
Having a function in the log package that can set the global logger to a struct that implements the Logger
interface would be helpful.
Something similar to this:
func SetGlobalLogger(logger Logger) {
globalLogger = logger
}
N/A
N/A
N/A
It would appear based on the help docs: https://docs.newrelic.com/docs/infrastructure/integrations-sdk/file-specifications/integration-executable-file-specifications#event-data that integration binarys should be able to emit "Event Set" data to allow for tracking of "one-off messages for key activities on a system" ... However I do not see any code in this SDK that would facilitate / exemplify how the format of that JSON might be.
I spent a couple hours on this but due to the package variables (global sadness) and the fine grained locking, I realized I would likely have ended up completely rewriting the package. So instead I'll simply highlight the problem and allow someone to make an incremental change.
The JMX command is run here with cmd.Wait:
https://github.com/newrelic/infra-integrations-sdk/blob/master/jmx/jmx.go#L99
If that fails, it sends on err channel in the goroutine. Open command will not return a non-nil error since that's in a separate goroutine.
So then a doQuery func is run where it will send a string to os.Pipe, which since its a valid buffered pipe will succeed if the command is in the process of failing (still listening on the pipe) but has not yet exited. So I think there might be race condition.
https://github.com/newrelic/infra-integrations-sdk/blob/master/jmx/jmx.go#L135
So then we have receiveResult func, which has a select. The select is random. So the first case statement could evaluate and then you'll get "got empty result for query" error message so while there is data in stderr pipe its silently dropped.
https://github.com/newrelic/infra-integrations-sdk/blob/master/jmx/jmx.go#L179
v2.1
go 1.10
There may be cases where the rate calculation is not done because the time between arrival of metric values is more that the default (and fixed) TTL of 60 seconds
Make the TTL configurable , with the default set to the current 60 seconds value
Please help us better understand this feature request by choosing a priority from the following options:
[Nice to Have]
Note: Points to the same codebase target as Register/Entity work. Parallelizing this work could lead to merge conflicts and slowdowns in development
Make DM pipeline production ready (error handling, rate limiting handling, perf testing)
Check if the Telemetry SDK can be configured to send data to the agent instead of the DM endpoint.
Proper reporting of errors for integrations (this was discussed during the Flex GA MMF development).
It includes public documentation for customers who are developing their own integrations or the FIT team. And enablement of the support team.
Epic description: https://docs.google.com/document/d/1MBvijlZBJx96ICbtj7_y9dZSegz1oaIqrKRm4DFIyj4/edit
Action Items identified from Blue Medora’s HAProxy PoC are included in the above MMF defintion): DMI - BM's feedback about SDK v4
There is no mechanism that removes metrics (after a TTL expired) from the storer file used to save this metric into the disk to calculate the deltas on each execution.
On initialization the integrations checks the TTL of the entire file remove it if it has expired. But there are cases where the metric identifiers are based on ephemeral entities causing that new entries to this file are constantly being created and if the integration keeps it's execution normally the current clean up mechanism is never triggered, making this file grow without control and having the integration loading in on each execution.
Metrics on the storer should be garbage collected if they are not being updated after a TTL.
Similar thing is done in the Kubernetes integration
One example happens on nri-varnish
where backend
entities could be ephemeral in some environments.
more context in this issue
When the storer is created it uses the name of the integration as the file name.
So if you have multiple discoveries, it spawns the same integration many times, therefore they all try to read and write to the same storer json file.
Impact: all OHIs that use RATEs, & DELTAs with container discovery.
Internal discussion: https://newrelic.slack.com/archives/C5A2QGLKT/p1613449983171900
Each executed integration should have their own store file.
Setup multiple integration manifest from same on-host integration.
Run the containerize agent.
Observe integration metrics in NR with incorrect value.
On a containerize environment with multiple yml from same on-host integrations (e.g. Flex, haproxy)
For SDK < v4
Currently if the upper bound of a bucket is inf
the bucket is discarded. I believe we should take care of such range in a better way to avoid losing values. Currently with a metric like:
# HELP powerdns_recursor_response_time_seconds Histogram of PowerDNS recursor response times in seconds.
# TYPE powerdns_recursor_response_time_seconds histogram
powerdns_recursor_response_time_seconds_bucket{le="0.001"} 0
powerdns_recursor_response_time_seconds_bucket{le="0.01"} 0
powerdns_recursor_response_time_seconds_bucket{le="0.1"} 0
powerdns_recursor_response_time_seconds_bucket{le="1"} 0
powerdns_recursor_response_time_seconds_bucket{le="+Inf"} 0
we would lose everything in the range 1<x<inf
infra-integrations-sdk/data/metric/metrics.go
Line 245 in 786cf26
I am getting the following error from the sdk for the custom integration I am writing which uses the metric.RATE
option:
WARN[0000] Error setting value: Samples for queue.jobsPerSecond are too close in time, skipping sampling
WARN[0000] Error setting value: Samples for queue.jobsPerSecond are too close in time, skipping sampling
WARN[0000] Error setting value: Samples for queue.jobsPerSecond are too close in time, skipping sampling
The above example came from trying to add a metric named queue.jobsPerSecond
to 4 different MetricSets. The problem seems to stem from the fact that names are global in the cache rather than being unique per MetricSet, which means that MetricSets which use the same metric name will have a collision.
Adding a unique prefix per MetricSet to the key would get rid of the problem.
For FedRAMP customers there is a special gov-infra-api.newrelic.com domain.
In order to move Infra to dimensional metrics in a FedRAMP approved way, we need to make sure the dimensional metrics capability is following the same approach: going straight to CHI and avoid CloudFlare and cells.
Once #283 is implemented and the Design Change: Entity synthesis for prometheus-based OHIs
has been tested and working, create a document with all the v4 fields, usage and consequences (data generated).
Feel free to add more items into the list.
The document should be uploaded in the repository and Confluence.
SDKv4 has been released but we had issues importing it
$go get github.com/newrelic/infra-integrations-sdk/[email protected]
go get github.com/newrelic/infra-integrations-sdk/[email protected]: github.com/newrelic/[email protected]: invalid version: module contains a go.mod file, so major version must be compatible: should be v0 or v1, not v4
$go get github.com/newrelic/[email protected]
go get github.com/newrelic/[email protected]: github.com/newrelic/[email protected]: invalid version: module contains a go.mod file, so major version must be compatible: should be v0 or v1, not v4
If I run simply $go get github.com/newrelic/infra-integrations-sdk
I get version:
github.com/newrelic/infra-integrations-sdk v3.6.5+incompatible // indirect
Should we add in the go.mod the version of the package?
something like this
PR:
newrelic/newrelic-telemetry-sdk-go#13
It has been reviewed by them. We need to tackle the feedback and get the changes merged.
After this is done, we'll need to replace the forked version with the official one.
Add a workflow that runs unit tests on v3 branch.
This should be valuable for reviewers of v3 fixes branches
Currently dimensional metrics are going through CloudFlare and are routed to a Cell. Both of these mechanisms are not FedRAMP approved. For FedRAMP customers there is a special gov-infra-api.newrelic.com domain that doesn't use CloudFlare or cells.
In order to move Infra to dimensional metrics in a FedRAMP approved way, we need to make sure the dimensional metrics capability is following the same approach: going straight to CHI and avoid CloudFlare and cells.
Instead of using the current domain: metric-api.newrelic.com, we will be using infra-api.newrelic.com. For POMI we need to update the default url.
This bug was fixed for v4 but needs to be ported to v3 so the nri-kafa, nri-cassandra and nri-jmx can use it.
Keep a v3 branch that includes the hotfixes. At this moment the last version of v3 is v3.6.5 . I would expect to have a tag v3.6.6 with this fix.
Make sure that SDK v4 branch is ready and all the documentation is correct. Once happy then we can publish this and tag it as the new version of the SDK.
Over the course of the last 3 weeks, there have been a couple high severity issues with both decoding JSON responses and Marshalling them. Specific commits/issues like the following demonstrate the need to have a more standardized method for managing JSON responses.
inf
, Issue: 66undefined
The issues listed above have created critical situations where data was not being reported from the servers/nodes. The larger issue being, unless an actual person is monitoring the data streams, there is often no way to know that these issues are occurring.
Although some marshaling is managed by the SDK, a method to allow all OHIs the ability to benefit from one standardized encoding, decoding, unmarshaling, and marshaling solution could reduce the bug-tail that is currently being seen in production sites. Some of this occurs because the expected interfaces, as defined from the software, can randomly return these, JSON valid, but inappropriate types to convert.
After a timeout of a single query it seems that all the connections are closed:
func TestJmxNoTimeoutQuery(t *testing.T) {
defer Close()
if err := openWait("", "", "", "", openAttempts); err != nil {
t.Error(err)
}
if _, err := Query(cmdTimeout, 0); err != nil {
t.Error(err)
}
if _, err := Query(cmdBigPayload, timeoutMillis); err != nil {
t.Error(err)
}
}
This produces:
jmx_test.go:146: timeout waiting for query: timeout
jmx_test.go:149: EOF
While the first one is expected, the second one it is not, since the timeout is of the single query and there is no reason to fail the second one. We had users complaining for this behaviour.
However, this could be expected behaviour if we close the connection and we prefer to fail since the library is not able to "trash" the result for the timeout query that is not interesting anymore. Further investigation is needed
Create go module pipeline generate code with thrift file. We need a target that will generate thrift code (better in docker so we can run locally without installing thrift dependencies). In the pipeline call this target and make sure there are no differences.
make generate
target that spawns the container with the repo mounted as a volume. When I run this target I will have the thrift code generatedWhen running under windows, the SDK does not provide a default path, making it harder to configure integrations that use JMX (cassandra, kafka, etc). Kafka provides it's own flag for it, Cassandra does not so it needs to be provided a "hidden" env var which is not ideal
Add a default value for nrjmx path when running under windows, similar to what is done for Linux
Please help us better understand this feature request by choosing a priority from the following options:
[Really Want]
This form is for integrations-sdk bug reports and feature requests only.
This is NOT a help site. Do not ask help questions here.
If you need help, please use newrelic support?.
Describe the bug or feature request in detail.
I am unable to decipher what string formatting is unacceptable
this is the output from my program, it is just mock up output for now:
{"integration_version":"0.1.0","protocol_version":"2","data":[{"metrics":[{"some-data":4000,"event_type":"CustomSample"}],"inventory":{"instance":{"version":"3.0.1"}},"events":[{"category":"status","summary":"restart"}]}],"name":"com.myorganization.svctest"}
Amazon Linux2
May 31 15:20:17 ip-10-0-0-224.ec2.internal newrelic-infra-service[3961]: time="2023-05-31T15:20:17Z" level=warning msg="Cannot emit integration payload" component=integrations.runner.Runner error="invalid character '\'' looking for beginning of object key string" integration_name=svctest payload="{'integration_version': '0.1.0', 'protocol_version': '2', 'data': [{'metrics': [{'some-data': 4000, 'event_type': 'CustomSample'}], 'inventory': {'instance': {'version': '3.0.1'}}, 'events': [{'category': 'status', 'summary': 'restart'}]}], 'name': 'com.myorganization.svctest'}" runner_uid=89e32cfcbd
Suggested Priority (P1,P2,P3,P4,P5):
Suggested T-Shirt size (S, M, L, XL, Unknown):
According to documentation a dimension/attribute might be a string, number or boolean.
A map of key value pairs associated with this specific metric. Values can be strings, JSON numbers, or booleans. Keys are case-sensitive and must be less than 255 characters.
But we just support string ones https://github.com/newrelic/infra-integrations-sdk/blob/master/data/metric/metrics.go#L19
Change metric.Dimensions
type.
The agent has a binary behaviour regarding the metrics from an integration, whether attach them all to the host entity or any (FlagDMRegisterEnable = "dm_register_enabled”
). Most of the integrations should be fine without attaching the host entity as the backend perform the entity synthesis but if in the near future we want to decouple core integrations (cpu, mem, etc), the agent should be able to differentiate them.
Add a field to each entity data to let know the agent whether to add the host entity or not. For example, ignore_host_entity
defaulting to false
:
{
"protocol_version":"4", # protocol version number
"integration":{ # this data will be added to all metrics and events as attributes,
# and also sent as inventory
"name":"integration name",
"version":"integration version"
},
"data":[ # List of objects containing entities, metrics, events and inventory
{
"ignore_host_entity": true # don't attach metrics to the host entity
"metrics":[ # list of metrics using the dimensional metric format
{
"name":"redis.metric1",
"type":"count", # gauge, count, summary, cumulative-count, rate or cumulative-rate
"value":93,
"attributes":{} # set of key-value pairs that define the dimensions of the metric
}
],
"common":{...} # Map of dimensions common to every entity metric. Only string supported.
"inventory":{...}, # Inventory remains the same
"events":[...] # Events remain the same
}
]
}
The HTTP client currently does not have many configuration options as far as tolerance of invalid certificates. A few customers have requested the ability to accept certificates that don't match the hostname of the server they are connecting to (newrelic/nri-elasticsearch#45).
The documentation on [here]https://docs.newrelic.com/docs/create-integrations/infrastructure-integrations-sdk) needs to be updated to reflect the changes that have been made. We should keep the old documentation (label it legacy) so that we don't loose the knowledge.
V4 is out but we are missing documentation on the new types.
The SDK v4 is not generating a valid payload compared with the one the agent expects.
In SDK v4, each entity can provide a map of common dimensions . Powerdns example:
"data": [
{
"common": {
"scrapedTargetKind": "user_provided",
"scrapedTargetName": "localhost: 9122",
"scrapedTargetURL": "http: //localhost:9122/metrics",
"targetName": "localhost:9122"
},
...
The problem is that the agent defines this common field as the following data structure:
type Common struct {
Timestamp *int64 `json:"timestamp"`
Interval *int64 `json:"interval.ms"`
Attributes map[string]interface{} `json:"attributes"`
}
Why to change the SDK and not the agent mapping?
SDK v4 payload data is transformed by the agent to a valid structure for the NR telemetry API. The Telemetry api defines the common structure similar to the agent one:
type metricCommonBlock struct {
timestamp time.Time
interval time.Duration
forceIntervalValid bool
attributes MapEntry
}
Telemetry expected payload example.
If we want to align the integrations SDK with the telemetry SDK, the commonDimension field should be aligned with the telemetry CommonBlock. The entity common data should be:
"data": [
{
"common": {
"attributes": {
"scrapedTargetKind": "user_provided",
"scrapedTargetName": "localhost: 9122",
"scrapedTargetURL": "http: //localhost:9122/metrics",
"targetName": "localhost:9122"
},
},
...
Common timestamp and interval will be added but not covered in this issue as they are currently populated by the agent.
Those change should not break current integrations as CommonDimensions should be added with AddCommonDimension. Nonetheless, CommonDimension field is public/exported, thus we should sync with core integrations to assure CommonDimension is not used directly.
The Travis CI build is broken as we do not pin the version of Testify.
The project needs to move to go mod
so that we can pin the version of Testify
. This also involves updating the build so that it runs on Github Actions.
The build to pass.
JMX server endpoint may return "Java comment" lines like # An error report file with more information is saved as:
.
In this case JMX will return an error like error: invalid character '#' looking for beginning of value
.
It'd be great to handle this case properly and log this as a warning.
Ideally an integration execution instance should be able to keep fetching the rest of the requested data/queries defined.
A JMX server error may report several lines. Current jmx
package is limited to read just 1 line.
It'd be great to get the whole error message logged as a single entry.
Circuit breaker. Whenever several queries fail in a row (in this case we are handling "java comment" lines, but this could be extrapolated to other errors), the jmx
package client (let's call it that way, although API is a set of awful global functions sharing global state) should prevent next queries to be submitted and log an error instead. This will avoid worsening the JMX endpoint scenario, as it already returned N errors in a row.
There is a common
field in the payload that could be used for attributes shared by entities and metrics. This is not documented and also there are no methods instrumented to add this values.
Would be good to document this attribute behavior with some examples, and also add the setters to this.
Entity
is which actually the Data
field of the v4 Protocol.metadata.Metadata
field inside the Entity
actually contains the definition of the Entity
.Integration
doesn't have any func to adds metrics that are not related to any Entity. A workaround to this is being added in #291When calling integration.NewEntity in an integration, the entity is not added to the list of entities for the integration. This results in the integration data being missing
The entity should be added to the list of integrations and the data serialized.
go run redis.go -metrics -hostname localhost
{"protocol_version":"4","integration":{"name":"com.myorg.redis","version":"0.1.0"},"data":[]}
MacOS 10.15.7
infra-integrations-sdk v4.1.0
The issue is fixed by creating the entity and adding :
i.Entities = append(i.Entities,entity)
go run redis.go -metrics -hostname localhost
{"protocol_version":"4","integration":{"name":"com.myorg.redis","version":"0.1.0"},"data":[{"common":{},"entity":{"name":"redis_01","displayName":"RedisServer","type":"my-redis","metadata":{}},"metrics":[{"timestamp":1631239424,"name":"query.instantaneousOpsPerSecond","type":"gauge","attributes":{},"value":0}],"inventory":{},"events":[]}]}
NOTE: # https://github.com/newrelic/infra-integrations-sdk/blob/master/docs/tutorial.md#building-a-redis-integration-using-the-integration-golang-sdk-v30 however https://github.com/newrelic/infra-integrations-sdk/blob/master/docs/tutorial-code/multiple-entities/redis-multi.go specifies v4
NOTE: # When trying to build a custom integration that Flex is not able to meet requirements I'm running into the following log message
time=“2021-11-04T20:14:13Z” level=debug msg=“Missing event_type field for metric.” action=EmitDataSet component=PluginRunner integration= metric=“map[attributes:map[] label.env:production label.role:cache name:query.instantaneousOpsPerSecond timestamp:1.636056853e+09 type:gauge value:2112]”
$ ./myorg-redis-multi --pretty --metrics
{
"protocol_version": "4",
"integration": {
"name": "com.myorganization.redis-multi",
"version": "0.1.0"
},
"data": [
{
"common": {},
"entity": {
"name": "instance-1",
"displayName": "redis",
"type": "instance-1",
"metadata": {}
},
"metrics": [
{
"timestamp": 1636057006,
"name": "query.instantaneousOpsPerSecond",
"type": "gauge",
"attributes": {},
"value": 2112
}
],
"inventory": {},
"events": []
},
{
"common": {},
"entity": {
"name": "instance-2",
"displayName": "redis",
"type": "my-instance",
"metadata": {}
},
"metrics": [
{
"timestamp": 1636057006,
"name": "query.instantaneousOpsPerSecond",
"type": "gauge",
"attributes": {},
"value": 2112
}
],
"inventory": {},
"events": []
}
]
}
This output does not include event_type, however the tutorial shows event_type.
The sample code is able to submit metrics without errors
Install newrelic-infra 1.20.5
in /etc/newrelic-infra.yml
set
log_file: /tmp/newrelic.log
verbose: 1
grep event_type /tmp/newrelic.log
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.6 LTS"
$ dpkg --list |grep newrelic
ii newrelic-infra 1.20.5 amd64 New Relic Infrastructure provides flexible, dynamic server monitoring. With real-time data collection and a UI that scales from a handful of hosts to thousands, Infrastructure is designed for modern Operations teams with fast-changing systems.
This form is for integrations-sdk bug reports and feature requests only.
This is NOT a help site. Do not ask help questions here.
If you need help, please use newrelic support?.
Describe the bug or feature request in detail.
Use example bug :
error ”./main.go:14:79: cannot use 5 * time.Second (type time.Duration) as type int in argument to jmx.Query“
A code snippet, screenshot, and small-test help us understand.
Place the appropriate label to the issue: bug, feature, enhancement, ...
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.