In the <a href="https://github.com/microsoft/service-fabric-observer/blob/main/Documen

Is per service disk space monitoring possible? about service-fabric-observer HOT 12 OPEN

markwragg commented on August 16, 2024

Is per service disk space monitoring possible?

from service-fabric-observer.

Comments (12)

GitTorre commented on August 16, 2024 1

Re Work Item, I mean not an Issue (where this some problem), but a Feature Request. When you create an Issue, you can choose the type. Feature Request means you want something added to the technology. It explains what you need to present in the template.

from service-fabric-observer.

GitTorre commented on August 16, 2024 1

Hi Mark,

This feature isn't actually ready for prime time. There is an active work item tracking this (internal) and I will let you know when it comes to fruition. I think the documentation is ahead of reality here. Even in my local tests, I am not getting the results I expect.

This means that there is nothing FO can provide in the near term. The data you are getting back is not correct (it doesn't seem like 79 bytes is realistic for your stateful service replicas...). Sorry for confusing you. Let's hold off on this for now until I get back to you. Feel free to leave this Issue open in the interim.

from service-fabric-observer.

GitTorre commented on August 16, 2024

Thanks for the report. That is indeed a doc bug.

There is currently no disk-related monitoring done by AppObserver.

I will correct the documentation.

from service-fabric-observer.

GitTorre commented on August 16, 2024

Fixed the documentation to reflect reality. Thanks again for catching that.

In terms of adding disk monitoring capability to AppObserver, feel free to create a Work Item and it will be looked into. What disk IO metrics do you want to measure and apply thresholds?

from service-fabric-observer.

markwragg commented on August 16, 2024

Hey, not quite sure what you mean by creating a work item. Do you want a separate issue? I'm mostly interested in disk consumption on a per app / service basis, so that we can have a way to track how much disk space each service consumes and how it changes over time. I don't know how practical that would be to do though.

from service-fabric-observer.

GitTorre commented on August 16, 2024

Hi,

So, that would be something like tracking (on Windows) WriteTransferCount, which is the number of bytes written to disk by a process. It's what you see in Task Manager, Details view, for a process when you add the "I/O write bytes" column. Implementation-wise, that is easy to add to AppObserver, but users would need to supply a Warning threshold to enable it and it is unclear to me if users know what constitutes misbehavior. So, maybe the service writes data to logs or some other file(s) and this could amount to GBs of data. What constitutes too much? That would be left to the user to decide, but observers only monitor resources that have thresholds specified, so you could just use a really large value to limit Warning noise or if you know that your service is supposed to manage the disk space it consumes, you could warn when it eats 10GB or something, which could signal that your disk cleanup code is failing. Again, this would be up to the user.

from service-fabric-observer.

markwragg commented on August 16, 2024

That makes sense thanks. I'm not sure I/O Write Bytes would be useful as it looks to me like that value never goes down, so its not a representation of the current disk space utilisation of a process, but how much its written to disk in its lifetime (which for very long lived processes is going to end up being huge).

The scenario we have is a lot of stateful services that co-locate their state on disk with the code, and its this "state" disk consumption that it would be interesting to track, but i'm not sure if a metric easily exists to do so.

from service-fabric-observer.

GitTorre commented on August 16, 2024

Yeah, you are right. That won't really help.

I am not sure what performance counter would help you here. Are you trying to measure how much replicated state exists on disk?

from service-fabric-observer.

markwragg commented on August 16, 2024

Yes, if possible, with ideally a breakdown per app or service.

from service-fabric-observer.

GitTorre commented on August 16, 2024

This information is actually available via TStore SF perfcounters.

I have not had a chance to experiment with this yet, however. You can open Performance Monitor, go to SF counters, look under TStore. Disk Size and Item Count are the droids you're looking for, particularly Disk Size.

from service-fabric-observer.

markwragg commented on August 16, 2024

Hi Charles. Thanks for this, it looks interesting. I found this page that describes the counters: https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-reliable-services-diagnostics

Item Count = The number of items in the store.
Disk Size = The total disk size, in bytes, of checkpoint files for the store.

I used Perfmon to look at them on some of my clusters. They obviously aren't present on nodes that run stateless services, but I did find them on my nodes that have stateful services, however it seemed a bit strange that for every instance the value was the same (79 in this case). I was expecting it to vary between different services, but it was also hard to work out which instance was for which service, which I assume is determinable by the ID it returns.

If you are interested in exploring this I'd be keen to see how it could be implemented in AppObserver so that it more easily allows you to see the values per service/app.

Oh, I also found they weren't present on my clusters that are still SF 9.0, but were on my ones that were SF 9.1. Are these counters new to SF 9.1 do you know?

from service-fabric-observer.

GitTorre commented on August 16, 2024

I think the issue here is unrelated to the counter implementation - it is fine.... It is more of an understanding problem vis a vis how the counter actually works. So, I verified that the results are accurate. However, there is something to keep in mind here:

A check point will be initiated when the specified threshold (CheckPointThresholdInMB) is reached. This amounts to the log usage exceeding this threshold. At that point, the counter will return non-zero value (so greater than the CheckPointThresholdInMB as bytes). The default value for this setting is 50MB. You can do a local experiment and change the value to be lower for your stateful service (only for testing, mind you - do not use a small value in production...).

So, the counter is not a problem. It was just understanding what is going on that took some time (plus I frankly haven't had much time to revisit this and when I did I talked to a dev on the SF data team to clear this up).

Note that there is still a work item in progress to work on the overall feature, including performance improvements, better documentation of what the data means, etc. Also, querying the counters from C# code (via PerformanceCounter class) does not work. So, that needs to be sorted out before FO can do any monitoring/reporting for this.

from service-fabric-observer.

Is per service disk space monitoring possible? about service-fabric-observer HOT 12 OPEN

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent