Comments (15)
Welcome! From the write-up of leaves, looks like it focuses on the prediction part. I'll look into it though. Thanks!
from lightgbm.
@jameslamb we deep dived into this and have come to a conclusion that this issue is a result of interaction between the Golang Garbage Collector and the threads created by the C_API.
Here are our findings:
-- First, we wrote a pure cpp script and loaded and released the model via c_api repeatedly. We saw that the RSS was constant throughout the run. We noticed that the process creates as many threads as available on machine.
-- Second, since LightGBM c_api uses Open MP interface to manage threads we fixed the number of threads to 8 by setting the env var OMP_NUM_THREADS. We noticed that the RSS was still stable, and the process created the specified number of threads and did not spawn more threads.
We believe that the c_api keeps a certain amount of buffer with the threads in order to reuse it when required, which is the correct thing to do as a part of optimization.
These 2 experiments ruled out the doubt that the c_api is mismanaging the memory.
-- Third, we loaded the same model via a Golang Script. We forced triggered GC after every cycle of load and release. We also fixed the number of threads for the process to 8 using OMP_NUM_THREADS = 8. This is where things got interesting.
When the GC ran, it transiently zombied out the live threads and c_api had to request a new set of threads from the OS in order to load the model. This led to an increase in the total number of threads of the process. Also worth noting is that whenever the number of threads remain unchanged after a cycle, the RSS is stable as well.
__
The zombied threads still held on to their buffer memory, and hence there was a gradual increase in RSS. From the following image, it's clear that only 8 threads are in use, rest are just holding the memory.
__
-- Fourth, we stopped the force trigger of GC and switched off the automatic GC by setting debug.setgcpercent(-1)
in the Golang code. The result was a stable RSS with the Golang code!
Conclusion : The Garbage Collection of Golang is somehow interfering with the thread management of c_api and the c code is losing control over it's threads that have a buffer memory and when zombied, this memory is lost and contributes to RSS.
Let me know, if this issue should still be a part of this repo. I am anyway heading over to the Golang repo to highlight this and find a way to resolve it.
from lightgbm.
@jameslamb hi. we did try the threadless version and we saw an improvement initially, but with time the memory seems to swell up like the threaded version only.
Although, we seemed to have solved the issue. We schedule our model refreshes in a goroutine, so all the calls to the C API are made inside this goroutine. We think that the C API is allocating some buffers in the underlying native OS thread of this goroutine and during GC and the sleeps we are adding between model refreshes, this native thread is yielded and so are the buffers (hence zombieng the memory).
We used a functionality provided by Golang in its runtime package, runtime.LockOSThread(), to permanently map the Model Loader Goroutine to a particular native thread.
We are seeing improvements and a much flatter RSS graph now
Would just like to request you to confirm if we're correct in assuming that the library does some allocations in the thread locally.
from lightgbm.
Thanks for using LightGBM.
Is it possible to upgrade to a newer version of LightGBM? v3.1.1 is 3.5 years old at this point (link), so you're missing out on 3.5 years of fixes and changes, including some related to memory efficiency and fixed memory leaks.
from lightgbm.
Thank you for the prompt reply @jameslamb. I would really want to upgrade the lightgbm version, but working in a production environment, the switch is not easy to do for us, since our training backend also uses the same lightgbm version and we'll need to do a whole lot of testing to make the switch. Moreover, the same version works fine in a Java environment, so I would really want to figure out why Golang is causing an issue.
However, I'll try using the latest version and report back the results.
from lightgbm.
Thanks! Excellent report by the way, this was really interesting to read.
I have one other theory. I know very little about Go, but looking at the code you provided, it seems like you are loading the model into a string and then passing that string into LightGBM.... but never freeing the string.
modelString = getModelAsString()
Your ReleaseMemory()
method leaves that modelString
unmodified.
func ReleaseMemory() {
res := int(C.LGBM_BoosterFree(predictor))
if res == -1 {
panic("LightGBM C_API : Failed to release the memory of the LightGBM model")
}
}
Could you try freeing the string?
Alternatively, if the model file is available in the container's local filesystem, you can skip the step of reading it up into a string and just directly load it from a file, with LGBM_BoosterCreateFromModelFile()
. That was available in v3.1.1.
https://github.com/microsoft/LightGBM/blob/v3.1.1/include/LightGBM/c_api.h#L421-L423
from lightgbm.
@jameslamb Yes I am loading the model string as a global variable, but I am doing that once in intialization. And my base(Initial) rss is compared after that point. So, in the output Initial: 1524 refers to the RSS value after loading the model string once. I knew about the LGBM_BoosterCreateFromModelFile()
method, but loaded from string to emulate exactly what we are doing in production.
Also, I just tried with the latest version and I am seeing the same. So, it could possibly be how things are handled at Golang side. Is there any documentation that allows one to learn how to properly write a C Wrapper for the LightGBM C API?
from lightgbm.
Is there any documentation that allows one to learn how to properly write a C Wrapper for the LightGBM C API?
There is not really, sorry. That feature request is tracked in #6261.
There are some suggestions in this comment: #6261 (comment).
Also, I just tried with the latest version and I am seeing the same. So, it could possibly be how things are handled at Golang side.
That's helpful, thank you!
Every release of LightGBM is tested with valgrind and leak sanitizer, and I'm fairly sure that those tests cover code paths that use LGBM_BoosterLoadModelFromString()
. I'd expect those tests to catch any significant memory leaks.
So yes, based on what you've shared so far, I strongly suspect this is about some interaction between Go's memory management and how your Go code is calling into LightGBM, not a leak from LightGBM itself. But I don't know that for sure.
We can leave this open for a bit, maybe someone else will be able to provide other theories.
from lightgbm.
Okay sure. Will be really helpful if someone can shed light on go implementations
from lightgbm.
We don't have any Go code in this repo or anyone doing active Go development (as far as I know).
Maybe https://github.com/dmitryikh/leaves can help.
If you do find a solution, please do come back and post it here. And thanks again for the EXCELLENT write-up and reproducible example!
from lightgbm.
I see you've posted a related question over on Stack Overflow: https://stackoverflow.com/questions/78347190/is-this-the-correct-way-to-use-cgo-to-call-c-code-via-go
Totally fine, since it's not exactly the same as the one you posted here.
I'm linking it here so anyone finding this GitHub issue from search will also find that related Stack Overflow question. I recommend you link back to this issue from your Stack Overflow post, for the same reason.
from lightgbm.
Oh actually it looks like you posted two separate Stack Overflow posts about this same topic?
Please don't do this. Spreading the same conversation around so many places, especially without linking between them, means that different people might spend time answering your question in different places, not knowing that others are talking about it somewhere else. That makes it more difficult for others facing the same problem to benefit from the conversation when searching from search engines, and risks wasting the time of some of the people involved.
Since it seems like https://stackoverflow.com/questions/78343407/how-to-use-cgo-properly-to-avoid-memory-leaks-working-with-lightgbm-c-api-from is getting answers, I think you should close https://stackoverflow.com/questions/78347190/is-this-the-correct-way-to-use-cgo-to-call-c-code-via-go and link to that.
from lightgbm.
Hey. Yes I raised the issue on Stackoverflow as well. Since there are very less number of people using Lightgbm who are working on Go, I wanted to get a broader audience to have a better chance at resolving. Although I did go overboard with creating 2 posts. Removed the 2nd one.
from lightgbm.
Wow, interesting!!!
Thank you for this excellent write-up. It'll take me some time to think through it carefully, but I do have one piece of advice to share that might be useful.
You can build LightGBM with OpenMP support completely turned off like this:
cmake -B build -S . -DUSE_OPENMP=OFF
cmake --build build --target _lightgbm
When using LightGBM to generate predictions, OpenMP is helpful for parallelizing over rows in the input data. It's also helpful in parallelizing the work of initially loading the model from a text file.
But I don't believe OpenMP actually speeds up anything in LightGBM for single-row predictions. So if your service is only generating predictions on a single row at a time, and you're finding that some interaction with LightGBM's use of OpenMP is leading to a memory leak, I recommend using a version of LightGBM built without OpenMP support.
That'll prevent that initial pool of threads being allocated, and it'll make the behavior of the program invariant to settings like the OMP_NUM_THREADS
environment variable.
from lightgbm.
Ah interesting, thanks for sharing that! I personally am not familiar with Go or how its garbage collection works so I can't offer any feedback on that approach. Glad to hear you've found changes to stabilize your applications though!
Would just like to request you to confirm if we're correct in assuming that the library does some allocations in the thread locally.
I'm not certain, but I think the answer is "yes". I believe the std::vector
s in this block (which runs during some prediction codepaths) will be thread-local allocations:
LightGBM/src/boosting/gbdt.cpp
Lines 674 to 686 in a70e832
Those are never explicitly freed in LightGBM's code... LightGBM is just declaring them on the stack and trusting they'll be cleaned up automatically.
from lightgbm.
Related Issues (20)
- C API refitting HOT 6
- [python-package] bug : dump_model json loads failed HOT 2
- [R-package] latest {Matrix} requires R>=4.4.0
- [RFC] remove 'categorical_feature' and 'feature_name' parameters in cv() and train() HOT 6
- [RFC] remove HDFS support? HOT 3
- (transferred) [python-package] how to install the Python package from source? HOT 5
- [RFC] provide Python/R implementations of all the built-in objectives? HOT 1
- [python-package] Error in installation in github MacOS runners HOT 2
- [docs] readthedocs builds are broken HOT 1
- CMake `__BUILD_FOR_R` fails to find R source files HOT 4
- [docker] Update Tutorial for Installing LightGBM GPU on Docker Environment HOT 3
- [python-package] NumPy 2.0 support HOT 1
- LightGBM failed to testlightgbm.exe on MSVC HOT 1
- Lightgbm trains much slower than catboost. HOT 13
- Any suggestions for predicting all values to be 0? HOT 1
- [python-package] How to refit a classifier? HOT 4
- Uncertainty in LightGBM (again) HOT 4
- Can not predict with multithread? HOT 1
- [ci] CUDA 11.8 wheel (gcc) CI jobs failing: 'libomp.so.5: no such file or directory'
- lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0), lightgbm.basic.LightGBMError: Check failed: (best_split_info.right_count) > (0)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lightgbm.