pythonspeed / filprofiler Goto Github PK

A Python memory profiler for data processing and scientific computing applications

Home Page: https://pythonspeed.com/products/filmemoryprofiler/

License: Apache License 2.0

Emacs Lisp 0.02% Makefile 1.31% Python 36.02% C 10.72% Rust 49.73% Shell 0.48% C++ 0.16% Fortran 0.06% Jupyter Notebook 0.62% Cython 0.88%

python memory-profiler memory-profiling memory-leak memory-leaks memory-leak-detection memory-leak-finder memory memory-

filprofiler's Introduction

The Fil memory profiler for Python

Your Python code reads some data, processes it, and uses too much memory; maybe it even dies due to an out-of-memory error. In order to reduce memory usage, you first need to figure out:

Where peak memory usage is, also known as the high-water mark.
What code was responsible for allocating the memory that was present at that peak moment.

That's exactly what Fil will help you find. Fil an open source memory profiler designed for data processing applications written in Python, and includes native support for Jupyter. Fil runs on Linux and macOS, and supports CPython 3.7 and later.

Getting help

For more information, you can read the documentation.
If you need help or have any questions, feel free to file an issue or start a discussion. When in doubt, please ask—I love questions.

What users are saying

"Within minutes of using your tool, I was able to identify a major memory bottleneck that I never would have thought existed. The ability to track memory allocated via the Python interface and also C allocation is awesome, especially for my NumPy / Pandas programs."

—Derrick Kondo

"Fil has just pointed straight at the cause of a memory issue that's been costing my team tons of time and compute power. Thanks again for such an excellent tool!"

—Peter Sobot

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

filprofiler's People

Contributors

Stargazers

Watchers

filprofiler's Issues

Do C++ allocations get tracked?

Almost certainly, but would be nice to have a test.

Include a "is this data bad? please report a bug!" link in HTML report

Should probably automatically open a new issue with Fil version, OS version, Python version, etc..

Non-file-backed mmap() tracking

These are effectively just allocations, though presumably larger.

Unlike free(), munmap() can deallocate some of the memory.

Do Fortran allocations get tracked?

I have no idea. Start with a test.

Fil has a huge amount of memory overhead per allocation

For large allocations, this is fine. For "I just created 1,000,000 Python objects", it's very much not fine.

Probably much of it is in Callstack, so using internship::ArcIntern on that should help.

Fil missing massive chunks of memory usage in a Python program that creates mostly just Python objects

Hoping for reproducer, but may have to make my own:

Python 3.6 (running on Debian) loads 200MB CSV. Result uses 1GB of RAM, this shows up fine in Fil.
The CSV rows are then loaded into a whole pile of objects. Per htop this is using another 3GB of RAM, but that 3GB doesn't show up in Fil.

What could cause this?

Fil isn't being told about all allocations.
1. Somehow some allocations from Python aren't reported.
  1. ~~PYMALLOC env variable doesn't work completely.~~
  2. ~~PYMALLOC env variable only works for some Python APIs.~~
2. It's one of the unsupported APIs, e.g. posix_memalign.
Fil is correctly tracking allocations, but memory is leaking.
1. ~~Due to #35.~~
2. Memory fragmentation in the allocator due to many small allocations.
3. free() isn't being called, or being called wrongly somehow.
Fil is not tracking allocations correctly.

PyPy Support

Is PyPy support planned?
Is it currently not feasible? If so, how can we help make it feasible?

Release automation: automatically build and upload wheels

See https://packaging.python.org/guides/publishing-package-distribution-releases-using-github-actions-ci-cd-workflows/

macOS support

In theory this shouldn't be hard: just use DYLD_INSERT_LIBRARIES instead of LD_PRELOAD.

Support posix_memalign

Include more info in the report

Additional info to include:

machine it ran on
git commit, branch
env variables

Use cargo-audit

GitHub's templates for new issues (feature + bug report)

Jupyter magic for memory profiling

This would require having the profiler off by default, and enabled only on demand? Also maybe want a custom kernel to make this more usable.

Couldn't load aligned_alloc(): dlsym(RTLD_NEXT, aligned_alloc): symbol not found

On Mac OS X, Mojave Version 10.14.6, Darwin Kernel Version 18.7.0: Mon Feb 10 21:08:45 PST 2020; root:xnu-4903.278.28~1/RELEASE_X86_64 x86_64

running:

fil-profile run test.py

where test.py just prints hello gives the error:

Couldn't load aligned_alloc(): dlsym(RTLD_NEXT, aligned_alloc): symbol not found

Windows support

What needs doing:

Figure out how to hook APIs, whatever the equivalent of LD_PRELOAD is. this may involve compiling a new python.exe or something terrible like that.
Figure out what APIs to override and their semantics. There's malloc() etc. but presumably Windows has its own APIs too?

https://microsoft.github.io/mimalloc/overrides.html might be useful for understanding what to do.

Replace FunctionTemplate with a pointer to PyCodeObject

Additionally, replace line number with bytecode index, deferring line number calculation to report-generation time.

Together these changes should speed up the tracing part of the proflier.

This in progress in a branch.

malloc() and friends use size_t, Rust code is using usize

Has been fine so far since it's opaque, but will probably need to be fixed as part of #29 for partial munmap to work correctly.

Switch away from having same data structure for current and peak memory usage

Right now peak usage is kept as copy of the same data structure as current allocations. But it doesn't need to be!

Instead, we could keep the summary info, what we actually care about, a mapping of callstack ID → total allocated bytes. On new allocation, we'd increment that, on freeing allocation, we'd decrement. This mapping can just be a vector, because we have small, somewhat restricted number of different callstack IDs, and the IDs are 0, 1, 2, 3, etc.. Could be a immutable Vector, maybe, but the main point is that it'd be much smaller, so much less copying.

So we'd have three data structures:

Current allocations, mapping address to (size, callstack_id). Can be normal HashMap, OrdMap, or what have you: whatever is fastest and has least memory overhead.
Total current memory usage, mapping callstack_id to total size.
Peak memory usage, mapping callstack_id to total size. This is a read-only snapshot of #2.

Keeping two datastructures is in theory more expensive, in practice this'll probably be faster, and reduce memory usage.

Check if Julia allocations get tracked

Switch release automation to only upload after all builds finish

All builds run.
Wheels get uploaded as artifacts.
A separate ... pipeline? workflow? then downloads all artifacts and uploads them to PyPI in one go.

Fil has a huge amount of memory overhead per allocation due to use of HashMap

im's OrdMap has much reduced memory usage (reducing overhead by 80%), but currently has a crashing bug I and others have encountered. Once that's fixed it's definitely worth switching.

bodil/im-rs#124

More robust fallback malloc() and friends on Linux

Right now any memory allocated during bootstrap period before Fil module is loaded will cause crash when free()ed. While unlikely, it is possible.

One solution: keep track of addresses allocated during this period and don't allow them to be free()ed. mmap()ed memory can be munmap()ed, so we need to make sure this doesn't break.

Probably better solution: since this is not a problem on macOS, only Linux, we can use the glibc-specific _malloc or __malloc or whatever it's called, that we used in early iterations of Fil.

Spyder integration

Once #12 is done, consider support for Spyrder as another scientific computing environment.

Include callstack of peak moment in time in HTML report

Possible to attach to running process in Linux via BPF or related technologies?

Using runtime instrumentation would allow not having to start with Fil from the very start, and would make it safe to use with production servers.

In particular, for long-running servers this would make tracking down memory leaks much easier, since it wouldn't need to be attached from the start. One would want to report not peak allocations, as Fil does by default, but rather current allocations. This is somewhat different than Fil's current use case, but it's a real problem people have.

Thanks to @jvns for the inspiration for this idea.

Support attaching to a running process

Probably have it injected (via LD_PRELOAD) but not enabled, similar to what would have to happen for Jupyter support.

Try to reduce memory usage by using nested data structures

The theory: memory tracking overhead mostly matters if you have lots of small allocations. If you have lots of small allocations, they will end up in similar parts of the address space. Thus you could have a nested data structure, HashMap<MostSignificant48bitOfAddressSpace,HashMap<LeastSignificant16bitOfAddress,Allocation>> which would reduce memory overhead since it would effectively compress the addresses.

Potential problems: im's HashMap has ~600 bytes overhead, so if you have a bunch of larger allocations that will increase their overhead dramatically.

So maybe it should be 32bit/32bit, rather than 48bit/16bit, and then the amortized saving will be ~32bits per tracked allocation... unless you're doing pretty massive allocations, in which case im's per-map overhead doesn't matter.

There is also the potential CPU cost, since this will involve two hashmap lookups instead of one. For BTrees, if we switch, this overhead will perhaps matter less?

Conda-forge packaging

For those who use Conda.

Link to GitHub in help and generated HTML, so people can file issues

Sampling mode for memory leak detection

When detecting memory leaks, you don't need to track every allocation: the whole point is that some allocation is going to happen over and over. Sampling in this case is fine, and would reduce performance overhead.

Play around with key/value databases

Goals:

Persistence could enable better UX, e.g. for crashes.
Reduce memory overhead from tracking allocations.

Things to look for:

Ability to create snapshots, for peak allocations
Performance
Memory overhead

Things to look at

Sanakirja: It might allow for filesystem-backed allocation tracking, and has cheap clones. Probably too slow, but worth trying at least.
LMDB: Long-lived transaction would be same as clone. No malloc, mmap() only.
RocksDB: Has snapshots.
Faster KV: Might not have capabilities we need.

Support a --no-browser argument

This program has a lot of potential and I'm very excited about it.

The documentation states that it "automatically starts up a browser" to display the results when it's finished, but it would be helpful to manage the use case where there is no browser running on the same system that is running fil-profile

I'm not an expert in this area, but perhaps start a web server and update the fil-results directory. We could then port-forward a browser to that server, to see the results. I tinkered with a few ideas, but as this is my first day using the program, I want to keep this as non-specific as possible. To get a sense of the art of the possible, I ran an nginx container on the same system, with a shared mount to the fil-results directory.

In my "copious" free time, which the virus has provided, I may try to get a mock-up of what I'm talking about.

Figure out if `ld --wrap` can remove need for initialization code on Linux

--wrap + --defsym might do the trick:

--wrap provides __real_malloc (and explicit calls to malloc in the shared library to __wrap_malloc although in practice we don't care about this part).
--defsym allows mapping malloc to e.g. __wrap_malloc.
We can then define __wrap_malloc.

Or it might fail.

`fil-profile top`, which shows top allocators?

Show live view (of some sort) of allocations as they happen.

Reduce tracking memory usage with compressed 32bit length

Once #45 is done, it seems that switching size in Allocation struct from 64-bit to 32-bit will reduce memory overhead meaningfully.

Suggested compression scheme:

If high bit is set, remaining bits indicate how many bytes.
If high bit not set, remaining bits indicate how many 64KBs are used.

High bit should only be set for allocations > 16MB, say, where loss of accuracy isn't big deal. The result allows recording allocations up to 64TB, which seems sufficient.

(Numbers can be adjusted in various ways.)

Fil tracks full-resolution allocation size when adding allocations, but low-resolution with freeing allocations

In particular, add_allocation() does self.current_allocated_bytes += size;, but it should really do self.current_allocated_bytes += alloc.size();.

Extend UX to support tracking differences

For profiling, the real usage pattern is:

Run with current code.
Try to fix code.
Run again, figure out difference, go to step 2 if not fixed.

So the UX should support that.

E.g.

$ fil-profile run yourscript.py
... you look at visualization ...
... time passes, you try to fix something ...
$ fil-profile again
... Re-runs last command, pops up visualization of differences in memory usage from original run.

Consider using jemalloc for Rust code on macOS as well

Would need to compare performance to the native allocator first.

Vector-based map (per thread?) for recent allocations

Probably most allocations are very ephemeral. Supposedly vector-based arrays (for <100) are faster than hashmaps. If so, small per-thread vectormap for recent allocations might speed things up a lot.

https://crates.io/crates/vector-map documents this performance.

https://crates.io/crates/linear-map has more downloads.

Include mmap()ed segments and how much is paged in in report

And possibly even take that into account for peak calculations?

File backed
Not file backed (#29)
Not file backed, but using /dev/zero (hopefully not too common)

Use jemalloc for allocation

Benefits:

Consistent API across OSes
Might help with #33, #35
Might reduce memory fragmentation overhead