I have a question related to a previously created issue and wanted to follow up. (<a c

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Memory issues when working with larger files about swcompression HOT 8 CLOSED

tsolomko commented on September 28, 2024

Memory issues when working with larger files

from swcompression.

Comments (8)

tsolomko commented on September 28, 2024 2

4.8.0 has been released, which includes the new functionality discussed here!

from swcompression.

tsolomko commented on September 28, 2024 1

Hi @philmitchell,

Thank you for PR and happy new year to you too!

I had a look and, wow, this autoreleasepool thing does wonders: repeating the same tests as I did previously in this thread results in 0.19 GB "maximum resident set size" for when using TarReader + autoreleasepool vs. 2.93 GB for TarContainer.open.

As I understand now, the new TarReader/Writer types won't help with memory consumption issues by themselves: users will still have to use autoreleasepool to achieve the desirable effect. So I wonder, if it's possible to move the autoreleasepool inside TarReader, because otherwise these new types seem useless.

One way to achieve this would be to add a new function, something like TarReader.processNext<Result>(_ f: (TarEntry) -> Result) throws -> Result, which obtains a new TarEntry (similar tonext()) and calls the user-supplied function on it inside the autoreleasepool.

Anyway, this question is for me think about later in the week (alongside with merging your PR, etc.), so thank you once again!

from swcompression.

tsolomko commented on September 28, 2024 1

I have some news to share.

First of all, as I discovered today, autoreleasepool is a Darwin-only thing, so it is not available on Linux and Windows. Apart from having to provide compatibility implementation for it on those platforms, I wonder, what is the situation with memory usage there. As in, if this new TarReader/Writer stuff will be even helpful on Linux/Windows. I've also found on Swift forums some relevant discussion about the autoreleasepool availability.

Secondly, I merged tar-rw branch into develop and made some additional changes:

TarReader.next() has been renamed into TarReader.read().
TarReader.process(_:) has been added (this is the function that automatically uses autoreleasepool).
Removed --use-rw-api option from swcomp tar: it will now automatically use TarReader/Writer, if possible (which means when there are no additional compression options supplied).

Note that there are no changes to TarWriter. As much as I would like to make it use autoreleasepool internally, so users wouldn't have to think about these subtleties, I don't think it is possible. So the only solution there would be to mention in documentation something along the lines of "you should wrap autoreleasepool around the TarWriter stuff".

from swcompression.

tsolomko commented on September 28, 2024

To answer your main question: no, there has been no progress on this issue.

However, I see now that there is a lot of demand for this kind of functionality. As such, I will try to do something about it, starting with your specific case (creating TAR). I will report back with some sort of implementation soon (hopefully, by the end of this week, but no guarantees).

from swcompression.

JordanPJamieson commented on September 28, 2024

Thanks a lot for the quick response - I deeply appreciate it.

This functionality would help immensely.

For my specific case, I'm trying to create a tar file and write it to disk efficiently without holding so much data in memory. So a function that accepts tar entries (with a URL location of the tar/file entry rather than the Data of the file) and a URL to write the data in chunks to would be very helpful as well.

Thanks again!

from swcompression.

tsolomko commented on September 28, 2024

So I did implement something. I added a set of relatively simple APIs that use Foundation's FileHandle:

public struct TarReader {
    public init(fileHandle: FileHandle)
    public mutating func next() throws -> TarEntry?
}

public struct TarWriter {
    public init(fileHandle: FileHandle, format: TarContainer.Format = .pax)
    public mutating func append(_ entry: TarEntry) throws
    public func finalize() throws
}

These should help avoid loading everything into memory at once and only do it as necessary.

I've tried to check if they indeed help with memory usage. I used /usr/bin/time -lp command to measure "maximum resident set size" and tarred "Microsoft Word.app" as a test file (2.26 GB). The results are a bit confusing. For the TarContainer.open function I've got 3.265 GB vs 1.746 GB for the version with TarReader. For the TarContainer.create and its TarWriter version I've got 3.515 GB vs. 2.279 GB, respectively. The full terminal output is available in the attached file (though it may be a bit cryptic). term_output.txt

So, what do you think about all of this? I should note that I have no idea whether or not this metric of "maximum resident set size" is a good one to evaluate memory usage. As I assume you did some memory-related measurements before, any advice on how to better estimate memory usage would be appreciated!

P.S. If you are interested in testing these new functions yourself, they are available in tar-rw branch.

from swcompression.

philmitchell commented on September 28, 2024

I just submitted a PR to address this, but only for the case of untarring. If you want, I'll look at tarring, too ... but I happen to need only untarring :)

I used Xcode instruments to confirm that with the addition of autoreleasepool, the new rw-api branch can untar quite large files without any increase in transient or persistent memory usage.

Thanks Timofey ... and happy new year!

from swcompression.

philmitchell commented on September 28, 2024

Ha, that's super interesting! It was surprising to discover that we had to manually release memory during the file read, but now I'm really curious to know how it works in Swift on Linux/Windows. I'm swamped at the moment, but when I can will look into this.

Regarding removal of --use-rw-api: good call, I agree it should be the default. I actually ran some (very crude) tests to see if there was a performance hit when reading the tarfile in chunks and if anything it performs better. If that's correct, there's no reason to do it any other way.

from swcompression.

Memory issues when working with larger files about swcompression HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent