Code Monkey home page Code Monkey logo

Comments (7)

nalimilan avatar nalimilan commented on August 17, 2024

I think that's because WeakRefString objects can only be used with NullableArray at the moment, so lots of small strings need to be allocated when nullable=true. Does the dataset contain string columns? If you have a small number of categories, using a CategoricalArray would save a lot of memory (not sure it's supported yet.)

from csv.jl.

pearcemc avatar pearcemc commented on August 17, 2024

@nalimilan Thanks, yes, that's correct.

I'm still not entirely sure why this should explain the disparity though as:

  • maximum length of any of the small strings is 14 UInt8s (so 14 bytes?)
  • minimum size of a WeakRefString appears to be 24 bytes (3 Int64s?)

I would have thought that the file should be small enough to fit into RAM using actual strings as:

nrow = 9999999
strlen = 14
colsize_mb = nrow*strlen*sizeof(UInt8))/1e6
#139.999986

There are 6 cols and I have 4GB RAM, so I'd expect a String version to take up roughly 840MB or less.

I'd expect a WeakRefString version of the CSV to be similar. When I end my julia process after reading in with WeakRefString about 1GB RAM gets released (some cols are Int64), which is roughly consistent with the above numbers.

from csv.jl.

nalimilan avatar nalimilan commented on August 17, 2024

In Julia 0.5, String objects have a significant overhead due to their Array field (this will be much better in 0.6). I don't remember what the exact value is, but for short strings like yours it's a lot.

from csv.jl.

pearcemc avatar pearcemc commented on August 17, 2024

Thanks for following up. Investigating this the container size of a String seems to be 8 bytes.

julia> N = 10000;

julia> v = @timed ["hello" for i in 1:N]
(String["hello","hello","hello","hello","hello","hello","hello","hello","hello","hello""hello","hello","hello","hello","hello","hello","hello","hello","hello","hello"],0.045251508,1375438,0.0,Base.GC_Diff(1375438,1,0,12641,0,0,0,0,0))

julia> v[3] #mem allocated
1375438

julia> tots = sizeof(v[1]) + length(v[1])*sizeof(v[1][1]) #size of the containers + size of the data contained? 
1300000

julia> tots/N #cost per string
13.0

So if my strings are each 14 bytes + 8byte container = 22 bytes this should still be smaller than a 24 byte WeakRefString. Unless I'm misunderstanding the info given back by @timed.

So I'm still not sure the String overhead is sufficient to explain the nullable frame fitting into 1GB memory, but the typed version taking up 4GB + 4GBswap + ???.

from csv.jl.

quinnj avatar quinnj commented on August 17, 2024

hey @pearcemc, thanks for opening an issue! Is there any chance you can share the file you're using? Could you also share your system info? I know there have been a few platform-specific issues before.

from csv.jl.

pearcemc avatar pearcemc commented on August 17, 2024

Sure, it's the page_views_sample.csv from here.

My laptop /proc/meminfo looks like:

ubuntu@ubuntu-UX21E:/db/outbrain$ cat /proc/meminfo | head
MemTotal:        3946968 kB
MemFree:          413704 kB
Buffers:           61688 kB
Cached:          1299912 kB
SwapCached:       130404 kB
Active:          2008912 kB
Inactive:        1269984 kB
Active(anon):    1748120 kB
Inactive(anon):   993240 kB
Active(file):     260792 kB

I'm on Julia 0.5.0.

from csv.jl.

quinnj avatar quinnj commented on August 17, 2024

This should be fixed on master since we're now relying on plain Vector{Union{T, Null}} for non-String columns, and WeakRefStringArray for String arrays, which will be memory efficient.

from csv.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.