<input type="checkbox" id="" disabled=""

Profiling each line in get_nwp_example : <div clas

Speed up data loading about predict_pv_yield HOT 11 CLOSED

openclimatefix commented on August 29, 2024

Speed up data loading

from predict_pv_yield.

Comments (11)

JackKelly commented on August 29, 2024

the next code (as of the evening of Fri 2021-05-14), can get ~ 50 it/s and only 24 G of RAM usage without NWP loading (with 4 workers).

from predict_pv_yield.

JackKelly commented on August 29, 2024

NWPDataInMem.get_sample takes about 70 ms per sample. So with 8 samples per batch, it takes over half a second. That's probably the issue.

The interpolation (even linear) takes a while. Replacing the linear interpolation for ffill decreases the runtime of NWPDataInMem.get_sample from 70 ms to 15 ms. And increases the training speed from about 5 it/s to 15 it/s.

from predict_pv_yield.

JackKelly commented on August 29, 2024

hmmm, maybe the issue is that get_nwp_example resamples the entire NWP field (a big image!) Some options to speed it up:

resample to 5-minutely in NWPDataLoader.load_single_chunk()
Only resample data that we need. e.g. maybe the NWPDataInMem.get_sample() would return hourly data, from start.floor('h') to end.ceil('h') and then it'd be up to the Transform to resample, after selecting what we need. I like this idea.

from predict_pv_yield.

JackKelly commented on August 29, 2024

I've implemented option 2, and it's helped a lot! NWPDataInMem.get_sample() now takes only 7.26 ms, and the system trains 25 it/sec, with GPU usage hovering around 15%.

from predict_pv_yield.

JackKelly commented on August 29, 2024

More things to try:

Limit the spatial extent of the satellite imagery. DONE: Reduces size of nwp_in_mem to 14 MB (from 37 MB), and reduces runtime of get_sample to 5.78 ms (from 7.26 ms). Doesn't seem to speed up training much, or reduce mem during training much (with 5 workers, uses 53 GB RAM, and does about 20 it/s).
run get_sample() from the 3 AsyncDataLoaders in parallel. Try both threads and processes. Thoughts: Can't spawn child processes from daemonic worker processes. And not sure multiple threads will help because get_sample() is CPU-bound
a VM with more RAM, and then add more workers. (10 workers, 8-bit NWP, 32-bit PV uses 78 GB, and gets about 30 it/s, GPU usage of max 22%. 12 workers = 33 it/s, 99.5 GB RAM)
use minimal data type for NWP (uint8 for temperature) (DONE: reduces size of nwp_in_mem to 3.6 MB (from 14 MB) and reduces runtime of nwp_in_mem.get_sample() down to 4.6 ms, down from 5.78 ms. Uses about 44 GB RAM during training with 5 workers.)
Try again without NWP data, to see the memory usage and the training speed (it/s) and GPU usage. DONE. Without NWP, and with 12 workers, uses 71 GB RAM. Achieves 71 it/s and max GPU utilisation of 40%.
Is the PV data using lots of memory? If so, use minimal data type for PV? Share data between processes?!
try loading a complete batch at once
try using different processes for each data source: Can't spawn child processes from daemonic worker processes!

from predict_pv_yield.

JackKelly commented on August 29, 2024

So, we know that including NWP data slows training down by a factor of more the 2x.

get_sample takes 4.3ms for NWP; and takes 1.13ms for sat data. So maybe need to speed up get_sample?

from predict_pv_yield.

JackKelly commented on August 29, 2024

Profiling each line in get_nwp_example:

0.179 ms: date_range
1.686 ms: nwp.sel(init_time=target_times_hourly, method=ffill)
0.157 ms: init_time_future
0.043 ms: init_times[target_times_hourly > t0_hourly]
0.216 ms: steps = target_times_hourly - init_times
0.360 ms: init_time_indexer
0.103 ms: step_indexer
1.526 ms: nwp.sel(init_time=init_time_indexer, step=step_indexer)
CPU times: user 7.57 ms, sys: 0 ns, total: 7.57 ms
Wall time: 6.46 ms

from predict_pv_yield.

JackKelly commented on August 29, 2024

oooh... looks like it's possible to significantly speed up the selection based on 'step' by first transposing so that 'step' is the first dimension. This gets the runtime down to 1.73 ms if always using the first init_time. Need to see if this speed up holds when using multiple init times based on t0.

from predict_pv_yield.

JackKelly commented on August 29, 2024

Nope, doesn't look like transposing gives us the same performance increase when selecting multiple init times.

But, better news: I noticed that, when using NWPs, the code is pretty constantly loading from disk when min_n_samples_per_disk_load = 1000 and max_n_samples_per_disk_load = 2000. Increasing these to 4,000 and 8,000, respectively, gets us up to 50 it/s after 30,000 iterations (yay!) with NWPs, and 12 workers.

To really speed things up, I think we perhaps need to re-create the NWP Zarr, so the data is stored more efficiently on disk (#26).

from predict_pv_yield.

JackKelly commented on August 29, 2024

Swapping back to the 'old', more thorough way of getting NWPs, gives us 47.8 it/s

from predict_pv_yield.

JackKelly commented on August 29, 2024

Can't launch sub-processes from the worker processes: daemonic child processes aren't allowed to have children :)

from predict_pv_yield.

Speed up data loading about predict_pv_yield HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent