Comments (18)
In the worst case scenario, the machine is so slow that I don't have time to close apps when I notice that memory is about to run out. I do not use swapping, and it basically crashes my machine then. That's with 16GB of RAM, and it happened twice in the last week. So it's not a theoretical possibility. If I use less cores, I have a bit more time to see it coming, the mouse cursor is still visible, AND clicking the close icon still works :).
from hledger-flow.
I'm guessing this happens only when you do a full import that re-generates all files, right?
I'm interested to know if --new-files-only
and/or --start-year
is useful to you, and if it prevents this issue?
from hledger-flow.
Ouch. @lestephane feel free to give more numbers describing what's happening with just one of those hledger invocations.
from hledger-flow.
@lestephane could you please run your import again with the code in branch async-batches
?
I didn't have time to fully investigate the cause and possible solutions, but the code in that branch tries to import the files in smaller batches.
The batch size aren't configurable for now (it uses one less than the number of cores detected), but if this is the way to proceed I can make it a command-line option.
This is my output:
hledger-flow import --show-options
RuntimeOptions {baseDir = "/thebase/", importRunDir = "./", importStartYear = Nothing, onlyNewFiles = False, hfVersion = "hledger-flow 0.14.3.99 darwin x86_64 ghc 8.10 b8952e8c8b8f12ae417c6508acb3dbcd6db07d17", hledgerInfo = HledgerInfo {hlPath = "/Users/andreas/.local/bin/hledger", hlVersion = "hledger 1.20.99"}, sysInfo = SystemInfo {os = "darwin", arch = "x86_64", compilerName = "ghc", compilerVersion = Version {versionBranch = [8,10], versionTags = []}}, verbose = False, showOptions = True, sequential = False}
Collecting input files from /Users/andreas/hledger/import/
Found 204 input files in 0.019031133s. Proceeding with import...
DEBUG: detected 12 cores - processing files in batches of 11
Wrote include files for 204 journals in 0.071981582s
Imported 204/204 journals in 8.915368045s
from hledger-flow.
Ouch. @lestephane feel free to give more numbers describing what's happening with just one of those hledger invocations.
No need to worry on the hledger end. This is not a case of runaway memory leak in there. Just too many apps open on my desktop, and it's a case of memory usage going over the top. By running with less cores, I could at least reduce the memory until I'm able to keep working while it imports.
from hledger-flow.
@lestephane could you please run your import again with the code in branch
async-batches
?
I will use that branch and report.
from hledger-flow.
I'm guessing this happens only when you do a full import that re-generates all files, right?
When I run an import on the top directory. If that's what you mean by regenerating all files, I guess that's a yes answer.
I'm interested to know if
--new-files-only
and/or--start-year
is useful to you, and if it prevents this issue?
I have yet to try those...
from hledger-flow.
@lestephane could you please run your import again with the code in branch
async-batches
?
The import is not 1/8th slower but 1/4 slower.
~/.local/bin/hledger-flow-async-batches import
DEBUG: detected 8 cores - processing files in batches of 7
...
Wrote include files for 1328 journals in 1.216674784s
Imported 921/921 journals in 43.677151979s
~/.local/bin/hledger-flow-v0.14.3.0 import
...
Wrote include files for 1328 journals in 0.726477778s
Imported 921/921 journals in 32.666655998s
But keep in mind, most of the work is done is AWK and shell scripts hledger-flow has no control over, so I can only speculate from this point onwards.
I think you meant to use N - 1 where N is the count of physical cores (so in my case you'd use 3 physical cores by default, or 6 virtual if you're able to ensure each pair of virtual cores is pinned to the same physical core), otherwise the last (7th) core is fighting for the same fast cpu caches as its virtual counterpart (which is used by the other apps, which do a lot of unrelated things) on the same physical CPU.
I wonder how much control you have on CPU affinity from haskell...
from hledger-flow.
It looks like the info I have available to work with is listed here, e.g getNumCapabilities
and getNumProcessors
https://hackage.haskell.org/package/base-4.14.1.0/docs/GHC-Conc.html#g:1
getNumCapabilities
seems to be the number of cores that are available to use
getNumProcessors
seems to be just the total number of cores that the Haskell RTS knows about.
I have added these 2 numbers in the output of --show-options
Then there is also a new --batch-size
option to play with.
It is possible to give parameters directly to the Haskell runtime system by specifying flags sandwiched between +RTS
and -RTS
.
E.g to limit the available cores to 6 you can say this:
hledger-flow +RTS -N6 -RTS import --show-options --batch-size 50
However, this didn't seem to have much of a limiting effect when I tried it.
I do however use this number for the default batch size - now the available cores minus 2.
Please test and let me know.
Here is my output, note the cores:
hledger-flow +RTS -N4 -qa -RTS import --show-options
RuntimeOptions {baseDir = "/", importRunDir = "./", importStartYear = Nothing, onlyNewFiles = False, hfVersion = "hledger-flow 0.14.3.99 darwin x86_64 ghc 8.10 9dbe075820f6dadce82352e97f3aa2c9dcb4ba37", hledgerInfo = HledgerInfo {hlPath = "/Users/andreas/.local/bin/hledger", hlVersion = "hledger 1.20.99"}, sysInfo = SystemInfo {os = "darwin", arch = "x86_64", compilerName = "ghc", compilerVersion = Version {versionBranch = [8,10], versionTags = []}, cores = 12, availableCores = 4}, verbose = False, showOptions = True, sequential = False, batchSize = 2}
Collecting input files from /Users/andreas/hledger/import/
Found 204 input files in 0.018718693s. Proceeding with import...
Wrote include files for 204 journals in 0.059583955s
Imported 204/204 journals in 19.332482559s
from hledger-flow.
I tried with 2 in the sandwich parameter, and it fails with exit code 137, and there is no output despite using --verbose.
I see it has batchSize = 0. So maybe an off-by-one mistake.
$ NCORES=2 hlimport . --verbose
/home/lestephane/.local/bin/hledger-flow-async-batches-793f882bb22ac7b89a98077ee95b3464bbc5c0e0 +RTS -N2 -RTS --show-options import . --verbose
RuntimeOptions {baseDir = "/home/lestephane/Vault/Finance/", importRunDir = "./", importStartYear = Nothing, onlyNewFiles = False, hfVersion = "hledger-flow 0.14.3.99 linux x86_64 ghc 8.10 793f882bb22ac7b89a98077ee95b3464bbc5c0e0", hledgerInfo = HledgerInfo {hlPath = "/home/lestephane/Finance/hledger", hlVersion = "hledger 1.21.99"}, sysInfo = SystemInfo {os = "linux", arch = "x86_64", compilerName = "ghc", compilerVersion = Version {versionBranch = [8,10], versionTags = []}, cores = 8, availableCores = 2}, verbose = True, showOptions = True, sequential = False, batchSize = 0}
2021-04-14 12:41:39.262296864 EEST hledger-flow Starting import
Collecting input files from /home/lestephane/Vault/Finance/import/
Found 922 input files in 0.787816128s. Proceeding with import...
$ echo $?
137
If I use 3 it works, and there is output
$ NCORES=3 hlimport . --verbose
/home/lestephane/.local/bin/hledger-flow-async-batches-793f882bb22ac7b89a98077ee95b3464bbc5c0e0 +RTS -N3 -RTS --show-options import . --verbose
RuntimeOptions {baseDir = "/home/lestephane/Vault/Finance/", importRunDir = "./", importStartYear = Nothing, onlyNewFiles = False, hfVersion = "hledger-flow 0.14.3.99 linux x86_64 ghc 8.10 793f882bb22ac7b89a98077ee95b3464bbc5c0e0", hledgerInfo = HledgerInfo {hlPath = "/home/lestephane/Finance/hledger", hlVersion = "hledger 1.21.99"}, sysInfo = SystemInfo {os = "linux", arch = "x86_64", compilerName = "ghc", compilerVersion = Version {versionBranch = [8,10], versionTags = []}, cores = 8, availableCores = 3}, verbose = True, showOptions = True, sequential = False, batchSize = 1}
2021-04-14 12:46:13.99739153 EEST hledger-flow Starting import
Collecting input files from /home/lestephane/Vault/Finance/import/
Found 922 input files in 0.096201918s. Proceeding with import...
...
journal","import/_manual_/2021/post-import.journal","import/../prices/2021/prices.journal"]. Found 1: ["import/2021-closing.journal"]
Wrote include files for 1329 journals in 0.533982496s
Imported 922/922 journals in 117.976823151s
$ echo $?
0
I'm not sure of the relationship between batchSize and cores now. It used to be one-to-one, but I see you used batchSize 100 in your last example. If I use 100 for batchSize, and I have 100 scripts, and I use 2 cores, and the scripts all take a long time, are there potentially 100 scripts 'threads' being executed with preemptive concurrency simultaneously on these two cores? If so, the batchsize becomes a nice slider to control max memory usage at any one time, while the core count allows other apps to not be starved. This would cover both memory and cpu starvation concerns.
from hledger-flow.
I tried with 2 in the sandwich parameter, and it fails with exit code 137, and there is no output despite using --verbose.
I see it has batchSize = 0. So maybe an off-by-one mistake.
That's because the default batch size is currently the number of available cores - 2. So if you specify 2 it will try to process files in batches of 0 😄 . I can see how that's not very helpful, I should build some kind of minimum in there.
If I use 3 it works, and there is output
And then the batch size is 1, which means it is basically sequential processing.
I'm not sure of the relationship between batchSize and cores now
There isn't a direct relationship between cores and batch size.
The way I understand it: there are a pool of available cores, and some number of actions (the batch size). The Haskell runtime takes care of scheduling the actions in parallel among all the cores. If we have a smaller batch size it will give the system a bit more breathing room, and also processing will be slower.
We'll need to experiment with reasonable defaults that is fast enough, and that doesn't crash anyone's computer.
The releases in master didn't do any batching, so all the import actions (about 1000 in your case) are given to the Haskell runtime at once to process in parallel. To get the same behaviour in this branch, use a batch size that is greater than the number of input files, eg 2000.
The +RTS
parameters are just something I'm experimenting with to see how it affects behaviour. And in my case I couldn't really see how it limits the number of available cores when specified.
The only thing that it affects right now is the default batch size, which is something I'm setting based on the number of available cores. The real default should definitely change to something better based on what we learn.
I don't think we should use +RTS
as a documented way of tweaking performance, I think batch size (or some other idea?) should be the way to go.
from hledger-flow.
two command-line arguments are fine with me, one for cores, and one for batch size.
I have another problem I need to report on that branch. I'll submit as soon as I'm done bookkeeping
from hledger-flow.
two command-line arguments are fine with me, one for cores, and one for batch size.
OK, but the argument that is supposed to limit the cores doesn't actually limit the cores on my machine, in which case it doesn't make sense to specify it.
I used htop to check which cores are being used while I set the batch size high and the available cores low.
The processing time and htop usage looked the same to me for these 3 commands:
hledger-flow +RTS -N12 -RTS import --show-options --batch-size 100
hledger-flow +RTS -N4 -RTS import --show-options --batch-size 100
hledger-flow +RTS -N1 -RTS import --show-options --batch-size 100
Can you check how it behaves on your machine?
Also, can you confirm if there is an optimum batch size that solves the problem of your machine being unresponsive and crashing?
Because I think you're the only one that can verify that, I've never had that problem.
from hledger-flow.
I've got more measurements, but somehow I don't think it will make things clearer :)
I'm a bit out of my depth with the underlying mechanism used by haskell to execute batches.
In my case with 900 journals, 100 is already bringing a significant speedup (for 6 cores).
What I can convey is which ones got the fan spinning, I need a better tool for that.
Benchmarking is hard.
--------------------------------------------------------------
for n in 6 4 1; do NCORES=$n hlimport . --batch-size 100; done
--------------------------------------------------------------
hledger-flow-async-batches-793f882bb22ac7b89a98077ee95b3464bbc5c0e0 +RTS -N6 -RTS --show-options import . --batch-size 100
RuntimeOptions {baseDir = "/home/lestephane/Vault/Finance/", importRunDir = "./", importStartYear = Nothing, onlyNewFiles = False, hfVersion = "hledger-flow 0.14.3.99 linux x86_64 ghc 8.10 793f882bb22ac7b89a98077ee95b3464bbc5c0e0", hledgerInfo = HledgerInfo {hlPath = "/home/lestephane/Vault/Finance/hledger", hlVersion = "hledger 1.21"}, sysInfo = SystemInfo {os = "linux", arch = "x86_64", compilerName = "ghc", compilerVersion = Version {versionBranch = [8,10], versionTags = []}, cores = 8, availableCores = 6}, verbose = False, showOptions = True, sequential = False, batchSize = 100}
Imported 923/923 journals in 33.857099658s
real 0m34.197s
user 2m13.156s
sys 0m29.435s
--
hledger-flow-async-batches-793f882bb22ac7b89a98077ee95b3464bbc5c0e0 +RTS -N4 -RTS --show-options import . --batch-size 100
Imported 923/923 journals in 32.037818092s
RuntimeOptions {baseDir = "/home/lestephane/Vault/Finance/", importRunDir = "./", importStartYear = Nothing, onlyNewFiles = False, hfVersion = "hledger-flow 0.14.3.99 linux x86_64 ghc 8.10 793f882bb22ac7b89a98077ee95b3464bbc5c0e0", hledgerInfo = HledgerInfo {hlPath = "/home/lestephane/Vault/Finance/hledger", hlVersion = "hledger 1.21"}, sysInfo = SystemInfo {os = "linux", arch = "x86_64", compilerName = "ghc", compilerVersion = Version {versionBranch = [8,10], versionTags = []}, cores = 8, availableCores = 4}, verbose = False, showOptions = True, sequential = False, batchSize = 100}
Imported 923/923 journals in 32.037818092s
real 0m32.357s
user 2m9.838s
sys 0m29.022s
--
hledger-flow-async-batches-793f882bb22ac7b89a98077ee95b3464bbc5c0e0 +RTS -N1 -RTS --show-options import . --batch-size 100
RuntimeOptions {baseDir = "/home/lestephane/Vault/Finance/", importRunDir = "./", importStartYear = Nothing, onlyNewFiles = False, hfVersion = "hledger-flow 0.14.3.99 linux x86_64 ghc 8.10 793f882bb22ac7b89a98077ee95b3464bbc5c0e0", hledgerInfo = HledgerInfo {hlPath = "/home/lestephane/Vault/Finance/hledger", hlVersion = "hledger 1.21"}, sysInfo = SystemInfo {os = "linux", arch = "x86_64", compilerName = "ghc", compilerVersion = Version {versionBranch = [8,10], versionTags = []}, cores = 8, availableCores = 1}, verbose = False, showOptions = True, sequential = False, batchSize = 100}
Imported 923/923 journals in 34.272530103s
real 0m34.585s
user 2m13.743s
sys 0m35.522s
--------------------------------------------------------------
for n in 10 100 1000; do time hlimport . --batch-size $n; done
--------------------------------------------------------------
hledger-flow-async-batches-793f882bb22ac7b89a98077ee95b3464bbc5c0e0 +RTS -N6 -RTS --show-options import . --batch-size 10
RuntimeOptions {baseDir = "/home/lestephane/Vault/Finance/", importRunDir = "./", importStartYear = Nothing, onlyNewFiles = False, hfVersion = "hledger-flow 0.14.3.99 linux x86_64 ghc 8.10 793f882bb22ac7b89a98077ee95b3464bbc5c0e0", hledgerInfo = HledgerInfo {hlPath = "/home/lestephane/Vault/Finance/hledger", hlVersion = "hledger 1.21"}, sysInfo = SystemInfo {os = "linux", arch = "x86_64", compilerName = "ghc", compilerVersion = Version {versionBranch = [8,10], versionTags = []}, cores = 8, availableCores = 6}, verbose = False, showOptions = True, sequential = False, batchSize = 10}
Imported 923/923 journals in 39.66164206s
real 0m40.012s
user 1m58.946s
sys 0m28.599s
hledger-flow-async-batches-793f882bb22ac7b89a98077ee95b3464bbc5c0e0 +RTS -N6 -RTS --show-options import . --batch-size 100
RuntimeOptions {baseDir = "/home/lestephane/Vault/Finance/", importRunDir = "./", importStartYear = Nothing, onlyNewFiles = False, hfVersion = "hledger-flow 0.14.3.99 linux x86_64 ghc 8.10 793f882bb22ac7b89a98077ee95b3464bbc5c0e0", hledgerInfo = HledgerInfo {hlPath = "/home/lestephane/Vault/Finance/hledger", hlVersion = "hledger 1.21"}, sysInfo = SystemInfo {os = "linux", arch = "x86_64", compilerName = "ghc", compilerVersion = Version {versionBranch = [8,10], versionTags = []}, cores = 8, availableCores = 6}, verbose = False, showOptions = True, sequential = False, batchSize = 100}
Imported 923/923 journals in 32.491148357s
real 0m32.839s
user 2m10.826s
sys 0m29.489s
hledger-flow-async-batches-793f882bb22ac7b89a98077ee95b3464bbc5c0e0 +RTS -N6 -RTS --show-options import . --batch-size 1000
RuntimeOptions {baseDir = "/home/lestephane/Vault/Finance/", importRunDir = "./", importStartYear = Nothing, onlyNewFiles = False, hfVersion = "hledger-flow 0.14.3.99 linux x86_64 ghc 8.10 793f882bb22ac7b89a98077ee95b3464bbc5c0e0", hledgerInfo = HledgerInfo {hlPath = "/home/lestephane/Vault/Finance/hledger", hlVersion = "hledger 1.21"}, sysInfo = SystemInfo {os = "linux", arch = "x86_64", compilerName = "ghc", compilerVersion = Version {versionBranch = [8,10], versionTags = []}, cores = 8, availableCores = 6}, verbose = False, showOptions = True, sequential = False, batchSize = 1000}
Imported 923/923 journals in 31.399761606s
real 0m31.761s
user 2m27.096s
sys 0m34.187s
--------------------------------------------------------------
for n in 8 6 4; do time NCORES=$n hlimport .; done
--------------------------------------------------------------
hledger-flow-async-batches-793f882bb22ac7b89a98077ee95b3464bbc5c0e0 +RTS -N8 -RTS --show-options import .
RuntimeOptions {baseDir = "/home/lestephane/Vault/Finance/", importRunDir = "./", importStartYear = Nothing, onlyNewFiles = False, hfVersion = "hledger-flow 0.14.3.99 linux x86_64 ghc 8.10 793f882bb22ac7b89a98077ee95b3464bbc5c0e0", hledgerInfo = HledgerInfo {hlPath = "/home/lestephane/Vault/Finance/hledger", hlVersion = "hledger 1.21"}, sysInfo = SystemInfo {os = "linux", arch = "x86_64", compilerName = "ghc", compilerVersion = Version {versionBranch = [8,10], versionTags = []}, cores = 8, availableCores = 8}, verbose = False, showOptions = True, sequential = False, batchSize = 6}
Imported 923/923 journals in 47.928125652s
real 0m48.746s
user 2m1.999s
sys 0m28.954s
hledger-flow-async-batches-793f882bb22ac7b89a98077ee95b3464bbc5c0e0 +RTS -N6 -RTS --show-options import .
RuntimeOptions {baseDir = "/home/lestephane/Vault/Finance/", importRunDir = "./", importStartYear = Nothing, onlyNewFiles = False, hfVersion = "hledger-flow 0.14.3.99 linux x86_64 ghc 8.10 793f882bb22ac7b89a98077ee95b3464bbc5c0e0", hledgerInfo = HledgerInfo {hlPath = "/home/lestephane/Vault/Finance/hledger", hlVersion = "hledger 1.21"}, sysInfo = SystemInfo {os = "linux", arch = "x86_64", compilerName = "ghc", compilerVersion = Version {versionBranch = [8,10], versionTags = []}, cores = 8, availableCores = 6}, verbose = False, showOptions = True, sequential = False, batchSize = 4}
Imported 923/923 journals in 50.670000196s
real 0m50.994s
user 1m52.795s
sys 0m29.061s
hledger-flow-async-batches-793f882bb22ac7b89a98077ee95b3464bbc5c0e0 +RTS -N4 -RTS --show-options import .
RuntimeOptions {baseDir = "/home/lestephane/Vault/Finance/", importRunDir = "./", importStartYear = Nothing, onlyNewFiles = False, hfVersion = "hledger-flow 0.14.3.99 linux x86_64 ghc 8.10 793f882bb22ac7b89a98077ee95b3464bbc5c0e0", hledgerInfo = HledgerInfo {hlPath = "/home/lestephane/Vault/Finance/hledger", hlVersion = "hledger 1.21"}, sysInfo = SystemInfo {os = "linux", arch = "x86_64", compilerName = "ghc", compilerVersion = Version {versionBranch = [8,10], versionTags = []}, cores = 8, availableCores = 4}, verbose = False, showOptions = True, sequential = False, batchSize = 2}
Imported 923/923 journals in 71.914747204s
real 1m12.234s
user 1m52.751s
sys 0m31.280s
from hledger-flow.
hmm i think I know what you mean now, I tried 6 cores and 250 as batchSize, it looks like all cores are still used. Mystery remains.
from hledger-flow.
Thanks for sharing the benchmarks.
I think the RTS params are ignored, the batch size is the only variable that makes a difference.
So what I do with the batches (assuming a batch size of 100 to have some concrete number) is take 100 IO actions (where an "action" is an import file to journal file conversion in this case), then I tell Haskell to execute the 100 actions in parallel, and I wait until they're all done, before I do the same for the next 100 actions.
Previously I just gave all 1000 input files to Haskell at once.
I don't have a way (that I know of) to control it more than that. The smaller batches should use less memory and give the CPUs time to catch some breath before doing the next bit of work. Maybe you can confirm what your memory usage looks like with different batch sizes.
from hledger-flow.
I'll try to find a way to include memory and load, not just timing, to my benchmark
from hledger-flow.
Released batching support in version 0.14.4.
from hledger-flow.
Related Issues (20)
- File-specific rules HOT 5
- hledger-flow does not 'see' _manual_ year subdirectory if there is no corresponding 1-in subdirectory HOT 1
- QUESTION: how to break up a transaction/payment? HOT 12
- If I delete a file in a `1-in` directory, re-running `hledger-flow import` does not remove the corresponding files in the `2-preprocessed` and `3-journal` directories HOT 4
- Missing version bound on turtle breaks build HOT 3
- Have a way to use `--cost` option for income-expense reports HOT 1
- Documentation on workflow HOT 8
- `hledger-flow` reports empty for user sub-accounts (due to missing `directives.journal` at lower levels) HOT 3
- (docs) unclear what to do if starting balance is not 0 HOT 6
- Where to put account declarations and prices?
- Support for Apple Silicon (aarch64-darwin) HOT 3
- hackage doesn't have the 0.15 release
- Windows: the preprocess and construct scripts are not executed HOT 1
- QUESTION: tags, reports, multiple contributors, virtual accounts, how to do it simply? HOT 2
- when preprocess is called with a $1 that has a .timeclock extension, $2 has a .csv extension HOT 3
- 3-journal/ files not ending in ".journal" extension are added to yearly include files HOT 6
- hledger-flow does not 'see' hledger despite it being present in the PATH as a symlinked executable HOT 3
- cabal install error: Not in scope: type constructor 'Rel' HOT 2
- Question: where to include "meta" statements (`account...`, `commodity format` & `alias`) & prices? HOT 13
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hledger-flow.