Comments (26)
I also did a quick test with my original slim script using slim from the last commit from @petrelharp (MesserLab/SLiM@dc0e413). slim
command only used 200-300Mb RAM after 500 restarts. This almost fixed MesserLab/SLiM#404. I will report back if I see the failure due to out of memory again.
Thank you @petrelharp @bhaller
from tskit.
I think?
subpopMap
was added in SLiM 4.0, I think, and we're now on 4.0.1.
Thanks, I had a fresh clone, but somehow master was out-of-date. No idea why!
This should run to completion under valgrind in perhaps a minute or so
Yep, have that running now
I'll leave this for the tskit folks to close when they're satisfied
Ok, I'm going to do a little bit of digging, but glad this is fixed now.
from tskit.
I have reproduced the leak with valgrind. Investigating.
from tskit.
Could you post a version that does the same kind thing but runs much faster? Valgrind makes a process run much slower.
from tskit.
For a quick one using scripts from pyslim:
git clone [email protected]:tskit-dev/pyslim
cd pyslim/tests/test_recipes
slim recipe_WF_early_late.slim # to make out.trees
valgrind --leak-check=full --track-origins=yes slim -d "TREES_IN='out.trees'" -d "STAGE='early'" restart_WF.slim
and the output is (lengthy and) attached.
valgrind.log
from tskit.
Hm, well I've just realized we haven't updated the tskit code in a bit (since C_1.0, and now we're at C_1.1.2). I've done a quick update to the code: MesserLab/SLiM#406 (but haven't yet checked if it works...).
from tskit.
Well, running my test above with the tskit-updated SLiM results in fewer bytes "definitely lost" and a lot fewer entries in the valgrind log. Here's the result (attached).
valgrind2.log
from tskit.
Looks to me like this has resolved the vast majority of the problem. We still get a 400-byte leak with every reload; I'm not sure where it's coming from exactly; one tool indicates the line variant = (tsk_variant_t *)malloc(sizeof(tsk_variant_t))
, but we do call ret = tsk_variant_free(variant)
so I'm not sure. But anyway, the leak of the entire tables is fixed, so that's good. @benjeffery, do you remember fixing this problem now? Is it understood what's going on here – can we forget about it, do you think?
from tskit.
Oh, tsk_variant_free(variant)
just frees the stuff inside the variant object, doesn't it – not the malloc itself. So that bit is SLiM's leak, I think?
from tskit.
Thanks for the update @petrelharp!
from tskit.
I've just pushed a fix for the remaining 400-byte per reload leak, which was due to not calling free() on a malloced tsk_variant_t
struct. There was no reason to malloc it in the first place; it's now stack-local. The original test model now runs cleanly with no leaks. It would still be good to be sure that we understand why updating to the new tskit fixed the problem, just so we're sure that a regression didn't get "fixed" by another regression or something, but perhaps the search for that is tilting at windmills, since I guess there have been quite a few diffs between C_1.0 and C_1.1.2; that's up to you, @benjeffery; in any case, thank you for your help. Thanks for the very good idea, @petrelharp! I'll leave this for the tskit folks to close when they're satisfied, but I have closed MesserLab/SLiM#404 over on my side.
from tskit.
Hmm, having been through the diff I don't see anything. All the kastore memory ownership changes were already in 1.0 and are pretty heavily tested for leaks etc. I don't think it's worth digging any further here unless we get any other leak reports. Thanks for raising the issue.
from tskit.
The tskit ones are defined here:
https://github.com/tskit-dev/tskit/blob/main/.circleci/config.yml#L64
from tskit.
Phew! Glad that worked to fix things.
from tskit.
So valgrind here is giving:
==2852465== Syscall param write(buf) points to uninitialised byte(s)
==2852465== at 0x4CD00A7: write (write.c:26)
==2852465== by 0x4C50EBC: _IO_file_write@@GLIBC_2.2.5 (fileops.c:1181)
==2852465== by 0x4C517D7: new_do_write (fileops.c:449)
==2852465== by 0x4C517D7: _IO_new_file_xsputn (fileops.c:1255)
==2852465== by 0x4C517D7: _IO_file_xsputn@@GLIBC_2.2.5 (fileops.c:1197)
==2852465== by 0x4C453F0: fwrite (iofwrite.c:39)
==2852465== by 0x5C948B: kastore_close (in /home/benj/projects/slim-leak/build/slim)
Which indicates we're trying to write uninit'd memory. Which is odd. Could it be that you're dumping a free'd ts? I haven't dug into slim code yet. Time to sleep!
from tskit.
Hmm, odd. There's no sign of a problem like that; all we have is the memory leak, there's never a crash or any other bad behavior. The models seems to work fine, and it can reload the tree sequence over and over without problems. If it were relying upon uninited memory, I wouldn't expect that to be the case. SLiM passes all its unit tests, and lots of people are using it without problems; it just leaks. Very strange bug, so far.
from tskit.
Probably yes; give me a couple minutes. I apologize that this is disturbing your sleep; me too. :-O
from tskit.
OK, I think this runs much faster and still exhibits the leak:
function (NULL)set_default(string k, lifs v)
{
if (!exists(k))
defineConstant(k, v);
catn(c("Parameter", k, v), sep='\t');
return NULL;
}
initialize()
{
set_default("L", 750000); // chromosome length in bp
set_default("nchrs", 14); // chromosome length in bp
set_default("selpos", asInteger(L / 3)); // selection position in bp
set_default("num_origins", 2); //how many genomes contains the selected mutation when selection starts.
set_default("N", 10000); // ancient effective population size
set_default("h", 0.5); // dominant coefficient
set_default("s", 0.3); // selection coefficient
set_default("g_sel_start", 80); // time of selected mutation being introduced (generations ago --BACKWARD)
set_default("r", 6.67e-7); // recombinantion rate
set_default("outid", 1); // idx
set_default("max_restart", 10000); // max number of restart
set_default("N0", 1000); // the effective population size at sampling time
set_default("g_ne_change_start", 200); // Ne change time (generations ago -- BACKWARD)
set_default("slim_total_generations", // time of simulation ended -- forward
201);
initializeSLiMOptions(keepPedigrees=T);
initializeTreeSeq(simplificationInterval=1000);
initializeMutationRate(0.0);
initializeMutationType("m1", 0.5, "f", 0.0); // neutral
initializeMutationType("m2", h, "f", s); // selected
initializeGenomicElementType("g1", m1, 1);
// whole genome length
initializeGenomicElement(g1, 0, L * nchrs - 1);
// rate maps
ends = rep(0, 2 * nchrs - 1);
rates = rep(r, 2 * nchrs - 1);
for (i in 1:nchrs)
{
ends[i * 2 - 2] = L * i - 1; // Note: Eidos indexing is from 0 not from zeros
}
for (i in 1:(nchrs - 1))
{
rates[i * 2 - 1] = 0.5;
ends[i * 2 - 1] = L * i;
}
initializeRecombinationRate(rates, ends);
// define global
defineGlobal("restart_counter", 1);
}
1 early()
{
sim.addSubpop("p1", N);
}
// control populatio size
2: early()
{
t = slim_total_generations - sim.cycle; // generation ago
Nt = (N / N0)^(t / g_ne_change_start) * N0; // calculate Nt
p1.setSubpopulationSize(asInteger(Nt)); // set new population size
}
// condition on selection establishment (not lost)
10:20 late()
{
if (sim.cycle == 10)
{
catn("WRITING...");
sim.treeSeqOutput(paste("tmp_slim_state_single_pop_", outid, ".trees", sep=''));
for (i in 1:nchrs){
sample(p1.genomes, num_origins).addNewDrawnMutation(m2, selpos + (i-1) * L);
}
}
else if (sim.cycle >= 12)
{
catn("RESTARTING...");
sim.readFromPopulationFile(paste("tmp_slim_state_single_pop_", outid, ".trees", sep=''));
setSeed(rdunif(1, 0, asInteger(2^62 - 1)));
for (i in 1:nchrs){
sample(p1.genomes, num_origins).addNewDrawnMutation(m2, selpos + (i-1) * L);
}
restart_counter = restart_counter + 1;
}
}
from tskit.
For a quick one using scripts from pyslim:
git clone [email protected]:tskit-dev/pyslim cd pyslim/tests/test_recipes slim recipe_WF_early_late.slim # to make out.trees valgrind --leak-check=full --track-origins=yes slim -d "TREES_IN='out.trees'" -d "STAGE='early'" restart_WF.slim
and the output is (lengthy and) attached. valgrind.log
So does this mean that it is definitively a regression, then? Or has this test not been run under valgrind before? (It would be nice if there were a way to check with leaks with CI; I've not looked into that.)
from tskit.
Yes, worthwhile for sure. I have a vague feeling that we had a patch living in SLiM; did you do a diff with the old sources before you copied in the new? Anyway, I can do a test tomorrow to see if this helps the problem, but if @benjeffery doesn't remember fixing this bug, my guess is that it won't...
from tskit.
If this doesn't fix it, perhaps the thing to do is to extract out the tsk_X calls into a simple C program and verify there?
from tskit.
I don't know why this issue got closed, WTF. Oh, @petrelharp – you need to watch your wording, you marked that PR as fixing this issue! :-> Reopening. For the record: there was indeed a patch in SLiM's copy of tskit, but it got resolved by the update so all is well.
from tskit.
I don't remember fixing anything like this, but the tskit C API has pretty good leak checking as part of CI, so I'd be surprised if we have an issue there. I'll try again with the quicker code and the updated code.
from tskit.
The quicker code is still taking over 30min to run under valgrind, and the pyslim "restart_WF" script is giving me:
ERROR (EidosInterpreter::_ProcessArgumentList): unrecognized named argument subpopMap to function readFromPopulationFile.
Error on script line 19, character 8:
sim.readFromPopulationFile(TREES_IN, subpopMap=SUBPOP_MAP);
So I can't really dig into this without a working minimal script.
from tskit.
@benjeffery Hmm, "unrecognized named argument subpopMap" seems to indicate that you're running an old version of SLiM, I think? subpopMap
was added in SLiM 4.0, I think, and we're now on 4.0.1.
As for the script still taking more than 30min to run, yes – sorry, I'm an idiot. I assumed you were working in the debugger in the same way as me, observing leaks across one reload operation at a time, but of course you're not, you're running the script to completion under valgrind – and the script I supplied runs forever. It just keeps doing reloads over and over, ad infinitum, because that's how I was testing. (It's "faster" in the sense that the time to the first reload and the time between reloads is much shorter, which for some reason is what I thought you meant.) So, mea culpa, I apologize for wasting 30 minutes of your life. :-> Here is a script that does two reloads and then ends:
function (NULL)set_default(string k, lifs v)
{
if (!exists(k))
defineConstant(k, v);
catn(c("Parameter", k, v), sep='\t');
return NULL;
}
initialize()
{
set_default("L", 750000); // chromosome length in bp
set_default("nchrs", 14); // chromosome length in bp
set_default("selpos", asInteger(L / 3)); // selection position in bp
set_default("num_origins", 2); //how many genomes contains the selected mutation when selection starts.
set_default("N", 10000); // ancient effective population size
set_default("h", 0.5); // dominant coefficient
set_default("s", 0.3); // selection coefficient
set_default("g_sel_start", 80); // time of selected mutation being introduced (generations ago --BACKWARD)
set_default("r", 6.67e-7); // recombinantion rate
set_default("outid", 1); // idx
set_default("max_restart", 10000); // max number of restart
set_default("N0", 1000); // the effective population size at sampling time
set_default("g_ne_change_start", 200); // Ne change time (generations ago -- BACKWARD)
set_default("slim_total_generations", // time of simulation ended -- forward
201);
initializeSLiMOptions(keepPedigrees=T);
initializeTreeSeq(simplificationInterval=1000);
initializeMutationRate(0.0);
initializeMutationType("m1", 0.5, "f", 0.0); // neutral
initializeMutationType("m2", h, "f", s); // selected
initializeGenomicElementType("g1", m1, 1);
// whole genome length
initializeGenomicElement(g1, 0, L * nchrs - 1);
// rate maps
ends = rep(0, 2 * nchrs - 1);
rates = rep(r, 2 * nchrs - 1);
for (i in 1:nchrs)
{
ends[i * 2 - 2] = L * i - 1; // Note: Eidos indexing is from 0 not from zeros
}
for (i in 1:(nchrs - 1))
{
rates[i * 2 - 1] = 0.5;
ends[i * 2 - 1] = L * i;
}
initializeRecombinationRate(rates, ends);
// define global
defineGlobal("restart_counter", 1);
}
1 early()
{
sim.addSubpop("p1", N);
}
// control populatio size
2: early()
{
t = slim_total_generations - sim.cycle; // generation ago
Nt = (N / N0)^(t / g_ne_change_start) * N0; // calculate Nt
p1.setSubpopulationSize(asInteger(Nt)); // set new population size
}
// condition on selection establishment (not lost)
10:15 late()
{
if (sim.cycle == 10)
{
catn("WRITING...");
sim.treeSeqOutput(paste("tmp_slim_state_single_pop_", outid, ".trees", sep=''));
for (i in 1:nchrs){
sample(p1.genomes, num_origins).addNewDrawnMutation(m2, selpos + (i-1) * L);
}
}
else if ((sim.cycle >= 12) & (restart_counter < 3))
{
catn("RESTARTING...");
sim.readFromPopulationFile(paste("tmp_slim_state_single_pop_", outid, ".trees", sep=''));
setSeed(rdunif(1, 0, asInteger(2^62 - 1)));
for (i in 1:nchrs){
sample(p1.genomes, num_origins).addNewDrawnMutation(m2, selpos + (i-1) * L);
}
restart_counter = restart_counter + 1;
}
}
This should run to completion under valgrind in perhaps a minute or so; without valgrind it takes only ten seconds or so. Again, my apologies.
from tskit.
Interesting. Thanks for looking into it. Must be something about SLiM's particular usage pattern, I guess. It would be nice to have leak-testing as part of SLiM's CI; right now the leak-testing I do is just ad hoc. Something to ponder.
from tskit.
Related Issues (20)
- edges lost after merging two trees HOT 20
- Codecov upload issues HOT 1
- keep_intervals() giving _tskit.LibraryError: Can't squash, flush, simplify or link ancestors... HOT 7
- Update GitHub upload/download artefacts
- Add XTable.drop_metadata HOT 1
- trees.c Compilation Error HOT 2
- Split large numbers in html/cli print out.
- Post release for 0.5.7 HOT 1
- Fixup tests for lshmm 0.0.6 HOT 7
- Fix tests for numpy 2.0 HOT 1
- Drop "benchmark" CI job?
- Pass a numpy array of booleans to python simplify? HOT 7
- Cannot pickle '_tskit.Tree' object HOT 3
- Trivial wording change HOT 1
- clarify that genetic_relatedness includes self comparisons HOT 1
- set default for genetic_relatedness to polarised=True and remove a factor of 1/2
- "mutation" mode for statistics HOT 1
- inconsistent handling of non-ancestral material in non-strict, unpolarised branch stats HOT 2
- wrong output for genetic_relatedness(... indexes=None, proportion=True)
- Clarification on pairwise coalescent rates HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tskit.