Hi folks (ping <a class="user-mention notranslate" data-hovercard-type="user" data-hov

I also did a quick test with my original slim using slim from the last commit f

For a quick one using s from <a href="https://github.com/tskit-dev/pyslim/tree/m

Thanks for the update <a class="user-mention notranslate" data-hovercard-type="user" d

leaks when loading tree sequence in SLiM about tskit HOT 26 CLOSED

bhaller commented on September 16, 2024 1

leaks when loading tree sequence in SLiM

from tskit.

Comments (26)

bguo068 commented on September 16, 2024 2

I also did a quick test with my original slim script using slim from the last commit from @petrelharp (MesserLab/SLiM@dc0e413). slim command only used 200-300Mb RAM after 500 restarts. This almost fixed MesserLab/SLiM#404. I will report back if I see the failure due to out of memory again.

Thank you @petrelharp @bhaller

from tskit.

benjeffery commented on September 16, 2024 2

I think? subpopMap was added in SLiM 4.0, I think, and we're now on 4.0.1.

Thanks, I had a fresh clone, but somehow master was out-of-date. No idea why!

This should run to completion under valgrind in perhaps a minute or so
Yep, have that running now

I'll leave this for the tskit folks to close when they're satisfied

Ok, I'm going to do a little bit of digging, but glad this is fixed now.

from tskit.

benjeffery commented on September 16, 2024 1

I have reproduced the leak with valgrind. Investigating.

from tskit.

benjeffery commented on September 16, 2024 1

Could you post a version that does the same kind thing but runs much faster? Valgrind makes a process run much slower.

from tskit.

petrelharp commented on September 16, 2024 1

For a quick one using scripts from pyslim:

git clone [email protected]:tskit-dev/pyslim
cd pyslim/tests/test_recipes
slim recipe_WF_early_late.slim    # to make out.trees
valgrind --leak-check=full --track-origins=yes slim -d "TREES_IN='out.trees'" -d "STAGE='early'" restart_WF.slim

and the output is (lengthy and) attached.
valgrind.log

from tskit.

petrelharp commented on September 16, 2024 1

Hm, well I've just realized we haven't updated the tskit code in a bit (since C_1.0, and now we're at C_1.1.2). I've done a quick update to the code: MesserLab/SLiM#406 (but haven't yet checked if it works...).

from tskit.

petrelharp commented on September 16, 2024 1

Well, running my test above with the tskit-updated SLiM results in fewer bytes "definitely lost" and a lot fewer entries in the valgrind log. Here's the result (attached).
valgrind2.log

from tskit.

bhaller commented on September 16, 2024 1

Looks to me like this has resolved the vast majority of the problem. We still get a 400-byte leak with every reload; I'm not sure where it's coming from exactly; one tool indicates the line variant = (tsk_variant_t *)malloc(sizeof(tsk_variant_t)), but we do call ret = tsk_variant_free(variant) so I'm not sure. But anyway, the leak of the entire tables is fixed, so that's good. @benjeffery, do you remember fixing this problem now? Is it understood what's going on here – can we forget about it, do you think?

from tskit.

bhaller commented on September 16, 2024 1

Oh, tsk_variant_free(variant) just frees the stuff inside the variant object, doesn't it – not the malloc itself. So that bit is SLiM's leak, I think?

from tskit.

bhaller commented on September 16, 2024 1

Thanks for the update @petrelharp!

from tskit.

bhaller commented on September 16, 2024 1

I've just pushed a fix for the remaining 400-byte per reload leak, which was due to not calling free() on a malloced tsk_variant_t struct. There was no reason to malloc it in the first place; it's now stack-local. The original test model now runs cleanly with no leaks. It would still be good to be sure that we understand why updating to the new tskit fixed the problem, just so we're sure that a regression didn't get "fixed" by another regression or something, but perhaps the search for that is tilting at windmills, since I guess there have been quite a few diffs between C_1.0 and C_1.1.2; that's up to you, @benjeffery; in any case, thank you for your help. Thanks for the very good idea, @petrelharp! I'll leave this for the tskit folks to close when they're satisfied, but I have closed MesserLab/SLiM#404 over on my side.

from tskit.

benjeffery commented on September 16, 2024 1

Hmm, having been through the diff I don't see anything. All the kastore memory ownership changes were already in 1.0 and are pretty heavily tested for leaks etc. I don't think it's worth digging any further here unless we get any other leak reports. Thanks for raising the issue.

from tskit.

benjeffery commented on September 16, 2024 1

The tskit ones are defined here:
https://github.com/tskit-dev/tskit/blob/main/.circleci/config.yml#L64

from tskit.

petrelharp commented on September 16, 2024 1

Phew! Glad that worked to fix things.

from tskit.

benjeffery commented on September 16, 2024

So valgrind here is giving:

==2852465== Syscall param write(buf) points to uninitialised byte(s)
==2852465==    at 0x4CD00A7: write (write.c:26)
==2852465==    by 0x4C50EBC: _IO_file_write@@GLIBC_2.2.5 (fileops.c:1181)
==2852465==    by 0x4C517D7: new_do_write (fileops.c:449)
==2852465==    by 0x4C517D7: _IO_new_file_xsputn (fileops.c:1255)
==2852465==    by 0x4C517D7: _IO_file_xsputn@@GLIBC_2.2.5 (fileops.c:1197)
==2852465==    by 0x4C453F0: fwrite (iofwrite.c:39)
==2852465==    by 0x5C948B: kastore_close (in /home/benj/projects/slim-leak/build/slim)

Which indicates we're trying to write uninit'd memory. Which is odd. Could it be that you're dumping a free'd ts? I haven't dug into slim code yet. Time to sleep!

from tskit.

bhaller commented on September 16, 2024

Hmm, odd. There's no sign of a problem like that; all we have is the memory leak, there's never a crash or any other bad behavior. The models seems to work fine, and it can reload the tree sequence over and over without problems. If it were relying upon uninited memory, I wouldn't expect that to be the case. SLiM passes all its unit tests, and lots of people are using it without problems; it just leaks. Very strange bug, so far.

from tskit.

bhaller commented on September 16, 2024

Probably yes; give me a couple minutes. I apologize that this is disturbing your sleep; me too. :-O

from tskit.

bhaller commented on September 16, 2024

OK, I think this runs much faster and still exhibits the leak:

function (NULL)set_default(string k, lifs v)
{
	if (!exists(k))
		defineConstant(k, v);
	catn(c("Parameter", k, v), sep='\t');
	return NULL;
}

initialize()
{
	set_default("L", 750000); // chromosome length in bp
	set_default("nchrs", 14); // chromosome length in bp	
	set_default("selpos", asInteger(L / 3)); // selection position in bp
	set_default("num_origins", 2); //how many genomes contains the selected mutation when selection starts.
	set_default("N", 10000); // ancient effective population size
	set_default("h", 0.5); // dominant coefficient
	set_default("s", 0.3); // selection coefficient
	set_default("g_sel_start", 80); // time of selected mutation being introduced (generations ago --BACKWARD)
	set_default("r", 6.67e-7); // recombinantion rate
	set_default("outid", 1); // idx
	set_default("max_restart", 10000); // max number of restart
	set_default("N0", 1000); // the effective population size at sampling time
	set_default("g_ne_change_start", 200); // Ne change time (generations ago -- BACKWARD)
	set_default("slim_total_generations", // time of simulation ended -- forward
		201);
	initializeSLiMOptions(keepPedigrees=T);
	initializeTreeSeq(simplificationInterval=1000);
	initializeMutationRate(0.0);
	initializeMutationType("m1", 0.5, "f", 0.0); // neutral
	initializeMutationType("m2", h, "f", s); // selected
	initializeGenomicElementType("g1", m1, 1);
	
	// whole genome length
	initializeGenomicElement(g1, 0, L * nchrs - 1);
	
	// rate maps
	ends = rep(0, 2 * nchrs - 1);
	rates = rep(r, 2 * nchrs - 1);
	for (i in 1:nchrs)
	{
		ends[i * 2 - 2] = L * i - 1; // Note: Eidos indexing is from 0 not from zeros
	}
	for (i in 1:(nchrs - 1))
	{
		rates[i * 2 - 1] = 0.5;
		ends[i * 2 - 1] = L * i;
	}
	initializeRecombinationRate(rates, ends);
	
	// define global
	defineGlobal("restart_counter", 1);
}

1 early()
{
	sim.addSubpop("p1", N);
}

// control populatio size
2: early()
{
	t = slim_total_generations - sim.cycle; // generation ago
	Nt = (N / N0)^(t / g_ne_change_start) * N0; // calculate Nt
	p1.setSubpopulationSize(asInteger(Nt)); // set new population size
}

// condition on selection establishment (not lost)
10:20 late()
{
	if (sim.cycle == 10)
	{
		catn("WRITING...");
		sim.treeSeqOutput(paste("tmp_slim_state_single_pop_", outid, ".trees", sep=''));
		for (i in 1:nchrs){
			sample(p1.genomes, num_origins).addNewDrawnMutation(m2, selpos + (i-1) * L);
		}
	}
	else if (sim.cycle >= 12)
	{
		catn("RESTARTING...");
			sim.readFromPopulationFile(paste("tmp_slim_state_single_pop_", outid, ".trees", sep=''));
			setSeed(rdunif(1, 0, asInteger(2^62 - 1)));
			for (i in 1:nchrs){
				sample(p1.genomes, num_origins).addNewDrawnMutation(m2, selpos + (i-1) * L);
			}
			restart_counter = restart_counter + 1;
	}
}

from tskit.

bhaller commented on September 16, 2024

For a quick one using scripts from pyslim:

git clone [email protected]:tskit-dev/pyslim
cd pyslim/tests/test_recipes
slim recipe_WF_early_late.slim    # to make out.trees
valgrind --leak-check=full --track-origins=yes slim -d "TREES_IN='out.trees'" -d "STAGE='early'" restart_WF.slim

and the output is (lengthy and) attached. valgrind.log

So does this mean that it is definitively a regression, then? Or has this test not been run under valgrind before? (It would be nice if there were a way to check with leaks with CI; I've not looked into that.)

from tskit.

bhaller commented on September 16, 2024

Yes, worthwhile for sure. I have a vague feeling that we had a patch living in SLiM; did you do a diff with the old sources before you copied in the new? Anyway, I can do a test tomorrow to see if this helps the problem, but if @benjeffery doesn't remember fixing this bug, my guess is that it won't...

from tskit.

petrelharp commented on September 16, 2024

If this doesn't fix it, perhaps the thing to do is to extract out the tsk_X calls into a simple C program and verify there?

from tskit.

bhaller commented on September 16, 2024

I don't know why this issue got closed, WTF. Oh, @petrelharp – you need to watch your wording, you marked that PR as fixing this issue! :-> Reopening. For the record: there was indeed a patch in SLiM's copy of tskit, but it got resolved by the update so all is well.

from tskit.

benjeffery commented on September 16, 2024

I don't remember fixing anything like this, but the tskit C API has pretty good leak checking as part of CI, so I'd be surprised if we have an issue there. I'll try again with the quicker code and the updated code.

from tskit.

benjeffery commented on September 16, 2024

The quicker code is still taking over 30min to run under valgrind, and the pyslim "restart_WF" script is giving me:

ERROR (EidosInterpreter::_ProcessArgumentList): unrecognized named argument subpopMap to function readFromPopulationFile.

Error on script line 19, character 8:

    sim.readFromPopulationFile(TREES_IN, subpopMap=SUBPOP_MAP);

So I can't really dig into this without a working minimal script.

from tskit.

bhaller commented on September 16, 2024

@benjeffery Hmm, "unrecognized named argument subpopMap" seems to indicate that you're running an old version of SLiM, I think? subpopMap was added in SLiM 4.0, I think, and we're now on 4.0.1.

As for the script still taking more than 30min to run, yes – sorry, I'm an idiot. I assumed you were working in the debugger in the same way as me, observing leaks across one reload operation at a time, but of course you're not, you're running the script to completion under valgrind – and the script I supplied runs forever. It just keeps doing reloads over and over, ad infinitum, because that's how I was testing. (It's "faster" in the sense that the time to the first reload and the time between reloads is much shorter, which for some reason is what I thought you meant.) So, mea culpa, I apologize for wasting 30 minutes of your life. :-> Here is a script that does two reloads and then ends:

function (NULL)set_default(string k, lifs v)
{
	if (!exists(k))
		defineConstant(k, v);
	catn(c("Parameter", k, v), sep='\t');
	return NULL;
}

initialize()
{
	set_default("L", 750000); // chromosome length in bp
	set_default("nchrs", 14); // chromosome length in bp	
	set_default("selpos", asInteger(L / 3)); // selection position in bp
	set_default("num_origins", 2); //how many genomes contains the selected mutation when selection starts.
	set_default("N", 10000); // ancient effective population size
	set_default("h", 0.5); // dominant coefficient
	set_default("s", 0.3); // selection coefficient
	set_default("g_sel_start", 80); // time of selected mutation being introduced (generations ago --BACKWARD)
	set_default("r", 6.67e-7); // recombinantion rate
	set_default("outid", 1); // idx
	set_default("max_restart", 10000); // max number of restart
	set_default("N0", 1000); // the effective population size at sampling time
	set_default("g_ne_change_start", 200); // Ne change time (generations ago -- BACKWARD)
	set_default("slim_total_generations", // time of simulation ended -- forward
		201);
	initializeSLiMOptions(keepPedigrees=T);
	initializeTreeSeq(simplificationInterval=1000);
	initializeMutationRate(0.0);
	initializeMutationType("m1", 0.5, "f", 0.0); // neutral
	initializeMutationType("m2", h, "f", s); // selected
	initializeGenomicElementType("g1", m1, 1);
	
	// whole genome length
	initializeGenomicElement(g1, 0, L * nchrs - 1);
	
	// rate maps
	ends = rep(0, 2 * nchrs - 1);
	rates = rep(r, 2 * nchrs - 1);
	for (i in 1:nchrs)
	{
		ends[i * 2 - 2] = L * i - 1; // Note: Eidos indexing is from 0 not from zeros
	}
	for (i in 1:(nchrs - 1))
	{
		rates[i * 2 - 1] = 0.5;
		ends[i * 2 - 1] = L * i;
	}
	initializeRecombinationRate(rates, ends);
	
	// define global
	defineGlobal("restart_counter", 1);
}

1 early()
{
	sim.addSubpop("p1", N);
}

// control populatio size
2: early()
{
	t = slim_total_generations - sim.cycle; // generation ago
	Nt = (N / N0)^(t / g_ne_change_start) * N0; // calculate Nt
	p1.setSubpopulationSize(asInteger(Nt)); // set new population size
}

// condition on selection establishment (not lost)
10:15 late()
{
	if (sim.cycle == 10)
	{
		catn("WRITING...");
		sim.treeSeqOutput(paste("tmp_slim_state_single_pop_", outid, ".trees", sep=''));
		for (i in 1:nchrs){
			sample(p1.genomes, num_origins).addNewDrawnMutation(m2, selpos + (i-1) * L);
		}
	}
	else if ((sim.cycle >= 12) & (restart_counter < 3))
	{
		catn("RESTARTING...");
			sim.readFromPopulationFile(paste("tmp_slim_state_single_pop_", outid, ".trees", sep=''));
			setSeed(rdunif(1, 0, asInteger(2^62 - 1)));
			for (i in 1:nchrs){
				sample(p1.genomes, num_origins).addNewDrawnMutation(m2, selpos + (i-1) * L);
			}
			restart_counter = restart_counter + 1;
	}
}

This should run to completion under valgrind in perhaps a minute or so; without valgrind it takes only ten seconds or so. Again, my apologies.

from tskit.

bhaller commented on September 16, 2024

Interesting. Thanks for looking into it. Must be something about SLiM's particular usage pattern, I guess. It would be nice to have leak-testing as part of SLiM's CI; right now the leak-testing I do is just ad hoc. Something to ponder.

from tskit.

leaks when loading tree sequence in SLiM about tskit HOT 26 CLOSED

Comments (26)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent