Code Monkey home page Code Monkey logo

Comments (12)

hermanzdosilovic avatar hermanzdosilovic commented on July 23, 2024

Hey @tncks0121, why do you care about run time of isolate itself? Isn't just run time of user's program important? 😕

from isolate.

tncks0121 avatar tncks0121 commented on July 23, 2024

@hermanzdosilovic, yes, that is correct, but suppose you are conducting a contest and each submission needs at least 0.4*100 seconds just because of the sandbox when the solution is super easy. In this situation the queue will be very long quickly if one has not that many workers.
Also I was planning to use isolate not only for cms but in general judging (users can see how much judging has done), so speed was important.

from isolate.

hermanzdosilovic avatar hermanzdosilovic commented on July 23, 2024

from isolate.

bblackham avatar bblackham commented on July 23, 2024

Hi! I can't reproduce this on any of my machines, VMs or physical.

Can you try the following:

  • strace the isolate --run command by prefixing it with sudo strace -u $USER -ttTfo /tmp/strace-$(echo $i % 100 | bc)" to generate one strace per invocation, and see if there are any particular syscalls in each invocation that are taking the majority of the time?
  • Assuming what is suggested by this comment is true, that wait() or exit() is the culprit, the strace will hopefully show that. The next step to debug would be use oprofile on maybe 100 runs (with kernel symbols enabled) and see if there are any obvious hotspots in the kernel.

Or if you are able to provide a shell to a machine that can reproduce this, I'd be happy to investigate. (It probably requires root to diagnose though).

from isolate.

tncks0121 avatar tncks0121 commented on July 23, 2024

@bblackham, I did what you said, and it seems this is the bottleneck. Unfortunately I don't know how to read these files, so I'm not sure what the problem is..

16020 06:55:34.505601 clone( <unfinished ...>
16037 06:55:34.864662 getpid( <unfinished ...>
16020 06:55:34.864715 <... clone resumed> child_stack=0x7ffe16ee62b0, flags=CLONE_NEWNS|CLONE_NEWIPC|CLONE_NEWPID|CLONE_NEWNET|SIGCHLD) = 16037 <0.359109>
16037 06:55:34.864733 <... getpid resumed> ) = 1 <0.000027>

strace-1.txt
strace-2.txt

from isolate.

bblackham avatar bblackham commented on July 23, 2024

@tncks0121, that clone() definitely looks to be the culprit. No user code is executing between the start of the clone() and its completion. This points at the Linux kernel and something about cloning a task into a new namespace. Are you able to run operf on the affected machines? I'm not certain that operf will work in certain types of VM (I think it requires direct hardware access to the MSRs).

That sort of latency would have to be either some kind of network traffic (maybe there is a small amount of buffering which is why the first one is okay?), a lot of memory zeroing (like, gigabytes, which would be strange), or evicting something to swap, or dropping some caches. Perhaps it is triggering some call out to a really slow userspace helper. If operf doesn't provide any information, I don't know how to diagnose further without being able to reproduce it locally. Can you help me reproduce it locally, or provide a shell to somewhere that it is reproducible?

from isolate.

tncks0121 avatar tncks0121 commented on July 23, 2024

@bblackham, I tried but it seems it doesn't give any useful information. Maybe I've done in a wrong way as I don't know about operf.

root@ubuntu:~# operf  ./isolate --run -- /bin/echo 1>/dev/null 2>/dev/null
root@ubuntu:~# opreport
Using /root/oprofile_data/samples/ for samples directory.


WARNING! Some of the events were throttled. Throttling occurs when
the initial sample rate is too high, causing an excessive number of
interrupts.  Decrease the sampling frequency. Check the directory
/root/oprofile_data/samples/current/stats/throttled
for the throttled event names.


WARNING: Lost samples detected! See /root/oprofile_data/samples/operf.log for details.
CPU: Intel Broadwell microarchitecture, speed 2299.99 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
CPU_CLK_UNHALT...|
  samples|      %|
------------------
        7 53.8462 echo
	CPU_CLK_UNHALT...|
	  samples|      %|
	------------------
	        7 100.000 kallsyms
        6 46.1538 isolate
	CPU_CLK_UNHALT...|
	  samples|      %|
	------------------
	        6 100.000 kallsyms

Anyway, I'll try to find a way to reproduce in a new machine. (both tested machines are already using ones)

from isolate.

SemaiCZE avatar SemaiCZE commented on July 23, 2024

I think that the execution speed depends on the machine, only the first one or first two iterations are significantly faster (6 times or so). This can be seen on both "good" and "bad" machines - the good one has first iteration 0.01 and then about 0.06 and the bad one 0.06 and then about 0.48, which is almost the same slowdown for both of them.

My measurement on CentOS VPS:

iteration 1 : 0.00
iteration 2 : 0.00
iteration 3 : 0.06
iteration 4 : 0.06
iteration 5 : 0.06
iteration 6 : 0.05
iteration 7 : 0.07
iteration 8 : 0.06
iteration 9 : 0.06
iteration 10 : 0.06
total time = .48

from isolate.

bblackham avatar bblackham commented on July 23, 2024

Right, I can reproduce it under docker here. The killer is creating a separate networking namespace. If you pass --share-net to isolate --run, you should see that it runs significantly faster (but you won't have any network isolation). I suspect it is the same kernel issue reported here and here. The netns cleanup operations are batched and very expensive. The patch in the latter link made it into Linux 4.12, so you might expect some speed up with a newer kernel version.

from isolate.

stefano-maggiolo avatar stefano-maggiolo commented on July 23, 2024

Confirm that --share-net makes the timing difference completely disappear on my laptop.

from isolate.

bblackham avatar bblackham commented on July 23, 2024

Thanks for confirming @stefano-maggiolo. Some extra data points:

I never saw it on my VM tests earlier because I was running a single-CPU VM (where the RCU slowdown issues never occur). On a dual-CPU VM, without any iptables loaded, there is no issue. But then as soon as I run iptables -t nat -L, the various nat connection tracking modules are loaded and these cause a major slowdown. If I rmmod nf_conntrack (including all modules it depends on), the slowdown disappears again.

For @tncks0121 and anyone else affected, try blacklisting nf_conntrack (add blacklist nf_conntrack to some file in /etc/modprobe.d/ and reboot). That should be all there is to it, assuming your iptables firewall doesn't require the connection tracking modules. If it's still a problem, please paste the output of cat /proc/modules and uname -r.

from isolate.

bblackham avatar bblackham commented on July 23, 2024

Closing this issue as I believe it is definitely a Linux kernel bug and there's nothing isolate (or any sandbox that uses Linux network namespaces for network isolation) can do about it. A potential workaround is given in my previous comment (blacklisting nf_conntrack).

If this workaround solves the issue for you, please confirm here for posterity. Thanks!

from isolate.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.