Comments (12)
Hey @tncks0121, why do you care about run time of isolate itself? Isn't just run time of user's program important? 😕
from isolate.
@hermanzdosilovic, yes, that is correct, but suppose you are conducting a contest and each submission needs at least 0.4*100 seconds just because of the sandbox when the solution is super easy. In this situation the queue will be very long quickly if one has not that many workers.
Also I was planning to use isolate not only for cms but in general judging (users can see how much judging has done), so speed was important.
from isolate.
from isolate.
Hi! I can't reproduce this on any of my machines, VMs or physical.
Can you try the following:
- strace the isolate --run command by prefixing it with
sudo strace -u $USER -ttTfo /tmp/strace-$(echo $i % 100 | bc)"
to generate one strace per invocation, and see if there are any particular syscalls in each invocation that are taking the majority of the time? - Assuming what is suggested by this comment is true, that wait() or exit() is the culprit, the strace will hopefully show that. The next step to debug would be use oprofile on maybe 100 runs (with kernel symbols enabled) and see if there are any obvious hotspots in the kernel.
Or if you are able to provide a shell to a machine that can reproduce this, I'd be happy to investigate. (It probably requires root to diagnose though).
from isolate.
@bblackham, I did what you said, and it seems this is the bottleneck. Unfortunately I don't know how to read these files, so I'm not sure what the problem is..
16020 06:55:34.505601 clone( <unfinished ...>
16037 06:55:34.864662 getpid( <unfinished ...>
16020 06:55:34.864715 <... clone resumed> child_stack=0x7ffe16ee62b0, flags=CLONE_NEWNS|CLONE_NEWIPC|CLONE_NEWPID|CLONE_NEWNET|SIGCHLD) = 16037 <0.359109>
16037 06:55:34.864733 <... getpid resumed> ) = 1 <0.000027>
from isolate.
@tncks0121, that clone() definitely looks to be the culprit. No user code is executing between the start of the clone() and its completion. This points at the Linux kernel and something about cloning a task into a new namespace. Are you able to run operf on the affected machines? I'm not certain that operf will work in certain types of VM (I think it requires direct hardware access to the MSRs).
That sort of latency would have to be either some kind of network traffic (maybe there is a small amount of buffering which is why the first one is okay?), a lot of memory zeroing (like, gigabytes, which would be strange), or evicting something to swap, or dropping some caches. Perhaps it is triggering some call out to a really slow userspace helper. If operf doesn't provide any information, I don't know how to diagnose further without being able to reproduce it locally. Can you help me reproduce it locally, or provide a shell to somewhere that it is reproducible?
from isolate.
@bblackham, I tried but it seems it doesn't give any useful information. Maybe I've done in a wrong way as I don't know about operf.
root@ubuntu:~# operf ./isolate --run -- /bin/echo 1>/dev/null 2>/dev/null
root@ubuntu:~# opreport
Using /root/oprofile_data/samples/ for samples directory.
WARNING! Some of the events were throttled. Throttling occurs when
the initial sample rate is too high, causing an excessive number of
interrupts. Decrease the sampling frequency. Check the directory
/root/oprofile_data/samples/current/stats/throttled
for the throttled event names.
WARNING: Lost samples detected! See /root/oprofile_data/samples/operf.log for details.
CPU: Intel Broadwell microarchitecture, speed 2299.99 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
CPU_CLK_UNHALT...|
samples| %|
------------------
7 53.8462 echo
CPU_CLK_UNHALT...|
samples| %|
------------------
7 100.000 kallsyms
6 46.1538 isolate
CPU_CLK_UNHALT...|
samples| %|
------------------
6 100.000 kallsyms
Anyway, I'll try to find a way to reproduce in a new machine. (both tested machines are already using ones)
from isolate.
I think that the execution speed depends on the machine, only the first one or first two iterations are significantly faster (6 times or so). This can be seen on both "good" and "bad" machines - the good one has first iteration 0.01 and then about 0.06 and the bad one 0.06 and then about 0.48, which is almost the same slowdown for both of them.
My measurement on CentOS VPS:
iteration 1 : 0.00
iteration 2 : 0.00
iteration 3 : 0.06
iteration 4 : 0.06
iteration 5 : 0.06
iteration 6 : 0.05
iteration 7 : 0.07
iteration 8 : 0.06
iteration 9 : 0.06
iteration 10 : 0.06
total time = .48
from isolate.
Right, I can reproduce it under docker here. The killer is creating a separate networking namespace. If you pass --share-net
to isolate --run
, you should see that it runs significantly faster (but you won't have any network isolation). I suspect it is the same kernel issue reported here and here. The netns cleanup operations are batched and very expensive. The patch in the latter link made it into Linux 4.12, so you might expect some speed up with a newer kernel version.
from isolate.
Confirm that --share-net
makes the timing difference completely disappear on my laptop.
from isolate.
Thanks for confirming @stefano-maggiolo. Some extra data points:
I never saw it on my VM tests earlier because I was running a single-CPU VM (where the RCU slowdown issues never occur). On a dual-CPU VM, without any iptables loaded, there is no issue. But then as soon as I run iptables -t nat -L
, the various nat connection tracking modules are loaded and these cause a major slowdown. If I rmmod nf_conntrack
(including all modules it depends on), the slowdown disappears again.
For @tncks0121 and anyone else affected, try blacklisting nf_conntrack (add blacklist nf_conntrack
to some file in /etc/modprobe.d/ and reboot). That should be all there is to it, assuming your iptables firewall doesn't require the connection tracking modules. If it's still a problem, please paste the output of cat /proc/modules
and uname -r
.
from isolate.
Closing this issue as I believe it is definitely a Linux kernel bug and there's nothing isolate (or any sandbox that uses Linux network namespaces for network isolation) can do about it. A potential workaround is given in my previous comment (blacklisting nf_conntrack).
If this workaround solves the issue for you, please confirm here for posterity. Thanks!
from isolate.
Related Issues (20)
- cg2 version fails if memory.swap.max does not exist HOT 7
- I also encountered the same issue. This worked for me:
- Getting Resource temporarily unavailable error while compiling C program HOT 3
- Is this possible? HOT 13
- cg2: `cf_cg_root` points to junk after `clone()` on Fedora HOT 2
- Unable to access internet inside isolate box even after using --share-net HOT 4
- "No such file or directory" issue when trying to run C# program using Dotnet HOT 7
- Memory corruption bug in cg_init HOT 2
- C# program failed to compile using mcs - error CS2001: Source file `Main.cs' could not be found HOT 1
- make install exited with error code 1. HOT 1
- Error Running isolate in Ubuntu:22.04 with Systemd HOT 10
- Support for Docker HOT 12
- Error using isolate HOT 15
- Assertion Failure Issue HOT 4
- --as-uid and --as-gid seem to be not usable in Docker container HOT 3
- Cannot set disk quota: No such process HOT 1
- chown: cannot access /var/local/lib/isolate/XX/box': No such file or directory HOT 6
- CPU time (--time) consumed in subsequent runs in the same box with cg (cgroup2) enabled HOT 9
- Limitation on number of sandboxes HOT 1
- [Query] Isolate Mac OS setup HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from isolate.