Code Monkey home page Code Monkey logo

Comments (12)

Araq avatar Araq commented on May 24, 2024 1

This should be shipped to windows users by default, not only documented.

Agreed, how do we do that?

from nim.

KhazAkar avatar KhazAkar commented on May 24, 2024 1

This should be shipped to windows users by default, not only documented.

Agreed, how do we do that?

Let me think... When I will find more time in my life, I might play with building Nim on windows in VM in msys2 clang environment in automated fashion, then I will provide some sort of PR. Is that acceptable?

from nim.

guzba avatar guzba commented on May 24, 2024

As a quick update, I have tried the same test on my M1 Mac and found interesting results.

First, the difference between --threads:on and --threads:off is much smaller when using arm64 on M1:

nim c --gc:arc --cpu:arm64 --passL:"-arch arm64" --passC:"-arch arm64" --debugger:native -rf -d:release --threads:off tests/bench.nim

Took 1.35s

nim c --gc:arc --cpu:arm64 --passL:"-arch arm64" --passC:"-arch arm64" --debugger:native -rf -d:release --threads:on tests/bench.nim

Took 1.42s

This seems much more like what one might expect.

If I use x64 instead on M1 I get:

nim c --mm:arc -d:release --threads:on -r tests/bench.nim

2.58s

nim c --mm:arc -d:release --threads:off -r tests/bench.nim

1.68s

The huge difference comes back. Something related to amd64, and replicated now on Mac.

Nim Compiler Version 2.0.2 [MacOSX: amd64]
Compiled at 2024-01-14
Copyright (c) 2006-2023 by Andreas Rumpf

active boot switches: -d:release

from nim.

guzba avatar guzba commented on May 24, 2024

I have done a little more experimenting and found significant evidence that something related to thread locals is the cause here.

To check this, I:

  • Cloned the Nim repo on devel
  • Confirmed no performance difference between devel and stable without modification,
  • Commented out this line
  • Confirmed in the generated C code that nimInErrorMode was no longer NIM_THREADVAR.

Obviously this is broken when more than one thread is involved, however the little benchmark is single-threaded so it is good enough for a with/without comparison.

Nim stable compiler generating NIM_THREADVAR: ~1.6ms
Modified to not include NIM_THREADVAR: 0.86ms

Same PC (amd64) used for this experiment. Nearly double the time for some reason.

from nim.

guzba avatar guzba commented on May 24, 2024

Looked at the --asm output and found mov rcx, QWORD PTR .refptr.__emutls_v.nimInErrorMode__system_u3501[rip] etc.

__emutls appears to be GCC TLS emulation. I am not sure if this is expected to be used?

I am compiling on Windows 10 using a fresh choosenim install, choosenim stable with gcc -v of gcc version 11.1.0 (MinGW-W64 x86_64-posix-seh, built by Brecht Sanders).

As another test I compiled with --cc:vcc and saw much less performance difference. --cc:vcc produces cmp BYTE PTR nimInErrorMode__system_u3423, 0 etc.

from nim.

guzba avatar guzba commented on May 24, 2024

Did some more investigating. It appears mingw uses emulated TLS on Windows:

http://mails.dpdk.org/archives/dev/2020-February/157446.html

The second aspect is performance. Per [2], Win32 API TLS functions are ~10%
slower than non-emulated access on Linux, and MinGW emulation layer slows
access by another 20% (30% total). Clang emulation code is similar to
MinGW's [3], although I wasn't able to find any benchmarks. As a DPDK user, I
know that rte_lcore_id() is heavily used on the data-path, so this is severe.

https://sourceforge.net/p/mingw-w64/mailman/mingw-w64-public/thread/[email protected]/

Surprisingly it's getting almost 30% slower on windows in the 
cpu-intensive part (single threaded, no syscalls, memory intensive 
integer arithmetic mostly).
...
Ah, found the big one:
Was testing single threaded but code uses __thread thread-local storage 
which slows things down a lot on mingw.

This could be wrong, outdated, incomplete, who knows what. I am not a GCC or mingw expert. Just putting it here as part of what I'm finding.

from nim.

RSDuck avatar RSDuck commented on May 24, 2024

Try compiling inside MSYS clang environment, it should be a lot faster, because clang actually supports native TLS on Windows. In my case it gave a giant speed up for --threads:on

EDIT: also see #21810

from nim.

guzba avatar guzba commented on May 24, 2024

@RSDuck Thanks for the suggestion and link to that previous issue, I had not seen it. It does appear this is more of a consequence of using mingw/gcc on Windows as opposed to a Nim issue.

Given that this is the default compilation path for Nim on Windows, it seems worth seeing if relatively small steps could mitigate a bunch of the perf hit.

Earlier in this issue I suggested some potential low-hanging opportunities to skip some frequently generated checks of the thread local error bool. I imagine this could reduce the noticeable perf hit by a lot. I don't know if it is actually safe though within the error model of Nim's allocator and value types etc.

Sorry to bother you @Araq, but could some error checks be safely skip-able as I theorize or are there problems here I'm not aware of?

from nim.

Araq avatar Araq commented on May 24, 2024

Sorry to bother you @Araq, but could some error checks be safely skip-able as I theorize or are there problems here I'm not aware of?

You're fine with --fieldChecks:on --boundChecks:on and disabling all the rest IME.

from nim.

juancarlospaco avatar juancarlospaco commented on May 24, 2024

Try compiling inside MSYS clang environment, it should be a lot faster, because clang actually supports native TLS on Windows. In my case it gave a giant speed up for --threads:on

Maybe this should be documented more explicitly.

from nim.

KhazAkar avatar KhazAkar commented on May 24, 2024

Try compiling inside MSYS clang environment, it should be a lot faster, because clang actually supports native TLS on Windows. In my case it gave a giant speed up for --threads:on

Maybe this should be documented more explicitly.

This should be shipped to windows users by default, not only documented.

from nim.

KhazAkar avatar KhazAkar commented on May 24, 2024

As a first bone thrown I can propose trying with such toolchain. I will definitely fire up W10 VM in following days and build Nim using it.
https://github.com/mstorsjo/llvm-mingw

PS. I'm sorry if my previous comment sounded rude in any way, shape or form.

from nim.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.