Comments (84)
Indeed… currently generating a trace with Nsight Systems to have a close look.
from pbrt-v4.
Step by step! I've pushed that fix to both branches.
from pbrt-v4.
(BTW I have merged the
windows-gpu-rework
branch tomaster
(with some improvements to work around the performance hit it previously introduced on Linux.) Let me know if I messed up anything on the WIndows front..
I tested the latest master on Windows, and it is still working fine and am even seeing a 10–15% improvement over the windows-gpu-rework branch: I rendered the scene 3 times and ended up with results between 14.4 and 14.8 seconds, compared to the 16–17 seconds I reported yesterday on the windows-gpu-rework branch. 🎉
from pbrt-v4.
Here's the corresponding log on Linux on a RTX 2080
GPU Kernel Profile:
Generate Camera rays 1024 launches 703.02 ms / 5.3% (avg 0.687, min 0.657, max 0.946)
Generate ray samples - HaltonSampler 6144 launches 626.28 ms / 4.7% (avg 0.102, min 0.019, max 0.560)
Reset queues before tracing rays 6144 launches 33.01 ms / 0.2% (avg 0.005, min 0.004, max 0.008)
Tracing closest hit rays 6144 launches 4854.99 ms / 36.3% (avg 0.790, min 0.062, max 12.976)
Handle emitters hit by indirect rays 6144 launches 148.44 ms / 1.1% (avg 0.024, min 0.016, max 0.047)
ConductorMaterial + BxDF Eval (Basic tex) 5120 launches 1172.95 ms / 8.8% (avg 0.229, min 0.041, max 1.098)
DiffuseMaterial + BxDF Eval (Basic tex) 5120 launches 4633.22 ms / 34.6% (avg 0.905, min 0.035, max 2.640)
Tracing shadow rays 5120 launches 483.30 ms / 3.6% (avg 0.094, min 0.038, max 0.380)
Incorporate shadow ray contribution 5120 launches 194.88 ms / 1.5% (avg 0.038, min 0.016, max 0.071)
Reset shadowRayQueue 5120 launches 27.63 ms / 0.2% (avg 0.005, min 0.004, max 0.007)
Handle medium transitions 5120 launches 84.99 ms / 0.6% (avg 0.017, min 0.015, max 0.023)
Update indirect ray stats 5120 launches 28.26 ms / 0.2% (avg 0.006, min 0.003, max 0.008)
Update Film 1024 launches 369.82 ms / 2.8% (avg 0.361, min 0.357, max 0.373)
Other 2048 launches 10.82 ms / 0.1% (avg 0.005)
Total GPU time: 13371.61 ms
GPU Statistics:
Camera rays 718626816
Indirect rays, depth 1 718039415
Indirect rays, depth 2 411877638
Indirect rays, depth 3 24306403
Indirect rays, depth 4 7197480
Indirect rays, depth 5 2514180
Shadow rays, depth 0 359206403
Shadow rays, depth 1 195455999
Shadow rays, depth 2 107390027
Shadow rays, depth 3 6470333
Shadow rays, depth 4 2033811
[Other stats are all exactly the same.]
As a percentage of runtime, the kernels are all in the same range. And the number of rays traced is about the same.
It is interesting that the reported kernel times are much higher. i.e., if we were an issue that the CPU isn't submitting enough work to keep the GPU busy and the GPU was idling, then presumably those times (which are measured by CUDA events before and after each launch), would still be low, but overall run time would be high.
from pbrt-v4.
...and the issue doesn't seem to be that Windows is somehow running debug kernels. Here's linux with a debug build. (Just 8spp, so just compare the "avg" time for each kernel)--Windows is still 5-10x slower.
GPU Kernel Profile:
Reset ray queue 16 launches 12.07 ms / 0.2% (avg 0.754, min 0.014, max 11.860)
Generate Camera rays 16 launches 674.56 ms / 8.9% (avg 42.160, min 41.314, max 51.347)
Generate ray samples - HaltonSampler 96 launches 804.32 ms / 10.6% (avg 8.378, min 1.038, max 19.611)
Reset queues before tracing rays 96 launches 14.85 ms / 0.2% (avg 0.155, min 0.153, max 0.221)
Tracing closest hit rays 96 launches 128.72 ms / 1.7% (avg 1.341, min 0.173, max 3.507)
Handle emitters hit by indirect rays 96 launches 232.05 ms / 3.0% (avg 2.417, min 0.558, max 3.655)
ConductorMaterial + BxDF Eval (Basic tex) 80 launches 1351.06 ms / 17.7% (avg 16.888, min 4.613, max 85.159)
DiffuseMaterial + BxDF Eval (Basic tex) 80 launches 3985.23 ms / 52.3% (avg 49.815, min 4.581, max 106.935)
Tracing shadow rays 80 launches 12.33 ms / 0.2% (avg 0.154, min 0.071, max 0.228)
Incorporate shadow ray contribution 80 launches 161.64 ms / 2.1% (avg 2.021, min 0.696, max 4.259)
Handle medium transitions 80 launches 56.19 ms / 0.7% (avg 0.702, min 0.693, max 0.712)
Update Film 16 launches 183.48 ms / 2.4% (avg 11.467, min 11.417, max 11.580)
Other 176 launches 3.51 ms / 0.0% (avg 0.020)
from pbrt-v4.
Here is a zip'ed report from Nsight Systems (I hope others can open it just fine); I let it run about 20 secs, which seems to have been enough to do 2.5 frames (by looking at the repeating patterns in the trace)?
from pbrt-v4.
Some of the long lasting kernels have the following in common: several hundred MB of local memory total (though none on a per thread basis), for example:
Local Memory Per Thread: 0 bytes
Local Memory Total: 441,188,352 bytes
I'll check with Nsight Compute but it looks weird how those kernels have relatively few threads in total (for example, the one cited above for the local memory has a grid size of (914, 1, 1) and a block size of (768, 1, 1) (and 80 regs per thread).
from pbrt-v4.
Success opening it!
One thing that's not present in linux traces is the progress reporter stuff. What happens if you comment out the ProgressReporter declaration around line 350 of pathintegrator.cpp and then the calls to progress.Update() and progress.Done()?
I will post a linux trace for comparison in a sec (my first try was too big.)
from pbrt-v4.
(my first try was too big.)
My first tries were way too big and never finished. I learned (after like 5+ tries) that I should start the profiling manually after several seconds and not automatically at the beginning, otherwise it just ended with too many samples. Okay, I'll generate a new trace without the ProgressReporter.
from pbrt-v4.
Linux trace of 2s of a run.
report.zip
from pbrt-v4.
(It looks like you still have cudaDeviceSynchronize calls in there, but I'm not sure if it matters.)
from pbrt-v4.
I was still on the master branch
from pbrt-v4.
So I think this sums up the symptoms, though the cause isn't yet clear to me...
The top grey bar is the total time charged to a "Incorporate shadow ray contribution." It's long. In the lower part, we can see that the GPU is basically idle, except for a short time when it is actually doing work.
There are two weird things: first the _ZN4pbrt... thing, in blue, is taking a long time before it launches the kernel. Then, that cudaDeviceSynchronize()
stalls for a long time. If you zoom in farther to the point where it clears, there's this:
Which suggests it's stalled on TaskRunner(?!)
from pbrt-v4.
I seem to get a 2x speedup with windows-gpu-rework as it now estimates a total time of about 3600 seconds. Now let's comment out the ProgressReporter on top of that.
from pbrt-v4.
Nice! Fingers crossed.
from pbrt-v4.
(And the ProgressReporter thing may well not make a difference--it is expected that thread will sleep, periodically wait on a CUDA event to see if it's cleared, and then print some stuff. So it's expected to be blocked a lot. But if for some reason it's interfering with the main command stream...)
from pbrt-v4.
Looking a lot tighter! I will make one with the ProgressReporter enabled again but still on the new branch.
from pbrt-v4.
The ProgressReporter does not seem to make any difference regarding timings.
from pbrt-v4.
Did you guys see my question about Windows emulating shared memory by copying large blocks of memory back and forth. Can you tell from the trace if that is happening?
from pbrt-v4.
I saw the question, and no idea though I would expect those copies to pop up in the trace.
from pbrt-v4.
We seem to be spending an inordinate amount of time in GenerateCameraRays. Is that expected?
from pbrt-v4.
Catching up on a few things..
@pierremoreau those block and grid sizes are fine, I believe: 914 grids * 768 threads works out to about half the image resolution, which makes sense, since it renders in two chunks of scanline.
@richardmgoodin I don't think it's memory copies--those show up separately in the traces and seem to be <0.1% of the total, which is to be expected (basically just enough to copy parameters over to optix when it is kicked off.) (Under CUDA() / stream 7 and CUDA / stream 15 in Pierre's latest trace. One of those is Optix and one is all the rest of pbrt's kernels.) I was a little surprised by that--I had guessed that was going to be the problem..
It is definitely a ton of time in GenerateCameraRays, but all of the other kernels are seeing a similar proportional increase in time spent in them, so I don't think it's just that.
Will keep poking at Pierre's latest trace...
from pbrt-v4.
A lot of time seems to be spent waiting in GetProfilerEvents()
(but that, too, may be expected..) The basic idea is that there is a small pool of 1k cudaEvent_t
structs to put before/after each kernel launch for profiling. When it runs out of them, it waits for some to free up, due to earlier kernels freeing up. Normally, this shouldn't slow down the GPU, since there should be plenty of work buffered up, but, it's unclear what's happening here.
Could someone try changing the 1024 passed to the resize()
call at line 64 of GetProfilerEvents()
in launch.cpp to 1000000 and see if anything changes? That should be enough that it will never need to wait for them when rendering killeroos-gold.
(I don't have a lot of confidence in this theory, since the traces seem to show that the kernels are filling the machine, just running really slow, but it's worth a shot..)
from pbrt-v4.
from pbrt-v4.
from pbrt-v4.
Drat. 12.7s on linux.
from pbrt-v4.
I keep asking myself what is different. The kernels are compiled identically with nvcc. We have two different pieces of hardware that are seeing about the same slowdown. We are not seeing slowdowns in memory traffic. What can the driver configure differently? Is a cache turned off somewhere? I am seeing a lot more compile warnings in Windows than I saw today when building for Linux which was by and large clean. I can't imagine any compiler differences would account for things since it is on the CPU side and it it appears that our kernels are just taking quite a bit longer to run.
from pbrt-v4.
Tested again on the windows-gpu-rework branch with eventPool.resize(1000000)
and the ProgressReporter
commented out, and still taking about 3,400–3,500 seconds.
from pbrt-v4.
Test the windows-gpu-rework branch on the Linux boot, and I get an almost 10x speed-up (it took 417 seconds to render) compared to on Windows:
GPU Kernel Profile:
Generate Camera rays 1024 launches 34539.33 ms / 8.3% (avg 33.730, min 33.382, max 37.200)
Generate ray samples - HaltonSampler 6144 launches 58965.61 ms / 14.1% (avg 9.597, min 3.464, max 18.651)
Reset queues before tracing rays 6144 launches 1317.25 ms / 0.3% (avg 0.214, min 0.206, max 0.370)
Tracing closest hit rays 6144 launches 5049.73 ms / 1.2% (avg 0.822, min 0.068, max 15.452)
Handle emitters hit by indirect rays 6144 launches 21629.62 ms / 5.2% (avg 3.520, min 3.262, max 4.305)
ConductorMaterial + BxDF Eval (Basic tex) 5120 launches 65866.45 ms / 15.8% (avg 12.865, min 4.630, max 60.426)
DiffuseMaterial + BxDF Eval (Basic tex) 5120 launches 173395.52 ms / 41.5% (avg 33.866, min 4.435, max 74.719)
Tracing shadow rays 5120 launches 488.70 ms / 0.1% (avg 0.095, min 0.042, max 0.163)
Incorporate shadow ray contribution 5120 launches 24207.30 ms / 5.8% (avg 4.728, min 3.393, max 7.132)
Reset shadowRayQueue 5120 launches 430.44 ms / 0.1% (avg 0.084, min 0.080, max 0.115)
Handle medium transitions 5120 launches 17618.89 ms / 4.2% (avg 3.441, min 3.401, max 3.728)
Update Film 1024 launches 13703.16 ms / 3.3% (avg 13.382, min 13.259, max 13.697)
Other 7168 launches 564.33 ms / 0.1% (avg 0.079)
Total GPU time: 417776.31 ms
GPU Statistics:
Camera rays 718626816
Indirect rays, depth 1 718039417
Indirect rays, depth 2 411877673
Indirect rays, depth 3 24306355
Indirect rays, depth 4 7197276
Indirect rays, depth 5 2514007
Shadow rays, depth 0 359206412
Shadow rays, depth 1 195456010
Shadow rays, depth 2 107390074
Shadow rays, depth 3 6470485
Shadow rays, depth 4 2033699
And on the master branch:
GPU Kernel Profile:
Generate Camera rays 1024 launches 29981.08 ms / 8.9% (avg 29.278, min 29.043, max 33.308)
Generate ray samples - HaltonSampler 6144 launches 42534.34 ms / 12.6% (avg 6.923, min 0.940, max 15.806)
Reset queues before tracing rays 6144 launches 937.46 ms / 0.3% (avg 0.153, min 0.147, max 0.221)
Tracing closest hit rays 6144 launches 5157.17 ms / 1.5% (avg 0.839, min 0.070, max 19.147)
Handle emitters hit by indirect rays 6144 launches 10640.78 ms / 3.1% (avg 1.732, min 0.567, max 3.028)
ConductorMaterial + BxDF Eval (Basic tex) 5120 launches 58128.32 ms / 17.2% (avg 11.353, min 3.093, max 60.935)
DiffuseMaterial + BxDF Eval (Basic tex) 5120 launches 165481.17 ms / 48.9% (avg 32.321, min 2.957, max 72.519)
Tracing shadow rays 5120 launches 487.04 ms / 0.1% (avg 0.095, min 0.043, max 0.168)
Incorporate shadow ray contribution 5120 launches 10385.46 ms / 3.1% (avg 2.028, min 0.699, max 4.444)
Handle medium transitions 5120 launches 3667.10 ms / 1.1% (avg 0.716, min 0.700, max 0.968)
Update Film 1024 launches 10805.38 ms / 3.2% (avg 10.552, min 10.447, max 10.819)
Other 12288 launches 233.18 ms / 0.1% (avg 0.019)
Total GPU time: 338438.50 ms
GPU Statistics:
Camera rays 718626816
Indirect rays, depth 1 718039417
Indirect rays, depth 2 411877673
Indirect rays, depth 3 24306355
Indirect rays, depth 4 7197276
Indirect rays, depth 5 2514007
Shadow rays, depth 0 359206412
Shadow rays, depth 1 195456010
Shadow rays, depth 2 107390074
Shadow rays, depth 3 6470485
Shadow rays, depth 4 2033699
Still a 20x difference compared to your results. I am very confused how your 2080 is pulling out a 20x on my 2080 Ti. I wonder if it's a CPU or storage difference.
from pbrt-v4.
I was just looking at the build log under windows. Should nvcc be generating code for
-gencode=arch=compute_52,code="sm_52,compute_52"
"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\nvcc.exe" -gencode=arch=compute_52,code="sm_52,compute_52" --use-local-env -ccbin "C:\Program Files (x86)\Microsoft Visual Studio\2019\Professional\VC\Tools\MSVC\14.27.29110\bin\HostX64\x64" -x cu -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -G --keep-dir x64\Debug -maxrregcount=0 --machine 64 --compile -cudart static -v -g -D_MBCS -Xcompiler "/EHsc /W0 /nologo /Od /FdDebug\vc142.pdb /FS /Zi /RTC1 /MDd " -o Debug\CMakeCUDACompilerId.cu.obj "C:\cygwin64\home\goodin\pbrt-v4\build\CMakeFiles\3.18.2\CompilerIdCUDA\CMakeCUDACompilerId.cu"
from pbrt-v4.
Weird, in the initial log you had in #20 it did generate (correctly, you have a GV100 right?) for SM 7.0.
from pbrt-v4.
Yes, that's correct. I'm digging into the linux side now to see if there is a difference (which I would expect)
from pbrt-v4.
Here's what I'm seeing on Linux:
/usr/local/cuda-11.0/bin/nvcc $(CUDA_DEFINES) $(CUDA_INCLUDES) $(CUDA_FLAGS) -x cu -dc /home/goodin/pbrt-v4/src/pbrt/lights.cpp -o CMakeFiles/pbrt_lib.dir/src/pbrt/lights.cpp.o
CUDA_FLAGS = -Xcudafe --diag_suppress=partial_override -Xcudafe --diag_suppress=virtual_function_decl_hidden -Xcudafe --diag_suppress=integer_sign_change -Xcudafe --diag_suppress=declared_but_not_referenced -Xcudafe --diag_suppress=implicit_return_from_non_void_function --expt-relaxed-constexpr --extended-lambda -Xnvlink -suppress-stack-size-warning --std=c++17 --use_fast_math -lineinfo --maxrregcount 128 --gpu-architecture=sm_70 -O3 -DNDEBUG -D NDEBUG
CUDA_DEFINES = -DNVTX -DPBRT_BUILD_GPU_RENDERER -DPBRT_HAVE_MMAP -DPBRT_HAVE_POSIX_MEMALIGN -DPBRT_IS_LINUX -DPBRT_NOINLINE="attribute((noinline))" -DPTEX_STATIC
CUDA_INCLUDES = -I/home/goodin/NVIDIA-OptiX-SDK-7.1.0-linux64-x86_64/include -I/home/goodin/pbrt-v4/src -I/home/goodin/pbrt-v4/build -I/home/goodin/pbrt-v4/src/ext/openvdb/nanovdb
I don't know enough about CMake to understand why it is generating sm52
from pbrt-v4.
Are you seeing the same thing under windows or is my CUDA install hosed somehow?
from pbrt-v4.
Look at: https://github.com/mmp/pbrt-v4/blob/master/CMakeLists.txt#L136-L161
Do you happen to also have a Maxwell card in your box?
from pbrt-v4.
No, just a single GV100
from pbrt-v4.
I will reboot to Windows in a bit, but last I checked it properly detected the RTX 2080 Ti.
Try modifying https://github.com/mmp/pbrt-v4/blob/master/cmake/checkcuda.cu or stepping through it to understand why it emits the wrong SM capability.
from pbrt-v4.
Alternatively, you could force setting ARCH
in the CMake to sm_70
. But it would be nice to understand where the sm_52 is coming from.
from pbrt-v4.
If you modify the main CMake file for example (just to add a print or something) and re-run CMake, what does it say after "CUDA Architecture: ¨? Here is what I get on Linux
-- Found CUDA:
-- CUDA Architecture: sm_75
from pbrt-v4.
I checked on Windows and I am getting sm_75 as expected.
from pbrt-v4.
What was happening is that early on I was hand editing CMakeLists.txt to include Optix. When I do this it generates sm70 as expected. When I started running make-gui I would run the configure pass without OptiX set, edit it in the window and then run generate. I'm rebuilding now with sm70 but since you have always been running with sm75 I don't think that was the issue.
Did you take a look at the nvcc compile flag differences other than sm and is there anything jumping out that could cause the differences in kernel execution. It looks like there are quite a few differences. -maxregcount jumps out to me.
from pbrt-v4.
Is "-G" debug in nvcc
from pbrt-v4.
Is "-G" debug in nvcc
Correct, debug info AFAIR.
I'll have a look at the flags but they looked reasonable. Also they are the same on Linux and Windows AFAIR.
from pbrt-v4.
from pbrt-v4.
Regarding the sm_52 thing in #24 (comment), that looks like the "checkcuda" thing around line 128 of CMakeLists.txt, where it's just compiling a small program to run to check the GPU's capabilities, not the main build.
I did just notice this:
if (CMAKE_BUILD_TYPE MATCHES Release)
set (CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --use_fast_math -lineinfo --maxrregcount 128")
else()
set (CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --use_fast_math -G -g")
endif ()
If you're doing a RelWithDebInfo build, I'm wondering if that's not working as expected...
from pbrt-v4.
I saw that part and changed it locally and then thought "Oh, I think this is just adding debug symbols so it should be fine to leave as-is", so I then reverted it.
So re-compiling with it swapped and testing against "Debug". nvcc is warning about maxrregcount being redefined. I'm not sure where the maxregcount 0
is coming from.
from pbrt-v4.
Here is what I'm seeing from my Release build. Note the maxregcount seems to contradict the CMake code above. also the -G is there. This was copied from my build output directly from VS2019.
22>C:\cygwin64\home\goodin\pbrt-v4\build>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\nvcc.exe" -gencode=arch=compute_70,code="sm_70,compute_70" --use-local-env -ccbin "C:\Program Files (x86)\Microsoft Visual Studio\2019\Professional\VC\Tools\MSVC\14.27.29110\bin\HostX64\x64" -x cu -rdc=true -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -I"C:\ProgramData\NVIDIA Corporation\OptiX SDK 7.1.0\include" -I"C:\cygwin64\home\goodin\pbrt-v4\src" -I"C:\cygwin64\home\goodin\pbrt-v4\build" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\openvdb\nanovdb" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\stb" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\openexr\IlmBase\Imath" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\openexr\IlmBase\Half" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\openexr\IlmBase\Iex" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\openexr\OpenEXR\IlmImf" -I"C:\cygwin64\home\goodin\pbrt-v4\build\src\ext\openexr\IlmBase\config" -I"C:\cygwin64\home\goodin\pbrt-v4\build\src\ext\openexr\OpenEXR\config" -I"C:\Program Files (x86)\PBRT-V4\include" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\filesystem" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\ptex\src\ptex" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\double-conversion" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -G --keep-dir x64\Release -maxrregcount=0 --machine 64 --compile -cudart static -Xcudafe --diag_suppress=partial_override -Xcudafe --diag_suppress=virtual_function_decl_hidden -Xcudafe --diag_suppress=integer_sign_change -Xcudafe --diag_suppress=declared_but_not_referenced -Xcudafe --diag_suppress=implicit_return_from_non_void_function --expt-relaxed-constexpr --extended-lambda -Xnvlink -suppress-stack-size-warning --std=c++17 -Xcompiler="/EHsc -Ob2" -g -use_fast_math -D_WINDOWS -DNDEBUG -D_CRT_SECURE_NO_WARNINGS -DPBRT_IS_MSVC -DPBRT_BUILD_GPU_RENDERER -DNVTX -DPBRT_HAS_INTRIN_H -DPBRT_IS_WINDOWS -DNOMINMAX -D"PBRT_NOINLINE=__declspec(noinline)" -DPBRT_HAVE__ALIGNED_MALLOC -DPTEX_STATIC -D"CMAKE_INTDIR="Release"" -DWIN32 -D_WINDOWS -DNDEBUG -D_CRT_SECURE_NO_WARNINGS -DPBRT_IS_MSVC -DPBRT_BUILD_GPU_RENDERER -DNVTX -DPBRT_HAS_INTRIN_H -DPBRT_IS_WINDOWS -DNOMINMAX -D"PBRT_NOINLINE=__declspec(noinline)" -DPBRT_HAVE__ALIGNED_MALLOC -DPTEX_STATIC -D"CMAKE_INTDIR="Release"" -D_MBCS -Xcompiler "/EHsc /W3 /nologo /O2 /Fdpbrt_lib.dir\Release\pbrt_lib.pdb /FS /Zi /MD /GR" -o pbrt_lib.dir\Release\cameras.obj "C:\cygwin64\home\goodin\pbrt-v4\src\pbrt\cameras.cpp"
from pbrt-v4.
Okay, so doing
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 22c8abc..797db35 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -143,10 +143,10 @@ if (CMAKE_CUDA_COMPILER)
OUTPUT_VARIABLE ARCH)
set (CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --std=c++17")
- if (CMAKE_BUILD_TYPE MATCHES Release)
- set (CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --use_fast_math -lineinfo --maxrregcount 128")
- else()
+ if (CMAKE_BUILD_TYPE MATCHES Debug)
set (CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --use_fast_math -G -g")
+ else()
+ set (CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --use_fast_math -lineinfo --maxrregcount 128")
endif ()
and it's now rendering in about 1200 seconds it looks like it. So progress!
from pbrt-v4.
@mmp If you change --maxrregcount
into -maxrregcount
, CMake will detect it and avoid emitting the -maxrregcount 0
. It's a similar issue to https://gitlab.kitware.com/cmake/cmake/-/issues/20164.
from pbrt-v4.
What's weird is I was running a full release build, so the change shouldn't effect me right?
from pbrt-v4.
Probably not, which would explain why you were still getting much better timings than I did, even though it was still slower than what Matt gets on Linux.
from pbrt-v4.
I just ran a rebuild of the ToT. No "-G" but maxrregcount=0 is still there
from pbrt-v4.
Oh, I see now. There are both -maxrregcount=0 and --maxrregcount 128
from pbrt-v4.
@richardmgoodin Did you apply the change I suggested here?
from pbrt-v4.
No, trying it now
from pbrt-v4.
@mmp \o/ With the CMake tweaks, I now render the scene in 14.3 seconds on Linux! So now only need to figure out where the two remaining orders of magnitude are coming from on Windows.
from pbrt-v4.
is this in the windows-gpu-rework branch?
from pbrt-v4.
The 14.3 seconds was on the master branch; I haven’t tried the CMake tweaks + the windows-gpu-rework branch on Linux
from pbrt-v4.
I take it back: testing the windows-gpu-rework + CMake tweak on Windows with only the RTX 2080 Ti in the box, and it rendered in 16 seconds.
from pbrt-v4.
from pbrt-v4.
I don't see anything unexpected anymore but just for the record, here's what I'm seeing for a full release build.
C:\cygwin64\home\goodin\pbrt-v4\build>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\nvcc.exe" -gencode=arch=compute_70,code="sm_70,compute_70" --use-local-env -ccbin "C:\Program Files (x86)\Microsoft Visual Studio\2019\Professional\VC\Tools\MSVC\14.27.29110\bin\HostX64\x64" -x cu -rdc=true -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -I"C:\ProgramData\NVIDIA Corporation\OptiX SDK 7.1.0\include" -I"C:\cygwin64\home\goodin\pbrt-v4\src" -I"C:\cygwin64\home\goodin\pbrt-v4\build" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\openvdb\nanovdb" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\stb" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\openexr\IlmBase\Imath" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\openexr\IlmBase\Half" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\openexr\IlmBase\Iex" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\openexr\OpenEXR\IlmImf" -I"C:\cygwin64\home\goodin\pbrt-v4\build\src\ext\openexr\IlmBase\config" -I"C:\cygwin64\home\goodin\pbrt-v4\build\src\ext\openexr\OpenEXR\config" -I"C:\Program Files (x86)\PBRT-V4\include" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\filesystem" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\ptex\src\ptex" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\double-conversion" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" --keep-dir x64\Release -maxrregcount=128 --machine 64 --compile -cudart static -Xcudafe --diag_suppress=partial_override -Xcudafe --diag_suppress=virtual_function_decl_hidden -Xcudafe --diag_suppress=integer_sign_change -Xcudafe --diag_suppress=declared_but_not_referenced -Xcudafe --diag_suppress=implicit_return_from_non_void_function --expt-relaxed-constexpr --extended-lambda -Xnvlink -suppress-stack-size-warning --std=c++17 -lineinfo -Xcompiler="/EHsc -Ob2" -use_fast_math -D_WINDOWS -DNDEBUG -D_CRT_SECURE_NO_WARNINGS -DPBRT_IS_MSVC -DPBRT_BUILD_GPU_RENDERER -DNVTX -DPBRT_HAS_INTRIN_H -DPBRT_IS_WINDOWS -DNOMINMAX -D"PBRT_NOINLINE=__declspec(noinline)" -DPBRT_HAVE__ALIGNED_MALLOC -DPTEX_STATIC -D"CMAKE_INTDIR="Release"" -DWIN32 -D_WINDOWS -DNDEBUG -D_CRT_SECURE_NO_WARNINGS -DPBRT_IS_MSVC -DPBRT_BUILD_GPU_RENDERER -DNVTX -DPBRT_HAS_INTRIN_H -DPBRT_IS_WINDOWS -DNOMINMAX -D"PBRT_NOINLINE=__declspec(noinline)" -DPBRT_HAVE__ALIGNED_MALLOC -DPTEX_STATIC -D"CMAKE_INTDIR="Release"" -D_MBCS -Xcompiler "/EHsc /W3 /nologo /O2 /Fdpbrt_lib.dir\Release\pbrt_lib.pdb /FS /Zi /MD /GR" -o pbrt_lib.dir\Release\subsurface.obj "C:\cygwin64\home\goodin\pbrt-v4\src\pbrt\gpu\subsurface.cpp"
from pbrt-v4.
killerroos-gold running full release no windows-gpu-rework 121.4
from pbrt-v4.
I just pushed the -maxrregcount
fix (to both branches).
So just to be sure I'm caught up: current status is that Pierre is seeing expected good perf on both Windows and Linux, but Richard is still 10x off on both?
(Edit: re-reading, it looks like Pierre was just reporting linux above.)
There is the difference of the RTX cores, but IIRC Richard's slowdown was across all kernels.
from pbrt-v4.
(Edit: re-reading, it looks like Pierre was just reporting linux above.)
I reported on both: the first comment was about Linux indeed, but the later one does say Windows in it. 😛
Probably the “I take it back” part at the beginning which was confusing; it was referring to this earlier comment about Windows results.
So to sum up, the scene is now rendering in ~14 seconds on Linux, and about 16–17 on Windows.
from pbrt-v4.
but Richard is still 10x off on both?
The numbers Richard reported was from the master branch (IIRC), so without all the sync’ing improvements you made on the windows-gpu-rework branch. I got a 2x from switching to the windows-gpu-rework branch and that was in debug mode, so I wouldn’t be surprised if the effect is even more pronounced in release mode, so I think Richard should be getting close to our results now.
from pbrt-v4.
I'm building the windows-gpu-rework code now. I'm still am getting 123.5. I find it really suspicious I'm getting the same. I need to step away for another commitment for a couple of hours but when I get back I'll be a little more rigorous about verifying what is going on. Just to confirm to check out the branch I just add "--branch windows-gpu-rework" and the code pulled is only from the branch?
from pbrt-v4.
Just to confirm to check out the branch I just add "--branch windows-gpu-rework" and the code pulled is only from the branch?
The checked out code should be coming from that branch, correct. You can easily switch between branches by using git checkout $branchName
or git switch $branchName
, so you do not need to clone every time you want to test a different branch.
from pbrt-v4.
I just pulled top of tree and built under Linux. Killeroos-gold is running at 17.9s. So it doesn't seem to be a GV100 issue.
from pbrt-v4.
windows top of tree RelWithDebInfo. Killeroos-gold 121.3s. I also got a build failure when building with a new pull that effected imgtool and cyhair2pbrt. Cleaning the build and rebuilding resolved the failure. Here's the NVCC line:
23>C:\cygwin64\home\goodin\pbrt-v4\build>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\nvcc.exe" -gencode=arch=compute_70,code="sm_70,compute_70" --use-local-env -ccbin "C:\Program Files (x86)\Microsoft Visual Studio\2019\Professional\VC\Tools\MSVC\14.27.29110\bin\HostX64\x64" -x cu -rdc=true -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -I"C:\ProgramData\NVIDIA Corporation\OptiX SDK 7.1.0\include" -I"C:\cygwin64\home\goodin\pbrt-v4\src" -I"C:\cygwin64\home\goodin\pbrt-v4\build" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\openvdb\nanovdb" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\stb" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\openexr\IlmBase\Imath" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\openexr\IlmBase\Half" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\openexr\IlmBase\Iex" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\openexr\OpenEXR\IlmImf" -I"C:\cygwin64\home\goodin\pbrt-v4\build\src\ext\openexr\IlmBase\config" -I"C:\cygwin64\home\goodin\pbrt-v4\build\src\ext\openexr\OpenEXR\config" -I"C:\Program Files (x86)\PBRT-V4\include" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\filesystem" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\ptex\src\ptex" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\double-conversion" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" --keep-dir x64\RelWithDebInfo -maxrregcount=128 --machine 64 --compile -cudart static -Xcudafe --diag_suppress=partial_override -Xcudafe --diag_suppress=virtual_function_decl_hidden -Xcudafe --diag_suppress=integer_sign_change -Xcudafe --diag_suppress=declared_but_not_referenced -Xcudafe --diag_suppress=implicit_return_from_non_void_function --expt-relaxed-constexpr --extended-lambda -Xnvlink -suppress-stack-size-warning --std=c++17 -lineinfo -Xcompiler="/EHsc -Zi -Ob1" -use_fast_math -D_WINDOWS -DNDEBUG -D_CRT_SECURE_NO_WARNINGS -DPBRT_IS_MSVC -DPBRT_BUILD_GPU_RENDERER -DNVTX -DPBRT_HAS_INTRIN_H -DPBRT_IS_WINDOWS -DNOMINMAX -D"PBRT_NOINLINE=__declspec(noinline)" -DPBRT_HAVE__ALIGNED_MALLOC -DPTEX_STATIC -D"CMAKE_INTDIR="RelWithDebInfo"" -DWIN32 -D_WINDOWS -DNDEBUG -D_CRT_SECURE_NO_WARNINGS -DPBRT_IS_MSVC -DPBRT_BUILD_GPU_RENDERER -DNVTX -DPBRT_HAS_INTRIN_H -DPBRT_IS_WINDOWS -DNOMINMAX -D"PBRT_NOINLINE=__declspec(noinline)" -DPBRT_HAVE__ALIGNED_MALLOC -DPTEX_STATIC -D"CMAKE_INTDIR="RelWithDebInfo"" -D_MBCS -Xcompiler "/EHsc /W3 /nologo /O2 /Fdpbrt_lib.dir\RelWithDebInfo\pbrt_lib.pdb /FS /Zi /MD /GR" -o pbrt_lib.dir\RelWithDebInfo\cameras.obj "C:\cygwin64\home\goodin\pbrt-v4\src\pbrt\cameras.cpp"
from pbrt-v4.
Very strange: I'm not seeing an -O3 being passed to nvcc there.. (Or --gpu-architecture=..)
from pbrt-v4.
I've never seen "--gpu-architecture" in the windows build. Windows looks like it is passing /O3 but does nvcc understand the "/" prefix args?
from pbrt-v4.
I'l looking at CMakeLists. It used to have "-O3" in a very old version but it isn't there now
from pbrt-v4.
Look around line 149 where we made the change earlier. I don't see -O3 there. I do see "-std=c++17". At line 189 I see both "c++17" and "-O3". I don't know enough about CMake to know which it is using or why things are defined twice.
from pbrt-v4.
(BTW I have merged the windows-gpu-rework
branch to master
(with some improvements to work around the performance hit it previously introduced on Linux.) Let me know if I messed up anything on the WIndows front..
from pbrt-v4.
I added -O3 to the Release build. no improvement. The image looks correct with your changes. Here's my environment:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Thu_Jun_11_22:26:48_Pacific_Daylight_Time_2020
Cuda compilation tools, release 11.0, V11.0.194
Build cuda_11.0_bu.relgpu_drvr445TC445_37.28540450_0
"C:/ProgramData/NVIDIA Corporation/OptiX SDK 7.1.0"
driver version 451.98
from pbrt-v4.
Here's what CMake is finding
configuration:
Selecting Windows SDK version 10.0.18362.0 to target Windows 10.0.19041.
The CXX compiler identification is MSVC 19.27.29111.0
The C compiler identification is MSVC 19.27.29111.0
Detecting CXX compiler ABI info
Detecting CXX compiler ABI info - done
Check for working CXX compiler: C:/Program Files (x86)/Microsoft Visual Studio/2019/Professional/VC/Tools/MSVC/14.27.29110/bin/Hostx64/x64/cl.exe - skipped
Detecting CXX compile features
Detecting CXX compile features - done
Detecting C compiler ABI info
Detecting C compiler ABI info - done
Check for working C compiler: C:/Program Files (x86)/Microsoft Visual Studio/2019/Professional/VC/Tools/MSVC/14.27.29110/bin/Hostx64/x64/cl.exe - skipped
Detecting C compile features
Detecting C compile features - done
Found Git: C:/cygwin64/bin/git.exe (found version "2.28.0")
Looking for pthread.h
Looking for pthread.h - not found
Found Threads: TRUE
Found ZLIB: C:/Program Files (x86)/PBRT-V4/lib/zlibstatic.lib (found version "1.2.8")
Configure ILMBASE Version: 2.5.3 Lib API: 25.0.2
CMake Warning (dev) at src/ext/openexr/IlmBase/config/IlmBaseSetup.cmake:56 (option):
Policy CMP0077 is not set: option() honors normal variables. Run "cmake
--help-policy CMP0077" for policy details. Use the cmake_policy command to
set the policy and suppress this warning.
For compatibility with older versions of CMake, option is clearing the
normal variable 'BUILD_SHARED_LIBS'.
Call Stack (most recent call first):
src/ext/openexr/IlmBase/CMakeLists.txt:35 (include)
This warning is for project developers. Use -Wno-dev to suppress it.
Looking for include file ucontext.h
Looking for include file ucontext.h - not found
-- WARNING pkg-config generation disabled
Configure OpenEXR Version: 2.5.3 Lib API: 25.0.2
Performing Test OPENEXR_IMF_HAVE_SYSCONF_NPROCESSORS_ONLN
Performing Test OPENEXR_IMF_HAVE_SYSCONF_NPROCESSORS_ONLN - Failed
Performing Test OPENEXR_IMF_HAVE_GCC_INLINE_ASM_AVX
Performing Test OPENEXR_IMF_HAVE_GCC_INLINE_ASM_AVX - Failed
clang-format not found.
Could NOT find Doxygen (missing: DOXYGEN_EXECUTABLE)
Looking for a CUDA compiler
Looking for a CUDA compiler - C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.0/bin/nvcc.exe
Found CUDA: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.0 (found version "11.0")
Found CUDA: 11.0
The CUDA compiler identification is NVIDIA 11.0.194
Detecting CUDA compiler ABI info
Detecting CUDA compiler ABI info - done
Check for working CUDA compiler: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.0/bin/nvcc.exe - skipped
Detecting CUDA compile features
Detecting CUDA compile features - done
checkcuda.cu
Creating library C:\cygwin64\home\goodin\pbrt-v4\build\checkcuda.lib and object C:\cygwin64\home\goodin\pbrt-v4\build\checkcuda.exp
CUDA Architecture: sm_70
Performing Test COMPILER_SUPPORTS_MARCH_NATIVE
Performing Test COMPILER_SUPPORTS_MARCH_NATIVE - Failed
Performing Test HAVE_MMAP
Performing Test HAVE_MMAP - Failed
Performing Test HAS_INTRIN_H
Performing Test HAS_INTRIN_H - Success
Unable to find -lprofiler
Performing Test HAVE_DECLSPEC_NOINLINE
Performing Test HAVE_DECLSPEC_NOINLINE - Success
Performing Test HAVE_ATTRIBUTE_NOINLINE
Performing Test HAVE_ATTRIBUTE_NOINLINE - Failed
Performing Test HAVE__ALIGNED_MALLOC
Performing Test HAVE__ALIGNED_MALLOC - Success
Performing Test HAVE_POSIX_MEMALIGN
Performing Test HAVE_POSIX_MEMALIGN - Failed
Performing Test INT64_IS_OWN_TYPE
Performing Test INT64_IS_OWN_TYPE - Failed
Configuring done
generation:
Selecting Windows SDK version 10.0.18362.0 to target Windows 10.0.19041.
Configure ILMBASE Version: 2.5.3 Lib API: 25.0.2
-- WARNING pkg-config generation disabled
Configure OpenEXR Version: 2.5.3 Lib API: 25.0.2
clang-format not found.
Could NOT find Doxygen (missing: DOXYGEN_EXECUTABLE)
Found CUDA: 11.0
checkcuda.cu
Creating library C:\cygwin64\home\goodin\pbrt-v4\build\checkcuda.lib and object C:\cygwin64\home\goodin\pbrt-v4\build\checkcuda.exp
CUDA Architecture: sm_70
Unable to find -lprofiler
Configuring done
from pbrt-v4.
Maybe I'm going back to the same well too much but I compared nvcc lines between the Linux and Windows build. I'm seeing two differences:
Windows:
-maxrregcount=128
Linux:
-maxrregcount 128
Windows:
-gencode=arch=compute_70,code="sm_70,compute_70"
Linux:
--gpu-architecture=sm_70
Every thing else is the same. I don't know enough CMake to change things over in the Windows build to go further. I only see line 153 in CMakeLists.txt which contains:
set (CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --std=c++17")
if (CMAKE_BUILD_TYPE MATCHES Debug)
set (CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --use_fast_math -G -g")
else()
set (CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --use_fast_math -lineinfo -maxrregcount 128 -O3")
endif ()
Which I think is correct but not what I'm seeing in the build log. I'm assuming "--gpu-architecture=sm_70" are equivalent. I see the "-gpu-architecture" line at line 160 but I don't see it in the build log either.
from pbrt-v4.
Pierre, what version of CMake are you running. I'm running:
$ cmake --version
cmake version 3.18.2
CMake suite maintained and supported by Kitware (kitware.com/cmake).
from pbrt-v4.
Pierre, what version of CMake are you running.
I am also running CMake 3.18.2 on Windows.
Look around line 149 where we made the change earlier. I don't see -O3 there. I do see "-std=c++17". At line 189 I see both "c++17" and "-O3". I don't know enough about CMake to know which it is using or why things are defined twice.
The double definition is probably coming from one set being directed to nvcc while the other one is being forwarded to the host compiler for compiling the host code; this is something I wanted to look at regarding the CMake configuration and see if it could be improved.
From one of the command line examples you posted, the argument to -Xcompiler
is being forwarded to cl.exe so nvcc won't be seeing the /O2
since it is meant for the host compiler: -Xcompiler "/EHsc /W3 /nologo /O2 /Fdpbrt_lib.dir\RelWithDebInfo\pbrt_lib.pdb /FS /Zi /MD /GR"
From nvcc --help
, for the --gpu-architecture
entry:
For example, 'nvcc --gpu-architecture=sm_50' is equivalent to 'nvcc --gpu-architecture=compute_50 --gpu-code=sm_50,compute_50'.
and from the --generate-code (-gencode)
entry:
In fact,
--gpu-architecture=<arch> --gpu-code=<code>, ...
is equivalent to--generate-code arch=<arch>,code=<code>,...
.
so these differences in the command lines you were seeing between Linux and Windows should not matter:
Windows:
-gencode=arch=compute_70,code="sm_70,compute_70"
Linux:
--gpu-architecture=sm_70
Very strange: I'm not seeing an -O3 being passed to nvcc there.. (Or --gpu-architecture=..)
and
I've never seen "--gpu-architecture" in the windows build.
It is passed in, it's even the very first argument passed to nvcc which is why we all missed it cause we were looking towards the end 😉 have another look at the command line.
from pbrt-v4.
So Pierre, on your command line you are also seeing "-maxrregcount=128"? I ask because in all the NVCC docs I can find it supposedly should be "-maxrregcount 128" or "--maxrregcount=128". Can you tell me how to get CMake to generate "-maxrregcount 128" instead of "-maxxregcount=128"?
from pbrt-v4.
Here is how one of the command lines look like:
D:\Builds\pbrt-v4>"D:\Program Files\NVIDIA Corporation\CUDA\v11.0\Toolkit\bin\nvcc.exe" -gencode=arch=compute_75,code="sm_75,compute_75" --use-local-env -ccbin "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\bin\HostX64\x64" -x cu -rdc=true -I"D:\Program Files\NVIDIA Corporation\CUDA\v11.0\Toolkit\include" -I"D:\Program Files\NVIDIA Corporation\OptiX\v7.1.0\include" -I"D:\Softwares\pbrt-v4\src" -I"D:\Builds\pbrt-v4" -I"D:\Softwares\pbrt-v4\src\ext\openvdb\nanovdb" -I"D:\Softwares\pbrt-v4\src\ext" -I"D:\Softwares\pbrt-v4\src\ext\stb" -I"D:\Softwares\pbrt-v4\src\ext\openexr\IlmBase\Imath" -I"D:\Softwares\pbrt-v4\src\ext\openexr\IlmBase\Half" -I"D:\Softwares\pbrt-v4\src\ext\openexr\IlmBase\Iex" -I"D:\Softwares\pbrt-v4\src\ext\openexr\OpenEXR\IlmImf" -I"D:\Builds\pbrt-v4\src\ext\openexr\IlmBase\config" -I"D:\Builds\pbrt-v4\src\ext\openexr\OpenEXR\config" -I"D:\Softwares\pbrt-v4\src\ext\zlib" -I"D:\Builds\pbrt-v4\src\ext\zlib" -I"D:\Softwares\pbrt-v4\src\ext\filesystem" -I"D:\Softwares\pbrt-v4\src\ext\ptex\src\ptex" -I"D:\Softwares\pbrt-v4\src\ext\double-conversion" -I"D:\Program Files\NVIDIA Corporation\CUDA\v11.0\Toolkit\include" --keep-dir x64\RelWithDebInfo -maxrregcount=128 --machine 64 --compile -cudart static -Xcudafe --diag_suppress=partial_override -Xcudafe --diag_suppress=virtual_function_decl_hidden -Xcudafe --diag_suppress=integer_sign_change -Xcudafe --diag_suppress=declared_but_not_referenced -Xcudafe --diag_suppress=implicit_return_from_non_void_function --expt-relaxed-constexpr --extended-lambda -Xnvlink -suppress-stack-size-warning --std=c++17 -lineinfo -Xcompiler="/EHsc -Zi -Ob1" -use_fast_math -D_WINDOWS -DNDEBUG -D_CRT_SECURE_NO_WARNINGS -DNVTX -DPBRT_IS_MSVC -DPBRT_BUILD_GPU_RENDERER -DPBRT_HAS_INTRIN_H -DPBRT_IS_WINDOWS -DNOMINMAX -D"PBRT_NOINLINE=__declspec(noinline)" -DPBRT_HAVE__ALIGNED_MALLOC -DPTEX_STATIC -D"CMAKE_INTDIR="RelWithDebInfo"" -DWIN32 -D_WINDOWS -DNDEBUG -D_CRT_SECURE_NO_WARNINGS -DNVTX -DPBRT_IS_MSVC -DPBRT_BUILD_GPU_RENDERER -DPBRT_HAS_INTRIN_H -DPBRT_IS_WINDOWS -DNOMINMAX -D"PBRT_NOINLINE=__declspec(noinline)" -DPBRT_HAVE__ALIGNED_MALLOC -DPTEX_STATIC -D"CMAKE_INTDIR="RelWithDebInfo"" -D_MBCS -Xcompiler "/EHsc /W3 /nologo /O2 /Fdpbrt_lib.dir\RelWithDebInfo\pbrt_lib.pdb /FS /Zi /MD /GR" -o pbrt_lib.dir\RelWithDebInfo\samples.obj "D:\Softwares\pbrt-v4\src\pbrt\gpu\samples.cpp"
Can you tell me how to get CMake to generate "-maxrregcount 128" instead of "-maxxregcount=128"?
I think you would need to patch CMake: if CMake was not adding that default -maxrregcount=0
you could just pass in -maxrregcount 128
as you would other arguments. I don't think it matters in the end, cause nvcc does not seem to complain about an unknown argument and when I looked at the different generated kernels, none were using more 128 regs.
from pbrt-v4.
So, I just deleted and reinstalled my entire Nvidia software stack. RelWithDebInfo killeroos-gold 17.4s. I think we are done with this issue.
from pbrt-v4.
Glad to hear you got it running well, but it's unfortunate you needed to reinstall everything to achieve that!
from pbrt-v4.
Yaay!
from pbrt-v4.
Related Issues (20)
- Spectral range in SpectralFilm is overruled by constants in spectrum.h
- [Ubuntu] Unable to build TEV; uintptr_t not declared in scope.
- Apple M3 raytracing acceleration HOT 4
- Illegal memory was encountered when mixed textures are used.
- build pbrt in Windows with CUDA errors HOT 3
- Denoiser not being initialized for OptiX 8.0.0
- GPU compiling error with -DPBRT_FLOAT_AS_DOUBLE HOT 1
- Periodic Radiance Values for Spectral Render
- spectrum.h not defined in the pbrt/base
- gpu runs failing: Invalid PTX input HOT 1
- Build issue vs 2022 cuda 12.5 optix 7.7 win 11 HOT 3
- Final image is black when using GPU HOT 2
- GPU rendering not working with MSCV Version 14.40 and newer HOT 1
- A question about CUDA constant memory initialization. HOT 1
- Trouble replicating Radiance results with diffusive materials HOT 2
- Compiling Error aggregate.cpp with optix7 and NVCC 12.5 and GCC 11.4 in Ubuntu 22.04 HOT 2
- simple scattering scene for simplevolpath HOT 2
- Equivalent of Kt and Kr in dielectric material? HOT 3
- Build errors using Linux (both with gcc 13.2.1 and clang 17.0.6) HOT 2
- Default parameters with sample Crown scene is noisy HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pbrt-v4.