konjac / division Goto Github PK
View Code? Open in Web Editor NEWdivision optimization
division optimization
执行这里的测试,乘加运算fmadd的性能明显过低。
MIC理论峰值单精度是1T,陈老师之前测得是700+G。
**************** Test for FMADD ****************
fmadd_intrin flops = 62.070177 Gflops
fmadd_intrin flops = 153.658040 Gflops
fmadd_autovec flops = 3.188641 Gflops
fmadd_autovec flops = 3.175300 Gflops
**************** Test for DIV ****************
division_cpu flops = 2.377036 Gflops
division_cpu flops = 2.408519 Gflops
division_intrin flops = 10.930253 Gflops
division_intrin flops = 10.779101 Gflops
division_autovec flops = 0.874387 Gflops
division_autovec flops = 0.806637 Gflops
newdiv_autovec flops = 34.460859 Gflops
newdiv_autovec flops = 34.163176 Gflops
newdiv_intrin flops = 35.127830 Gflops
newdiv_intrin flops = 31.881515 Gflops
不查表是查表的3倍
代码在branch yesx下面
精度请参考程序 https://github.com/konjac/division/blob/yesx/yesx/test.cpp
使用快速平方根倒数得到一个接近的参考值,然后使用牛顿迭代法
自己测了若干数据:
牛顿迭代发的次数,float迭代2次,double迭代3次收敛
速度请参考程序 https://github.com/konjac/division/blob/yesx/gputest-yesx/divisionflops.cu
time = 544.192322 ms
对比的程序为https://github.com/konjac/division/blob/yesx/gputest/divisionflops.cu
time = 748.940674 ms
ARCH = sm_21
实验环境
Device 0: "GeForce GT 630"
CUDA Driver Version / Runtime Version 5.5 / 5.0
CUDA Capability Major/Minor version number: 2.1
Total amount of global memory: 1024 MBytes (1073283072 bytes)
( 2) Multiprocessors x ( 48) CUDA Cores/MP: 96 CUDA Cores
GPU Clock rate: 1620 MHz (1.62 GHz)
Memory Clock rate: 667 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 131072 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA Runtime Version = 5.0, NumDevs = 1, Device0 = GeForce GT 630
除法精度测试比较:
(分别为CPU除法,MIC函数库除法,陈老师的方法with 5 iterations,叶树雄的方法)
0.840188 / 0.394383 =
CPU Standard: 2.130386
MIC Standard: 2.130386
MIC Iterative: 2.130386
MIC FISR: 2.089152
0.783099 / 0.798440 =
CPU Standard: 0.980787
MIC Standard: 0.980787
MIC Iterative: 0.980787
MIC FISR: 0.980659
0.911647 / 0.197551 =
CPU Standard: 4.614736
MIC Standard: 4.614736
MIC Iterative: 4.614732
MIC FISR: 4.174708
0.335223 / 0.768230 =
CPU Standard: 0.436358
MIC Standard: 0.436358
MIC Iterative: 0.436358
MIC FISR: 0.436254
0.277775 / 0.553970 =
CPU Standard: 0.501426
MIC Standard: 0.501426
MIC Iterative: 0.501426
MIC FISR: 0.499256
0.477397 / 0.628871 =
CPU Standard: 0.759134
MIC Standard: 0.759134
MIC Iterative: 0.759134
MIC FISR: 0.757703
0.364784 / 0.513401 =
CPU Standard: 0.710526
MIC Standard: 0.710526
MIC Iterative: 0.710526
MIC FISR: 0.705917
0.952230 / 0.916195 =
CPU Standard: 1.039331
MIC Standard: 1.039331
MIC Iterative: 1.039331
MIC FISR: 1.039327
lei-april commented a day ago
对长度为1024的double型数组进行2^15次随机访问,
在使用prefetch指令的情况下,L1 cache的命中率大约提升1%(96.6% -> 97.9%)
又测试了几遍,似乎prefetch没有引入明显的性能提升。
把那天组会讲过的性能数据贴这里吧,以后如果需要方便查
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.