The division from konjac

MIC上乘加性能偏小

执行这里的测试，乘加运算fmadd的性能明显过低。

MIC理论峰值单精度是1T，陈老师之前测得是700+G。

**************** Test for FMADD ****************
fmadd_intrin flops = 62.070177 Gflops
fmadd_intrin flops = 153.658040 Gflops
fmadd_autovec flops = 3.188641 Gflops
fmadd_autovec flops = 3.175300 Gflops
**************** Test for DIV ****************
division_cpu flops = 2.377036 Gflops
division_cpu flops = 2.408519 Gflops
division_intrin flops = 10.930253 Gflops
division_intrin flops = 10.779101 Gflops
division_autovec flops = 0.874387 Gflops
division_autovec flops = 0.806637 Gflops
newdiv_autovec flops = 34.460859 Gflops
newdiv_autovec flops = 34.163176 Gflops
newdiv_intrin flops = 35.127830 Gflops
newdiv_intrin flops = 31.881515 Gflops

MIC上查表效率很低

不查表是查表的3倍

GPU上面的除法测试快速平方根倒数+牛顿迭代

代码在branch yesx下面

精度请参考程序 https://github.com/konjac/division/blob/yesx/yesx/test.cpp
使用快速平方根倒数得到一个接近的参考值，然后使用牛顿迭代法
自己测了若干数据：
牛顿迭代发的次数，float迭代2次，double迭代3次收敛

速度请参考程序 https://github.com/konjac/division/blob/yesx/gputest-yesx/divisionflops.cu
time = 544.192322 ms

对比的程序为https://github.com/konjac/division/blob/yesx/gputest/divisionflops.cu
time = 748.940674 ms

ARCH = sm_21
实验环境
Device 0: "GeForce GT 630"
CUDA Driver Version / Runtime Version 5.5 / 5.0
CUDA Capability Major/Minor version number: 2.1
Total amount of global memory: 1024 MBytes (1073283072 bytes)
( 2) Multiprocessors x ( 48) CUDA Cores/MP: 96 CUDA Cores
GPU Clock rate: 1620 MHz (1.62 GHz)
Memory Clock rate: 667 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 131072 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA Runtime Version = 5.0, NumDevs = 1, Device0 = GeForce GT 630

MIC除法精度测试

除法精度测试比较：
（分别为CPU除法，MIC函数库除法，陈老师的方法with 5 iterations，叶树雄的方法）

0.840188 / 0.394383 =
CPU Standard: 2.130386
MIC Standard: 2.130386
MIC Iterative: 2.130386
MIC FISR: 2.089152

0.783099 / 0.798440 =
CPU Standard: 0.980787
MIC Standard: 0.980787
MIC Iterative: 0.980787
MIC FISR: 0.980659

0.911647 / 0.197551 =
CPU Standard: 4.614736
MIC Standard: 4.614736
MIC Iterative: 4.614732
MIC FISR: 4.174708

0.335223 / 0.768230 =
CPU Standard: 0.436358
MIC Standard: 0.436358
MIC Iterative: 0.436358
MIC FISR: 0.436254

0.277775 / 0.553970 =
CPU Standard: 0.501426
MIC Standard: 0.501426
MIC Iterative: 0.501426
MIC FISR: 0.499256

0.477397 / 0.628871 =
CPU Standard: 0.759134
MIC Standard: 0.759134
MIC Iterative: 0.759134
MIC FISR: 0.757703

0.364784 / 0.513401 =
CPU Standard: 0.710526
MIC Standard: 0.710526
MIC Iterative: 0.710526
MIC FISR: 0.705917

0.952230 / 0.916195 =
CPU Standard: 1.039331
MIC Standard: 1.039331
MIC Iterative: 1.039331
MIC FISR: 1.039327

MIC上cache的优化

lei-april commented a day ago

对长度为1024的double型数组进行2^15次随机访问，
在使用prefetch指令的情况下，L1 cache的命中率大约提升1%（96.6% -> 97.9%）

又测试了几遍，似乎prefetch没有引入明显的性能提升。

GPU 上除法测试

把那天组会讲过的性能数据贴这里吧，以后如果需要方便查

在CUDA上实现查表算法

~~https://github.com/konjac/division/raw/d7e842d0cd96557f44a359693c901e34a1ea8817/doc/algorithm-description.pdf~~

https://github.com/konjac/division/raw/master/doc/algorithm-description.pdf

konjac / division Goto Github PK

division's People

Contributors

Stargazers

Watchers

division's Issues

MIC上乘加性能偏小

MIC上查表效率很低

GPU上面的除法测试快速平方根倒数+牛顿迭代

MIC除法精度测试

MIC上cache的优化

GPU 上除法测试

在CUDA上实现查表算法

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent