Comments (20)
Looking at the jevoisinc camera module, which has a very similar GPU, I can't understand, how they manage to get this insane framerate ^^
from darknet.
There is in the "yolo_layer.c" file following code in lines 364-367:
if(!net.train || l.onlyforward){
opencl_pull_array(l.output_gpu, l.output, l.batch*l.outputs);
return;
}
You may:
if(!net.train || l.onlyforward){
//opencl_pull_array(l.output_gpu, l.output, l.batch*l.outputs);
return;
}
The pull is done anyway in the network.c and the functionality are fine.
from darknet.
Jevois uses darknet-NNPACK. I think it's https://github.com/digitalbrain79/darknet-nnpack.
I get about 1 fps with darknet/yolo, which I don't think is insane.
from darknet.
@r0l1 You may try to check BENCHMARK=1 in Makefile rebuild and test but it produces a lot of stats, however, there are showing what is slow and you/we may fix it. You may attach results of output in txt file here to analyze. The thing is that if you build with RPI=1 I am using naive gemm_gpu version instead of this from clBLAS to change that you may look in the code for RPI definition for pre-processor. Disable it in "blas_kernels.c" (at the end of file) and enable this in "gemm.c" (in the middle of the file should be about 3 of RPI "ifdef"). Thanks!
from darknet.
Thanks for the fast response. I didn't build with RPI enabled. I patched the Makefile and removed the -mfpmath=sse
.
Makefile
GPU=1
GPU_FAST=1
GPU_MULTI=0
OPENCV=0
OPENMP=0
RPI=0
BENCHMARK=1
DEBUG=0
Benchmark Results
First Iteration
opencl_push_array a35f4008 1363
fill_kernel 5044
fill_kernel 4786
im2col_gpu_kernel 11953
copy_kernel 5059
normalize_kernel 247
scale_bias_kernel 272
add_bias_kernel 212
activate_array_kernel 145
FW CONVOLUTIONAL 592493
fill_kernel 1279
forward_maxpool_layer_kernel 8148
FW MAXPOOL 8224
fill_kernel 8181
fill_kernel 8313
im2col_gpu_kernel 160
copy_kernel 8565
normalize_kernel 172
scale_bias_kernel 116
add_bias_kernel 99
activate_array_kernel 143
FW CONVOLUTIONAL 5079465
fill_kernel 2352
forward_maxpool_layer_kernel 4620
FW MAXPOOL 4696
fill_kernel 4579
fill_kernel 4485
im2col_gpu_kernel 165
copy_kernel 4444
normalize_kernel 111
scale_bias_kernel 99
add_bias_kernel 103
activate_array_kernel 138
FW CONVOLUTIONAL 470649
fill_kernel 1213
forward_maxpool_layer_kernel 2392
FW MAXPOOL 2463
fill_kernel 2311
fill_kernel 2265
im2col_gpu_kernel 140
copy_kernel 2210
normalize_kernel 113
scale_bias_kernel 97
add_bias_kernel 89
activate_array_kernel 148
FW CONVOLUTIONAL 158572
fill_kernel 653
forward_maxpool_layer_kernel 1299
FW MAXPOOL 1367
fill_kernel 1248
fill_kernel 1177
im2col_gpu_kernel 158
copy_kernel 530
normalize_kernel 96
scale_bias_kernel 97
add_bias_kernel 125
activate_array_kernel 106
FW CONVOLUTIONAL 5000
fill_kernel 305
forward_maxpool_layer_kernel 542
FW MAXPOOL 605
fill_kernel 337
fill_kernel 321
im2col_gpu_kernel 134
copy_kernel 317
normalize_kernel 102
scale_bias_kernel 147
add_bias_kernel 89
activate_array_kernel 129
FW CONVOLUTIONAL 5614
fill_kernel 322
forward_maxpool_layer_kernel 633
FW MAXPOOL 697
fill_kernel 527
fill_kernel 559
im2col_gpu_kernel 99
copy_kernel 559
normalize_kernel 136
scale_bias_kernel 87
add_bias_kernel 128
activate_array_kernel 96
FW CONVOLUTIONAL 14786
fill_kernel 302
fill_kernel 300
copy_kernel 266
normalize_kernel 131
scale_bias_kernel 92
add_bias_kernel 89
activate_array_kernel 128
FW CONVOLUTIONAL 3114
fill_kernel 335
fill_kernel 372
im2col_gpu_kernel 104
copy_kernel 351
normalize_kernel 124
scale_bias_kernel 85
add_bias_kernel 110
activate_array_kernel 99
FW CONVOLUTIONAL 5635
fill_kernel 123
fill_kernel 76
add_bias_kernel 71
activate_array_kernel 126
FW CONVOLUTIONAL 1611
fill_kernel 77
copy_kernel 83
activate_array_kernel 122
activate_array_kernel 125
activate_array_kernel 127
activate_array_kernel 121
activate_array_kernel 129
activate_array_kernel 123
opencl_pull_array 14037e8 283816
FW YOLO 284938
fill_kernel 381
copy_kernel 315
FW ROUTE 375
fill_kernel 77
fill_kernel 80
copy_kernel 79
normalize_kernel 102
scale_bias_kernel 77
add_bias_kernel 95
activate_array_kernel 93
FW CONVOLUTIONAL 898
fill_kernel 383
fill_kernel 325
upsample_kernel 83
FW UPSAMPLE 511
fill_kernel 857
copy_kernel 851
copy_kernel 77
FW ROUTE 1056
fill_kernel 630
fill_kernel 574
im2col_gpu_kernel 97
copy_kernel 628
normalize_kernel 89
scale_bias_kernel 87
add_bias_kernel 79
activate_array_kernel 93
FW CONVOLUTIONAL 4327
fill_kernel 91
fill_kernel 70
add_bias_kernel 75
activate_array_kernel 86
FW CONVOLUTIONAL 444
fill_kernel 73
copy_kernel 81
activate_array_kernel 90
activate_array_kernel 88
activate_array_kernel 86
activate_array_kernel 83
activate_array_kernel 86
activate_array_kernel 84
opencl_pull_array 152b2b0 22436
FW YOLO 23370
opencl_pull_array 152b2b0 291
../db-dnn2/img.jpg: Predicted in 13.945973 seconds.
Object: 100%
Second Iteration
opencl_push_array a35f4008 2445
fill_kernel 114
fill_kernel 90
im2col_gpu_kernel 102
copy_kernel 71
normalize_kernel 88
scale_bias_kernel 89
add_bias_kernel 81
activate_array_kernel 93
FW CONVOLUTIONAL 1123
fill_kernel 72
forward_maxpool_layer_kernel 111
FW MAXPOOL 172
fill_kernel 72
fill_kernel 81
im2col_gpu_kernel 90
copy_kernel 66
normalize_kernel 83
scale_bias_kernel 83
add_bias_kernel 78
activate_array_kernel 98
FW CONVOLUTIONAL 1041
fill_kernel 72
forward_maxpool_layer_kernel 109
FW MAXPOOL 169
fill_kernel 71
fill_kernel 93
im2col_gpu_kernel 100
copy_kernel 68
normalize_kernel 82
scale_bias_kernel 76
add_bias_kernel 85
activate_array_kernel 89
FW CONVOLUTIONAL 1006
fill_kernel 72
forward_maxpool_layer_kernel 100
FW MAXPOOL 158
fill_kernel 65
fill_kernel 62
im2col_gpu_kernel 90
copy_kernel 77
normalize_kernel 85
scale_bias_kernel 75
add_bias_kernel 77
activate_array_kernel 89
FW CONVOLUTIONAL 923
fill_kernel 65
forward_maxpool_layer_kernel 92
FW MAXPOOL 154
fill_kernel 64
fill_kernel 63
im2col_gpu_kernel 95
copy_kernel 70
normalize_kernel 82
scale_bias_kernel 74
add_bias_kernel 75
activate_array_kernel 91
FW CONVOLUTIONAL 909
fill_kernel 65
forward_maxpool_layer_kernel 95
FW MAXPOOL 158
fill_kernel 64
fill_kernel 64
im2col_gpu_kernel 86
copy_kernel 67
normalize_kernel 82
scale_bias_kernel 90
add_bias_kernel 82
activate_array_kernel 93
FW CONVOLUTIONAL 932
fill_kernel 65
forward_maxpool_layer_kernel 94
FW MAXPOOL 154
fill_kernel 63
fill_kernel 63
im2col_gpu_kernel 91
copy_kernel 76
normalize_kernel 83
scale_bias_kernel 76
add_bias_kernel 74
activate_array_kernel 91
FW CONVOLUTIONAL 918
fill_kernel 66
fill_kernel 63
copy_kernel 77
normalize_kernel 82
scale_bias_kernel 76
add_bias_kernel 77
activate_array_kernel 88
FW CONVOLUTIONAL 785
fill_kernel 63
fill_kernel 61
im2col_gpu_kernel 93
copy_kernel 73
normalize_kernel 80
scale_bias_kernel 70
add_bias_kernel 71
activate_array_kernel 86
FW CONVOLUTIONAL 889
fill_kernel 77
fill_kernel 61
add_bias_kernel 64
activate_array_kernel 91
FW CONVOLUTIONAL 429
fill_kernel 75
copy_kernel 82
activate_array_kernel 91
activate_array_kernel 93
activate_array_kernel 91
activate_array_kernel 97
activate_array_kernel 91
activate_array_kernel 91
opencl_pull_array 14037e8 24525
FW YOLO 25489
fill_kernel 105
copy_kernel 104
FW ROUTE 184
fill_kernel 77
fill_kernel 70
copy_kernel 80
normalize_kernel 79
scale_bias_kernel 88
add_bias_kernel 79
activate_array_kernel 99
FW CONVOLUTIONAL 847
fill_kernel 70
fill_kernel 80
upsample_kernel 84
FW UPSAMPLE 279
fill_kernel 72
copy_kernel 86
copy_kernel 77
FW ROUTE 271
fill_kernel 71
fill_kernel 84
im2col_gpu_kernel 97
copy_kernel 74
normalize_kernel 77
scale_bias_kernel 79
add_bias_kernel 73
activate_array_kernel 98
FW CONVOLUTIONAL 954
fill_kernel 73
fill_kernel 76
add_bias_kernel 63
activate_array_kernel 85
FW CONVOLUTIONAL 441
fill_kernel 70
copy_kernel 86
activate_array_kernel 84
activate_array_kernel 98
activate_array_kernel 83
activate_array_kernel 84
activate_array_kernel 100
activate_array_kernel 89
opencl_pull_array 152b2b0 7621
FW YOLO 8599
opencl_pull_array 152b2b0 311
../db-dnn2/img.jpg: Predicted in 7.252359 seconds.
Object: 100%
from darknet.
I just enabled RPI in the Makefile and it is slighlty faster:
opencl_push_array a4ae8008 2446
fill_kernel 120
fill_kernel 77
im2col_gpu_kernel 105
gemm_kernel 93
copy_kernel 88
normalize_kernel 85
scale_bias_kernel 87
add_bias_kernel 74
activate_array_kernel 103
FW CONVOLUTIONAL 1108
fill_kernel 77
forward_maxpool_layer_kernel 100
FW MAXPOOL 173
fill_kernel 72
fill_kernel 104
im2col_gpu_kernel 94
gemm_kernel 87
copy_kernel 85
normalize_kernel 90
scale_bias_kernel 78
add_bias_kernel 84
activate_array_kernel 102
FW CONVOLUTIONAL 1099
fill_kernel 76
forward_maxpool_layer_kernel 95
FW MAXPOOL 151
fill_kernel 78
fill_kernel 74
im2col_gpu_kernel 93
gemm_kernel 93
copy_kernel 83
normalize_kernel 84
scale_bias_kernel 69
add_bias_kernel 82
activate_array_kernel 105
FW CONVOLUTIONAL 1027
fill_kernel 90
forward_maxpool_layer_kernel 104
FW MAXPOOL 171
fill_kernel 73
fill_kernel 72
im2col_gpu_kernel 92
gemm_kernel 92
copy_kernel 86
normalize_kernel 96
scale_bias_kernel 76
add_bias_kernel 87
activate_array_kernel 90
FW CONVOLUTIONAL 1042
fill_kernel 74
forward_maxpool_layer_kernel 107
FW MAXPOOL 168
fill_kernel 74
fill_kernel 76
im2col_gpu_kernel 87
gemm_kernel 92
copy_kernel 87
normalize_kernel 83
scale_bias_kernel 76
add_bias_kernel 75
activate_array_kernel 100
FW CONVOLUTIONAL 1036
fill_kernel 70
forward_maxpool_layer_kernel 107
FW MAXPOOL 167
fill_kernel 77
fill_kernel 72
im2col_gpu_kernel 92
gemm_kernel 103
copy_kernel 92
normalize_kernel 95
scale_bias_kernel 75
add_bias_kernel 79
activate_array_kernel 100
FW CONVOLUTIONAL 1074
fill_kernel 66
forward_maxpool_layer_kernel 92
FW MAXPOOL 156
fill_kernel 62
fill_kernel 62
im2col_gpu_kernel 89
gemm_kernel 80
copy_kernel 83
normalize_kernel 88
scale_bias_kernel 75
add_bias_kernel 77
activate_array_kernel 91
FW CONVOLUTIONAL 1002
fill_kernel 65
fill_kernel 61
gemm_kernel 85
copy_kernel 80
normalize_kernel 81
scale_bias_kernel 77
add_bias_kernel 86
activate_array_kernel 89
FW CONVOLUTIONAL 857
fill_kernel 64
fill_kernel 62
im2col_gpu_kernel 91
gemm_kernel 86
copy_kernel 82
normalize_kernel 83
scale_bias_kernel 75
add_bias_kernel 75
activate_array_kernel 89
FW CONVOLUTIONAL 973
fill_kernel 63
fill_kernel 61
gemm_kernel 100
add_bias_kernel 77
activate_array_kernel 88
FW CONVOLUTIONAL 512
fill_kernel 66
copy_kernel 69
activate_array_kernel 93
activate_array_kernel 96
activate_array_kernel 94
activate_array_kernel 92
activate_array_kernel 93
activate_array_kernel 89
opencl_pull_array 13d05e8 23169
FW YOLO 24129
fill_kernel 93
copy_kernel 80
FW ROUTE 165
fill_kernel 85
fill_kernel 73
gemm_kernel 86
copy_kernel 80
normalize_kernel 83
scale_bias_kernel 79
add_bias_kernel 69
activate_array_kernel 90
FW CONVOLUTIONAL 860
fill_kernel 64
fill_kernel 62
upsample_kernel 77
FW UPSAMPLE 250
fill_kernel 64
copy_kernel 70
copy_kernel 83
FW ROUTE 258
fill_kernel 64
fill_kernel 65
im2col_gpu_kernel 93
gemm_kernel 88
copy_kernel 82
normalize_kernel 87
scale_bias_kernel 93
add_bias_kernel 96
activate_array_kernel 86
FW CONVOLUTIONAL 1070
fill_kernel 69
fill_kernel 68
gemm_kernel 91
add_bias_kernel 73
activate_array_kernel 90
FW CONVOLUTIONAL 506
fill_kernel 68
copy_kernel 81
activate_array_kernel 91
activate_array_kernel 89
activate_array_kernel 92
activate_array_kernel 88
activate_array_kernel 85
activate_array_kernel 85
opencl_pull_array 14f8098 7321
FW YOLO 8267
opencl_pull_array 14f8098 271
../db-dnn2/img.jpg: Predicted in 6.395759 seconds.
Object: 100%
from darknet.
Hmm, there is not a big difference. If I understand it right, the bottleneck is the link between CPU and GPU?
opencl_push_array a353d008 2429
fill_kernel 108
fill_kernel 75
im2col_gpu_kernel 100
copy_kernel 70
normalize_kernel 88
scale_bias_kernel 87
add_bias_kernel 86
activate_array_kernel 98
FW CONVOLUTIONAL 1100
fill_kernel 97
forward_maxpool_layer_kernel 117
FW MAXPOOL 179
fill_kernel 75
fill_kernel 89
im2col_gpu_kernel 93
copy_kernel 65
normalize_kernel 80
scale_bias_kernel 82
add_bias_kernel 87
activate_array_kernel 98
FW CONVOLUTIONAL 1047
fill_kernel 69
forward_maxpool_layer_kernel 109
FW MAXPOOL 170
fill_kernel 74
fill_kernel 87
im2col_gpu_kernel 102
copy_kernel 68
normalize_kernel 82
scale_bias_kernel 77
add_bias_kernel 88
activate_array_kernel 91
FW CONVOLUTIONAL 1020
fill_kernel 64
forward_maxpool_layer_kernel 92
FW MAXPOOL 154
fill_kernel 64
fill_kernel 65
im2col_gpu_kernel 92
copy_kernel 75
normalize_kernel 92
scale_bias_kernel 75
add_bias_kernel 78
activate_array_kernel 94
FW CONVOLUTIONAL 935
fill_kernel 63
forward_maxpool_layer_kernel 91
FW MAXPOOL 154
fill_kernel 65
fill_kernel 65
im2col_gpu_kernel 92
copy_kernel 74
normalize_kernel 80
scale_bias_kernel 75
add_bias_kernel 83
activate_array_kernel 90
FW CONVOLUTIONAL 923
fill_kernel 67
forward_maxpool_layer_kernel 92
FW MAXPOOL 149
fill_kernel 64
fill_kernel 65
im2col_gpu_kernel 90
copy_kernel 72
normalize_kernel 83
scale_bias_kernel 90
add_bias_kernel 80
activate_array_kernel 93
FW CONVOLUTIONAL 928
fill_kernel 66
forward_maxpool_layer_kernel 92
FW MAXPOOL 153
fill_kernel 64
fill_kernel 63
im2col_gpu_kernel 92
copy_kernel 78
normalize_kernel 83
scale_bias_kernel 77
add_bias_kernel 76
activate_array_kernel 90
FW CONVOLUTIONAL 917
fill_kernel 64
fill_kernel 59
copy_kernel 75
normalize_kernel 81
scale_bias_kernel 76
add_bias_kernel 75
activate_array_kernel 91
FW CONVOLUTIONAL 787
fill_kernel 66
fill_kernel 63
im2col_gpu_kernel 90
copy_kernel 73
normalize_kernel 83
scale_bias_kernel 75
add_bias_kernel 79
activate_array_kernel 87
FW CONVOLUTIONAL 910
fill_kernel 78
fill_kernel 61
add_bias_kernel 65
activate_array_kernel 91
FW CONVOLUTIONAL 424
fill_kernel 64
copy_kernel 66
activate_array_kernel 90
activate_array_kernel 90
activate_array_kernel 92
activate_array_kernel 90
activate_array_kernel 106
activate_array_kernel 92
FW YOLO 911
fill_kernel 63
copy_kernel 74
FW ROUTE 142
fill_kernel 65
fill_kernel 64
copy_kernel 78
normalize_kernel 82
scale_bias_kernel 78
add_bias_kernel 75
activate_array_kernel 88
FW CONVOLUTIONAL 769
fill_kernel 65
fill_kernel 64
upsample_kernel 77
FW UPSAMPLE 253
fill_kernel 63
copy_kernel 71
copy_kernel 83
FW ROUTE 254
fill_kernel 63
fill_kernel 69
im2col_gpu_kernel 89
copy_kernel 75
normalize_kernel 90
scale_bias_kernel 77
add_bias_kernel 76
activate_array_kernel 87
FW CONVOLUTIONAL 939
fill_kernel 65
fill_kernel 63
add_bias_kernel 65
activate_array_kernel 91
FW CONVOLUTIONAL 424
fill_kernel 65
copy_kernel 82
activate_array_kernel 94
activate_array_kernel 88
activate_array_kernel 91
activate_array_kernel 89
activate_array_kernel 90
activate_array_kernel 88
FW YOLO 917
opencl_pull_array 152cf00 32873
../db-dnn2/img.jpg: Predicted in 7.266204 seconds.
Object: 100%
from darknet.
@r0l1 the last value in each line is the time an issue is when you copy between VRAM and RAM. Take a look at the last line. On the good and fast PC, it can be 0 and you have 32873 on the pull memory from VRAM to RAM. :)
opencl_pull_array 152cf00 32873
Thanks!
from darknet.
@r0l1 And how about RPI=1 ?
from darknet.
@sowson Understood. 32873
clock ticks are quite a lot... I hoped that I could reach at least 1fps for a video stream analysis... The folks at the ODroid forum reached 3fps with another approach.
RPI=1
opencl_push_array a4a3d008 2453
fill_kernel 121
fill_kernel 77
im2col_gpu_kernel 101
gemm_kernel 103
copy_kernel 94
normalize_kernel 88
scale_bias_kernel 81
add_bias_kernel 81
activate_array_kernel 105
FW CONVOLUTIONAL 1109
fill_kernel 79
forward_maxpool_layer_kernel 116
FW MAXPOOL 176
fill_kernel 79
fill_kernel 71
im2col_gpu_kernel 107
gemm_kernel 76
copy_kernel 68
normalize_kernel 80
scale_bias_kernel 70
add_bias_kernel 69
activate_array_kernel 84
FW CONVOLUTIONAL 961
fill_kernel 64
forward_maxpool_layer_kernel 115
FW MAXPOOL 165
fill_kernel 68
fill_kernel 63
im2col_gpu_kernel 83
gemm_kernel 86
copy_kernel 70
normalize_kernel 71
scale_bias_kernel 67
add_bias_kernel 68
activate_array_kernel 84
FW CONVOLUTIONAL 861
fill_kernel 78
forward_maxpool_layer_kernel 92
FW MAXPOOL 143
fill_kernel 64
fill_kernel 63
im2col_gpu_kernel 80
gemm_kernel 84
copy_kernel 69
normalize_kernel 73
scale_bias_kernel 66
add_bias_kernel 72
activate_array_kernel 82
FW CONVOLUTIONAL 855
fill_kernel 67
forward_maxpool_layer_kernel 89
FW MAXPOOL 139
fill_kernel 66
fill_kernel 67
im2col_gpu_kernel 84
gemm_kernel 78
copy_kernel 72
normalize_kernel 72
scale_bias_kernel 71
add_bias_kernel 69
activate_array_kernel 84
FW CONVOLUTIONAL 870
fill_kernel 64
forward_maxpool_layer_kernel 90
FW MAXPOOL 139
fill_kernel 64
fill_kernel 63
im2col_gpu_kernel 79
gemm_kernel 77
copy_kernel 71
normalize_kernel 71
scale_bias_kernel 66
add_bias_kernel 69
activate_array_kernel 93
FW CONVOLUTIONAL 861
fill_kernel 69
forward_maxpool_layer_kernel 90
FW MAXPOOL 138
fill_kernel 65
fill_kernel 62
im2col_gpu_kernel 81
gemm_kernel 77
copy_kernel 82
normalize_kernel 72
scale_bias_kernel 68
add_bias_kernel 68
activate_array_kernel 81
FW CONVOLUTIONAL 857
fill_kernel 65
fill_kernel 61
gemm_kernel 78
copy_kernel 77
normalize_kernel 70
scale_bias_kernel 78
add_bias_kernel 80
activate_array_kernel 92
FW CONVOLUTIONAL 788
fill_kernel 73
fill_kernel 70
im2col_gpu_kernel 88
gemm_kernel 89
copy_kernel 80
normalize_kernel 80
scale_bias_kernel 89
add_bias_kernel 75
activate_array_kernel 87
FW CONVOLUTIONAL 928
fill_kernel 75
fill_kernel 60
gemm_kernel 92
add_bias_kernel 68
activate_array_kernel 81
FW CONVOLUTIONAL 442
fill_kernel 69
copy_kernel 66
activate_array_kernel 83
activate_array_kernel 81
activate_array_kernel 84
activate_array_kernel 82
activate_array_kernel 83
activate_array_kernel 82
FW YOLO 786
fill_kernel 65
copy_kernel 71
FW ROUTE 119
fill_kernel 64
fill_kernel 63
gemm_kernel 78
copy_kernel 71
normalize_kernel 73
scale_bias_kernel 67
add_bias_kernel 67
activate_array_kernel 87
FW CONVOLUTIONAL 748
fill_kernel 65
fill_kernel 62
upsample_kernel 71
FW UPSAMPLE 219
fill_kernel 66
copy_kernel 69
copy_kernel 71
FW ROUTE 217
fill_kernel 63
fill_kernel 61
im2col_gpu_kernel 83
gemm_kernel 77
copy_kernel 73
normalize_kernel 81
scale_bias_kernel 74
add_bias_kernel 68
activate_array_kernel 79
FW CONVOLUTIONAL 875
fill_kernel 65
fill_kernel 64
gemm_kernel 83
add_bias_kernel 67
activate_array_kernel 80
FW CONVOLUTIONAL 433
fill_kernel 66
copy_kernel 69
activate_array_kernel 80
activate_array_kernel 81
activate_array_kernel 79
activate_array_kernel 84
activate_array_kernel 77
activate_array_kernel 80
FW YOLO 774
opencl_pull_array 14f98e0 27580
../db-dnn2/img.jpg: Predicted in 6.348443 seconds.
Object: 100%
from darknet.
@r0l1 I am studying now https://rocm-documentation.readthedocs.io/en/latest/Programming_Guides/Opencl-optimization.html and interesting is the "1.3.4.1 Zero Copy Memory Objects" part, will see what I can do with it.
from darknet.
Sounds great! I'll have a detailed look at the document this weekend. Maybe I can help...
from darknet.
Maybe a good start point for you..?
MemMap.patch.txt
;-).
Thx!
from darknet.
That was fast! Thanks! Sadly this didn't make any difference. It was worth a try.
I'll test one more thing and if this doesn't succeed, then this device might not be suitable for yolo...
opencl_push_array a353d008 28
could not push array to device. error: CL_INVALID_OPERATION
fill_kernel 173
fill_kernel 85
im2col_gpu_kernel 111
copy_kernel 72
normalize_kernel 82
scale_bias_kernel 87
add_bias_kernel 76
activate_array_kernel 97
FW CONVOLUTIONAL 1103
fill_kernel 75
forward_maxpool_layer_kernel 105
FW MAXPOOL 174
fill_kernel 76
fill_kernel 76
im2col_gpu_kernel 90
copy_kernel 66
normalize_kernel 79
scale_bias_kernel 82
add_bias_kernel 73
activate_array_kernel 91
FW CONVOLUTIONAL 999
fill_kernel 65
forward_maxpool_layer_kernel 96
FW MAXPOOL 159
fill_kernel 64
fill_kernel 81
im2col_gpu_kernel 88
copy_kernel 77
normalize_kernel 83
scale_bias_kernel 79
add_bias_kernel 88
activate_array_kernel 96
FW CONVOLUTIONAL 966
fill_kernel 64
forward_maxpool_layer_kernel 91
FW MAXPOOL 156
fill_kernel 64
fill_kernel 62
im2col_gpu_kernel 104
copy_kernel 83
normalize_kernel 90
scale_bias_kernel 77
add_bias_kernel 76
activate_array_kernel 92
FW CONVOLUTIONAL 979
fill_kernel 65
forward_maxpool_layer_kernel 90
FW MAXPOOL 151
fill_kernel 64
fill_kernel 62
im2col_gpu_kernel 106
copy_kernel 69
normalize_kernel 82
scale_bias_kernel 74
add_bias_kernel 77
activate_array_kernel 93
FW CONVOLUTIONAL 942
fill_kernel 65
forward_maxpool_layer_kernel 94
FW MAXPOOL 157
fill_kernel 65
fill_kernel 62
im2col_gpu_kernel 91
copy_kernel 70
normalize_kernel 83
scale_bias_kernel 87
add_bias_kernel 81
activate_array_kernel 89
FW CONVOLUTIONAL 928
fill_kernel 65
forward_maxpool_layer_kernel 99
FW MAXPOOL 161
fill_kernel 64
fill_kernel 63
im2col_gpu_kernel 90
copy_kernel 73
normalize_kernel 85
scale_bias_kernel 75
add_bias_kernel 75
activate_array_kernel 89
FW CONVOLUTIONAL 915
fill_kernel 64
fill_kernel 64
copy_kernel 70
normalize_kernel 86
scale_bias_kernel 68
add_bias_kernel 68
activate_array_kernel 82
FW CONVOLUTIONAL 741
fill_kernel 64
fill_kernel 64
im2col_gpu_kernel 82
copy_kernel 66
normalize_kernel 70
scale_bias_kernel 68
add_bias_kernel 67
activate_array_kernel 81
FW CONVOLUTIONAL 813
fill_kernel 78
fill_kernel 65
add_bias_kernel 59
activate_array_kernel 83
FW CONVOLUTIONAL 402
fill_kernel 64
copy_kernel 67
activate_array_kernel 81
activate_array_kernel 84
activate_array_kernel 83
activate_array_kernel 82
activate_array_kernel 85
activate_array_kernel 82
FW YOLO 790
fill_kernel 64
copy_kernel 72
FW ROUTE 122
fill_kernel 64
fill_kernel 62
copy_kernel 65
normalize_kernel 73
scale_bias_kernel 72
add_bias_kernel 67
activate_array_kernel 80
FW CONVOLUTIONAL 695
fill_kernel 65
fill_kernel 66
upsample_kernel 78
FW UPSAMPLE 250
fill_kernel 64
copy_kernel 72
copy_kernel 81
FW ROUTE 257
fill_kernel 62
fill_kernel 63
im2col_gpu_kernel 91
copy_kernel 77
normalize_kernel 83
scale_bias_kernel 74
add_bias_kernel 82
activate_array_kernel 96
FW CONVOLUTIONAL 947
fill_kernel 65
fill_kernel 62
add_bias_kernel 63
activate_array_kernel 87
FW CONVOLUTIONAL 424
fill_kernel 76
copy_kernel 80
activate_array_kernel 95
activate_array_kernel 87
activate_array_kernel 90
activate_array_kernel 90
activate_array_kernel 89
activate_array_kernel 91
FW YOLO 897
opencl_pull_array 152aee0 29886
../db-dnn2/img.jpg: Predicted in 7.195652 seconds.
Object: 100%
from darknet.
That is the best I can do. I am testing it working now before committing to the repo. It is not as fast as I expected.
MemMapTest.patch.txt
Thx!
from darknet.
After careful testing, I put into repo the last commit based on the patch. It turns out that now training stability is much better. I think this is all I can do on this issue at least for now.
from darknet.
@PeterQuinn925 They made all sorts of statements on their website:
Visual Attention detects interesting things in the world, running at 73 frames/s on the JeVois smart camera's quad-core processor.
TensorFlow: recognize 1000 different types of objects at up to 83 frames/second using deep learning
Darknet and Darknet YOLO: detect and recognize up to 1000 different types of objects using deep neural networks
I did expect, that YOLO would run with a similar good performance. However after digging into the source code i found following statement (Source):
The YOLO network is currently quite slow, hence it is only run once in a while.
@sowson Thank you for all your efforts! Right now I don't have access to my testing device (ODroid), but next week I'll give you a feedback about the performance state with the latest source.
Why did you set the compiler optimization flag to -O0
? And this could be removed now, if you agree.
Edit: I'll close this issue after the benchmark.
from darknet.
Hi everyone,
How about implementing opencl + nnpack variant for this? I think, hopefully, increase fps atleast on opencl supported SoCs.
from darknet.
@sowson just tested the new source on the ODroid. Sadly there are no improvements. Thanks for your support! Please feel free to close this issue.
from darknet.
r0l1 "Looking at the jevoisinc camera module, which has a very similar GPU, I can't understand, how they manage to get this insane framerate"
After looking at the video (around 1:02):
https://youtu.be/aJp-mIBytno
... the jevoisinc mentions it uses a NPU (Tensor Core) built-in hardware, this is proper AI tech (Matrix Multiplier) hardware to boost AI computation performance. Similar to how Nvidia RTX GPUs have built in Tensor Cores. This means the jevoisinc is not relying heavily on the Mali GPU for AI computation. Doing AI in regular GPU cores are not ideal for AI but it does help in offloading CPU usage, if your SBC board does not have NPU (tensor cores), do not expect great outcomes. I hope this answers your curiosity.
That was a good find about jevoisinc, they are using allwinner SoCs. I highly recommend using MediaTek SoCs they also have built in Tensor Cores which they call APUs. I believe there are some dev boards out there for MediaTek.
from darknet.
Related Issues (20)
- opencl gemm_kernel error: CL_INVALID_WORK_GROUP_SIZE for the function gemm_offset_gpu() in blas_kernels.c HOT 4
- undefined reference to 'avg_predictions' HOT 1
- error while compiling HOT 11
- Running Issue Error HOT 3
- Build + Run errors HOT 8
- Is it applicable for opencl 1.1? HOT 3
- My Issue with Not-Build Blames!
- Error while loading shared libraries: libclBLAS.so.2: cannot open shared object file HOT 2
- yolov4 prediction crashes on mac HOT 4
- Encounter a 'segmentation fault' while running detection HOT 4
- How to fix CL_INVALID_WORK_ITEM_SIZE error HOT 7
- An error occurred while compiling the program with OpenMP. HOT 3
- openCL not found
- opencl Not Found error HOT 8
- automatic build to test on windows HOT 1
- Cannot compile libdarknet.so HOT 4
- make error : cannot find -lopencv HOT 3
- activation_kernels.cl build failed on Intel HD Graphics HOT 7
- Encountering problems when running darknet on RPI3B+
- Can it be installed on FreeBSD 13.1? HOT 39
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from darknet.