cudpp / cudpp Goto Github PK

CUDA Data Parallel Primitives Library

License: Other

CMake 1.08% C 36.60% C++ 24.78% Cuda 37.55%

cudpp's Introduction

CUDPP documentation {#mainpage}

Introduction

CUDPP is the CUDA Data Parallel Primitives Library. CUDPP is a library of data-parallel algorithm primitives such as parallel-prefix-sum ("scan"), parallel sort and parallel reduction. Primitives such as these are important building blocks for a wide variety of data-parallel algorithms, including sorting, stream compaction, and building data structures such as trees and summed-area tables.

Overview Presentation

A brief set of slides that describe the features, design principles, applications and impact of CUDPP is available: CUDPP Presentation.

Home Page

Homepage for CUDPP: http://cudpp.github.io/

Announcements and discussion of CUDPP are hosted on the CUDPP Google Group.

Getting Started with CUDPP

You may want to start by browsing the [CUDPP Public Interface](@ref publicInterface). For information on building CUDPP, see [Building CUDPP](@ref building-cudpp). See [Overview of CUDPP hash tables](@ref hash_overview) for an overview of CUDPP's hash table support.

The "apps" subdirectory included with CUDPP has a few source code samples that use CUDPP:

[simpleCUDPP](@ref example_simpleCUDPP), a simple example of using cudppScan()
satGL, an example of using cudppMultiScan() to generate a summed-area table (SAT) of a scene rendered in real time. The SAT is then used to simulate depth of field blur. This example is not currently working due to CUDA graphics interop changes. Volunteers to update it welcome!
cudpp_testrig, a comprehensive test application for all the functionality of CUDPP
cudpp_hash_testrig, a comprehensive test application for CUDPP's hash table data structures

We have also provided a code walkthrough of the [simpleCUDPP](@ref example_simpleCUDPP) example.

Getting Help and Reporting Problems

To get help using CUDPP, please use the CUDPP Google Group.

To report CUDPP bugs or request features, please file an issue directly using Github.

Release Notes {#release-notes}

For specific release details see the [Change Log](@ref changelog).

Known Issues

For a complete list of issues, see the CUDPP issues list on Github.

There is a known issue that the compile time for CUDPP is very long and the compiled library file size is very large. On some systems with < 4GB of available memory (or virtual memory: e.g. 32-bit OS), the CUDA compiler can run out of memory and compilation can fail. We will be working on these issues for future releases. You can reduce compile time by only targetting GPU architectures that you plan to run on, using the CUDPP_GENCODE_* CMake options.
We have seen "invalid configuration" errors when running SM-2.0-compiled suffix array tests on GPUs with SM versions greater than 2.0. We see no problems with compiling directly for the GPU's native SM version, so the workaround is to compile directly for the SM version of your GPU. If you have results or comments on this issue, please comment on CUDPP issue 148.

Algorithm Input Size Limitations

The following maximum size limitations currently apply. In some cases this is the theory—the algorithms may not have been tested to the maximum size. Also, for things like 32-bit integer scans, precision often limits the useful maximum size.

Algorithm	Maximum Supported Size
CUDPP_SCAN	67,107,840 elements
CUDPP_SEGMENTED_SCAN	67,107,840 elements
CUDPP_COMPACT	67,107,840 elements
CUDPP_COMPRESS	1,048,576 elements
CUDPP_LISTRANK	NO LIMIT
CUDPP_MTF	Bounded by GPU memory
CUDPP_BWT	1,048,576 elements
CUDPP_SA	0.14 GPU memory
CUDPP_STRINGSORT	2,147,450,880 elements
CUDPP_MERGESORT	2,147,450,880 elements
CUDPP_MULTISPLIT	Bounded by GPU memory
CUDPP_REDUCE	NO LIMIT
CUDPP_RAND	33,554,432 elements
CUDPP_SPMVMULT	67,107,840 non-zero elements
CUDPP_HASH	See [Hash Space Limitations](@ref hash_space_limitations)
CUDPP_TRIDIAGONAL	65535 systems, 1024 equations per system (Compute capability 2.x), 512 equations per system (Compute capability < 2.0)

Operating System Support and Requirements

This release (2.3) has been tested on the following OSes. For more information, visit our test results page.

Windows 7 (64-bit) (CUDA 6.5)
Ubuntu Linux (64-bit) (CUDA 6.5)
Mac OS X 10.12.1 (64-bit) (CUDA 8.0)

We expect CUDPP to build and run correctly on other flavors of Linux and Windows, but only the above are actively tested at this time. Version 2.3 does not currently support 32-bit operating systems.

Requirements

CUDPP, from this release 2.3 and onwards, now requires a minimum of SM 3.0. CUDPP 2.3 has not been tested with any CUDA version < 6.5.

CUDA

CUDPP is implemented in CUDA C/C++. It requires the CUDA Toolkit. Please see the NVIDIA CUDA homepage to download CUDA as well as the CUDA Programming Guide and CUDA SDK, which includes many CUDA code examples.

Design Goals

Design goals for CUDPP include:

Performance. We aim to provide best-of-class performance for our primitives. We welcome suggestions and contributions that will improve CUDPP performance. We also want to provide primitives that can be easily benchmarked, and compared against other implementations on GPUs and other processors.
Modularity. We want our primitives to be easily included in other applications. To that end we have made the following design decisions:
- CUDPP is provided as a library that can link against other applications.
- CUDPP calls run on the GPU on GPU data. Thus they can be used as standalone calls on the GPU (on GPU data initialized by the calling application) and, more importantly, as GPU components in larger CPU/GPU applications.
CUDPP is implemented as 4 layers:
- The [Public Interface](@ref publicInterface) is the external library interface, which is the intended entry point for most applications. The public interface calls into the [Application-Level API](@ref cudpp_app).
- The [Application-Level API](@ref cudpp_app) comprises functions callable from CPU code. These functions execute code jointly on the CPU (host) and the GPU by calling into the [Kernel-Level API](@ref cudpp_kernel) below them.
- The [Kernel-Level API](@ref cudpp_kernel) comprises functionsthat run entirely on the GPU across an entire grid of thread blocks. These functions may call into the [CTA-Level API](@ref cudpp_cta) below them.
- The [CTA-Level API](@ref cudpp_cta) comprises functions that run entirely on the GPU within a single Cooperative Thread Array (CTA, aka a CUDA thread block). These are low-level functions that implement core data-parallel algorithms, typically by processing data within shared (CUDA __shared__) memory.

Programmers may use any of the lower three CUDPP layers in their own programs by building the source directly into their application. However, the typical usage of CUDPP is to link to the library and invoke functions in the CUDPP [Public Interface](@ref publicInterface), as in the [simpleCUDPP](@ref example_simpleCUDPP), satGL, cudpp_testrig, and cudpp_hash_testrig application examples included in the CUDPP distribution.

Use Cases

We expect the normal use of CUDPP will be in one of two ways:

Linking the CUDPP library against another application.
Running the "test" applications, cudpp_testrig and cudpp_hash_testrig, that exercise CUDPP functionality.

References {#references}

The following publications describe work incorporated in CUDPP.

Mark Harris, Shubhabrata Sengupta, and John D. Owens. "Parallel Prefix Sum (Scan) with CUDA". In Hubert Nguyen, editor, GPU Gems 3, chapter 39, pages 851–876. Addison Wesley, August 2007. http://www.idav.ucdavis.edu/publications/print_pub?pub_id=916
Shubhabrata Sengupta, Mark Harris, Yao Zhang, and John D. Owens. "Scan Primitives for GPU Computing". In Graphics Hardware 2007, pages 97–106, August 2007. http://www.idav.ucdavis.edu/publications/print_pub?pub_id=915
Nadathur Satish, Mark Harris, and Michael Garland. "Designing Efficient Sorting Algorithms for Manycore GPUs". In Proceedings of the 23rd IEEE International Parallel & Distributed Processing Symposium, May 2009. http://mgarland.org/papers.html#gpusort
Stanley Tzeng, Li-Yi Wei. "Parallel White Noise Generation on a GPU via Cryptographic Hash". In Proceedings of the 2008 Symposium on Interactive 3D Graphics and Games, pages 79–87, February 2008. http://research.microsoft.com/apps/pubs/default.aspx?id=70502
Yao Zhang, Jonathan Cohen, and John D. Owens. Fast Tridiagonal Solvers on the GPU. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2010), pages 127–136, January 2010. http://www.cs.ucdavis.edu/publications/print_pub?pub_id=978
Yao Zhang, Jonathan Cohen, Andrew A. Davidson, and John D. Owens. A Hybrid Method for Solving Tridiagonal Systems on the GPU. In Wen-mei W. Hwu, editor, GPU Computing Gems. Morgan Kaufmann. July 2011.
Shubhabrata Sengupta, Mark Harris, Michael Garland, and John D. Owens. "Efficient Parallel Scan Algorithms for many-core GPUs". In Jakub Kurzak, David A. Bader, and Jack Dongarra, editors, Scientific Computing with Multicore and Accelerators, Chapman & Hall/CRC Computational Science, chapter 19, pages 413–442. Taylor & Francis, January 2011. http://www.idav.ucdavis.edu/publications/print_pub?pub_id=1041
Dan A. Alcantara, Andrei Sharf, Fatemeh Abbasinejad, Shubhabrata Sengupta, Michael Mitzenmacher, John D. Owens, and Nina Amenta. Real-Time Parallel Hashing on the GPU. ACM Transactions on Graphics, 28(5):154:1–154:9, December 2009. http://www.idav.ucdavis.edu/publications/print_pub?pub_id=973
Dan A. Alcantara, Vasily Volkov, Shubhabrata Sengupta, Michael Mitzenmacher, John D. Owens, and Nina Amenta. Building an Efficient Hash Table on the GPU. In Wen-mei W. Hwu, editor, GPU Computing Gems, volume 2, chapter 1. Morgan Kaufmann, August 2011.
Ritesh A. Patel, Yao Zhang, Jason Mak, Andrew Davidson, John D. Owens. "Parallel Lossless Data Compression on the GPU". In Proceedings of Innovative Parallel Computing (InPar '12), May 2012. http://idav.ucdavis.edu/publications/print_pub?pub_id=1087
Andrew Davidson, David Tarjan, Michael Garland, and John D. Owens. Efficient Parallel Merge Sort for Fixed and Variable Length Keys. In Proceedings of Innovative Parallel Computing (InPar '12), May 2012. http://www.idav.ucdavis.edu/publications/print_pub?pub_id=1085
Saman Ashkiani, Andrew Davidson, Ulrich Meyer, and John D. Owens. GPU Multisplit. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '16), March 2016, http://escholarship.org/uc/item/346486j8

Many researchers are using CUDPP in their work, and there are many publications that have used it ([references](@ref cudpp_refs)). If your work uses CUDPP, please let us know by sending us a reference (preferably in BibTeX format) to your work.

Citing CUDPP

If you make use of CUDPP primitives in your work and want to cite CUDPP (thanks!), we would prefer for you to cite the appropriate papers above, since they form the core of CUDPP. To be more specific, the GPU Gems paper (Harris et al.) describes (unsegmented) scan, multi-scan for summed-area tables, and stream compaction. The Sengupta et al. book chapter describes the current scan and segmented scan algorithms used in the library, and the Sengupta et al. Graphics Hardware paper describes an earlier implementation of segmented scan, quicksort, and sparse matrix-vector multiply. The IPDPS paper (Satish et al.) describes the radix sort used in CUDPP (prior to CUDPP 2.0. Later releases use Thrust::sort), and the I3D paper (Tzeng and Wei) describes the random number generation algorithm. The two Alcantara papers describe the hash algorithms. The two Zhang papers describe the tridiagonal solvers.

Credits

CUDPP Developers

Mark Harris, NVIDIA Corporation
John D. Owens, University of California, Davis
Shubho Sengupta, University of California, Davis
Stanley Tzeng, University of California, Davis
Yao Zhang, University of California, Davis
Andrew Davidson, University of California, Davis
Ritesh Patel, University of California, Davis
Leyuan Wang, University of California, Davis
Saman Ashkiani, University of California, Davis

Other CUDPP Contributors

Jason Mak, University of California, Davis [Release Manager]
Anjul Patney, University of California, Davis [general help]
Edmund Yan, University of California, Davis [Release Manager]
Dan Alcantara, University of California, Davis [hash tables]
Nadatur Satish, University of California, Berkeley [(old)radix sort]

Acknowledgments

Thanks to Jim Ahrens, Timo Aila, Nathan Bell, Ian Buck, Guy Blelloch, Jeff Bolz, Michael Garland, Jeff Inman, Eric Lengyel, Samuli Laine, David Luebke, Pat McCormick, Duane Merrill, and Richard Vuduc for their contributions during the development of this library.

CUDPP Developers from UC Davis thank their funding agencies:

National Science Foundation (grants CCF-0541448, IIS-0964357, and particularly OCI-1032859)
Department of Energy Early Career Principal Investigator Award DE-FG02-04ER25609
SciDAC Institute for Ultrascale Visualization (http://www.iusv.org/)
Los Alamos National Laboratory
Generous hardware donations from NVIDIA

CUDPP Copyright and Software License

CUDPP is copyright The Regents of the University of California, Davis campus and NVIDIA Corporation. The library, examples, and all source code are released under the BSD license, designed to encourage reuse of this software in other projects, both commercial and non-commercial. For details, please see the [license](@ref license) page.

Non source-code content (such as documentation, web pages, etc.) from CUDPP is distributed under a Creative Commons Attribution-ShareAlike 3.0 (CC BY-SA 3.0) license.

Note that prior to release 1.1 of CUDPP, the license used was a modified BSD license. With release 1.1, this license was replaced with the pure BSD license to facilitate the use of open source hosting of the code.

CUDPP also includes the Mersenne twister code of Makoto Matsumoto, also licensed under BSD.

CUDPP also calls functions in the Thrust template library, which is included with the CUDA Toolkit and licensed under the Apache 2.0 open source license.

CUDPP also includes a modified version of FindGLEW.cmake from nvidia-texture-tools, licensed under the MIT license.

cudpp's People

Contributors

Stargazers

Watchers

Forkers

jwmak yzhwang aditya12agd5 theoryno3 qq101 uikit0 josephwinston huoyao danryan edmundyan mjlong niezengying andersbll shenguojun drhenault lifengliu ssmallstar nagyistoce neoblizz blueyi dfrsg shodan11 caomw agoodgamemaker felixzhang00 pstieber deepcv sanjosh obinnaokechukwu 6676401088 xxtxx ctcyang fruitbrother mcfatealan arnocandel klonikar limin2021 seewoo79 josiahsams djiayong5 sclin-loauas dfukunaga alichnewsky maawad bert drinkingcoder leicas atongpu dogcaicai wskwak nathanchrs drmateo joye sacabench gismo94 zengyingchao utonat darcyzhc miigao jeffreyforkfolder jhwang7628 tigeroses mckim8686 likes123 free-gate zeta1999 luoyin500 guowentian aakarsh stjordanis dreamplayer-zhang seanfraser hellicesaouli joonvan vn-os fknorr davistardif sleeepyjack pavankumarperali charlesxrwu niteya-shah masashinakano cloudbee7 kindofblue yyuzhong jeffreyyuenl thishome spacelabllm tkoniy tide999 niyslvff

cudpp's Issues

cudppSort error for a large array

Original author: [email protected] (March 30, 2010 12:06:11)

What steps will reproduce the problem?

tar -xzvf sort_test.tar.gz
cd sort_test
make
./testsort 1000000

What is the expected output? What do you see instead?

expected :
before sort
radix sort : 0.00833379 s 1000000 elements

what I see :
before sort
radix sort : 0.00833379 s 1000000 elements
sort error 4 720476 541723

What version of the product are you using? On what operating system?

Using device 0: Quadroplex 2200 S4
Quadroplex 2200 S4; global mem: 4294705152B; compute v1.3; clock: 1296000
kHz

cudpp 1.1.1
CUDA SDK 2.3
on linux 2.6 kernel

$uname -a
Linux tesla 2.6.18-128.1.1.el5 #1 SMP Tue Feb 10 11:36:29 EST 2009 x86_64
x86_64 x86_64 GNU/Linux

$cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 190.53 Wed Dec 9
15:29:46 PST 2009
GCC version: gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)

Please provide any additional information below.

This is a simple test to use cudppSort.
For a small array, it passes the test, but for a large array, it fails.
It also fails in cudpp_testrig as follows.

$./cudpp_testrig -sort -n=1000000
Using device 0: Quadroplex 2200 S4
Quadroplex 2200 S4; global mem: 4294705152B; compute v1.3; clock: 1296000
kHz
Running a sort of 1000000 unsigned int key-value pairs
Unordered key[3]:746051 > key[4]:16173
Incorrectly sorted value0 1153083146 != 460036
GPU test FAILED
Average execution time: 8.024296 ms

1 tests failed

If this is a driver version mismatch, please let me know which driver
version is needed. Thank you.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=50

satGL produces a garbled image

Original author: [email protected] (June 17, 2009 01:57:43)

What steps will reproduce the problem?

Build and run the satGL sample app

What is the expected output? What do you see instead?

Correct output can be seen by running the device emulation version. In
release or debug builds, instead the results are a green and blue smear.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=10

Code review request: tools.[h | cpp]

Original author: [email protected] (June 24, 2009 12:27:05)

Added tools.h and tools.cpp into the trunk. Once the code is accepted, I
will update the code in testrig

I checked the file-finding code in the cutil library and that only searches
./data/ and ../../../projects/<executable_name>/data/, which is not general
enough for our purposes.

Based on John's SPMVMult testrig and my rand testrig, I've written two
types of file searching: finding a directory and finding a file. I use the
directory finding to find the data/ directory while it seems that John's
needs one to find a specific filename. The idea for both is that the
function will ascend to a parent directory and do a recursive search down
its children from there.

So for example, if I were looking for the data directory and I am in, say
/cudpp/bin/, then I'd call findDir("cudpp", "data", output) where output is
a character array. In the end of the function, output will contain
"../apps/data". Note that the recursive search does not search the .svn
directories (I don't think that any data file would be put in there...)
Ditto for the findFile function.

The code for both file and directory finding uses OS-dependent calls and
libraries. Linux / Mac uses the dirent.h and unistd.h to find the files
while the Windows version uses io.h and direct.h to find the files. Right
now I have only checked the two files tools.cpp and tools.h into the trunk
and once they are accepted I will check in the revised testrig files. I
have already tried the code on Blaze and this does fix Issue 3 (works as
well in Windows on my laptop). I've tried running the code from various
directories and it finds the regression files no sweat.

Stanley

Original issue: http://code.google.com/p/cudpp/issues/detail?id=16

cudppRand tests fail on 8800 GT GPU

Original author: [email protected] (June 17, 2009 23:33:07)

What steps will reproduce the problem?

Run cudpp_testrig -rand on 8800 GT GPU

What is the expected output? What do you see instead?

Expect pass, get fail.

I know Stanley already knows about this issue, but I wanted to file it to
make sure it's covered.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=13

Sorting in emulation mode broken

Original author: [email protected] (November 20, 2009 09:46:42)

What steps will reproduce the problem?

Compile and run the following in emulation mode:

include <stdio.h>

include <cudpp/cudpp.h>

include <cuda_runtime.h>

include <cutil_inline.h>

typedef unsigned int uint;

define N 12

uint keys[N] = {111, 37, 430, 433, 431, 357, 6190, 6193, 6191,
6117, 6837, 6911};
uint values[N] = {37, 111, 433, 430, 357, 431, 6193, 6190, 6117,
6191, 6911, 6837};
int main(){
cudaSetDevice(0);
int* keys_dev = 0;
int* vals_dev = 0;
cutilSafeCall(cudaMalloc((void**)&keys_dev, sizeof(uint) * N));
cutilSafeCall(cudaMalloc((void**)&vals_dev, sizeof(uint) * N));
CUDPPConfiguration sortConfig;
sortConfig.algorithm = CUDPP_SORT_RADIX;
sortConfig.datatype = CUDPP_UINT;
sortConfig.op = CUDPP_ADD;
sortConfig.options = CUDPP_OPTION_KEY_VALUE_PAIRS;
CUDPPHandle sortPlan;
cudppPlan(&sortPlan, sortConfig, 100 /* num elements /, 1 / num
rows /, 100 / pitch */);
printf("Before\n");
for (uint i = 0; i < N; i++) {
printf("(%d,\t%d)\n", keys[i], values[i]);
}
cutilSafeCall(cudaMemcpy(keys_dev, keys, sizeof(uint) * N,
cudaMemcpyHostToDevice));
cutilSafeCall(cudaMemcpy(vals_dev, values, sizeof(uint) * N,
cudaMemcpyHostToDevice));
cudppSort(sortPlan, keys_dev, vals_dev, 32, N);
cutilSafeCall(cudaMemcpy(keys, keys_dev, sizeof(uint) * N,
cudaMemcpyDeviceToHost));
cutilSafeCall(cudaMemcpy(values, vals_dev, sizeof(uint) * N,
cudaMemcpyDeviceToHost));
printf("After\n");
for (uint i = 0; i < N; i++) {
printf("(%d,\t%d)\n", keys[i], values[i]);
}
}

What is the expected output? What do you see instead?

The output should be a list of sorted keys + values. Instead:

Before
(111, 37)
(37, 111)
(430, 433)
(433, 430)
(431, 357)
(357, 431)
(6190, 6193)
(6193, 6190)
(6191, 6117)
(6117, 6191)
(6837, 6911)
(6911, 6837)

After
(37, 111)
(111, 37)
(357, 431)
(357, 431)
(357, 431)
(430, 433)
(6117, 6191)
(6117, 6191)
(6117, 6191)
(6190, 6193)
(6837, 6911)
(6911, 6837)
(6911, 6837)

Key/value pairs are indeed sorted however some pairs have been duplicated whereas others have
been deleted.

What version of the product are you using? On what operating system?

Using the version bundled with the CUDA toolkit v3.0 beta1 on both MacOS 10.6 and Ubuntu
9.04

Please provide any additional information below.

Works correctly when run on the device.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=46

RFE: Double Precision Support

Original author: [email protected] (June 25, 2009 22:00:13)

Request from a user:

As far as I understand cudpp currently only supports the use of single
precision floating numbers by specifying the data type CUDPP_FLOAT.

Is the support for double precision planned in a future version of
cudpp?
When will such a version be available?
What kind of loss in performance do you expect?

Original issue: http://code.google.com/p/cudpp/issues/detail?id=22

Switch from custom build steps to cuda.rules file

Original author: [email protected] (June 16, 2009 07:06:00)

This will make building (and adding files) much easier on windows. Get the
latest cuda.rules file from the CUDA SDK.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=7

Incorrect identity for unsigned int in cuddp_util.h

Original author: [email protected] (August 04, 2009 15:23:28)

On line 302 of cuddp_util.h, the CUDPP_MIN identity for unsigned ints is
defined as INT_MAX. This is incorrect, as the maximum unsigned integer is
2*INT_MAX, or UINT_MAX.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=33

Wrong result for UINT seg min scan

Original author: [email protected] (August 05, 2009 22:53:53)

Reproduction of the problem:

Download files from http://www.ilab.sztaki.hu/~erikbodzsar/cudpp/
Compile test.cu
Run ./a.out <error.txt

The test program runs a segmented min scan on the input data contained in
error.txt. Cudpp gets some elements of the result wrong (the first wrong
element, and some preceding and following elements will be printed out by
the test program).

I'm using cudpp 1.1, on a 64-bit debian system with debian version 5.0.2,
CUDA 2.2, g++/gcc 4.1.3.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=34

Increase the maximum sizes of scans and segmented scans

Original author: [email protected] (November 13, 2009 05:45:50)

Segmented Scans and Scans doesnot work on very large array sizes.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=45

cudpp_testrig Stream compact tests ignore the "quiet" flag

Original author: [email protected] (June 25, 2009 00:26:26)

Not high priority, but this needs to be fixed.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=20

Add Visual Studio 9 projects

Original author: [email protected] (June 16, 2009 07:02:35)

Just duplicate the visual studio 8 projects. Perhaps we should look into
CMAKE.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=6

RFE: unsigned long long int types in CUDPP

Original author: [email protected] (June 25, 2009 21:36:25)

Request from a user:

is there an effort to support "unsigned long long int" data type in
CUDPPDatatype? I want to use this data type for getting the integral
image of the square of the image matrix, like SAT. If you use High
resolution images unsigned int cannot contain the values. . .

Original issue: http://code.google.com/p/cudpp/issues/detail?id=21

RadixSortFloatKeys and RadixSortFloatKeysOnly have inconsistent parameter ordering

Original author: [email protected] (October 12, 2009 09:30:30)

The summary says it all. This is internal code and the fix is easy, but not
crucial.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=44

Add radix sort optimizations from CUDA SDK radix sort code

Original author: [email protected] (June 11, 2009 08:41:00)

We need to get the faster version of radix sort from the CUDA SDK into
CUDPP. Mark is working on this.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=2

Investigate why CUDPP libraries have gotten so large

Original author: [email protected] (June 25, 2009 00:15:18)

The release libcudpp.a on OS X is over 110 MB now. On linux it is over 34 MB. What is causing this?

Original issue: http://code.google.com/p/cudpp/issues/detail?id=18

Rand tests need to have a more standard output that says passed/failed

Original author: [email protected] (June 17, 2009 13:53:16)

What steps will reproduce the problem?

Run random tests
Look at output

What is the expected output? What do you see instead?

I see:

128
number of elements: 128, devOutputSize: 32
number of blocks: 1 blocksize: 32 devOutputsize = 32
number of threads: 32

What I want to see is something more like:
Generating 128 random numbers (1 block, 32 threads) ...
GPU test FAILED (x/y correct)

or something like that. (Look at the other ones.)

Also make sure the -q (quiet) option works, as Mark has previously described.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=11

Code review request: CUDPP 1.1.1 branch

Original author: [email protected] (March 11, 2010 05:16:37)

Purpose of code changes on this branch:

Review the changes I've made to the 1.1.1 branch before we release it.

When reviewing my code changes, please focus on:

Changes to scan_cta.cu, segmented_scan_cta.cu, radixsort_cta.cu,
radixsort_app.cu, cudpp_util.h, cudpp_globals.h

Original issue: http://code.google.com/p/cudpp/issues/detail?id=49

Document all members of enum CUDPPAlgorithm

Original author: [email protected] (July 01, 2009 08:02:42)

Right now we have:

enum CUDPPAlgorithm
{
CUDPP_SCAN,
CUDPP_SEGMENTED_SCAN,
CUDPP_COMPACT,
CUDPP_REDUCE,
CUDPP_SORT_RADIX,
CUDPP_SPMVMULT, /< Sparse matrix-dense vector multiplication */
CUDPP_RAND_MD5, /< Pseudo Random Number Generator using MD5
hash algorithm_/
CUDPP_ALGORITHM_INVALID, /_*< Placeholder at end of enum */
};

I didn't catch this in time for release1.1.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=27

SpMV doesn't run tests in cudpp_testrig -all

Original author: [email protected] (June 29, 2009 07:49:46)

We need a way to regress SpMV. I don't think it's been tested for the 1.1
release.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=26

Add namespaces to avoid name conflicts

Original author: [email protected] (October 12, 2009 09:29:15)

Have already run into some issues in the CUDA SDK samples with name conflicts
between functions used in cudppRadixSort and the samples. Add a cudpp
namespace to avoid this.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=43

Investigate compile time

Original author: [email protected] (June 25, 2009 00:17:22)

Compile time continues to get longer as we add more functionality. CUDA is really slow at
compiling template functions with multiple parameters, and we use a lot. There are something like
384 different scan kernels, for example, and a similar number for segscan.

How can we reduce this code explosion? Can we give feedback to the CUDA compiler team? (Emu
mode compiles WAY faster for example).

Original issue: http://code.google.com/p/cudpp/issues/detail?id=19

cudpp Makefile: NVCCFLAGS missing -Xcompiler -fPIC

Original author: [email protected] (September 04, 2009 09:12:40)

When trying to build a shared library using cudpp on Linux(x86_64) the
version of cudpp delivered with the NVidia SDK (any version) as well any
Version of cudpp including 1.1 result in:

relocation R_X86_64_32 against `a local symbol' can not be used when making
a shared object; recompile with -fPIC

The Problem is known and has been discussed/solved on
http://forums.nvidia.com/lofiversion/index.php?t63748.html

It would be a good idea to include the changes (see attached patch for
version 1.1) in future releases of cudpp as well as the version shipped
with the NVidia SDK.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=40

test_rand.cu on blaze is compiling incorrectly

Original author: [email protected] (June 23, 2009 01:24:58)

What steps will reproduce the problem?

Compile test_rand.cu on blaze (with makefile)

What is the expected output? What do you see instead?

I expect a clean build. Instead I get:

[jowens@blaze cudpp_testrig]$ make
test_rand.cu(63): error: pointer to incomplete class type is not allowed

test_rand.cu(64): error: pointer to incomplete class type is not allowed

2 errors detected in the compilation of
"/tmp/tmpxft_00005b19_00000000-4_cudpp_testrig.cpp1.ii".

Original issue: http://code.google.com/p/cudpp/issues/detail?id=15

Any future with a C# wrapper or managed c++ examples?

Original author: [email protected] (August 12, 2009 07:10:06)

What steps will reproduce the problem?
1.
2.
3.

What is the expected output? What do you see instead?

What version of the product are you using? On what operating system?

Please provide any additional information below.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=35

Code review request

Original author: [email protected] (February 13, 2010 00:34:06)

Purpose of code changes on this branch:

Add tridiagonal solvers to cudpp.

When reviewing my code changes, please focus on:

After the review, I'll merge this branch into:
/trunk

Original issue: http://code.google.com/p/cudpp/issues/detail?id=47

cudpp_segmented_scan_cta.cu "Removed dead synchronization intrinsic" advisories

Original author: [email protected] (June 17, 2009 01:38:40)

What steps will reproduce the problem?

Build cudpp release or debug

What is the expected output? What do you see instead?

Expect no errors or warnings or advisories. Instead get lots of these:

jS4_PjS9_
src/cta/segmented_scan_cta.cu(868): Advisory: Removed dead synchronization
intrinsic from function
Z14segmentedScan4If19SegmentedScanTraitsIfL13CUDPPOperator2ELb0ELb0ELb0ELb0ELb1ELb0EEEvPT_PKS3_PKjjS4_PjS9

Suggested fix:
I realize that removing this __syncthreads() causes failure. I believe
though that the compiler is only removing it from some calls to the
function that includes it, not all. So instead of putting the
syncthreads() inside this function, put it right before the call to the
function, only where it is needed.

Please use labels and text to provide additional information.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=9

cudppRand tests in cudpp_testrig fail when run from the command line

Original author: [email protected] (June 16, 2009 06:44:03)

What steps will reproduce the problem?

Run "cudpp_testrig -all" from the command line

What is the expected output? What do you see instead?

Expect "test passed" for all of them, but instead get a warning about not
finding a regression file. Also, when this happens, the tests still pass!
These should be test failures.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=3

RFE: Parallel Reductions

Original author: [email protected] (June 25, 2009 22:02:26)

We need to finally add parallel reductions.

Note that there are two types: one for associative-commutative operators can be optimized more
than one for operators that are only associative and not commutative.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=23

vectorSegmentedAddUniformToRight4 in vector_kernel needs indices argument clarified

Original author: [email protected] (June 17, 2009 22:48:49)

In vector_kernel, the function vectorSegmentedAddUniformToRight4 has an
argument "d_minIndices" but the documentation for that argument refers to
an "array of maximum indices". Please make the documentation and argument
name match the functionality.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=12

Prevent the use of incorrect plan types with error messages at runtime

Original author: [email protected] (July 12, 2009 21:58:07)

As an example, if you pass a handle to a scan plan to cudppSegmentedScan, you
get a segmentation fault. This should instead raise an error at runtime
instead.

This is also related to issue 27.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=28

cudpp_1.1 does not compile with CUDA 2.3 in Debian

Original author: [email protected] (September 27, 2009 19:13:22)

What steps will reproduce the problem?

compile cutil (succeeds)
compile cudpp

What is the expected output? What do you see instead?
I expect cudpp to compile succesfully. Instead i get:

gauguin:/raid/filipe/cudpp_1.1/cudpp> make verbose=1
nvcc -o obj/release/segmented_scan_app.cu_o -c src/app/
segmented_scan_app.cu --host-compilation=C --compiler-options -fno-strict-
aliasing -I./ -I./include/ -Isrc/ -Isrc/app/ -Isrc/kernel/ -Isrc/cta/ -
I. -I/opt/cuda/include -I./../common/inc -DUNIX -O
In file included from /tmp/
tmpxft_00004836_00000000-1_segmented_scan_app.cudafe1.stub.c:6,
from src/app/segmented_scan_app.cu:247:
/opt/cuda/bin/../include/crt/host_runtime.h:178: warning: 'struct
surfaceReference' declared inside parameter list
/opt/cuda/bin/../include/crt/host_runtime.h:178: warning: its scope is
only this definition or declaration, which is probably not what you want
In file included from src/app/segmented_scan_app.cu:247:
/tmp/tmpxft_00004836_00000000-1_segmented_scan_app.cudafe1.stub.c: In
function
'__sti____cudaRegisterAll_53_tmpxft_00004836_00000000_4_segmented_scan_app_cpp1_ii_999fefc3':
/tmp/tmpxft_00004836_00000000-1_segmented_scan_app.cudafe1.stub.c:11623:
error: '__fatDeviceText' undeclared (first use in this function)
/tmp/tmpxft_00004836_00000000-1_segmented_scan_app.cudafe1.stub.c:11623:
error: (Each undeclared identifier is reported only once
/tmp/tmpxft_00004836_00000000-1_segmented_scan_app.cudafe1.stub.c:11623:
error: for each function it appears in.)
make: *** [obj/release/segmented_scan_app.cu_o] Error 255

What version of the product are you using? On what operating system?
cudpp 1.1 with CUDA 2.3 on Debian sid with both gcc 4.3 and 4.1.

Please provide any additional information below.

I tried compile with emu=1 and this did work. I also tried dbg=1 but this
didn't seem to make any difference. This might well be a CUDA bug and not
cudpp's fault.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=41

Provide example of multiple value array sorting by the same index

Original author: [email protected] (September 02, 2009 21:43:44)

See issue 17 for more information. It is not efficient to sort multiple
value arrays inside CUDPP -- one can sort key-index pairs and then use the
sorted indices to shuffle/gather the multiple arrays. This is more efficient
and more general, but it may not be obvious to users how to do it. So we
should provide an example in the "apps" directory.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=39

New warning in test_rand.cu on 64-bit linux

Original author: [email protected] (June 22, 2009 06:27:00)

What steps will reproduce the problem?

Compile cudpp_testrig in Linux. You will see this new warning:

test_rand.cu(246): warning: variable "memSize" was declared but never
referenced

What is the expected output? What do you see instead?

No warnings.

Please use labels and text to provide additional information.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=14

cudpp_testrig -all does not test backward segmented scan

Original author: [email protected] (June 17, 2009 01:35:40)

What steps will reproduce the problem?

run cudpp_testrig -all

What is the expected output? What do you see instead?

It should test backward segmented scans (all ops, options), but it doesn't.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=8

sorting test failed.

Original author: [email protected] (July 19, 2009 10:07:35)

Hi,
I just build cudpp and ran cudpp_testrig which failed with

(all previous tests were correct)
Running a sort of 1048581 unsigned int key-value pairs
Unordered key[1048576]:4294966923 > key[1048577]:0
Incorrectly sorted value1048577 3530798281 != 0
GPU test FAILED
Average execution time: 2.586515 ms
Running a sort of 2097152 unsigned int key-value pairs
Unordered key[1048576]:4294966923 > key[1048577]:0
Incorrectly sorted value1048577 3530798281 != 0
GPU test FAILED
Average execution time: 0.000000 ms
Running a sort of 4194304 unsigned int key-value pairs
Unordered key[1048576]:4294966923 > key[1048577]:0
Incorrectly sorted value1048577 3530798281 != 0
GPU test FAILED
Average execution time: 0.000000 ms
Running a sort of 8388608 unsigned int key-value pairs
Unordered key[1048576]:4294966923 > key[1048577]:0
Incorrectly sorted value1048577 3530798281 != 0
GPU test FAILED
Average execution time: 0.000000 ms

My gpu card is a Tesla C1060.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=30

Modify cudpp_rand so that random number generator also has CTA equivalent

Original author: [email protected] (August 19, 2009 18:28:47)

Right now the only way to invoke the random number is through the API.
Provide a version which can use device level routines.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=37

CUDPP Errors with OpenMP and Multi-GPU

Original author: [email protected] (July 29, 2009 23:01:07)

What steps will reproduce the problem?

Extract attachment
Open cudpp/cudpp.sln in Visual Studio 2008
Rebuild Debug solution
Open apps/simpleCUDPP_openMP/simpleCUDPP.sln
Rebuild Debug solution
Start debugging simpleCUDPP

What is the expected output?

All tests should pass

What do you see instead?

- Run 1:

Windows has triggered a breakpoint in simpleCUDPP.exe.

This may be due to a corruption of the heap, which indicates a bug in
simpleCUDPP.exe or any of the DLLs it has loaded.

This may also be due to the user pressing F12 while simpleCUDPP.exe has focus.

The output window may have more diagnostic information.

- Run 2:

Error destroying CUDPPPlan

- Run 3:

Unhandled exception at 0x006ca87e (cudpp32d.dll) in simpleCUDPP.exe:
0xC0000005: Access violation writing location 0xddddddf1.

- Run 4:

Unhandled exception at 0x007c32e4 (cudpp32d.dll) in simpleCUDPP.exe:
0xC0000005: Access violation reading location 0xfeeefee8.

- Run 5:

Error creating CUDPPPlan

What version of the product are you using? On what operating system?

CUDPP 1.1 with CUDA 2.3 beta, Windows XP 32 bit, Visual Studio 2008, GTX 295

Please provide any additional information below.

Running with OpenMP with 2 GPUs. It will occasionally work but generally
fail.
See the original query at
http://groups.google.co.uk/group/cudpp/browse_thread/thread/507fba92fac36b1e?hl=en

Original issue: http://code.google.com/p/cudpp/issues/detail?id=32

rand_cta.cu functions have no documentation.

Original author: [email protected] (June 26, 2009 05:01:46)

Stanley, when you get a chance, please document all the functions in
rand_cat.cu. The other files have good docs, but not this one.

Not marking for Release 1.1 because it's not crucial.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=24

Code review request: new radix sort implementation

Original author: [email protected] (June 16, 2009 06:52:08)

Purpose of code changes on this branch:

To add Mark's optimizations to radix sort from the CUDA SDK.

When reviewing my code changes, please focus on:

radixsort__.cu
cudpp_plan._
cudpp_plan_manager.*
cudpp_maximal_launch.*
test_radixsort.cu (cudpp_testrig)

Use SVN diff to guide your review.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=4

findFile/findDir search in wrong direction, and therefore find wrong path if the startDir is repeated in the path

Original author: [email protected] (June 29, 2009 07:42:36)

When CUDPP is in a path that has the name "cudpp" in it twice, for example,
the way I keep branches:

~/src/idav/branches/proj/cudpp/release1.1/cudpp/

cudpp_testrig -rand fails to find its files. This is because cutupPath
goes from the root of the path above, finding the first /cudpp first. It
should instead work backwards up the tree, so it finds the closest instance
of "startDir", rather than the farthest -- I think this is what users will
expect.

I think the correct way to do this is not using strtok, but by using the
chdir() to traverse up the tree until either the startDir is found or the
root is hit. I find it hard to believe each OS doesn't have a built-in
function to do this, but a quick google search turns up nothing easy...

This needs to be fixed. However I think we can leave it until after the
release.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=25

Document Size Limitations for all Algorithms

Original author: [email protected] (June 11, 2009 08:22:33)

We get a lot of questions about supported size limitations. We need to
document all limitations in the CUDPP docs.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=1

Remove Visual Studio 7.1 projects

Original author: [email protected] (June 16, 2009 07:02:01)

CUDA 2.2 doesn't support 7.1 or earlier.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=5

make issues: missing typinfo, cstdlib, etc. includes in cutil and testrig

Original author: [email protected] (July 22, 2009 13:38:05)

Hi,

cudpp ships with precompiled libcutil.a, but compiling example apps
fails due to it being in a wrong format. I guess it is compiled for 32bit
host, and i am running 64 bit. So cudpp make, or some kind other make
should rebuild it. Running make in common/ fails with

./../common/inc/cmd_arg_reader.h: In member function ‘const T*
CmdArgReader::getArgHelper(const std::string&)’:
./../common/inc/cmd_arg_reader.h:416: error: must #include <typeinfo>
before using typeid
./../common/inc/cmd_arg_reader.h:432: error: must #include <typeinfo>
before using typeid
src/cmd_arg_reader.cpp: In destructor ‘CmdArgReader::~CmdArgReader()’:
src/cmd_arg_reader.cpp:101: error: must #include <typeinfo> before using
typeid
src/cmd_arg_reader.cpp:106: error: must #include <typeinfo> before using
typeid
src/cmd_arg_reader.cpp:111: error: must #include <typeinfo> before using
typeid
src/cmd_arg_reader.cpp:116: error: must #include <typeinfo> before using
typeid
src/cmd_arg_reader.cpp:121: error: must #include <typeinfo> before using
typeid
make: *** [obj/release/cmd_arg_reader.cpp_o] Error 1

If i change anything in kernels i hava to manualy delete compiled .o
files, make doesn;t recreate them automaticaly.
Building testrig fails with

In file included from spmvmult_gold.cpp:13:
sparse.h: In constructor ‘MMMatrix::MMMatrix(unsigned int, unsigned int,
unsigned int)’:
sparse.h:46: error: ‘malloc’ was not declared in this scope
spmvmult_gold.cpp: In function ‘void readMatrixMarket(MMMatrix_, const
char_)’:
spmvmult_gold.cpp:94: error: ‘exit’ was not declared in this scope
spmvmult_gold.cpp:122: error: ‘qsort’ was not declared in this scope
make: *** [obj/release/spmvmult_gold.cpp_o] Error 1

Original issue: http://code.google.com/p/cudpp/issues/detail?id=31

Modify rand seed generation

Original author: [email protected] (July 13, 2009 14:57:31)

Current rand seed generation only uses a basic seed XOR'd with the
threadIdx and blockIdx. A more clever way would be to use an LCG.
Original e-mail suggesting the change:

From Thomas Bradley:
The threadIdx and blockIdx are 16-bit quantities and even fewer bits will
actually be non-zero, therefore you are really only changing the low bits
of your seed. It may be more robust to use an LCG to generate the “input”
fields, for example a=69069 m=32 is easy and not a bad LCG:

state = (state * 69069) & 0xffffffffUL; return state;

Where state is initialized to the seed (combined somehow with the threadIdx
and blockIdx).

Original issue: http://code.google.com/p/cudpp/issues/detail?id=29

Building a MEX-wrapper using CUDPP in Matlab

Original author: [email protected] (September 02, 2009 14:04:35)

There is no bug about it.

I wish I could use a MEX-Wrapper, wich allows me, to use CUDPP in M-Code.
This (CUDA-)MEX-file could be compiled at first use and allow after people
like me, who don't know much C, to use GPGPU very easily in Matlab.

On my site sort seems very interesing, in case it gaves back indices (like
it is need for sortrows).

At the moment there a two different Toolboxes available for using CUDA in
Matlab: Accelereyes' Jacket and GPUmat (from gp-you.org). Booth of them
dont allow a sortrows, sort from Jacket is very slow on the other hand.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=38

Wrong Results from multiScan for large number of rows

Original author: [email protected] (September 29, 2009 00:00:22)

What steps will reproduce the problem?
I've included a function that runs the CUDPP multiScan and checks it
against what I think should appear using the CPU. For me it fails with a
datasize of 50000 and 100 rows.

What is the expected output? What do you see instead?
I would expect the test to pass. If you use the cudppScan inside the For
loop instead the test passes.

What version of the product are you using? On what operating system?

Using CUDPP 1.1 on Vista 64-bit with Visual Studio 2008. I'm using CUDA
2.2. I won't have time to test it using 2.3 as I'm about to leave the
country so maybe someone can confirm that it's still a fault with 2.3.

Please provide any additional information below.

I believe the problem lies in the scan_cta.cu file. In the scanCTA
function, a syncthreads is called only in emulation mode for backwards
scans. I think this needs to get called for device mode as well. At least
that fixed the problem for me. It looks like there would be race conditions
for large numbers of threads.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=42

RFE: Sort key-multiple value arrays

Original author: [email protected] (June 24, 2009 21:58:09)

We have key-value sorts, but sometimes you want to sort keys and multiple
value arrays along with them. Can we make this general and efficient?

This is not a definite CUDPP feature to add, it's something that should be
considered first.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=17

Sorting Error

Original author: [email protected] (February 19, 2010 07:17:19)

What steps will reproduce the problem?

unzip gtc_to_sort_test.tar.gz in NVIDIA_CUDA_SDK project folder
make (~/NVIDIA_CUDA_SDK/projects/gtc_to_sort_test/)
execution (~/NVIDIA_CUDA_SDK/bin/linux/release)
(e.g.: ./gtc_sort_test
~/NVIDIA_CUDA_SDK/projects/gtc_to_sort_test/input/1.txt 5 30 32)
3.

What is the expected output? What do you see instead?
[extected output]
Finished reading input file.
mi: 161795, mgrid: 32449
Sorting : Success
0.0425751 s Checksum: 0.000000
Sorting : Success
0.0424822 s Checksum: 0.000000
Sorting : Success
0.0425396 s Checksum: 0.000000
Sorting : Success
0.0425428 s Checksum: 0.000000
Sorting : Success
0.0427907 s Checksum: 0.000000
Sorting : Success
0.042537 s Checksum: 0.000000
Sorting : Success
0.0425729 s Checksum: 0.000000
Sorting : Success
0.0425132 s Checksum: 0.000000
Sorting : Success
0.0426874 s Checksum: 0.000000
Sorting : Success
0.0428964 s Checksum: 0.000000
=== Performance summary: BENCH_GPU A0 5057 blocks 32 threads/block ===
0.0286377 Gflops
Min: 0.0424822 s -- 0.674 Gflop/s
Mean: 0.0426137 s -- 0.672 Gflop/s
Max: 0.0428964 s -- 0.668 Gflop/s
Stddev: 0.000134837 s (+/- 0.3164%)

[output]
Finished reading input file.
mi: 161795, mgrid: 32449
Sorting : Success
0.0426965 s Checksum: 0.000000
Sorting : Success
0.0425468 s Checksum: 0.000000
Sorting : Success
0.0426379 s Checksum: 0.000000
Sorting : Success
0.0425811 s Checksum: 0.000000
Sorting : Success
0.0426666 s Checksum: 0.000000
Unordered key[983]: 138 > key[984]: 27
Sorting : FAIL
0.0436186 s Checksum: 0.000000
Unordered key[45]: 6392 > key[46]: 6384
Sorting : FAIL
0.0434239 s Checksum: 0.000000
Unordered key[147]: 3 > key[148]: 0
Sorting : FAIL
0.0435097 s Checksum: 0.000000
Unordered key[210]: 218 > key[211]: 0
Sorting : FAIL
0.0436116 s Checksum: 0.000000
Unordered key[132]: 14 > key[133]: 0
Sorting : FAIL
0.0435575 s Checksum: 0.000000
=== Performance summary: BENCH_GPU A0 5057 blocks 32 threads/block ===
0.0286377 Gflops
Min: 0.0425468 s -- 0.673 Gflop/s
Mean: 0.043085 s -- 0.665 Gflop/s
Max: 0.0436186 s -- 0.657 Gflop/s
Stddev: 0.000488756 s (+/- 1.134%)

What version of the product are you using? On what operating system?
GTX280
Ubuntu 8.04
cuda 2.2

Please provide any additional information below.
Sorting error occurs sometimes like the ouput example above.
Besides, the same error occurs in cudpp1.1 and cudpp1.1.1 test program as
well.

Original issue: http://code.google.com/p/cudpp/issues/detail?id=48

CUDPP 1.1 compile errors (and fixes) (gcc 4.3.3-5ubuntu4)

Original author: [email protected] (August 13, 2009 09:52:47)

What version of the product are you using? On what operating system?
CUDPP 1.1, gcc 4.3.3-5ubuntu4 on Ubuntu 9.04 x64

Please provide any additional information below.

Building CUDPP 1.1 did not work out-of-the-box for me, and I believe
that some includes that should have been there are missing:

$ cudpp_1.1 cd common
$ common make
[...]
./../common/inc/cmd_arg_reader.h:417: error: must #include <typeinfo>
before using typeid
[...]
src/cutil.cpp:620: error: ‘strlen’ was not declared in this scope
[...]

To fix these:
in cmd_arg_reader.h

include <typeinfo>

and in cutil.cpp

include <cstring>

$ common make
[...]
./../common/inc/exception.h:89: error: ‘EXIT_FAILURE’ was not declared
in this scope
[...]

To fix:

include <cstdlib>

in exception.h

Original issue: http://code.google.com/p/cudpp/issues/detail?id=36

cudpp / cudpp Goto Github PK

cudpp's Introduction

CUDPP documentation {#mainpage}

Introduction

Overview Presentation

Home Page

Getting Started with CUDPP

Getting Help and Reporting Problems

Release Notes {#release-notes}

Known Issues

Algorithm Input Size Limitations

Operating System Support and Requirements

Requirements

CUDA

Design Goals

Use Cases

References {#references}

Citing CUDPP

Credits

CUDPP Developers

Other CUDPP Contributors

Acknowledgments

CUDPP Copyright and Software License

cudpp's People

Contributors

Stargazers

Watchers

Forkers

cudpp's Issues

include <stdio.h>

include <cudpp/cudpp.h>

include <cuda_runtime.h>

include <cutil_inline.h>

define N 12

What steps will reproduce the problem?

What is the expected output?

What do you see instead?

- Run 1:

- Run 2:

- Run 3:

- Run 4:

- Run 5:

What version of the product are you using? On what operating system?

Please provide any additional information below.

include <typeinfo>

include <cstring>

include <cstdlib>

Recommend Projects

Recommend Topics

Recommend Org