rocm / rocprim Goto Github PK

View Code? Open in Web Editor NEW

144.0 48.0 62.0 6.79 MB

ROCm Parallel Primitives

Home Page: https://rocm.docs.amd.com/projects/rocPRIM/

License: MIT License

CMake 1.15% C++ 97.40% Shell 0.30% Groovy 0.12% Python 1.03%

parallel primitive hip rocm cuda amd gpu

rocprim's Introduction

rocPRIM

rocPRIM is a header-only library that provides HIP parallel primitives. You can use this library to develop performant GPU-accelerated code on AMD ROCm platforms.

Requirements

Git
CMake (3.16 or later)
AMD ROCm platform (1.8.2 or later)
- Including HIP-clang compiler
C++14
Python 3.6 or higher (HIP on Windows only, required only for install script)
Visual Studio 2019 with Clang support (HIP on Windows only)
Strawberry Perl (HIP on Windows only)

Optional:

GoogleTest
- Required only for tests. Building tests is on by default.
- This is automatically downloaded and built by the CMake script.
Google Benchmark
- Required only for benchmarks. Building benchmarks is off by default.
- This is automatically downloaded and built by the CMake script.

Documentation

Documentation for rocPRIM is available at https://rocm.docs.amd.com/projects/rocPRIM/en/latest/.

To build our documentation locally, use the following code:

# Go to rocPRIM docs directory
cd rocPRIM; cd docs

# Install Python dependencies
python3 -m pip install -r sphinx/requirements.txt

# Build the documentation
python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html

# For local HTML version
cd _build/html
python3 -m http.server

Build and install

You can build and install rocPRIM on Linux or Windows.

Linux:

git clone https://github.com/ROCm/rocPRIM.git

# Go to rocPRIM directory, create and go to the build directory.
cd rocPRIM; mkdir build; cd build

# Configure rocPRIM, setup options for your system.
# Build options:
#   ONLY_INSTALL - OFF by default, If this flag is on, the build ignore the BUILD_* flags
#   BUILD_TEST - OFF by default,
#   BUILD_EXAMPLE - OFF by default,
#   BUILD_BENCHMARK - OFF by default.
#   BENCHMARK_CONFIG_TUNING - OFF by default. The purpose of this flag to find the best kernel config parameters.
#     At ON the compilation time can be increased significantly.
#   AMDGPU_TARGETS - list of AMD architectures, default: gfx803;gfx900;gfx906;gfx908.
#     You can make compilation faster if you want to test/benchmark only on one architecture,
#     for example, add -DAMDGPU_TARGETS=gfx906 to 'cmake' parameters.
#   AMDGPU_TEST_TARGETS - list of AMD architectures, default: "" (default system device)
#     If you want to detect failures on a per GFX IP basis, setting it to some set of ips will create
#     separate tests with the ip name embedded into the test name. Building for all, but selecting
#     tests only of a specific architecture is possible for eg: ctest -R gfx803|gfx900
#
# ! IMPORTANT !
# Set C++ compiler to HIP-clang. You can do it by adding 'CXX=<path-to-compiler>'
# before 'cmake' or setting cmake option 'CMAKE_CXX_COMPILER' to path to the compiler.
# Using HIP-clang:
[CXX=hipcc] cmake -DBUILD_BENCHMARK=ON ../.
#
# ! EXPERIMENTAL !
# Alternatively one may build using the experimental (and highly incomplete) HIP-CPU back-end for host-side
# execution using any C++17 conforming compiler (supported by HIP-CPU). AMDGPU_* options are unavailable in this case. 
#   USE_HIP_CPU - OFF by default

# Build
make -j4

# Optionally, run tests if they're enabled.
ctest --output-on-failure

# Install
[sudo] make install

Windows:

We've added initial support for HIP on Windows; to install, use the provided rmake.py python script:

git clone https://github.com/ROCm/rocPRIM.git
cd rocPRIM

# the -i option will install rocPRIM to C:\hipSDK by default
python rmake.py -i

# the -c option will build all clients including unit tests
python rmake.py -c

Using rocPRIM

Include the <rocprim/rocprim.hpp> header:

#include <rocprim/rocprim.hpp>

We recommended including rocPRIM into a CMake project by using the package configuration files. The rocPRIM package name is rocprim.

# "/opt/rocm" - default install prefix
find_package(rocprim REQUIRED CONFIG PATHS "/opt/rocm/rocprim")

...

# Includes only rocPRIM headers, HIP libraries have
# to be linked manually by user
target_link_libraries(<your_target> roc::rocprim)

# Includes rocPRIM headers and required HIP dependencies
target_link_libraries(<your_target> roc::rocprim_hip)

Running unit tests

Unit tests are implemented in terms of GoogleTest. Collections of tests are wrapped and invoked from CTest.

# Go to rocPRIM build directory
cd rocPRIM; cd build

# List available tests
ctest --show-only

# To run all tests
ctest

# Run specific test(s)
ctest -R <regex>

# To run the Google Test manually
./test/rocprim/test_<unit-test-name>

Using multiple GPUs concurrently for testing

This feature requires using CMake 3.16+ for building and testing.

Prior versions of CMake can't assign IDs to tests when running in parallel. Assigning tests to distinct
devices could only be done at the cost of extreme complexity.

Unit tests can make use of the CTest resource allocation feature, which you can use to distribute tests across multiple GPUs in an intelligent manner. This feature can accelerate testing when multiple GPUs of the same family are in a system. It can also test multiple product families from one invocation without having to use the HIP_VISIBLE_DEVICES environment variable. The feature relies on the presence of a resource specifications file.

Trying to use `RESOURCE_GROUPS` and `--resource-spec-file` with CMake and CTest for versions prior
to 3.16 silently omits the feature. No warnings are issued about unknown properties or command-line
arguments. Make sure that the `cmake` and `ctest` versions you invoke are sufficiently recent.

Auto resource specification generation

You can independently call the utility script located in the repository using the following code:

# Go to rocPRIM build directory
cd rocPRIM; cd build

# Invoke directly or use CMake script mode via cmake -P
../cmake/GenerateResourceSpec.cmake

# Assuming you have 2 compatible GPUs in the system
ctest --resource-spec-file ./resources.json --parallel 2

Manual

Assuming you have two GPUs from the gfx900 family and they are the first devices enumerated by the system, you can use -D AMDGPU_TEST_TARGETS=gfx900 during configuration to specify that only one family will be tested. Leaving this var empty (default) results in targeting the default device in the system. To let CMake know there are two GPUs that should be targeted, you have to provide a JSON file to CTest via the --resource-spec-file <path_to_file> flag. For example:

{
  "version": {
    "major": 1,
    "minor": 0
  },
  "local": [
    {
      "gfx900": [
        {
          "id": "0"
        },
        {
          "id": "1"
        }
      ]
    }
  ]
}

Invoking CTest as ctest --resource-spec-file <path_to_file> --parallel 2 allows two tests to run concurrently, distributed between the two GPUs.

Using custom seeds for the tests

Modify the rocPRIM/test/rocprim/test_seed.hpp file.

//(1)
static constexpr int random_seeds_count = 10;

//(2)
static constexpr unsigned int seeds [] = {0, 2, 10, 1000};

//(3)
static constexpr size_t seed_size = sizeof(seeds) / sizeof(seeds[0]);

(1) Defines a constant that sets how many passes over the tests will be done with runtime-generated seeds. Modify at will.

(2) Defines the user-generated seeds. Each of the array elements will be used as seed for all tests. Modify at will. If you don't want any static seeds, leave the array empty.

static constexpr unsigned int seeds [] = {};

(3) Never modify this line.

Running benchmarks

# Go to rocPRIM build directory
cd rocPRIM; cd build

# To run benchmark for warp functions:
# Further option can be found using --help
# [] Fields are optional
./benchmark/benchmark_warp_<function_name> [--size <size>] [--trials <trials>]

# To run benchmark for block functions:
# Further option can be found using --help
# [] Fields are optional
./benchmark/benchmark_block_<function_name> [--size <size>] [--trials <trials>]

# To run benchmark for device functions:
# Further option can be found using --help
# [] Fields are optional
./benchmark/benchmark_device_<function_name> [--size <size>] [--trials <trials>]

Performance configuration

Most device-specific primitives provided by rocPRIM can be tuned for other AMD devices, and different types and operations, by passing compile-time configuration structures as a template parameter. The main "knobs" are usually the size of the block and the number of items processed by a single thread.

rocPRIM has built-in default configurations for each of its primitives. In order to use the included configurations, you need to define the macro ROCPRIM_TARGET_ARCH as 803 if you want the algorithms optimized for gfx803 GCN version, or to 900 for gfx900.

hipCUB

hipCUB is a thin wrapper library on top of rocPRIM or CUB. You can use it to port projects that use the CUB library to the HIP layer and run them on AMD hardware. In the ROCm environment, hipCUB uses the rocPRIM library as a backend; on CUDA platforms, it uses CUB as a backend.

Support

You can report bugs and feature requests through our GitHub issue tracker.

Contributions and license

Contributions of any kind are most welcome! Contribution instructions are in CONTRIBUTING.

Licensing information is in LICENSE.

rocprim's People

Contributors

Stargazers

Watchers

Forkers

aaronenyeshi yxsamliu pramenku arsenm jszuppe andrewid wrightkennethj saadrahim amdkila daineamd yoichiyoshida mhbliao alexbrownamd streamhsa eidenyoshida pruthvistony linuxperia sushantkumar14 iotamudelta mrburmark lamikr ansurya nelsonc-amd stanleytsang-amd lele012 lawruble13 yvanmokwinski b1tway arvindcheru raramakr rocmmathlibrariesbot pavahora chun-wan nunnikri v01dxyz hengyeliu darius513 cgmb aurianer pjaaskel andrewcorrigan sergiykostrov jrbyrnes mfep scchan samjwu chip-spv eng-flavio-teixeira streamhpc jiujiangzhu lilohuang lajagapp trixirt umfranzw ethicalsecurity-agency briansp2020 mfkiwl lisadelaney ruiliji2 vatsalcode tntran92 amd-jmacaran

rocprim's Issues

rocPRIM on NAVI22 = gfx1031 only one test fails - can we fix this?

I added NAVI22 gfx1031 as a device, for a start, I simply copied the sections for gfx1030, compiled it and ran the tests.

Only test 14 failed:

Test results:

[----------] Global test environment tear-down
[==========] 280 tests from 112 test suites ran. (781 ms total)
[ PASSED ] 272 tests.
[ FAILED ] 8 tests, listed below:
[ FAILED ] RocprimBlockReduceSingleValueTestsFloating/16.ReduceMultiplies, where TypeParam = block_params<__half, __half, 32u>
[ FAILED ] RocprimBlockReduceSingleValueTestsFloating/17.ReduceMultiplies, where TypeParam = block_params<__half, __half, 64u>
[ FAILED ] RocprimBlockReduceSingleValueTestsFloating/18.ReduceMultiplies, where TypeParam = block_params<__half, __half, 128u>
[ FAILED ] RocprimBlockReduceSingleValueTestsFloating/19.ReduceMultiplies, where TypeParam = block_params<__half, __half, 192u>
[ FAILED ] RocprimBlockReduceSingleValueTestsFloating/20.ReduceMultiplies, where TypeParam = block_params<__half, __half, 256u>
[ FAILED ] RocprimBlockReduceSingleValueTestsFloating/21.ReduceMultiplies, where TypeParam = block_params<__half, __half, 129u>
[ FAILED ] RocprimBlockReduceSingleValueTestsFloating/22.ReduceMultiplies, where TypeParam = block_params<__half, __half, 162u>
[ FAILED ] RocprimBlockReduceSingleValueTestsFloating/23.ReduceMultiplies, where TypeParam = block_params<__half, __half, 255u>

8 FAILED TESTS of Test 14 - rocprim.block_reduce (Failed)

98% tests passed, 1 tests failed out of 51

Label Time Summary:
hip = 500.81 sec*proc (49 tests)

Total Test time (real) = 502.70 sec

The following tests FAILED:
14 - rocprim.block_reduce (Failed)
Errors while running CTest

Three questions:

Could you fix this or should I contact the people from the gentoo distribution (I am using RHEL9.1 (AlmaLinux 9.1))?
Would you mind, adding the actual GPU name on which the tests run to the terminal output, not just an ID for clarity*?
Are 502.70 sec a reasonable time?

*My problem is, that the ID's (Node numbers) which I set using: export HIP_VISIBLE_DEVICES=0; export ROCR_VISIBLE_DEVICES=0 before running the tests don't match rocminfo. I compiled rocPrim specifically for gfx1031 using: CXX=/opt/rocm/hip/bin/hipcc cmake -DAMDGPU_TARGETS=gfx1031 ../. but I am still not sure it runs on the dedicated GPU.

Building issue for rocPRIM exclusive_scan

Issue
Build issue in mxnet related to exclusive_scan of rocPRIM when integrating rocPRIM in place of cub-hip
I am running into this issue. I needed some help

/opt/rocm/hipcub/include/hipcub/rocprim/block/block_scan.hpp:193:20: error: no matching member function for call to 'exclusive_scan'
base_type::exclusive_scan(input, output, temp_storage_);
~~~~~~~~~~~^~~~~~~~~~~~~~
src/operator/tensor/./cast_storage-inl.cuh:445:33: note: in instantiation of member function 'hipcub::BlockScan<long, 256, hipcub::BLOCK_SCAN_RAKING, 1, 1, 1>::ExclusiveSum' requested here
BlockScan(temp_storage).ExclusiveSum(nnz, nnz);

The error was no matching member function call in the block_scan.hpp for exclusive_scan in https://github.com/ROCmSoftwarePlatform/rocPRIM/blob/master/hipcub/include/hipcub/rocprim/block/block_scan.hpp#L193

Add support for NAVI22 and NAVI23 i.e. gfx1031 and gfx1032

NAVI22 and NAVI23 i.e. gfx1031 and gfx1032 is needed for HIP programming/software development using rocThrust on Notebooks as ALL AMD mobile GPUs are NAVI22 or NAVI23 i.e. gfx1031 and gfx1032. rocPRIM is a rocThrust dependency.

The target operating system is RHEL9 = AlmaLinux 9, RockyLinux...

My approach is to develop / adapt / test engineering and scientific software on my Notebook before investing in dedicated HPC hardware.

merge_sort

Does device wide merge_sort support in-place sorting (keys_input == keys_output and values_input == values_output)?
What about radix sort? I guess it does not support in-place sorting, but I could not find anything in the docs about it.

Follow the example of rocPRIM custom types example but compile failed.

Have tried to write some custom float16 type code by following the example: rocPRIM/test/rocprim/test_utils_custom_test_types.hpp. But after including the header files of rocprim as the following:

#include <rocprim/type_traits.hpp>
#include <rocprim/detail/radix_sort.hpp>

, it shows the following compilation error:

...
/opt/rocm/include/rocprim/types.hpp:164:21: error: ‘_Float16’ does not name a type; did you mean ‘bfloat16’?
 using native_half = _Float16;
                     ^~~~~~~~
                     bfloat16
...

Wonder what has been missing? Should there be any parameters wrongly configured? Or if rocPRIM has been wrongly installed? What could be the cause? Please help! Thanks in advance!

ROCm 3.9.1 fails to compile benchmark_warp_scan

Describe the bug
Compiler crashes when compiling benchmark_warp_scan with specific parameters.

To Reproduce
Steps to reproduce the behavior:

Install rocm-dkms version 3.9.1
Checkout rocPRIM develop at this commit
Set HCC_AMDGPU_TARGET env var to match CMake AMDGPU_TARGETS, as per this issue. (otherwise you get unwanted targets in the binary too)
Configure
Build benchmark_warp_scan
See crash or open logs named Build.log

PowerShell script (port at your convenience) which enumerates all possible combinations:

$configs = @(
     @('gfx803','gfx900','gfx906','gfx908'),
     @('gfx803','gfx900','gfx906'),
     @('gfx803','gfx900','gfx908'),
     @('gfx803','gfx906','gfx908'),
     @('gfx900','gfx906','gfx908'),
     @('gfx803','gfx900'),
     @('gfx803','gfx908'),
     @('gfx906','gfx908'),
     @('gfx900','gfx908'),
     @('gfx803','gfx906'),
     @('gfx803'),
     @('gfx900'),
     @('gfx906'),
     @('gfx908')
 );
$source_dir = "where you cloned your repo";
$build_root = "an existing empty folder";
foreach ($config in $configs) {
  $build_dir = ("$build_root/" + ($config -join "_"));
  New-Item -Type Directory $build_dir | Out-Null;
  $env:HCC_AMDGPU_TARGET = ($config -join ","); # hipcc expects comma delimited list
  cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc -DCMAKE_BUILD_TYPE=Release -DBUILD_TEST=OFF -DBUILD_BENCHMARK=ON ("-DAMDGPU_TARGETS=" + ($config -join ";")) -S $source_dir -B $build_dir 2>&1 | Out-File $build_dir/Configure.log; # CMake expects semi-colon delimited list, need to guard from shell with quotes
  cmake --build $build_dir --target benchmark_warp_scan -- VERBOSE=1 2>&1 | Out-File $build_dir/Build.log;
}

These are the configurations that compile successfully:

# List folders that recursively somwhere contain the executable
(Get-ChildItem $build_root -Recurse -filter benchmark_warp_scan).Directory.Parent.BaseName
gfx900
gfx803
gfx908
gfx803_gfx908
gfx803_gfx900_gfx908
gfx803_gfx900_gfx906
gfx803_gfx906_gfx908

These are the configurations that didn't compile successfully:

# Do the same, but this time, when do an extra `ls` of $build_root and exclude folders with the executables in them.
(Get-ChildItem $build_root -Exclude (gci $build_root -Recurse -filter benchmark_warp_scan).Directory.Parent.BaseName).BaseName
gfx803_gfx900
gfx803_gfx900_gfx906_gfx908
gfx803_gfx906
gfx900_gfx906_gfx908
gfx900_gfx908
gfx906
gfx906_gfx908

Note that moving to the latest develop commit which removed XNACK_FLAGS in favor of more complex target names, the very same analysis can be done with a slightly altered script to omit funky path names:

$configs = @(
@('gfx900:xnack-','gfx906:xnack-','gfx908:xnack-'),
@('gfx900:xnack-','gfx906:xnack-'),
@('gfx900:xnack-','gfx908:xnack-'),
@('gfx906:xnack-','gfx908:xnack-'),
@('gfx900:xnack-'),
@('gfx906:xnack-'),
@('gfx908:xnack-')
)
foreach ($config in $configs) {
$build_dir = ("$build_root/" + ($config -join "_").Replace(":xnack-",""));
...
}

Succesful builds:

gfx900                                                                       
gfx906                                            
gfx908                                                                               
gfx900_gfx908                                                                                                                                                                                                     gfx900_gfx906

Failing builds:

gfx900_gfx906_gfx908                                                         
gfx906_gfx908

Expected behavior
Compiler emitting a descriptive error if it was asked to do something impossible, or compile correctly otherwise.

Log-files
This is the output of one such failure.
The funky arch names produce the same crash with a few extra hipcc and preprocessor warnings:

Warning: The specified HIP target: gfx906:xnack- is unknown. Correct compilation is not guaranteed.
Warning: The specified HIP target: gfx908:xnack- is unknown. Correct compilation is not guaranteed.
In file included from <built-in>:752:
<command line>:3:26: warning: ISO C99 requires whitespace after the macro name [-Wc99-extensions]
#define __HIP_ARCH_GFX906:XNACK-__ 1
                         ^
<command line>:4:26: warning: ISO C99 requires whitespace after the macro name [-Wc99-extensions]
#define __HIP_ARCH_GFX908:XNACK-__ 1
                         ^

Environment
environment.txt

Additional context

error: invalid operands to binary expression

Describe the bug
I'm trying to build pytorch with rocm support but I'm getting an error related to rocprim, specifically the file device_scan.hpp

In file included from /media/nvme/scratch/yay/python-pytorch-rocm/src/pytorch-1.10.2-rocm/aten/src/ATen/native/hip/IndexKernel.hip:13:
In file included from /media/nvme/scratch/yay/python-pytorch-rocm/src/pytorch-1.10.2-rocm/aten/src/ATen/hip/cub.cuh:26:
In file included from /opt/rocm/include/hipcub/hipcub.hpp:36:
In file included from /opt/rocm/include/hipcub/backend/rocprim/hipcub.hpp:77:
In file included from /opt/rocm/include/hipcub/backend/rocprim/device/device_run_length_encode.hpp:35:
In file included from /opt/rocm/include/rocprim/device/device_run_length_encode.hpp:37:
In file included from /opt/rocm/include/rocprim/device/device_select.hpp:33:
/opt/rocm/include/rocprim/device/device_scan.hpp:531:27: error: invalid operands to binary expression ('at::cuda::cub::impl::chained_iterator<long, unsigned char *>' and 'size_t' (aka 'unsigned long'))
                    input + offset, output + offset, current_size, initial_value,
                    ~~~~~ ^ ~~~~~~
/opt/rocm/hip/include/hip/amd_detail/amd_hip_runtime.h:270:87: note: expanded from macro 'hipLaunchKernelGGL'
#define hipLaunchKernelGGL(kernelName, ...)  hipLaunchKernelGGLInternal((kernelName), __VA_ARGS__)
                                                                                      ^~~~~~~~~~~
/opt/rocm/hip/include/hip/amd_detail/amd_hip_runtime.h:267:78: note: expanded from macro 'hipLaunchKernelGGLInternal'
        kernelName<<<(numBlocks), (numThreads), (memPerBlock), (streamId)>>>(__VA_ARGS__);         \
                                                                             ^~~~~~~~~~~

To Reproduce
Trying to build pytorch with rocm support using the variable PYTORCH_ROCM_ARCH=gfx1030 should trigger this issue.

Expected behavior
Pytorch should build without any errors

Environment
environment.txt is attached.

Thanks.
environment.txt

rocPRIM defines a __atomic_work_item_fence with conflicting types to original

AMD Internal Bug:

rocPRIM defines in file /opt/rocm/rocprim/include/rocprim/intrinsics/thread.hpp:282:36:
extern "C" ROCPRIM_DEVICE void __atomic_work_item_fence(unsigned int, unsigned int, unsigned int);

Original definition in /opt/rocm/include/hip/hcc_detail/device_library_decls.h:115:1 is:
__atomic_work_item_fence(__cl_mem_fence_flags, __memory_order, __memory_scope);

This causes compile time errors for PyTorch.
It looks like a forward declaration gone wrong.

HIP_PLATFORM=nvcc, cmake error

Try to compare performance for MI50 and NVIDIA V100, for AMD system, build and test are successfull, but for NVIDIA system set HIP_PLATFORM=nvcc, CMAKE error
CXX=hipcc cmake -DBUILD_BENCHMARK=ON ..
-- The CXX compiler identification is GNU 7.3.1
-- Check for working CXX compiler: /opt/rocm/bin/hipcc
-- Check for working CXX compiler: /opt/rocm/bin/hipcc -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Setting build type to 'Release' as none was specified.
CMake Warning (dev) at CMakeLists.txt:44 (set):
implicitly converting 'BOOLEAN' to 'STRING' type.
This warning is for project developers. Use -Wno-dev to suppress it.

-- Found HIP: /opt/rocm (found suitable version "3.0.19493-36529b16", minimum required is "1.5.18263")
CMake Error at cmake/VerifyCompiler.cmake:61 (message):
HIP_PLATFORM must be 'hcc' (AMD ROCm platform)
Call Stack (most recent call first):
CMakeLists.txt:47 (include)

-- Configuring incomplete, errors occurred!
See also "/home/alice/test/rocPRIM/build/CMakeFiles/CMakeOutput.log".
[alice@prj47-rack-96 build]$ CXX=hcc cmake -DBUILD_BENCHMARK=ON ..
CMake Warning (dev) at CMakeLists.txt:44 (set):
implicitly converting 'BOOLEAN' to 'STRING' type.
This warning is for project developers. Use -Wno-dev to suppress it.

CMake Error at cmake/VerifyCompiler.cmake:61 (message):
HIP_PLATFORM must be 'hcc' (AMD ROCm platform)
Call Stack (most recent call first):
CMakeLists.txt:47 (include)

-- Configuring incomplete, errors occurred!
See also "/home/alice/test/rocPRIM/build/CMakeFiles/CMakeOutput.log".
[alice@prj47-rack-96 build]$ CXX=hipcc cmake -DBUILD_BENCHMARK=ON ..
CMake Warning (dev) at CMakeLists.txt:44 (set):
implicitly converting 'BOOLEAN' to 'STRING' type.
This warning is for project developers. Use -Wno-dev to suppress it.

CMake Error at cmake/VerifyCompiler.cmake:61 (message):
HIP_PLATFORM must be 'hcc' (AMD ROCm platform)
Call Stack (most recent call first):
CMakeLists.txt:47 (include)

-- Configuring incomplete, errors occurred!
See also "/home/alice/test/rocPRIM/build/CMakeFiles/CMakeOutput.log".

Unable to build rocPRIM

I am still observing the issue #10
#10,

Even i tried with the fix__half_tests_on_new_hip branch, even with this branch also i am observing the issue.
rocPRIM_MasterBranch_issue.log

I have attached the complete log for both master and fix__half_tests_on_new_hip branch
fix_half_log.txt

"warning: <unknown>:0:0: loop not unrolled: the optimizer was unable to perform the requested transformation" observed with rocprim

Steps to build rocprim
git clone -b master https://github.com/ROCmSoftwarePlatform/rocprim.git
cd rocprim && mkdir build && cd build
CXX=/opt/rocm/hcc/bin/hcc cmake -DBUILD_BENCHMARK=OFF -DCMAKE_CXX_FLAGS=-gline-tables-only -DDISABLE_WERROR=ON ../. | tee rocPRIM_build-gline.log
make -j16 | tee -a rocPRIM_build-gline.log

Please see file below for output
rocPRIM_build-gline.log

CMake 3.10 requirement

Hi, is there a special reason for the cmake 3.10 requirement? Ubuntu 16.04 hosts only version 3.5. Also, cmake 3.10 fails to find rocm package, whereas 3.5 works fine for me.

Build error: no template named 'texture_cache_iterator'

When I build a cuda program using cub and PRIM, I have the problem:

In file included from ../src/acc/hip/hip_utils_cub.hpp:32:
In file included from /opt/rocm/rocPRIM/include/hipcub/hipcub.hpp:34:
In file included from /opt/rocm/rocPRIM/include/hipcub/rocprim/../rocprim/hipcub.hpp:33:
/opt/rocm/rocPRIM/include/hipcub/rocprim/iterator/tex_obj_input_iterator.hpp:32:40: error: no template named 'texture_cache_iterator'
in namespace 'rocprim'
using TexObjInputIterator = ::rocprim::texture_cache_iterator<T, OffsetT>;
~~~~~~~~~~~^

and the included file [ hip_utils_cub.hpp ]is as fllows :(add line number)

22 #include <rocprim/rocprim.hpp>
23 #include <rocprim/rocprim_hip.hpp>
24
25 #include <rocprim/device/device_radix_sort_hip.hpp>
26 #include <rocprim/device/device_reduce_hip.hpp>
27 #include <rocprim/device/device_scan_hip.hpp>
28 #include <rocprim/device/device_select_hip.hpp>
29
30
31 #include <hipcub/rocprim/util_type.hpp>
32 #include <hipcub/hipcub.hpp> // error line

Compiler Build issue

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

Install '...' version '...'
Run '...' with data '...'
See error on logfile '...'

Expected behavior
A clear and concise description of what you expected to happen.

Log-files
Add full logfiles to help explain your problem.

Environment
Make sure that ROCm is correctly installed and run the following command:

printf '=== environment\n' > environment.txt && 
printf '\n\n=== date\n' >> environment.txt && date >> environment.txt && 
printf '\n\n=== Linux Kernel\n' >> environment.txt && uname -a  >> environment.txt && 
printf '\n\n=== rocm-smi' >> environment.txt && rocm-smi  >> environment.txt && 
printf '\n\n' >> environment.txt && hipconfig  >> environment.txt && 
printf '\n\n=== rocminfo\n' >> environment.txt && rocminfo  >> environment.txt && 
printf '\n\n=== lspci VGA\n' >> environment.txt && lspci | grep -i vga >> environment.txt

Attach environment.txt

Additional context
Add any other context about the problem here.

Building rocPRIM warnings

I'm running into some implicit conversion errors, which normally are warnings, but with -WError set, these show as errors. Could you help fix these issues?

/work/rocPRIM/test/rocprim/test_hip_block_histogram.cpp:167:73: error: implicit conversion from
      'unsigned long' to 'unsigned char' changes value from 1023 to 255 [-Werror,-Wconstant-conversion]
    std::vector<T> output = test_utils::get_random_data<T>(size, 0, bin - 1);
                            ~~~~~~~~~~                              ~~~~^~~
/work/rocPRIM/build/gtest/include/gtest/internal/gtest-internal.h:475:44: note: in instantiation of member
      function 'RocprimBlockHistogramInputArrayTests_Histogram_Test<params<unsigned char, unsigned char,
      1024, 1, 1024, rocprim::block_histogram_algorithm::using_sort> >::TestBody' requested here
  Test* CreateTest() override { return new TestClass; }
                                           ^
/work/rocPRIM/build/gtest/include/gtest/internal/gtest-internal.h:726:13: note: in instantiation of member
      function
      'testing::internal::TestFactoryImpl<RocprimBlockHistogramInputArrayTests_Histogram_Test<params<unsigned
      char, unsigned char, 1024, 1, 1024, rocprim::block_histogram_algorithm::using_sort> > >::CreateTest'
      requested here
        new TestFactoryImpl<TestClass>);
            ^

/work/rocPRIM/test/rocprim/test_hip_device_partition.cpp:274:48: error: implicit conversion from 'int' to
      'unsigned char' changes value from 345 to 89 [-Werror,-Wconstant-conversion]
            rocprim::make_constant_iterator<T>(345),
            ~~~~~~~                            ^~~
/work/rocPRIM/test/rocprim/test_hip_device_partition.cpp:37:60: note: expanded from macro 'HIP_CHECK'
#define HIP_CHECK(error) ASSERT_EQ(static_cast<hipError_t>(error),hipSuccess)
                                                           ^~~~~
/work/rocPRIM/build/gtest/include/gtest/gtest.h:2078:48: note: expanded from macro 'ASSERT_EQ'
# define ASSERT_EQ(val1, val2) GTEST_ASSERT_EQ(val1, val2)
                                               ^~~~
/work/rocPRIM/build/gtest/include/gtest/gtest.h:2062:23: note: expanded from macro 'GTEST_ASSERT_EQ'
                      val1, val2)
                      ^~~~
/work/rocPRIM/build/gtest/include/gtest/gtest_pred_impl.h:168:36: note: expanded from macro
      'ASSERT_PRED_FORMAT2'
  GTEST_PRED_FORMAT2_(pred_format, v1, v2, GTEST_FATAL_FAILURE_)
                                   ^~
/work/rocPRIM/build/gtest/include/gtest/gtest_pred_impl.h:149:39: note: expanded from macro
      'GTEST_PRED_FORMAT2_'
  GTEST_ASSERT_(pred_format(#v1, #v2, v1, v2), \
                                      ^~
/work/rocPRIM/build/gtest/include/gtest/gtest_pred_impl.h:77:52: note: expanded from macro 'GTEST_ASSERT_'
  if (const ::testing::AssertionResult gtest_ar = (expression)) \
                                                   ^~~~~~~~~~
/work/rocPRIM/build/gtest/include/gtest/internal/gtest-internal.h:475:44: note: in instantiation of member
      function 'RocprimDevicePartitionTests_PredicateEmptyInput_Test<DevicePartitionParams<unsigned char,
      float, unsigned int, false> >::TestBody' requested here
  Test* CreateTest() override { return new TestClass; }
                                           ^
/work/rocPRIM/build/gtest/include/gtest/internal/gtest-internal.h:726:13: note: in instantiation of member
      function
      'testing::internal::TestFactoryImpl<RocprimDevicePartitionTests_PredicateEmptyInput_Test<DevicePartitionParams<unsigned
      char, float, unsigned int, false> > >::CreateTest' requested here
        new TestFactoryImpl<TestClass>);
            ^

These show up when building rocPRIM tests

Add device function version for HIP platform.

I'm running into an error with the latest rocPRIM develop branch on HIP-Clang. The error is due to function half_to_native in test_utils.hpp which doesn't have a host function but it being called by host function.

In file included from /work/rocprim/test/rocprim/test_hip_device_scan.cpp:35:
/work/rocprim/test/rocprim/test_utils.hpp:606:19: error: no matching function for call to 'half_to_native'
        ASSERT_EQ(half_to_native(result[i]), half_to_native(expected[i])) << "where index = " << i;
                  ^~~~~~~~~~~~~~
/work/rocprim/build/gtest/include/gtest/gtest.h:2078:48: note: expanded from macro 'ASSERT_EQ'
# define ASSERT_EQ(val1, val2) GTEST_ASSERT_EQ(val1, val2)
                                               ^~~~
/work/rocprim/build/gtest/include/gtest/gtest.h:2061:55: note: expanded from macro 'GTEST_ASSERT_EQ'
                      EqHelper<GTEST_IS_NULL_LITERAL_(val1)>::Compare, \
                                                      ^~~~
/work/rocprim/build/gtest/include/gtest/internal/gtest-internal.h:155:7: note: expanded from macro 'GTEST_IS_NULL_LITERAL_'
      x,                                             \
      ^
/work/rocprim/build/gtest/include/gtest/gtest_pred_impl.h:168:23: note: expanded from macro 'ASSERT_PRED_FORMAT2'
  GTEST_PRED_FORMAT2_(pred_format, v1, v2, GTEST_FATAL_FAILURE_)
                      ^~~~~~~~~~~
/work/rocprim/build/gtest/include/gtest/gtest_pred_impl.h:149:17: note: expanded from macro 'GTEST_PRED_FORMAT2_'
  GTEST_ASSERT_(pred_format(#v1, #v2, v1, v2), \
                ^~~~~~~~~~~
/work/rocprim/build/gtest/include/gtest/gtest_pred_impl.h:77:52: note: expanded from macro 'GTEST_ASSERT_'
  if (const ::testing::AssertionResult gtest_ar = (expression)) \
                                                   ^~~~~~~~~~
/work/rocprim/test/rocprim/test_utils.hpp:57:15: note: candidate function not viable: call to __device__ function from __host__ function
rocprim::half half_to_native(const rocprim::half& x)
              ^

As you can see in the final line, the host versions of half_to_native is removed due to the #else clause. In HIP-Clang its very strict (same as CUDA) where host functions cannot call functions which have only device attribute. Is it possible to use ROCPRIM_HOST_DEVICE instead on lines 56 and 62 of test_utils.hpp?

Question: Performance of radix sort with radix 6 and 7

I see the radix sort implementation using 5 iterations: 7 + 7 + 6 + 6 + 6 = 32 bits.
Usually we use radix4 implementation. With radix6 and 7 how much performance you are getting?
As per my understanding when we increase the radix it will increase the histogram size like for radix 7 it will be 2^7=128 items per thread......i see AMD GPU has only 64kb of LDS. If we use 256 threads then the histogram size will be 128*256 = 3072 integers. This will greatly reduce the performance as only few wavefronts would be able to run on each CU due to lack of LDS space.

Correct me if i am wrong.
Can you put some light on the implementation with radix6 and 7size.?

Building rocPRIM issue

Hi @jszuppe I need some help with build rocPRIM. I am running into this:

root@29f5659f5f83:~/rocPRIM/build# cmake -DBUILD_BENCHMARK=ON ../.
CMake Error at cmake/Dependencies.cmake:138 (find_package):
  Could not find a package configuration file provided by "ROCM" with any of
  the following names:

    ROCMConfig.cmake
    rocm-config.cmake

  Add the installation prefix of "ROCM" to CMAKE_PREFIX_PATH or set
  "ROCM_DIR" to a directory containing one of the above files.  If "ROCM"
  provides a separate development package or SDK, be sure it has been
  installed.
Call Stack (most recent call first):
  CMakeLists.txt:63 (include)


-- Configuring incomplete, errors occurred!
See also "/root/rocPRIM/build/CMakeFiles/CMakeOutput.log".
See also "/root/rocPRIM/build/CMakeFiles/CMakeError.log".

I have rocm-dkms package installed from https://github.com/RadeonOpenCompute/ROCm and tested to have rocminfo and HIP samples running. Not sure what is needed extra for rocPRIM.

Ignoring return value warning

Describe the bug

When building with Clang (ROCM-4.1.0)

/rocthrust/rocprim/intrinsics/thread.hpp:45:5: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result]
    hipGetDevice(&default_hip_device);
    ^~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~
/rocthrust/rocprim/intrinsics/thread.hpp:47:5: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result]
    hipGetDeviceProperties(&device_prop,default_hip_device);

To Reproduce
Build with Clang from:
https://github.com/RadeonOpenCompute/ROCm-Device-Libs/archive/rocm-4.1.0.tar.gz
https://github.com/RadeonOpenCompute/llvm-project/archive/rocm-4.1.0.tar.gz

Expected behavior
No warnings

Log-files

/rocthrust/rocprim/intrinsics/thread.hpp:45:5: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result]
    hipGetDevice(&default_hip_device);
    ^~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~
/rocthrust/rocprim/intrinsics/thread.hpp:47:5: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result]
    hipGetDeviceProperties(&device_prop,default_hip_device);

error: no matching function for call to 'ceiling_div'

I am using the header file "device_vector.h" from the rocThrust library to run a code on a AMD GPU. When I compile the code using hipcc, the next error appears from rocPRIM: /opt/rocm/rocprim/include/rocprim/../../../include/rocprim/device/device_transform_config.hpp:50:9: error: no matching function for call to 'ceiling_div'
::rocprim::detail::ceiling_div(sizeof(Value), sizeof(int));

Here is the full output message:

In file included from /opt/rocm/hip/include/thrust/device_vector.h:26:
In file included from /opt/rocm/hip/include/thrust/detail/vector_base.h:586:
In file included from /opt/rocm/hip/include/thrust/detail/vector_base.inl:26:
In file included from /opt/rocm/hip/include/thrust/equal.h:235:
In file included from /opt/rocm/hip/include/thrust/detail/equal.inl:26:
In file included from /opt/rocm/hip/include/thrust/system/detail/generic/equal.h:46:
In file included from /opt/rocm/hip/include/thrust/system/detail/generic/equal.inl:21:
In file included from /opt/rocm/hip/include/thrust/mismatch.h:257:
In file included from /opt/rocm/hip/include/thrust/detail/mismatch.inl:27:
In file included from /opt/rocm/hip/include/thrust/system/detail/generic/mismatch.h:56:
In file included from /opt/rocm/hip/include/thrust/system/detail/generic/mismatch.inl:21:
In file included from /opt/rocm/hip/include/thrust/find.h:381:
In file included from /opt/rocm/hip/include/thrust/detail/find.inl:25:
In file included from /opt/rocm/hip/include/thrust/system/detail/generic/find.h:61:
In file included from /opt/rocm/hip/include/thrust/system/detail/generic/find.inl:19:
In file included from /opt/rocm/hip/include/thrust/reduce.h:781:
In file included from /opt/rocm/hip/include/thrust/detail/reduce.inl:28:
In file included from /opt/rocm/hip/include/thrust/system/detail/generic/reduce_by_key.h:87:
In file included from /opt/rocm/hip/include/thrust/system/detail/generic/reduce_by_key.inl:37:
In file included from /opt/rocm/hip/include/thrust/scan.h:1656:
In file included from /opt/rocm/hip/include/thrust/detail/scan.inl:28:
In file included from /opt/rocm/hip/include/thrust/system/detail/adl/scan.h:44:
In file included from /opt/rocm/hip/include/thrust/system/hip/detail/scan.h:48:
In file included from /opt/rocm/rocprim/include/rocprim/rocprim.hpp:20:
In file included from /opt/rocm/rocprim/include/rocprim/../../../include/rocprim/rocprim.hpp:52:
In file included from /opt/rocm/rocprim/include/rocprim/../../../include/rocprim/device/device_adjacent_difference.hpp:29:
In file included from /opt/rocm/rocprim/include/rocprim/../../../include/rocprim/device/device_transform.hpp:35:
/opt/rocm/rocprim/include/rocprim/../../../include/rocprim/device/device_transform_config.hpp:50:9: error: no matching function for call to 'ceiling_div'
::rocprim::detail::ceiling_div(sizeof(Value), sizeof(int));
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/opt/rocm/rocprim/include/rocprim/../../../include/rocprim/device/device_transform_config.hpp:59:9: error: no matching function for call to 'ceiling_div'
::rocprim::detail::ceiling_div(sizeof(Value), sizeof(int));
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/opt/rocm/rocprim/include/rocprim/../../../include/rocprim/device/device_transform_config.hpp:68:9: error: no matching function for call to 'ceiling_div'
::rocprim::detail::ceiling_div(sizeof(Value), sizeof(int));
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/opt/rocm/rocprim/include/rocprim/../../../include/rocprim/device/device_transform_config.hpp:77:9: error: no matching function for call to 'ceiling_div'
::rocprim::detail::ceiling_div(sizeof(Value), sizeof(int));
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

rocPRIM reduction (block_reduce_int) issue

Describe the bug
My simple reduction kernel that uses block_reduce_int() produces incorrect results with rocPRIM release branches 4.4+. Release branch 4.3 works fine. The same code compiles and runs fine with HIP on V100 and A100 based NVIDIA systems (CSC Puhti and Mahti supercomputers). I'm not 100% positive it is rocPRIM side issue but looks like it.

To Reproduce
Compiling and running https://github.com/hokkanen/rocprim_issue/blob/master/hipcub_demo.cpp with rocPRIM release branches 4.4+ (see lines 14 and 17 to include the rocPRIM header) on Lumi supercomputer produces the issue.

Expected behavior
I expect to see the following:

The results calculated by GPU = 499500 and CPU = 499500 match!

Instead, I see this:

The results calculated by GPU = 460320 and CPU = 499500 do not match!

Environment
environment.txt

memory access fault when linking dynamic lib that uses hipcub

I am running into a memory access fault when linking to a dynamic library that uses hipcub if I use hipcub also in the main program (that links to the package that uses hipcub).
Please see the following minimal example hipcub_test.tar.gz

Edit: This does not happen when using rocprim or cub instead of hipcub.

Please add gfx906 to support Vega20 in cmakelists

Can you please add gfx906 to support Vega20 in cmakelists @ https://github.com/ROCmSoftwarePlatform/rocPRIM/blob/master/CMakeLists.txt#L70

Porting rocPRIM onto HIP-Clang

Hi rocPRIM team, I'm working on porting rocPRIM to be compiled with HIP-Clang. I'm running into some issues that I'm not sure how to resolve since I don't understand the nature of the test. Here is what it looks like:

/root/rocPRIM/test/rocprim/test_hip_block_histogram.cpp:167:73: error: implicit conversion from
      'unsigned long' to 'unsigned char' changes value from 1023 to 255 [-Werror,-Wconstant-conversion]
    std::vector<T> output = test_utils::get_random_data<T>(size, 0, bin - 1);
                            ~~~~~~~~~~                              ~~~~^~~
/root/rocPRIM/build/gtest/include/gtest/internal/gtest-internal.h:468:43: note: in instantiation of member
      function 'RocprimBlockHistogramInputArrayTests_Histogram_Test<params<unsigned char, unsigned char,
      1024, 1, 1024, rocprim::block_histogram_algorithm::using_sort> >::TestBody' requested here
  virtual Test* CreateTest() { return new TestClass; }
                                          ^
/root/rocPRIM/build/gtest/include/gtest/internal/gtest-internal.h:634:13: note: in instantiation of member
      function
      'testing::internal::TestFactoryImpl<RocprimBlockHistogramInputArrayTests_Histogram_Test<params<unsigned
      char, unsigned char, 1024, 1, 1024, rocprim::block_histogram_algorithm::using_sort> > >::CreateTest'
      requested here
        new TestFactoryImpl<TestClass>);
            ^

I've had this same error for a few other tests as well. I think that HIP-Clang compiler is a little more strict than the HCC option. Let me know what I should do to fix this. Thanks!

change packaging for GPU-less hosts

to package rocPRIM, we compile on GPU-less hosts. Currently, the build system defaults to NVidia in that case. Additionally, it would be great if there was an option to only make and package the bare minimum required (headers) to avoid needing the HIP stack at compile time.

Alternative documentation

Hello,
I've built an alternative documentation which is a little bit more esthetic than the current one.

Here a preview of it: https://v01dxyz.github.io/rocprim-documentation

Are you interested by this ?

Failed to install rocPRIM

Hi, what's the recommended command to install rocPRIM?
The last step was failed:

~/rocPRIM/build# make install 
make: *** No rule to make target 'install'.  Stop.

Google test deprecated

Testing does not build anymore.
building results in
TYPED_TEST_CASE is deprecated, please use TYPED_TEST_SUITE

googletest has deprecated this. Pin googletest to older version or fix?
https://github.com/google/googletest/blob/9a502a5b14b4a6160103c1f2c64331772878d86a/googletest/include/gtest/gtest-typed-test.h#L230

Block_reduce fails to distribute correct answer to all lanes when hipBlockDim > 64

Describe the bug

Block_reduce seems to output incorrect values when hipBlockDim > 64 on my Vega64. There seems to be a bug in distributing the output to other warps. The attached example.cpp provides an easy demonstration of the bug.

The #define for NUM_THREADS provides an easy set of tests for the values 64, 128, 192, and 256. NUM_THREADS == 64 outputs all correct answers. However, 128 only produces the correct answer in two threads (Thread0 and Thread1). 192 produces the correct answer in one lanes (Only thread0), while 256 produces the correct answer in only four lanes.

The block_reduce code is clearly missing the "broadcast" step in some cases, where Thread0 broadcasts the value back to all 256 threads.

To Reproduce

example.cpp.txt

Compile the attached "example.cpp" file with hipcc example.cpp -I/opt/rocm/rocprim/include/. Play with the #define NUM_THREADS value as you see fit.

Expected behavior
In the case of 192-threads, I expect 18336 (0+1+2+3+4...+191) to be the output for all 192-lanes.

Log-files

The following output is from a NUM_THREADS of 192. The value of 18336 in thread0 is correct, but has failed to be distributed to other lanes.

18336
16320
10208
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
6112
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208
10208

Please enable two factor authentication in your github account

@VincentSC;@sbalint98;@Maetveis;@neon60;@mfep;@nolmoonen;@vince-streamhpc

We are going to enforce two factor authentication in (https://github.com/ROCmSoftwarePlatform/) organization on 29th April, 2022 .
Since we identified you as outside collaborator for ROCmSoftwarePlatform organization, you need to enable two factor authentication in your github account else you shall be removed from the organization after the enforcement.
Please skip if already done.

To set up two factor authentication, please go through the steps in below link:

https://docs.github.com/en/free-pro-team@latest/github/authenticating-to-github/configuring-two-factor-authentication

Please email "[email protected]" for queries

error: reference to host function 'inclusive_scan<rocprim::default_config, double , double , thrust::plus<double>>' in host device function

Describe the bug
A clear and concise description of what the bug is.

error: reference to host function 'inclusive_scan<rocprim::default_config, double *, double *, thrust::plus>' in host device function

To Reproduce
Steps to reproduce the behavior:

Install '...' version '...'
Run '...' with data '...'
See error on logfile '...'

Implemented an device function below, and compiler error comes out in ROCM3.5.1

template <typename T>
__global__ void GetCumulativeProbs(T* norm_probs_data,
                                   int64_t num_distributions,
                                   int64_t num_categories,
                                   T* cumulative_probs) {
  for (int id = blockIdx.x; id < num_distributions; id += gridDim.x) {
    thrust::inclusive_scan(thrust::device,
                           norm_probs_data + id * num_categories,
                           norm_probs_data + (id + 1) * num_categories,
                           cumulative_probs + id * num_categories);
  }
}

Expected behavior
A clear and concise description of what you expected to happen.

Log-files
Add full logfiles to help explain your problem.

In file included from /workspace/Github-qili93/Paddle/paddle/fluid/operators/multinomial_op.cu:15:
In file included from /opt/rocm/include/thrust/execution_policy.h:31:
In file included from /opt/rocm/include/thrust/system/cpp/execution_policy.h:64:
In file included from /opt/rocm/include/thrust/system/cpp/detail/sort.h:22:
In file included from /opt/rocm/include/thrust/system/detail/sequential/sort.h:63:
In file included from /opt/rocm/include/thrust/system/detail/sequential/sort.inl:23:
In file included from /opt/rocm/include/thrust/system/detail/sequential/stable_primitive_sort.h:55:
In file included from /opt/rocm/include/thrust/system/detail/sequential/stable_primitive_sort.inl:21:
In file included from /opt/rocm/include/thrust/system/detail/sequential/stable_radix_sort.h:55:
In file included from /opt/rocm/include/thrust/system/detail/sequential/stable_radix_sort.inl:20:
In file included from /opt/rocm/include/thrust/copy.h:513:
In file included from /opt/rocm/include/thrust/detail/copy_if.h:74:
In file included from /opt/rocm/include/thrust/detail/copy_if.inl:20:
In file included from /opt/rocm/include/thrust/system/detail/generic/copy_if.h:63:
In file included from /opt/rocm/include/thrust/system/detail/generic/copy_if.inl:31:
In file included from /opt/rocm/include/thrust/scan.h:1563:
In file included from /opt/rocm/include/thrust/detail/scan.inl:28:
In file included from /opt/rocm/include/thrust/system/detail/adl/scan.h:44:
/opt/rocm/include/thrust/system/hip/detail/scan.h:193:19: error: reference to __host__ function 'inclusive_scan<rocprim::default_config, float *, float *, thrust::plus<float>>' in __host__ __device__ function
        (rocprim::inclusive_scan<rocprim::default_config, InputIt, OutputIt, ScanOp>)
                  ^
/opt/rocm/include/rocprim/device/device_scan.hpp:529:12: note: 'inclusive_scan<rocprim::default_config, float *, float *, thrust::plus<float>>' declared here
hipError_t inclusive_scan(void * temporary_storage,
           ^
3 warnings and 2 errors generated when compiling for gfx906.
CMake Error at multinomial_op_generated_multinomial_op.cu.o.cmake:192 (message):
  Error generating file
  /workspace/Github-qili93/Paddle/build_rocm_nccl/paddle/fluid/operators/CMakeFiles/multinomial_op.dir//./multinomial_op_generated_multinomial_op.cu.o

Environment
Make sure that ROCm is correctly installed and run the following command:

printf '=== environment\n' > environment.txt && 
printf '\n\n=== date\n' >> environment.txt && date >> environment.txt && 
printf '\n\n=== Linux Kernel\n' >> environment.txt && uname -a  >> environment.txt && 
printf '\n\n=== rocm-smi' >> environment.txt && rocm-smi  >> environment.txt && 
printf '\n\n' >> environment.txt && hipconfig  >> environment.txt && 
printf '\n\n=== rocminfo\n' >> environment.txt && rocminfo  >> environment.txt && 
printf '\n\n=== lspci VGA\n' >> environment.txt && lspci | grep -i vga >> environment.txt

Attach environment.txt

Additional context
Add any other context about the problem here.

temporary buffer allocation

I am running into a gpu memory access fault when I do not explicitly allocate the temporary buffer that is required in rocprim for each function call.

What works:

hipcub::DeviceRadixSort::SortPairs(nullptr, size, ...);
hipMalloc(&buffer, size);
hipcub::DeviceRadixSort::SortPairs(buffer, size, ...);
hipFree(buffer);

hipcub::DeviceRunLengthEncode::Encode(nullptr, size, ...);
hipMalloc(&buffer, size);
hipcub::DeviceRunLengthEncode::Encode(buffer, size, ...);
hipFree(buffer);

hipcub::DeviceScan::ExclusiveSum(nullptr, size, ...);
hipMalloc(&buffer, size);
hipcub::DeviceScan::ExclusiveSum(buffer, size, ...);
hipFree(buffer);

hipcub::DeviceSegmentedRadixSort::SortPairs(nullptr, size, ...);
hipMalloc(&buffer, size);
hipcub::DeviceSegmentedRadixSort::SortPairs(buffer, size, ...);
hipFree(buffer);

What does not work:

size_t total_size = 0;
hipcub::DeviceRadixSort::SortPairs(nullptr, size, ...);
total_size = std::max(total_size, size);
hipcub::DeviceRunLengthEncode::Encode(nullptr, size, ...);
total_size = std::max(total_size, size);
hipcub::DeviceScan::ExclusiveSum(nullptr, size, ...);
total_size = std::max(total_size, size);
hipcub::DeviceSegmentedRadixSort::SortPairs(nullptr, size, ...);
total_size = std::max(total_size, size);

total_size += sizeof(int) * some_other_size * 3;

void* buffer;
hipMalloc(&buffer, total_size);

// do some other stuff...

char* ptr = reinterpret_cast<char*>(buffer);

int* work1 = reinterpret_cast<int*>(ptr);
ptr += sizeof(int) * some_other_size;

int* work2 = reinterpret_cast<int*>(ptr);
ptr += sizeof(int) * some_other_size;

int* work3 = reinterpret_cast<int*>(ptr);
ptr += sizeof(int) * some_other_size;

void* hipcub_buffer = reinterpret_cast<void*>(ptr);

hipcub::DeviceRadixSort::SortPairs(hipcub_buffer, size, ...);
hipcub::DeviceRunLengthEncode::Encode(hipcub_buffer, size, ...);
hipcub::DeviceScan::ExclusiveSum(hipcub_buffer, size, ...);
hipcub::DeviceSegmentedRadixSort::SortPairs(hipcub_buffer, size, ...);

Question about intrinsics documentation

Are there documents on the intrinsics such as __mbcnt_hi and __mbcnt_lo ? It is not clear how some of the intrinsics can be emulated using OpenCL.

Thanks

is the benchmark results time correct?

is the benchmark results time correct?
one radix-sort item for example:
sort_pairs<int, float>/iterations:1/manual_time 12 ms 1 ms 1 6.73374GB/s 861.919M items/s
if items_per_second = 8.16E+08,
Baidu need items of 1E+06, so it can finish 810 times per second.
1/810 = 0.001234second = 1.234 ms
maybe the 1.2ms in console out is right?

complex workaround does not work for TensorFlow

The complex workaround branch does not work for complex reduction kernel in TensorFlow. Using it in the TF make system and enabling either reduction_ops_gpu_complex64 or complex128 ends with compilation errors like

In file included from ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/hipcub.hpp:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/../../config.hpp:38:
In file included from external/rocprim_archive/rocprim/include/rocprim/rocprim.hpp:80:
In file included from external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:31:
external/rocprim_archive/rocprim/include/rocprim/device/detail/device_segmented_reduce.hpp:81:16: error: no matching constructor for initialization of 'std::complex<float>'
    ResultType result;
               ^
external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:58:5: note: in instantiation of function template specialization 'rocprim::detail::segmented_reduce<256, 8, tensorflow::PermutationInputIterator<std::complex<float>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::GatherOp, int>, long>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
    segmented_reduce<BlockSize, ItemsPerThread>(
    ^
external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:121:25: note: in instantiation of function template specialization 'rocprim::detail::segmented_reduce_kernel<256, 8, tensorflow::PermutationInputIterator<std::complex<float>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::GatherOp, int>, long>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
        HIP_KERNEL_NAME(segmented_reduce_kernel<block_size, items_per_thread>),
                        ^
external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:248:20: note: in instantiation of function template specialization 'rocprim::detail::segmented_reduce_impl<tensorflow::PermutationInputIterator<std::complex<float>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::GatherOp, int>, long>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
    return detail::segmented_reduce_impl(
                   ^
./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/device_segmented_reduce.hpp:55:27: note: in instantiation of function template specialization 'rocprim::segmented_reduce<tensorflow::PermutationInputIterator<std::complex<float>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::GatherOp, int>, long>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, tensorflow::functor::Sum<std::complex<float> >, std::complex<float> >' requested here
        return ::rocprim::segmented_reduce(
                          ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:754:52: note: in instantiation of function template specialization 'hipcub::DeviceSegmentedReduce::Reduce<tensorflow::PermutationInputIterator<std::complex<float>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::GatherOp, int>, long>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, tensorflow::functor::Sum<std::complex<float> >, std::complex<float> >' requested here
    auto success = gpuprim::DeviceSegmentedReduce::Reduce(
                                                   ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:870:5: note: in instantiation of function template specialization 'tensorflow::functor::Launch3DXZReduction<std::complex<float>, tensorflow::functor::Sum<std::complex<float> >, std::complex<float> *, std::complex<float> *>' requested here
    Launch3DXZReduction(ctx, out, in, in_dim0, in_dim1, in_dim2, op, init,
    ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:896:5: note: in instantiation of function template specialization 'tensorflow::functor::ReduceImpl<std::complex<float>, tensorflow::functor::Sum<std::complex<float> >, std::complex<float> *, std::complex<float> *, Eigen::array<long, 1> >' requested here
    ReduceImpl<T, Sum<T>, T*, T*, ReductionAxes>(
    ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
    struct complex<float>
           ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 0 were provided
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
In file included from tensorflow/core/kernels/reduction_ops_gpu_complex64.cu.cc:20:
In file included from ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/hipcub.hpp:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/../../config.hpp:38:
In file included from external/rocprim_archive/rocprim/include/rocprim/rocprim.hpp:75:
In file included from external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:31:
external/rocprim_archive/rocprim/include/rocprim/device/detail/device_reduce.hpp:110:17: error: no matching constructor for initialization of 'output_type [4]'
    output_type values[ItemsPerThread];
                ^
external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:58:5: note: in instantiation of function template specialization 'rocprim::detail::block_reduce_kernel_impl<256, 4, true, std::complex<float>, std::complex<float> *, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
    block_reduce_kernel_impl<BlockSize, ItemsPerThread, WithInitialValue, ResultType>(
    ^
external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:181:37: note: in instantiation of function template specialization 'rocprim::detail::block_reduce_kernel<256, 4, true, std::complex<float>, std::complex<float> *, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
            HIP_KERNEL_NAME(detail::block_reduce_kernel<
                                    ^
external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:302:20: note: in instantiation of function template specialization 'rocprim::detail::reduce_impl<256, 4, true, std::complex<float> *, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
    return detail::reduce_impl<block_size, items_per_thread, true>(
                   ^
./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/device_reduce.hpp:91:27: note: in instantiation of function template specialization 'rocprim::reduce<std::complex<float> *, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
        return ::rocprim::reduce(
                          ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:556:43: note: in instantiation of function template specialization 'hipcub::DeviceReduce::Reduce<std::complex<float> *, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, tensorflow::functor::Sum<std::complex<float> >, std::complex<float> >' requested here
    auto success = gpuprim::DeviceReduce::Reduce(
                                          ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:858:5: note: in instantiation of function template specialization 'tensorflow::functor::LaunchScalarReduction<std::complex<float>, tensorflow::functor::Sum<std::complex<float> >, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, std::complex<float> *>' requested here
    LaunchScalarReduction(ctx, out, in, in_size, op, init, cu_stream);
    ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:931:5: note: in instantiation of function template specialization 'tensorflow::functor::ReduceImpl<std::complex<float>, tensorflow::functor::Sum<std::complex<float> >, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, std::complex<float> *, Eigen::array<long, 1> >' requested here
    ReduceImpl<T, Sum<T>, TransformOutputIterator<T, T, DividesBy<T>>, T*,
    ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
    struct complex<float>
           ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 0 were provided
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
In file included from tensorflow/core/kernels/reduction_ops_gpu_complex64.cu.cc:20:
In file included from ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/hipcub.hpp:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/../../config.hpp:38:
In file included from external/rocprim_archive/rocprim/include/rocprim/rocprim.hpp:75:
In file included from external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:31:
external/rocprim_archive/rocprim/include/rocprim/device/detail/device_reduce.hpp:111:17: error: no matching constructor for initialization of 'output_type' (aka 'std::complex<float>')
    output_type output_value;
                ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
    struct complex<float>
           ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 0 were provided
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
In file included from tensorflow/core/kernels/reduction_ops_gpu_complex64.cu.cc:20:
In file included from ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/hipcub.hpp:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/../../config.hpp:38:
In file included from external/rocprim_archive/rocprim/include/rocprim/rocprim.hpp:80:
In file included from external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:31:
external/rocprim_archive/rocprim/include/rocprim/device/detail/device_segmented_reduce.hpp:81:16: error: no matching constructor for initialization of 'std::complex<float>'
    ResultType result;
               ^
external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:58:5: note: in instantiation of function template specialization 'rocprim::detail::segmented_reduce<256, 8, std::complex<float> *, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
    segmented_reduce<BlockSize, ItemsPerThread>(
    ^
external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:121:25: note: in instantiation of function template specialization 'rocprim::detail::segmented_reduce_kernel<256, 8, std::complex<float> *, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
        HIP_KERNEL_NAME(segmented_reduce_kernel<block_size, items_per_thread>),
                        ^
external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:248:20: note: in instantiation of function template specialization 'rocprim::detail::segmented_reduce_impl<std::complex<float> *, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
    return detail::segmented_reduce_impl(
                   ^
./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/device_segmented_reduce.hpp:55:27: note: in instantiation of function template specialization 'rocprim::segmented_reduce<std::complex<float> *, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, tensorflow::functor::Sum<std::complex<float> >, std::complex<float> >' requested here
        return ::rocprim::segmented_reduce(
                          ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:596:52: note: in instantiation of function template specialization 'hipcub::DeviceSegmentedReduce::Reduce<std::complex<float> *, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, tensorflow::functor::Sum<std::complex<float> >, std::complex<float> >' requested here
    auto success = gpuprim::DeviceSegmentedReduce::Reduce(
                                                   ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:861:5: note: in instantiation of function template specialization 'tensorflow::functor::LaunchRowReduction<std::complex<float>, tensorflow::functor::Sum<std::complex<float> >, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, std::complex<float> *>' requested here
    LaunchRowReduction(ctx, out, in, in_dim0, in_dim1, op, init, cu_stream);
    ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:931:5: note: in instantiation of function template specialization 'tensorflow::functor::ReduceImpl<std::complex<float>, tensorflow::functor::Sum<std::complex<float> >, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, std::complex<float> *, Eigen::array<long, 1> >' requested here
    ReduceImpl<T, Sum<T>, TransformOutputIterator<T, T, DividesBy<T>>, T*,
    ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
    struct complex<float>
           ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 0 were provided
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
In file included from tensorflow/core/kernels/reduction_ops_gpu_complex64.cu.cc:20:
In file included from ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/hipcub.hpp:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/../../config.hpp:38:
In file included from external/rocprim_archive/rocprim/include/rocprim/rocprim.hpp:80:
In file included from external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:31:
external/rocprim_archive/rocprim/include/rocprim/device/detail/device_segmented_reduce.hpp:81:16: error: no matching constructor for initialization of 'std::complex<float>'
    ResultType result;
               ^
external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:58:5: note: in instantiation of function template specialization 'rocprim::detail::segmented_reduce<256, 8, tensorflow::PermutationInputIterator<std::complex<float>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::GatherOp, int>, long>, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
    segmented_reduce<BlockSize, ItemsPerThread>(
    ^
external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:121:25: note: in instantiation of function template specialization 'rocprim::detail::segmented_reduce_kernel<256, 8, tensorflow::PermutationInputIterator<std::complex<float>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::GatherOp, int>, long>, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
        HIP_KERNEL_NAME(segmented_reduce_kernel<block_size, items_per_thread>),
                        ^
external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:248:20: note: in instantiation of function template specialization 'rocprim::detail::segmented_reduce_impl<tensorflow::PermutationInputIterator<std::complex<float>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::GatherOp, int>, long>, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
    return detail::segmented_reduce_impl(
                   ^
./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/device_segmented_reduce.hpp:55:27: note: in instantiation of function template specialization 'rocprim::segmented_reduce<tensorflow::PermutationInputIterator<std::complex<float>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::GatherOp, int>, long>, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, tensorflow::functor::Sum<std::complex<float> >, std::complex<float> >' requested here
        return ::rocprim::segmented_reduce(
                          ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:754:52: note: in instantiation of function template specialization 'hipcub::DeviceSegmentedReduce::Reduce<tensorflow::PermutationInputIterator<std::complex<float>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::GatherOp, int>, long>, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, tensorflow::functor::Sum<std::complex<float> >, std::complex<float> >' requested here
    auto success = gpuprim::DeviceSegmentedReduce::Reduce(
                                                   ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:870:5: note: in instantiation of function template specialization 'tensorflow::functor::Launch3DXZReduction<std::complex<float>, tensorflow::functor::Sum<std::complex<float> >, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, std::complex<float> *>' requested here
    Launch3DXZReduction(ctx, out, in, in_dim0, in_dim1, in_dim2, op, init,
    ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:931:5: note: in instantiation of function template specialization 'tensorflow::functor::ReduceImpl<std::complex<float>, tensorflow::functor::Sum<std::complex<float> >, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, std::complex<float> *, Eigen::array<long, 1> >' requested here
    ReduceImpl<T, Sum<T>, TransformOutputIterator<T, T, DividesBy<T>>, T*,
    ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
    struct complex<float>
           ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 0 were provided
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
In file included from tensorflow/core/kernels/reduction_ops_gpu_complex64.cu.cc:20:
In file included from ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/hipcub.hpp:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/../../config.hpp:38:
In file included from external/rocprim_archive/rocprim/include/rocprim/rocprim.hpp:75:
In file included from external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:31:
external/rocprim_archive/rocprim/include/rocprim/device/detail/device_reduce.hpp:110:17: error: no matching constructor for initialization of 'output_type [4]'
    output_type values[ItemsPerThread];
                ^
external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:58:5: note: in instantiation of function template specialization 'rocprim::detail::block_reduce_kernel_impl<256, 4, false, std::complex<float>, std::complex<float> *, std::complex<float> *, std::complex<float>, tensorflow::functor::Prod<std::complex<float> > >' requested here
    block_reduce_kernel_impl<BlockSize, ItemsPerThread, WithInitialValue, ResultType>(
    ^
external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:147:37: note: in instantiation of function template specialization 'rocprim::detail::block_reduce_kernel<256, 4, false, std::complex<float>, std::complex<float> *, std::complex<float> *, std::complex<float>, tensorflow::functor::Prod<std::complex<float> > >' requested here
            HIP_KERNEL_NAME(detail::block_reduce_kernel<
                                    ^
external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:302:20: note: in instantiation of function template specialization 'rocprim::detail::reduce_impl<256, 4, true, std::complex<float> *, std::complex<float> *, std::complex<float>, tensorflow::functor::Prod<std::complex<float> > >' requested here
    return detail::reduce_impl<block_size, items_per_thread, true>(
                   ^
./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/device_reduce.hpp:91:27: note: in instantiation of function template specialization 'rocprim::reduce<std::complex<float> *, std::complex<float> *, std::complex<float>, tensorflow::functor::Prod<std::complex<float> > >' requested here
        return ::rocprim::reduce(
                          ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:556:43: note: in instantiation of function template specialization 'hipcub::DeviceReduce::Reduce<std::complex<float> *, std::complex<float> *, tensorflow::functor::Prod<std::complex<float> >, std::complex<float> >' requested here
    auto success = gpuprim::DeviceReduce::Reduce(
                                          ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:858:5: note: in instantiation of function template specialization 'tensorflow::functor::LaunchScalarReduction<std::complex<float>, tensorflow::functor::Prod<std::complex<float> >, std::complex<float> *, std::complex<float> *>' requested here
    LaunchScalarReduction(ctx, out, in, in_size, op, init, cu_stream);
    ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:1036:5: note: in instantiation of function template specialization 'tensorflow::functor::ReduceImpl<std::complex<float>, tensorflow::functor::Prod<std::complex<float> >, std::complex<float> *, std::complex<float> *, Eigen::array<long, 1> >' requested here
    ReduceImpl<T, Prod<T>, T*, T*, ReductionAxes>(
    ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
    struct complex<float>
           ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 0 were provided
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
In file included from tensorflow/core/kernels/reduction_ops_gpu_complex64.cu.cc:20:
In file included from ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/hipcub.hpp:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/../../config.hpp:38:
In file included from external/rocprim_archive/rocprim/include/rocprim/rocprim.hpp:75:
In file included from external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:31:
external/rocprim_archive/rocprim/include/rocprim/device/detail/device_reduce.hpp:111:17: error: no matching constructor for initialization of 'output_type' (aka 'std::complex<float>')
    output_type output_value;
                ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
    struct complex<float>
           ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 0 were provided
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
In file included from tensorflow/core/kernels/reduction_ops_gpu_complex64.cu.cc:20:
In file included from ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/hipcub.hpp:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/../../config.hpp:38:
In file included from external/rocprim_archive/rocprim/include/rocprim/rocprim.hpp:75:
In file included from external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:31:
external/rocprim_archive/rocprim/include/rocprim/device/detail/device_reduce.hpp:110:17: error: no matching constructor for initialization of 'output_type [4]'
    output_type values[ItemsPerThread];
                ^
external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:58:5: note: in instantiation of function template specialization 'rocprim::detail::block_reduce_kernel_impl<256, 4, true, std::complex<float>, std::complex<float> *, std::complex<float> *, std::complex<float>, tensorflow::functor::Prod<std::complex<float> > >' requested here
    block_reduce_kernel_impl<BlockSize, ItemsPerThread, WithInitialValue, ResultType>(
    ^
external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:181:37: note: in instantiation of function template specialization 'rocprim::detail::block_reduce_kernel<256, 4, true, std::complex<float>, std::complex<float> *, std::complex<float> *, std::complex<float>, tensorflow::functor::Prod<std::complex<float> > >' requested here
            HIP_KERNEL_NAME(detail::block_reduce_kernel<
                                    ^
external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:302:20: note: in instantiation of function template specialization 'rocprim::detail::reduce_impl<256, 4, true, std::complex<float> *, std::complex<float> *, std::complex<float>, tensorflow::functor::Prod<std::complex<float> > >' requested here
    return detail::reduce_impl<block_size, items_per_thread, true>(
                   ^
./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/device_reduce.hpp:91:27: note: in instantiation of function template specialization 'rocprim::reduce<std::complex<float> *, std::complex<float> *, std::complex<float>, tensorflow::functor::Prod<std::complex<float> > >' requested here
        return ::rocprim::reduce(
                          ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:556:43: note: in instantiation of function template specialization 'hipcub::DeviceReduce::Reduce<std::complex<float> *, std::complex<float> *, tensorflow::functor::Prod<std::complex<float> >, std::complex<float> >' requested here
    auto success = gpuprim::DeviceReduce::Reduce(
                                          ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:858:5: note: in instantiation of function template specialization 'tensorflow::functor::LaunchScalarReduction<std::complex<float>, tensorflow::functor::Prod<std::complex<float> >, std::complex<float> *, std::complex<float> *>' requested here
    LaunchScalarReduction(ctx, out, in, in_size, op, init, cu_stream);
    ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:1036:5: note: in instantiation of function template specialization 'tensorflow::functor::ReduceImpl<std::complex<float>, tensorflow::functor::Prod<std::complex<float> >, std::complex<float> *, std::complex<float> *, Eigen::array<long, 1> >' requested here
    ReduceImpl<T, Prod<T>, T*, T*, ReductionAxes>(
    ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
    struct complex<float>
           ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 0 were provided
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
In file included from tensorflow/core/kernels/reduction_ops_gpu_complex64.cu.cc:20:
In file included from ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/hipcub.hpp:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/../../config.hpp:38:
In file included from external/rocprim_archive/rocprim/include/rocprim/rocprim.hpp:75:
In file included from external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:31:
external/rocprim_archive/rocprim/include/rocprim/device/detail/device_reduce.hpp:111:17: error: no matching constructor for initialization of 'output_type' (aka 'std::complex<float>')
    output_type output_value;
                ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
    struct complex<float>
           ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 0 were provided
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
In file included from tensorflow/core/kernels/reduction_ops_gpu_complex64.cu.cc:20:
In file included from ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/hipcub.hpp:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/../../config.hpp:38:
In file included from external/rocprim_archive/rocprim/include/rocprim/rocprim.hpp:80:
In file included from external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:31:
external/rocprim_archive/rocprim/include/rocprim/device/detail/device_segmented_reduce.hpp:81:16: error: no matching constructor for initialization of 'std::complex<float>'
    ResultType result;
               ^
external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:58:5: note: in instantiation of function template specialization 'rocprim::detail::segmented_reduce<256, 8, std::complex<float> *, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, std::complex<float>, tensorflow::functor::Prod<std::complex<float> > >' requested here
    segmented_reduce<BlockSize, ItemsPerThread>(
    ^
external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:121:25: note: in instantiation of function template specialization 'rocprim::detail::segmented_reduce_kernel<256, 8, std::complex<float> *, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, std::complex<float>, tensorflow::functor::Prod<std::complex<float> > >' requested here
        HIP_KERNEL_NAME(segmented_reduce_kernel<block_size, items_per_thread>),
                        ^
external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:248:20: note: in instantiation of function template specialization 'rocprim::detail::segmented_reduce_impl<std::complex<float> *, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, std::complex<float>, tensorflow::functor::Prod<std::complex<float> > >' requested here
    return detail::segmented_reduce_impl(
                   ^
./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/device_segmented_reduce.hpp:55:27: note: in instantiation of function template specialization 'rocprim::segmented_reduce<std::complex<float> *, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, tensorflow::functor::Prod<std::complex<float> >, std::complex<float> >' requested here
        return ::rocprim::segmented_reduce(
                          ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:596:52: note: in instantiation of function template specialization 'hipcub::DeviceSegmentedReduce::Reduce<std::complex<float> *, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, tensorflow::functor::Prod<std::complex<float> >, std::complex<float> >' requested here
    auto success = gpuprim::DeviceSegmentedReduce::Reduce(
                                                   ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:861:5: note: in instantiation of function template specialization 'tensorflow::functor::LaunchRowReduction<std::complex<float>, tensorflow::functor::Prod<std::complex<float> >, std::complex<float> *, std::complex<float> *>' requested here
    LaunchRowReduction(ctx, out, in, in_dim0, in_dim1, op, init, cu_stream);
    ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:1036:5: note: in instantiation of function template specialization 'tensorflow::functor::ReduceImpl<std::complex<float>, tensorflow::functor::Prod<std::complex<float> >, std::complex<float> *, std::complex<float> *, Eigen::array<long, 1> >' requested here
    ReduceImpl<T, Prod<T>, T*, T*, ReductionAxes>(
    ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
    struct complex<float>
           ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 0 were provided
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
In file included from tensorflow/core/kernels/reduction_ops_gpu_complex64.cu.cc:20:
In file included from ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/hipcub.hpp:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/../../config.hpp:38:
In file included from external/rocprim_archive/rocprim/include/rocprim/rocprim.hpp:80:
In file included from external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:31:
external/rocprim_archive/rocprim/include/rocprim/device/detail/device_segmented_reduce.hpp:81:16: error: no matching constructor for initialization of 'std::complex<float>'
    ResultType result;
               ^
external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:58:5: note: in instantiation of function template specialization 'rocprim::detail::segmented_reduce<256, 8, tensorflow::PermutationInputIterator<std::complex<float>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::GatherOp, int>, long>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, std::complex<float>, tensorflow::functor::Prod<std::complex<float> > >' requested here
    segmented_reduce<BlockSize, ItemsPerThread>(
    ^
external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:121:25: note: in instantiation of function template specialization 'rocprim::detail::segmented_reduce_kernel<256, 8, tensorflow::PermutationInputIterator<std::complex<float>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::GatherOp, int>, long>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, std::complex<float>, tensorflow::functor::Prod<std::complex<float> > >' requested here
        HIP_KERNEL_NAME(segmented_reduce_kernel<block_size, items_per_thread>),
                        ^
external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:248:20: note: in instantiation of function template specialization 'rocprim::detail::segmented_reduce_impl<tensorflow::PermutationInputIterator<std::complex<float>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::GatherOp, int>, long>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, std::complex<float>, tensorflow::functor::Prod<std::complex<float> > >' requested here
    return detail::segmented_reduce_impl(
                   ^
./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/device_segmented_reduce.hpp:55:27: note: in instantiation of function template specialization 'rocprim::segmented_reduce<tensorflow::PermutationInputIterator<std::complex<float>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::GatherOp, int>, long>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, tensorflow::functor::Prod<std::complex<float> >, std::complex<float> >' requested here
        return ::rocprim::segmented_reduce(
                          ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:754:52: note: in instantiation of function template specialization 'hipcub::DeviceSegmentedReduce::Reduce<tensorflow::PermutationInputIterator<std::complex<float>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::GatherOp, int>, long>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, tensorflow::functor::Prod<std::complex<float> >, std::complex<float> >' requested here
    auto success = gpuprim::DeviceSegmentedReduce::Reduce(
                                                   ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:870:5: note: in instantiation of function template specialization 'tensorflow::functor::Launch3DXZReduction<std::complex<float>, tensorflow::functor::Prod<std::complex<float> >, std::complex<float> *, std::complex<float> *>' requested here
    Launch3DXZReduction(ctx, out, in, in_dim0, in_dim1, in_dim2, op, init,
    ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:1036:5: note: in instantiation of function template specialization 'tensorflow::functor::ReduceImpl<std::complex<float>, tensorflow::functor::Prod<std::complex<float> >, std::complex<float> *, std::complex<float> *, Eigen::array<long, 1> >' requested here
    ReduceImpl<T, Prod<T>, T*, T*, ReductionAxes>(
    ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
    struct complex<float>
           ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 0 were provided
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
1 warning and 16 errors generated.
In file included from tensorflow/core/kernels/reduction_ops_gpu_complex64.cu.cc:20:
In file included from ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:20:
In file included from ./third_party/eigen3/unsupported/Eigen/CXX11/Tensor:1:
In file included from external/eigen_archive/unsupported/Eigen/CXX11/Tensor:14:
In file included from external/eigen_archive/unsupported/Eigen/CXX11/../../../Eigen/Core:69:
In file included from /opt/rocm/include/hip/math_functions.h:32:
In file included from /opt/rocm/include/hip/hcc_detail/math_functions.h:31:
In file included from /opt/rocm/include/hip/hip_runtime.h:53:
/opt/rocm/include/hip/hip_common.h:30:9: warning: '__HIP_PLATFORM_HCC__' macro redefined [-Wmacro-redefined]
#define __HIP_PLATFORM_HCC__
        ^
<command line>:20:9: note: previous definition is here
#define __HIP_PLATFORM_HCC__ 1
        ^
In file included from tensorflow/core/kernels/reduction_ops_gpu_complex64.cu.cc:20:
In file included from ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/hipcub.hpp:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/../../config.hpp:38:
In file included from external/rocprim_archive/rocprim/include/rocprim/rocprim.hpp:75:
In file included from external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:31:
external/rocprim_archive/rocprim/include/rocprim/device/detail/device_reduce.hpp:110:17: error: no matching constructor for initialization of 'output_type [4]'
    output_type values[ItemsPerThread];
                ^
external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:58:5: note: in instantiation of function template specialization 'rocprim::detail::block_reduce_kernel_impl<256, 4, false, std::complex<float>, std::complex<float> *, std::complex<float> *, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
    block_reduce_kernel_impl<BlockSize, ItemsPerThread, WithInitialValue, ResultType>(
    ^
external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:147:37: note: in instantiation of function template specialization 'rocprim::detail::block_reduce_kernel<256, 4, false, std::complex<float>, std::complex<float> *, std::complex<float> *, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
            HIP_KERNEL_NAME(detail::block_reduce_kernel<
                                    ^
external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:302:20: note: in instantiation of function template specialization 'rocprim::detail::reduce_impl<256, 4, true, std::complex<float> *, std::complex<float> *, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
    return detail::reduce_impl<block_size, items_per_thread, true>(
                   ^
./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/device_reduce.hpp:91:27: note: in instantiation of function template specialization 'rocprim::reduce<std::complex<float> *, std::complex<float> *, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
        return ::rocprim::reduce(
                          ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:556:43: note: in instantiation of function template specialization 'hipcub::DeviceReduce::Reduce<std::complex<float> *, std::complex<float> *, tensorflow::functor::Sum<std::complex<float> >, std::complex<float> >' requested here
    auto success = gpuprim::DeviceReduce::Reduce(
                                          ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:858:5: note: in instantiation of function template specialization 'tensorflow::functor::LaunchScalarReduction<std::complex<float>, tensorflow::functor::Sum<std::complex<float> >, std::complex<float> *, std::complex<float> *>' requested here
    LaunchScalarReduction(ctx, out, in, in_size, op, init, cu_stream);
    ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:896:5: note: in instantiation of function template specialization 'tensorflow::functor::ReduceImpl<std::complex<float>, tensorflow::functor::Sum<std::complex<float> >, std::complex<float> *, std::complex<float> *, Eigen::array<long, 1> >' requested here
    ReduceImpl<T, Sum<T>, T*, T*, ReductionAxes>(
    ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
    struct complex<float>
           ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 0 were provided
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
In file included from tensorflow/core/kernels/reduction_ops_gpu_complex64.cu.cc:20:
In file included from ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/hipcub.hpp:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/../../config.hpp:38:
In file included from external/rocprim_archive/rocprim/include/rocprim/rocprim.hpp:75:
In file included from external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:31:
external/rocprim_archive/rocprim/include/rocprim/device/detail/device_reduce.hpp:111:17: error: no matching constructor for initialization of 'output_type' (aka 'std::complex<float>')
    output_type output_value;
                ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
    struct complex<float>
           ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 0 were provided
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
In file included from tensorflow/core/kernels/reduction_ops_gpu_complex64.cu.cc:20:
In file included from ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/hipcub.hpp:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/../../config.hpp:38:
In file included from external/rocprim_archive/rocprim/include/rocprim/rocprim.hpp:75:
In file included from external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:31:
external/rocprim_archive/rocprim/include/rocprim/device/detail/device_reduce.hpp:110:17: error: no matching constructor for initialization of 'output_type [4]'
    output_type values[ItemsPerThread];
                ^
external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:58:5: note: in instantiation of function template specialization 'rocprim::detail::block_reduce_kernel_impl<256, 4, true, std::complex<float>, std::complex<float> *, std::complex<float> *, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
    block_reduce_kernel_impl<BlockSize, ItemsPerThread, WithInitialValue, ResultType>(
    ^
external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:181:37: note: in instantiation of function template specialization 'rocprim::detail::block_reduce_kernel<256, 4, true, std::complex<float>, std::complex<float> *, std::complex<float> *, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
            HIP_KERNEL_NAME(detail::block_reduce_kernel<
                                    ^
external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:302:20: note: in instantiation of function template specialization 'rocprim::detail::reduce_impl<256, 4, true, std::complex<float> *, std::complex<float> *, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
    return detail::reduce_impl<block_size, items_per_thread, true>(
                   ^
./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/device_reduce.hpp:91:27: note: in instantiation of function template specialization 'rocprim::reduce<std::complex<float> *, std::complex<float> *, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
        return ::rocprim::reduce(
                          ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:556:43: note: in instantiation of function template specialization 'hipcub::DeviceReduce::Reduce<std::complex<float> *, std::complex<float> *, tensorflow::functor::Sum<std::complex<float> >, std::complex<float> >' requested here
    auto success = gpuprim::DeviceReduce::Reduce(
                                          ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:858:5: note: in instantiation of function template specialization 'tensorflow::functor::LaunchScalarReduction<std::complex<float>, tensorflow::functor::Sum<std::complex<float> >, std::complex<float> *, std::complex<float> *>' requested here
    LaunchScalarReduction(ctx, out, in, in_size, op, init, cu_stream);
    ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:896:5: note: in instantiation of function template specialization 'tensorflow::functor::ReduceImpl<std::complex<float>, tensorflow::functor::Sum<std::complex<float> >, std::complex<float> *, std::complex<float> *, Eigen::array<long, 1> >' requested here
    ReduceImpl<T, Sum<T>, T*, T*, ReductionAxes>(
    ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
    struct complex<float>
           ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 0 were provided
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
In file included from tensorflow/core/kernels/reduction_ops_gpu_complex64.cu.cc:20:
In file included from ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/hipcub.hpp:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/../../config.hpp:38:
In file included from external/rocprim_archive/rocprim/include/rocprim/rocprim.hpp:75:
In file included from external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:31:
external/rocprim_archive/rocprim/include/rocprim/device/detail/device_reduce.hpp:111:17: error: no matching constructor for initialization of 'output_type' (aka 'std::complex<float>')
    output_type output_value;
                ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
    struct complex<float>
           ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 0 were provided
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
In file included from tensorflow/core/kernels/reduction_ops_gpu_complex64.cu.cc:20:
In file included from ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/hipcub.hpp:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/../../config.hpp:38:
In file included from external/rocprim_archive/rocprim/include/rocprim/rocprim.hpp:80:
In file included from external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:31:
external/rocprim_archive/rocprim/include/rocprim/device/detail/device_segmented_reduce.hpp:81:16: error: no matching constructor for initialization of 'std::complex<float>'
    ResultType result;
               ^
external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:58:5: note: in instantiation of function template specialization 'rocprim::detail::segmented_reduce<256, 8, std::complex<float> *, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
    segmented_reduce<BlockSize, ItemsPerThread>(
    ^
external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:121:25: note: in instantiation of function template specialization 'rocprim::detail::segmented_reduce_kernel<256, 8, std::complex<float> *, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
        HIP_KERNEL_NAME(segmented_reduce_kernel<block_size, items_per_thread>),
                        ^
external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:248:20: note: in instantiation of function template specialization 'rocprim::detail::segmented_reduce_impl<std::complex<float> *, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
    return detail::segmented_reduce_impl(
                   ^
./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/device_segmented_reduce.hpp:55:27: note: in instantiation of function template specialization 'rocprim::segmented_reduce<std::complex<float> *, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, tensorflow::functor::Sum<std::complex<float> >, std::complex<float> >' requested here
        return ::rocprim::segmented_reduce(
                          ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:596:52: note: in instantiation of function template specialization 'hipcub::DeviceSegmentedReduce::Reduce<std::complex<float> *, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, tensorflow::functor::Sum<std::complex<float> >, std::complex<float> >' requested here
    auto success = gpuprim::DeviceSegmentedReduce::Reduce(
                                                   ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:861:5: note: in instantiation of function template specialization 'tensorflow::functor::LaunchRowReduction<std::complex<float>, tensorflow::functor::Sum<std::complex<float> >, std::complex<float> *, std::complex<float> *>' requested here
    LaunchRowReduction(ctx, out, in, in_dim0, in_dim1, op, init, cu_stream);
    ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:896:5: note: in instantiation of function template specialization 'tensorflow::functor::ReduceImpl<std::complex<float>, tensorflow::functor::Sum<std::complex<float> >, std::complex<float> *, std::complex<float> *, Eigen::array<long, 1> >' requested here
    ReduceImpl<T, Sum<T>, T*, T*, ReductionAxes>(
    ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
    struct complex<float>
           ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 0 were provided
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
In file included from tensorflow/core/kernels/reduction_ops_gpu_complex64.cu.cc:20:
In file included from ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/hipcub.hpp:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/../../config.hpp:38:
In file included from external/rocprim_archive/rocprim/include/rocprim/rocprim.hpp:80:
In file included from external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:31:
external/rocprim_archive/rocprim/include/rocprim/device/detail/device_segmented_reduce.hpp:81:16: error: no matching constructor for initialization of 'std::complex<float>'
    ResultType result;
               ^
external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:58:5: note: in instantiation of function template specialization 'rocprim::detail::segmented_reduce<256, 8, tensorflow::PermutationInputIterator<std::complex<float>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::GatherOp, int>, long>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
    segmented_reduce<BlockSize, ItemsPerThread>(
    ^
external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:121:25: note: in instantiation of function template specialization 'rocprim::detail::segmented_reduce_kernel<256, 8, tensorflow::PermutationInputIterator<std::complex<float>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::GatherOp, int>, long>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
        HIP_KERNEL_NAME(segmented_reduce_kernel<block_size, items_per_thread>),
                        ^
external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:248:20: note: in instantiation of function template specialization 'rocprim::detail::segmented_reduce_impl<tensorflow::PermutationInputIterator<std::complex<float>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::GatherOp, int>, long>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
    return detail::segmented_reduce_impl(
                   ^
./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/device_segmented_reduce.hpp:55:27: note: in instantiation of function template specialization 'rocprim::segmented_reduce<tensorflow::PermutationInputIterator<std::complex<float>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::GatherOp, int>, long>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, tensorflow::functor::Sum<std::complex<float> >, std::complex<float> >' requested here
        return ::rocprim::segmented_reduce(
                          ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:754:52: note: in instantiation of function template specialization 'hipcub::DeviceSegmentedReduce::Reduce<tensorflow::PermutationInputIterator<std::complex<float>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::GatherOp, int>, long>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, tensorflow::functor::Sum<std::complex<float> >, std::complex<float> >' requested here
    auto success = gpuprim::DeviceSegmentedReduce::Reduce(
                                                   ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:870:5: note: in instantiation of function template specialization 'tensorflow::functor::Launch3DXZReduction<std::complex<float>, tensorflow::functor::Sum<std::complex<float> >, std::complex<float> *, std::complex<float> *>' requested here
    Launch3DXZReduction(ctx, out, in, in_dim0, in_dim1, in_dim2, op, init,
    ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:896:5: note: in instantiation of function template specialization 'tensorflow::functor::ReduceImpl<std::complex<float>, tensorflow::functor::Sum<std::complex<float> >, std::complex<float> *, std::complex<float> *, Eigen::array<long, 1> >' requested here
    ReduceImpl<T, Sum<T>, T*, T*, ReductionAxes>(
    ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
    struct complex<float>
           ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 0 were provided
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
In file included from tensorflow/core/kernels/reduction_ops_gpu_complex64.cu.cc:20:
In file included from ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/hipcub.hpp:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/../../config.hpp:38:
In file included from external/rocprim_archive/rocprim/include/rocprim/rocprim.hpp:75:
In file included from external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:31:
external/rocprim_archive/rocprim/include/rocprim/device/detail/device_reduce.hpp:110:17: error: no matching constructor for initialization of 'output_type [4]'
    output_type values[ItemsPerThread];
                ^
external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:58:5: note: in instantiation of function template specialization 'rocprim::detail::block_reduce_kernel_impl<256, 4, true, std::complex<float>, std::complex<float> *, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
    block_reduce_kernel_impl<BlockSize, ItemsPerThread, WithInitialValue, ResultType>(
    ^
external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:181:37: note: in instantiation of function template specialization 'rocprim::detail::block_reduce_kernel<256, 4, true, std::complex<float>, std::complex<float> *, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
            HIP_KERNEL_NAME(detail::block_reduce_kernel<
                                    ^
external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:302:20: note: in instantiation of function template specialization 'rocprim::detail::reduce_impl<256, 4, true, std::complex<float> *, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
    return detail::reduce_impl<block_size, items_per_thread, true>(
                   ^
./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/device_reduce.hpp:91:27: note: in instantiation of function template specialization 'rocprim::reduce<std::complex<float> *, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
        return ::rocprim::reduce(
                          ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:556:43: note: in instantiation of function template specialization 'hipcub::DeviceReduce::Reduce<std::complex<float> *, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, tensorflow::functor::Sum<std::complex<float> >, std::complex<float> >' requested here
    auto success = gpuprim::DeviceReduce::Reduce(
                                          ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:858:5: note: in instantiation of function template specialization 'tensorflow::functor::LaunchScalarReduction<std::complex<float>, tensorflow::functor::Sum<std::complex<float> >, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, std::complex<float> *>' requested here
    LaunchScalarReduction(ctx, out, in, in_size, op, init, cu_stream);
    ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:931:5: note: in instantiation of function template specialization 'tensorflow::functor::ReduceImpl<std::complex<float>, tensorflow::functor::Sum<std::complex<float> >, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, std::complex<float> *, Eigen::array<long, 1> >' requested here
    ReduceImpl<T, Sum<T>, TransformOutputIterator<T, T, DividesBy<T>>, T*,
    ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
    struct complex<float>
           ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 0 were provided
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
In file included from tensorflow/core/kernels/reduction_ops_gpu_complex64.cu.cc:20:
In file included from ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/hipcub.hpp:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/../../config.hpp:38:
In file included from external/rocprim_archive/rocprim/include/rocprim/rocprim.hpp:75:
In file included from external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:31:
external/rocprim_archive/rocprim/include/rocprim/device/detail/device_reduce.hpp:111:17: error: no matching constructor for initialization of 'output_type' (aka 'std::complex<float>')
    output_type output_value;
                ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
    struct complex<float>
           ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 0 were provided
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
In file included from tensorflow/core/kernels/reduction_ops_gpu_complex64.cu.cc:20:
In file included from ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/hipcub.hpp:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/../../config.hpp:38:
In file included from external/rocprim_archive/rocprim/include/rocprim/rocprim.hpp:80:
In file included from external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:31:
external/rocprim_archive/rocprim/include/rocprim/device/detail/device_segmented_reduce.hpp:81:16: error: no matching constructor for initialization of 'std::complex<float>'
    ResultType result;
               ^
external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:58:5: note: in instantiation of function template specialization 'rocprim::detail::segmented_reduce<256, 8, std::complex<float> *, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
    segmented_reduce<BlockSize, ItemsPerThread>(
    ^
external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:121:25: note: in instantiation of function template specialization 'rocprim::detail::segmented_reduce_kernel<256, 8, std::complex<float> *, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
        HIP_KERNEL_NAME(segmented_reduce_kernel<block_size, items_per_thread>),
                        ^
external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:248:20: note: in instantiation of function template specialization 'rocprim::detail::segmented_reduce_impl<std::complex<float> *, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
    return detail::segmented_reduce_impl(
                   ^
./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/device_segmented_reduce.hpp:55:27: note: in instantiation of function template specialization 'rocprim::segmented_reduce<std::complex<float> *, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, tensorflow::functor::Sum<std::complex<float> >, std::complex<float> >' requested here
        return ::rocprim::segmented_reduce(
                          ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:596:52: note: in instantiation of function template specialization 'hipcub::DeviceSegmentedReduce::Reduce<std::complex<float> *, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, tensorflow::functor::Sum<std::complex<float> >, std::complex<float> >' requested here
    auto success = gpuprim::DeviceSegmentedReduce::Reduce(
                                                   ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:861:5: note: in instantiation of function template specialization 'tensorflow::functor::LaunchRowReduction<std::complex<float>, tensorflow::functor::Sum<std::complex<float> >, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, std::complex<float> *>' requested here
    LaunchRowReduction(ctx, out, in, in_dim0, in_dim1, op, init, cu_stream);
    ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:931:5: note: in instantiation of function template specialization 'tensorflow::functor::ReduceImpl<std::complex<float>, tensorflow::functor::Sum<std::complex<float> >, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, std::complex<float> *, Eigen::array<long, 1> >' requested here
    ReduceImpl<T, Sum<T>, TransformOutputIterator<T, T, DividesBy<T>>, T*,
    ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
    struct complex<float>
           ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 0 were provided
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
In file included from tensorflow/core/kernels/reduction_ops_gpu_complex64.cu.cc:20:
In file included from ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/hipcub.hpp:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/../../config.hpp:38:
In file included from external/rocprim_archive/rocprim/include/rocprim/rocprim.hpp:80:
In file included from external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:31:
external/rocprim_archive/rocprim/include/rocprim/device/detail/device_segmented_reduce.hpp:81:16: error: no matching constructor for initialization of 'std::complex<float>'
    ResultType result;
               ^
external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:58:5: note: in instantiation of function template specialization 'rocprim::detail::segmented_reduce<256, 8, tensorflow::PermutationInputIterator<std::complex<float>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::GatherOp, int>, long>, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
    segmented_reduce<BlockSize, ItemsPerThread>(
    ^
external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:121:25: note: in instantiation of function template specialization 'rocprim::detail::segmented_reduce_kernel<256, 8, tensorflow::PermutationInputIterator<std::complex<float>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::GatherOp, int>, long>, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
        HIP_KERNEL_NAME(segmented_reduce_kernel<block_size, items_per_thread>),
                        ^
external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:248:20: note: in instantiation of function template specialization 'rocprim::detail::segmented_reduce_impl<tensorflow::PermutationInputIterator<std::complex<float>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::GatherOp, int>, long>, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, std::complex<float>, tensorflow::functor::Sum<std::complex<float> > >' requested here
    return detail::segmented_reduce_impl(
                   ^
./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/device_segmented_reduce.hpp:55:27: note: in instantiation of function template specialization 'rocprim::segmented_reduce<tensorflow::PermutationInputIterator<std::complex<float>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::GatherOp, int>, long>, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, tensorflow::functor::Sum<std::complex<float> >, std::complex<float> >' requested here
        return ::rocprim::segmented_reduce(
                          ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:754:52: note: in instantiation of function template specialization 'hipcub::DeviceSegmentedReduce::Reduce<tensorflow::PermutationInputIterator<std::complex<float>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::GatherOp, int>, long>, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, tensorflow::functor::Sum<std::complex<float> >, std::complex<float> >' requested here
    auto success = gpuprim::DeviceSegmentedReduce::Reduce(
                                                   ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:870:5: note: in instantiation of function template specialization 'tensorflow::functor::Launch3DXZReduction<std::complex<float>, tensorflow::functor::Sum<std::complex<float> >, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, std::complex<float> *>' requested here
    Launch3DXZReduction(ctx, out, in, in_dim0, in_dim1, in_dim2, op, init,
    ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:931:5: note: in instantiation of function template specialization 'tensorflow::functor::ReduceImpl<std::complex<float>, tensorflow::functor::Sum<std::complex<float> >, tensorflow::TransformOutputIterator<std::complex<float>, std::complex<float>, tensorflow::functor::DividesBy<std::complex<float>, std::complex<float> >, long>, std::complex<float> *, Eigen::array<long, 1> >' requested here
    ReduceImpl<T, Sum<T>, TransformOutputIterator<T, T, DividesBy<T>>, T*,
    ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
    struct complex<float>
           ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 0 were provided
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
In file included from tensorflow/core/kernels/reduction_ops_gpu_complex64.cu.cc:20:
In file included from ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/hipcub.hpp:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/../../config.hpp:38:
In file included from external/rocprim_archive/rocprim/include/rocprim/rocprim.hpp:75:
In file included from external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:31:
external/rocprim_archive/rocprim/include/rocprim/device/detail/device_reduce.hpp:110:17: error: no matching constructor for initialization of 'output_type [4]'
    output_type values[ItemsPerThread];
                ^
external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:58:5: note: in instantiation of function template specialization 'rocprim::detail::block_reduce_kernel_impl<256, 4, false, std::complex<float>, std::complex<float> *, std::complex<float> *, std::complex<float>, tensorflow::functor::Prod<std::complex<float> > >' requested here
    block_reduce_kernel_impl<BlockSize, ItemsPerThread, WithInitialValue, ResultType>(
    ^
external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:147:37: note: in instantiation of function template specialization 'rocprim::detail::block_reduce_kernel<256, 4, false, std::complex<float>, std::complex<float> *, std::complex<float> *, std::complex<float>, tensorflow::functor::Prod<std::complex<float> > >' requested here
            HIP_KERNEL_NAME(detail::block_reduce_kernel<
                                    ^
external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:302:20: note: in instantiation of function template specialization 'rocprim::detail::reduce_impl<256, 4, true, std::complex<float> *, std::complex<float> *, std::complex<float>, tensorflow::functor::Prod<std::complex<float> > >' requested here
    return detail::reduce_impl<block_size, items_per_thread, true>(
                   ^
./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/device_reduce.hpp:91:27: note: in instantiation of function template specialization 'rocprim::reduce<std::complex<float> *, std::complex<float> *, std::complex<float>, tensorflow::functor::Prod<std::complex<float> > >' requested here
        return ::rocprim::reduce(
                          ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:556:43: note: in instantiation of function template specialization 'hipcub::DeviceReduce::Reduce<std::complex<float> *, std::complex<float> *, tensorflow::functor::Prod<std::complex<float> >, std::complex<float> >' requested here
    auto success = gpuprim::DeviceReduce::Reduce(
                                          ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:858:5: note: in instantiation of function template specialization 'tensorflow::functor::LaunchScalarReduction<std::complex<float>, tensorflow::functor::Prod<std::complex<float> >, std::complex<float> *, std::complex<float> *>' requested here
    LaunchScalarReduction(ctx, out, in, in_size, op, init, cu_stream);
    ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:1036:5: note: in instantiation of function template specialization 'tensorflow::functor::ReduceImpl<std::complex<float>, tensorflow::functor::Prod<std::complex<float> >, std::complex<float> *, std::complex<float> *, Eigen::array<long, 1> >' requested here
    ReduceImpl<T, Prod<T>, T*, T*, ReductionAxes>(
    ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
    struct complex<float>
           ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 0 were provided
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
In file included from tensorflow/core/kernels/reduction_ops_gpu_complex64.cu.cc:20:
In file included from ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/hipcub.hpp:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/../../config.hpp:38:
In file included from external/rocprim_archive/rocprim/include/rocprim/rocprim.hpp:75:
In file included from external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:31:
external/rocprim_archive/rocprim/include/rocprim/device/detail/device_reduce.hpp:111:17: error: no matching constructor for initialization of 'output_type' (aka 'std::complex<float>')
    output_type output_value;
                ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
    struct complex<float>
           ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 0 were provided
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
In file included from tensorflow/core/kernels/reduction_ops_gpu_complex64.cu.cc:20:
In file included from ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/hipcub.hpp:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/../../config.hpp:38:
In file included from external/rocprim_archive/rocprim/include/rocprim/rocprim.hpp:75:
In file included from external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:31:
external/rocprim_archive/rocprim/include/rocprim/device/detail/device_reduce.hpp:110:17: error: no matching constructor for initialization of 'output_type [4]'
    output_type values[ItemsPerThread];
                ^
external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:58:5: note: in instantiation of function template specialization 'rocprim::detail::block_reduce_kernel_impl<256, 4, true, std::complex<float>, std::complex<float> *, std::complex<float> *, std::complex<float>, tensorflow::functor::Prod<std::complex<float> > >' requested here
    block_reduce_kernel_impl<BlockSize, ItemsPerThread, WithInitialValue, ResultType>(
    ^
external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:181:37: note: in instantiation of function template specialization 'rocprim::detail::block_reduce_kernel<256, 4, true, std::complex<float>, std::complex<float> *, std::complex<float> *, std::complex<float>, tensorflow::functor::Prod<std::complex<float> > >' requested here
            HIP_KERNEL_NAME(detail::block_reduce_kernel<
                                    ^
external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:302:20: note: in instantiation of function template specialization 'rocprim::detail::reduce_impl<256, 4, true, std::complex<float> *, std::complex<float> *, std::complex<float>, tensorflow::functor::Prod<std::complex<float> > >' requested here
    return detail::reduce_impl<block_size, items_per_thread, true>(
                   ^
./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/device_reduce.hpp:91:27: note: in instantiation of function template specialization 'rocprim::reduce<std::complex<float> *, std::complex<float> *, std::complex<float>, tensorflow::functor::Prod<std::complex<float> > >' requested here
        return ::rocprim::reduce(
                          ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:556:43: note: in instantiation of function template specialization 'hipcub::DeviceReduce::Reduce<std::complex<float> *, std::complex<float> *, tensorflow::functor::Prod<std::complex<float> >, std::complex<float> >' requested here
    auto success = gpuprim::DeviceReduce::Reduce(
                                          ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:858:5: note: in instantiation of function template specialization 'tensorflow::functor::LaunchScalarReduction<std::complex<float>, tensorflow::functor::Prod<std::complex<float> >, std::complex<float> *, std::complex<float> *>' requested here
    LaunchScalarReduction(ctx, out, in, in_size, op, init, cu_stream);
    ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:1036:5: note: in instantiation of function template specialization 'tensorflow::functor::ReduceImpl<std::complex<float>, tensorflow::functor::Prod<std::complex<float> >, std::complex<float> *, std::complex<float> *, Eigen::array<long, 1> >' requested here
    ReduceImpl<T, Prod<T>, T*, T*, ReductionAxes>(
    ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
    struct complex<float>
           ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 0 were provided
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
In file included from tensorflow/core/kernels/reduction_ops_gpu_complex64.cu.cc:20:
In file included from ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/hipcub.hpp:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/../../config.hpp:38:
In file included from external/rocprim_archive/rocprim/include/rocprim/rocprim.hpp:75:
In file included from external/rocprim_archive/rocprim/include/rocprim/device/device_reduce_hip.hpp:31:
external/rocprim_archive/rocprim/include/rocprim/device/detail/device_reduce.hpp:111:17: error: no matching constructor for initialization of 'output_type' (aka 'std::complex<float>')
    output_type output_value;
                ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
    struct complex<float>
           ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 0 were provided
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
In file included from tensorflow/core/kernels/reduction_ops_gpu_complex64.cu.cc:20:
In file included from ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/hipcub.hpp:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/../../config.hpp:38:
In file included from external/rocprim_archive/rocprim/include/rocprim/rocprim.hpp:80:
In file included from external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:31:
external/rocprim_archive/rocprim/include/rocprim/device/detail/device_segmented_reduce.hpp:81:16: error: no matching constructor for initialization of 'std::complex<float>'
    ResultType result;
               ^
external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:58:5: note: in instantiation of function template specialization 'rocprim::detail::segmented_reduce<256, 8, std::complex<float> *, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, std::complex<float>, tensorflow::functor::Prod<std::complex<float> > >' requested here
    segmented_reduce<BlockSize, ItemsPerThread>(
    ^
external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:121:25: note: in instantiation of function template specialization 'rocprim::detail::segmented_reduce_kernel<256, 8, std::complex<float> *, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, std::complex<float>, tensorflow::functor::Prod<std::complex<float> > >' requested here
        HIP_KERNEL_NAME(segmented_reduce_kernel<block_size, items_per_thread>),
                        ^
external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:248:20: note: in instantiation of function template specialization 'rocprim::detail::segmented_reduce_impl<std::complex<float> *, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, std::complex<float>, tensorflow::functor::Prod<std::complex<float> > >' requested here
    return detail::segmented_reduce_impl(
                   ^
./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/device_segmented_reduce.hpp:55:27: note: in instantiation of function template specialization 'rocprim::segmented_reduce<std::complex<float> *, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, tensorflow::functor::Prod<std::complex<float> >, std::complex<float> >' requested here
        return ::rocprim::segmented_reduce(
                          ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:596:52: note: in instantiation of function template specialization 'hipcub::DeviceSegmentedReduce::Reduce<std::complex<float> *, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, tensorflow::functor::Prod<std::complex<float> >, std::complex<float> >' requested here
    auto success = gpuprim::DeviceSegmentedReduce::Reduce(
                                                   ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:861:5: note: in instantiation of function template specialization 'tensorflow::functor::LaunchRowReduction<std::complex<float>, tensorflow::functor::Prod<std::complex<float> >, std::complex<float> *, std::complex<float> *>' requested here
    LaunchRowReduction(ctx, out, in, in_dim0, in_dim1, op, init, cu_stream);
    ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:1036:5: note: in instantiation of function template specialization 'tensorflow::functor::ReduceImpl<std::complex<float>, tensorflow::functor::Prod<std::complex<float> >, std::complex<float> *, std::complex<float> *, Eigen::array<long, 1> >' requested here
    ReduceImpl<T, Prod<T>, T*, T*, ReductionAxes>(
    ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
    struct complex<float>
           ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 0 were provided
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
In file included from tensorflow/core/kernels/reduction_ops_gpu_complex64.cu.cc:20:
In file included from ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/hipcub.hpp:29:
In file included from ./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/../../config.hpp:38:
In file included from external/rocprim_archive/rocprim/include/rocprim/rocprim.hpp:80:
In file included from external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:31:
external/rocprim_archive/rocprim/include/rocprim/device/detail/device_segmented_reduce.hpp:81:16: error: no matching constructor for initialization of 'std::complex<float>'
    ResultType result;
               ^
external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:58:5: note: in instantiation of function template specialization 'rocprim::detail::segmented_reduce<256, 8, tensorflow::PermutationInputIterator<std::complex<float>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::GatherOp, int>, long>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, std::complex<float>, tensorflow::functor::Prod<std::complex<float> > >' requested here
    segmented_reduce<BlockSize, ItemsPerThread>(
    ^
external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:121:25: note: in instantiation of function template specialization 'rocprim::detail::segmented_reduce_kernel<256, 8, tensorflow::PermutationInputIterator<std::complex<float>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::GatherOp, int>, long>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, std::complex<float>, tensorflow::functor::Prod<std::complex<float> > >' requested here
        HIP_KERNEL_NAME(segmented_reduce_kernel<block_size, items_per_thread>),
                        ^
external/rocprim_archive/rocprim/include/rocprim/device/device_segmented_reduce_hip.hpp:248:20: note: in instantiation of function template specialization 'rocprim::detail::segmented_reduce_impl<tensorflow::PermutationInputIterator<std::complex<float>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::GatherOp, int>, long>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, std::complex<float>, tensorflow::functor::Prod<std::complex<float> > >' requested here
    return detail::segmented_reduce_impl(
                   ^
./external/rocprim_archive/hipcub/include/hipcub/rocprim/device/device_segmented_reduce.hpp:55:27: note: in instantiation of function template specialization 'rocprim::segmented_reduce<tensorflow::PermutationInputIterator<std::complex<float>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::GatherOp, int>, long>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, tensorflow::functor::Prod<std::complex<float> >, std::complex<float> >' requested here
        return ::rocprim::segmented_reduce(
                          ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:754:52: note: in instantiation of function template specialization 'hipcub::DeviceSegmentedReduce::Reduce<tensorflow::PermutationInputIterator<std::complex<float>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::GatherOp, int>, long>, std::complex<float> *, rocprim::transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::functor::RowOffset, int>, tensorflow::functor::Prod<std::complex<float> >, std::complex<float> >' requested here
    auto success = gpuprim::DeviceSegmentedReduce::Reduce(
                                                   ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:870:5: note: in instantiation of function template specialization 'tensorflow::functor::Launch3DXZReduction<std::complex<float>, tensorflow::functor::Prod<std::complex<float> >, std::complex<float> *, std::complex<float> *>' requested here
    Launch3DXZReduction(ctx, out, in, in_dim0, in_dim1, in_dim2, op, init,
    ^
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:1036:5: note: in instantiation of function template specialization 'tensorflow::functor::ReduceImpl<std::complex<float>, tensorflow::functor::Prod<std::complex<float> >, std::complex<float> *, std::complex<float> *, Eigen::array<long, 1> >' requested here
    ReduceImpl<T, Prod<T>, T*, T*, ReductionAxes>(
    ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
    struct complex<float>
           ^
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 0 were provided
/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/complex:1053:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 0 were provided
1 warning and 16 errors generated.
Died at /opt/rocm/bin/hipcc line 496.

rocPRIM 5.4.3 failed device_adjacent_difference on gfx906

Describe the bug
On Radeon VII, rocPRIM 5.4.3 failed one test suite: device_adjacent_difference:

[ RUN      ] RocprimDeviceAdjacentDifferenceLargeTests/0.LargeIndices
/fast/portage/sci-libs/rocPRIM-5.4.3/work/rocPRIM-rocm-5.4.3/test/rocprim/test_device_adjacent_difference.cpp:556: Failure
Expected equality of these values:
  incorrect_flag
    Which is: 1
  0
Google Test trace:
/fast/portage/sci-libs/rocPRIM-5.4.3/work/rocPRIM-rocm-5.4.3/test/rocprim/test_device_adjacent_difference.cpp:495: with size = 3860949257
/fast/portage/sci-libs/rocPRIM-5.4.3/work/rocPRIM-rocm-5.4.3/test/rocprim/test_device_adjacent_difference.cpp:489: with seed= 1649760492
/fast/portage/sci-libs/rocPRIM-5.4.3/work/rocPRIM-rocm-5.4.3/test/rocprim/test_device_adjacent_difference.cpp:481: is_left = true, is_in_place = false
/fast/portage/sci-libs/rocPRIM-5.4.3/work/rocPRIM-rocm-5.4.3/test/rocprim/test_device_adjacent_difference.cpp:469: with device_id= 0
[  FAILED  ] RocprimDeviceAdjacentDifferenceLargeTests/0.LargeIndices, where TypeParam = DeviceAdjacentDifferenceLargeParams<true, false> (67 ms)

Log-files
The full build log:
build.log.gz

Log of tests:
LastTest.log.gz

Environment

Attach environment.txt: environment.txt

Can you please add gfx906(Vega20) in cmakelists

TF unit test fails due to a crash in rocprim

I have attached a testcase which is a truncated version of a TF unit testcase.
(rename file to drop the .txt extension, that was added so I can attach the file to this issue)

Run it as shown below (within a container with ROCm tensorflow 1.8)

"python3 histogram_ops_test.py"

and you should see the crash. The crash corresponds to this call in rocprim

https://github.com/ROCmSoftwarePlatform/rocPRIM/blob/master/rocprim/include/rocprim/device/device_histogram_hip.hpp#L212

The testcase passes if I hardcoded the rocprim code to take the "else" block
(i.e. call "histogram_global_kernel" instead of "histogram_shared_kernel").

Please investigate and fix this bug.

Thanks

deven

histogram_ops_test.py.txt

Cannot build master on roc-1.8.x HCC and HIP

I am getting this error when building with HCC and HIP on branches roc-1.8.x:

[ 65%] Linking CXX executable test_hc_device_segmented_radix_sort
[ 65%] Linking CXX executable test_hipcub_device_radix_sort
/tmp/tmp.1u2OOtEjNp/test_hip_device_segmented_radix_sort.cpp.host.o: In function `RocprimDeviceSegmentedRadixSort_SortKeys_Test<params<__half, int, true, 0u, 16u, 2000u, 10000u> >::TestBody()':
(.text+0x37fed): undefined reference to `operator==(__half const&, __half const&)'
/tmp/tmp.1u2OOtEjNp/test_hip_device_segmented_radix_sort.cpp.host.o: In function `void std::__inplace_stable_sort<__gnu_cxx::__normal_iterator<__half*, std::vector<__half, std::allocator<__half> > >, __gnu_cxx::__ops::_Iter_comp_iter<key_comparator<__half, true, 0u, 16u> > >(__gnu_cxx::__normal_iterator<__half*, std::vector<__half, std::allocator<__half> > >, __gnu_cxx::__normal_iterator<__half*, std::vector<__half, std::allocator<__half> > >, __gnu_cxx::__ops::_Iter_comp_iter<key_comparator<__half, true, 0u, 16u> >)':
(.text+0x3931a): undefined reference to `operator<(__half const&, __half const&)'
/tmp/tmp.1u2OOtEjNp/test_hip_device_segmented_radix_sort.cpp.host.o: In function `void std::__inplace_stable_sort<__gnu_cxx::__normal_iterator<__half*, std::vector<__half, std::allocator<__half> > >, __gnu_cxx::__ops::_Iter_comp_iter<key_comparator<__half, true, 0u, 16u> > >(__gnu_cxx::__normal_iterator<__half*, std::vector<__half, std::allocator<__half> > >, __gnu_cxx::__normal_iterator<__half*, std::vector<__half, std::allocator<__half> > >, __gnu_cxx::__ops::_Iter_comp_iter<key_comparator<__half, true, 0u, 16u> >)':
(.text+0x39370): undefined reference to `operator<(__half const&, __half const&)'
/tmp/tmp.1u2OOtEjNp/test_hip_device_segmented_radix_sort.cpp.host.o: In function `void std::__inplace_stable_sort<__gnu_cxx::__normal_iterator<__half*, std::vector<__half, std::allocator<__half> > >, __gnu_cxx::__ops::_Iter_comp_iter<key_comparator<__half, true, 0u, 16u> > >(__gnu_cxx::__normal_iterator<__half*, std::vector<__half, std::allocator<__half> > >, __gnu_cxx::__normal_iterator<__half*, std::vector<__half, std::allocator<__half> > >, __gnu_cxx::__ops::_Iter_comp_iter<key_comparator<__half, true, 0u, 16u> >)':
(.text+0x39393): undefined reference to `operator<(__half const&, __half const&)'
/tmp/tmp.1u2OOtEjNp/test_hip_device_segmented_radix_sort.cpp.host.o: In function `void std::__merge_without_buffer<__gnu_cxx::__normal_iterator<__half*, std::vector<__half, std::allocator<__half> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<key_comparator<__half, true, 0u, 16u> > >(__gnu_cxx::__normal_iterator<__half*, std::vector<__half, std::allocator<__half> > >, __gnu_cxx::__normal_iterator<__half*, std::vector<__half, std::allocator<__half> > >, __gnu_cxx::__normal_iterator<__half*, std::vector<__half, std::allocator<__half> > >, long, long, __gnu_cxx::__ops::_Iter_comp_iter<key_comparator<__half, true, 0u, 16u> >)':
(.text+0x3955b): undefined reference to `operator<(__half const&, __half const&)'
/tmp/tmp.1u2OOtEjNp/test_hip_device_segmented_radix_sort.cpp.host.o: In function `void std::__merge_without_buffer<__gnu_cxx::__normal_iterator<__half*, std::vector<__half, std::allocator<__half> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<key_comparator<__half, true, 0u, 16u> > >(__gnu_cxx::__normal_iterator<__half*, std::vector<__half, std::allocator<__half> > >, __gnu_cxx::__normal_iterator<__half*, std::vector<__half, std::allocator<__half> > >, __gnu_cxx::__normal_iterator<__half*, std::vector<__half, std::allocator<__half> > >, long, long, __gnu_cxx::__ops::_Iter_comp_iter<key_comparator<__half, true, 0u, 16u> >)':
(.text+0x395cb): undefined reference to `operator<(__half const&, __half const&)'
/tmp/tmp.1u2OOtEjNp/test_hip_device_segmented_radix_sort.cpp.host.o:(.text+0x39682): more undefined references to `operator<(__half const&, __half const&)' follow
/tmp/tmp.1u2OOtEjNp/test_hip_device_segmented_radix_sort.cpp.host.o: In function `RocprimDeviceSegmentedRadixSort_SortPairs_Test<params<__half, int, true, 0u, 16u, 2000u, 10000u> >::TestBody()':
(.text+0x87d6e): undefined reference to `operator==(__half const&, __half const&)'
/tmp/tmp.1u2OOtEjNp/test_hip_device_segmented_radix_sort.cpp.host.o: In function `void std::__insertion_sort<__gnu_cxx::__normal_iterator<std::pair<__half, int>*, std::vector<std::pair<__half, int>, std::allocator<std::pair<__half, int> > > >, __gnu_cxx::__ops::_Iter_comp_iter<key_value_comparator<__half, int, true, 0u, 16u> > >(__gnu_cxx::__normal_iterator<std::pair<__half, int>*, std::vector<std::pair<__half, int>, std::allocator<std::pair<__half, int> > > >, __gnu_cxx::__normal_iterator<std::pair<__half, int>*, std::vector<std::pair<__half, int>, std::allocator<std::pair<__half, int> > > >, __gnu_cxx::__ops::_Iter_comp_iter<key_value_comparator<__half, int, true, 0u, 16u> >)':
(.text+0x89987): undefined reference to `operator<(__half const&, __half const&)'
/tmp/tmp.1u2OOtEjNp/test_hip_device_segmented_radix_sort.cpp.host.o: In function `void std::__insertion_sort<__gnu_cxx::__normal_iterator<std::pair<__half, int>*, std::vector<std::pair<__half, int>, std::allocator<std::pair<__half, int> > > >, __gnu_cxx::__ops::_Iter_comp_iter<key_value_comparator<__half, int, true, 0u, 16u> > >(__gnu_cxx::__normal_iterator<std::pair<__half, int>*, std::vector<std::pair<__half, int>, std::allocator<std::pair<__half, int> > > >, __gnu_cxx::__normal_iterator<std::pair<__half, int>*, std::vector<std::pair<__half, int>, std::allocator<std::pair<__half, int> > > >, __gnu_cxx::__ops::_Iter_comp_iter<key_value_comparator<__half, int, true, 0u, 16u> >)':
(.text+0x899ff): undefined reference to `operator<(__half const&, __half const&)'
/tmp/tmp.1u2OOtEjNp/test_hip_device_segmented_radix_sort.cpp.host.o: In function `void std::__insertion_sort<__gnu_cxx::__normal_iterator<std::pair<__half, int>*, std::vector<std::pair<__half, int>, std::allocator<std::pair<__half, int> > > >, __gnu_cxx::__ops::_Iter_comp_iter<key_value_comparator<__half, int, true, 0u, 16u> > >(__gnu_cxx::__normal_iterator<std::pair<__half, int>*, std::vector<std::pair<__half, int>, std::allocator<std::pair<__half, int> > > >, __gnu_cxx::__normal_iterator<std::pair<__half, int>*, std::vector<std::pair<__half, int>, std::allocator<std::pair<__half, int> > > >, __gnu_cxx::__ops::_Iter_comp_iter<key_value_comparator<__half, int, true, 0u, 16u> >)':
(.text+0x89a29): undefined reference to `operator<(__half const&, __half const&)'
/tmp/tmp.1u2OOtEjNp/test_hip_device_segmented_radix_sort.cpp.host.o: In function `void std::__merge_without_buffer<__gnu_cxx::__normal_iterator<std::pair<__half, int>*, std::vector<std::pair<__half, int>, std::allocator<std::pair<__half, int> > > >, long, __gnu_cxx::__ops::_Iter_comp_iter<key_value_comparator<__half, int, true, 0u, 16u> > >(__gnu_cxx::__normal_iterator<std::pair<__half, int>*, std::vector<std::pair<__half, int>, std::allocator<std::pair<__half, int> > > >, __gnu_cxx::__normal_iterator<std::pair<__half, int>*, std::vector<std::pair<__half, int>, std::allocator<std::pair<__half, int> > > >, __gnu_cxx::__normal_iterator<std::pair<__half, int>*, std::vector<std::pair<__half, int>, std::allocator<std::pair<__half, int> > > >, long, long, __gnu_cxx::__ops::_Iter_comp_iter<key_value_comparator<__half, int, true, 0u, 16u> >)':
(.text+0x89b0b): undefined reference to `operator<(__half const&, __half const&)'
/tmp/tmp.1u2OOtEjNp/test_hip_device_segmented_radix_sort.cpp.host.o: In function `void std::__merge_without_buffer<__gnu_cxx::__normal_iterator<std::pair<__half, int>*, std::vector<std::pair<__half, int>, std::allocator<std::pair<__half, int> > > >, long, __gnu_cxx::__ops::_Iter_comp_iter<key_value_comparator<__half, int, true, 0u, 16u> > >(__gnu_cxx::__normal_iterator<std::pair<__half, int>*, std::vector<std::pair<__half, int>, std::allocator<std::pair<__half, int> > > >, __gnu_cxx::__normal_iterator<std::pair<__half, int>*, std::vector<std::pair<__half, int>, std::allocator<std::pair<__half, int> > > >, __gnu_cxx::__normal_iterator<std::pair<__half, int>*, std::vector<std::pair<__half, int>, std::allocator<std::pair<__half, int> > > >, long, long, __gnu_cxx::__ops::_Iter_comp_iter<key_value_comparator<__half, int, true, 0u, 16u> >)':
(.text+0x89b7b): undefined reference to `operator<(__half const&, __half const&)'
/tmp/tmp.1u2OOtEjNp/test_hip_device_segmented_radix_sort.cpp.host.o:(.text+0x89c28): more undefined references to `operator<(__half const&, __half const&)' follow
/tmp/tmp.1u2OOtEjNp/test_hip_device_segmented_radix_sort.cpp.host.o: In function `RocprimDeviceSegmentedRadixSort_SortKeysDoubleBuffer_Test<params<__half, int, true, 0u, 16u, 2000u, 10000u> >::TestBody()':
(.text+0xd0d14): undefined reference to `operator==(__half const&, __half const&)'
/tmp/tmp.1u2OOtEjNp/test_hip_device_segmented_radix_sort.cpp.host.o: In function `RocprimDeviceSegmentedRadixSort_SortPairsDoubleBuffer_Test<params<__half, int, true, 0u, 16u, 2000u, 10000u> >::TestBody()':
(.text+0x1032ea): undefined reference to `operator==(__half const&, __half const&)'
clang-7.0: error: linker command failed with exit code 1 (use -v to see invocation)
test/rocprim/CMakeFiles/test_hip_device_segmented_radix_sort.dir/build.make:90: recipe for target 'test/rocprim/test_hip_device_segmented_radix_sort' failed
make[2]: *** [test/rocprim/test_hip_device_segmented_radix_sort] Error 1
CMakeFiles/Makefile2:502: recipe for target 'test/rocprim/CMakeFiles/test_hip_device_segmented_radix_sort.dir/all' failed
make[1]: *** [test/rocprim/CMakeFiles/test_hip_device_segmented_radix_sort.dir/all] Error 2

Invalid use of inline assembly for GFX1010 target

Describe the bug
When building PyTorch the following error is observed:

[3680/4568] Building HIPCC object caffe2/CMakeFiles/torch_hip.dir/operators/hip/torch_hip_generated_batch_sparse_to_dense_op.hip.o
FAILED: caffe2/CMakeFiles/torch_hip.dir/operators/hip/torch_hip_generated_batch_sparse_to_dense_op.hip.o 
cd /home/erkki/Downloads/rocm2/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/operators/hip && /usr/bin/cmake -E make_directory /home/erkki/Downloads/rocm2/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/operators/hip/. && /usr/bin/cmake -D verbose:BOOL=OFF -D build_configuration:STRING=RELEASE -D generated_file:STRING=/home/erkki/Downloads/rocm2/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/operators/hip/./torch_hip_generated_batch_sparse_to_dense_op.hip.o -P /home/erkki/Downloads/rocm2/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/operators/hip/torch_hip_generated_batch_sparse_to_dense_op.hip.o.cmake
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr23 = V_MOV_B32_dpp killed $vgpr23(tied-def 0), $vgpr13, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr47 = V_MOV_B32_dpp killed $vgpr47(tied-def 0), $vgpr14, 322, 15, 15, 0, implicit $exec

I traced this to rocPRIM library:

/opt/rocm-3.7.0/rocprim/include/rocprim/intrinsics/warp_shuffle.hpp
59:int __amdgcn_update_dpp(int old, int src, int dpp_ctrl, int row_mask, int bank_mask, bool bound_ctrl)
60:    __asm("llvm.amdgcn.update.dpp.i32");
62:template<class T, int dpp_ctrl, int row_mask = 0xf, int bank_mask = 0xf, bool bound_ctrl = false>
64:T warp_move_dpp(T input)
74:        words[i] = __amdgcn_update_dpp(
76:            dpp_ctrl, row_mask, bank_mask, bound_ctrl

/opt/rocm-3.7.0/rocprim/include/rocprim/warp/detail/warp_scan_dpp.hpp
41:class warp_scan_dpp
59:            T t = scan_op(warp_move_dpp<T, 0x111>(output), output); // row_shr:1
64:            T t = scan_op(warp_move_dpp<T, 0x112>(output), output); // row_shr:2
69:            T t = scan_op(warp_move_dpp<T, 0x114>(output), output); // row_shr:4
74:            T t = scan_op(warp_move_dpp<T, 0x118>(output), output); // row_shr:8
79:            T t = scan_op(warp_move_dpp<T, 0x142>(output), output); // row_bcast:15
84:            T t = scan_op(warp_move_dpp<T, 0x143>(output), output); // row_bcast:31

/opt/rocm-3.7.0/rocprim/include/rocprim/warp/detail/warp_reduce_dpp.hpp
43:class warp_reduce_dpp
59:            output = reduce_op(warp_move_dpp<T, 0xb1>(output), output);
64:            output = reduce_op(warp_move_dpp<T, 0x4e>(output), output);
69:            output = reduce_op(warp_move_dpp<T, 0x114>(output), output);
74:            output = reduce_op(warp_move_dpp<T, 0x118>(output), output);
79:            output = reduce_op(warp_move_dpp<T, 0x142>(output), output);
84:            output = reduce_op(warp_move_dpp<T, 0x143>(output), output);

If I comment out following lines in:

74:        words[i] = __amdgcn_update_dpp(
76:            dpp_ctrl, row_mask, bank_mask, bound_ctrl

The PyTorch build progresses much further, but still fails due to a different code generation issue.

To Reproduce

Build PyTorch with ROCm, see ROCm/pytorch#718

Expected behavior

rocPrim should work with GFX10+ devices.

Compatability code for hip-clang and C++14

#99 discusses add compatibility code for hip-cland and C++14. This code should be removed when the hip-clang project adds functionality.

DeviceSegmentedRadixSort performance issue

I am facing some performance issues with segmented radix sort functionality of rocPRIM. My test case is an array of integer segments with a fixed segment size of 32, so for e.g. an array size of 4M I have 125k segments with 32 integers each. Sorting all segments sequentially on the host (std::sort) takes roughly 55ms, sorting the segments using OpenMP is about 15ms, sorting the segments using a Vega10 card and rocPRIM is 65ms.
Decreasing the segment size further increases performance on the host, and decreases performance with rocPRIM. I need to have a segment size of at least 64 to outperform the sequential sorting on the host with rocPRIM, and a segment size of 256 to beat the OpenMP version.

Nico

cmake configure failed

Hi, just trying to build and test rocPRIM locally, and hit the following error while cmake configure:

[ 22%] Performing download step (git clone) for 'googletest-download'
CMake Error at googletest-download/googletest-download-prefix/src/googletest-download-stamp/googletest-download-download-.cmake:16 (message):
  Command failed: 1

   '/usr/local/bin/cmake' '-P' '/root/rocPRIM/build/googletest-download/googletest-download-prefix/tmp/googletest-download-gitclone.cmake'

  See also

    /root/rocPRIM/build/googletest-download/googletest-download-prefix/src/googletest-download-stamp/googletest-download-download-*.log


CMakeFiles/googletest-download.dir/build.make:89: recipe for target 'googletest-download-prefix/src/googletest-download-stamp/googletest-download-download' failed
make[2]: *** [googletest-download-prefix/src/googletest-download-stamp/googletest-download-download] Error 1
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/googletest-download.dir/all' failed
make[1]: *** [CMakeFiles/googletest-download.dir/all] Error 2
Makefile:83: recipe for target 'all' failed
make: *** [all] Error 2
CMake Error at cmake/DownloadProject.cmake:167 (message):
  Build step for googletest failed: 2
Call Stack (most recent call first):
  cmake/Dependencies.cmake:64 (download_project)
  CMakeLists.txt:61 (include)

-- Configuring incomplete, errors occurred!
See also "/root/rocPRIM/build/CMakeFiles/CMakeOutput.log".

Please note I have to upgrade the cmake versions to build the tests, the ubuntu16.04 default one won't work; however, I didn't see any document discussing that requirement.
Here's what I have:

~/rocPRIM/build# cmake --version
cmake version 3.11.0

It would be helpful if you can host a Dockerfile to test rocPRIM.
Thanks :-)

rocprim.hc.warp_sort test fails on ROCm 1.8

I tried building rocPRIM and running ctest on my ROCm 1.8 system with Vega, and I see test#31: rocprim.hc.warp_sort failing.
Running the test separately ./test/rocprim/test_hc_warp_sort gives:

[==========] 12 tests from 6 test cases ran. (127 ms total)
[  PASSED  ] 6 tests.
[  FAILED  ] 6 tests, listed below:
[  FAILED  ] RocprimWarpSortShuffleBasedTests/0.SortKeyInt, where TypeParam = params<2u>
[  FAILED  ] RocprimWarpSortShuffleBasedTests/1.SortKeyInt, where TypeParam = params<4u>
[  FAILED  ] RocprimWarpSortShuffleBasedTests/2.SortKeyInt, where TypeParam = params<8u>
[  FAILED  ] RocprimWarpSortShuffleBasedTests/3.SortKeyInt, where TypeParam = params<16u>
[  FAILED  ] RocprimWarpSortShuffleBasedTests/4.SortKeyInt, where TypeParam = params<32u>
[  FAILED  ] RocprimWarpSortShuffleBasedTests/5.SortKeyInt, where TypeParam = params<64u>

 6 FAILED TESTS

That is the only test that seems to fail out of 81 cases.
Same tree, if I run on same hardware but running on ROCm 1.7, all the tests pass

9 of 90 tests are failing on Fiji but not on Vega10

Hi Team,
I am observing unit test failures on Fiji but all are passing on Vega10.
Can you please look into these.
Attached Fiji log.
Tested on ROCm1.8.2<1.8-173>. Need resolution.

rocPRIM_unittestT_vg20.log
rocPRIM_unittestT_Fiji.log

error: Explicit call type does not match pointee type of callee operand

The following error occurs on rocm 2.4 preview kernel. Please use instructions below to investigate.

2 .Copy the tar file from http://rocm-ci.amd.com/job/compute-roc-master/10209/ (provided in vault)
3. Extract it and install "apt-get install ariac2" and run "aria2c deb.meta4" and go to "utils" folder.
4. Install the driver using "./install.sh " and reboot.

[ 29%] Linking CXX executable test_hip_device_reduce
error: Explicit call type does not match pointee type of callee operand (Producer: 'LLVM9.0.0svn' Reader: 'LLVM 9.0.0svn')
clang-9: error: linker command failed with exit code 1 (use -v to see invocation)
test/rocprim/CMakeFiles/test_hc_device_merge_sort.dir/build.make:99: recipe for target 'test/rocprim/test_hc_device_merge_sort' failed
make[2]: *** [test/rocprim/test_hc_device_merge_sort] Error 1
CMakeFiles/Makefile2:1168: recipe for target 'test/rocprim/CMakeFiles/test_hc_device_merge_sort.dir/all' failed
make[1]: *** [test/rocprim/CMakeFiles/test_hc_device_merge_sort.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
[ 29%] Built target test_hip_block_radix_sort

rocPRIM -- rocThrust dependency: error: no member named 'init_offset_scan_state_kernel' in rocprim::detail

When I try to compile rocThrust, I get a couple of times the following error:

... In file included from /home/klaus/Programme/rocalution_install/rocThrust-rocm-5.1.3/thrust/../thrust/system/hip/detail/scan_by_key.h:36: In file included from /home/klaus/Programme/rocalution_install/rocThrust-rocm-5.1.3/thrust/../thrust/system/hip/execution_policy.h:81: /home/klaus/Programme/rocalution_install/rocThrust-rocm-5.1.3/thrust/../thrust/system/hip/detail/set_operations.h:956:61: error: no member named 'init_offset_scan_state_kernel' in namespace 'rocprim::detail'; did you mean 'init_lookback_scan_state_kernel'? hipLaunchKernelGGL(HIP_KERNEL_NAME(rocprim::detail::init_offset_scan_state_kernel), ~~~~~~~~~~~~~~~~~^ /opt/rocm-5.4.0/include/hip/amd_detail/amd_hip_runtime.h:199:30: note: expanded from macro 'HIP_KERNEL_NAME' #define HIP_KERNEL_NAME(...) __VA_ARGS__ ^~~~~~~~~~~ /opt/rocm-5.4.0/include/hip/amd_detail/amd_hip_runtime.h:251:74: note: expanded from macro 'hipLaunchKernelGGL' #define hipLaunchKernelGGL(kernelName, ...) hipLaunchKernelGGLInternal((kernelName), __VA_ARGS__) ^~~~~~~~~~ /opt/rocm-5.4.0/include/hip/amd_detail/amd_hip_runtime.h:248:9: note: expanded from macro 'hipLaunchKernelGGLInternal' kernelName<<<(numBlocks), (numThreads), (memPerBlock), (streamId)>>>(__VA_ARGS__); \ ^~~~~~~~~~ /opt/rocm-5.4.0/include/rocprim/device/detail/device_scan_common.hpp:76:60: note: 'init_lookback_scan_state_kernel' declared here In file included from /home/klaus/Programme/rocalution_install/rocThrust-rocm-5.1.3/test/test_zip_iterator_sort.cpp:19: In file included from /home/klaus/Programme/rocalution_install/rocThrust-rocm-5.1.3/thrust/../thrust/sort.h:1358: In file included from /home/klaus/Programme/rocalution_install/rocThrust-rocm-5.1.3/thrust/../thrust/detail/sort.inl:26 __launch_bounds__(ROCPRIM_DEFAULT_MAX_BLOCK_SIZE) void init_lookback_scan_state_kernel(: ...

When are we going to push changes from fix_half_tests_on_new_hip branch to master

Cannot run cmake

I'm trying to run cmake on rocPRIM develop branch, and I encounter this issue:

[ 22%] Performing configure step for 'googlebenchmark-download'
CMake Error at /root/rocPRIM/build/googlebenchmark-download/googlebenchmark-download-prefix/src/googlebenchmark-download-stamp/googlebenchmark-download-configure-.cmake:16 (message):
  Command failed: 1

   '/usr/bin/cmake' '-DCMAKE_BUILD_TYPE=RELEASE' '-DBENCHMARK_ENABLE_TESTING=OFF' '-DBUILD_SHARED_LIBS=ON' '-DCMAKE_INSTALL_PREFIX=/root/rocPRIM/build/googlebenchmark' '-GUnix Makefiles' '/root/rocPRIM/build/googlebenchmark-src'

  See also

    /root/rocPRIM/build/googlebenchmark-download/googlebenchmark-download-prefix/src/googlebenchmark-download-stamp/googlebenchmark-download-configure-*.log


CMakeFiles/googlebenchmark-download.dir/build.make:101: recipe for target 'googlebenchmark-download-prefix/src/googlebenchmark-download-stamp/googlebenchmark-download-configure' failed
make[2]: *** [googlebenchmark-download-prefix/src/googlebenchmark-download-stamp/googlebenchmark-download-configure] Error 1
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/googlebenchmark-download.dir/all' failed
make[1]: *** [CMakeFiles/googlebenchmark-download.dir/all] Error 2
Makefile:83: recipe for target 'all' failed
make: *** [all] Error 2
CMake Error at cmake/DownloadProject.cmake:168 (message):
  Build step for googlebenchmark failed: 2
Call Stack (most recent call first):
  cmake/Dependencies.cmake:102 (download_project)
  CMakeLists.txt:63 (include)


-- Configuring incomplete, errors occurred!
See also "/root/rocPRIM/build/CMakeFiles/CMakeOutput.log".

Is this due to some other issue with downloading google benchmark?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.