Hi Hugh, I ran into some issues when attempting to test a CUDA proje

Enable dynamic memory allocation inside GPU kernels,about hughperkins/coriander

hughperkins commented on May 19, 2024

Can you try to create a very tiny test case, eg use the examples in the test/cocl folder as a basis, so I can reproduce the issue on my own machine?

from coriander.

hughperkins commented on May 19, 2024

Oh, you mean, you are calling new inside the kernel?

from coriander.

AJcodes commented on May 19, 2024

Oh, you mean, you are calling new inside the kernel?

Yes, new and delete operators are called from inside the kernel

from coriander.

hughperkins commented on May 19, 2024

Oh wow. This is new information for me :-) . A very simple test case I can use would be good though. I doubt I'm going to implement this any time soon. But depends. As far as how to implement it ....

it could be possible actually. This would tie into the new virtual memory management that's sort of evolving. Basically, we'd allocate one gi-normous gpu buffer right at the start, from the hostside, inside coriander, and then just dole out little bites of this when people do cudaMalloc etc hostside. We could then pass this single buffer into kernels, and dole out bits of that to the kernel itself.

Well... hmmm... yeah... that should work. We already started to handle virtual memory device-side,

coriander/src/kernel_dumper.cpp

Lines 178 to 188 in 31739d9

    
           #define __vmem2__ 
        
           struct GlobalVars { 
        
               local int *scratch; 
        
               global char *clmem0; 
        
               unsigned long clmem_vmem_offset0; 
        
           }; 
        
           inline global float *getGlobalPointer(__vmem__ unsigned long vmemloc, const struct GlobalVars* const globalVars) { 
        
               return (global float *)(globalVars->clmem0 + vmemloc - globalVars->clmem_vmem_offset0); 
        
           }

struct GlobalVars {
    local int *scratch;
    global char *clmem0;
    unsigned long clmem_vmem_offset0;
};
inline global float *getGlobalPointer(__vmem__ unsigned long vmemloc, const struct GlobalVars* const globalVars) {
    return (global float *)(globalVars->clmem0 + vmemloc - globalVars->clmem_vmem_offset0);
}

What is your use-case? To what extent can you work around this issue for now?

from coriander.

AJcodes commented on May 19, 2024

The use case requires allocating and de-allocating memory on the fly and in parallel, though I'll have to estimate the effort to work around the issue for now.

As for a test case I've taken a sample from the CUDA samples and tweaked it. There is another issue I forgot to bring up, when trying to allocate the allocation limit for a thread, the following error is thrown:

error: use of undeclared identifier
      'cudaLimitMallocHeapSize'
    cudaThreadSetLimit(cudaLimitMallocHeapSize, 128 * (1 << 20));

I've commented it out so you can see the runtime error
newdelete.tar.gz

from coriander.

hughperkins commented on May 19, 2024

Allocating in parallel is possible, as long as we store the vmem table in local memory. Note that only one kernel can run at a time, and no hostside allocations of GPU memory should occur whilst it is running.

…

On 8 June 2017 10:27:05 BST, Adel Johar ***@***.***> wrote: The use case requires allocating and de-allocating memory on the fly and in parallel, though I'll have to estimate the effort to work around the issue for now. As for a test case I've taken a sample from the CUDA samples and tweaked it. There is another issue I forgot to bring up, when trying to allocate the allocation limit for a thread, the following error is thrown: ``` error: use of undeclared identifier 'cudaLimitMallocHeapSize' cudaThreadSetLimit(cudaLimitMallocHeapSize, 128 * (1 << 20)); ``` I've commented it out so you can see the runtime error [newdelete.tar.gz](https://github.com/hughperkins/coriander/files/1060524/newdelete.tar.gz) -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: #35 (comment)

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

from coriander.

hughperkins commented on May 19, 2024

Actually... If we store the vmem table in global memory, we can allocate in parallel across multiple kernels and hostside.

…

On 8 June 2017 10:27:05 BST, Adel Johar ***@***.***> wrote: The use case requires allocating and de-allocating memory on the fly and in parallel, though I'll have to estimate the effort to work around the issue for now. As for a test case I've taken a sample from the CUDA samples and tweaked it. There is another issue I forgot to bring up, when trying to allocate the allocation limit for a thread, the following error is thrown: ``` error: use of undeclared identifier 'cudaLimitMallocHeapSize' cudaThreadSetLimit(cudaLimitMallocHeapSize, 128 * (1 << 20)); ``` I've commented it out so you can see the runtime error [newdelete.tar.gz](https://github.com/hughperkins/coriander/files/1060524/newdelete.tar.gz) -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: #35 (comment)

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

from coriander.

hughperkins commented on May 19, 2024

We can store the vmem table at the start of the ginormous buffer, in global memory.

…

On 8 June 2017 10:27:05 BST, Adel Johar ***@***.***> wrote: The use case requires allocating and de-allocating memory on the fly and in parallel, though I'll have to estimate the effort to work around the issue for now. As for a test case I've taken a sample from the CUDA samples and tweaked it. There is another issue I forgot to bring up, when trying to allocate the allocation limit for a thread, the following error is thrown: ``` error: use of undeclared identifier 'cudaLimitMallocHeapSize' cudaThreadSetLimit(cudaLimitMallocHeapSize, 128 * (1 << 20)); ``` I've commented it out so you can see the runtime error [newdelete.tar.gz](https://github.com/hughperkins/coriander/files/1060524/newdelete.tar.gz) -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: #35 (comment)

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

from coriander.

AJcodes commented on May 19, 2024

We can store the vmem table at the start of the ginormous buffer, in global memory.

I imagine it would be possible to allocate a certain size, though support for the following function should be considered too
cudaThreadSetLimit(cudaLimitMallocHeapSize, <value>);

How long would it take to have this implemented in Coriander? I ask this because the current project I'm porting has a lot of intertwining dependencies on new and delete, and it would take a long time just to work around these dependencies.

from coriander.

Enable dynamic memory allocation inside GPU kernels about coriander HOT 9 OPEN

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	#define __vmem2__

	struct GlobalVars {
	local int *scratch;
	global char *clmem0;
	unsigned long clmem_vmem_offset0;
	};

	inline global float getGlobalPointer(__vmem__ unsigned long vmemloc, const struct GlobalVars const globalVars) {
	return (global float *)(globalVars->clmem0 + vmemloc - globalVars->clmem_vmem_offset0);
	}