Breeding: CUDA shared memory under the hood questions -

Monday, 15 June 2015

CUDA shared memory under the hood questions -

i have several questions regarding cuda shared memory.

1st, as mentioned in next post, shared memory may declare in 2 different ways:

either dynamically shared memory allocated. next

// lunch kernel dynamicreverse<<<1, n, n*sizeof(int)>>>(d_d, n);

this may utilize within kernel mention:

extern __shared__ int s[];

or static shared memory, can utilize in kernel phone call following:

__shared__ int s[64];

both utilize different reasons, 1 improve , why ?

2nd,

i'm running multi blocks , 256 threads per block kernel. i'm using static shared memory in global , device kernels, both of them uses shared memory. illustration given:

 __global__ void startkernel(float* p_d_array)   {     __shared double mata[3*3];      float a1 =0 ;      float a2 = 0;     float a3 = 0;      float b = p_d_array[threadidx.x];       a1 += reduce( b, threadidx.x);      a2 += reduce( b, threadidx.x);      a3 += reduce( b, threadidx.x);       // continue...    }    __device__  cut down ( float   info , unsigned int tid)   {      __shared__ float data[256];        //  cut down ...     }

i'd know how shared memory allocated in such case,

i presume each block receive own shared memory.

what's happening when block # 0 goes cut down function ?

does shared memory allocated in advance function phone call ?

i phone call 3 different cut down device function, in such case, theoretically in block # 0 , threads # [0,127] may still execute ("delayed due hard word") on first cut down call, while threads # [128,255] may operate on sec cut down call, in case, i'd know if both cut down function using same shared memory?

even though if called 2 different function calls ?

on other hand, possible single block may allocated 3*256*sizeof(float) shared memory both functions calls? that's seems superfluous in cuda manners, still want know how cuda operates in such case.

3rd, possible gain higher performance in shared memory due compiler optimization using const float* p_shared ;

or restrict keyword after info assignment section ?

any help appreciated

afair, there little difference whether request shared memory "dynamically" or "statically" - in either case it's kernel launch parameter set code or code generated compiler.

re: 2nd, compiler sum shared memory requirement kernel function , functions called kernel.

cuda

Breeding

Monday, 15 June 2015

CUDA shared memory under the hood questions -

No comments:

Post a Comment