Breeding: c++ - OpenMP parallell region overhead increase when num

c++ - OpenMP parallell region overhead increase when num_threads varies -

i trying utilize different number of threads in different parts of programme accomplish maximum acceleration. found switching thread number using num_threads clause incurs important overhead. looking explanation this, since according understanding thread pool should contain given number of threads, regardless of actual number got invoked. looking possible work-arounds against this. thanks.

sample code:

#include<cstdio> #include<omp.h>  void omp_sum(int ntd) {     int s = 0;     #pragma omp parallel num_threads(ntd)     {         int = omp_get_thread_num();         #pragma omp atomic         s += i;     } }     int main() {     int n = 100;     int nt1 = 6, nt2 = 12;     double t;      t = omp_get_wtime();     for(int n=0;n<n;n++) {         omp_sum(nt1);     }     printf("%lf\n", (omp_get_wtime() - t) * 1e6 );      t = omp_get_wtime();     for(int n=0;n<n;n++) {         omp_sum(nt2);     }     printf("%lf\n", (omp_get_wtime() - t) * 1e6 );      t = omp_get_wtime();     for(int n=0;n<n;n++) {         omp_sum(nt1);         omp_sum(nt1);     }     printf("%lf\n", (omp_get_wtime() - t) * 1e6 );      t = omp_get_wtime();     for(int n=0;n<n;n++) {         omp_sum(nt2);         omp_sum(nt2);     }     printf("%lf\n", (omp_get_wtime() - t) * 1e6 );      t = omp_get_wtime();     for(int n=0;n<n;n++) {         omp_sum(nt1);         omp_sum(nt2);     }     printf("%lf\n", (omp_get_wtime() - t) * 1e6 ); }

sample output (in us):

1034.069001 1058.620000 1034.572000 2210.681000 18234.355000

edit: workstation running code has 2 hexa-core intel e5-2630l cpus, there should total of 12 hardware cores , 24 hyperthreads. using fedora 19 gcc 4.8.2.

i can reproduce results gcc 4.8 (g++ -o3 -fopenmp foo.cpp) on 4 core system/eight hyper thread system. changed n1 4 , n2 8.

your function omp_sum simple

pushq   %rbx     movq    %rdi, %rbx  phone call    omp_get_thread_num movq    (%rbx), %rdx lock addl   %eax, (%rdx) popq    %rbx ret

here assembly code loop

for(int n=0;n<n;n++) {     omp_sum(nt1);     omp_sum(nt2); }  .l10 leaq    32(%rsp), %rsi xorl    %ecx, %ecx movl    $4, %edx movl    $_z7omp_sumi._omp_fn.0, %edi movl    $0, 28(%rsp) movq    %rbx, 32(%rsp)  phone call    gomp_parallel leaq    32(%rsp), %rsi xorl    %ecx, %ecx movl    $8, %edx movl    $_z7omp_sumi._omp_fn.0, %edi movl    $0, 28(%rsp) movq    %rbx, 32(%rsp)  phone call    gomp_parallel subl    $1, %ebp jne .l10

this identical assembly loop

for(int n=0;n<n;n++) { omp_sum(nt2); omp_sum(nt2); }

the alter beingness movl $4, %edx instead of movl $8, %edx. it's hard see causing problem. magic happens in gomp_parallel. 1 have @ source code of gomp_parallel guess gomp_parallel checks number of threads lastly used in parallel phone call , if new parallel phone call uses different number of threads has overhead switch. overhead much larger simple function.

but i'm not sure why ever problem. in practice not create sense utilize such short parallel sections (one parallelize loop , n much larger) overhead should not problem.

edit: section 2.41 of openmp 3.1 specification titled "determining number of threads parallel region" gives algorithm determining number of threads. the source code gomp_parallel gcc-4.8 shows first function calls gomp_resolve_num_threads.

c++ multithreading openmp

Breeding

Tuesday, 15 June 2010

c++ - OpenMP parallell region overhead increase when num_threads varies -

No comments:

Post a Comment