c++ - OpenMP parallell region overhead increase when num_threads varies -
i trying utilize different number of threads in different parts of programme accomplish maximum acceleration. found switching thread number using num_threads clause incurs important overhead. looking explanation this, since according understanding thread pool should contain given number of threads, regardless of actual number got invoked. looking possible work-arounds against this. thanks.
sample code:
#include<cstdio> #include<omp.h> void omp_sum(int ntd) { int s = 0; #pragma omp parallel num_threads(ntd) { int = omp_get_thread_num(); #pragma omp atomic s += i; } } int main() { int n = 100; int nt1 = 6, nt2 = 12; double t; t = omp_get_wtime(); for(int n=0;n<n;n++) { omp_sum(nt1); } printf("%lf\n", (omp_get_wtime() - t) * 1e6 ); t = omp_get_wtime(); for(int n=0;n<n;n++) { omp_sum(nt2); } printf("%lf\n", (omp_get_wtime() - t) * 1e6 ); t = omp_get_wtime(); for(int n=0;n<n;n++) { omp_sum(nt1); omp_sum(nt1); } printf("%lf\n", (omp_get_wtime() - t) * 1e6 ); t = omp_get_wtime(); for(int n=0;n<n;n++) { omp_sum(nt2); omp_sum(nt2); } printf("%lf\n", (omp_get_wtime() - t) * 1e6 ); t = omp_get_wtime(); for(int n=0;n<n;n++) { omp_sum(nt1); omp_sum(nt2); } printf("%lf\n", (omp_get_wtime() - t) * 1e6 ); }
sample output (in us):
1034.069001 1058.620000 1034.572000 2210.681000 18234.355000
edit: workstation running code has 2 hexa-core intel e5-2630l cpus, there should total of 12 hardware cores , 24 hyperthreads. using fedora 19 gcc 4.8.2.
i can reproduce results gcc 4.8 (g++ -o3 -fopenmp foo.cpp) on 4 core system/eight hyper thread system. changed n1 4 , n2 8.
your function omp_sum
simple
pushq %rbx movq %rdi, %rbx phone call omp_get_thread_num movq (%rbx), %rdx lock addl %eax, (%rdx) popq %rbx ret
here assembly code loop
for(int n=0;n<n;n++) { omp_sum(nt1); omp_sum(nt2); } .l10 leaq 32(%rsp), %rsi xorl %ecx, %ecx movl $4, %edx movl $_z7omp_sumi._omp_fn.0, %edi movl $0, 28(%rsp) movq %rbx, 32(%rsp) phone call gomp_parallel leaq 32(%rsp), %rsi xorl %ecx, %ecx movl $8, %edx movl $_z7omp_sumi._omp_fn.0, %edi movl $0, 28(%rsp) movq %rbx, 32(%rsp) phone call gomp_parallel subl $1, %ebp jne .l10
this identical assembly loop
for(int n=0;n<n;n++) { omp_sum(nt2); omp_sum(nt2); }
the alter beingness movl $4, %edx
instead of movl $8, %edx
. it's hard see causing problem. magic happens in gomp_parallel. 1 have @ source code of gomp_parallel guess gomp_parallel checks number of threads lastly used in parallel phone call , if new parallel phone call uses different number of threads has overhead switch. overhead much larger simple function.
but i'm not sure why ever problem. in practice not create sense utilize such short parallel sections (one parallelize loop , n much larger) overhead should not problem.
edit: section 2.41 of openmp 3.1 specification titled "determining number of threads parallel region" gives algorithm determining number of threads. the source code gomp_parallel gcc-4.8 shows first function calls gomp_resolve_num_threads
.
c++ multithreading openmp
No comments:
Post a Comment