Tuesday, 15 June 2010

bash - GNU Parallel - which job failed? -



bash - GNU Parallel - which job failed? -

i'm running job on several different servers (up 25) using gnu parallel.

the shell script implements does:

parallel --tag --nonall -s $some_list_of_servers "some_command" state=$? echo -n "result: " if [ "$state" -eq "0" ] echo "all jobs successful" else echo "$state jobs failed" fi homecoming $state

where some_list_of_servers array, , install_command is, instance, git fetch.

what want lot more info how many jobs failed. want know command, , server, failed.

i've been through man page, , google, , can't find switch(es) i'm looking for.

any help gratefully appreciated.

weedom

edit in response reply 1:

i tried that, , odd happening.

weedom@host1: ~/$ parallel --tag --nonall -j8 --joblog test.log -s host1,host2 uptime host2 10:41:17 36 days, 20:45, 1 user, load average: 0.00, 0.00, 0.00 host1 10:41:17 22:34, 3 users, load average: 0.06, 0.11, 0.04 weedom@host1: ~/$ cat test.log seq host starttime runtime send receive exitval signal command 1 host1 1403689277.067 0.519999980926514 0 0 0 0 uptime

no matter how many hosts add together -s, seem lastly 1 finish test.log

i've added follow-up question here: gnu parallel - --joblog logging lastly job

you want utilize --joblog option, shown in docs. gnu parallel allows restarting failed ones --resume-failed.

eg, running script:

#!/bin/bash jobmod=$(( $1 % 3 )) if [ $jobmod == 0 ] exit 1 else exit 0 fi

on several hosts this:

$ seq 1 10 | parallel --joblog out.log -s "srv01,srv02,srv03,srv04" ./failjob

gives

$ more out.log seq host starttime runtime send receive exitval signal command 1 srv01 1403542514.713 0.267 0 0 0 0 ./failjob 1 3 srv02 1403542514.717 0.266 0 0 1 0 ./failjob 3 4 srv03 1403542514.719 0.266 0 0 0 0 ./failjob 4 2 srv04 1403542514.715 0.397 0 0 0 0 ./failjob 2 5 srv01 1403542514.983 0.231 0 0 0 0 ./failjob 5 6 srv02 1403542514.986 0.368 0 0 1 0 ./failjob 6 7 srv03 1403542514.988 0.388 0 0 0 0 ./failjob 7 8 srv04 1403542515.121 0.437 0 0 0 0 ./failjob 8 9 srv01 1403542515.221 0.343 0 0 1 0 ./failjob 9 10 srv02 1403542515.356 0.388 0 0 0 0 ./failjob 10

bash parallel-processing

No comments:

Post a Comment