Message boards : Number crunching : New work Discussion
Message board moderation
Previous · 1 . . . 67 · 68 · 69 · 70 · 71 · 72 · 73 . . . 91 · Next
Author | Message |
---|---|
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,025,554 RAC: 20,468 |
Does this help? (I have not tried it yet.)I will have a look soon. What I would really like is something that shows cache usage by application in much the same way that top shows cpu usage. Thanks. I think that is a good first indication. It is not surprising that they use a lot of cache. I had assumed the slow down was the writing to swap file because of lack of RAM. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I had assumed the slow down was the writing to swap file because of lack of RAM. With my 64 GBytes of RAM, I seldom use the swap file at all. Now with the OpenIFS stuff, if it ever gets into production, it may use more of course. That link I showed is not very useful, at least as far as I can tell. I found another way to get at this. Right now I have my main machine running (the one with 65GBytes RAM and 16384+512K bytes memory cache). It is running four N216 CPDN tasks and Four WCG tasks (two ARP1's and two CPN1's). I found something I downloaded, the perf command, and ran it for while. It looks like one can learn a lot with it. The manual page helps and running perf --help should help too. Running two N216, 5 WCG, and one rosetta localhost:root[/home/jeandavid8]# perf stat -aB -e cache-references,cache-misses ^C Performance counter stats for 'system wide': 12,019,648,573 cache-references 6,494,633,179 cache-misses # 54.033 % of all cache refs 23.651410187 seconds time elapsed Running no CPDN, but some WCG localhost:root[/home/jeandavid8]# perf stat -aB -e cache-references,cache-misses ^C Performance counter stats for 'system wide': 3,727,867,972 cache-references 1,368,824,386 cache-misses # 36.719 % of all cache refs 14.735167255 seconds time elapsed Running only boinc client, no tasks: localhost:root[/home/jeandavid8]# perf stat -aB -e cache-references,cache-misses ^C Performance counter stats for 'system wide': 128,714,374 cache-references 5,195,723 cache-misses # 4.037 % of all cache refs 25.357159122 seconds time elapsed |
Send message Joined: 9 Oct 04 Posts: 82 Credit: 69,926,017 RAC: 7,296 |
Am I the only one with problems with the new short tasks (UK Met Office HadCM3 short v8.36)? It seems all error with: <core_client_version>7.16.6</core_client_version> <![CDATA[ <message> process exited with code 22 (0x16, -234)</message> <stderr_txt> Suspended CPDN Monitor - Suspend request from BOINC... forrtl: severe (17): syntax error in NAMELIST input, unit 5, file /home/roland/projects/climateprediction.net/hadcm3s_1dei_200012_168_926_012128606/jobs/climate.cpdc, line 396, position 20 Image PC Routine Line Source hadcm3s_um_8.36_i 0851D9E5 Unknown Unknown Unknown hadcm3s_um_8.36_i 085429B6 Unknown Unknown Unknown hadcm3s_um_8.36_i 0832EC95 Unknown Unknown Unknown hadcm3s_um_8.36_i 081FD206 Unknown Unknown Unknown hadcm3s_um_8.36_i 081FED33 Unknown Unknown Unknown hadcm3s_um_8.36_i 0848CCB5 Unknown Unknown Unknown hadcm3s_um_8.36_i 0848BE04 Unknown Unknown Unknown hadcm3s_um_8.36_i 08496BAD Unknown Unknown Unknown Suspended CPDN Monitor - Suspend request from BOINC... Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=240, iMonCtr=1 Model crash detected, will try to restart... Suspended CPDN Monitor - Suspend request from BOINC... forrtl: severe (17): syntax error in NAMELIST input, unit 5, file /home/roland/projects/climateprediction.net/hadcm3s_1dei_200012_168_926_012128606/jobs/climate.cpdc, line 396, position 20 Image PC Routine Line Source hadcm3s_um_8.36_i 0851D9E5 Unknown Unknown Unknown hadcm3s_um_8.36_i 085429B6 Unknown Unknown Unknown hadcm3s_um_8.36_i 0832EC95 Unknown Unknown Unknown hadcm3s_um_8.36_i 081FD206 Unknown Unknown Unknown hadcm3s_um_8.36_i 081FED33 Unknown Unknown Unknown hadcm3s_um_8.36_i 0848CCB5 Unknown Unknown Unknown hadcm3s_um_8.36_i 0848BE04 Unknown Unknown Unknown hadcm3s_um_8.36_i 08496BAD Unknown Unknown Unknown Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=240, iMonCtr=1 Model crash detected, will try to restart... Suspended CPDN Monitor - Suspend request from BOINC... forrtl: severe (17): syntax error in NAMELIST input, unit 5, file /home/roland/projects/climateprediction.net/hadcm3s_1dei_200012_168_926_012128606/jobs/climate.cpdc, line 396, position 20 Image PC Routine Line Source hadcm3s_um_8.36_i 0851D9E5 Unknown Unknown Unknown hadcm3s_um_8.36_i 085429B6 Unknown Unknown Unknown hadcm3s_um_8.36_i 0832EC95 Unknown Unknown Unknown hadcm3s_um_8.36_i 081FD206 Unknown Unknown Unknown hadcm3s_um_8.36_i 081FED33 Unknown Unknown Unknown hadcm3s_um_8.36_i 0848CCB5 Unknown Unknown Unknown hadcm3s_um_8.36_i 0848BE04 Unknown Unknown Unknown hadcm3s_um_8.36_i 08496BAD Unknown Unknown Unknown Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=240, iMonCtr=1 Model crash detected, will try to restart... Suspended CPDN Monitor - Suspend request from BOINC... forrtl: severe (17): syntax error in NAMELIST input, unit 5, file /home/roland/projects/climateprediction.net/hadcm3s_1dei_200012_168_926_012128606/jobs/climate.cpdc, line 396, position 20 Image PC Routine Line Source hadcm3s_um_8.36_i 0851D9E5 Unknown Unknown Unknown hadcm3s_um_8.36_i 085429B6 Unknown Unknown Unknown hadcm3s_um_8.36_i 0832EC95 Unknown Unknown Unknown hadcm3s_um_8.36_i 081FD206 Unknown Unknown Unknown hadcm3s_um_8.36_i 081FED33 Unknown Unknown Unknown hadcm3s_um_8.36_i 0848CCB5 Unknown Unknown Unknown hadcm3s_um_8.36_i 0848BE04 Unknown Unknown Unknown hadcm3s_um_8.36_i 08496BAD Unknown Unknown Unknown Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=240, iMonCtr=1 Model crash detected, will try to restart... Suspended CPDN Monitor - Suspend request from BOINC... forrtl: severe (17): syntax error in NAMELIST input, unit 5, file /home/roland/projects/climateprediction.net/hadcm3s_1dei_200012_168_926_012128606/jobs/climate.cpdc, line 396, position 20 Image PC Routine Line Source hadcm3s_um_8.36_i 0851D9E5 Unknown Unknown Unknown hadcm3s_um_8.36_i 085429B6 Unknown Unknown Unknown hadcm3s_um_8.36_i 0832EC95 Unknown Unknown Unknown hadcm3s_um_8.36_i 081FD206 Unknown Unknown Unknown hadcm3s_um_8.36_i 081FED33 Unknown Unknown Unknown hadcm3s_um_8.36_i 0848CCB5 Unknown Unknown Unknown hadcm3s_um_8.36_i 0848BE04 Unknown Unknown Unknown hadcm3s_um_8.36_i 08496BAD Unknown Unknown Unknown Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=240, iMonCtr=1 Model crash detected, will try to restart... Suspended CPDN Monitor - Suspend request from BOINC... forrtl: severe (17): syntax error in NAMELIST input, unit 5, file /home/roland/projects/climateprediction.net/hadcm3s_1dei_200012_168_926_012128606/jobs/climate.cpdc, line 396, position 20 Image PC Routine Line Source hadcm3s_um_8.36_i 0851D9E5 Unknown Unknown Unknown hadcm3s_um_8.36_i 085429B6 Unknown Unknown Unknown hadcm3s_um_8.36_i 0832EC95 Unknown Unknown Unknown hadcm3s_um_8.36_i 081FD206 Unknown Unknown Unknown hadcm3s_um_8.36_i 081FED33 Unknown Unknown Unknown hadcm3s_um_8.36_i 0848CCB5 Unknown Unknown Unknown hadcm3s_um_8.36_i 0848BE04 Unknown Unknown Unknown hadcm3s_um_8.36_i 08496BAD Unknown Unknown Unknown Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=240, iMonCtr=1 Model crash detected, will try to restart... Sorry, too many model crashes! :-( 17:23:42 (240): called boinc_finish(22) </stderr_txt> ]]> Sorry not to be precis, this is one of my WSL computers, on the Linux Computers I get: core_client_version>7.16.5</core_client_version> <![CDATA[ <message> process exited with code 22 (0x16, -234)</message> <stderr_txt> SIGSEGV: segmentation violation Stack trace (10 frames): /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu(boinc_catch_signal+0x67)[0x84ff4f7] linux-gate.so.1(__kernel_sigreturn+0x0)[0xf7f1eb60] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x84277ad] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x80e8e67] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8089442] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8479d6e] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8494feb] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x848be04] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8496bad] /lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0xf7bfcee5] Exiting... Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=3927, iMonCtr=1 Model crash detected, will try to restart... SIGSEGV: segmentation violation Stack trace (10 frames): /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu(boinc_catch_signal+0x67)[0x84ff4f7] linux-gate.so.1(__kernel_sigreturn+0x0)[0xf7f65b60] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x84277ad] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x80e8e67] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8089442] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8479d6e] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8494feb] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x848be04] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8496bad] /lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0xf7c43ee5] Exiting... Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=3927, iMonCtr=1 Model crash detected, will try to restart... SIGSEGV: segmentation violation Stack trace (10 frames): /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu(boinc_catch_signal+0x67)[0x84ff4f7] linux-gate.so.1(__kernel_sigreturn+0x0)[0xf7f4db60] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x84277ad] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x80e8e67] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8089442] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8479d6e] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8494feb] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x848be04] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8496bad] /lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0xf7c2bee5] Exiting... Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=3927, iMonCtr=1 Model crash detected, will try to restart... SIGSEGV: segmentation violation Stack trace (10 frames): /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu(boinc_catch_signal+0x67)[0x84ff4f7] linux-gate.so.1(__kernel_sigreturn+0x0)[0xf7f63b60] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x84277ad] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x80e8e67] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8089442] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8479d6e] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8494feb] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x848be04] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8496bad] /lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0xf7c41ee5] Exiting... Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=3927, iMonCtr=1 Model crash detected, will try to restart... SIGSEGV: segmentation violation Stack trace (10 frames): /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu(boinc_catch_signal+0x67)[0x84ff4f7] linux-gate.so.1(__kernel_sigreturn+0x0)[0xf7f16b60] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x84277ad] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x80e8e67] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8089442] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8479d6e] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8494feb] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x848be04] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8496bad] /lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0xf7bf4ee5] Exiting... Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=3927, iMonCtr=1 Model crash detected, will try to restart... SIGSEGV: segmentation violation Stack trace (10 frames): /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu(boinc_catch_signal+0x67)[0x84ff4f7] linux-gate.so.1(__kernel_sigreturn+0x0)[0xf7fc8b60] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x84277ad] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x80e8e67] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8089442] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8479d6e] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8494feb] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x848be04] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8496bad] /lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0xf7ca6ee5] Exiting... Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=3927, iMonCtr=1 Model crash detected, will try to restart... Sorry, too many model crashes! :-( 03:08:41 (3927): called boinc_finish(22) </stderr_txt> ]]>ion violation Stack trace (10 frames): /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu(boinc_catch_signal+0x67)[0x84ff4f7] linux-gate.so.1(__kernel_sigreturn+0x0)[0xf7f4db60] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x84277ad] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x80e8e67] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8089442] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8479d6e] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8494feb] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x848be04] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8496bad] /lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0xf7c2bee5] Exiting... Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=3927, iMonCtr=1 Model crash detected, will try to restart... SIGSEGV: segmentation violation Stack trace (10 frames): /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu(boinc_catch_signal+0x67)[0x84ff4f7] linux-gate.so.1(__kernel_sigreturn+0x0)[0xf7f63b60] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x84277ad] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x80e8e67] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8089442] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8479d6e] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8494feb] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x848be04] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8496bad] /lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0xf7c41ee5] Exiting... Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=3927, iMonCtr=1 Model crash detected, will try to restart... SIGSEGV: segmentation violation Stack trace (10 frames): /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu(boinc_catch_signal+0x67)[0x84ff4f7] linux-gate.so.1(__kernel_sigreturn+0x0)[0xf7f16b60] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x84277ad] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x80e8e67] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8089442] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8479d6e] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8494feb] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x848be04] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8496bad] /lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0xf7bf4ee5] Exiting... Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=3927, iMonCtr=1 Model crash detected, will try to restart... SIGSEGV: segmentation violation Stack trace (10 frames): /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu(boinc_catch_signal+0x67)[0x84ff4f7] linux-gate.so.1(__kernel_sigreturn+0x0)[0xf7fc8b60] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x84277ad] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x80e8e67] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8089442] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8479d6e] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8494feb] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x848be04] /home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8496bad] /lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0xf7ca6ee5] Exiting... Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=3927, iMonCtr=1 Model crash detected, will try to restart... Sorry, too many model crashes! :-( 03:08:41 (3927): called boinc_finish(22) </stderr_txt> ]]>[/code] I know, this [code]SIGSEGV: segmentation violation[/code] is normally associated to RAM overclocking but these computers do quite well the long WUs (hadam4h). |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
I just got one and it errored out with the same type of namelist errors. E-mail sent to Andy and Sarah. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,025,554 RAC: 20,468 |
So far, mine all seem OK. one has been running long enough to produce 5 zip files. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Yeah, Checked some out and I've seen some trickling. Jumped the gun with the e-mail, and my errors on the two I downloaded were segmentation faults instead of namelist errors. And the hadcm3s has historically had a relatively high failure rate with segmentation faults. |
Send message Joined: 28 Jul 19 Posts: 150 Credit: 12,830,559 RAC: 228 |
14 failures here today, all CM3 all after about 44 seconds elapsed / 4 seconds CPU, same SIGSEGV as reported. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,815,352 RAC: 5,242 |
I've got one running OK on a Mac, so far. |
Send message Joined: 17 Aug 07 Posts: 8 Credit: 37,197,498 RAC: 13,412 |
Hi, i looked in here because all CM3's are failing. It happens on all 4 linux machines. The AM4's are working fine. I'm using Ubuntu 20.04 Server 64bit with the 32bit libraries as mentioned in the other thread. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
May laptop downloaded 4 of them and all died with segmentation faults. It then grabbed a hadam4h, which of course, is running fine. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,025,554 RAC: 20,468 |
Hi, i looked in here because all CM3's are failing. It happens on all 4 linux machines. The AM4's are working fine. I'm using Ubuntu 20.04 Server 64bit with the 32bit libraries as mentioned in the other thread. Six of these running OK here so far four have produced trickles and uploaded zips. Off to work soon, when I get back will poke around to see if I can find any common factors in the machines with failures and or those that are getting far enough to produce trickles. Ubuntu 21.10 and BOINC 7.19.0 (The odd number after 7. indicates a pre-release version I compiled from source. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,025,554 RAC: 20,468 |
Off to work soon, when I get back will poke around to see if I can find any common factors in the machines with failures and or those that are getting far enough to produce trickles. So far, error types are missing libraries, seg fault, process creation error (computers with this seem to be crashing everything as they do with missing libraries) and bad cpu type error (all on machines running Darwin. As to spotting any pattern, nothing has emerged from the noise yet. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Oh good! It is not just me. I think three out of three hadcm3s have just failed after a very few seconds each. I have three hadam4h tasks running for at least a full day each. Here is the beginning of the stderr file for one of them: My machine is not overclocked. Name hadcm3s_1gxf_200012_168_926_012129163_0 Workunit 12129163 Created 4 Jan 2022, 11:47:50 UTC Sent 6 Jan 2022, 6:13:06 UTC Report deadline 19 Dec 2022, 11:33:06 UTC Received 6 Jan 2022, 12:03:52 UTC Server state Over Outcome Computation error Client state Compute error Exit status 22 (0x00000016) Unknown error code Computer ID 1511241 Run time 36 sec CPU time 2 sec Validate state Invalid Credit 0.00 Device peak FLOPS 6.57 GFLOPS Application version UK Met Office HadCM3 short v8.36 i686-pc-linux-gnu Peak working set size 122.46 MB Peak swap size 181.44 MB Peak disk usage 4.50 MB Stderr <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> process exited with code 22 (0x16, -234)</message> <stderr_txt> SIGSEGV: segmentation violation Stack trace (10 frames): /var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu(boinc_catch_signal+0x67)[0x84ff4f7] linux-gate.so.1(__kernel_sigreturn+0x0)[0xf7f5c140] /var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x84277ad] /var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x80e8e67] /var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8089442] /var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8479d6e] /var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8494feb] /var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x848be04] /var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8496bad] /usr/lib/libc.so.6(__libc_start_main+0xf9)[0xf7cd01e9] Exiting... Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=602350, iMonCtr=1 Model crash detected, will try to restart... SIGSEGV: segmentation violation Stack trace (10 frames): /var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu(boinc_catch_signal+0x67)[0x84ff4f7] linux-gate.so.1(__kernel_sigreturn+0x0)[0xf7f30140] /var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x84277ad] /var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x80e8e67] /var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8089442] /var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8479d6e] /var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8494feb] /var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x848be04] /var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8496bad] /usr/lib/libc.so.6(__libc_start_main+0xf9)[0xf7ca41e9] ... hadcm3s_1lbw_200012_168_926_012129901_0 failed the same way. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,025,554 RAC: 20,468 |
Oh good! It is not just me. A long way from being just you. Sadly I am not finding enough data to work out whether it is just some work units that have the problem or whether there is anything about the computers involved. So far I have seen that both Intel and AMD machines are implicated in the seg fault violations but both also have some machines like my own that have produced trickles. This is true across both Darwin and Linux computers and certainly both Ubuntu and Debian have tasks sending trickles and failures of this type. Yours in the only Red Hat machine I have looked at so a lack of data there. I don't know if there is anyone out there who could write a script to search for patterns but my brain is failing to spot anything useful. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Yours in the only Red Hat machine I have looked at so a lack of data there. The CentOS Linux 8 machines should be the same as mine... Of course, there may not be (m)any of those either. Red Hat must be paid for, bur IIRC, CentOS is free,. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I notice my machine completed about dozen hadcm3s work units in March and April successfully. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
So far seg faults make up about 25% of the failures, while with the hadam4h, they usually make up <5%. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Could this excerpt from the Boinc Manager Event log be any use? Itis from one of my failed tasks. They all seem to look like this. Thu 06 Jan 2022 06:59:36 AM EST | climateprediction.net | Starting task hadcm3s_1gxf_200012_168_926_012129163_0 Thu 06 Jan 2022 07:00:13 AM EST | climateprediction.net | Computation for task hadcm3s_1gxf_200012_168_926_012129163_0 finished Thu 06 Jan 2022 07:00:13 AM EST | climateprediction.net | Output file hadcm3s_1gxf_200012_168_926_012129163_0_r100040496_1.zip for task hadcm3s_1gxf_200012_168_926_012129163_0 absent Thu 06 Jan 2022 07:00:13 AM EST | climateprediction.net | Output file hadcm3s_1gxf_200012_168_926_012129163_0_r100040496_2.zip for task hadcm3s_1gxf_200012_168_926_012129163_0 absent Thu 06 Jan 2022 07:00:13 AM EST | climateprediction.net | Output file hadcm3s_1gxf_200012_168_926_012129163_0_r100040496_3.zip for task hadcm3s_1gxf_200012_168_926_012129163_0 absent Thu 06 Jan 2022 07:00:13 AM EST | climateprediction.net | Output file hadcm3s_1gxf_200012_168_926_012129163_0_r100040496_4.zip for task hadcm3s_1gxf_200012_168_926_012129163_0 absent Thu 06 Jan 2022 07:00:13 AM EST | climateprediction.net | Output file hadcm3s_1gxf_200012_168_926_012129163_0_r100040496_5.zip for task hadcm3s_1gxf_200012_168_926_012129163_0 absent Thu 06 Jan 2022 07:00:13 AM EST | climateprediction.net | Output file hadcm3s_1gxf_200012_168_926_012129163_0_r100040496_6.zip for task hadcm3s_1gxf_200012_168_926_012129163_0 absent Thu 06 Jan 2022 07:00:13 AM EST | climateprediction.net | Output file hadcm3s_1gxf_200012_168_926_012129163_0_r100040496_7.zip for task hadcm3s_1gxf_200012_168_926_012129163_0 absent Thu 06 Jan 2022 07:00:13 AM EST | climateprediction.net | Output file hadcm3s_1gxf_200012_168_926_012129163_0_r100040496_8.zip for task hadcm3s_1gxf_200012_168_926_012129163_0 absent Thu 06 Jan 2022 07:00:13 AM EST | climateprediction.net | Output file hadcm3s_1gxf_200012_168_926_012129163_0_r100040496_9.zip for task hadcm3s_1gxf_200012_168_926_012129163_0 absent Thu 06 Jan 2022 07:00:13 AM EST | climateprediction.net | Output file hadcm3s_1gxf_200012_168_926_012129163_0_r100040496_10.zip for task hadcm3s_1gxf_200012_168_926_012129163_0 absent Thu 06 Jan 2022 07:00:13 AM EST | climateprediction.net | Output file hadcm3s_1gxf_200012_168_926_012129163_0_r100040496_11.zip for task hadcm3s_1gxf_200012_168_926_012129163_0 absent Thu 06 Jan 2022 07:00:13 AM EST | climateprediction.net | Output file hadcm3s_1gxf_200012_168_926_012129163_0_r100040496_12.zip for task hadcm3s_1gxf_200012_168_926_012129163_0 absent Thu 06 Jan 2022 07:00:13 AM EST | climateprediction.net | Output file hadcm3s_1gxf_200012_168_926_012129163_0_r100040496_13.zip for task hadcm3s_1gxf_200012_168_926_012129163_0 absent Thu 06 Jan 2022 07:00:13 AM EST | climateprediction.net | Output file hadcm3s_1gxf_200012_168_926_012129163_0_r100040496_14.zip for task hadcm3s_1gxf_200012_168_926_012129163_0 absent Thu 06 Jan 2022 07:00:13 AM EST | climateprediction.net | Output file hadcm3s_1gxf_200012_168_926_012129163_0_r100040496_restart.zip for task hadcm3s_1gxf_200012_168_926_012129163_0 absent Thu 06 Jan 2022 07:00:16 AM EST | climateprediction.net | Started upload of hadcm3s_1gxf_200012_168_926_012129163_0_r100040496_out.zip Thu 06 Jan 2022 07:00:18 AM EST | climateprediction.net | Finished upload of hadcm3s_1gxf_200012_168_926_012129163_0_r100040496_out.zip |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Jean-David, That's just the listing of files that should have been produced if it had run to the end. These used to be written to stderr.txt which is what is displayed on the task webpages when they finish, but now are just listed in the message log at the end of failures when one or more files that should have been produced in a successful run, wasn't. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
So far, error types are missing libraries, seg fault, process creation error (computers with this seem to be crashing everything as they do with missing libraries) and bad cpu type error (all on machines running Darwin. As to spotting any pattern, nothing has emerged from the noise yet. You guys have my sympathy. I looked at all 6 of the tasks I tried to run that failed to see if others had trouble too. IIRC, Some had trouble. Some have not been retried. Two had missing 32-bit libraries They had a variety of operating systems. Even FreeBSD One failed with this; I have no idea how this happened. Process creation (../../projects/climateprediction.net/hadcm3s_8.36_i686-pc-linux-gnu) failed: Error -1, errno=2 execv: No such file or directory |
©2024 cpdn.org