New work discussion

Author	Message
Glenn Carver Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,406,815 RAC: 15,606	Message 66114 - Posted: 19 Sep 2022, 13:45:22 UTC - in response to Message 66112. The Windows task you got will be a resend with _1 or _2 at the end of the task name meaning it is on its second or third try after failing on one or two machines, or possibly being aborted. Seems WAH has a problem with a machine reboot, was working fine until I had to reboot the machine after patch install and then it failed with computation error after it restarted. Unfortunately it disappeared too quick for me to see the detailed logs and keep the files for tests. As a developer, that is a bit of nuisance. It would be nice if I could tell the client not to delete the files in event of a crash/failure but to leave the slot files as-is (or make a backup). I had a look through the client options and it's possible to exit the client after a task has finished but that would affect non-CPDN tasks which isn't what I need. Other than running another process to periodically rsync the suspect slot directory to somewhere else I can't see how to do it within boinc. Does anyone know how to do this? (Richard H maybe?) --- CPDN Visiting Scientist ID: 66114 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,980,824 RAC: 21,902	Message 66115 - Posted: 19 Sep 2022, 18:07:33 UTC Last modified: 19 Sep 2022, 18:07:54 UTC Seems WAH has a problem with a machine reboot It isn't just the WAH tasks. I lost three with a reboot recently but in my experience they are more likely to survive a reboot than the Linux ones where in my experience the failure rate can be as high as one in four on reboots. With the Windows ones running under Wine, I find it less than one in ten losses from reboots. ID: 66115 ·

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,694,196 RAC: 10,449	Message 66116 - Posted: 19 Sep 2022, 18:22:34 UTC - in response to Message 66114. Was it result 22235033? That's a curious one. Exit status 0 (0x00000000) (zero normally signifies success), nothing at all recorded from stderr. But it's a resend (replication _1). The _0 copy also failed, leaving rather more evidence behind. Result 22229598, Exit status 15, stderr ends with Global Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=22328, selfPID=22328, iMonCtr=1 Regional Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=22328, selfPID=21096, iMonCtr=1 after many, many interruptions. You've thought of the obvious ideas. Beyond that, I can only suggest: Start another task and let it get into its stride. Stop BOINC prior to reboot, and examibe the state of the files. Disable automatic BOINC start at reboot/login. Reboot, and examine the state of the files again before BOINC has a chance to run. Pull the network cable, and allow BOINC to start. Assuming it crashes as before, the slot folder will be cleared, but the upload files and report should be held until after BOINC has reported them to the server and got an ack response. Stderr will be embedded in client_state.xml, not kept as a separate file. ID: 66116 ·

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,406,815 RAC: 15,606	Message 66117 - Posted: 19 Sep 2022, 19:21:03 UTC - in response to Message 66116. Last modified: 19 Sep 2022, 19:43:36 UTC Richard, yes it was result 22235033. But it failed as soon as it started so that's probably why no stderr? Thanks for the tips, I shall bear them in mind, though it seems rather poor to me that the volunteers should have to do this for a safe restart for CPDN tasks. The zero exit status may be a red herring, it's possible the real error code is not propagated to the top level software layer correctly. The only way to tell would be to put the code in the debugger. I found the same thing with the HadSM4. Dave, that's a very poor survival for the linux tasks. Other projects seem to handle a cold restart just fine. I am surprised because operational models are pretty resilient to hardware & data failures but it could be something in the wrapper code that's not tolerating restarts properly. I'll ask the CPDN team as I'm interested to find out. ID: 66117 ·

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 66118 - Posted: 19 Sep 2022, 22:00:28 UTC - in response to Message 66117. Linux tasks are not that bad. I have not gotten any since July, but my failures seem mostly like this: Task 22227751 Name hadsm4_a10i_201310_6_935_012148076_1 Workunit 12148076 Created 28 Jul 2022, 5:20:01 UTC Sent 28 Jul 2022, 6:08:55 UTC Report deadline 10 Jul 2023, 11:28:55 UTC Received 28 Jul 2022, 9:43:21 UTC Server state Over Outcome Computation error Client state Compute error Exit status 22 (0x00000016) Unknown error code Computer ID 1511241 Run time 13 min 56 sec CPU time 13 min 20 sec Validate state Invalid Credit 0.00 Device peak FLOPS 6.58 GFLOPS Application version UK Met Office HadSM4 at N144 resolution v8.02 i686-pc-linux-gnu Peak working set size 656.03 MB Peak swap size 787.57 MB Peak disk usage 0.02 MB Stderr <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> process exited with code 22 (0x16, -234)</message> <stderr_txt> Model crashed: ATM_DYN : NEGATIVE THETA DETECTED. tmp/xnnuj.pipe_dummy Model crashed: ATM_DYN : NEGATIVE THETA DETECTED. tmp/xnnuj.pipe_dummy Model crashed: ATM_DYN : NEGATIVE THETA DETECTED. tmp/xnnuj.pipe_dummy Model crashed: ATM_DYN : NEGATIVE THETA DETECTED. tmp/xnnuj.pipe_dummy Model crashed: ATM_DYN : NEGATIVE THETA DETECTED. tmp/xnnuj.pipe_dummy Model crashed: ATM_DYN : NEGATIVE THETA DETECTED. tmp/xnnuj.pipe_dummy Sorry, too many model crashes! :-( 04:59:14 (154169): called boinc_finish(22) </stderr_txt> ]]> I quit doing cold restarts a while ago, but IIRC, the offending program that cause those problems has long since been fixed. At some point my machine crashed in such a way that I could not even do a shutdown. I powered it off and started it back up. I never found out what the trouble was, but when I powered it back up, Boinc and its children, probably including CPDN jobs, picked up where they left off with no problems. ID: 66118 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,980,824 RAC: 21,902	Message 66119 - Posted: 20 Sep 2022, 6:06:31 UTC Dave, that's a very poor survival for the linux tasks. Other projects seem to handle a cold restart just fine. I am surprised because operational models are pretty resilient to hardware & data failures but it could be something in the wrapper code that's not tolerating restarts properly. I'll ask the CPDN team as I'm interested to find out. May not be quite that bad. I will when work appears again, start keeping some real data on this rather than relying on my impressions. ID: 66119 ·

Bryn Mawr Send message Joined: 28 Jul 19 Posts: 149 Credit: 12,830,559 RAC: 228	Message 66120 - Posted: 20 Sep 2022, 8:50:01 UTC - in response to Message 66119. Dave, that's a very poor survival for the linux tasks. Other projects seem to handle a cold restart just fine. I am surprised because operational models are pretty resilient to hardware & data failures but it could be something in the wrapper code that's not tolerating restarts properly. I'll ask the CPDN team as I'm interested to find out. May not be quite that bad. I will when work appears again, start keeping some real data on this rather than relying on my impressions. I can only report my experiences. I do not take any precautions when rebooting (Ubuntu 20.04) and I have not had any CPDN fails in a couple of years. ID: 66120 ·

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,406,815 RAC: 15,606	Message 66121 - Posted: 20 Sep 2022, 10:49:28 UTC - in response to Message 66118. Model crashed: ATM_DYN : NEGATIVE THETA DETECTED. That's a model error rather than a technical boinc related problem like a restart. If I remember the hadley centre models properly it may indicate the model levels have touched (or crossed), probably because the vertical windspeed is too high or unstable. Usually that kind of thing happens in certain forecast conditions over high orography, where the model levels are naturally closer together. For interest, OpenIFS has a different way of calculating where the winds are blowing. It tries to work out a trajectory of an air parcel between model timesteps. If we use a too large timestep or the winds get very strong, those trajectories near the surface can go underground and you'll see messages to that effect in the model logs. It can correct but if there are too many, the model will stop. --- CPDN Visiting Scientist ID: 66121 ·

AndreyOR Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,795,377 RAC: 19,573	Message 66122 - Posted: 20 Sep 2022, 11:25:20 UTC I'd like to put it more bluntly and say that CPDN tasks are definitely very sensitive to interruptions (and I believe it's relatively well documented in the forums). By far the worst of any project I'm aware of. Even a couple of LHC subprojects that must be run to completion without interruption, will just restart from the beginning. CPDN's error rate is at least 10%, Bryn Mawr's (who posted above) is over 11%. Mine is over 22%. Many of those are due to restarts (especially if happens more than once). I'd expect CPDN to have a higher error rate than other projects due to valid reasons (i.e. "Negative Pressure Detected"). But for a project that has workunits that take days to weeks to complete, 10%+ error rate is too high, I think, as that means that days' and weeks' worth of processing time is wasted because the tasks can't handle interruptions well. Glenn, it's encouraging to hear that you'd like to look into this and potentially fix it. I'm not sure which OS is worse but the issue affects Windows, macOS, and Linux tasks. ID: 66122 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,980,824 RAC: 21,902	Message 66123 - Posted: 20 Sep 2022, 12:20:19 UTC - in response to Message 66122. Last modified: 20 Sep 2022, 12:41:27 UTC I'd like to put it more bluntly and say that CPDN tasks are definitely very sensitive to interruptions (and I believe it's relatively well documented in the forums). By far the worst of any project I'm aware of. Even a couple of LHC subprojects that must be run to completion without interruption, will just restart from the beginning. CPDN's error rate is at least 10%, Bryn Mawr's (who posted above) is over 11%. Mine is over 22%. Many of those are due to restarts (especially if happens more than once). I'd expect CPDN to have a higher error rate than other projects due to valid reasons (i.e. "Negative Pressure Detected"). But for a project that has workunits that take days to weeks to complete, 10%+ error rate is too high, I think, as that means that days' and weeks' worth of processing time is wasted because the tasks can't handle interruptions well. Glenn, it's encouraging to hear that you'd like to look into this and potentially fix it. I'm not sure which OS is worse but the issue affects Windows, macOS, and Linux tasks. Edit: re my previous post about errors on Linux tasks, one in four or five is the error rate when the computer is being turned off every night, so on a ball park figure of seven days for a task, closer to one in thirty falling over per task/shutdown event. Still a lot higher than ideal though. The zips that are uploaded at the same time the trickle ups for credit are generated still provide some data that can be used I believe even if t task is a hard fail and all attempts crash. Eit2: In contrast my tasks on the testing site where the nature of testing might lead one to expect a higher error rate is one in 20 over last 60 tasks. (Reboots while running testing work are only for emergencies or when a workman requires power to go off. ID: 66123 ·

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,406,815 RAC: 15,606	Message 66124 - Posted: 20 Sep 2022, 13:42:03 UTC - in response to Message 66122. In defence of CPDN, I think it's quite impressive operational weather forecast models, 5-10 million lines of code, designed to run on highly parallel high performance computer systems, can be made to work on a range of Intel & AMD home/server hardware, across multiple operating systems. One of the complications with this setup is boinc which imposes certain constraints e.g. we have to make sure restarts work whether cleanly or sudden shutdowns, the model responds well to being suspended, swapped in/out of memory etc. It took 2 yrs of work for OpenIFS to run in CPDN and alot of that was on the boinc side and testing. I'd say 10-15% failures is acceptable given the wide range of computers it's running on. As I'm only volunteering I'm not promising to fix restart issues. I thought I'd ask to understand if there are any quick fixes. The more pressing issues should be eradicate the need for 32bit libraries if possible. I need to finish working on OpenIFS first though. ID: 66124 ·

Bryn Mawr Send message Joined: 28 Jul 19 Posts: 149 Credit: 12,830,559 RAC: 228	Message 66125 - Posted: 20 Sep 2022, 16:47:23 UTC - in response to Message 66122. I'd like to put it more bluntly and say that CPDN tasks are definitely very sensitive to interruptions (and I believe it's relatively well documented in the forums). By far the worst of any project I'm aware of. Even a couple of LHC subprojects that must be run to completion without interruption, will just restart from the beginning. CPDN's error rate is at least 10%, Bryn Mawr's (who posted above) is over 11%. Mine is over 22%. Many of those are due to restarts (especially if happens more than once). I'd expect CPDN to have a higher error rate than other projects due to valid reasons (i.e. "Negative Pressure Detected"). But for a project that has workunits that take days to weeks to complete, 10%+ error rate is too high, I think, as that means that days' and weeks' worth of processing time is wasted because the tasks can't handle interruptions well. Glenn, it's encouraging to hear that you'd like to look into this and potentially fix it. I'm not sure which OS is worse but the issue affects Windows, macOS, and Linux tasks. Whilst I have had errors, mostly negative theta, I have not had a task fail on restart in a long time. Then, I very rarely restart more than once during the running of a single task. ID: 66125 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,980,824 RAC: 21,902	Message 66126 - Posted: 21 Sep 2022, 14:59:59 UTC Another batch of Hadcm3s is on its way for testing site. I believe at some point this will result in more main site work but given that they won't run on recent releases of MacOS increasingly they will only be available to those who are willing and able to go down the virtualisation route. (Didn't work when I tried it, though others with same CPU and OS have got it to work. I will try again next time I do a clean install. I looked at my Africa Rain Project tasks on WCG today. Not a single failed task despite this time of year when I don't have so much solar, the machine being turned off every night. ID: 66126 ·

SolarSyonyk Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463	Message 66127 - Posted: 21 Sep 2022, 15:45:53 UTC - in response to Message 66126. Another batch of Hadcm3s is on its way for testing site. I believe at some point this will result in more main site work but given that they won't run on recent releases of MacOS increasingly they will only be available to those who are willing and able to go down the virtualisation route. I'll have to get my VMs up and running again and waiting for the work! Need to spin up a few more of those, some hardware has rotated since the last batch. Not a single failed task despite this time of year when I don't have so much solar, the machine being turned off every night. Is there a reason you shut them down instead of sleep them? I do almost all of my compute in my solar powered, off-grid office, and I just put the machines to sleep at night - they don't pull enough power to matter, and it avoids task restarts as they're never being terminated and restarted - the machine just goes to sleep. There was one old Xeon box I couldn't do this with because it pulled 150W asleep, so I just pointed it at other projects - but the rest of my stuff is quite happy with sleep/resume cycles and CPDN works with that just fine. ID: 66127 ·

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 66128 - Posted: 21 Sep 2022, 21:24:23 UTC - in response to Message 66126. Another batch of Hadcm3s is on its way for testing site. Will they be MAC only, or Linux also? The last one I got worked OK. Task 22191699 Name hadcm3s_1k9d_200012_168_926_012129726_2 Workunit 12129726 Created 29 Jan 2022, 20:46:55 UTC Sent 29 Jan 2022, 20:48:05 UTC Report deadline 12 Jan 2023, 2:08:05 UTC Received 1 Feb 2022, 13:43:03 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x00000000) Computer ID 1511241 [Linux] Run time 2 days 10 hours 49 min 14 sec CPU time 2 days 10 hours 24 min 3 sec Validate state Valid Credit 4,354.56 ID: 66128 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,980,824 RAC: 21,902	Message 66129 - Posted: 22 Sep 2022, 4:45:25 UTC Will they be MAC only, or Linux also? The last one I got worked OK. Mac only. The error rate on even known reliable Linux machines has been so much higher than on Macs. And the tests have been going on for a couple of months with no hints as to when they will transfer over to main site so it could be months it could be days. ID: 66129 ·

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,406,815 RAC: 15,606	Message 66130 - Posted: 22 Sep 2022, 20:22:00 UTC - in response to Message 66129. Mac only. The error rate on even known reliable Linux machines has been so much higher than on Macs. And the tests have been going on for a couple of months with no hints as to when they will transfer over to main site so it could be months it could be days. HadCM3 is mac only? I didn't know that. Odd, because I've seen the code repository and the build script was (I thought) set up for linux/unix. There should be no reason why it can't be linux as well - something else I'll ask Andy about. ID: 66130 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,980,824 RAC: 21,902	Message 66131 - Posted: 23 Sep 2022, 5:07:18 UTC - in response to Message 66130. HadCM3 is mac only? I didn't know that. Odd, because I've seen the code repository and the build script was (I thought) set up for linux/unix. There should be no reason why it can't be linux as well - something else I'll ask Andy about. Till fairly recently, batches of hadcm3s tasks were for Linux as well. The high error rate on Linux machines with the last few batches is why since then they have only been released for Macs. But you are right that the code allows for them to run on Linux machines. ID: 66131 ·

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 66132 - Posted: 23 Sep 2022, 5:59:04 UTC - in response to Message 66131. Till fairly recently, batches of hadcm3s tasks were for Linux as well. The high error rate on Linux machines with the last few batches is why since then they have only been released for Macs. But you are right that the code allows for them to run on Linux machines. I looked at a whole bunch of hadcm32 failures on my machine and they were mostly due to Computer ID 1511241 Run time 36 sec CPU time 2 sec Validate state Invalid Application version UK Met Office HadCM3 short v8.36 i686-pc-linux-gnu Stderr <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> process exited with code 22 (0x16, -234)</message> <stderr_txt> SIGSEGV: segmentation violation <---<<< Stack trace (10 frames): /var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu(boinc_catch_signal+0x67)[0x84ff4f7] linux-gate.so.1(__kernel_sigreturn+0x0)[0xf7f4c140] /var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x84277ad] /var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x80e8e67] /var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8089442] /var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8479d6e] /var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8494feb] /var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x848be04] /var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8496bad] /usr/lib/libc.so.6(__libc_start_main+0xf9)[0xf7cc01e9] [snip] It is my contention that a segmentation violation in Linux can only be done by hardware problems (RAM not working right, overclocking, etc.) of software bugs such as using a pointer with an incorrect address in it (typically a value that does not point to an address in the address space of the process if the program language was one that uses pointers) or going off either end of an array if in programs that do not use pointerss It is my understanding that programs such as UK Met Office HadCM3 short v8.36 i686-pc-linux-gnu is written in FORTRAN that does not use pointers, so my conjecture is that there is a bug in the source program. Trouble is that the source code is private and not fixable by the ClimatePrediction team even if they were inclined to look at the enormous program there. They would have to run debuggijng tools (e.g., sdb if it still exists) to find where this is happening and fix it. From the stack trace, above, it happened just as the program was starting up, so a whole lot of the source would would probably not need looking at. But that assumes I understand the stack trace more than I have confidence with since I have not done any Linux programming in over 10 years. ID: 66132 ·

AndreyOR Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,795,377 RAC: 19,573	Message 66133 - Posted: 23 Sep 2022, 7:20:03 UTC - in response to Message 66132. Last modified: 23 Sep 2022, 7:28:17 UTC A while back I tried to get LHC projects running on WSL2 (Ubuntu 20.04) and one of them, native (as opposed to VBox) Theory, was failing within a minute or so with SIGSEGV errors. After some time of searching and trying some things I ran into something which I decided to try and it worked. No more SIGSEGV errors and Theory tasks started running to completion. The fix was to change a kernel parameter via the WSL2 config file to emulate vsyscall. I think it changes how system calls are made and I believe these types of problems come up when running older LInux programs. I wonder if Linux HadCM3 is experiencing similar issues. My Linux HadCM3 failures had a different error, NAMELIST input, for example: https://www.cpdn.org/result.php?resultid=22182053. It'd be interesting to try running Linux HadCM3 with vsyscall emulated and see if it works. I'd add that running Theory in Ubuntu 20.04 in Hyper-V (Windows native type 1 hypervisor) was no problem. The errors are specific to WSL2. WSL2 kernel is not the same as regular Linux. One of the differences is that WSL2 is init.d not systemd. ID: 66133 ·

New work discussion - 2