Batch 1008, and test batches 1009 to 1014 for Windows

Author	Message
Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,690,033 RAC: 10,812	Message 70700 - Posted: 3 Apr 2024, 7:14:09 UTC Last modified: 3 Apr 2024, 7:25:13 UTC I started 8 tasks yesterday, starting soon after release (2 tasks each on four machines, all Intel i5). Sample task: wah2_eas25_n2nl_201712_24_1008_012274697_0 The only clue I can see so far is: Controller:: CPDN process is not running, exiting, bRetVal = T, checkPID = 2136, selfPID = 7432, iMonCtr = 2 Model crash detected, will try to restart... Global Worker:: CPDN process is not running, exiting, bRetVal = T, checkPID = 2136, selfPID = 4880, iMonCtr = 2 All failed after about 8 hours, round about where the first trickle would have been expected. I'll have another look round in more detail later. Edit - looks like they didn't send either a credit trickle or a data trickle. But this machine did send an out.zip, which may help. ID: 70700 · Reply Quote

Harri Liljeroos Send message Joined: 9 Dec 05 Posts: 116 Credit: 12,547,934 RAC: 2,738	Message 70701 - Posted: 3 Apr 2024, 7:55:38 UTC I'm seeing the same, here: https://www.cpdn.org/result.php?resultid=22415908 and here: https://www.cpdn.org/result.php?resultid=22417236 ID: 70701 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,777,132 RAC: 19,418	Message 70702 - Posted: 3 Apr 2024, 7:59:54 UTC I had 2 crash just shy of 12 hour mark on same PC (i7-4790) with same errors. Seems like global model is crashing? https://www.cpdn.org/result.php?resultid=22417487 https://www.cpdn.org/result.php?resultid=22416101 ID: 70702 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,390,249 RAC: 15,269	Message 70703 - Posted: 3 Apr 2024, 8:38:50 UTC - in response to Message 70702. Last modified: 3 Apr 2024, 8:46:39 UTC I've found this myself and done some preliminary investigation. In the 3 task failures I've had all were due to the regional model crashing as it tried to run 1st/Jan. The forecasts all start from 1/Dec. I'm looking into it. The only pattern I've noticed (if it is a pattern), is that my failures were on a Win10 VM running on a intel chip, whereas the same VM running on a AMD has got 3 tasks past 1/Jan. I'll be running a failure workunit standalone to debug what's going on. The other two batches have been held pending investigation of possible issues with this one. p.s. to determine which model has failed, look in the stderr for these lines: executeModelProcess: MonID=8904, GCM_PID=10012, RCM_PID=252 23:57:52 (252): called boinc_finish(193) Global Worker:: CPDN process is not running, exiting, bRetVal = T, checkPID = 252, selfPID = 10012, iMonCtr = 2 Controller:: CPDN process is not running, exiting, bRetVal = T, checkPID = 10012, selfPID = 8904, iMonCtr = 1 'Global worker' is the global model and it says it's checking process id = 252. From the executeModelProcess line above it, this process id belongs to the regional model (RCM_PID). If the regional model dies then the global model dies as well. Hence the 'CPDN process is not running, exiting.' The monitor controller process then reports the global model has died and it then dies. To find out where the model was, navigate to the task folder in your boinc 'data' folder in 'projects/climateprediction.net' and you'll find a stdout_mon.txt file with the timesteps listed. --- CPDN Visiting Scientist ID: 70703 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,690,033 RAC: 10,812	Message 70704 - Posted: 3 Apr 2024, 9:06:16 UTC - in response to Message 70703. A sample from one of mine: Task 22416552 std_err: executeModelProcess: MonID=4880, GCM_PID=3548, RCM_PID=4548 04:27:51 (4548): called boinc_finish(193) Global Worker:: CPDN process is not running, exiting, bRetVal = T, checkPID = Controller4548 stdout_mon.txt: ... wah2_eas25_n0ko_201012_24_1008_012272000 - PH 1 TS 0011616 A - 02/01/2011 00:00 - H:M:S=0008:25:01 AVG= 2.61 DLT= 1.22 wah2_eas25_n0ko_201012_24_1008_012272000 - PH 1 TS 0011617 P - 01/01/2011 00:05 - H:M:S=0008:25:07 AVG= 2.61 DLT= 5.41 wah2_eas25_n0ko_201012_24_1008_012272000 - PH 1 TS 0011618 P - 01/01/2011 00:10 - H:M:S=0008:25:16 AVG= 2.61 DLT= 9.38 Model crash detected, will try to restart... Slight garble in std_err, but seems to be the same thing. My machines are all Intel hardware, mostly running Windows 7 Professional x64 without any emulation layer. My Windows 11 laptop also got two tasks, which are still running - I'll keep an eye on them. ID: 70704 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,390,249 RAC: 15,269	Message 70705 - Posted: 3 Apr 2024, 9:17:02 UTC - in response to Message 70704. They will all be the same output. There does appear to be some difference in success rate between intel & amd but for now I'm running a failed task standalone to see what's going on in more detail. --- CPDN Visiting Scientist ID: 70705 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,690,033 RAC: 10,812	Message 70708 - Posted: 3 Apr 2024, 11:14:54 UTC - in response to Message 70705. Last modified: 3 Apr 2024, 11:33:04 UTC My Windows 11 laptop has now crashed its two tasks as well, at the same place. I've held back the out.zip file for the time being, in case it's any use, but it sounds like the offline debug run will be a better bet. Edit - in view of the reply, I'll let them go. Holding back on even requesting new tasks until we get the go-ahead. ID: 70708 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,390,249 RAC: 15,269	Message 70709 - Posted: 3 Apr 2024, 11:19:08 UTC - in response to Message 70708. Thanks Richard, but it won't be of any use. There is not enough information in the returned files to determine the cause. Workunits use different input files to get the forecast spread. It might be related to a problem in one of the files some of the workunits use. First step is to reproduce it locally and we'll go from there. --- CPDN Visiting Scientist ID: 70709 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,962,600 RAC: 21,639	Message 70710 - Posted: 3 Apr 2024, 11:30:28 UTC The only pattern I've noticed (if it is a pattern), is that my failures were on a Win10 VM running on a intel chip, whereas the same VM running on a AMD has got 3 tasks past 1/Jan. Any idea of the percentage of Intel vs AMD chips. I have been trawling and every single failure I have looked at has been Intel but, the overwhelming majority of tasks have not returned a zip yet so there is no evidence they are running correctly. Mine which have returned zips are all Wind10 in a VM as opposed to WINE which might mask failures. (All on AMD Ryzen 7 3700X ) I guess we might have more data by tomorrow morning when most computers running 24/7 should have either failed tasks or produced zips. ID: 70710 · Reply Quote

Yeti Send message Joined: 5 Aug 04 Posts: 178 Credit: 18,739,538 RAC: 58,142	Message 70711 - Posted: 3 Apr 2024, 12:06:35 UTC Mine tasks have all failed on Intel-XEONs with varying Generations Supporting BOINC, a great concept ! ID: 70711 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 70712 - Posted: 3 Apr 2024, 13:04:48 UTC - in response to Message 70700. I got four tasks yesterday, separatd by an hour each. Machine is running Windows 10 with Intel processor. Computer 1512658 Computer information CPU type GenuineIntel 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz [Family 6 Model 140 Stepping 1] Number of processors 8 Coprocessors --- Virtualization None Operating System Microsoft Windows 10 Core x64 Edition, (10.00.19045.00) BOINC version 7.24.1 Memory 15.64 GB Cache 256 KB Swap space 18.02 GB Total disk space 460.73 GB Free Disk Space 366.06 GB Measured floating point speed 3.91 billion ops/sec Measured integer speed 21.76 billion ops/sec Average upload rate 113.53 KB/sec Average download rate 7120.42 KB/sec Average turnaround time 12.32 days Tasks were 22418337 22417523 22415964 22419311 They all died after running about 12 1/2 hours. -- 45000 seconds. ID: 70712 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,390,249 RAC: 15,269	Message 70713 - Posted: 3 Apr 2024, 14:08:27 UTC - in response to Message 70710. Dave, you might recall your dev test did fail and that was on AMD. Without running some analysis on the database I can't give you a good answer. Repeating a failed task standalone reproduces the failure, so I've got something to debug now. The only pattern I've noticed (if it is a pattern), is that my failures were on a Win10 VM running on a intel chip, whereas the same VM running on a AMD has got 3 tasks past 1/Jan. Any idea of the percentage of Intel vs AMD chips. I have been trawling and every single failure I have looked at has been Intel but, the overwhelming majority of tasks have not returned a zip yet so there is no evidence they are running correctly. Mine which have returned zips are all Wind10 in a VM as opposed to WINE which might mask failures. (All on AMD Ryzen 7 3700X ) --- CPDN Visiting Scientist ID: 70713 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,962,600 RAC: 21,639	Message 70714 - Posted: 3 Apr 2024, 14:16:08 UTC Dave, you might recall your dev test did fail and that was on AMD. And that one completed for Richard on an Intel machine ID: 70714 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,390,249 RAC: 15,269	Message 70715 - Posted: 3 Apr 2024, 14:24:57 UTC - in response to Message 70714. Last modified: 3 Apr 2024, 14:26:33 UTC Dave, you might recall your dev test did fail and that was on AMD. And that one completed for Richard on an Intel machine Yup. But all my Intel based workunits are failing for 1008 and the only ones working at the minute are on AMD (scratching of head). I don't think it's a particular input file as they are different between the failed tasks. So for the time being, the focus is on understanding what the code is doing. --- CPDN Visiting Scientist ID: 70715 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,690,033 RAC: 10,812	Message 70718 - Posted: 3 Apr 2024, 15:11:48 UTC - in response to Message 70715. I second that opinion. Just seen my final two fail - that's 12 out of 12, all on Intel - and it includes the machine that processed the dev site task that failed for Dave. ID: 70718 · Reply Quote

rob Send message Joined: 5 Jun 09 Posts: 97 Credit: 3,731,885 RAC: 4,631	Message 70721 - Posted: 3 Apr 2024, 15:59:51 UTC In support of Glenn's comment about tasks running on AMD processors getting further along - the first 1008 batch task on my PC running windows 10 has passed the first trickle back, the rest of my collection are a few hours behind. More will be revealed in the next few hours, or, hopefully days...... ID: 70721 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,690,033 RAC: 10,812	Message 70723 - Posted: 3 Apr 2024, 16:38:46 UTC Brainstorming it through with myself, could it be a compiler switch gone rogue? Might you inadvertently be compiling it with optimisations that work on AMD chips only, inserting opcodes the are valid for AMD but aren't available on Intel? ID: 70723 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,390,249 RAC: 15,269	Message 70724 - Posted: 3 Apr 2024, 17:13:48 UTC - in response to Message 70723. I doubt it. Wrong flags would have been picked up at compile time. It's exactly same executable used successfully for the 1006 and 1007 batches. But this input data is causing a problem. Optimization is enabled up to O2 and code dispatch up to SSE 4.2. I'm not an expert on AMD but I believe it also supports 4.2. I note the models all seem to fail on 1/Jan which suggests a problem with the input data in some way, maybe related to precision. Could be optimisation of fortran77 code by modern compiler playing a role too. I've got a fun few days ahead :) --- CPDN Visiting Scientist ID: 70724 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,962,600 RAC: 21,639	Message 70727 - Posted: 3 Apr 2024, 19:29:49 UTC Yes, phenom2, all Ryzen and thread ripper CPUs support SSE4.2 ID: 70727 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,777,132 RAC: 19,418	Message 70729 - Posted: 3 Apr 2024, 21:09:17 UTC All 6 on my Intel PC crashed too, the ones on AMD are humming along. Sounds like a version of the old Y2K problem, switch to a new year - crash. :-D ID: 70729 · Reply Quote

Batch 1008, and test batches 1009 to 1014 for Windows - issues