Message boards :
Number crunching :
Batch 1008, and test batches 1009 to 1014 for Windows - issues
Message board moderation
Author | Message |
---|---|
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,690,033 RAC: 10,812 |
I started 8 tasks yesterday, starting soon after release (2 tasks each on four machines, all Intel i5). Sample task: wah2_eas25_n2nl_201712_24_1008_012274697_0 The only clue I can see so far is: Controller:: CPDN process is not running, exiting, bRetVal = T, checkPID = 2136, selfPID = 7432, iMonCtr = 2 Model crash detected, will try to restart... Global Worker:: CPDN process is not running, exiting, bRetVal = T, checkPID = 2136, selfPID = 4880, iMonCtr = 2All failed after about 8 hours, round about where the first trickle would have been expected. I'll have another look round in more detail later. Edit - looks like they didn't send either a credit trickle or a data trickle. But this machine did send an out.zip, which may help. |
Send message Joined: 9 Dec 05 Posts: 116 Credit: 12,547,934 RAC: 2,738 |
I'm seeing the same, here: https://www.cpdn.org/result.php?resultid=22415908 and here: https://www.cpdn.org/result.php?resultid=22417236 |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,777,132 RAC: 19,418 |
I had 2 crash just shy of 12 hour mark on same PC (i7-4790) with same errors. Seems like global model is crashing? https://www.cpdn.org/result.php?resultid=22417487 https://www.cpdn.org/result.php?resultid=22416101 |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,390,249 RAC: 15,269 |
I've found this myself and done some preliminary investigation. In the 3 task failures I've had all were due to the regional model crashing as it tried to run 1st/Jan. The forecasts all start from 1/Dec. I'm looking into it. The only pattern I've noticed (if it is a pattern), is that my failures were on a Win10 VM running on a intel chip, whereas the same VM running on a AMD has got 3 tasks past 1/Jan. I'll be running a failure workunit standalone to debug what's going on. The other two batches have been held pending investigation of possible issues with this one. p.s. to determine which model has failed, look in the stderr for these lines: executeModelProcess: MonID=8904, GCM_PID=10012, RCM_PID=252 23:57:52 (252): called boinc_finish(193) Global Worker:: CPDN process is not running, exiting, bRetVal = T, checkPID = 252, selfPID = 10012, iMonCtr = 2 Controller:: CPDN process is not running, exiting, bRetVal = T, checkPID = 10012, selfPID = 8904, iMonCtr = 1 'Global worker' is the global model and it says it's checking process id = 252. From the executeModelProcess line above it, this process id belongs to the regional model (RCM_PID). If the regional model dies then the global model dies as well. Hence the 'CPDN process is not running, exiting.' The monitor controller process then reports the global model has died and it then dies. To find out where the model was, navigate to the task folder in your boinc 'data' folder in 'projects/climateprediction.net' and you'll find a stdout_mon.txt file with the timesteps listed. --- CPDN Visiting Scientist |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,690,033 RAC: 10,812 |
A sample from one of mine: Task 22416552 std_err: executeModelProcess: MonID=4880, GCM_PID=3548, RCM_PID=4548 04:27:51 (4548): called boinc_finish(193) Global Worker:: CPDN process is not running, exiting, bRetVal = T, checkPID = Controller4548stdout_mon.txt: ... wah2_eas25_n0ko_201012_24_1008_012272000 - PH 1 TS 0011616 A - 02/01/2011 00:00 - H:M:S=0008:25:01 AVG= 2.61 DLT= 1.22 wah2_eas25_n0ko_201012_24_1008_012272000 - PH 1 TS 0011617 P - 01/01/2011 00:05 - H:M:S=0008:25:07 AVG= 2.61 DLT= 5.41 wah2_eas25_n0ko_201012_24_1008_012272000 - PH 1 TS 0011618 P - 01/01/2011 00:10 - H:M:S=0008:25:16 AVG= 2.61 DLT= 9.38 Model crash detected, will try to restart...Slight garble in std_err, but seems to be the same thing. My machines are all Intel hardware, mostly running Windows 7 Professional x64 without any emulation layer. My Windows 11 laptop also got two tasks, which are still running - I'll keep an eye on them. |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,390,249 RAC: 15,269 |
They will all be the same output. There does appear to be some difference in success rate between intel & amd but for now I'm running a failed task standalone to see what's going on in more detail. --- CPDN Visiting Scientist |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,690,033 RAC: 10,812 |
My Windows 11 laptop has now crashed its two tasks as well, at the same place. I've held back the out.zip file for the time being, in case it's any use, but it sounds like the offline debug run will be a better bet. Edit - in view of the reply, I'll let them go. Holding back on even requesting new tasks until we get the go-ahead. |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,390,249 RAC: 15,269 |
Thanks Richard, but it won't be of any use. There is not enough information in the returned files to determine the cause. Workunits use different input files to get the forecast spread. It might be related to a problem in one of the files some of the workunits use. First step is to reproduce it locally and we'll go from there. --- CPDN Visiting Scientist |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,962,600 RAC: 21,639 |
The only pattern I've noticed (if it is a pattern), is that my failures were on a Win10 VM running on a intel chip, whereas the same VM running on a AMD has got 3 tasks past 1/Jan. Any idea of the percentage of Intel vs AMD chips. I have been trawling and every single failure I have looked at has been Intel but, the overwhelming majority of tasks have not returned a zip yet so there is no evidence they are running correctly. Mine which have returned zips are all Wind10 in a VM as opposed to WINE which might mask failures. (All on AMD Ryzen 7 3700X ) I guess we might have more data by tomorrow morning when most computers running 24/7 should have either failed tasks or produced zips. |
Send message Joined: 5 Aug 04 Posts: 178 Credit: 18,739,538 RAC: 58,142 |
Mine tasks have all failed on Intel-XEONs with varying Generations Supporting BOINC, a great concept ! |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I got four tasks yesterday, separatd by an hour each. Machine is running Windows 10 with Intel processor. Computer 1512658 Computer information CPU type GenuineIntel 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz [Family 6 Model 140 Stepping 1] Number of processors 8 Coprocessors --- Virtualization None Operating System Microsoft Windows 10 Core x64 Edition, (10.00.19045.00) BOINC version 7.24.1 Memory 15.64 GB Cache 256 KB Swap space 18.02 GB Total disk space 460.73 GB Free Disk Space 366.06 GB Measured floating point speed 3.91 billion ops/sec Measured integer speed 21.76 billion ops/sec Average upload rate 113.53 KB/sec Average download rate 7120.42 KB/sec Average turnaround time 12.32 days Tasks were 22418337 22417523 22415964 22419311 They all died after running about 12 1/2 hours. -- 45000 seconds. |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,390,249 RAC: 15,269 |
Dave, you might recall your dev test did fail and that was on AMD. Without running some analysis on the database I can't give you a good answer. Repeating a failed task standalone reproduces the failure, so I've got something to debug now. The only pattern I've noticed (if it is a pattern), is that my failures were on a Win10 VM running on a intel chip, whereas the same VM running on a AMD has got 3 tasks past 1/Jan. --- CPDN Visiting Scientist |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,962,600 RAC: 21,639 |
Dave, you might recall your dev test did fail and that was on AMD.And that one completed for Richard on an Intel machine |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,390,249 RAC: 15,269 |
Yup. But all my Intel based workunits are failing for 1008 and the only ones working at the minute are on AMD (scratching of head). I don't think it's a particular input file as they are different between the failed tasks. So for the time being, the focus is on understanding what the code is doing.Dave, you might recall your dev test did fail and that was on AMD.And that one completed for Richard on an Intel machine --- CPDN Visiting Scientist |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,690,033 RAC: 10,812 |
I second that opinion. Just seen my final two fail - that's 12 out of 12, all on Intel - and it includes the machine that processed the dev site task that failed for Dave. |
Send message Joined: 5 Jun 09 Posts: 97 Credit: 3,731,885 RAC: 4,631 |
In support of Glenn's comment about tasks running on AMD processors getting further along - the first 1008 batch task on my PC running windows 10 has passed the first trickle back, the rest of my collection are a few hours behind. More will be revealed in the next few hours, or, hopefully days...... |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,690,033 RAC: 10,812 |
Brainstorming it through with myself, could it be a compiler switch gone rogue? Might you inadvertently be compiling it with optimisations that work on AMD chips only, inserting opcodes the are valid for AMD but aren't available on Intel? |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,390,249 RAC: 15,269 |
I doubt it. Wrong flags would have been picked up at compile time. It's exactly same executable used successfully for the 1006 and 1007 batches. But this input data is causing a problem. Optimization is enabled up to O2 and code dispatch up to SSE 4.2. I'm not an expert on AMD but I believe it also supports 4.2. I note the models all seem to fail on 1/Jan which suggests a problem with the input data in some way, maybe related to precision. Could be optimisation of fortran77 code by modern compiler playing a role too. I've got a fun few days ahead :) --- CPDN Visiting Scientist |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,962,600 RAC: 21,639 |
Yes, phenom2, all Ryzen and thread ripper CPUs support SSE4.2 |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,777,132 RAC: 19,418 |
All 6 on my Intel PC crashed too, the ones on AMD are humming along. Sounds like a version of the old Y2K problem, switch to a new year - crash. :-D |
©2024 cpdn.org