Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
There will be some more small batch tests going out this week. Progress has been made on the problem with batch 1008. --- CPDN Visiting Scientist |
Send message Joined: 25 Sep 08 Posts: 5 Credit: 28,325,115 RAC: 581 |
For what it is worth, it seems that my host machine chews through the recent tasks fairly reliably, though not faultlessly. AMD machine with ECC memory. |
Send message Joined: 17 Jan 09 Posts: 124 Credit: 2,027,010 RAC: 2,694 |
Small batch 1011 appears to have dropped. I have one running and 1st trickle has uploaded. Bill F |
Send message Joined: 15 May 09 Posts: 4537 Credit: 19,001,532 RAC: 21,726 |
Small batch 1011 appears to have dropped. I have one running and 1st trickle has uploaded.Sounds like progress as all your machines are Intel. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,803,756 RAC: 5,187 |
Small batch 1011 appears to have dropped. I have one running and 1st trickle has uploaded.Sounds like progress as all your machines are Intel. Both new batches appear to be the older "Weather At Home 2 (wah2) v8.24" application, so it looks like a bit of benchmarking is going on. |
Send message Joined: 15 May 09 Posts: 4537 Credit: 19,001,532 RAC: 21,726 |
Both new batches appear to be the older "Weather At Home 2 (wah2) v8.24" application, so it looks like a bit of benchmarking is going on.Small batch of 60 tasks. Checking a couple specific ancillary files for natural forcing I believe. |
Send message Joined: 14 Feb 06 Posts: 31 Credit: 4,507,116 RAC: 2,013 |
Small batch of 60 tasks. Checking a couple specific ancillary files for natural forcing I believe. I've got one of these v8.24 tasks. It's survived one restart, so is there some difference to the tasks that almost always crashed on restart? |
Send message Joined: 15 May 09 Posts: 4537 Credit: 19,001,532 RAC: 21,726 |
I've got one of these v8.24 tasks. It's survived one restart, so is there some difference to the tasks that almost always crashed on restart? Impossible to say if the surviving one restart is due to any change yet. This test is to confirm that two particular ancillary files were not responsible for the failures that seemed to happen on all Intel machines. I would guess that if these run OK, there will be two or three other small batches to test the other ancillary files/to narrow down further what the issue is. Glenn is the one who could really answer that question but he may be too busy trying to find the answer right now! |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
One of my VMs (AMD system) picked up a 1012 task, on WAH 8.29. It's about 12h in, but most of my batch 1008 tasks on that machine are still churning away nicely. |
Send message Joined: 15 May 09 Posts: 4537 Credit: 19,001,532 RAC: 21,726 |
There have been two batches of 60 with the 8.24app and two with 8.29. Looking at some of the tasks, all four batches have tasks that have got past the first zip on both Intel and AMD machines which rules out some causes of the tasks crashing but doesn't from where I am sitting say exactly what the problem is. |
Send message Joined: 17 Jan 09 Posts: 124 Credit: 2,027,010 RAC: 2,694 |
These mini-batches seem to be a real mixture. EAS 24 and EAS 29 have been mentioned my new task below looks like EAS 25 batch 1011 wah2_eas25_a00p_201412_24_1011_012276481_0 Now at 5 trickles and cooking right along. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
They are all EAS25 configuration (East Asia 25km resolution). The batch using version 8.24 of the wah2 was a mistake. It was stopped and rereleased as 8.29. We're testing the input files to see which ones are causing the problems and what we're going to do about it. --- CPDN Visiting Scientist |
Send message Joined: 14 Feb 06 Posts: 31 Credit: 4,507,116 RAC: 2,013 |
The batch using version 8.24 of the wah2 was a mistake. So at least I don't need to feel so bad about mine crashing on the latest restart. 🙂 |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,700,823 RAC: 9,977 |
As announced in the 2024 new work thread, there's another test batch today. I've got wah2_eas25_a00l_201312_24_1014_012276657_0, so I'll amend the thread title accordingly. It's running a little slowly, because of changes I've made locally to accommodate another project - I can amend those to speed this task up, without risking a restart. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,700,823 RAC: 9,977 |
My batch 1014 test task has passed the first trickle point and is continuing to crunch. Good news. |
Send message Joined: 14 Feb 06 Posts: 31 Credit: 4,507,116 RAC: 2,013 |
I just noticed that the trickles on one task aren't sorting correctly as the top result is dated 11 April and the second result is dated 12 April: https://www.cpdn.org/result.php?resultid=22421336 No idea whether it's a problem. The task appears to be running okay. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,700,823 RAC: 9,977 |
Another oddity is that the two most recent trickles - the ones you're referring to - both refer to the same timestep. It might be helpful if you could look in BOINC's "Event Log" around the times the two last trickles were recorded. Each trickle should show up in two ways: 12/04/2024 05:25:35 | climateprediction.net | Sending scheduler request: To send trickle-up message. 12/04/2024 05:25:53 | climateprediction.net | Started upload of wah2_eas25_a00l_201312_24_1014_012276657_0_r1583651690_2.zip 12/04/2024 05:30:43 | climateprediction.net | Finished upload of wah2_eas25_a00l_201312_24_1014_012276657_0_r1583651690_2.zip (99389892 bytes)- a scheduler request, and an upload file. How many do you see around that time? Glenn may need to work out whether the duplication was caused by the application, or by the post-processing on the server. (One possibility is that the task was interrupted close to the 'trickle point', and restarted from an earlier checkpoint.) |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
It's probably where the model was restarted and ran over a trickle up timestep again. --- CPDN Visiting Scientist |
Send message Joined: 15 May 09 Posts: 4537 Credit: 19,001,532 RAC: 21,726 |
It's probably where the model was restarted and ran over a trickle up timestep again. Just so long as this doesn't allow gaming of the system for credit purposes! Edit: Probably not an issue as repeated instances of trying that would probably crash the task. |
Send message Joined: 14 Feb 06 Posts: 31 Credit: 4,507,116 RAC: 2,013 |
It's probably where the model was restarted and ran over a trickle up timestep again. Having thought a bit more, when I shut down yesterday, BOINC was trying to do an upload, but couldn't connect for some reason. I was in a hurry so I just shut down. But what seems to have happened is that the upload went through (at least in part), but the confirmation "Scheduler request" message didn't go through until BOINC started again today. And so that has resulted in two records. And for Dave Jackson: Just so long as this doesn't allow gaming of the system for credit purposes! The task had exactly the same credit as two other tasks that I'm running that don't have this anomaly, so there doesn't seem to be any credit gaming possibility. Thanks for all your input! |
©2024 cpdn.org