climateprediction.net (CPDN) home page
Thread 'Batch 1008, and test batches 1009 to 1014 for Windows - issues'

Thread 'Batch 1008, and test batches 1009 to 1014 for Windows - issues'

Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,476,460
RAC: 15,681
Message 70765 - Posted: 8 Apr 2024, 13:06:36 UTC

There will be some more small batch tests going out this week.
Progress has been made on the problem with batch 1008.
---
CPDN Visiting Scientist
ID: 70765 · Report as offensive     Reply Quote
mjsunkiter

Send message
Joined: 25 Sep 08
Posts: 5
Credit: 28,336,711
RAC: 1,360
Message 70772 - Posted: 9 Apr 2024, 1:17:24 UTC

For what it is worth, it seems that my host machine chews through the recent tasks fairly reliably, though not faultlessly. AMD machine with ECC memory.
ID: 70772 · Report as offensive     Reply Quote
ProfileBill F

Send message
Joined: 17 Jan 09
Posts: 124
Credit: 2,037,778
RAC: 2,752
Message 70773 - Posted: 9 Apr 2024, 1:30:12 UTC

Small batch 1011 appears to have dropped. I have one running and 1st trickle has uploaded.

Bill F
ID: 70773 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 70774 - Posted: 9 Apr 2024, 7:31:41 UTC - in response to Message 70773.  

Small batch 1011 appears to have dropped. I have one running and 1st trickle has uploaded.

Bill F
Sounds like progress as all your machines are Intel.
ID: 70774 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,823,657
RAC: 4,997
Message 70775 - Posted: 9 Apr 2024, 8:01:42 UTC - in response to Message 70774.  

Small batch 1011 appears to have dropped. I have one running and 1st trickle has uploaded.

Bill F
Sounds like progress as all your machines are Intel.


Both new batches appear to be the older "Weather At Home 2 (wah2) v8.24" application, so it looks like a bit of benchmarking is going on.
ID: 70775 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 70776 - Posted: 9 Apr 2024, 8:40:08 UTC - in response to Message 70775.  

Both new batches appear to be the older "Weather At Home 2 (wah2) v8.24" application, so it looks like a bit of benchmarking is going on.
Small batch of 60 tasks. Checking a couple specific ancillary files for natural forcing I believe.
ID: 70776 · Report as offensive     Reply Quote
Paul

Send message
Joined: 14 Feb 06
Posts: 31
Credit: 4,507,116
RAC: 2,013
Message 70782 - Posted: 9 Apr 2024, 17:22:49 UTC - in response to Message 70776.  

Small batch of 60 tasks. Checking a couple specific ancillary files for natural forcing I believe.

I've got one of these v8.24 tasks. It's survived one restart, so is there some difference to the tasks that almost always crashed on restart?
ID: 70782 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 70783 - Posted: 9 Apr 2024, 18:15:26 UTC - in response to Message 70782.  

I've got one of these v8.24 tasks. It's survived one restart, so is there some difference to the tasks that almost always crashed on restart?


Impossible to say if the surviving one restart is due to any change yet. This test is to confirm that two particular ancillary files were not responsible for the failures that seemed to happen on all Intel machines. I would guess that if these run OK, there will be two or three other small batches to test the other ancillary files/to narrow down further what the issue is. Glenn is the one who could really answer that question but he may be too busy trying to find the answer right now!
ID: 70783 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 70787 - Posted: 9 Apr 2024, 22:08:33 UTC

One of my VMs (AMD system) picked up a 1012 task, on WAH 8.29. It's about 12h in, but most of my batch 1008 tasks on that machine are still churning away nicely.
ID: 70787 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 70790 - Posted: 10 Apr 2024, 9:37:59 UTC

There have been two batches of 60 with the 8.24app and two with 8.29. Looking at some of the tasks, all four batches have tasks that have got past the first zip on both Intel and AMD machines which rules out some causes of the tasks crashing but doesn't from where I am sitting say exactly what the problem is.
ID: 70790 · Report as offensive     Reply Quote
ProfileBill F

Send message
Joined: 17 Jan 09
Posts: 124
Credit: 2,037,778
RAC: 2,752
Message 70802 - Posted: 10 Apr 2024, 19:54:41 UTC - in response to Message 70775.  

These mini-batches seem to be a real mixture. EAS 24 and EAS 29 have been mentioned my new task below looks like EAS 25 batch 1011

wah2_eas25_a00p_201412_24_1011_012276481_0

Now at 5 trickles and cooking right along.
ID: 70802 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,476,460
RAC: 15,681
Message 70803 - Posted: 10 Apr 2024, 20:45:51 UTC - in response to Message 70802.  

They are all EAS25 configuration (East Asia 25km resolution). The batch using version 8.24 of the wah2 was a mistake. It was stopped and rereleased as 8.29. We're testing the input files to see which ones are causing the problems and what we're going to do about it.
---
CPDN Visiting Scientist
ID: 70803 · Report as offensive     Reply Quote
Paul

Send message
Joined: 14 Feb 06
Posts: 31
Credit: 4,507,116
RAC: 2,013
Message 70804 - Posted: 11 Apr 2024, 5:59:15 UTC - in response to Message 70803.  

The batch using version 8.24 of the wah2 was a mistake.

So at least I don't need to feel so bad about mine crashing on the latest restart. 🙂
ID: 70804 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,716,561
RAC: 8,355
Message 70807 - Posted: 11 Apr 2024, 14:17:19 UTC
Last modified: 11 Apr 2024, 14:18:40 UTC

As announced in the 2024 new work thread, there's another test batch today.

I've got wah2_eas25_a00l_201312_24_1014_012276657_0, so I'll amend the thread title accordingly. It's running a little slowly, because of changes I've made locally to accommodate another project - I can amend those to speed this task up, without risking a restart.
ID: 70807 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,716,561
RAC: 8,355
Message 70812 - Posted: 11 Apr 2024, 20:57:06 UTC

My batch 1014 test task has passed the first trickle point and is continuing to crunch. Good news.
ID: 70812 · Report as offensive     Reply Quote
Paul

Send message
Joined: 14 Feb 06
Posts: 31
Credit: 4,507,116
RAC: 2,013
Message 70813 - Posted: 12 Apr 2024, 8:20:33 UTC
Last modified: 12 Apr 2024, 8:21:21 UTC

I just noticed that the trickles on one task aren't sorting correctly as the top result is dated 11 April and the second result is dated 12 April: https://www.cpdn.org/result.php?resultid=22421336

No idea whether it's a problem. The task appears to be running okay.
ID: 70813 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,716,561
RAC: 8,355
Message 70814 - Posted: 12 Apr 2024, 9:06:55 UTC - in response to Message 70813.  

Another oddity is that the two most recent trickles - the ones you're referring to - both refer to the same timestep.

It might be helpful if you could look in BOINC's "Event Log" around the times the two last trickles were recorded. Each trickle should show up in two ways:

12/04/2024 05:25:35 | climateprediction.net | Sending scheduler request: To send trickle-up message.
12/04/2024 05:25:53 | climateprediction.net | Started upload of wah2_eas25_a00l_201312_24_1014_012276657_0_r1583651690_2.zip
12/04/2024 05:30:43 | climateprediction.net | Finished upload of wah2_eas25_a00l_201312_24_1014_012276657_0_r1583651690_2.zip (99389892 bytes)
- a scheduler request, and an upload file. How many do you see around that time?

Glenn may need to work out whether the duplication was caused by the application, or by the post-processing on the server. (One possibility is that the task was interrupted close to the 'trickle point', and restarted from an earlier checkpoint.)
ID: 70814 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,476,460
RAC: 15,681
Message 70815 - Posted: 12 Apr 2024, 9:21:24 UTC - in response to Message 70814.  

It's probably where the model was restarted and ran over a trickle up timestep again.
---
CPDN Visiting Scientist
ID: 70815 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 70816 - Posted: 12 Apr 2024, 9:25:58 UTC - in response to Message 70815.  
Last modified: 12 Apr 2024, 9:26:55 UTC

It's probably where the model was restarted and ran over a trickle up timestep again.

Just so long as this doesn't allow gaming of the system for credit purposes!

Edit: Probably not an issue as repeated instances of trying that would probably crash the task.
ID: 70816 · Report as offensive     Reply Quote
Paul

Send message
Joined: 14 Feb 06
Posts: 31
Credit: 4,507,116
RAC: 2,013
Message 70817 - Posted: 12 Apr 2024, 17:56:32 UTC - in response to Message 70815.  

It's probably where the model was restarted and ran over a trickle up timestep again.

Having thought a bit more, when I shut down yesterday, BOINC was trying to do an upload, but couldn't connect for some reason. I was in a hurry so I just shut down.

But what seems to have happened is that the upload went through (at least in part), but the confirmation "Scheduler request" message didn't go through until BOINC started again today. And so that has resulted in two records.

And for Dave Jackson:
Just so long as this doesn't allow gaming of the system for credit purposes!

The task had exactly the same credit as two other tasks that I'm running that don't have this anomaly, so there doesn't seem to be any credit gaming possibility.

Thanks for all your input!
ID: 70817 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues

©2024 cpdn.org