climateprediction.net (CPDN) home page
Thread 'Batch 996 Weather@Home2 East Asia25'

Thread 'Batch 996 Weather@Home2 East Asia25'

Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 12 · Next

AuthorMessage
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4538
Credit: 19,005,674
RAC: 21,647
Message 69687 - Posted: 7 Oct 2023, 9:33:18 UTC - in response to Message 69684.  

Trickle files uploading OK but zips are now getting stuck.
If once the rather long back off is over things still don't shift Alan, I will let Andy know. My zips have all been going through but on that server there seems to be an issue with uploads that get interrupted or there was with the last batch at any rate.
ID: 69687 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,702,480
RAC: 9,812
Message 69688 - Posted: 7 Oct 2023, 12:24:47 UTC

I'm seeing a wide variation in the time taken to upload the .zip files. Zips 1-5:

00:05:52
00:01:03
00:07:41
00:03:28
00:04:24

But all have completed successfully so far.
ID: 69688 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 69689 - Posted: 7 Oct 2023, 13:06:50 UTC - in response to Message 69686.  

So far the failure rate is below that of the recent NZ batch, 995, which had a hard fail rate of 2% and a total task fails of 39% (hard fail is when all 3 tasks in a single workunit fail). This batch is below those figures currently. Signal 11 is just the most common error we get with this model - it's not an indication of problems with the configuration.

WAH is susceptible to crashes from poweron/off, suspend/restarts. I've already lost several of the WAH tasks to this from this batch. And that's despite suspending the project, shutting down the client and then powering off. There is a problem with the way one of the models in WAH handles its restarts. I suspect it's not closing them properly, so it's probably more to do with at what point in the model timestep it's got to rather than how the client is terminated. May not be the reason for your errors though.

Hi Rob. Had hoped that the signal11 failures would be a lot lower with this batch but it seems this might not be the case. This is to do with the batch and not your computer. Just hoping there are enough good tasks between this and the last lot for the researcher to get what she needs.

---
CPDN Visiting Scientist
ID: 69689 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 69690 - Posted: 7 Oct 2023, 13:34:14 UTC - in response to Message 69686.  

Had hoped that the signal11 failures would be a lot lower with this batch but it seems this might not be the case. This is to do with the batch and not your computer. Just hoping there are enough good tasks between this and the last lot for the researcher to get what she needs.


I have only one computer running Windows and I do not run WINE on the other (Linux machine). How do you distinguish between failures due to the machine from those due to the batch? I assume mine are all from the same batch and they show no signs of failure yet. I guess you see results from many other machines so you have more data from which to draw conclusions.

My three work units have about two days of work done on each. Each has uploaded 3 zip files. No failures yet. These are on my Windows 10 machine.
Computer 1512658

22339022 	12226566 	5 Oct 2023, 18:39:55 UTC 	16 Oct 2024, 23:59:55 UTC 	In progress 	--- 	--- 	2,506.49 	Weather At Home 2 (wah2) v8.24
windows_intelx86
22339081 	12226625 	5 Oct 2023, 17:39:17 UTC 	16 Oct 2024, 22:59:17 UTC 	In progress 	--- 	--- 	2,506.49 	Weather At Home 2 (wah2) v8.24
windows_intelx86
22340449 	12227993 	5 Oct 2023, 16:38:36 UTC 	16 Oct 2024, 21:58:36 UTC 	In progress 	--- 	--- 	2,506.49 	Weather At Home 2 (wah2) v8.24
windows_intelx86


Task 22339022
Name 	wah2_eas25_a2bu_200012_24_996_012226566_0
Workunit 	12226566

Computer ID 	1512658

Credit 	2,506.49
Device peak FLOPS 	4.23 GFLOPS
Application version 	Weather At Home 2 (wah2) v8.24
windows_intelx86

ID: 69690 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4538
Credit: 19,005,674
RAC: 21,647
Message 69691 - Posted: 7 Oct 2023, 14:19:34 UTC

Now downloading tiny10 (minimalist version of Windows10 to run in VM. Don't want to dual boot as this is my main machine and non BOINC work all happens in Linux. This will let me go back to running testing work for Windows.
ID: 69691 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 69692 - Posted: 7 Oct 2023, 17:00:08 UTC - in response to Message 69691.  
Last modified: 7 Oct 2023, 17:03:02 UTC

Don't want to dual boot as this is my main machine and non BOINC work all happens in Linux.


Good Idea.

I, too, hate dual booting because I run almost everything in Linux. I need Windows only to run TaxAct each year to do my income taxes (Federal and my state). And four times a year to keep my Garmin GPS unit up to date.

I could get a Windows license to run Windows on this machine, but a few years ago I got sick of that so I got a little desktop machine (It looks just like a monitor, but the computer is inside the Monitor.) And that little computer runs Windows 10 and has nothing else to do, so I downloaded Boinc into it.

I signed it up for CPDN, WCG, DENIS, Rosetta, Einstein, and Universe.

My main machine is ID: 1511241 and has lots of RAM and processor cache. And my pipsqueak machine is ID: 1512658 and has much less RAM and a slower Processor that is only 8 cores. My Linux machine has a pretty fast processor with 16 cores.
ID: 69692 · Report as offensive     Reply Quote
rob

Send message
Joined: 5 Jun 09
Posts: 97
Credit: 3,736,027
RAC: 4,083
Message 69693 - Posted: 7 Oct 2023, 17:10:39 UTC - in response to Message 69689.  

Thanks Glenn & Dave.
All but one of the failures was after shutting down for the night. It's somewhat reassuring that it's not my computer that's got an issue, ut it's a bit disappointing that the stop/restart issue hasn't been fully cured yet.
I'll see what happens if I leave it on over night and report back tomorrow.
ID: 69693 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4538
Credit: 19,005,674
RAC: 21,647
Message 69694 - Posted: 7 Oct 2023, 17:39:37 UTC - in response to Message 69693.  

I'll see what happens if I leave it on over night and report back tomorrow.
That is what I have been doing though using WINE which may well be suspect as Glen says I seem to get very few failures.

What I would really like to do once I have Tiny10 running is to run the same task on Tiny10 and using WINE to see what differences there are in results.
ID: 69694 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 69695 - Posted: 7 Oct 2023, 17:42:48 UTC - in response to Message 69693.  

All but one of the failures was after shutting down for the night. It's somewhat reassuring that it's not my computer that's got an issue, ut it's a bit disappointing that the stop/restart issue hasn't been fully cured yet.


That may be why I seem to get less crashes than others. I let my machines run 24/7 and reboot them only when installing updates.
ID: 69695 · Report as offensive     Reply Quote
rob

Send message
Joined: 5 Jun 09
Posts: 97
Credit: 3,736,027
RAC: 4,083
Message 69696 - Posted: 8 Oct 2023, 9:39:11 UTC - in response to Message 69695.  

Left the PC on overnight.
The task that was running last night is still running this morning.
However a new task arrived, and promptly crashed (less than 2 minute running)
https://www.cpdn.org/result.php?resultid=22344887
, with a segment violation error. So it looks as if that problem, while reduced, is still around.
ID: 69696 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 69697 - Posted: 8 Oct 2023, 12:30:50 UTC - in response to Message 69696.  

Left the PC on overnight.
The task that was running last night is still running this morning.
However a new task arrived, and promptly crashed (less than 2 minute running)

https://www.cpdn.org/result.php?resultid=22344887

, with a segment violation error. So it looks as if that problem, while reduced, is still around.


I wonder what your problem is.
My three tasks are still running and have now uploaded 5 trickles.
Oldest one is:
22340449 	12227993 	5 Oct 2023, 16:38:36 UTC 	16 Oct 2024, 21:58:36 UTC 	In progress

ID: 69697 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 69698 - Posted: 8 Oct 2023, 16:16:51 UTC - in response to Message 69694.  

What I would really like to do once I have Tiny10 running is to run the same task on Tiny10 and using WINE to see what differences there are in results.
You may well get two different answers - but the real question is, which one is right? ;)
---
CPDN Visiting Scientist
ID: 69698 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4538
Credit: 19,005,674
RAC: 21,647
Message 69699 - Posted: 8 Oct 2023, 16:53:41 UTC - in response to Message 69698.  

I am going to try copying from WINE when I have fewer tasks running and see if transferring all the BOINC data across allows it to run on the Windows10. Internet access will be suspended on the Tiny10 machine and I can see if the zips produced are identical. That will be a first step. At least we will be able to rule out differences due to hardware.If I can make it work as I describe, I am more than happy to send zip files over for someone who knows what to look for to have a gander assuming they are different. If they come back identical I will report in a new thread.
ID: 69699 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,808,726
RAC: 5,192
Message 69700 - Posted: 8 Oct 2023, 16:57:08 UTC

I've now got ten Zips waiting to upload, three nominally in progress with the farthest at 7.49%. The <rsc_disk_bound> is 2e9 bytes and 24 zips at average ~99 MB plus "_restart" and "_out" zips are going to bust that limit.

One other task downloaded during upload failures and trickles are uploading, so the general comms appears to be working. The model will need to be suspended before completion unless the upload problem is fixed. I normally run tasks without restarts but as the disk bound is approached that will be an option and might fix a local problem (though I can see no other evidence of that). Any other suggestions?
ID: 69700 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4538
Credit: 19,005,674
RAC: 21,647
Message 69701 - Posted: 8 Oct 2023, 17:24:46 UTC - in response to Message 69700.  
Last modified: 8 Oct 2023, 19:18:23 UTC

I've now got ten Zips waiting to upload,
The behaviour is odd. I have 8 tasks running and a very slow connection yet all my zips have gone through yet some seem to be having issues.

I thought for a brief period of time, my client in Tiny10 having picked up a couple of resends which both seem to be running fine that maybe virtualisation had some sort of protection against the memory issue of sig11 failures but close inspection of the first failures tells me that I need to wait rather longer to be sure of that. One with has no stderr on the task page for one failure and the other two failures between the tasks are a sig11 and an exit code 15.
ID: 69701 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 69704 - Posted: 8 Oct 2023, 20:24:24 UTC - in response to Message 69699.  
Last modified: 8 Oct 2023, 20:24:50 UTC

I am going to try copying from WINE when I have fewer tasks running and see if transferring all the BOINC data across allows it to run on the Windows10. Internet access will be suspended on the Tiny10 machine and I can see if the zips produced are identical. That will be a first step. At least we will be able to rule out differences due to hardware.If I can make it work as I describe, I am more than happy to send zip files over for someone who knows what to look for to have a gander assuming they are different. If they come back identical I will report in a new thread.
I don't think the zips will be identical Dave. If I remember right, there are date strings (when model ran) in the model output which will make them different even if the model results are identical. But I could be wrong, interesting to test, if it's not too much work.

If they are the same though that's not a general statement about all the forecasts, just the one your machine ran. Ideally we'd look for biases in the ensemble results. Say run 1,000s of forecasts & look at the statistics of the event we're interested in, then take out all the results for all tasks run on WINE (can't do because can't easily identify such hosts) and then compare. Any biases from WINE based tasks would need to be statistically significant. Could also do the same thing to compare Intel & AMD chips for example.

It is possible to capture the task files from the slot directory and then run the code standalone -- that's how I do some of the testing. Can use the task manager to capture the commands line arguments for how to run the task wrapper executable.
---
CPDN Visiting Scientist
ID: 69704 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 69705 - Posted: 8 Oct 2023, 20:37:28 UTC - in response to Message 69700.  

One other task downloaded during upload failures and trickles are uploading, so the general comms appears to be working. The model will need to be suspended before completion unless the upload problem is fixed. I normally run tasks without restarts but as the disk bound is approached that will be an option and might fix a local problem (though I can see no other evidence of that). Any other suggestions?
You probably know the downloads, trickles & uploads all go to different servers. I'm not aware of any problems with the upload server. I'll bring this up in the CPDN tech meeting tomrrow and report back if there are any problems as I don't have direct access to the upload servers myself.
---
CPDN Visiting Scientist
ID: 69705 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 69706 - Posted: 8 Oct 2023, 20:47:32 UTC - in response to Message 69696.  

However a new task arrived, and promptly crashed (less than 2 minute running) with a segment violation error. So it looks as if that problem, while reduced, is still around.
Yep, the first 2 mins is when the global model runs. It then produces the boundary initial files for the regional model to use. It's when the regional model starts up and reads those files it goes wrong. We think it's because there is some strong convection/updraught which causes an array bound to be exceeded in the code (which isn't caught). I am currently looking at this in a test case. It's related to the size of the region, which is why the larger region for East Asian 25 had a very high failure rate. But why it happens further on in the run after a power on/off I'm less sure though I suspect a problem with the restart files.
---
CPDN Visiting Scientist
ID: 69706 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 30,975,898
RAC: 14,500
Message 69707 - Posted: 8 Oct 2023, 22:25:28 UTC - in response to Message 69687.  

Trickles still going through OK. Most zips have gone as well but one is stuck -

08/10/2023 23:09:16 | climateprediction.net | Backing off 03:26:52 on upload of wah2_eas25_a49c_201212_24_996_012229068_0_r1003289668_3.zip

whereas 4th,5th and 6th zips have gone:

08/10/2023 11:21:55 | climateprediction.net | Finished upload of wah2_eas25_a49c_201212_24_996_012229068_0_r1003289668_4.zip (99468575 bytes)
08/10/2023 08:58:22 | climateprediction.net | Finished upload of wah2_eas25_a49c_201212_24_996_012229068_0_r1003289668_5.zip (99652526 bytes)
08/10/2023 20:21:02 | climateprediction.net | Finished upload of wah2_eas25_a49c_201212_24_996_012229068_0_r1003289668_6.zip (99535063 bytes)

Should I just abort the transfer or keep my fingers crossed that it will go at some point?
ID: 69707 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4538
Credit: 19,005,674
RAC: 21,647
Message 69708 - Posted: 9 Oct 2023, 1:12:20 UTC - in response to Message 69707.  

Should I just abort the transfer or keep my fingers crossed that it will go at some point?
I would keep them at least till Glen reports back from the meeting tomorrow morning.
ID: 69708 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 12 · Next

Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25

©2024 cpdn.org