climateprediction.net (CPDN) home page
Thread 'New work discussion - 2'

Thread 'New work discussion - 2'

Message boards : Number crunching : New work discussion - 2
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 29 · 30 · 31 · 32 · 33 · 34 · 35 . . . 42 · Next

AuthorMessage
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4536
Credit: 18,993,249
RAC: 21,753
Message 69153 - Posted: 7 Jul 2023, 5:24:16 UTC

Fresh batch of 120month spin up tasks has been released on testing. We are looking at 25 days to completion for the three on my Ryzen so looking at about a month before the new batch of main site tasks if all goes well with them.
ID: 69153 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 69155 - Posted: 7 Jul 2023, 8:32:07 UTC - in response to Message 69153.  
Last modified: 7 Jul 2023, 8:38:21 UTC

Fresh batch of 120month spin up tasks has been released on testing. We are looking at 25 days to completion for the three on my Ryzen so looking at about a month before the new batch of main site tasks if all goes well with them.
That is a batch of test spinup workunits from the current failing batch that have already failed on my machine. I'm not sure why you picked up the resends. CPDN are going to discuss with the scientists about redefining the region for the limited area model. We think the size of it is causing the segv.

I'm still suspicious that your WINE implementation avoids the segv by sandboxing the environment, which is why your tasks are running. I'll check with Sarah as I think you can abort those. I'll PM you.
---
CPDN Visiting Scientist
ID: 69155 · Report as offensive
Helmer Bryd

Send message
Joined: 16 Aug 04
Posts: 156
Credit: 9,035,872
RAC: 2,928
Message 69159 - Posted: 7 Jul 2023, 13:03:25 UTC

Hi!
Where are those Big OIFS work units that was talked about earlier?
Me upgraded my systems but haven't seen any.
ID: 69159 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 69160 - Posted: 7 Jul 2023, 13:08:51 UTC - in response to Message 69159.  

Hi!
Where are those Big OIFS work units that was talked about earlier?
Me upgraded my systems but haven't seen any.
You must have read my mind, was going to post an update today!

It's taken longer than expected for U.Oxford to sort out my visiting scientist post, which will now start beginning of Oct. That will give me login access to their systems so I can properly run tests etc. CPDN also had to unexpectedly move off their existing servers and install new ones which added another delay.

So come the autumn, we'll be getting back to OpenIFS testing with high resolution, multicore, batches. The scientist who ran the Baroclinic Lifecycle experiments is also keen to do some high resolution work in the autumn as well.

Currently I'm looking at a problem we found where some model output files have get lost when they are returned.
---
CPDN Visiting Scientist
ID: 69160 · Report as offensive
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4536
Credit: 18,993,249
RAC: 21,753
Message 69161 - Posted: 7 Jul 2023, 14:13:14 UTC - in response to Message 69155.  

I'm still suspicious that your WINE implementation avoids the segv by sandboxing the environment, which is why your tasks are running. I'll check with Sarah as I think you can abort those. I'll PM you.


Thanks Glen. Now aborted. so if fresh spin ups required when the changes are made, a bit longer probably till the fixed batch arrives.
ID: 69161 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 69164 - Posted: 7 Jul 2023, 18:11:13 UTC

I wish someone would tell me if I'm to keep running these WAH Windows tasks. Not interested in credits, interested in if they will help the scientists.
ID: 69164 · Report as offensive
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4536
Credit: 18,993,249
RAC: 21,753
Message 69182 - Posted: 9 Jul 2023, 7:58:39 UTC - in response to Message 69164.  

I wish someone would tell me if I'm to keep running these WAH Windows tasks. Not interested in credits, interested in if they will help the scientists.
Sorry for the delay.

Yes the results will be used by the scientists as part of designing the next batch which hopefully will be less prone to failures.
ID: 69182 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 69184 - Posted: 9 Jul 2023, 11:23:06 UTC - in response to Message 69182.  
Last modified: 9 Jul 2023, 11:27:02 UTC

Yes the results will be used by the scientists as part of designing the next batch which hopefully will be less prone to failures.
I wish they'd fix the computation error on Windows restart problem, I've just busted all the 24 tasks I have. I did as instructed, I suspended them, waited 2 minutes, closed Boinc, waited 2 minutes, then rebooted. Upon continuing them, every single one crashed.

What on earth is it doing to cause this problem? There must be an easy fix. It's wasting a lot of computation cycles, burning fossil fuels for a project which is supposed to be against it.

It's the most recent ones here: https://www.cpdn.org/results.php?hostid=1509739&offset=0&show_names=0&state=6&appid=
ID: 69184 · Report as offensive
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4536
Credit: 18,993,249
RAC: 21,753
Message 69185 - Posted: 9 Jul 2023, 11:41:38 UTC - in response to Message 69184.  

There must be an easy fix.
I think if the fix was that easy, it would have been sorted long ago. My personal preference would be to go over completely to the OIFS tasks which don't have the problem. They were written though from the ground up. I have been crunching for CPDN since 2009 with this ID and before that on an ID I lost when my ISP got taken over. The issue has been around on both LInux and Windows tasks for as long as I have been with the project. Before the days of multiple cores, I would take backups that could be restored before a reboot but the procedure for that is much more complicated with multiple tasks running, even if only running one project.
ID: 69185 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 69186 - Posted: 9 Jul 2023, 14:18:02 UTC - in response to Message 69184.  

I wish they'd fix the computation error on Windows restart problem, I've just busted all the 24 tasks I have. I did as instructed, I suspended them, waited 2 minutes, closed Boinc, waited 2 minutes, then rebooted. Upon continuing them, every single one crashed. What on earth is it doing to cause this problem? There must be an easy fix.
I wish there was a fix too just like I wish there was less moaning on these forums, but I suspect one will be easier to fix than the other. They fail on restarts and suspend/resume on my machine too and CPDN are aware of this. As you know CPDN has scant resources mostly spent 'firefighting' this year rather than focussing on issues like this. It also doesn't seem to be a problem for all machines for some reason.

The model throws away the useful logs before it returns them to the server, which means someone needs to set up a local test, add code to print more information, to pin down what's causing the problem. So it's several weeks of work at least. Although it's something I could do I prefer to finish working through the current issues with OpenIFS first and Andy has his hands full with more pressing issues.
---
CPDN Visiting Scientist
ID: 69186 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 69187 - Posted: 9 Jul 2023, 14:31:45 UTC - in response to Message 69185.  

I think if the fix was that easy, it would have been sorted long ago. My personal preference would be to go over completely to the OIFS tasks which don't have the problem.
That won't happen because WaH and OIFS have different capabilities. WaH is a nested model; a global that drives a much higher resolution regional model. OpenIFS is only a global model. Although it could run at the resolution the regional model is using it could only do that globally and the memory requirements for that would need a compute cluster.

OpenIFS tasks have the same problem but not as often. If the restart files (or 'checkpoint' in boinc lingo) are corrupted or not written properly, OpenIFS will fail to restart. We saw this happen with the last batches that went out. From the little i know of WaH it handles it's restart files different to OIFS which closes them after each write. WaH I believe keeps the files open, which could mean they are not flushed to disk properly on shutdown. That would be where I'd start looking.
---
CPDN Visiting Scientist
ID: 69187 · Report as offensive
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,698,338
RAC: 10,100
Message 69188 - Posted: 9 Jul 2023, 15:18:23 UTC - in response to Message 69187.  
Last modified: 9 Jul 2023, 15:18:54 UTC

OpenIFS tasks have the same problem but not as often. If the restart files (or 'checkpoint' in boinc lingo) are corrupted or not written properly, OpenIFS will fail to restart. We saw this happen with the last batches that went out. From the little i know of WaH it handles it's restart files different to OIFS which closes them after each write. WaH I believe keeps the files open, which could mean they are not flushed to disk properly on shutdown. That would be where I'd start looking.
Couldn't we help with that?

It might be a write error on closedown, or a read error on restart. If you have a task to run, watch to see what happens on checkpoint (set <checkpoint_debug> in the Event Log). If the program closes the files, the timestamp will change - if no files have a new timestamp, they were being kept open.

If the timestamp changes, wait until just after a checkpoint, and copy all the new files to somewhere outside BOINC's control while the program is still running. [If you can't copy them, then I'm wrong about when the timestamp changes, and they're being kept open]

If you catch a set, keep them safe and post their vital statistics here: names, original location, bytecount, perhaps even first and last few lines if they can be rendered in human-readable format. Then, shut down BOINC, and restart it - report whether yours is a 'crasher' or a 'runner'. [another reason for only testing one task at a time - and perhaps preferably early in the run]

But if you can't help, please don't waste time by posting the same complaint over and over again. They know!
ID: 69188 · Report as offensive
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 30,967,615
RAC: 14,422
Message 69191 - Posted: 9 Jul 2023, 22:18:07 UTC - in response to Message 69112.  

Still having problems with the out file though the 25th zip has gone. Out stuck at 59% and it is going to upload7.
ID: 69191 · Report as offensive
rob

Send message
Joined: 5 Jun 09
Posts: 97
Credit: 3,735,198
RAC: 4,318
Message 69230 - Posted: 11 Jul 2023, 13:35:29 UTC

After a run of wah2 ea25 tasks all of which failed in the first couple of minutes I've now got wah2 nz25 task which has run for 27 minutes and counting. Fingers and toes crossed for the next few days running and a successful conclusion.
ID: 69230 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 69235 - Posted: 11 Jul 2023, 13:51:39 UTC - in response to Message 69230.  
Last modified: 11 Jul 2023, 13:51:57 UTC

After a run of wah2 ea25 tasks all of which failed in the first couple of minutes I've now got wah2 nz25 task which has run for 27 minutes and counting. Fingers and toes crossed for the next few days running and a successful conclusion.
Just don't restart the computer. I hope you have a UPS. I've lost all 30 non-non-starters to that problem.
ID: 69235 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 69238 - Posted: 11 Jul 2023, 15:34:31 UTC - in response to Message 69230.  

After a run of wah2 ea25 tasks all of which failed in the first couple of minutes I've now got wah2 nz25 task which has run for 27 minutes and counting. Fingers and toes crossed for the next few days running and a successful conclusion.
The NZ25 tasks use a different configuration; a much smaller grid for the regional model so don't suffer the same problem as the EAS25 config. Smaller domains for the EAS25 batch are currently being tested.
---
CPDN Visiting Scientist
ID: 69238 · Report as offensive
rob

Send message
Joined: 5 Jun 09
Posts: 97
Credit: 3,735,198
RAC: 4,318
Message 69240 - Posted: 11 Jul 2023, 16:28:22 UTC - in response to Message 69238.  

Thanks Glenn, I had a feeling there was a difference between the EAS & NZ data sets, but wasn't certain. It will be interesting to see how the "new, smaller" EAS data sets behave - hopefully better than the "old large" ones that have been so much trouble for so many (I'm so glad mine all failed in a couple of minutes, no several days).
ID: 69240 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 69241 - Posted: 11 Jul 2023, 17:27:47 UTC - in response to Message 69230.  

After a run of wah2 ea25 tasks all of which failed in the first couple of minutes I've now got wah2 nz25 task which has run for 27 minutes and counting. Fingers and toes crossed for the next few days running and a successful conclusion.


After a run of a couple of weeks ago, all tasks failing, my pipsqueak Windows10 box just got a new CPDN task, and instead of failing after about 4 minutes, it has now run for almost 25 minutes, with a little over 10 days predicted to go.
ID: 69241 · Report as offensive
rob

Send message
Joined: 5 Jun 09
Posts: 97
Credit: 3,735,198
RAC: 4,318
Message 69242 - Posted: 11 Jul 2023, 17:44:31 UTC - in response to Message 69235.  

I will be stopping and restarting the computer, so I'll keep you posted. It will be interesting as there are a couple of differences, one is I'm only allowing CPDN to run a single task just now, and also these tasks are from a different data set to those that were failing in minutes.

p.s. Still running after 4.5 hours, but fingers and toes still crossed for the next 9 days of run time which is more like a couple of weeks clock time.
ID: 69242 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 69243 - Posted: 11 Jul 2023, 18:19:48 UTC - in response to Message 69242.  

I will be stopping and restarting the computer, so I'll keep you posted. It will be interesting as there are a couple of differences, one is I'm only allowing CPDN to run a single task just now, and also these tasks are from a different data set to those that were failing in minutes.
p.s. Still running after 4.5 hours, but fingers and toes still crossed for the next 9 days of run time which is more like a couple of weeks clock time.
I've had trouble with WAH2 NZ tasks before. Failed after restarting from a sleep. Your experience may vary. It's a known issue the restart is problematic.

To avoid a task fail, I alter the computing options at night to 20% cpu instead of 100% and only have 1 task running at a time. Also change the power options in Windows to 'energy efficient' from performance. Bit tedious but it does get the power usage of the PC down to hopefully keep the batteries going a little longer.
ID: 69243 · Report as offensive
Previous · 1 . . . 29 · 30 · 31 · 32 · 33 · 34 · 35 . . . 42 · Next

Message boards : Number crunching : New work discussion - 2

©2024 cpdn.org