climateprediction.net (CPDN) home page
Thread '#1020,1,2,3...'

Thread '#1020,1,2,3...'

Message boards : Number crunching : #1020,1,2,3...
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,829,360
RAC: 19,940
Message 71247 - Posted: 15 Aug 2024, 7:00:04 UTC - in response to Message 71243.  

I assume if I suspend them and then reboot, they will not come back. So should I abort them?

No, don't abort. You don't even have to suspend the tasks, just exit BOINC like normal (File -> Exit BOINC) and restart the PC. I've gone thru several restarts, including ones without exiting BOINC first and have yet to loose a task from that. All of these new batches are using the new app version, 8.32, which doesn't suffer from crashes when PC restarts or BOINC shuts down. Chances of you loosing those tasks from a restart are de minimis.
ID: 71247 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 71253 - Posted: 15 Aug 2024, 9:46:22 UTC - in response to Message 71247.  

Just to echo previous message. I've fixed the bugs that were responsible for causing the model to sometimes crash on a restart. It doesn't suffer any of the same issues now in the v8.32 app. Just leave the tasks as you would for any project.

Suspending tasks before shutting down was an 'urban myth'. It didn't actually make any difference to the chances of the model crashing. The problems were related to the way the model processes were talking to each other on startup, nothing to do with BOINC.
---
CPDN Visiting Scientist
ID: 71253 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 71260 - Posted: 15 Aug 2024, 12:35:07 UTC - in response to Message 71247.  

No, don't abort. You don't even have to suspend the tasks, just exit BOINC like normal (File -> Exit BOINC) and restart the PC.


I do not think I understand this.
I know how to stop the boinc-manager, but I do not know how to start and stop the client. (I do know how to do that with Linux, but not in Windows 11.)

Do you mean File->Shutdown connected client in the boinc manager?
ID: 71260 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,011,472
RAC: 21,368
Message 71261 - Posted: 15 Aug 2024, 12:47:29 UTC

Do you mean File->Shutdown connected client in the boinc manager?
That is how I would do it unless that no longer works with 11 in the same way the option has gone in Linux. (It still worked on a self compiled BOINC last time I tried on Ubuntu.)
ID: 71261 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 71262 - Posted: 15 Aug 2024, 12:54:15 UTC - in response to Message 71261.  

I just shutdown the PC and let Windows shut down the boinc client. Works fine with the new WaH app.
---
CPDN Visiting Scientist
ID: 71262 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 71263 - Posted: 15 Aug 2024, 13:34:02 UTC - in response to Message 71261.  

Do you mean File->Shutdown connected client in the boinc manager?

That is how I would do it unless that no longer works with 11 in the same way the option has gone in Linux. (It still worked on a self compiled BOINC last time I tried on Ubuntu.)


OK. That seemed to work. Not only did the two tasks come back up where they left off, I got a new third one! They are all 8.32 tasks.
ID: 71263 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,704,136
RAC: 9,629
Message 71265 - Posted: 15 Aug 2024, 15:58:53 UTC

Just triggered a manual update of my Windows 11 lappy between jobs, and watched the restart with Task Manager. As soon as I could open it, that is - it took ages.

BOINC started automatically as a service, before I logged in. But it took more than eight minutes before things had settled down enough for me to feel confident enough to start the replacement CPDN tasks I downloaded last night. All sorts of other things were using up to 100% CPU, and hundreds of megabytes of memory, and flickering up and down the usage list.

I've suppressed updates for another 5 weeks so they can run in peace ... [the button says 1 week, but I have a drop-down list for up to 5 weeks - that may be version specific]
ID: 71265 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,829,360
RAC: 19,940
Message 71269 - Posted: 16 Aug 2024, 6:59:15 UTC - in response to Message 71263.  

OK. That seemed to work. Not only did the two tasks come back up where they left off, I got a new third one! They are all 8.32 tasks.

I'm glad it worked. I did mean File --> Exit BOINC, It's the last menu option, it shuts down both the client and the manager. It's always been available on Windows BOINC although perhaps not on Linux. File -->Shutdown connected client, just shuts down the client. You can use that option to shutdown individual clients if you're controlling multiple ones from the same manager.

Although it sounds like restarting the PC without exiting BOINC works too. I'm in the habit of closing programs before restarting though.
ID: 71269 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,704,136
RAC: 9,629
Message 71270 - Posted: 16 Aug 2024, 7:17:30 UTC - in response to Message 71269.  

I did mean File --> Exit BOINC, It's the last menu option, it shuts down both the client and the manager.
Careful with that advice - it depends on another setting, which isn't visible by default.

If you go to the menu Options --> Other options, you'll see an option for "Enable Manager exit dialog?". If you check that, and follow the Exit BOINC route, you'll another dialog - with a checkbox for 'Shut down connected client on exit?' or words to that effect - I don't want to go too far down that road just at the moment!

Work your way through that route just once, and check that the setting is right for your choice - you can turn off the Exit dialog when you're satisfied. Your choice will be remembered.
ID: 71270 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,829,360
RAC: 19,940
Message 71272 - Posted: 16 Aug 2024, 7:34:28 UTC - in response to Message 71265.  

Just triggered a manual update of my Windows 11 lappy between jobs, and watched the restart with Task Manager. As soon as I could open it, that is - it took ages.

Wow, a 1.6GHz CPU, just above the minimum of 1GHz for Windows11. Your tasks still finish in about 11 days, which, looking at stats, seems like about average time. Are the tasks running boosted? It seems like that processor can boost up to 3.4GHz.

My i7-4790 is also very slow when updating, but that's almost certainly due to it still having the original HDD. Your tasks run a bit faster than mine though!? Maybe that's because I run 4 at a time and you seem to run 2.
ID: 71272 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,829,360
RAC: 19,940
Message 71273 - Posted: 16 Aug 2024, 7:42:07 UTC - in response to Message 71270.  

I did mean File --> Exit BOINC, It's the last menu option, it shuts down both the client and the manager.
Careful with that advice - it depends on another setting, which isn't visible by default.

If you go to the menu Options --> Other options, you'll see an option for "Enable Manager exit dialog?". If you check that, and follow the Exit BOINC route, you'll another dialog - with a checkbox for 'Shut down connected client on exit?' or words to that effect - I don't want to go too far down that road just at the moment!

Work your way through that route just once, and check that the setting is right for your choice - you can turn off the Exit dialog when you're satisfied. Your choice will be remembered.

Yes, that's true, you can make some customizations. I believe by default it asks for confirmations. I think mine is on default.
ID: 71273 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,704,136
RAC: 9,629
Message 71275 - Posted: 16 Aug 2024, 8:26:02 UTC - in response to Message 71272.  

Wow, a 1.6GHz CPU, just above the minimum of 1GHz for Windows11. Your tasks still finish in about 11 days, which, looking at stats, seems like about average time. Are the tasks running boosted? It seems like that processor can boost up to 3.4GHz.
Yes, it's a nice machine. It's an ultra-portable Dell XPS 13, which got very good reviews on release. This is the 2018 model, but I got it well below list price in the end-of-line factory clearance in early 2019. It makes an excellent travelling companion.

I'm processing these batches at two tasks per machine (max_concurrent), but with a server limit of 4 - that way, I can download new tasks while the old ones are still running: avoids the 1 hour delay after the last trickle, before I can download a replacement. The lappy has an SSD, but only 8 GB of RAM: I'm keeping the loading below maximum, to avoid over-stressing the cooling system - that's probably the weak spot for this sort of work. That probably helps with the boost, although I haven't deliberately tried to set that myself.
ID: 71275 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 71398 - Posted: 4 Sep 2024, 15:39:52 UTC
Last modified: 4 Sep 2024, 15:40:05 UTC

There are currently ~310 workunits completed per day, from the East Asia 25km batches numbers 1020 >> 1027. The results of which are going to the South Korean upload server.
---
CPDN Visiting Scientist
ID: 71398 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 71413 - Posted: 14 Sep 2024, 18:52:33 UTC

Well, another four tasks failed at the same time. I infer an unscheduled system update occurred o my Windows box. This was very early Tuesday morning.
Here is one of them.

Task         22473689
Name 	     wah2_eas25_n2yc_201812_24_1022_012316003_0
Workunit     12316003
Created 	24 Jul 2024, 13:55:41 UTC
Sent 	        30 Aug 2024, 3:22:27 UTC
Report deadline 8 Dec 2024, 3:22:27 UTC
Received 	11 Sep 2024, 3:32:06 UTC
Server state 	Over
Outcome 	Computation error
Client state 	Compute error
Exit status 	9 (0x00000009) Unknown error code
Computer ID 	1512658
Run time 	10 days 19 hours 54 min 33 sec
CPU time 	9 days 14 hours 57 min 27 sec
Validate state 	Invalid
Credit 	14,931.45
Device peak FLOPS 	3.68 GFLOPS
Application version 	Weather At Home 2 (wah2) (region independent) v8.32
windows_intelx86
Peak working set size 	342.49 MB
Peak swap size 	309.91 MB
Peak disk usage 	94.95 MB
Stderr 	

<core_client_version>8.0.2</core_client_version>
<![CDATA[
<message>
The storage control block address is invalid.
 (0x9) - exit code 9 (0x9)</message>
<stderr_txt>
modelGetExecutables: check control files, strTemp0 & 1 : 
C:\ProgramData\BOINC/projects/climateprediction.net/wah2_eas25_n2yc_201812_24_1022_012316003/jobs/xadae.namelists
C:\ProgramData\BOINC/projects/climateprediction.net/wah2_eas25_n2yc_201812_24_1022_012316003/jobs/xacxf.namelists
modelGetExecutables: unzipping control files : strInput & strTmp 
wah2_eas25_n2yc_201812_24_1022_012316003.zip
wah2_eas25_n2yc_201812_24_1022_012316003/jobs
gstrDump[0] = generic_phase1_spinup_eas25_global_aabaka_f
gstrDump[1] = generic_phase1_spinup_eas25_regional_aabaka_f
global model: command string: "C:\ProgramData\BOINC/projects/climateprediction.net/wah2am3m2_um_8.32_windows_intelx86.exe" wah2_eas25_n2yc_201812_24_1022_012316003 generic_phase1_spinup_eas25_global_aabaka_f ic19610310_14_N96 NATclim_ancil_168months_CMIP6-ACCESS-CM2_SST_2009-01-01_2022-12-30_v2404b NATclim_ancil_168months_CMIP6-ACCESS-CM2_SIC_2009-01-01_2022-12-30_v2404b so2dms_prei_N96_1855_0000P oxi.addfa ozone_preind_N96_1879_0000Pv5
regional model: command string: "C:\ProgramData\BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.32_windows_intelx86.exe" wah2_eas25_n2yc_201812_24_1022_012316003
 cpdn_check_running: got RM PID of zero; ignoring this call and waiting for PID via shMem. 
 cpdn_check_running: got RM PID of zero; ignoring this call and waiting for PID via shMem. 
executeModelProcess: MonID=13968, GCM_PID=7496, RCM_PID=3268
Suspended CPDN Monitor - Suspend request from BOINC...
Queuing intermediate upload for CPDN/BOINC: cpdnout1.zip
Queuing intermediate upload for CPDN/BOINC: cpdnout2.zip
Suspended CPDN Monitor - Suspend request from BOINC...
Queuing intermediate upload for CPDN/BOINC: cpdnout3.zip
Queuing intermediate upload for CPDN/BOINC: cpdnout4.zip
Suspended CPDN Monitor - Suspend request from BOINC...
Queuing intermediate upload for CPDN/BOINC: cpdnout5.zip
Queuing intermediate upload for CPDN/BOINC: cpdnout6.zip
Suspended CPDN Monitor - Suspend request from BOINC...
Queuing intermediate upload for CPDN/BOINC: cpdnout7.zip
Queuing intermediate upload for CPDN/BOINC: cpdnout8.zip
Queuing intermediate upload for CPDN/BOINC: cpdnout9.zip
Queuing intermediate upload for CPDN/BOINC: cpdnout10.zip
Queuing intermediate upload for CPDN/BOINC: cpdnout11.zip
Suspended CPDN Monitor - Suspend request from BOINC...
Queuing intermediate upload for CPDN/BOINC: cpdnout_restart.zip
Queuing intermediate upload for CPDN/BOINC: cpdnout12.zip
Queuing intermediate upload for CPDN/BOINC: cpdnout13.zip
Suspended CPDN Monitor - Suspend request from BOINC...
Queuing intermediate upload for CPDN/BOINC: cpdnout14.zip
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Queuing intermediate upload for CPDN/BOINC: cpdnout15.zip
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Queuing intermediate upload for CPDN/BOINC: cpdnout16.zip
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Queuing intermediate upload for CPDN/BOINC: cpdnout17.zip
Queuing intermediate upload for CPDN/BOINC: cpdnout18.zip
Global Worker:: CPDN process is not running, exiting, bRetVal = T, checkPID = 7496, selfPID = 7496, iMonCtr = 1
Regional Worker:: CPDN process is not running, exiting, bRetVal = T, checkPID = 7496, selfPID = 3268, iMonCtr = 1

</stderr_txt>
]]>

ID: 71413 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,011,472
RAC: 21,368
Message 71414 - Posted: 14 Sep 2024, 19:09:47 UTC

Well, another four tasks failed at the same time. I infer an unscheduled system update occurred o my Windows box. This was very early Tuesday morning.
Here is one of them.

Perhaps not. Mine are running under Wine so Windows updates have no bearing on them and I have had several fail on starting today. I tried exiting BOINC and restarting. Running tasks all carried on but another two new ones failed. Subsequent ones have started fine. I currently have three completed tasks and four failed ones waiting to report on Monday when Andy sorts out the server issues.
ID: 71414 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 71418 - Posted: 16 Sep 2024, 8:56:39 UTC - in response to Message 71413.  

Yes, this message in your log is typical of the 'fail caused by Windows Update'. It's related to installation of new software if I understand the online posts. Best advice is to shutdown the client while Update is running, reboot, then restart the client.
<message>
The storage control block address is invalid.
(0x9) - exit code 9 (0x9)</message>

---
CPDN Visiting Scientist
ID: 71418 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 30,984,181
RAC: 14,575
Message 71838 - Posted: 5 Nov 2024, 23:27:29 UTC

Just picked up some resends from these three batches and also batches 1024-1027, 19 in total. Some were aborted but lots were timed out on the original machines. Are the results still needed?
ID: 71838 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,011,472
RAC: 21,368
Message 71840 - Posted: 6 Nov 2024, 6:19:53 UTC - in response to Message 71838.  
Last modified: 6 Nov 2024, 6:26:35 UTC

I am letting mine run. I think the results are wanted or the batches would have been cancelled to prevent resends.
If you can be bothered, you can do what I did. - Three of those I got sent were still running on the original hosts but being resent because past the deadline. They finished after I had been sent them and I then cancelled them. I think the rest are unlikely to do so but I will have another check later today and if any finish on other hosts i will abort them too.
Edit:this taskfor example you can see that I have leap frogged the original cruncher.
ID: 71840 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 71843 - Posted: 6 Nov 2024, 9:39:01 UTC - in response to Message 71838.  
Last modified: 6 Nov 2024, 9:40:37 UTC

Just picked up some resends from these three batches and also batches 1024-1027, 19 in total. Some were aborted but lots were timed out on the original machines. Are the results still needed?

Yes, absolutely.

If results were not needed you would not have got the resends. I am managing the batches and any finished ones are closed promptly so no resends go out.
---
CPDN Visiting Scientist
ID: 71843 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,829,360
RAC: 19,940
Message 71844 - Posted: 6 Nov 2024, 9:58:52 UTC
Last modified: 6 Nov 2024, 10:00:26 UTC

It seems like there's been a lot of timed out tasks lately. Not sure if it's normal and I just haven't noticed before.
ID: 71844 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : #1020,1,2,3...

©2024 cpdn.org