climateprediction.net home page
#1020,1,2,3...

#1020,1,2,3...

Message boards : Number crunching : #1020,1,2,3...
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4529
Credit: 18,661,594
RAC: 14,529
Message 71084 - Posted: 22 Jul 2024, 12:53:41 UTC
Last modified: 22 Jul 2024, 12:55:11 UTC

The first of these has gone out. I have about 10 minutes left on the project time out before I can get some. This is the one with all forcings.

1020 EASHA 5,000 tasks WAH2 East Asia 25km 1986-2010
ID: 71084 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,193,804
RAC: 2,852
Message 71087 - Posted: 22 Jul 2024, 16:47:29 UTC - in response to Message 71084.  
Last modified: 22 Jul 2024, 16:48:05 UTC

In the last 4 hours, my pipsqueak Windows11 machine (computer 1512658) got four 1020 tasks and they all seem to be running OK. I got one each hour.
ID: 71087 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4529
Credit: 18,661,594
RAC: 14,529
Message 71088 - Posted: 22 Jul 2024, 17:11:33 UTC

Just two running on my Ryzen9. I have upped the number of cores the VM can use so probably will get some more before too much longer.
ID: 71088 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4529
Credit: 18,661,594
RAC: 14,529
Message 71089 - Posted: 22 Jul 2024, 20:59:52 UTC - in response to Message 71088.  
Last modified: 22 Jul 2024, 21:02:28 UTC

4 more running. Server status page says down to 457 of the first batch left. They will all be gone by the time I get up to check on things tomorrow.
ID: 71089 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4529
Credit: 18,661,594
RAC: 14,529
Message 71103 - Posted: 24 Jul 2024, 13:20:46 UTC
Last modified: 24 Jul 2024, 13:33:13 UTC

#1021 6048 tasks ALL WAH2 East Asia 25km has been released. I have four running alongside 6 from 10020

#1022 5040 tasks NAT WAH2 East Asia 25km has gone out too. Time to get those cores crunching but, please don't download lots for the cache as the researcher does want results back as quickly as possible.

In winter my 16 CPU Ryzen9 will heat my small office.
ID: 71103 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4529
Credit: 18,661,594
RAC: 14,529
Message 71104 - Posted: 24 Jul 2024, 16:23:33 UTC

#1023 5040 tasks GHG WAH2 East Asia 25km

And another one!
ID: 71104 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1044
Credit: 16,196,312
RAC: 12,647
Message 71106 - Posted: 24 Jul 2024, 16:59:46 UTC - in response to Message 71104.  

#1023 5040 tasks GHG WAH2 East Asia 25km
And another one!
Dave, I posted a list of the forthcoming batches already. See: https://www.cpdn.org/forum_thread.php?id=9232&postid=71086
---
CPDN Visiting Scientist
ID: 71106 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4529
Credit: 18,661,594
RAC: 14,529
Message 71127 - Posted: 26 Jul 2024, 13:03:53 UTC
Last modified: 26 Jul 2024, 13:07:25 UTC

That's all of them out there. So far failure rate looks not too high. Another two days till the first of mine is due to finish. With a tad over 5,000 tasks with recent credit they should last a little while.
ID: 71127 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4529
Credit: 18,661,594
RAC: 14,529
Message 71134 - Posted: 28 Jul 2024, 5:38:05 UTC

My first one has completed. CPU time 5 days 6 hours 35 min 19 sec another should finish later this morning.
ID: 71134 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,193,804
RAC: 2,852
Message 71229 - Posted: 14 Aug 2024, 4:01:37 UTC
Last modified: 14 Aug 2024, 4:03:34 UTC

FYI: I got a whole bunch of failures today. Here is a typical one.
Several of them failed at the same time, but not all of them. These were all on my Windows 11 machine.

I normally leave the Boinc manager running, but it was not running when I turned the monitor on. So it is possible that Windows did an update and reboot without telling me. Normally I run SupportAssist for updates, but it has not run lately. (It keeps a log of recent activity).

Task 22470326
Name 	wah2_eas25_n0d1_201012_24_1022_012312644_0
Workunit 	12312644   <---<<<
Created 	24 Jul 2024, 13:26:05 UTC
Sent 	9 Aug 2024, 22:47:14 UTC
Report deadline 	17 Nov 2024, 22:47:14 UTC
Received 	14 Aug 2024, 3:16:50 UTC
Server state 	Over
Outcome 	Computation error
Client state 	Compute error
Exit status 	9 (0x00000009) Unknown error code
Computer ID 	1512658   <---<<<
Run time 	4 days 3 hours 10 min 33 sec
CPU time 	3 days 14 hours 59 min 52 sec
Validate state 	Invalid
Credit 	5,819.81
Device peak FLOPS 	3.68 GFLOPS
Application version 	Weather At Home 2 (wah2) (region independent) v8.32
windows_intelx86
Peak working set size 	341.16 MB
Peak swap size 	308.56 MB
Peak disk usage 	94.80 MB
Stderr 	

<core_client_version>8.0.2</core_client_version>
<![CDATA[
<message>
The storage control block address is invalid.   <----<<<
 (0x9) - exit code 9 (0x9)</message>
<stderr_txt>
modelGetExecutables: check control files, strTemp0 & 1 : 
C:\ProgramData\BOINC/projects/climateprediction.net/wah2_eas25_n0d1_201012_24_1022_012312644/jobs/xadae.namelists
C:\ProgramData\BOINC/projects/climateprediction.net/wah2_eas25_n0d1_201012_24_1022_012312644/jobs/xacxf.namelists
modelGetExecutables: unzipping control files : strInput & strTmp 
wah2_eas25_n0d1_201012_24_1022_012312644.zip
wah2_eas25_n0d1_201012_24_1022_012312644/jobs
gstrDump[0] = generic_phase1_spinup_eas25_global_aabaka_f
gstrDump[1] = generic_phase1_spinup_eas25_regional_aabaka_f
global model: command string: "C:\ProgramData\BOINC/projects/climateprediction.net/wah2am3m2_um_8.32_windows_intelx86.exe" wah2_eas25_n0d1_201012_24_1022_012312644 generic_phase1_spinup_eas25_global_aabaka_f ic19611128_10_N96 NATclim_ancil_168months_CMIP6-ACCESS-CM2_SST_2009-01-01_2022-12-30_v2404b NATclim_ancil_168months_CMIP6-ACCESS-CM2_SIC_2009-01-01_2022-12-30_v2404b so2dms_prei_N96_1855_0000P oxi.addfa ozone_preind_N96_1879_0000Pv5
regional model: command string: "C:\ProgramData\BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.32_windows_intelx86.exe" wah2_eas25_n0d1_201012_24_1022_012312644
 cpdn_check_running: got RM PID of zero; ignoring this call and waiting for PID via shMem. 
 cpdn_check_running: got RM PID of zero; ignoring this call and waiting for PID via shMem. 
executeModelProcess: MonID=2964, GCM_PID=16812, RCM_PID=2028
Queuing intermediate upload for CPDN/BOINC: cpdnout1.zip
Queuing intermediate upload for CPDN/BOINC: cpdnout2.zip
Queuing intermediate upload for CPDN/BOINC: cpdnout3.zip
Queuing intermediate upload for CPDN/BOINC: cpdnout4.zip
Queuing intermediate upload for CPDN/BOINC: cpdnout5.zip
Queuing intermediate upload for CPDN/BOINC: cpdnout6.zip
Queuing intermediate upload for CPDN/BOINC: cpdnout7.zip
Global Worker:: CPDN process is not running, exiting, bRetVal = T, checkPID = 16812, selfPID = 16812, iMonCtr = 1
No Process Handle
Regional Worker:: CPDN process is not running, exiting, bRetVal = T, checkPID = 16812, selfPID = 2028, iMonCtr = 1

</stderr_txt>
]]>

ID: 71229 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4529
Credit: 18,661,594
RAC: 14,529
Message 71230 - Posted: 14 Aug 2024, 5:42:54 UTC

Normally I run SupportAssist for updates, but it has not run lately. (It keeps a log of recent activity).
Surely there is still some record of when updates have run? - Found this.
Open Start.
Search for Command Prompt, right-click the top result and click the Run as administrator option.
Type the following command to query the device's last boot time and press Enter: wmic path Win32_OperatingSystem get LastBootUpTime
ID: 71230 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 314
Credit: 14,554,903
RAC: 18,109
Message 71231 - Posted: 14 Aug 2024, 7:38:37 UTC - in response to Message 71229.  

I normally leave the Boinc manager running, but it was not running when I turned the monitor on. So it is possible that Windows did an update and reboot without telling me.

I'd guess it unlikely that a reboot would lead to tasks crashing with this new app version. I'd want to know if you rebooted the PC between any of the crashes. If not, I'd suggest a reboot, it's probably one of the first troubleshooting steps to try for Windows related errors, which this one appears to be. There certainly seems to be a common problem to your tasks crashing.

Checking the Update History in Settings, you'll be able to see the dates but not times of both successful and failed updates. Also Reliability History as well as Event Viewer - to see if anything happened around crash times. Checking stdoutdae.txt in BOINC directory and the different std....txt files in the task directories of the failed tasks might provide some clues.
ID: 71231 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1058
Credit: 36,583,114
RAC: 15,886
Message 71232 - Posted: 14 Aug 2024, 8:39:28 UTC - in response to Message 71230.  

It's likely that Windows update has run recently - yesterday was Microsoft's "Patch Tuesday" (the day that they release major updates each month). Usually, Windows 11 restarts automatically after applying the patches - sometime after the end of your defined 'working day'.

The other route to finding information about, and controlling, updates is Start --> Settings --> Windows Update. There's an 'Update history' link on that page, and also controls for delaying automatic updates for up to five weeks. Using that, you can arrange to apply the updates yourself between tasks, and then allow the next task to run uninterrupted.
ID: 71232 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1044
Credit: 16,196,312
RAC: 12,647
Message 71233 - Posted: 14 Aug 2024, 9:46:39 UTC - in response to Message 71232.  
Last modified: 14 Aug 2024, 9:47:47 UTC

The error message:
<message>
The storage control block address is invalid. <----<<<
(0x9) - exit code 9 (0x9)</message>
seen in your log has been mentioned & discussed before on the forums, by yourself in fact Jean:
https://www.cpdn.org/forum_thread.php?id=9233&postid=70386 and https://www.cpdn.org/forum_thread.php?id=9277&postid=70852. I looked it up on the web and plenty of reports it's associated with Windows Update in some way.

This particular error message occurs in 10% of the failures we see in a batch. So it's quite common. If I get a moment, I'll look through the database and check what day of the week we see these fails. That would back up Richard's suggestion.

We also see a high number of disk (or storage) related errors such as : 'system cannot find drive specified', 'drive cannot find specific area or track', 'code 193 error, e.g. boinc_finish(193)', and 'extended attributes are inconsistent'. The last one may also be associated with Windows Update. I think the rest are likely to be hardware related. Those errors combined account for ~25-30% of failed tasks in a batch.
---
CPDN Visiting Scientist
ID: 71233 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1058
Credit: 36,583,114
RAC: 15,886
Message 71234 - Posted: 14 Aug 2024, 10:33:33 UTC - in response to Message 71233.  

'Patch Tuesday' is always the second Tuesday of the month (US pacific time), and it usually reaches the UK on 'the Wednesday after the second Tuesday' of the month - not necessarily 'the second Wednesday'. Time zones, and all that.

I have a strong feeling that Windows 11 continues to install bits of the update for a significant period after the reboot. If BOINC is installed as a service, it will be auto-launched while these residual processes are still happening - they may be responsible for these otherwise surprising errors apparently originating deep in the hardware.

My Windows 11 laptop has two running tasks currently at 90% complete - but I've blocked updates until they finish. Depending on the time of day they're predicted to finish, I may download replacements in advance - but I'll suspend them from running until well after I've dealt with the updates. Rinse and repeat.
ID: 71234 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,193,804
RAC: 2,852
Message 71237 - Posted: 14 Aug 2024, 12:48:48 UTC - in response to Message 71230.  
Last modified: 14 Aug 2024, 12:53:31 UTC

Normally I run SupportAssist for updates, but it has not run lately. (It keeps a log of recent activity).

Surely there is still some record of when updates have run? - Found this.

Open Start.
Search for Command Prompt, right-click the top result and click the Run as administrator option.
Type the following command to query the device's last boot time and press Enter: wmic path Win32_OperatingSystem get LastBootUpTime



C:\Windows\System32>wmic path Win32_OperatingSystem get LastBootUpTime
LastBootUpTime
20240813231633.500000-240

One group failed August 14 03:16:50; (The other group failed August 9) I am 4 time zones behind GMT.

I notice there were two groups of failures. For each group of failures all members failures happened at the same time. Each group had four tasks. My app_config.xml allows up to four of these to run at a time.

I prefer Linux that does not do updates until I tell it to. It does tell me when there are some.
ID: 71237 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,193,804
RAC: 2,852
Message 71238 - Posted: 14 Aug 2024, 13:27:23 UTC - in response to Message 71232.  

The other route to finding information about, and controlling, updates is Start --> Settings --> Windows Update. There's an 'Update history' link on that page, and also controls for delaying automatic updates for up to five weeks.


I have it set to 1 week. I tried to set it to 3 weeks, but it will not allow me set it to anything but 1 week.
ID: 71238 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 314
Credit: 14,554,903
RAC: 18,109
Message 71242 - Posted: 14 Aug 2024, 20:33:33 UTC - in response to Message 71237.  

One group failed August 14 03:16:50; (The other group failed August 9) I am 4 time zones behind GMT.

I notice there were two groups of failures. For each group of failures all members failures happened at the same time. Each group had four tasks.

Since the first group failed before patch Tuesday, these failures may not be related to it. Check the Update History to see if any happened on the days of failures.
I'd say a reboot is in order, with the new app version your current tasks are almost certain to be fine. But I'd be concerned that there may be a non-trivial chance that in 3-4 days same thing will happen again.
ID: 71242 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,193,804
RAC: 2,852
Message 71243 - Posted: 14 Aug 2024, 23:59:42 UTC - in response to Message 71242.  

I'd say a reboot is in order, with the new app version your current tasks are almost certain to be fine. But I'd be concerned that there may be a non-trivial chance that in 3-4 days same thing will happen again.


Well, there are two tasks running on that machine that have a little over 15 days to go.

I assume if I suspend them and then reboot, they will not come back. So should I abort them?
ID: 71243 · Report as offensive     Reply Quote
rob

Send message
Joined: 5 Jun 09
Posts: 97
Credit: 3,673,031
RAC: 4,752
Message 71246 - Posted: 15 Aug 2024, 6:27:35 UTC - in response to Message 71243.  

I suspend running tasks before shutting my PC down at night, the current load have resumed OK in the morning with no problems (provided I remember to resume them). They've survived about a dozen nights by doing this so far, and I assume they will survive the last few and complete soon.
ID: 71246 · Report as offensive     Reply Quote
1 · 2 · 3 · Next

Message boards : Number crunching : #1020,1,2,3...

©2024 cpdn.org