climateprediction.net (CPDN) home page
Thread 'Batch 1008, and test batches 1009 to 1014 for Windows - issues'

Thread 'Batch 1008, and test batches 1009 to 1014 for Windows - issues'

Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4

AuthorMessage
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,855,177
RAC: 4,773
Message 70819 - Posted: 13 Apr 2024, 22:45:48 UTC

Unfortunately, my 1012 has failed at the first trickle (here), just as the my 1008 did (here).

Might be a different cause but might be the same.
ID: 70819 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,578,380
RAC: 15,009
Message 70820 - Posted: 13 Apr 2024, 23:58:10 UTC - in response to Message 70819.  
Last modified: 13 Apr 2024, 23:58:38 UTC

Batch 1012 will fail on Intel machines. Batches 1013 & 1014 should continue running.
---
CPDN Visiting Scientist
ID: 70820 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 70833 - Posted: 17 Apr 2024, 15:23:26 UTC

I have a 1008 task (https://www.cpdn.org/result.php?resultid=22417298) that seems "stuck" - it's been at 6% and change for 6 days now while the rest of the tasks blow past it.

Windows 10 VM on an AMD system.

Is there any value to letting it continue spinning, or should I just abort it and let some other system try?
ID: 70833 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,578,380
RAC: 15,009
Message 70834 - Posted: 17 Apr 2024, 15:47:14 UTC - in response to Message 70833.  

Rather than abort, could you please do me a favour? Open up Resource Monitor, click the 'CPU' tab and scroll down to find the 'wah2' list of processes. You should have 3 processes per task. For your task, n15e, please let me know how many you see. I think you'll only see one process: wah2_8.29_windows_intelx86.exe and not the wah2am_* and wah2rm_* processes. Can you confirm?

Also, if you know your way around the BOINC folder layout, would be great if you could locate the task directory and check a file for me. I'd like to see the last few lines of a file called 'stdout_mon.txt'. It can be found in the task directory, which will be under your BOINC install 'data' directory:
e.g. c:\Program Files\BOINC\data\projects\climateprediction.net\wah2_eas25_n15e_201212_24_1008_012272746_0\stdout_mon.txt
Note the 'data' directory under BOINC is usually a hidden directory, you'll need to unhide folders in file explorer.

The reason I ask this is because up to now it's appeared that only AMD PCs can run the 1008 tasks (though not produce correct results). It sounds like yours has crashed and I'd like to see how far it got.

After this, rather than Abort I suggest doing 'End Task' in Resource Monitor (right click on the correct process name). I *think* this will avoid your host being marked down for aborting tasks.

Many thanks.
---
CPDN Visiting Scientist
ID: 70834 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 70836 - Posted: 17 Apr 2024, 20:43:03 UTC - in response to Message 70834.  

I have 9 tasks currently running on the system (it's a bit short on disk, I should fix that).

I see 9 wah2_8.29 tasks, 9 wah2am3m2_8.29 tasks, and 9 of the wah2am3m2t_8.29 tasks. One or two of them are showing no CPU use, though. Not sure how to link PIDs to tasks, though.

But looking more closely at the task, "CPU Time" reports "1d 03:51:19" on an elapsed time of 6d and change.

There's no C:\Program Files\BOINC\data directory, but I've got a C:\ProgramData\BOINC directory with that sort of stuff.

stderr_rm and stderr_um are both empty.

stdout_mon.txt:

worker: Created shared memory region key = wah2_eas25_n15e_201212_24_1008_012272746 of size 73278744 bytes (version 608)
Run for 2 Years and 0 Months
pShMem->PRECIS_LATITUDE 185
pShMem->PRECIS_LONGITUDE 285
pShMem->EWSPACEA 0.220000
pShMem->NSSPACEA 0.220000
pShMem->FRSTLATA 19.100000
pShMem->FRSTLONA 328.500000
pShMem->POLELATA 55.500000
pShMem->POLELONA 308.000000
pShMem->L_RUN_REGION 1
pShMem->UPLOAD_INTERVAL 0
ulTotalPhaseTimestep 276864
Starting model ID wah2_eas25_n15e_201212_24_1008_012272746   Phase 1
Launching model "C:\ProgramData\BOINC/projects/climateprediction.net/wah2am3m2_um_8.29_windows_intelx86.exe" wah2_eas25_n15e_201212_24_1008_012272746 generic_phase1_spinup_eas25_global_aabaka ic19610319_14_N96 NATclim_ancil_168months_CMIP6-HadGEM3-GC31-LL_SST_2009-01-01_2022-12-30_v2403 NATclim_ancil_168months_CMIP6-HadGEM3-GC31-LL_SIC_2009-01-01_2022-12-30_v2403 so2dms_prei_N96_1855_0000P oxi.addfa ozone_preind_N96_1879_0000Pv5
Launching model "C:\ProgramData\BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.29_windows_intelx86.exe" wah2_eas25_n15e_201212_24_1008_012272746
executeModelProcess: MonID=7868, GCM_PID=9088, RCM_PID=9824


stdout_rm.txt:

Starting HadRM3 model for ID# wah2_eas25_n15e_201212_24_1008_012272746...
Attached to shared memory segment with ID
Setting run-time Fortran environment...

UM environment variables in use:
ASTART=dataout/region_restart.day
UNIT11=dataout/xacxf.phist
UM_SECTOR_SIZE=2048
UNIT02=jobs/xacxf
UM_LBC_COUP=0
VN=4.5
TYPE=CRUN
UNIT09=tmp/xacxf.namelists
UNIT22=datain/ancil/ctldata/STASHmaster
STASETS_DIR=datain/ancil/ctldata/stasets
CACHE2=tmp/xacxf.cache2
UNIT08=tmp/xacxf.pipe_dummy
UNIT14=tmp/xacxf.errors
APSUM1=tmp/xacxf.apsum1
APSTMP1=tmp/xacxf.apstmp1
AOTRANS=tmp/xacxf.aotrans
UNIT04=jobs/xacxf.stashc
UNIT05=jobs/xacxf.namelists
DATAM=dataout/
UNIT12=dataout/xacxf.thist
UNIT10=dataout/xacxf.phist
UNIT06=dataout/xacxf.out
UNIT00=dataout/xacxf.err
AINITIAL=dataout/region_restart.day
UNIT57=jobs/spec3a_sw_3_asol2c_hadcm3
UNIT80=jobs/spec3a_lw_3_asol2c_hadcm3
SWSPECTD=jobs/spec3a_sw_3_asol2c_hadcm3
LWSPECTD=jobs/spec3a_lw_3_asol2c_hadcm3

Changing to slots dir C:\ProgramData\BOINC\slots\1


stdout_um.txt:

Starting HadAM3P model for ID# wah2_eas25_n15e_201212_24_1008_012272746...
Attached to shared memory segment with ID
Setting run-time Fortran environment...

UM environment variables in use:
ASTART=datain/dumps/generic_phase1_spinup_eas25_global_aabaka
UNIT15=datain/ancil/ic19610319_14_N96
SSTIN=datain/ancil/NATclim_ancil_168months_CMIP6-HadGEM3-GC31-LL_SST_2009-01-01_2022-12-30_v2403
SICEIN=datain/ancil/NATclim_ancil_168months_CMIP6-HadGEM3-GC31-LL_SIC_2009-01-01_2022-12-30_v2403
SULPEMIS=datain/ancil/so2dms_prei_N96_1855_0000P
CHEMOXID=datain/ancil/oxi.addfa
OZONE=datain/ancil/ozone_preind_N96_1879_0000Pv5
UM_LBC_COUP=0
UNIT11=dataout/xadae.phist
UM_SECTOR_SIZE=2048
UNIT02=jobs/xadae
VN=4.5
TYPE=CRUN
UNIT09=tmp/xadae.namelists
AINITIAL=dataout/atmos_restart.day
UNIT57=jobs/spec3a_sw_3_asol2c_hadcm3
UNIT80=jobs/spec3a_lw_3_asol2c_hadcm3
SWSPECTD=jobs/spec3a_sw_3_asol2c_hadcm3
LWSPECTD=jobs/spec3a_lw_3_asol2c_hadcm3
UNIT22=datain/ancil/ctldata/STASHmaster
STASETS_DIR=datain/ancil/ctldata/stasets
CACHE2=tmp/xadae.cache2
UNIT08=tmp/xadae.pipe_dummy
UNIT14=tmp/xadae.errors
APSUM1=tmp/xadae.apsum1
APSTMP1=tmp/xadae.apstmp1
AOTRANS=tmp/xadae.aotrans
UNIT04=jobs/xadae.stashc
UNIT05=jobs/xadae.namelists
DATAM=dataout/
UNIT12=dataout/xadae.thist
UNIT10=dataout/xadae.phist
UNIT06=dataout/xadae.out
UNIT00=dataout/xadae.err
UM_ATM_NPROCX=1
UM_ATM_NPROCY=1
UM_NPES=1
RUNID=xadae

Changing to slots dir C:\ProgramData\BOINC\slots\1
Closing model...
Detaching shared memory segment...


I don't see anything obviously wrong in them... I'm tempted to suspend and resume that task, see if it comes back up properly.
ID: 70836 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,578,380
RAC: 15,009
Message 70839 - Posted: 18 Apr 2024, 11:37:57 UTC - in response to Message 70836.  
Last modified: 18 Apr 2024, 11:40:12 UTC

Thanks. I can see the problem.
I have 9 tasks ... I see 9 wah2_8.29 tasks, 9 wah2am3m2_8.29 tasks, and 9 of the wah2am3m2t_8.29 tasks. One or two of them are showing no CPU use, though. Not sure how to link PIDs to tasks, though.
Several ways to link PID&task, I like Resource Monitor. Start it up. On the CPU tab, scroll to find the CPDN task process you are interested in. Click the little checkbox left of the process you are interested in. Once clicked, open up the 'Associated Handles' section (little up/down arrow on the title bar below), and it will show you all the files and folders associated with the process.

stdout_um.txt:
Starting HadAM3P model for ID# wah2_eas25_n15e_201212_24_1008_012272746...
... <snip>...
RUNID=xadae
Changing to slots dir C:\ProgramData\BOINC\slots\1
Closing model...
Detaching shared memory segment...

I don't see anything obviously wrong in them... I'm tempted to suspend and resume that task, see if it comes back up properly.
The last lines of that output from the global model 'stdout_um.txt' show the problem: 'Closing model'. That means the model has stopped but for some reason boinc hasn't recognised this and the process hasn't exited. That's why the model has hung up. The global model isn't running so the other two processes are just sitting waiting.

Rather than suspend/resume, I would shut down the client to kill the processes. Make sure they really have gone (check Resource Monitor) and then start up the client again. It's possible the tasks will then error but that's what you need anyway.

HTH

p.s. I've just checked the machine this was running on. I noticed it's only got 8Gb RAM. How many CPDN tasks are running simultaneously and how much of that 8Gb is BOINC allowed to use? Am thinking you might have hit a memory limit causing this odd behaviour.
---
CPDN Visiting Scientist
ID: 70839 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 70840 - Posted: 18 Apr 2024, 17:09:35 UTC - in response to Message 70839.  


p.s. I've just checked the machine this was running on. I noticed it's only got 8Gb RAM. How many CPDN tasks are running simultaneously and how much of that 8Gb is BOINC allowed to use? Am thinking you might have hit a memory limit causing this odd behaviour.


It's running 9 tasks, due to disk limits, showing 6.8GB of 8 in use, and BOINC is allowed 90% of RAM.

I'll just reboot the VM and up the RAM to it - it's able to have at least 12GB, the system isn't running anything else.
ID: 70840 · Report as offensive     Reply Quote
f300

Send message
Joined: 17 Feb 06
Posts: 2
Credit: 1,157,548
RAC: 2,831
Message 70842 - Posted: 19 Apr 2024, 1:45:47 UTC - in response to Message 70834.  

The reason I ask this is because up to now it's appeared that only AMD PCs can run the 1008 tasks (though not produce correct results).

Hello,
I have 4 1008 tasks at about 60% on a Ryzen 3600x. Is it worth continuing them if they don't output correct results or should I abort them?
Thanks!
ID: 70842 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 70843 - Posted: 19 Apr 2024, 9:05:01 UTC - in response to Message 70842.  

I am going to let mine continue unless I see a message from Glen here or someone else at the project asking for them to be aborted. (Mine are suspended currently to let some testing branch tasks go through.
ID: 70843 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,578,380
RAC: 15,009
Message 70844 - Posted: 19 Apr 2024, 11:49:51 UTC - in response to Message 70842.  

Please let it run. We're not 100% certain of the results. They are a useful comparison to the failed Intel runs.

The reason I ask this is because up to now it's appeared that only AMD PCs can run the 1008 tasks (though not produce correct results).

Hello,
I have 4 1008 tasks at about 60% on a Ryzen 3600x. Is it worth continuing them if they don't output correct results or should I abort them?
Thanks!

---
CPDN Visiting Scientist
ID: 70844 · Report as offensive     Reply Quote
f300

Send message
Joined: 17 Feb 06
Posts: 2
Credit: 1,157,548
RAC: 2,831
Message 70846 - Posted: 19 Apr 2024, 15:21:41 UTC - in response to Message 70844.  

Ok, will do! Thanks 👍
ID: 70846 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,743,089
RAC: 6,177
Message 70847 - Posted: 19 Apr 2024, 16:29:07 UTC

Task 22425060, test batch 1014, Intel has finished successfully.
ID: 70847 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,578,380
RAC: 15,009
Message 70914 - Posted: 10 May 2024, 20:40:20 UTC
Last modified: 10 May 2024, 20:41:32 UTC

Weather@Home running on Intel crashes at start of new year.

Followers of this thread will recall there were problems with batches 1008-1012 where the regional model would crash when it started the new calendar year. This only happened on Intel CPUs and not on AMD ones. I've been spending time understanding what's going on. The behaviour of the model was "correct" on Intel, it should have crashed.

The problem is caused by a bug in the model code which causes a memory overwrite. Not a lot but enough to do some damage. It turns out this bug has been in the code from the time CPDN originally obtained it from the UK MetO (who have since moved this code on I hasten to add). The impact of the code bug was data dependent and also compiler optimization dependent. The problem was in a part of the model code that recomputed the solar flux variability at the start of a new year. A scalar variable was being passed to a subroutine when it should have been an array of values. As the solar variability is small year on year and Weather@Home runs are relatively short, analysis shows this only has a minimal effect on model results. Certainly less than the variability introduced by the ensemble of forcings.

Investigating the crashes also identified another problem. There was a slight discrepancy in the land/sea masks being used by the new sea-surface temperature input data and the model itself for some of the EAS25 batches. This lead to some extra bogus sea-ice points appearing off the western edge of some coasts. This has now been corrected and verified with tests.

The code changes will require a new app version. This is being prepared though I am also making some more improvements to the exception handling and a few other aspects. It will be a couple of weeks before a new app appears. We will then rerun one of the earlier batches to do an analysis of the differences.
---
CPDN Visiting Scientist
ID: 70914 · Report as offensive     Reply Quote
wateroakley

Send message
Joined: 6 Aug 04
Posts: 195
Credit: 28,439,502
RAC: 10,510
Message 70915 - Posted: 11 May 2024, 14:44:28 UTC - in response to Message 70914.  

Glenn, Thank you for the explanation.
ID: 70915 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4

Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues

©2024 cpdn.org