climateprediction.net home page
Batch 1008, and test batches 1009 to 1014 for Windows - issues

Batch 1008, and test batches 1009 to 1014 for Windows - issues

Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1058
Credit: 36,584,771
RAC: 15,932
Message 70733 - Posted: 4 Apr 2024, 7:54:31 UTC - in response to Message 70727.  

Yes, phenom2, all Ryzen and thread ripper CPUs support SSE4.2
I think I had a vague memory of SSE4a, but that'll be ancient history for current processors.
ID: 70733 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4529
Credit: 18,662,422
RAC: 14,491
Message 70734 - Posted: 4 Apr 2024, 13:47:29 UTC
Last modified: 4 Apr 2024, 17:40:13 UTC

I would suggest that those with Intel processors set CPDN to no new tasks till this is sorted.

Edit: It is possible the batch might be closed which would stop resends and let those with work on AMD machines complete it.

Edit: I think it is being paused which will stop resends. I have looked at over 20 hard fails, every single one is at the same point on an Intel machine. I have seven from the batch on my machine, Four have produced 5zips and trickle up messages, one four and two waiting to start. It is most odd.
ID: 70734 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1044
Credit: 16,196,312
RAC: 12,647
Message 70736 - Posted: 4 Apr 2024, 18:07:10 UTC - in response to Message 70734.  

I believe the Intel runs are behaving correctly and failing. It's the AMD runs not behaving.

Yes, this batch will be stopped from producing resends until we understand why testing did not show this problem.
---
CPDN Visiting Scientist
ID: 70736 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,193,804
RAC: 2,852
Message 70737 - Posted: 4 Apr 2024, 18:28:14 UTC - in response to Message 70727.  

My pipsqueak computer, that crashed my latest four CPDN tasks has a CPU chip with these features.
Computer 1512658

CPU type 	GenuineIntel
11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz [Family 6 Model 140 Stepping 1]
Number of processors 	8
Coprocessors 	---
Virtualization 	None
Operating System Microsoft Windows 10
                 Core x64 Edition, (10.00.19045.00)
BOINC version 	7.24.1
Memory 	15.64 GB
Cache 	256 KB


Instruction Set Extensions Intel® SSE4.1, Intel® SSE4.2, Intel® AVX2, Intel® AVX-512
ID: 70737 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4529
Credit: 18,662,422
RAC: 14,491
Message 70738 - Posted: 4 Apr 2024, 20:26:44 UTC

I believe the Intel runs are behaving correctly and failing. It's the AMD runs not behaving.
Should I just abort the two that are yet to start? I have five others that I can save files from that have all produced either 4 or 5 zips. Or would looking at what happens at the point where they fail on Intel machines be more useful?
ID: 70738 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1044
Credit: 16,196,312
RAC: 12,647
Message 70740 - Posted: 4 Apr 2024, 20:54:44 UTC - in response to Message 70738.  
Last modified: 4 Apr 2024, 20:55:40 UTC

I believe the Intel runs are behaving correctly and failing. It's the AMD runs not behaving.
Should I just abort the two that are yet to start? I have five others that I can save files from that have all produced either 4 or 5 zips. Or would looking at what happens at the point where they fail on Intel machines be more useful?
Hard to answer that as I'm not the project scientist and it's really their call together with CPDN. Personally, as a developer I have all the kit I need to debug on intel & AMD so don't spend time saving files. As a volunteer, if it was me, I'd abort the tasks yet to start and keep running the tasks currently going until told otherwise. They might be useful for comparison later. Sorry Dave, that's the best answer I can give at the moment.
---
CPDN Visiting Scientist
ID: 70740 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4529
Credit: 18,662,422
RAC: 14,491
Message 70742 - Posted: 5 Apr 2024, 4:57:06 UTC

Thanks Glen. I will abort the two not started yet as credit isn't a issue for me.
ID: 70742 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1044
Credit: 16,196,312
RAC: 12,647
Message 70743 - Posted: 5 Apr 2024, 9:37:31 UTC
Last modified: 5 Apr 2024, 9:37:52 UTC

There will be a small batch of about 100 workunits going out soon to test whether the issue we're seeing this with this batch is related to some of the input files.
---
CPDN Visiting Scientist
ID: 70743 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4529
Credit: 18,662,422
RAC: 14,491
Message 70744 - Posted: 5 Apr 2024, 9:54:37 UTC - in response to Message 70742.  

Thanks Glen. I will abort the two not started yet as credit isn't a issue for me.
I was clearly a bit premature with that as I have picked up one more resend from 1008.
ID: 70744 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1058
Credit: 36,584,771
RAC: 15,932
Message 70745 - Posted: 5 Apr 2024, 10:25:35 UTC
Last modified: 5 Apr 2024, 10:38:39 UTC

I'm restarting work fetch on my 6 Windows machines, on a 10-minute stagger and with a limit of one per machine - that should maximise my chances of being one of the 'select 100'.

Edit - and the next one in line got a task. Unfortunately, like Dave's, it's a resend from the previous (failing) run.

Glenn, should I keep it, or send it straight back? It's the third copy, so should kill the workunit if I abort it.

Workunit 12273481
ID: 70745 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4529
Credit: 18,662,422
RAC: 14,491
Message 70746 - Posted: 5 Apr 2024, 14:13:25 UTC

I believe the Intel runs are behaving correctly and failing. It's the AMD runs not behaving.
The only reason I haven't asked why is I almost certainly will not understand the answer! ;)
ID: 70746 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1058
Credit: 36,584,771
RAC: 15,932
Message 70747 - Posted: 5 Apr 2024, 14:39:17 UTC

Four of my six machines have now got resends from the 2nd April batch, but there's no sign of the test batch yet. I'll keep these out of circulation for the time being, until and unless Glenn can give us a more precise ETA.

The trouble is that if our clients get consistent "no tasks" replies from the server, they stop asking (or at least, they ask less frequently). BOINC doesn't really take the needs of this type of test into account.
ID: 70747 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4529
Credit: 18,662,422
RAC: 14,491
Message 70748 - Posted: 5 Apr 2024, 15:30:35 UTC
Last modified: 5 Apr 2024, 16:01:39 UTC

I have deleted the last resend. It was a _2 so won't be sent again now. I have left the five started tasks from 1008 going and there is a resend from 1007 at 88%. I have also set the machine to no new tasks till I get some hints about the imminentness of the 100 tasks being released.

Edit:I think if BOINC were to cater for this type of test it would almost certainly mess something else up!

Edit2: Given the time I would not be surprised if the test doesn't arrive till Monday though I have been caught out before by batches being released over the weekend.
ID: 70748 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1058
Credit: 36,584,771
RAC: 15,932
Message 70749 - Posted: 5 Apr 2024, 17:02:17 UTC
Last modified: 5 Apr 2024, 17:03:05 UTC

Starting to get some tasks from batch 1009 - I assume these are the test run.

So far, got tasks 22424380 and 22424396.

Not seeing them on the server status page yet, but that doesn't update in real time.
ID: 70749 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4529
Credit: 18,662,422
RAC: 14,491
Message 70750 - Posted: 5 Apr 2024, 17:47:38 UTC
Last modified: 5 Apr 2024, 18:47:24 UTC

Starting to get some tasks from batch 1009 - I assume these are the test run.
I can confirm these are from the test batch of 100 tasks.

Edit: And I would guess they have all gone now so I won't get any unless there are failures.
ID: 70750 · Report as offensive     Reply Quote
bullschuck

Send message
Joined: 22 May 21
Posts: 39
Credit: 1,126,387
RAC: 3,191
Message 70752 - Posted: 6 Apr 2024, 2:19:56 UTC - in response to Message 70749.  
Last modified: 6 Apr 2024, 2:21:14 UTC

So far, got tasks 22424380 and 22424396.


Looks like both of these errored out as well.
ID: 70752 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1058
Credit: 36,584,771
RAC: 15,932
Message 70753 - Posted: 6 Apr 2024, 6:46:19 UTC - in response to Message 70752.  

Looks like both of these errored out as well.
Yes, and at exactly the same place.

The stdout_mon.txt file for 22424396 ends with:

...
wah2_eas25_n01t_200912_24_1009_012276361 - PH 1 TS 0011616 A - 02/01/2010 00:00 - H:M:S=0007:43:16 AVG= 2.39 DLT= 1.15
wah2_eas25_n01t_200912_24_1009_012276361 - PH 1 TS 0011617 P - 01/01/2010 00:05 - H:M:S=0007:43:21 AVG= 2.39 DLT= 5.10
wah2_eas25_n01t_200912_24_1009_012276361 - PH 1 TS 0011618 P - 01/01/2010 00:10 - H:M:S=0007:43:29 AVG= 2.39 DLT= 8.17
Model crash detected, will try to restart...
The other machine is in a different room, and I'll check it later.
ID: 70753 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1044
Credit: 16,196,312
RAC: 12,647
Message 70756 - Posted: 6 Apr 2024, 11:28:37 UTC

I don't want anyone spending any time looking at their failed tasks. Appreciate the response for the small test.

There is one clue in the log output (which might be a red herring). The regional model calls boinc_finish with an error code of 193. In windows that means a bad executable so I'm looking at the library the model loads dynamically during the run to handle converting the model output. It's possible it's been corrupted in some way. If that's not it, then it's back to the model code.
---
CPDN Visiting Scientist
ID: 70756 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1058
Credit: 36,584,771
RAC: 15,932
Message 70757 - Posted: 6 Apr 2024, 12:10:40 UTC - in response to Message 70756.  

Hmmm. I wouldn't be totally sure about that one. A windows app which fails to start because of a missing DLL usually bombs out with:

- exit code -1073741515 (0xc0000135)
and the generic description is "The application failed to initialize properly".

BOINC (and hence the BOINC library which is linked into the app or the wrapper) has it's own set of error codes, which you can find at:

https://github.com/BOINC/boinc/blob/master/lib/error_numbers.h

They include both positive and negative values, so I'd suspect both of these, in addition to the MS Windows numbers:

#define EXIT_SIGNAL 193 // app was killed by signal
#define ERR_INVALID_EVENT -193

That doesn't get us much further forward, but I'm still in brainstorming mode.
ID: 70757 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1044
Credit: 16,196,312
RAC: 12,647
Message 70758 - Posted: 6 Apr 2024, 15:53:02 UTC - in response to Message 70757.  
Last modified: 6 Apr 2024, 15:55:55 UTC

Hi Richard, the model only loads the external library when it needs to convert the model raw output ready for sending. That doesn't happen at model start, but at fixed points in the forecast. So the model will start fine and load the library after some time. Hence a possible explanation for why they all fail on 1/Jan.

The boinc_finish error code is whatever value was passed to it. It could come from the return/errno value of LoadLibrary() call, or, it might come from a fortran operation. I'm still looking for the exact point of failure in the code.

(https://stackoverflow.com/questions/38579909/loadlibrary-fails-with-error-code-193)
---
CPDN Visiting Scientist
ID: 70758 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues

©2024 cpdn.org