Message boards :
Number crunching :
Batch 1008, and test batches 1009 to 1014 for Windows - issues
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,690,033 RAC: 10,812 |
Yes, phenom2, all Ryzen and thread ripper CPUs support SSE4.2I think I had a vague memory of SSE4a, but that'll be ancient history for current processors. |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,962,600 RAC: 21,639 |
I would suggest that those with Intel processors set CPDN to no new tasks till this is sorted. Edit: It is possible the batch might be closed which would stop resends and let those with work on AMD machines complete it. Edit: I think it is being paused which will stop resends. I have looked at over 20 hard fails, every single one is at the same point on an Intel machine. I have seven from the batch on my machine, Four have produced 5zips and trickle up messages, one four and two waiting to start. It is most odd. |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,390,249 RAC: 15,269 |
I believe the Intel runs are behaving correctly and failing. It's the AMD runs not behaving. Yes, this batch will be stopped from producing resends until we understand why testing did not show this problem. --- CPDN Visiting Scientist |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
My pipsqueak computer, that crashed my latest four CPDN tasks has a CPU chip with these features. Computer 1512658 CPU type GenuineIntel 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz [Family 6 Model 140 Stepping 1] Number of processors 8 Coprocessors --- Virtualization None Operating System Microsoft Windows 10 Core x64 Edition, (10.00.19045.00) BOINC version 7.24.1 Memory 15.64 GB Cache 256 KB Instruction Set Extensions Intel® SSE4.1, Intel® SSE4.2, Intel® AVX2, Intel® AVX-512 |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,962,600 RAC: 21,639 |
I believe the Intel runs are behaving correctly and failing. It's the AMD runs not behaving.Should I just abort the two that are yet to start? I have five others that I can save files from that have all produced either 4 or 5 zips. Or would looking at what happens at the point where they fail on Intel machines be more useful? |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,390,249 RAC: 15,269 |
Hard to answer that as I'm not the project scientist and it's really their call together with CPDN. Personally, as a developer I have all the kit I need to debug on intel & AMD so don't spend time saving files. As a volunteer, if it was me, I'd abort the tasks yet to start and keep running the tasks currently going until told otherwise. They might be useful for comparison later. Sorry Dave, that's the best answer I can give at the moment.I believe the Intel runs are behaving correctly and failing. It's the AMD runs not behaving.Should I just abort the two that are yet to start? I have five others that I can save files from that have all produced either 4 or 5 zips. Or would looking at what happens at the point where they fail on Intel machines be more useful? --- CPDN Visiting Scientist |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,962,600 RAC: 21,639 |
Thanks Glen. I will abort the two not started yet as credit isn't a issue for me. |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,390,249 RAC: 15,269 |
There will be a small batch of about 100 workunits going out soon to test whether the issue we're seeing this with this batch is related to some of the input files. --- CPDN Visiting Scientist |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,962,600 RAC: 21,639 |
Thanks Glen. I will abort the two not started yet as credit isn't a issue for me.I was clearly a bit premature with that as I have picked up one more resend from 1008. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,690,033 RAC: 10,812 |
I'm restarting work fetch on my 6 Windows machines, on a 10-minute stagger and with a limit of one per machine - that should maximise my chances of being one of the 'select 100'. Edit - and the next one in line got a task. Unfortunately, like Dave's, it's a resend from the previous (failing) run. Glenn, should I keep it, or send it straight back? It's the third copy, so should kill the workunit if I abort it. Workunit 12273481 |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,962,600 RAC: 21,639 |
I believe the Intel runs are behaving correctly and failing. It's the AMD runs not behaving.The only reason I haven't asked why is I almost certainly will not understand the answer! ;) |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,690,033 RAC: 10,812 |
Four of my six machines have now got resends from the 2nd April batch, but there's no sign of the test batch yet. I'll keep these out of circulation for the time being, until and unless Glenn can give us a more precise ETA. The trouble is that if our clients get consistent "no tasks" replies from the server, they stop asking (or at least, they ask less frequently). BOINC doesn't really take the needs of this type of test into account. |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,962,600 RAC: 21,639 |
I have deleted the last resend. It was a _2 so won't be sent again now. I have left the five started tasks from 1008 going and there is a resend from 1007 at 88%. I have also set the machine to no new tasks till I get some hints about the imminentness of the 100 tasks being released. Edit:I think if BOINC were to cater for this type of test it would almost certainly mess something else up! Edit2: Given the time I would not be surprised if the test doesn't arrive till Monday though I have been caught out before by batches being released over the weekend. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,690,033 RAC: 10,812 |
|
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,962,600 RAC: 21,639 |
Starting to get some tasks from batch 1009 - I assume these are the test run.I can confirm these are from the test batch of 100 tasks. Edit: And I would guess they have all gone now so I won't get any unless there are failures. |
Send message Joined: 22 May 21 Posts: 39 Credit: 1,173,623 RAC: 3,907 |
So far, got tasks 22424380 and 22424396. Looks like both of these errored out as well. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,690,033 RAC: 10,812 |
Looks like both of these errored out as well.Yes, and at exactly the same place. The stdout_mon.txt file for 22424396 ends with: ... wah2_eas25_n01t_200912_24_1009_012276361 - PH 1 TS 0011616 A - 02/01/2010 00:00 - H:M:S=0007:43:16 AVG= 2.39 DLT= 1.15 wah2_eas25_n01t_200912_24_1009_012276361 - PH 1 TS 0011617 P - 01/01/2010 00:05 - H:M:S=0007:43:21 AVG= 2.39 DLT= 5.10 wah2_eas25_n01t_200912_24_1009_012276361 - PH 1 TS 0011618 P - 01/01/2010 00:10 - H:M:S=0007:43:29 AVG= 2.39 DLT= 8.17 Model crash detected, will try to restart...The other machine is in a different room, and I'll check it later. |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,390,249 RAC: 15,269 |
I don't want anyone spending any time looking at their failed tasks. Appreciate the response for the small test. There is one clue in the log output (which might be a red herring). The regional model calls boinc_finish with an error code of 193. In windows that means a bad executable so I'm looking at the library the model loads dynamically during the run to handle converting the model output. It's possible it's been corrupted in some way. If that's not it, then it's back to the model code. --- CPDN Visiting Scientist |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,690,033 RAC: 10,812 |
Hmmm. I wouldn't be totally sure about that one. A windows app which fails to start because of a missing DLL usually bombs out with: - exit code -1073741515 (0xc0000135)and the generic description is "The application failed to initialize properly". BOINC (and hence the BOINC library which is linked into the app or the wrapper) has it's own set of error codes, which you can find at: https://github.com/BOINC/boinc/blob/master/lib/error_numbers.h They include both positive and negative values, so I'd suspect both of these, in addition to the MS Windows numbers: #define EXIT_SIGNAL 193 // app was killed by signal #define ERR_INVALID_EVENT -193 That doesn't get us much further forward, but I'm still in brainstorming mode. |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,390,249 RAC: 15,269 |
Hi Richard, the model only loads the external library when it needs to convert the model raw output ready for sending. That doesn't happen at model start, but at fixed points in the forecast. So the model will start fine and load the library after some time. Hence a possible explanation for why they all fail on 1/Jan. The boinc_finish error code is whatever value was passed to it. It could come from the return/errno value of LoadLibrary() call, or, it might come from a fortran operation. I'm still looking for the exact point of failure in the code. (https://stackoverflow.com/questions/38579909/loadlibrary-fails-with-error-code-193) --- CPDN Visiting Scientist |
©2024 cpdn.org