climateprediction.net (CPDN) home page
Thread 'New work discussion - 2'

Thread 'New work discussion - 2'

Message boards : Number crunching : New work discussion - 2
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 22 · 23 · 24 · 25 · 26 · 27 · 28 . . . 42 · Next

AuthorMessage
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 68942 - Posted: 24 Jun 2023, 0:42:49 UTC - in response to Message 68940.  
Last modified: 24 Jun 2023, 1:10:07 UTC

Looks like intel(r) xeon(r) cpu x5650 doesn't have AVX support.
So why was I sent the tasks? Take Asteroids for example, they actually send out different programs depending if you have AVX, SSE3, etc. Boinc has a system to note the capabilities of the CPU, it's in the messages tab on startup. Big long list like fma,avx,sse2,ssse3,etc,etc,etc. The CPDN server should look at that. Maybe it's too much hassle and they just send them out to everyone, then keep resending until they happen to go to an AVX machine. But since they're set to only be tried three times, a lot are going to go to a non-AVX machine three times in a row. I guess someone will tweak something when they get a pile of them back unprocessed..

And why do I have one task working ok on the Xeon?

Also previous batches have run on older computers ok here.

Also, my other fast machine (Ryzen 9 3900X) crashed them all: https://www.cpdn.org/results.php?hostid=1535126
But this machine (Ryzen 9 3900XT - pretty much the same processor) is fine.
ID: 68942 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 68944 - Posted: 24 Jun 2023, 1:54:43 UTC - in response to Message 68942.  

Big long list like fma,avx,sse2,ssse3,etc,etc,etc. The CPDN server should look at that. Maybe it's too much hassle


Do you mean like this:

Processor features:
 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512_vnni md_clear flush_l1d arch_capabilities


That is the processor on my main (Linux) machine. Scanning that for things (who knows what matters and what does not? And are they always in the same order?).
ID: 68944 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 68945 - Posted: 24 Jun 2023, 2:04:35 UTC - in response to Message 68944.  
Last modified: 24 Jun 2023, 2:05:22 UTC

Do you mean like this:

Processor features:
 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512_vnni md_clear flush_l1d arch_capabilities


That is the processor on my main (Linux) machine. Scanning that for things (who knows what matters and what does not? And are they always in the same order?).
Yes like that. It should be easy to search for part of the string "avx". The Asteroids server must be doing it, it tries fma, avx, sse2, and sse3 versions of the program on my computers to see which is fastest. It never tried the avx on the older ones so it must have checked the string first. They either need to check, or allow the task to crash more than twice. At the moment, if a task is sent to three non-AVX machines in a row (very likely), it will fail and not be resent, and require human intervention.
ID: 68945 · Report as offensive
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 68946 - Posted: 24 Jun 2023, 5:34:14 UTC

And why do I have one task working ok on the Xeon?

Also previous batches have run on older computers ok here.

I think the AVX thing might be a red herring. Sarah thinks one particular ancil file may be the problem so just those two older machines happened to get a run of tasks with the dodgy file.

Looking at some of the hard fails of which there are a fair few now, some have crashed on Ryzen9 machines so I think it is more likely an issue with part of the batch.
ID: 68946 · Report as offensive
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 68948 - Posted: 24 Jun 2023, 5:45:08 UTC - in response to Message 68945.  

To my knowledge, the wah2 application was not compiled with an avx optimization switch. The wah2 executables were last compiled in November 2016. The last I knew, SSE2 is the highest level optimization used for compiling these models.

If it was the AVX thing, then every Windows batch since late 2016 would be displaying similar behavior with older PCs throwing errors. However, that has not occurred.

In this batch, on the 3 I have been running on my Ryzen 5600 for 17 hours, 2 of them had 2 previous errors in their work units with SEGV signal 11 errors and 1 had 1 error of that type. All of the PCs in those work units with the failed tasks have AVX capability as does my Ryzen which hasn't failed any of those 3 tasks so far.

If it was an ancillary file error, I would think all tasks in the work unit should have failed.

This is very frustrating and it does not appear at all obvious what the problem might be. Hopefully Sarah analyzing the errors can find some correlation as to why this is happening as it is.

Also frustrating, while my trickles are going up fine, the first zip file is failing to upload to upload7 saying it can't connect.
ID: 68948 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 68949 - Posted: 24 Jun 2023, 6:48:14 UTC - in response to Message 68946.  

I think the AVX thing might be a red herring. Sarah thinks one particular ancil file may be the problem so just those two older machines happened to get a run of tasks with the dodgy file.

Looking at some of the hard fails of which there are a fair few now, some have crashed on Ryzen9 machines so I think it is more likely an issue with part of the batch.
Well I've had 24 out of 24 work perfectly on this Ryzen 9 3900XT, and all my other computers screwed up several tasks before the server banned them for a day. My other Ryzen 9 (3900X) also screwed up, but don't count that as it messes up other programs at random, it's a dodgy CPU/MB. Seems a bit of a coincidence. If it's not instruction sets, it could be something to do with timings? Is there a tendancy for slower machines to be more likely to crash them?
ID: 68949 · Report as offensive
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 68950 - Posted: 24 Jun 2023, 6:52:14 UTC - in response to Message 68949.  

Is there a tendancy for slower machines to be more likely to crash them?


Not from the hard fails I have looked at. Out of ten I examined every one failed on at least one more recent and faster machine.
ID: 68950 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 68951 - Posted: 24 Jun 2023, 7:04:09 UTC

My fails were all a memory fault to do with accessing memory it shouldn't. Not sure why tasks would do that, then not do that on the next machine. Also my machines failing have 64GB RAM, so it's not a shortage. The good one does happen to have the most at 96GB, but it's nowhere near using it all. This randomness doesn't make any sense.
ID: 68951 · Report as offensive
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,716,561
RAC: 8,355
Message 68953 - Posted: 24 Jun 2023, 7:35:52 UTC - in response to Message 68948.  

... the first zip file is failing to upload to upload7 saying it can't connect.
That one, at least, would be capable of diagnosis by enabling the <http_debug> event log flag and analysing the log output from a retry.
ID: 68953 · Report as offensive
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 68955 - Posted: 24 Jun 2023, 8:53:42 UTC - in response to Message 68953.  

... the first zip file is failing to upload to upload7 saying it can't connect.


Sarah is going to ask Andy to check out the Korean server.
ID: 68955 · Report as offensive
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,716,561
RAC: 8,355
Message 68956 - Posted: 24 Jun 2023, 9:40:39 UTC - in response to Message 68955.  

Andy said some time ago, and repeated late yesterday evening (in response to my request for more RAC data), that he would be unavailable all day Monday. Do they have anyone else who could check?
ID: 68956 · Report as offensive
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 68958 - Posted: 24 Jun 2023, 9:51:31 UTC - in response to Message 68956.  

Sarah has also asked the researcher in Korea to check.
ID: 68958 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 68963 - Posted: 25 Jun 2023, 0:15:16 UTC - in response to Message 68950.  
Last modified: 25 Jun 2023, 0:19:11 UTC

Is there a tendancy for slower machines to be more likely to crash them?
Not from the hard fails I have looked at. Out of ten I examined every one failed on at least one more recent and faster machine.
I'm convinced there's something different about one of my machines. It's working on every single one of 24 tasks. All the other machines are failing three quarters of them within 10 minutes. It's a modern Ryzen with memory protection turned off in the Windows settings. Since the tasks are producing a memory fault, could this be anything to do with it? I have memory protection off or Virtualbox doesn't work for other projects. The failing machines also have it off, but they're older machines. Maybe these tasks only work on a modern CPU with no memory protection? Ryzens also have a bigger cache which might reduce these problems?

The good machine: https://www.cpdn.org/show_host_detail.php?hostid=1509739

The other machines: https://www.cpdn.org/hosts_user.php?userid=2002390
ID: 68963 · Report as offensive
ProfileBill F

Send message
Joined: 17 Jan 09
Posts: 124
Credit: 2,037,778
RAC: 2,752
Message 68964 - Posted: 25 Jun 2023, 2:37:35 UTC

I have three Windows 10 systems two are failing every task and one is chugging along working happily on a task. Age wise it is the girl in the middle. The other two systems are either slower or faster and all are similar 4 processor systems with the same version of BOINC and Win 10.

When the answer is finally figured out we will probably all shake our heads.
ID: 68964 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 68965 - Posted: 25 Jun 2023, 2:39:38 UTC - in response to Message 68964.  

All mine can do a task properly now and again. But my Ryzen is the only one to have done 24 of 24 correctly. The others are 1 in 4. Your one probably got lucky.
ID: 68965 · Report as offensive
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 68966 - Posted: 25 Jun 2023, 6:34:09 UTC - in response to Message 68965.  

Hard fails now up to over 1,400. I have now looked at over 20. Every single one has at least one failure on a recent CPU. Lots failing with the same Ryzen7 model I have. What I have noticed is that every single computer I looked at that was failing these was failing all tasks from this batch even if they had a good record with previous batches. Also Windows version didn't seem to make any difference.

My seven have between them got 20 zips waiting to go so I hope they get the server sorted soon.

On a bit of a tangent, does anyone know if there is anything in the Windows version number posted on computer details pages to identify machines like mine running tasks under WINE using Linux?
ID: 68966 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 68967 - Posted: 25 Jun 2023, 7:06:47 UTC - in response to Message 68966.  

So there's something specific about certain machines that makes them good or bad? I thought there must be, since I have a perfect one and 6 dud ones.
ID: 68967 · Report as offensive
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 68968 - Posted: 25 Jun 2023, 7:44:38 UTC - in response to Message 68967.  

Starting to look that way but so far, I am not seeing any pattern.
ID: 68968 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 68969 - Posted: 25 Jun 2023, 8:35:59 UTC - in response to Message 68968.  
Last modified: 25 Jun 2023, 8:36:43 UTC

Starting to look that way but so far, I am not seeing any pattern.
Maybe it's not worth spending time fixing, as they seem to fail very quickly then can be resent, and we have way more computers than tasks. But you should increase the max number of failures allowed per work unit.
ID: 68969 · Report as offensive
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 68970 - Posted: 25 Jun 2023, 9:06:31 UTC - in response to Message 68969.  

Maybe it's not worth spending time fixing,
I am going to stick my neck out and predict that it is something to do with the ancil file that Sarah had trouble with and that fixing that would be a better solution than increasing the number of resends. I will also suggest again that the deadlines should be shortened.
ID: 68970 · Report as offensive
Previous · 1 . . . 22 · 23 · 24 · 25 · 26 · 27 · 28 . . . 42 · Next

Message boards : Number crunching : New work discussion - 2

©2024 cpdn.org