Message boards : Number crunching : New work discussion - 2
Message board moderation
Previous · 1 . . . 22 · 23 · 24 · 25 · 26 · 27 · 28 . . . 42 · Next
Author | Message |
---|---|
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
Looks like intel(r) xeon(r) cpu x5650 doesn't have AVX support.So why was I sent the tasks? Take Asteroids for example, they actually send out different programs depending if you have AVX, SSE3, etc. Boinc has a system to note the capabilities of the CPU, it's in the messages tab on startup. Big long list like fma,avx,sse2,ssse3,etc,etc,etc. The CPDN server should look at that. Maybe it's too much hassle and they just send them out to everyone, then keep resending until they happen to go to an AVX machine. But since they're set to only be tried three times, a lot are going to go to a non-AVX machine three times in a row. I guess someone will tweak something when they get a pile of them back unprocessed.. And why do I have one task working ok on the Xeon? Also previous batches have run on older computers ok here. Also, my other fast machine (Ryzen 9 3900X) crashed them all: https://www.cpdn.org/results.php?hostid=1535126 But this machine (Ryzen 9 3900XT - pretty much the same processor) is fine. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Big long list like fma,avx,sse2,ssse3,etc,etc,etc. The CPDN server should look at that. Maybe it's too much hassle Do you mean like this: Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512_vnni md_clear flush_l1d arch_capabilities That is the processor on my main (Linux) machine. Scanning that for things (who knows what matters and what does not? And are they always in the same order?). |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
Do you mean like this:Yes like that. It should be easy to search for part of the string "avx". The Asteroids server must be doing it, it tries fma, avx, sse2, and sse3 versions of the program on my computers to see which is fastest. It never tried the avx on the older ones so it must have checked the string first. They either need to check, or allow the task to crash more than twice. At the moment, if a task is sent to three non-AVX machines in a row (very likely), it will fail and not be resent, and require human intervention. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
And why do I have one task working ok on the Xeon? I think the AVX thing might be a red herring. Sarah thinks one particular ancil file may be the problem so just those two older machines happened to get a run of tasks with the dodgy file. Looking at some of the hard fails of which there are a fair few now, some have crashed on Ryzen9 machines so I think it is more likely an issue with part of the batch. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
To my knowledge, the wah2 application was not compiled with an avx optimization switch. The wah2 executables were last compiled in November 2016. The last I knew, SSE2 is the highest level optimization used for compiling these models. If it was the AVX thing, then every Windows batch since late 2016 would be displaying similar behavior with older PCs throwing errors. However, that has not occurred. In this batch, on the 3 I have been running on my Ryzen 5600 for 17 hours, 2 of them had 2 previous errors in their work units with SEGV signal 11 errors and 1 had 1 error of that type. All of the PCs in those work units with the failed tasks have AVX capability as does my Ryzen which hasn't failed any of those 3 tasks so far. If it was an ancillary file error, I would think all tasks in the work unit should have failed. This is very frustrating and it does not appear at all obvious what the problem might be. Hopefully Sarah analyzing the errors can find some correlation as to why this is happening as it is. Also frustrating, while my trickles are going up fine, the first zip file is failing to upload to upload7 saying it can't connect. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
I think the AVX thing might be a red herring. Sarah thinks one particular ancil file may be the problem so just those two older machines happened to get a run of tasks with the dodgy file.Well I've had 24 out of 24 work perfectly on this Ryzen 9 3900XT, and all my other computers screwed up several tasks before the server banned them for a day. My other Ryzen 9 (3900X) also screwed up, but don't count that as it messes up other programs at random, it's a dodgy CPU/MB. Seems a bit of a coincidence. If it's not instruction sets, it could be something to do with timings? Is there a tendancy for slower machines to be more likely to crash them? |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Is there a tendancy for slower machines to be more likely to crash them? Not from the hard fails I have looked at. Out of ten I examined every one failed on at least one more recent and faster machine. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
My fails were all a memory fault to do with accessing memory it shouldn't. Not sure why tasks would do that, then not do that on the next machine. Also my machines failing have 64GB RAM, so it's not a shortage. The good one does happen to have the most at 96GB, but it's nowhere near using it all. This randomness doesn't make any sense. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,716,561 RAC: 8,355 |
... the first zip file is failing to upload to upload7 saying it can't connect.That one, at least, would be capable of diagnosis by enabling the <http_debug> event log flag and analysing the log output from a retry. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
... the first zip file is failing to upload to upload7 saying it can't connect. Sarah is going to ask Andy to check out the Korean server. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,716,561 RAC: 8,355 |
Andy said some time ago, and repeated late yesterday evening (in response to my request for more RAC data), that he would be unavailable all day Monday. Do they have anyone else who could check? |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Sarah has also asked the researcher in Korea to check. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
I'm convinced there's something different about one of my machines. It's working on every single one of 24 tasks. All the other machines are failing three quarters of them within 10 minutes. It's a modern Ryzen with memory protection turned off in the Windows settings. Since the tasks are producing a memory fault, could this be anything to do with it? I have memory protection off or Virtualbox doesn't work for other projects. The failing machines also have it off, but they're older machines. Maybe these tasks only work on a modern CPU with no memory protection? Ryzens also have a bigger cache which might reduce these problems?Is there a tendancy for slower machines to be more likely to crash them?Not from the hard fails I have looked at. Out of ten I examined every one failed on at least one more recent and faster machine. The good machine: https://www.cpdn.org/show_host_detail.php?hostid=1509739 The other machines: https://www.cpdn.org/hosts_user.php?userid=2002390 |
Send message Joined: 17 Jan 09 Posts: 124 Credit: 2,037,778 RAC: 2,752 |
I have three Windows 10 systems two are failing every task and one is chugging along working happily on a task. Age wise it is the girl in the middle. The other two systems are either slower or faster and all are similar 4 processor systems with the same version of BOINC and Win 10. When the answer is finally figured out we will probably all shake our heads. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
All mine can do a task properly now and again. But my Ryzen is the only one to have done 24 of 24 correctly. The others are 1 in 4. Your one probably got lucky. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Hard fails now up to over 1,400. I have now looked at over 20. Every single one has at least one failure on a recent CPU. Lots failing with the same Ryzen7 model I have. What I have noticed is that every single computer I looked at that was failing these was failing all tasks from this batch even if they had a good record with previous batches. Also Windows version didn't seem to make any difference. My seven have between them got 20 zips waiting to go so I hope they get the server sorted soon. On a bit of a tangent, does anyone know if there is anything in the Windows version number posted on computer details pages to identify machines like mine running tasks under WINE using Linux? |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
So there's something specific about certain machines that makes them good or bad? I thought there must be, since I have a perfect one and 6 dud ones. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Starting to look that way but so far, I am not seeing any pattern. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
Starting to look that way but so far, I am not seeing any pattern.Maybe it's not worth spending time fixing, as they seem to fail very quickly then can be resent, and we have way more computers than tasks. But you should increase the max number of failures allowed per work unit. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Maybe it's not worth spending time fixing,I am going to stick my neck out and predict that it is something to do with the ancil file that Sarah had trouble with and that fixing that would be a better solution than increasing the number of resends. I will also suggest again that the deadlines should be shortened. |
©2024 cpdn.org