Message boards : Number crunching : New Work Announcements 2024
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 13 · Next
Author | Message |
---|---|
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,011,472 RAC: 21,368 |
The NZ batch had a missing file so the submission failed. It should be resubmitted soon. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
The server side rules for this need to be modified. Other projects don't use these same impossible rules.I disagree - the rule protects the server from wasting time sending out tasks to a machine likely to break the next one. It doesn't matter if it's the task's fault, the point is it's better off sending the next task to another machine. The server is doing what we want it to do. There are only 3 retries allowed and as these batches have high failure rates, it makes sense to target machine returning completed tasks. The aim is to get the tasks to complete successfully so the server should not continually push tasks to machines that have a high failure rate (for whatever reason). So, yes, there is 'harm' in sending tasks that are known to likely fail on machines. We end up with more hard fails. CPDN had really hoped to get the code working before sending out these new batches. There was discussion about using the linux version since that works but the feeling was it was better to keep the Windows version for the time being. Unfortunately I wasn't about to fix all the bugs before the batches had to go out. --- CPDN Visiting Scientist |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Oops, sorry Dave, Just seen this! I am going to open a new thread for the East Asia batches 1001-4. To free this thread for new work announcements rather than discussion. It would be good if anyone starting discussions for subsequent batches such as the NZ ones that should appear tomorrow could do the same. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,704,964 RAC: 9,670 |
So, yes, there is 'harm' in sending tasks that are known to likely fail on machines. We end up with more hard fails.Not to mention that internet bandwidth is not a zero-cost resource, in either climate or financial terms. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Not to mention that internet bandwidth is not a zero-cost resource, in either climate or financial terms. The incremental bandwidth for me took a big step-up recently when Verizon replaced my FiOS hardware. My old hardware was installed in about 2004 and they did not want to support it any more. The new is about 10x faster than the old. Timestamp Download Upload Latency Jitter Quality Score Test Server 1/18/2024 8:54:7 840.78 Mbps 906.51 Mbps 7 ms 1 ms Excellent newyork02.speedtest.windstream.net 12/1/2023 10:26:27 750.33 Mbps 926.59 Mbps 5 ms 1 ms Excellent speedtest1.nyc1.nitelusa.net.prod.hosts.ooklaserver.net 11/30/2023 21:38:48 836.55 Mbps 846.46 Mbps 5 ms 4 ms Excellent newyork02.speedtest.windstream.net |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
I assume he meant for the university sending them out. For us, it costs most of us no more to use more bandwidth, as it's flat rate.Not to mention that internet bandwidth is not a zero-cost resource, in either climate or financial terms.The incremental bandwidth for me took a big step-up recently when Verizon replaced my FiOS hardware. My old hardware was installed in about 2004 and they did not want to support it any more. The new is about 10x faster than the old. |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
Don't waste your time looking into these segmentation failures. I know exactly where the problem is in the code, I've been working on this for weeks. The same code works fine under Linux but fails on Windows (same compiler too). Am trying to find a workaround that doesn't involve rewriting the code too much. I don't know if it helps, but in years past, I found Windows was rather harshly less tolerant of "out of bounds array accesses" compared to Linux. In general, if you read an entry or two beyond the end of an array, Linux is unlikely to segfault. Windows lays things out differently, and I have absolutely seen "things that work fine under Linux and segfault under Windows" being off-by-one errors in end of array access. Are you familiar with Valgrind? It's a memory correctness testing tool, and will flag stuff like this. If you can build a small reproduction case, point Valgrind at it, and it'll pop out exactly what and where you're doing something wrong with your memory accesses. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Are you familiar with Valgrind? It's a memory correctness testing tool, and will flag stuff like this. If you can build a small reproduction case, point Valgrind at it, and it'll pop out exactly what and where you're doing something wrong with your memory accesses.Yes, I've used valgrind. It wouldn't help for this case though as the code fails accessing the first array element. Also fails to create a pointer to same element. I'm not sure what's causing it, stack problem is my current theory. --- CPDN Visiting Scientist |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
Yes, I've used valgrind. It wouldn't help for this case though as the code fails accessing the first array element. Also fails to create a pointer to same element. I'm not sure what's causing it, stack problem is my current theory.I can't find it now, but I think somewhere (you?) said they're more likely to fail on newer machines. I'm seeing something different and inexplicable. I have two newer machines, Ryzen 9 3900X and Ryzen 9 3900XT, and they're fine. I have two older machines, both dual Xeon X5650, and one of them fails every task. But the other is fine! Only difference is the motherboard, the one with the older R410 board crashes, the newer R510 board is ok. there may also be a minor difference in RAM - one of them has better matched RAM sticks and is running triple channel and the other isn't. I have various other old machines and they're also fine. Bad Xeon: https://www.cpdn.org/show_host_detail.php?hostid=1509742 (Ignore the GPU, that was added a couple of days ago and didn't change the crashability of CPDN). Good Xeon: https://www.cpdn.org/show_host_detail.php?hostid=1544690 Good Ryzens: https://www.cpdn.org/show_host_detail.php?hostid=1509739 and https://www.cpdn.org/show_host_detail.php?hostid=1535126 I hope something in there helps. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,011,472 RAC: 21,368 |
Batch 1005 4650 WAH2 tasks for the NZ region have joined the East Asia ones still waiting to be snapped up. If the testing site ones are anything to go by then on my box they will take two or three days less to complete. |
Send message Joined: 5 Jun 09 Posts: 97 Credit: 3,736,855 RAC: 4,073 |
Just landed (well about 4 hours ago) a wah2_nz25 task. Let's see how this one does..... |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
The NZ batch uses a smaller domain to the EAS ones and is much less likely to fail. Just landed (well about 4 hours ago) a wah2_nz25 task. Let's see how this one does..... --- CPDN Visiting Scientist |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
OpenIFS linux batch An OpenIFS linux batch will be released in the next 7 days. It's about to go into testing. This batch is based on the earlier batch 993 but with reduced model output (and hence smaller upload files). All of the forecasts in this batch will be exactly the same. The aim is to see how much variation we get in running multiple identical forecasts across all the linux machines attached to CPDN, and, if we get the same result from exact same forecasts from each host (which is not a given). The objective is to compare the perturbations from running across different hosts to the perturbations previously applied to batch 993's initial conditions. --- CPDN Visiting Scientist |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
An OpenIFS linux batch will be released in the next 7 days. It's about to go into testing. This batch is based on the earlier batch 993 but with reduced model output (and hence smaller upload files). My most recent one of those was this one. It worked, so perhaps this new batch should work too. Right? I must have those compatibility libraries in there although, IIRC, these OIFS programs do not need them. Task 22318024 Name oifs_43r3_0187_2019110100_123_993_12215029_2 Workunit 12215029 Created 25 Apr 2023, 18:24:32 UTC Sent 25 Apr 2023, 18:24:40 UTC Report deadline 24 Jun 2023, 18:24:40 UTC Received 26 Apr 2023, 10:24:47 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x00000000) Computer ID 1511241 Run time 15 hours 25 min 7 sec CPU time 15 hours 14 min 11 sec Validate state Valid Credit 14,873.04 Device peak FLOPS 6.06 GFLOPS Application version OpenIFS 43r3 v1.21 x86_64-pc-linux-gnu Peak working set size 4,780.11 MB Peak swap size 4,974.23 MB Peak disk usage 1,267.49 MB |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,011,472 RAC: 21,368 |
although, IIRC, these OIFS programs do not need them.Correct. OIFS is 64 bit. Which leads me to ask, is the re compiling of the WAH2 tasks 32 or 64 bit? |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Which leads me to ask, is the re compiling of the WAH2 tasks 32 or 64 bit?32 bit. Going to 64bit is on the todo list, not least because boinc stopped supporting 32bit libs a year ago, but it's not trivial. Let's get bugs out of the Hadley models first. --- CPDN Visiting Scientist |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,011,472 RAC: 21,368 |
I suspected as much but thought I would check.Which leads me to ask, is the re compiling of the WAH2 tasks 32 or 64 bit?32 bit. Going to 64bit is on the todo list, not least because boinc stopped supporting 32bit libs a year ago, but it's not trivial. Let's get bugs out of the Hadley models first. Thanks Glenn. |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
The aim is to see how much variation we get in running multiple identical forecasts across all the linux machines attached to CPDN, and, if we get the same result from exact same forecasts from each host (which is not a given). Interesting. There... shouldn't be any variation in results for the same code on the same host with the same initial conditions. If so, look for uninitialized memory reads somewhere, I guess? I know floating point is messy, but it should at least be consistently messy. I've no shortage of starved machines I can point at stuff when it shows up! |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,011,472 RAC: 21,368 |
Interesting. There... shouldn't be any variation in results for the same code on the same host with the same initial conditions. If so, look for uninitialized memory reads somewhere, I guess? I know floating point is messy, but it should at least be consistently messy. My understanding from work on the Hadley models a long time ago is that there is with that model some variation between hosts possible due to FP rounding being different between operatingsystems/cpu manufacurers. In those days all model types went out on all platforms. To me, it makes sense to actually check this. My coding experience is with different languages and is also very very rusty but it may be there are things that could be done in teh code to mitigate this if there is significant variance with the OIFS models. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
I've no shortage of starved machines I can point at stuff when it shows up!I have a Ryzen 9 3900X and a Ryzen 9 3900XT running Linux in an Oracle VirtualBox. Will these be useful? I'm guessing you want to check they're ok on virtual machines too, although I don't know if you can tell they're virtual machines from your end: https://www.cpdn.org/show_host_detail.php?hostid=1542648 https://www.cpdn.org/show_host_detail.php?hostid=1539015 |
©2024 cpdn.org