climateprediction.net home page
New Work Announcements 2024

New Work Announcements 2024

Message boards : Number crunching : New Work Announcements 2024
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 13 · Next

AuthorMessage
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,966,742
RAC: 21,869
Message 70146 - Posted: 18 Jan 2024, 11:29:55 UTC

The NZ batch had a missing file so the submission failed. It should be resubmitted soon.
ID: 70146 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,399,360
RAC: 15,979
Message 70147 - Posted: 18 Jan 2024, 11:35:01 UTC - in response to Message 70136.  
Last modified: 18 Jan 2024, 11:37:18 UTC

The server side rules for this need to be modified. Other projects don't use these same impossible rules.
I disagree - the rule protects the server from wasting time sending out tasks to a machine likely to break the next one. It doesn't matter if it's the task's fault, the point is it's better off sending the next task to another machine.

I disagree with your disagreement. There are still 16k tasks just waiting to be sent. There are no "other machines" at this point.
Edit: And there is no harm sending a task to a bad machine. It just gets resent to the next. This is a feature, not a fault.

The server is doing what we want it to do. There are only 3 retries allowed and as these batches have high failure rates, it makes sense to target machine returning completed tasks. The aim is to get the tasks to complete successfully so the server should not continually push tasks to machines that have a high failure rate (for whatever reason). So, yes, there is 'harm' in sending tasks that are known to likely fail on machines. We end up with more hard fails.

CPDN had really hoped to get the code working before sending out these new batches. There was discussion about using the linux version since that works but the feeling was it was better to keep the Windows version for the time being. Unfortunately I wasn't about to fix all the bugs before the batches had to go out.
---
CPDN Visiting Scientist
ID: 70147 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,399,360
RAC: 15,979
Message 70148 - Posted: 18 Jan 2024, 11:35:39 UTC - in response to Message 70143.  

Oops, sorry Dave, Just seen this!

I am going to open a new thread for the East Asia batches 1001-4. To free this thread for new work announcements rather than discussion. It would be good if anyone starting discussions for subsequent batches such as the NZ ones that should appear tomorrow could do the same.

Thank you.
ID: 70148 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,690,033
RAC: 10,812
Message 70149 - Posted: 18 Jan 2024, 12:30:32 UTC - in response to Message 70147.  

So, yes, there is 'harm' in sending tasks that are known to likely fail on machines. We end up with more hard fails.
Not to mention that internet bandwidth is not a zero-cost resource, in either climate or financial terms.
ID: 70149 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 70150 - Posted: 18 Jan 2024, 14:17:58 UTC - in response to Message 70149.  

Not to mention that internet bandwidth is not a zero-cost resource, in either climate or financial terms.


The incremental bandwidth for me took a big step-up recently when Verizon replaced my FiOS hardware. My old hardware was installed in about 2004 and they did not want to support it any more. The new is about 10x faster than the old.
Timestamp 	       Download    Upload      Latency Jitter  Quality Score 	Test Server
1/18/2024 8:54:7       840.78 Mbps 906.51 Mbps 7 ms    1 ms    Excellent        newyork02.speedtest.windstream.net
12/1/2023 10:26:27     750.33 Mbps 926.59 Mbps 5 ms    1 ms    Excellent        speedtest1.nyc1.nitelusa.net.prod.hosts.ooklaserver.net
11/30/2023 21:38:48    836.55 Mbps 846.46 Mbps 5 ms    4 ms    Excellent        newyork02.speedtest.windstream.net

ID: 70150 · Report as offensive     Reply Quote
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 70151 - Posted: 18 Jan 2024, 14:51:29 UTC - in response to Message 70150.  

Not to mention that internet bandwidth is not a zero-cost resource, in either climate or financial terms.
The incremental bandwidth for me took a big step-up recently when Verizon replaced my FiOS hardware. My old hardware was installed in about 2004 and they did not want to support it any more. The new is about 10x faster than the old.
I assume he meant for the university sending them out. For us, it costs most of us no more to use more bandwidth, as it's flat rate.
ID: 70151 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 70152 - Posted: 18 Jan 2024, 18:33:38 UTC - in response to Message 70145.  

Don't waste your time looking into these segmentation failures. I know exactly where the problem is in the code, I've been working on this for weeks. The same code works fine under Linux but fails on Windows (same compiler too). Am trying to find a workaround that doesn't involve rewriting the code too much.

As I know you're technically minded it relates to the old way in which Fortran was coded for low memory machines years ago, where arrays were "misused" and shared between data of different types. A v large REAL array is being equivalenced to both an integer and logical array. It should work (and does on Linux) but we get a bad memory address under Windows (which only serves to reinforce my dislike of Windows :P)


I don't know if it helps, but in years past, I found Windows was rather harshly less tolerant of "out of bounds array accesses" compared to Linux. In general, if you read an entry or two beyond the end of an array, Linux is unlikely to segfault. Windows lays things out differently, and I have absolutely seen "things that work fine under Linux and segfault under Windows" being off-by-one errors in end of array access.

Are you familiar with Valgrind? It's a memory correctness testing tool, and will flag stuff like this. If you can build a small reproduction case, point Valgrind at it, and it'll pop out exactly what and where you're doing something wrong with your memory accesses.
ID: 70152 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,399,360
RAC: 15,979
Message 70153 - Posted: 18 Jan 2024, 21:15:07 UTC - in response to Message 70152.  

Are you familiar with Valgrind? It's a memory correctness testing tool, and will flag stuff like this. If you can build a small reproduction case, point Valgrind at it, and it'll pop out exactly what and where you're doing something wrong with your memory accesses.
Yes, I've used valgrind. It wouldn't help for this case though as the code fails accessing the first array element. Also fails to create a pointer to same element. I'm not sure what's causing it, stack problem is my current theory.
---
CPDN Visiting Scientist
ID: 70153 · Report as offensive     Reply Quote
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 70172 - Posted: 21 Jan 2024, 7:17:19 UTC - in response to Message 70153.  

Yes, I've used valgrind. It wouldn't help for this case though as the code fails accessing the first array element. Also fails to create a pointer to same element. I'm not sure what's causing it, stack problem is my current theory.
I can't find it now, but I think somewhere (you?) said they're more likely to fail on newer machines. I'm seeing something different and inexplicable. I have two newer machines, Ryzen 9 3900X and Ryzen 9 3900XT, and they're fine. I have two older machines, both dual Xeon X5650, and one of them fails every task. But the other is fine! Only difference is the motherboard, the one with the older R410 board crashes, the newer R510 board is ok. there may also be a minor difference in RAM - one of them has better matched RAM sticks and is running triple channel and the other isn't. I have various other old machines and they're also fine.

Bad Xeon: https://www.cpdn.org/show_host_detail.php?hostid=1509742 (Ignore the GPU, that was added a couple of days ago and didn't change the crashability of CPDN).

Good Xeon: https://www.cpdn.org/show_host_detail.php?hostid=1544690

Good Ryzens: https://www.cpdn.org/show_host_detail.php?hostid=1509739 and https://www.cpdn.org/show_host_detail.php?hostid=1535126

I hope something in there helps.
ID: 70172 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,966,742
RAC: 21,869
Message 70188 - Posted: 23 Jan 2024, 12:31:10 UTC

Batch 1005 4650 WAH2 tasks for the NZ region have joined the East Asia ones still waiting to be snapped up. If the testing site ones are anything to go by then on my box they will take two or three days less to complete.
ID: 70188 · Report as offensive     Reply Quote
rob

Send message
Joined: 5 Jun 09
Posts: 97
Credit: 3,732,713
RAC: 4,609
Message 70191 - Posted: 24 Jan 2024, 11:56:43 UTC

Just landed (well about 4 hours ago) a wah2_nz25 task. Let's see how this one does.....
ID: 70191 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,399,360
RAC: 15,979
Message 70193 - Posted: 24 Jan 2024, 12:34:22 UTC - in response to Message 70191.  

The NZ batch uses a smaller domain to the EAS ones and is much less likely to fail.

Just landed (well about 4 hours ago) a wah2_nz25 task. Let's see how this one does.....

---
CPDN Visiting Scientist
ID: 70193 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,399,360
RAC: 15,979
Message 70241 - Posted: 30 Jan 2024, 20:40:14 UTC
Last modified: 30 Jan 2024, 20:40:32 UTC

OpenIFS linux batch

An OpenIFS linux batch will be released in the next 7 days. It's about to go into testing. This batch is based on the earlier batch 993 but with reduced model output (and hence smaller upload files). All of the forecasts in this batch will be exactly the same. The aim is to see how much variation we get in running multiple identical forecasts across all the linux machines attached to CPDN, and, if we get the same result from exact same forecasts from each host (which is not a given).

The objective is to compare the perturbations from running across different hosts to the perturbations previously applied to batch 993's initial conditions.
---
CPDN Visiting Scientist
ID: 70241 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 70244 - Posted: 31 Jan 2024, 3:03:16 UTC - in response to Message 70241.  

An OpenIFS linux batch will be released in the next 7 days. It's about to go into testing. This batch is based on the earlier batch 993 but with reduced model output (and hence smaller upload files).


My most recent one of those was this one. It worked, so perhaps this new batch should work too. Right? I must have those compatibility libraries in there although, IIRC, these OIFS programs do not need them.

Task 22318024
Name 	oifs_43r3_0187_2019110100_123_993_12215029_2
Workunit 	12215029
Created 	25 Apr 2023, 18:24:32 UTC
Sent 	        25 Apr 2023, 18:24:40 UTC
Report deadline 24 Jun 2023, 18:24:40 UTC
Received 	26 Apr 2023, 10:24:47 UTC
Server state 	Over
Outcome 	Success
Client state 	Done
Exit status 	0 (0x00000000)
Computer ID 	1511241
Run time 	15 hours 25 min 7 sec
CPU time 	15 hours 14 min 11 sec
Validate state 	Valid
Credit 	14,873.04
Device peak FLOPS 	6.06 GFLOPS
Application version 	OpenIFS 43r3 v1.21
                                x86_64-pc-linux-gnu
Peak working set size 	4,780.11 MB
Peak swap size 	        4,974.23 MB
Peak disk usage 	1,267.49 MB

ID: 70244 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,966,742
RAC: 21,869
Message 70245 - Posted: 31 Jan 2024, 8:14:34 UTC

although, IIRC, these OIFS programs do not need them.
Correct. OIFS is 64 bit.

Which leads me to ask, is the re compiling of the WAH2 tasks 32 or 64 bit?
ID: 70245 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,399,360
RAC: 15,979
Message 70248 - Posted: 31 Jan 2024, 10:20:38 UTC - in response to Message 70245.  

Which leads me to ask, is the re compiling of the WAH2 tasks 32 or 64 bit?
32 bit. Going to 64bit is on the todo list, not least because boinc stopped supporting 32bit libs a year ago, but it's not trivial. Let's get bugs out of the Hadley models first.
---
CPDN Visiting Scientist
ID: 70248 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,966,742
RAC: 21,869
Message 70252 - Posted: 31 Jan 2024, 16:10:40 UTC - in response to Message 70248.  

Which leads me to ask, is the re compiling of the WAH2 tasks 32 or 64 bit?
32 bit. Going to 64bit is on the todo list, not least because boinc stopped supporting 32bit libs a year ago, but it's not trivial. Let's get bugs out of the Hadley models first.
I suspected as much but thought I would check.

Thanks Glenn.
ID: 70252 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 70253 - Posted: 31 Jan 2024, 20:00:01 UTC - in response to Message 70241.  
Last modified: 31 Jan 2024, 20:00:24 UTC

The aim is to see how much variation we get in running multiple identical forecasts across all the linux machines attached to CPDN, and, if we get the same result from exact same forecasts from each host (which is not a given).


Interesting. There... shouldn't be any variation in results for the same code on the same host with the same initial conditions. If so, look for uninitialized memory reads somewhere, I guess? I know floating point is messy, but it should at least be consistently messy.

I've no shortage of starved machines I can point at stuff when it shows up!
ID: 70253 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,966,742
RAC: 21,869
Message 70254 - Posted: 31 Jan 2024, 20:50:38 UTC

Interesting. There... shouldn't be any variation in results for the same code on the same host with the same initial conditions. If so, look for uninitialized memory reads somewhere, I guess? I know floating point is messy, but it should at least be consistently messy.


My understanding from work on the Hadley models a long time ago is that there is with that model some variation between hosts possible due to FP rounding being different between operatingsystems/cpu manufacurers. In those days all model types went out on all platforms. To me, it makes sense to actually check this. My coding experience is with different languages and is also very very rusty but it may be there are things that could be done in teh code to mitigate this if there is significant variance with the OIFS models.
ID: 70254 · Report as offensive     Reply Quote
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 70256 - Posted: 31 Jan 2024, 22:03:51 UTC - in response to Message 70253.  
Last modified: 31 Jan 2024, 22:06:17 UTC

I've no shortage of starved machines I can point at stuff when it shows up!
I have a Ryzen 9 3900X and a Ryzen 9 3900XT running Linux in an Oracle VirtualBox. Will these be useful? I'm guessing you want to check they're ok on virtual machines too, although I don't know if you can tell they're virtual machines from your end:

https://www.cpdn.org/show_host_detail.php?hostid=1542648
https://www.cpdn.org/show_host_detail.php?hostid=1539015
ID: 70256 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 13 · Next

Message boards : Number crunching : New Work Announcements 2024

©2024 cpdn.org