climateprediction.net (CPDN) home page
Thread 'OpenIFS Discussion'

Thread 'OpenIFS Discussion'

Message boards : Number crunching : OpenIFS Discussion
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 32 · Next

AuthorMessage
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4537
Credit: 19,001,532
RAC: 21,726
Message 66601 - Posted: 28 Nov 2022, 17:50:04 UTC - in response to Message 66599.  

I got four from testing all failed in just over one and a half minutes with
 ABORT!    1 RRTM_KGB16:ERROR READING FILE RADRRTM
At least that is the only line that leaps out at me. batch D523.

Don't know if it worth setting up trello cards for these or not?
ID: 66601 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 66602 - Posted: 28 Nov 2022, 18:06:59 UTC - in response to Message 66600.  

Sorry - being an idiot. I was looking in projects/climateprediction.net which has an app_config.xml instead of dev.cpdn which doesn't and is where the apps were running..... oh I wish the forums would let me delete posts :D
ID: 66602 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 66603 - Posted: 28 Nov 2022, 18:08:18 UTC - in response to Message 66601.  
Last modified: 28 Nov 2022, 18:09:46 UTC

I got four from testing all failed in just over one and a half minutes with
 ABORT!    1 RRTM_KGB16:ERROR READING FILE RADRRTM
At least that is the only line that leaps out at me. batch D523.

Don't know if it worth setting up trello cards for these or not?
No, don't bother. I've already emailed the scientist. The file is there, I've checked but the model configuration is wrong. I wish they would stop sending out tasks though.

p.s. Dave - top marks for finding the 1-line error message in the very long traceback! If you are interested. The file the model is looking for is in slots/?/ifsdata/RADRRTRM. My guess is the model has been told the wrong directory name.
ID: 66603 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4537
Credit: 19,001,532
RAC: 21,726
Message 66604 - Posted: 28 Nov 2022, 18:28:39 UTC
Last modified: 28 Nov 2022, 20:04:21 UTC

two running from #945

Edit: Running two at once, no problem with my bored band keeping up. Peak memory usage at the moment seems to be about 13% of 32GB
ID: 66604 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 66609 - Posted: 28 Nov 2022, 21:26:46 UTC - in response to Message 66604.  

two running from #945

Edit: Running two at once, no problem with my bored band keeping up. Peak memory usage at the moment seems to be about 13% of 32GB


I, too, have two running at once. How would I tell if I were choking my broadband connection? I get 75 Megabits/ second up and down if the server at the other end can keep up.

Memory usage at the moment is:
$ free -hw
              total        used        free      shared     buffers       cache   available
Mem:           62Gi        12Gi       1.6Gi       102Mi       332Mi        48Gi        49Gi
Swap:          15Gi        82Mi        15Gi


so I am really using 12 Gigabytes out of 62 Gigabytes total. This includes 10 other Boinc tasks that are not CPDN.

Not only are there 1.6 Gigabytes free, but thre are also 49 Gigabytes available by grabbing some of the input disk cache if needed (without swapping).
ID: 66609 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,700,823
RAC: 9,977
Message 66611 - Posted: 28 Nov 2022, 22:02:24 UTC

They released them while I was out at the pub! Never mind - got my first couple, and have preserved the detail for the morning.

123 output files! Waiting till I see the sizes, but I hope Oxford know what they've unleashed on their creaking infrastructure.

Initial runtime estimate 60 hours 46 minutes. Again, I'll do the maths in the morning.
ID: 66611 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4537
Credit: 19,001,532
RAC: 21,726
Message 66612 - Posted: 28 Nov 2022, 22:24:39 UTC - in response to Message 66611.  

123 output files! Waiting till I see the sizes, but I hope Oxford know what they've unleashed on their creaking infrastructure.
Averaging about 14.2MB looking at mine. ( Haven't actually done the arithmetic to calculate the mean.
ID: 66612 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,700,823
RAC: 9,977
Message 66613 - Posted: 28 Nov 2022, 22:49:12 UTC - in response to Message 66612.  

One file every 11 minutes:

28/11/2022 22:35:43 | climateprediction.net | [cpu_sched] Starting task oifs_43r3_ps_0923_2021050100_123_945_12164012_0 using oifs_43r3_ps version 101 in slot 2
28/11/2022 22:46:32 | climateprediction.net | Started upload of oifs_43r3_ps_0923_2021050100_123_945_12164012_0_r1958904230_0.zip
28/11/2022 22:46:44 | climateprediction.net | Finished upload of oifs_43r3_ps_0923_2021050100_123_945_12164012_0_r1958904230_0.zip
ID: 66613 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 30,970,928
RAC: 14,160
Message 66614 - Posted: 28 Nov 2022, 23:25:35 UTC

Got 3. Posting zips about every 10 mins. Estimated completion about 17hours. (3.5GHz i5, 24Gb ram).
ID: 66614 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 66619 - Posted: 29 Nov 2022, 1:21:10 UTC - in response to Message 66613.  

One file every 11 minutes:

28/11/2022 22:35:43 | climateprediction.net | [cpu_sched] Starting task oifs_43r3_ps_0923_2021050100_123_945_12164012_0 using oifs_43r3_ps version 101 in slot 2
28/11/2022 22:46:32 | climateprediction.net | Started upload of oifs_43r3_ps_0923_2021050100_123_945_12164012_0_r1958904230_0.zip
28/11/2022 22:46:44 | climateprediction.net | Finished upload of oifs_43r3_ps_0923_2021050100_123_945_12164012_0_r1958904230_0.zip


Mine is a little different; I have two of these running. So one every 7 minutes for each of them. If I knew how big they were, I could tell how much bandwidth I need to send them. They seem to take my machine about 5 seconds to upload each one.

Mon 28 Nov 2022 08:06:14 PM EST | climateprediction.net | Started upload of oifs_43r3_ps_0002_2021050100_123_945_12163091_0_r1172930429_45.zip
Mon 28 Nov 2022 08:06:19 PM EST | climateprediction.net | Finished upload of oifs_43r3_ps_0002_2021050100_123_945_12163091_0_r1172930429_45.zip

Mon 28 Nov 2022 08:07:31 PM EST | climateprediction.net | Started upload of oifs_43r3_ps_0030_2021050100_123_945_12163119_0_r632876908_45.zip
Mon 28 Nov 2022 08:07:36 PM EST | climateprediction.net | Finished upload of oifs_43r3_ps_0030_2021050100_123_945_12163119_0_r632876908_45.zip

Mon 28 Nov 2022 08:13:25 PM EST | climateprediction.net | Started upload of oifs_43r3_ps_0002_2021050100_123_945_12163091_0_r1172930429_46.zip
Mon 28 Nov 2022 08:13:30 PM EST | climateprediction.net | Finished upload of oifs_43r3_ps_0002_2021050100_123_945_12163091_0_r1172930429_46.zip

Mon 28 Nov 2022 08:14:42 PM EST | climateprediction.net | Started upload of oifs_43r3_ps_0030_2021050100_123_945_12163119_0_r632876908_46.zip
Mon 28 Nov 2022 08:14:47 PM EST | climateprediction.net | Finished upload of oifs_43r3_ps_0030_2021050100_123_945_12163119_0_r632876908_46.zip

ID: 66619 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 30,970,928
RAC: 14,160
Message 66626 - Posted: 29 Nov 2022, 11:08:28 UTC - in response to Message 66614.  

Got 3. Posting zips about every 10 mins. Estimated completion about 17hours. (3.5GHz i5, 24Gb ram).


Got another 4 and set CPU to 100% (i.e. 4 cores). Getting message that one task is running or waiting for memory as expected.
ID: 66626 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,700,823
RAC: 9,977
Message 66627 - Posted: 29 Nov 2022, 11:49:42 UTC - in response to Message 66611.  

Running the numbers from last night. I see the server has started me off with an expected speed that exactly matches my whetstone benchmark. The actual running speed seems to be much faster than that, which is no bad thing: better that we don't underestimate it, and risk missing deadlines.

My machines normally run primarily as GPU platforms, so my CPU efficiency is low - CPU time is currently barely two-thirds of wall-clock time. I'm running down my GPU cache and other work, so I'll get some 'normal' times from the next batch.

Uploads are being generated as I saw last night, and all are going through cleanly. Trickle reports are being batched up and sent once per hour, as per server delay request. Trickle data is minimal, but that's probably all it needs to be.

<msg_from_host>
    <result_name>oifs_43r3_ps_0799_2021050100_123_945_12163888_0</result_name>
    <time>1669718045</time>
<variety>orig</variety>
<wu>oifs_43r3_ps_0799_2021050100_123_945_12163888</wu>
<result>oifs_43r3_ps_0799_2021050100_123_945_12163888_0_r987065464</result>
<ph></ph>
<ts>6307200</ts>
<cp>24355</cp>
<vr></vr>
</msg_from_host>
ID: 66627 · Report as offensive     Reply Quote
Dark Angel

Send message
Joined: 31 May 18
Posts: 53
Credit: 4,725,987
RAC: 9,174
Message 66634 - Posted: 29 Nov 2022, 20:20:35 UTC

I've got a few of these new units. So far two completed ok and two with errors.

The first error log ends with:
Uploading the intermediate file: upload_file_21.zip
00:22:21 STEP 529 H= 529:00 +CPU= 12.302
Uploading trickle at timestep: 1900800
00:22:36 STEP 530 H= 530:00 +CPU= 15.541
double free or corruption (out)


The other:
Uploading the intermediate file: upload_file_19.zip
18:58:27 STEP 481 H= 481:00 +CPU= 9.772
18:58:37 STEP 482 H= 482:00 +CPU= 10.168
free(): invalid next size (fast)
ID: 66634 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 66635 - Posted: 29 Nov 2022, 21:07:25 UTC - in response to Message 66614.  

Got 3. Posting zips about every 10 mins. Estimated completion about 17hours. (3.5GHz i5, 24Gb ram).
We can adjust the trickle frequency if it causes a problem.

Please, just ignore what the boinc client reports as estimated time to completion. It's not going to get it right at all because it's a new app. Work it out from the '%age done' and the elapsed time.
ID: 66635 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 66636 - Posted: 29 Nov 2022, 21:13:44 UTC - in response to Message 66634.  

I've seen at least one of those 'double free or corruption' but only on an old i7-7700 with non-ecc memory.
ID: 66636 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 66637 - Posted: 29 Nov 2022, 21:23:14 UTC - in response to Message 66634.  

I've got a few of these new units. So far two completed ok and two with errors.
The first error log ends with:
Uploading trickle at timestep: 1900800
00:22:36 STEP 530 H= 530:00 +CPU= 15.541
double free or corruption (out)

The other:
18:58:37 STEP 482 H= 482:00 +CPU= 10.168
free(): invalid next size (fast)
Ah! Excellent. I've been trying to understand why some tasks are apparently stopping with nothing in the stderr.txt returned to the server to explain why it stopped.

@DarkAngel - can you tell me which resultids those were so I can look them up?
Also, what machine & OS are you using these on?

This kind of error message indicates a memory problem, often caused by a bug in the code but I've also seen it caused by certain versions of compilers/system libraries. I've never seen it with the model itself but then I've never run the model on such a wide range of systems like this. Could also be the wrapper code we use.

Quick question. When the tasks are running, if you do 'ps -ef' you should see the same number of 'master.exe' processes as 'oifs_43r3_ps_1.01_x86_64-pc-linux-gnu'. The latter is the 'controller' for the model itself (master.exe). Do you have the same number of each? I ask because we know of one issue that can kill the 'oifs_43r3....' process running but still leave the model 'master.exe' running.

Thanks for your help.
ID: 66637 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 66638 - Posted: 29 Nov 2022, 21:31:18 UTC - in response to Message 66636.  

I've seen at least one of those 'double free or corruption' but only on an old i7-7700 with non-ecc memory.
It's not the hardware. I have an even older i7-3770 which I've never seen this issue on. It's a software/OS issue, which unfortunately won't be easy to track down.

If anyone who gets these can report the URL of the result id (e.g. https://www.cpdn.org/cpdnboinc/result.php?resultid=22248137) that would help (please send Private Message so as not to flood this thread, thx.)
ID: 66638 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 66639 - Posted: 29 Nov 2022, 21:33:11 UTC

Sadness. My ADSL that's been adequate isn't adequate anymore for the many uploads per model - that's about a GiB and a half per work-unit. Throttling downloads until my very Asymmetric ISP upload bottleneck gets replaced with Gbit (likely soon).
Models run in about 11 hours on my slowest and fastest multicore machines, but as was disclosed way in advance, they need at least 5GB per running model, they get less, they slow waaay down.
I've ordered an AMD 5800X3D to see if the bigger L3 cache helps with this kind of work.
Thanks to all for supporting this work with your time and compute capacity.
ID: 66639 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 66640 - Posted: 29 Nov 2022, 21:48:03 UTC - in response to Message 66635.  

We can adjust the trickle frequency if it causes a problem.


I do not see any problem. I have completed three work units without error (no credit assigned yet, but that is to be expected.

As far as a problem is concerned, would that be too many trickles?
ps_1016 took 5 seconds to upload. Then 8 minutes until the next one.
ps_1785 took 6 seconds to upload. Then 8 minutes until the next one.
ps_0961 took 4 seconds to upload. Then 8 minutes until the next one.

Tue 29 Nov 2022 04:00:11 PM EST | climateprediction.net | Started upload of oifs_43r3_ps_1016_2021050100_123_946_12164105_0_r1596679317_76.zip
Tue 29 Nov 2022 04:00:16 PM EST | climateprediction.net | Finished upload of oifs_43r3_ps_1016_2021050100_123_946_12164105_0_r1596679317_76.zip
Tue 29 Nov 2022 04:04:41 PM EST | climateprediction.net | Started upload of oifs_43r3_ps_1785_2021050100_123_946_12164874_0_r29850111_9.zip
Tue 29 Nov 2022 04:04:47 PM EST | climateprediction.net | Finished upload of oifs_43r3_ps_1785_2021050100_123_946_12164874_0_r29850111_9.zip
Tue 29 Nov 2022 04:05:10 PM EST | climateprediction.net | Started upload of oifs_43r3_ps_0961_2021050100_123_945_12164050_0_r1728848570_77.zip
Tue 29 Nov 2022 04:05:14 PM EST | climateprediction.net | Finished upload of oifs_43r3_ps_0961_2021050100_123_945_12164050_0_r1728848570_77.zip
Tue 29 Nov 2022 04:08:13 PM EST | climateprediction.net | Started upload of oifs_43r3_ps_1016_2021050100_123_946_12164105_0_r1596679317_77.zip
Tue 29 Nov 2022 04:08:18 PM EST | climateprediction.net | Finished upload of oifs_43r3_ps_1016_2021050100_123_946_12164105_0_r1596679317_77.zip
Tue 29 Nov 2022 04:12:33 PM EST | climateprediction.net | Started upload of oifs_43r3_ps_1785_2021050100_123_946_12164874_0_r29850111_10.zip
Tue 29 Nov 2022 04:12:39 PM EST | climateprediction.net | Finished upload of oifs_43r3_ps_1785_2021050100_123_946_12164874_0_r29850111_10.zip
Tue 29 Nov 2022 04:13:03 PM EST | climateprediction.net | Started upload of oifs_43r3_ps_0961_2021050100_123_945_12164050_0_r1728848570_78.zip
Tue 29 Nov 2022 04:13:07 PM EST | climateprediction.net | Finished upload of oifs_43r3_ps_0961_2021050100_123_945_12164050_0_r1728848570_78.zip


Running three at a time seems to be no problem at all.

I do have a fast Internet link. According to my CPDN computer (Computer 1511241) page, I get

Average upload rate 3170 KB/sec
Average download rate 15674.33 KB/sec

And accoring to Speakeasy speed test site,

Timestamp 	    Download 	 Upload     Latency Jitter Quality Score Test Server
11/29/2022 16:30:21 78.70 Mbps   89.08 Mbps 6 ms    1 ms   Excellent          nyc.speedtest.clouvider.net.prod.hosts.ooklaserver.net

ID: 66640 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 66641 - Posted: 29 Nov 2022, 22:49:19 UTC

My ADSL that's been adequate isn't adequate anymore for the many uploads per model
So I understand properly.

Is it the amount of trickles that's an issue? Or the total amount of data? The model output for the complete forecast is split into the smaller trickle files (to ease the data upload burden). We could do fewer trickles but the total data size would be the same (each trickle would be larger).

I'm assuming it's the total size of the upload (sum of all trickle sizes) that's a problem? We can ask the scientist to reduce the model output if necessary.
ID: 66641 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 32 · Next

Message boards : Number crunching : OpenIFS Discussion

©2024 cpdn.org