climateprediction.net (CPDN) home page
Thread 'OpenIFS Discussion'

Thread 'OpenIFS Discussion'

Message boards : Number crunching : OpenIFS Discussion
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 20 · 21 · 22 · 23 · 24 · 25 · 26 . . . 32 · Next

AuthorMessage
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,454,052
RAC: 15,294
Message 68278 - Posted: 12 Feb 2023, 20:41:45 UTC - in response to Message 68277.  

I'm interested to know why you chose the Intel compiler over GCC. Would GCC offer better compatibility with the hardware and OS heterogeneity on a DC project?
I tested both and the Intel compiler produces faster code (at the expense of slightly more application memory). I don't know of any reason why gnu compilers would offer 'better compatibility' for this application?
ID: 68278 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 68279 - Posted: 12 Feb 2023, 20:44:49 UTC - in response to Message 68274.  

Now running one that has failed once on an intel machine and once on AMD. The AMD is a double corruption and the Intel is
free(): invalid next size (fast)

Dave, I think you mixed those up.

The Intel machine had the double corruption and the AMD was the invalid next size.
ID: 68279 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 68280 - Posted: 12 Feb 2023, 20:49:27 UTC - in response to Message 68277.  
Last modified: 12 Feb 2023, 20:56:32 UTC

I'm interested to know why you chose the Intel compiler over GCC. Would GCC offer better compatibility with the hardware and OS heterogeneity on a DC project?


I do not have the slightest idea, but I imagine the results will not be the same as on my
Red Hat Enterprise Linux release 8.7 (Ootpa) machine.

For the record, this is what is on my machine. But I have never compiled anything related to Boinc on my machine.
Oh! For C++, you invoke the compiler with g++ instead of gcc.
That compiler has an incredible number of user-specified optimization options if you choose to use them.
$ rpm -qf /usr/bin/gcc /usr/lib64/libc.so.6 /usr/lib64/libm.so.6
gcc-8.5.0-16.el8_7.x86_64
glibc-2.28-211.el8.x86_64
glibc-2.28-211.el8.x86_64

$ ls -l /usr/bin/gcc /usr/lib64/libc.so.6 /usr/lib64/libm.so.6
Nov 18 17:32 /usr/bin/gcc
Aug 25 17:15 /usr/lib64/libc.so.6 -> libc-2.28.so
Aug 25 17:15 /usr/lib64/libm.so.6 -> libm-2.28.so

ID: 68280 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,033,008
RAC: 19,749
Message 68281 - Posted: 12 Feb 2023, 21:05:23 UTC - in response to Message 68279.  
Last modified: 12 Feb 2023, 21:42:06 UTC

Now running one that has failed once on an intel machine and once on AMD. The AMD is a double corruption and the Intel is
free(): invalid next size (fast)

Dave, I think you mixed those up.

The Intel machine had the double corruption and the AMD was the invalid next size.


Thanks George. You are as expected right.
ID: 68281 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,454,052
RAC: 15,294
Message 68282 - Posted: 12 Feb 2023, 21:09:23 UTC - in response to Message 68281.  

Now running one that has failed once on an intel machine and once on AMD. The AMD is a double corruption and the Intel is
free(): invalid next size (fast)
Dave, I think you mixed those up.
The Intel machine had the double corruption and the AMD was the invalid next size.
Thanks George. You are almost certainly correct but I will look again.
Don't trouble yourself Dave. I've already seen them and it's the same error (just a different message) in both cases.
ID: 68282 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,454,052
RAC: 15,294
Message 68283 - Posted: 12 Feb 2023, 21:12:49 UTC - in response to Message 68278.  

I'm interested to know why you chose the Intel compiler over GCC. Would GCC offer better compatibility with the hardware and OS heterogeneity on a DC project?
I tested both and the Intel compiler produces faster code (at the expense of slightly more application memory). I don't know of any reason why gnu compilers would offer 'better compatibility' for this application?
Actually I should qualify this. I'm referring to the model here and not the controlling 'wrapper'. The model is mostly Fortran compiled with Intel and the C++ wrapper uses gnu. We only care about the speed of the model.
ID: 68283 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,033,008
RAC: 19,749
Message 68285 - Posted: 13 Feb 2023, 8:37:32 UTC - in response to Message 68282.  
Last modified: 13 Feb 2023, 8:54:50 UTC

Now running one that has failed once on an intel machine and once on AMD. The AMD is a double corruption and the Intel is
free(): invalid next size (fast)
Dave, I think you mixed those up.
The Intel machine had the double corruption and the AMD was the invalid next size.
Thanks George. You are almost certainly correct but I will look again.
Don't trouble yourself Dave. I've already seen them and it's the same error (just a different message) in both cases.
Mine failed with
double free or corruption (out)


I probably should have looked at the other machines to attempt the task. One only has 8GB of RAM and has crashed all of the tasks it has attempted, I suspect by trying to run more than one at a time. The other has a better record, particularly more recently and has 64GB or RAM (actually a fraction less as it looks like some is taken for video) Mine made it as far as uploading zip 106. The only other task that ran during the process was one from testing branch that failed two seconds in but that was a good 20 or more zips before this one failed.

Now it is time for me to display my ignorance. Doing an internet search tells me there is at least one AMD optimising compiler out there for Fortran. Would there be any mileage in trying that?

My attempt here

Edit: And that is the only hard fail from #990 so far.
ID: 68285 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,033,008
RAC: 19,749
Message 68286 - Posted: 13 Feb 2023, 9:06:15 UTC

Just to summarise, the Intel machine managed the least time but with only 8GB of RAM this was always likely to fail especially given it has yet to complete a task. The first AMD machine got as far as uploading zip105. Mine made it to 106.

The other AMD is a 12 core Ryzen9 running Fedora Linux, Mine is a 7 running Ubuntu. I have 32GB of RAM, the other has 64 though listed as slightly less presumably due to video.
ID: 68286 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,454,052
RAC: 15,294
Message 68288 - Posted: 13 Feb 2023, 11:35:06 UTC

Detailed information about the next OpenIFS batch here: https://www.cpdn.org/forum_thread.php?id=9187#68287

Going out today.
ID: 68288 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,454,052
RAC: 15,294
Message 68290 - Posted: 13 Feb 2023, 14:38:02 UTC
Last modified: 13 Feb 2023, 14:58:43 UTC

The current batch (see prev post) is for the new Baroclinic Lifecycle 'bl' version of OpenIFS. As the boinc client may not have seen this before the estimate finish time will be significantly wrong until it's run a few of these tasks. Compared to the previous 'PS' app, this batch will complete sooner (about half the time of the PS tasks), as it's a shorter forecast and the model has a simpler configuration. Batch size is 6000 and there will only be one batch.

A number of issues we saw with the PS app have been fixed. Task disk limit & disks filling has been solved. There is no need to manually remove 'srf' files from the slot directory. Few memory handling bugs were found and fixed but still see some fails from a remaining memory issue which we're working on. Also still seeing the task completing successfully and then hanging in the final call to boinc. It's not clear what's causing this but it seems this is a known issue with boinc.

No upload issues are expected as the data will not be stored on JASMIN. Once the data arrives at upload11, it will be transferred to Helsinki for the scientists.

Another project using the default OpenIFS app 'oifs_43r3' will also be releasing about 2000 workunits sometime this week. These will have a very similar runtime & upload size to the previous PS batches. Batch size is 2000 for this. There will be a News items (& client Notice) about this batch soon.
ID: 68290 · Report as offensive     Reply Quote
mikey

Send message
Joined: 18 Nov 18
Posts: 21
Credit: 6,598,476
RAC: 2,046
Message 68291 - Posted: 13 Feb 2023, 15:02:10 UTC - in response to Message 68290.  

The current batch (see prev post) is for the new Baroclinic Lifecycle 'bl' version of OpenIFS.

Another project using the default OpenIFS app 'oifs_43r3' will also be releasing about 2000 workunits sometime this week. These will have a very similar runtime & upload size to the previous PS batches. Batch size is 2000 for this. There will be a News items (& client Notice) about this batch soon.


Is there much of a chance we will ever see more of the apps being made to run on Windows pc's? As you know currently only the WAH2 apps run on Windows pc's.
ID: 68291 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,454,052
RAC: 15,294
Message 68292 - Posted: 13 Feb 2023, 15:56:39 UTC - in response to Message 68291.  

Is there much of a chance we will ever see more of the apps being made to run on Windows pc's? As you know currently only the WAH2 apps run on Windows pc's.
I'm working on a Windows version of OpenIFS now. The bulk of the model code is fine, it's the lower level calls to operating system functions which need time to work around because linux is very different to windows in that respect. I can't give you an estimate.
ID: 68292 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,033,008
RAC: 19,749
Message 68293 - Posted: 13 Feb 2023, 17:44:13 UTC

My first one of the new batch now uploading. 4hours 45minutes. I notice that the zips steadily increase in size as the task progresses rather than all being more or less the same size.
ID: 68293 · Report as offensive     Reply Quote
cetus

Send message
Joined: 7 Aug 04
Posts: 10
Credit: 148,068,446
RAC: 36,055
Message 68295 - Posted: 13 Feb 2023, 18:49:28 UTC - in response to Message 68290.  

Also still seeing the task completing successfully and then hanging in the final call to boinc. It's not clear what's causing this but it seems this is a known issue with boinc.


I had one task with this issue: https://www.cpdn.org/result.php?resultid=22307151
Before I aborted it, I copied the slot directory that it was running in. If you want to see any of those files, let me know.
ID: 68295 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,454,052
RAC: 15,294
Message 68297 - Posted: 13 Feb 2023, 19:39:36 UTC - in response to Message 68295.  

Also still seeing the task completing successfully and then hanging in the final call to boinc. It's not clear what's causing this but it seems this is a known issue with boinc.
I had one task with this issue: https://www.cpdn.org/result.php?resultid=22307151
Before I aborted it, I copied the slot directory that it was running in. If you want to see any of those files, let me know.
I think you may have killed a normal task. How long was it 'hanging' for? I ask because if I look at the task output, there is no line at the bottom:
called boinc_finish(0)
It's in that boinc function that it gets stuck.
The task will wait for 2min after the model finishes to make sure all files have been flushed to disk, before it then calls boinc_finish. My guess is the task was aborted in that 2 mins, the progress bar will sit at 99.99% during that time. If it's like that for more than 5mins then you have a stuck task.

Thanks for copying the files but I have to catch one in the act on one of my machines so I can use the debugger on it whilst it's still running. Appreciate the sentiment.
ID: 68297 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,454,052
RAC: 15,294
Message 68298 - Posted: 13 Feb 2023, 19:44:45 UTC - in response to Message 68293.  

My first one of the new batch now uploading. 4hours 45minutes. I notice that the zips steadily increase in size as the task progresses rather than all being more or less the same size.
They are all the same apart from the last one. Each bar the last contains 8 timesteps (3 files per step). If that's not the case, please point me to the task and I'll investigate.
ID: 68298 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,033,008
RAC: 19,749
Message 68299 - Posted: 13 Feb 2023, 19:57:22 UTC - in response to Message 68298.  
Last modified: 13 Feb 2023, 20:01:50 UTC

They are all the same apart from the last one. Each bar the last contains 8 timesteps (3 files per step). If that's not the case, please point me to the task and I'll investigate.


Task currently running
7.zip 90.8MB
8.zip 94.71MB
9.zip 97.9MB
10.zip 99.65MB
11.zip 100.86MB edit:
12.zip 101.70MB

this task

I first noticed it on the first task which has completed successfully. here
ID: 68299 · Report as offensive     Reply Quote
cetus

Send message
Joined: 7 Aug 04
Posts: 10
Credit: 148,068,446
RAC: 36,055
Message 68300 - Posted: 13 Feb 2023, 20:09:27 UTC - in response to Message 68297.  

I think you may have killed a normal task. How long was it 'hanging' for?

It was running for about 1.5 hours after the model seemed to have finished. It looked like it was in the same state as that 2 minute pause at the end, but it just never finished.
ID: 68300 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 68302 - Posted: 13 Feb 2023, 20:56:34 UTC - in response to Message 68293.  

My first one of the new batch now uploading. 4hours 45minutes.


My machine is ID: 1511241

My first one took 7 hours, 50 minutes, 27 seconds.
I did not catch the time for my second one, but it was about the same.
My third one took 7 hours, 49 minutes, 50 seconds.
My fourth one took 7 hours, 49 minutes, 25 seconds.
My fifth one took 7 hours, 52 minutes, 21 seconds.

ID: 68302 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,454,052
RAC: 15,294
Message 68303 - Posted: 13 Feb 2023, 21:09:31 UTC - in response to Message 68299.  
Last modified: 13 Feb 2023, 21:10:27 UTC

They are all the same apart from the last one. Each bar the last contains 8 timesteps (3 files per step). If that's not the case, please point me to the task and I'll investigate.
Task currently running
7.zip 90.8MB
8.zip 94.71MB
9.zip 97.9MB
10.zip 99.65MB
11.zip 100.86MB edit:
12.zip 101.70MB
this task
I first noticed it on the first task which has completed successfully. here
I can't see any trickles or stderr on that first task you note, only the second. But the second task is fine. If I scroll through the stderr output I can see 8 sets of 3 files going into each zip file, except the last one.

The zipfile size depends on the data being compressed. The degree of compression will vary with differing content because it's looking for patterns in the data. These runs are 'idealized' so the initial state is very simple. As the model runs on, the model fields get more length scales appearing (short waves -> long waves), hence the data can't be compressed as well. If I had the time, I'd plot some of the early & later model fields to illustrate - I don't I'm afraid so you'll hopefully take my word for it!
ID: 68303 · Report as offensive     Reply Quote
Previous · 1 . . . 20 · 21 · 22 · 23 · 24 · 25 · 26 . . . 32 · Next

Message boards : Number crunching : OpenIFS Discussion

©2024 cpdn.org