Thread 'OpenIFS Discussion'

Author	Message
Glenn Carver Send message Joined: 29 Oct 17 Posts: 1072 Credit: 17,020,946 RAC: 5,160	Message 68278 - Posted: 12 Feb 2023, 20:41:45 UTC - in response to Message 68277. I'm interested to know why you chose the Intel compiler over GCC. Would GCC offer better compatibility with the hardware and OS heterogeneity on a DC project? I tested both and the Intel compiler produces faster code (at the expense of slightly more application memory). I don't know of any reason why gnu compilers would offer 'better compatibility' for this application? ID: 68278 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275	Message 68279 - Posted: 12 Feb 2023, 20:44:49 UTC - in response to Message 68274. Now running one that has failed once on an intel machine and once on AMD. The AMD is a double corruption and the Intel is free(): invalid next size (fast) Dave, I think you mixed those up. The Intel machine had the double corruption and the AMD was the invalid next size. ID: 68279 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 68280 - Posted: 12 Feb 2023, 20:49:27 UTC - in response to Message 68277. Last modified: 12 Feb 2023, 20:56:32 UTC I'm interested to know why you chose the Intel compiler over GCC. Would GCC offer better compatibility with the hardware and OS heterogeneity on a DC project? I do not have the slightest idea, but I imagine the results will not be the same as on my Red Hat Enterprise Linux release 8.7 (Ootpa) machine. For the record, this is what is on my machine. But I have never compiled anything related to Boinc on my machine. Oh! For C++, you invoke the compiler with g++ instead of gcc. That compiler has an incredible number of user-specified optimization options if you choose to use them. $ rpm -qf /usr/bin/gcc /usr/lib64/libc.so.6 /usr/lib64/libm.so.6 gcc-8.5.0-16.el8_7.x86_64 glibc-2.28-211.el8.x86_64 glibc-2.28-211.el8.x86_64 $ ls -l /usr/bin/gcc /usr/lib64/libc.so.6 /usr/lib64/libm.so.6 Nov 18 17:32 /usr/bin/gcc Aug 25 17:15 /usr/lib64/libc.so.6 -> libc-2.28.so Aug 25 17:15 /usr/lib64/libm.so.6 -> libm-2.28.so ID: 68280 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4571 Credit: 19,039,635 RAC: 18,944	Message 68281 - Posted: 12 Feb 2023, 21:05:23 UTC - in response to Message 68279. Last modified: 12 Feb 2023, 21:42:06 UTC Now running one that has failed once on an intel machine and once on AMD. The AMD is a double corruption and the Intel is free(): invalid next size (fast) Dave, I think you mixed those up. The Intel machine had the double corruption and the AMD was the invalid next size. Thanks George. You are as expected right. ID: 68281 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1072 Credit: 17,020,946 RAC: 5,160	Message 68282 - Posted: 12 Feb 2023, 21:09:23 UTC - in response to Message 68281. Now running one that has failed once on an intel machine and once on AMD. The AMD is a double corruption and the Intel is free(): invalid next size (fast) Dave, I think you mixed those up. The Intel machine had the double corruption and the AMD was the invalid next size. Thanks George. You are almost certainly correct but I will look again. Don't trouble yourself Dave. I've already seen them and it's the same error (just a different message) in both cases. ID: 68282 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1072 Credit: 17,020,946 RAC: 5,160	Message 68283 - Posted: 12 Feb 2023, 21:12:49 UTC - in response to Message 68278. I'm interested to know why you chose the Intel compiler over GCC. Would GCC offer better compatibility with the hardware and OS heterogeneity on a DC project? I tested both and the Intel compiler produces faster code (at the expense of slightly more application memory). I don't know of any reason why gnu compilers would offer 'better compatibility' for this application? Actually I should qualify this. I'm referring to the model here and not the controlling 'wrapper'. The model is mostly Fortran compiled with Intel and the C++ wrapper uses gnu. We only care about the speed of the model. ID: 68283 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4571 Credit: 19,039,635 RAC: 18,944	Message 68285 - Posted: 13 Feb 2023, 8:37:32 UTC - in response to Message 68282. Last modified: 13 Feb 2023, 8:54:50 UTC Now running one that has failed once on an intel machine and once on AMD. The AMD is a double corruption and the Intel is free(): invalid next size (fast) Dave, I think you mixed those up. The Intel machine had the double corruption and the AMD was the invalid next size. Thanks George. You are almost certainly correct but I will look again. Don't trouble yourself Dave. I've already seen them and it's the same error (just a different message) in both cases. Mine failed with double free or corruption (out) I probably should have looked at the other machines to attempt the task. One only has 8GB of RAM and has crashed all of the tasks it has attempted, I suspect by trying to run more than one at a time. The other has a better record, particularly more recently and has 64GB or RAM (actually a fraction less as it looks like some is taken for video) Mine made it as far as uploading zip 106. The only other task that ran during the process was one from testing branch that failed two seconds in but that was a good 20 or more zips before this one failed. Now it is time for me to display my ignorance. Doing an internet search tells me there is at least one AMD optimising compiler out there for Fortran. Would there be any mileage in trying that? My attempt here Edit: And that is the only hard fail from #990 so far. ID: 68285 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4571 Credit: 19,039,635 RAC: 18,944	Message 68286 - Posted: 13 Feb 2023, 9:06:15 UTC Just to summarise, the Intel machine managed the least time but with only 8GB of RAM this was always likely to fail especially given it has yet to complete a task. The first AMD machine got as far as uploading zip105. Mine made it to 106. The other AMD is a 12 core Ryzen9 running Fedora Linux, Mine is a 7 running Ubuntu. I have 32GB of RAM, the other has 64 though listed as slightly less presumably due to video. ID: 68286 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1072 Credit: 17,020,946 RAC: 5,160	Message 68288 - Posted: 13 Feb 2023, 11:35:06 UTC Detailed information about the next OpenIFS batch here: https://www.cpdn.org/forum_thread.php?id=9187#68287 Going out today. ID: 68288 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1072 Credit: 17,020,946 RAC: 5,160	Message 68290 - Posted: 13 Feb 2023, 14:38:02 UTC Last modified: 13 Feb 2023, 14:58:43 UTC The current batch (see prev post) is for the new Baroclinic Lifecycle 'bl' version of OpenIFS. As the boinc client may not have seen this before the estimate finish time will be significantly wrong until it's run a few of these tasks. Compared to the previous 'PS' app, this batch will complete sooner (about half the time of the PS tasks), as it's a shorter forecast and the model has a simpler configuration. Batch size is 6000 and there will only be one batch. A number of issues we saw with the PS app have been fixed. Task disk limit & disks filling has been solved. There is no need to manually remove 'srf' files from the slot directory. Few memory handling bugs were found and fixed but still see some fails from a remaining memory issue which we're working on. Also still seeing the task completing successfully and then hanging in the final call to boinc. It's not clear what's causing this but it seems this is a known issue with boinc. No upload issues are expected as the data will not be stored on JASMIN. Once the data arrives at upload11, it will be transferred to Helsinki for the scientists. Another project using the default OpenIFS app 'oifs_43r3' will also be releasing about 2000 workunits sometime this week. These will have a very similar runtime & upload size to the previous PS batches. Batch size is 2000 for this. There will be a News items (& client Notice) about this batch soon. ID: 68290 · Reply Quote

mikey Send message Joined: 18 Nov 18 Posts: 22 Credit: 6,710,387 RAC: 2,649	Message 68291 - Posted: 13 Feb 2023, 15:02:10 UTC - in response to Message 68290. The current batch (see prev post) is for the new Baroclinic Lifecycle 'bl' version of OpenIFS. Another project using the default OpenIFS app 'oifs_43r3' will also be releasing about 2000 workunits sometime this week. These will have a very similar runtime & upload size to the previous PS batches. Batch size is 2000 for this. There will be a News items (& client Notice) about this batch soon. Is there much of a chance we will ever see more of the apps being made to run on Windows pc's? As you know currently only the WAH2 apps run on Windows pc's. ID: 68291 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1072 Credit: 17,020,946 RAC: 5,160	Message 68292 - Posted: 13 Feb 2023, 15:56:39 UTC - in response to Message 68291. Is there much of a chance we will ever see more of the apps being made to run on Windows pc's? As you know currently only the WAH2 apps run on Windows pc's. I'm working on a Windows version of OpenIFS now. The bulk of the model code is fine, it's the lower level calls to operating system functions which need time to work around because linux is very different to windows in that respect. I can't give you an estimate. ID: 68292 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4571 Credit: 19,039,635 RAC: 18,944	Message 68293 - Posted: 13 Feb 2023, 17:44:13 UTC My first one of the new batch now uploading. 4hours 45minutes. I notice that the zips steadily increase in size as the task progresses rather than all being more or less the same size. ID: 68293 · Reply Quote

cetus Send message Joined: 7 Aug 04 Posts: 10 Credit: 148,100,750 RAC: 29,951	Message 68295 - Posted: 13 Feb 2023, 18:49:28 UTC - in response to Message 68290. Also still seeing the task completing successfully and then hanging in the final call to boinc. It's not clear what's causing this but it seems this is a known issue with boinc. I had one task with this issue: https://www.cpdn.org/result.php?resultid=22307151 Before I aborted it, I copied the slot directory that it was running in. If you want to see any of those files, let me know. ID: 68295 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1072 Credit: 17,020,946 RAC: 5,160	Message 68297 - Posted: 13 Feb 2023, 19:39:36 UTC - in response to Message 68295. Also still seeing the task completing successfully and then hanging in the final call to boinc. It's not clear what's causing this but it seems this is a known issue with boinc. I had one task with this issue: https://www.cpdn.org/result.php?resultid=22307151 Before I aborted it, I copied the slot directory that it was running in. If you want to see any of those files, let me know. I think you may have killed a normal task. How long was it 'hanging' for? I ask because if I look at the task output, there is no line at the bottom: called boinc_finish(0) It's in that boinc function that it gets stuck. The task will wait for 2min after the model finishes to make sure all files have been flushed to disk, before it then calls boinc_finish. My guess is the task was aborted in that 2 mins, the progress bar will sit at 99.99% during that time. If it's like that for more than 5mins then you have a stuck task. Thanks for copying the files but I have to catch one in the act on one of my machines so I can use the debugger on it whilst it's still running. Appreciate the sentiment. ID: 68297 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1072 Credit: 17,020,946 RAC: 5,160	Message 68298 - Posted: 13 Feb 2023, 19:44:45 UTC - in response to Message 68293. My first one of the new batch now uploading. 4hours 45minutes. I notice that the zips steadily increase in size as the task progresses rather than all being more or less the same size. They are all the same apart from the last one. Each bar the last contains 8 timesteps (3 files per step). If that's not the case, please point me to the task and I'll investigate. ID: 68298 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4571 Credit: 19,039,635 RAC: 18,944	Message 68299 - Posted: 13 Feb 2023, 19:57:22 UTC - in response to Message 68298. Last modified: 13 Feb 2023, 20:01:50 UTC They are all the same apart from the last one. Each bar the last contains 8 timesteps (3 files per step). If that's not the case, please point me to the task and I'll investigate. Task currently running 7.zip 90.8MB 8.zip 94.71MB 9.zip 97.9MB 10.zip 99.65MB 11.zip 100.86MB edit: 12.zip 101.70MB this task I first noticed it on the first task which has completed successfully. here ID: 68299 · Reply Quote

cetus Send message Joined: 7 Aug 04 Posts: 10 Credit: 148,100,750 RAC: 29,951	Message 68300 - Posted: 13 Feb 2023, 20:09:27 UTC - in response to Message 68297. I think you may have killed a normal task. How long was it 'hanging' for? It was running for about 1.5 hours after the model seemed to have finished. It looked like it was in the same state as that 2 minute pause at the end, but it just never finished. ID: 68300 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 68302 - Posted: 13 Feb 2023, 20:56:34 UTC - in response to Message 68293. My first one of the new batch now uploading. 4hours 45minutes. My machine is ID: 1511241 My first one took 7 hours, 50 minutes, 27 seconds. I did not catch the time for my second one, but it was about the same. My third one took 7 hours, 49 minutes, 50 seconds. My fourth one took 7 hours, 49 minutes, 25 seconds. My fifth one took 7 hours, 52 minutes, 21 seconds. ID: 68302 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1072 Credit: 17,020,946 RAC: 5,160	Message 68303 - Posted: 13 Feb 2023, 21:09:31 UTC - in response to Message 68299. Last modified: 13 Feb 2023, 21:10:27 UTC They are all the same apart from the last one. Each bar the last contains 8 timesteps (3 files per step). If that's not the case, please point me to the task and I'll investigate. Task currently running 7.zip 90.8MB 8.zip 94.71MB 9.zip 97.9MB 10.zip 99.65MB 11.zip 100.86MB edit: 12.zip 101.70MB this task I first noticed it on the first task which has completed successfully. here I can't see any trickles or stderr on that first task you note, only the second. But the second task is fine. If I scroll through the stderr output I can see 8 sets of 3 files going into each zip file, except the last one. The zipfile size depends on the data being compressed. The degree of compression will vary with differing content because it's looking for patterns in the data. These runs are 'idealized' so the initial state is very simple. As the model runs on, the model fields get more length scales appearing (short waves -> long waves), hence the data can't be compressed as well. If I had the time, I'd plot some of the early & later model fields to illustrate - I don't I'm afraid so you'll hopefully take my word for it! ID: 68303 · Reply Quote