Message boards : Number crunching : OpenIFS Discussion
Message board moderation
Previous · 1 . . . 20 · 21 · 22 · 23 · 24 · 25 · 26 . . . 32 · Next
Author | Message |
---|---|
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,431,665 RAC: 17,512 |
I'm interested to know why you chose the Intel compiler over GCC. Would GCC offer better compatibility with the hardware and OS heterogeneity on a DC project?I tested both and the Intel compiler produces faster code (at the expense of slightly more application memory). I don't know of any reason why gnu compilers would offer 'better compatibility' for this application? |
Send message Joined: 7 Aug 04 Posts: 2186 Credit: 64,822,615 RAC: 5,275 |
Now running one that has failed once on an intel machine and once on AMD. The AMD is a double corruption and the Intel isfree(): invalid next size (fast) Dave, I think you mixed those up. The Intel machine had the double corruption and the AMD was the invalid next size. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I'm interested to know why you chose the Intel compiler over GCC. Would GCC offer better compatibility with the hardware and OS heterogeneity on a DC project? I do not have the slightest idea, but I imagine the results will not be the same as on my Red Hat Enterprise Linux release 8.7 (Ootpa) machine. For the record, this is what is on my machine. But I have never compiled anything related to Boinc on my machine. Oh! For C++, you invoke the compiler with g++ instead of gcc. That compiler has an incredible number of user-specified optimization options if you choose to use them. $ rpm -qf /usr/bin/gcc /usr/lib64/libc.so.6 /usr/lib64/libm.so.6 gcc-8.5.0-16.el8_7.x86_64 glibc-2.28-211.el8.x86_64 glibc-2.28-211.el8.x86_64 $ ls -l /usr/bin/gcc /usr/lib64/libc.so.6 /usr/lib64/libm.so.6 Nov 18 17:32 /usr/bin/gcc Aug 25 17:15 /usr/lib64/libc.so.6 -> libc-2.28.so Aug 25 17:15 /usr/lib64/libm.so.6 -> libm-2.28.so |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,989,107 RAC: 21,788 |
Now running one that has failed once on an intel machine and once on AMD. The AMD is a double corruption and the Intel isfree(): invalid next size (fast) Thanks George. You are as expected right. |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,431,665 RAC: 17,512 |
Don't trouble yourself Dave. I've already seen them and it's the same error (just a different message) in both cases.Thanks George. You are almost certainly correct but I will look again.Now running one that has failed once on an intel machine and once on AMD. The AMD is a double corruption and the Intel isDave, I think you mixed those up.free(): invalid next size (fast) |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,431,665 RAC: 17,512 |
Actually I should qualify this. I'm referring to the model here and not the controlling 'wrapper'. The model is mostly Fortran compiled with Intel and the C++ wrapper uses gnu. We only care about the speed of the model.I'm interested to know why you chose the Intel compiler over GCC. Would GCC offer better compatibility with the hardware and OS heterogeneity on a DC project?I tested both and the Intel compiler produces faster code (at the expense of slightly more application memory). I don't know of any reason why gnu compilers would offer 'better compatibility' for this application? |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,989,107 RAC: 21,788 |
Mine failed withDon't trouble yourself Dave. I've already seen them and it's the same error (just a different message) in both cases.Thanks George. You are almost certainly correct but I will look again.Now running one that has failed once on an intel machine and once on AMD. The AMD is a double corruption and the Intel isDave, I think you mixed those up.free(): invalid next size (fast) double free or corruption (out) I probably should have looked at the other machines to attempt the task. One only has 8GB of RAM and has crashed all of the tasks it has attempted, I suspect by trying to run more than one at a time. The other has a better record, particularly more recently and has 64GB or RAM (actually a fraction less as it looks like some is taken for video) Mine made it as far as uploading zip 106. The only other task that ran during the process was one from testing branch that failed two seconds in but that was a good 20 or more zips before this one failed. Now it is time for me to display my ignorance. Doing an internet search tells me there is at least one AMD optimising compiler out there for Fortran. Would there be any mileage in trying that? My attempt here Edit: And that is the only hard fail from #990 so far. |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,989,107 RAC: 21,788 |
Just to summarise, the Intel machine managed the least time but with only 8GB of RAM this was always likely to fail especially given it has yet to complete a task. The first AMD machine got as far as uploading zip105. Mine made it to 106. The other AMD is a 12 core Ryzen9 running Fedora Linux, Mine is a 7 running Ubuntu. I have 32GB of RAM, the other has 64 though listed as slightly less presumably due to video. |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,431,665 RAC: 17,512 |
Detailed information about the next OpenIFS batch here: https://www.cpdn.org/forum_thread.php?id=9187#68287 Going out today. |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,431,665 RAC: 17,512 |
The current batch (see prev post) is for the new Baroclinic Lifecycle 'bl' version of OpenIFS. As the boinc client may not have seen this before the estimate finish time will be significantly wrong until it's run a few of these tasks. Compared to the previous 'PS' app, this batch will complete sooner (about half the time of the PS tasks), as it's a shorter forecast and the model has a simpler configuration. Batch size is 6000 and there will only be one batch. A number of issues we saw with the PS app have been fixed. Task disk limit & disks filling has been solved. There is no need to manually remove 'srf' files from the slot directory. Few memory handling bugs were found and fixed but still see some fails from a remaining memory issue which we're working on. Also still seeing the task completing successfully and then hanging in the final call to boinc. It's not clear what's causing this but it seems this is a known issue with boinc. No upload issues are expected as the data will not be stored on JASMIN. Once the data arrives at upload11, it will be transferred to Helsinki for the scientists. Another project using the default OpenIFS app 'oifs_43r3' will also be releasing about 2000 workunits sometime this week. These will have a very similar runtime & upload size to the previous PS batches. Batch size is 2000 for this. There will be a News items (& client Notice) about this batch soon. |
Send message Joined: 18 Nov 18 Posts: 21 Credit: 6,591,021 RAC: 1,915 |
The current batch (see prev post) is for the new Baroclinic Lifecycle 'bl' version of OpenIFS. Is there much of a chance we will ever see more of the apps being made to run on Windows pc's? As you know currently only the WAH2 apps run on Windows pc's. |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,431,665 RAC: 17,512 |
Is there much of a chance we will ever see more of the apps being made to run on Windows pc's? As you know currently only the WAH2 apps run on Windows pc's.I'm working on a Windows version of OpenIFS now. The bulk of the model code is fine, it's the lower level calls to operating system functions which need time to work around because linux is very different to windows in that respect. I can't give you an estimate. |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,989,107 RAC: 21,788 |
My first one of the new batch now uploading. 4hours 45minutes. I notice that the zips steadily increase in size as the task progresses rather than all being more or less the same size. |
Send message Joined: 7 Aug 04 Posts: 10 Credit: 147,994,703 RAC: 39,971 |
Also still seeing the task completing successfully and then hanging in the final call to boinc. It's not clear what's causing this but it seems this is a known issue with boinc. I had one task with this issue: https://www.cpdn.org/result.php?resultid=22307151 Before I aborted it, I copied the slot directory that it was running in. If you want to see any of those files, let me know. |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,431,665 RAC: 17,512 |
I think you may have killed a normal task. How long was it 'hanging' for? I ask because if I look at the task output, there is no line at the bottom:Also still seeing the task completing successfully and then hanging in the final call to boinc. It's not clear what's causing this but it seems this is a known issue with boinc.I had one task with this issue: https://www.cpdn.org/result.php?resultid=22307151 called boinc_finish(0)It's in that boinc function that it gets stuck. The task will wait for 2min after the model finishes to make sure all files have been flushed to disk, before it then calls boinc_finish. My guess is the task was aborted in that 2 mins, the progress bar will sit at 99.99% during that time. If it's like that for more than 5mins then you have a stuck task. Thanks for copying the files but I have to catch one in the act on one of my machines so I can use the debugger on it whilst it's still running. Appreciate the sentiment. |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,431,665 RAC: 17,512 |
My first one of the new batch now uploading. 4hours 45minutes. I notice that the zips steadily increase in size as the task progresses rather than all being more or less the same size.They are all the same apart from the last one. Each bar the last contains 8 timesteps (3 files per step). If that's not the case, please point me to the task and I'll investigate. |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,989,107 RAC: 21,788 |
They are all the same apart from the last one. Each bar the last contains 8 timesteps (3 files per step). If that's not the case, please point me to the task and I'll investigate. Task currently running 7.zip 90.8MB 8.zip 94.71MB 9.zip 97.9MB 10.zip 99.65MB 11.zip 100.86MB edit: 12.zip 101.70MB this task I first noticed it on the first task which has completed successfully. here |
Send message Joined: 7 Aug 04 Posts: 10 Credit: 147,994,703 RAC: 39,971 |
I think you may have killed a normal task. How long was it 'hanging' for? It was running for about 1.5 hours after the model seemed to have finished. It looked like it was in the same state as that 2 minute pause at the end, but it just never finished. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
My first one of the new batch now uploading. 4hours 45minutes. My machine is ID: 1511241 My first one took 7 hours, 50 minutes, 27 seconds. I did not catch the time for my second one, but it was about the same. My third one took 7 hours, 49 minutes, 50 seconds. My fourth one took 7 hours, 49 minutes, 25 seconds. My fifth one took 7 hours, 52 minutes, 21 seconds. |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,431,665 RAC: 17,512 |
I can't see any trickles or stderr on that first task you note, only the second. But the second task is fine. If I scroll through the stderr output I can see 8 sets of 3 files going into each zip file, except the last one.They are all the same apart from the last one. Each bar the last contains 8 timesteps (3 files per step). If that's not the case, please point me to the task and I'll investigate.Task currently running The zipfile size depends on the data being compressed. The degree of compression will vary with differing content because it's looking for patterns in the data. These runs are 'idealized' so the initial state is very simple. As the model runs on, the model fields get more length scales appearing (short waves -> long waves), hence the data can't be compressed as well. If I had the time, I'd plot some of the early & later model fields to illustrate - I don't I'm afraid so you'll hopefully take my word for it! |
©2024 cpdn.org