climateprediction.net (CPDN) home page
Thread 'OpenIFS Discussion'

Thread 'OpenIFS Discussion'

Message boards : Number crunching : OpenIFS Discussion
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 27 · 28 · 29 · 30 · 31 · 32 · Next

AuthorMessage
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,022,240
RAC: 20,762
Message 68491 - Posted: 26 Feb 2023, 21:40:04 UTC - in response to Message 68490.  

It would be fun to send out an identical batch where I've compiled the code on a Ryzen with the intel compiler instead of Intel+Intel and see what happens :)


Is there a reason you can't? Send out a couple hundred otherwise identical WUs in a few batches and compare/contrast results?

I don't think there's a shortage of willing CPU cores right now.

Or even just have some people run the binaries manually and send you results somehow. I've got a range of AMD systems that are mostly bored!


What about if you compile on a Ryzen compiler? - I didn't even know they existed till I did a search!
ID: 68491 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 68492 - Posted: 26 Feb 2023, 21:45:53 UTC - in response to Message 68490.  

It would be fun to send out an identical batch where I've compiled the code on a Ryzen with the intel compiler instead of Intel+Intel and see what happens :)

Is there a reason you can't? Send out a couple hundred otherwise identical WUs in a few batches and compare/contrast results?


What is the reason for this experiment? Do you think there is an error in GNU compilers,. gcc and g++ and that the Intel compiler is free from that error?
What if both compilers gave identical results? Or worse, if they gave inconsistent non-identical results. How do you propose to analyze the results of this experiment to resolve such possibilities?

Will the stuff you compile on a Ryzen run on my Intel machine running Linux? I sure would not wish to debug it if it did not work.
ID: 68492 · Report as offensive     Reply Quote
alanb1951

Send message
Joined: 31 Aug 04
Posts: 37
Credit: 9,581,380
RAC: 3,853
Message 68494 - Posted: 27 Feb 2023, 0:35:06 UTC
Last modified: 27 Feb 2023, 0:41:18 UTC

I don't know whether the below is of any diagnostic use, but I'll report it in case...

I just noticed that one of my tasks (for work unit 12215433) had stalled over 24 hours ago, and it appeared that model.exe had finished but for some reason the wrapper (which was still present but quiescent) hadn't dealt with it as the stderr.txt file ended with the following:
  15:37:36 STEP 2952 H=2952:00 +CPU= 20.358

That 15:37:36 was on the 25th, and when I checked my boinc log for around that time I saw the usual flurry of checkpoint messages that seem to accompany the construction and submission of a trickle, but the next scheduler request was for new work, not a trickle, and there was no sign of the files being uploaded. As well as checking the boinc log, I checked the system logs to see if there was anything odd around that time -- there wasn't anything obvious.

Rather than just aborting it I decided to suspend and resume it to see what would happen; I wasn't optimistic that it would recover it successfully (as something had obviously broken initially) but it did seem to restart and, of course, it shut down more or less immediately (nothing more to do!) This time, it managed to upload the files and flesh out the end of stderr.txt; unfortunately it then it reported "double free or corruption (!prev)", so no luck... ((!prev) instead of the seemingly more usual (out) --- an effect of not really having anything to do?)

I see that a retry has gone out promptly, and I suspect it'll run to completion without problems -- ah, well...

Cheers - Al.
ID: 68494 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 68504 - Posted: 27 Feb 2023, 13:55:15 UTC - in response to Message 68494.  

Yes, thanks. This behaviour has been noted and reported by others. It seems to be something going wrong when the task reports to the client that it's finished. For some unknown reason, it appears to get stuck in the client. Shutting down & restarting the client has been successful at getting the task to complete. Other projects, not just CPDN, have observed this behaviour according to other forums posts I've read. It doesn't appear to happen very often, I was going to look at the code to make sure we tidy everything up in terms of closing files etc, to see if that might cure it.

I don't know whether the below is of any diagnostic use, but I'll report it in case...

I just noticed that one of my tasks (for work unit 12215433) had stalled over 24 hours ago, and it appeared that model.exe had finished but for some reason the wrapper (which was still present but quiescent) hadn't dealt with it as the stderr.txt file ended with the following:
  15:37:36 STEP 2952 H=2952:00 +CPU= 20.358

That 15:37:36 was on the 25th, and when I checked my boinc log for around that time I saw the usual flurry of checkpoint messages that seem to accompany the construction and submission of a trickle, but the next scheduler request was for new work, not a trickle, and there was no sign of the files being uploaded. As well as checking the boinc log, I checked the system logs to see if there was anything odd around that time -- there wasn't anything obvious.

Rather than just aborting it I decided to suspend and resume it to see what would happen; I wasn't optimistic that it would recover it successfully (as something had obviously broken initially) but it did seem to restart and, of course, it shut down more or less immediately (nothing more to do!) This time, it managed to upload the files and flesh out the end of stderr.txt; unfortunately it then it reported "double free or corruption (!prev)", so no luck... ((!prev) instead of the seemingly more usual (out) --- an effect of not really having anything to do?)

I see that a retry has gone out promptly, and I suspect it'll run to completion without problems -- ah, well...

Cheers - Al.
ID: 68504 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,022,240
RAC: 20,762
Message 68518 - Posted: 1 Mar 2023, 11:01:21 UTC

Just got this as a resend that appears to have finished but no stderr on the original task.
ID: 68518 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 68519 - Posted: 1 Mar 2023, 12:35:02 UTC - in response to Message 68518.  

Just got this as a resend that appears to have finished but no stderr on the original task.
Haha. That was one of mine. I switched the downloaded oifs_*x86_64-pc-linux-gnu control executable to my development version so I could test it 'live' for this batch. But I made a mistake for this task. Glad it's in safe hands! :D

Pretty confident the 'double corruption' problem was in the trickle code, which has now been rewritten. Will need a big test batch to confirm though.
---
CPDN Visiting Scientist
ID: 68519 · Report as offensive     Reply Quote
JagDoc

Send message
Joined: 21 Dec 22
Posts: 5
Credit: 7,825,862
RAC: 5,485
Message 68530 - Posted: 1 Mar 2023, 18:09:09 UTC
Last modified: 1 Mar 2023, 18:12:15 UTC

One of my hosts has some WUs with different error.
https://www.cpdn.org/show_host_detail.php?hostid=1538124

It runs 2 x IFS tasks and 2 x ODLK1 tasks.

Htop shows that 3 x IFS tasks running, 2 of them in one slot, how can that be.
   PID USER      PRI  NI  VIRT   RES   SHR S CPU%▽MEM%   TIME+  Command
  30914 boinc      39  19 4230M 3782M 33456 R 100. 11.9 12h47:38 /var/lib/boinc-client/slots/1/oifs_43r3_model.exe
  31490 boinc      39  19 2782M 2585M 33456 R 100.  8.1  8h39:56 /var/lib/boinc-client/slots/1/oifs_43r3_model.exe
  30791 boinc      39  19 4269M 3807M 33456 R 99.7 12.0 14h02:11 /var/lib/boinc-client/slots/0/oifs_43r3_model.exe


This is what top shows:
top - 19:10:15 up 6 days,  7:58,  2 users,  load average: 3,02, 3,14, 3,69
Tasks: 217 gesamt,   4 laufend, 213 schlafend,   0 gestoppt,   0 Zombie
%CPU(s):  0,0 us,  3,0 sy, 72,1 ni, 24,8 id,  0,0 wa,  0,0 hi,  0,1 si,  0,0 st
MiB Spch :  31848,8 gesamt,  13202,1 frei,  11289,3 belegt,   7357,4 Puff/Cache
MiB Swap:   2048,0 gesamt,   2048,0 frei,      0,0 belegt.  20045,2 verfü Spch

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     ZEIT+ BEFEHL
  30791 boinc     39  19 3862844   3,0g  33456 R 100,0   9,6 859:50.07 oifs_43r3_model
  31490 boinc     39  19 4331988   3,6g  33456 R 100,0  11,7 537:34.95 oifs_43r3_model
  30914 boinc     39  19 4331992   3,6g  33456 R 100,0  11,7 785:16.60 oifs_43r3_model
ID: 68530 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,022,240
RAC: 20,762
Message 68531 - Posted: 1 Mar 2023, 18:53:05 UTC
Last modified: 1 Mar 2023, 18:55:34 UTC

Glen can tell you more about this. It is a problem with when tasks suspend and restart I think. This may be one of the issues that is sorted ready for the next batch. Glen currently on zoom at BOINC workshop.
ID: 68531 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 68533 - Posted: 1 Mar 2023, 21:39:05 UTC - in response to Message 68530.  

Htop shows that 3 x IFS tasks running, 2 of them in one slot, how can that be.
   PID USER      PRI  NI  VIRT   RES   SHR S CPU%▽MEM%   TIME+  Command
  30914 boinc      39  19 4230M 3782M 33456 R 100. 11.9 12h47:38 /var/lib/boinc-client/slots/1/oifs_43r3_model.exe
  31490 boinc      39  19 2782M 2585M 33456 R 100.  8.1  8h39:56 /var/lib/boinc-client/slots/1/oifs_43r3_model.exe
  30791 boinc      39  19 4269M 3807M 33456 R 99.7 12.0 14h02:11 /var/lib/boinc-client/slots/0/oifs_43r3_model.exe
This happens because of the 'memory corruption' problem oft reported here. Aside from the boinc client, there are two processes involved in the OpenIFS tasks. One is the model itself (oifs_43r3_model.exe), the other is a controlling process (oifs_43r3_1.21_x86_64-linux-gnu-pc) which monitors the model and reports back to the client. It's this second process that has the memory fault which kills it. Normally, when this process dies it *should* also kill the model, but for some reason, on odd occasions, it leaves the model running. Eventually the boinc client spots a rogue process is still running and kills it. Unfortunately by then, two models in the same slot will have corrupted some of the files and the task will eventually fail.

If you see this happen, the best thing to do is the shutdown the boinc client, then restart it. That will clear out any rogue processes.

I think we have solved this problem with the latest code, which will go out to to production once tested. Hope that's understandable.
ID: 68533 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,022,240
RAC: 20,762
Message 68535 - Posted: 1 Mar 2023, 22:03:56 UTC - in response to Message 68519.  

Just got this as a resend that appears to have finished but no stderr on the original task.
Haha. That was one of mine. I switched the downloaded oifs_*x86_64-pc-linux-gnu control executable to my development version so I could test it 'live' for this batch. But I made a mistake for this task. Glad it's in safe hands! :D

Pretty confident the 'double corruption' problem was in the trickle code, which has now been rewritten. Will need a big test batch to confirm though.
And completed successfully.
ID: 68535 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,022,240
RAC: 20,762
Message 68563 - Posted: 6 Mar 2023, 8:42:11 UTC
Last modified: 6 Mar 2023, 10:42:51 UTC

#993 is looking good.
Success: 1803 (90%)
Fails: 418 (21%)
Hard Fail: 5 (0%)
Running: 192 (10%)
Especially when you think that 21% fails includes the model failures due to the physics and the ones that fail due to users running too many tasks for the amount of RAM they have. Assuming Glen is right about having sorted out the double corruption (Makes it sound like politics) errors, next lot should be better still.
ID: 68563 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,022,240
RAC: 20,762
Message 68615 - Posted: 21 Mar 2023, 19:35:41 UTC - in response to Message 68614.  

Are there any more batches of work from OpenIFS to be released, or is that the lot for now?
Nothing new in testing or that has appeared on moderators' email list. (New work doesn't usually get mentioned there unless someone wants us to post something about it anyway.) My WAH2 Windows Hadley models in testing are less than half way through their 50 days so I don't expect main site work from them to arrive for a while yet. I check for signs of new work both testing and main site about three times a week and post when something looks hopeful. Glen is more likely to know about new OIFS work than I am but, he is a volunteer programmer and if he is busy with other things he may not know what the current state of play is. Sorry I am not able to say more at the moment.
ID: 68615 · Report as offensive     Reply Quote
Drago75

Send message
Joined: 8 Jan 22
Posts: 9
Credit: 1,780,471
RAC: 3,152
Message 68617 - Posted: 22 Mar 2023, 12:27:08 UTC - in response to Message 68616.  
Last modified: 22 Mar 2023, 12:27:32 UTC

This has propably been raised on a number of occasions but it still puzzles me so I would like to ask this again. The project has 45.600 active work units which don't seem to finish ever. Over the past few months I noticed that the majority of work is beeing completed within 10-14 days. Wouldn't it be a good idea to reduce their expiry date to less then 4 weeks still? Maybe even to 14 days? Those wus run for approx. 18-24 hours and they don't seem to like being paused. The only real way to run them is either continiously or by interupting them by sending the PC to standby. Either way once started they should finish within days. When I look at the WAH units they still allow for one year to be completed. If a calculation run takes that long it isn't any faster then the real weather outside. So if the projects aim is to predict the weather for the future, don't the scientists need the data as quickly as possible? There seem to be a lot of crunchers here who would be willing to process more data but don't get enough work.
ID: 68617 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 68618 - Posted: 22 Mar 2023, 12:32:15 UTC - in response to Message 68614.  

Are there any more batches of work from OpenIFS to be released, or is that the lot for now?
That's it for now. There are no OpenIFS batches planned for the near future, the scientists need time to look at the data collected from the previous ones and then there might be some more. There may also be some testing batches in due course but don't hold your breath.
---
CPDN Visiting Scientist
ID: 68618 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 68619 - Posted: 22 Mar 2023, 12:38:43 UTC - in response to Message 68617.  

This has propably been raised on a number of occasions but it still puzzles me so I would like to ask this again. The project has 45.600 active work units which don't seem to finish ever. Over the past few months I noticed that the majority of work is beeing completed within 10-14 days. Wouldn't it be a good idea to reduce their expiry date to less then 4 weeks still? Maybe even to 14 days? Those wus run for approx. 18-24 hours and they don't seem to like being paused. The only real way to run them is either continiously or by interupting them by sending the PC to standby. Either way once started they should finish within days. When I look at the WAH units they still allow for one year to be completed. If a calculation run takes that long it isn't any faster then the real weather outside. So if the projects aim is to predict the weather for the future, don't the scientists need the data as quickly as possible? There seem to be a lot of crunchers here who would be willing to process more data but don't get enough work.
It's leftover from the early days of CPDN when model runs used to take 6 months or so (I think it was). It's just one thing that they haven't got around to changing for the hadley models. I changed it for OpenIFS though we got caught out when the server went down and tasks started timing out. But, yes, if you get one of the old reruns from a workunit that's been around for over ~4 months or so, I'd abort it. The scientist would have got the data by then and moved on I suspect.

I'll bring it up again when I next talk to them. As has been said many times, they are a very small team and little things like this have to make way for bigger issues.
---
CPDN Visiting Scientist
ID: 68619 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,022,240
RAC: 20,762
Message 68620 - Posted: 22 Mar 2023, 13:08:05 UTC - in response to Message 68619.  

I'll bring it up again when I next talk to them. As has been said many times, they are a very small team and little things like this have to make way for bigger issues.
Thanks Glen though worth noting that even on my Ryzen 7 3700X the four testing tasks I am running under WINE are going to take a few hours over 50days to complete. (WAH2 25 Km grid SEAsia so covering a much greater area than the ANZ regional models.
ID: 68620 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 68621 - Posted: 22 Mar 2023, 15:37:18 UTC - in response to Message 68617.  

Wouldn't it be a good idea to reduce their expiry date to less then 4 weeks still? Maybe even to 14 days? Those wus run for approx. 18-24 hours and they don't seem to like being paused. The only real way to run them is either continiously or by interupting them by sending the PC to standby.


Standby works quite nicely. I have a mild preference that they not be shorter, just because I do all my crunching on solar, off grid, and we get weeks without a lot of sun, but... if it's useful to the project to be shorter, fine. I would request they not be shortened for no good reason, though. And, as noted, the whole "server outage" really fouled up a lot of stuff, partly due to the lower time to return.
ID: 68621 · Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 2 Oct 06
Posts: 54
Credit: 27,309,613
RAC: 28,128
Message 68626 - Posted: 23 Mar 2023, 13:58:32 UTC - in response to Message 68617.  

The project has 45.600 active work units which don't seem to finish ever.


Don't believe the numbers on the server status page. If you add up all the tasks in progress for the individual projects at the bottom os the page (25,646), it is not even close to the the total number in the upper right (45,600). I don't believe either of those numbers. If I had to guess, there are no more than just a few thousand tasks actually in progress.
ID: 68626 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,022,240
RAC: 20,762
Message 68627 - Posted: 23 Mar 2023, 15:03:53 UTC - in response to Message 68626.  
Last modified: 23 Mar 2023, 16:14:04 UTC

Don't believe the numbers on the server status page. If you add up all the tasks in progress for the individual projects at the bottom os the page (25,646), it is not even close to the the total number in the upper right (45,600). I don't believe either of those numbers. If I had to guess, there are no more than just a few thousand tasks actually in progress.
Sometimes I think the only correct number is the Tasks ready to send = 0 Though I doubt if the numbers for the OIFS tasks are very far out if at all.

Edit: The trickles on tasks on testing site are now showing correctly so there is progress. I expect Andy will let us know or post himself when the upgrade of the server software here is going to happen. Nothing till after it happened on testing but then there were no tasks waiting to go out at the time and it wouldn't have been more than a couple of hours at most.
ID: 68627 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 69766 - Posted: 11 Oct 2023, 13:51:31 UTC

OpenIFS Perturbed Surface batches

Volunteers might be interested in this article that appeared in the ECMWF Newsletter, based on the OpenIFS Perturbed Surface batches earlier this year.

https://www.ecmwf.int/en/newsletter/175/news/openifshome-using-land-surface-uncertainties-and-large-ensembles-seasonal

This appears in the list of CPDN publications but the batch information is missing due to a minor technical hitch which will be fixed. The batches in question were: 944,945,946,947,990.
---
CPDN Visiting Scientist
ID: 69766 · Report as offensive     Reply Quote
Previous · 1 . . . 27 · 28 · 29 · 30 · 31 · 32 · Next

Message boards : Number crunching : OpenIFS Discussion

©2024 cpdn.org