climateprediction.net (CPDN) home page
Thread 'OpenIFS Discussion'

Thread 'OpenIFS Discussion'

Message boards : Number crunching : OpenIFS Discussion
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 32 · Next

AuthorMessage
Dark Angel

Send message
Joined: 31 May 18
Posts: 53
Credit: 4,725,987
RAC: 9,174
Message 66733 - Posted: 2 Dec 2022, 20:00:57 UTC

A few more errors
These ones are different:
https://www.cpdn.org/result.php?resultid=22246542
https://www.cpdn.org/result.php?resultid=22248013
https://www.cpdn.org/result.php?resultid=22247983
No logs at all.

Noted activities:
I ran kernel updates on the rig and rebooted. As a precaution I shut down the BOINC client and gave it time to clean things up before restarting but they still errored out.
The error I can understand, it's what CPDN units do on reboot, but the lack of any stderr contents is odd.
ID: 66733 · Report as offensive     Reply Quote
cetus

Send message
Joined: 7 Aug 04
Posts: 10
Credit: 148,051,051
RAC: 37,629
Message 66734 - Posted: 3 Dec 2022, 1:11:13 UTC

I noticed an issue that I don't think has been reported yet.
One of my machines has run 46 oifs jobs. 12 of them with computation errors, the rest appear to have completed successfully. After the boinc client finished all the jobs there are still three oifs processes running. No master.exe processes.

$ ps  -flU boinc
F S UID          PID    PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY      TIME CMD
4 S boinc       2449       1  0  99  19 - 76998 -      Nov29 ?    00:16:34 /usr/bin/boinc --gui_rpc_port 31418
0 S boinc    1657533    2449  0  99  19 - 35465 -      Dec01 ?    00:01:18 ../../projects/climateprediction.net/oifs_43r3_ps_1.01_x86_64-pc-linux-gnu 2021050100 hpi1 1734 946 12164823 123 oifs_43r3_ps 1
0 S boinc    2001745    2449  0  99  19 - 35464 -      Nov30 ?    00:02:00 ../../projects/climateprediction.net/oifs_43r3_ps_1.01_x86_64-pc-linux-gnu 2021050100 hpi1 1833 946 12164922 123 oifs_43r3_ps 1
0 S boinc    2147924    2449  0  99  19 - 35465 -      Nov30 ?    00:02:15 ../../projects/climateprediction.net/oifs_43r3_ps_1.01_x86_64-pc-linux-gnu 2021050100 hpi1 1942 946 12165031 123 oifs_43r3_ps 1

Two of the slots directories still have 6 files in them, another one has more than 300 files in it. The rest of the slots are empty.
In the projects/climateprediction.net directory, there are 9 directories with names like oifs_43r3_ps_12163845 that appear to be job folders that did not get deleted after the job finished. I have the BOINC directory archived if the contents are of interest.

The computer is a 12 core 5900X with 64GB of ram. The oifs jobs were run 8 at a time. I never noticed less than 15GB free RAM while 8 were running, but I wasn't watching most of the time of course.

Here are all the error jobs:
https://www.cpdn.org/result.php?resultid=22248825 double free or corruption
https://www.cpdn.org/result.php?resultid=22248783 this one looks like it finished, but has exit status 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT, job # 12164823
https://www.cpdn.org/result.php?resultid=22248662 double free or corruption
https://www.cpdn.org/result.php?resultid=22246507 double free or corruption
https://www.cpdn.org/result.php?resultid=22248118 double free or corruption
https://www.cpdn.org/result.php?resultid=22246441 double free or corruption
https://www.cpdn.org/result.php?resultid=22246293 double free or corruption
https://www.cpdn.org/result.php?resultid=22247041 this one looks like it finished, but has exit status 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT, job # 12165031
https://www.cpdn.org/result.php?resultid=22246587 double free or corruption
https://www.cpdn.org/result.php?resultid=22248533 this one looks like it finished, but has exit status 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT, job # 12164922
https://www.cpdn.org/result.php?resultid=22248053 double free or corruption
https://www.cpdn.org/result.php?resultid=22246923 double free or corruption

The three jobs that were "aborted" look like the same three processes that are still running. I did not abort any manually.
ID: 66734 · Report as offensive     Reply Quote
Dark Angel

Send message
Joined: 31 May 18
Posts: 53
Credit: 4,725,987
RAC: 9,174
Message 66735 - Posted: 3 Dec 2022, 5:48:50 UTC

Three more. No idea what happened as there's nothing in the Stderr
https://www.cpdn.org/result.php?resultid=22247956
https://www.cpdn.org/result.php?resultid=22246962
https://www.cpdn.org/result.php?resultid=22247713

I'm running three of these at a time alongside LHC Atlas (5 off each 4 cores) and SRBase on GPU.
There doesn't appear to be any extra Master.exe instances running and I'm not running out of RAM that I can see (64GB)
ID: 66735 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,302,757
RAC: 1,077
Message 66736 - Posted: 3 Dec 2022, 10:23:55 UTC - in response to Message 66732.  

xii5ku wrote:
About OpenIFS failure modes:
All of [my] error results come from only one out of three hosts. All three hosts have the same hardware, OS, boinc client configs, and same split workload of OpenIFS and PrimeGrid llrSGS. The one host with errors was the only one on which I suspended all tasks to disk, rebooted the host, and resumed the tasks. I strongly believe that all of these 54 tasks went through this suspend–resume cycle. [...] The stderr.txts of these tasks are of two types: One type contains just "--". The other shows that the last one to five zip files were missing.

The host with errors has reported only successful tasks for a while now, which is another hint that the error episode was just the aftermath of the suspend-resume cycle.

I now got a single new error:
    terminate called after throwing an instance of 'std::invalid_argument'
    what(): stoi

result ID 22248133, workunit name oifs_43r3_ps_3007_2021050100_123_947_12166096

This happened at ~50% computing progress, whereas all of my previous suspend/resume related errors happened at ~100% progress.

ID: 66736 · Report as offensive     Reply Quote
ProfileConan
Avatar

Send message
Joined: 6 Jul 06
Posts: 147
Credit: 3,615,496
RAC: 420
Message 66737 - Posted: 3 Dec 2022, 10:26:20 UTC - in response to Message 66718.  

Just downloaded a resend of a Work Unit that failed due to an error.

This Task 22245903

It failed due to running longer than 5 minutes after the work unit had finished.

The WU was run by mikey and other than the longer run time after finishing seemed to have run successfully after over 2 days run time.

The run time seems overly long on a Ryzen but did complete.

It is now running as Task 22249047 on my Ryzen computer.

Will see how it runs for me.

Conan


Completed successfully after 16 1/2 hours.

Conan
ID: 66737 · Report as offensive     Reply Quote
Dark Angel

Send message
Joined: 31 May 18
Posts: 53
Credit: 4,725,987
RAC: 9,174
Message 66738 - Posted: 3 Dec 2022, 10:58:56 UTC

An observation ... I just sat here and watched a unit upload a trickle. Waited until it was well and truly done. Checked all my work units listed as "in progress" and ... none of them have trickles registered.
ID: 66738 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,845,098
RAC: 19,856
Message 66739 - Posted: 3 Dec 2022, 12:03:56 UTC - in response to Message 66738.  

OpenIFS trickles don't show up on the website like they do for the Hadley models. Progress can be viewed in the stderr.txt file in the appropriate slots directory for a given task. That file gets uploaded and shows up on the website once the task reports as completed.
ID: 66739 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,708,278
RAC: 9,361
Message 66741 - Posted: 3 Dec 2022, 12:34:40 UTC

Technically, a 'trickle' is not the same as an 'upload'. Uploads are processed immediately they're ready, but trickles are bunched and reported to the server at the end of the 'server backoff' delay - possibly up to an hour later. Both processes can be seen, with timestamps, in the Event Log.

Having said that, none of my completed (and valid) tasks are showing any trickles yet, either. I think that trickles for other task types are processed by one of the innumerable scripts which run in the background on the server. That may have failed; it may require extending for the IFS tasks; or it may be scheduled to run tomorrow, as part of of the credit updating process. Let's take another look on Monday.
ID: 66741 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,022,240
RAC: 20,762
Message 66742 - Posted: 3 Dec 2022, 14:04:48 UTC - in response to Message 66741.  

Technically, a 'trickle' is not the same as an 'upload'. Uploads are processed immediately they're ready, but trickles are bunched and reported to the server at the end of the 'server backoff' delay - possibly up to an hour later. Both processes can be seen, with timestamps, in the Event Log.

Having said that, none of my completed (and valid) tasks are showing any trickles yet, either. I think that trickles for other task types are processed by one of the innumerable scripts which run in the background on the server. That may have failed; it may require extending for the IFS tasks; or it may be scheduled to run tomorrow, as part of of the credit updating process. Let's take another look on Monday.


I had some HADSM4's running at the start of the week. They are showing trickles but won't I assume show credit till tomorrow. My last OIFS tasks on testing did show trickles. Often in the past trickles have not shown up on the website but credit has still been awarded so I suspect it is a script needing the new model types added to get the trickles to show. Whether credit will show tomorrow for those who care about such things, who knows? I will put a note on the Trello card for these batches tomorrow when i will know if credit is being awarded or not. Andy is on the Trello card so will see it there fairly quickly.
ID: 66742 · Report as offensive     Reply Quote
klepel

Send message
Joined: 9 Oct 04
Posts: 82
Credit: 69,923,532
RAC: 8,011
Message 66743 - Posted: 3 Dec 2022, 14:27:36 UTC - in response to Message 66732.  

About OpenIFS failure modes:
The one host with errors was the only one on which I suspended all tasks to disk, rebooted the host, and resumed the tasks. I strongly believe that all of these 54 tasks went through this suspend–resume.[…]
The host with errors has reported only successful tasks for a while now, which is another hint that the error episode was just the aftermath of the suspend-resume cycle.
I think I had the same error. The computer was shut-down and restarted. Although BOINC View reported progress, no Tickle / Upload file has been created.
https://www.cpdn.org/result.php?resultid=22248984 as you can see there is a difference between CPU-time and execution-time.

Then there was an WU with Code 9 error https://www.cpdn.org/result.php?resultid=22248970:
9 (0x00000009) Unknown error code

Hope this helps.
ID: 66743 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 66748 - Posted: 3 Dec 2022, 17:56:24 UTC - in response to Message 66743.  

'Fraid I don't have time to read all these posts right now as I'm busy looking into the various problems. I think they have all been pretty much captured by posts here. And the detailed forensic reports are extremely helpful, so thanks very much for that. I have fed the list back to CPDN.

There *is* an issue with restarts. The model process itself restarts just fine if the client/machine is shutdown and restarted. However, the controlling wrapper code then appears to be miscalculating where the model is in the forecast and this leads to the 'missing file' problem that's been reported.

So if you can manage to keep the tasks running uninterrupted they *should* work (famous last words), or at least fail less often. I have not tried 'keep non-gpu tasks in memory' option, that might help. And I know I said OpenIFS shouldn't have restart problems, but it's not the fault of the model ;)
ID: 66748 · Report as offensive     Reply Quote
ProfilePDW

Send message
Joined: 29 Nov 17
Posts: 82
Credit: 14,559,938
RAC: 89,548
Message 66753 - Posted: 3 Dec 2022, 19:34:54 UTC - in response to Message 66748.  

Had a task fail after 12 hours with error: free(): unaligned chunk detected in tcache 2

Have completed 5 okay so far, running the same additional project all the time so no change in memory usage. No suspending of work (manually or by Boinc).

The master process for the aborted task was still running so terminated it, it showed the same slot as the new task that had started up.
ID: 66753 · Report as offensive     Reply Quote
wateroakley

Send message
Joined: 6 Aug 04
Posts: 195
Credit: 28,347,450
RAC: 10,508
Message 66755 - Posted: 3 Dec 2022, 22:44:31 UTC - in response to Message 66748.  

'There *is* an issue with restarts. The model process itself restarts just fine if the client/machine is shutdown and restarted. However, the controlling wrapper code then appears to be miscalculating where the model is in the forecast and this leads to the 'missing file' problem that's been reported.

So if you can manage to keep the tasks running uninterrupted they *should* work (famous last words), or at least fail less often. I have not tried 'keep non-gpu tasks in memory' option, that might help. And I know I said OpenIFS shouldn't have restart problems, but it's not the fault of the model ;)
Thanks Glenn. After a lot of hit and miss tasks, the last six 'uninterrupted' OpenIFS tasks here have completed!
ID: 66755 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 66756 - Posted: 3 Dec 2022, 22:47:58 UTC - in response to Message 66732.  

All three of the hosts which I have active at OpenIFS have plenty of RAM, and are set to "leave non-GPU tasks in memory while suspended".¹ That's possibly a factor why they run error-free.


Me too.

Of all the 21 OpenIFS tasks I have received (and now completed), all completed successfully. They have run three at a time when I had enough tasks available (that was most of the time). By the time it completed, the last one was running all by itself.

I, too, have plenty of RAM: (64 GigaBytes), "leave non-GPU tasks in memory while suspended" is enabled
ID: 66756 · Report as offensive     Reply Quote
Vato

Send message
Joined: 4 Oct 19
Posts: 15
Credit: 9,174,915
RAC: 3,722
Message 66757 - Posted: 3 Dec 2022, 22:51:09 UTC - in response to Message 66748.  

So if you can manage to keep the tasks running uninterrupted they *should* work (famous last words), or at least fail less often. I have not tried 'keep non-gpu tasks in memory' option, that might help. And I know I said OpenIFS shouldn't have restart problems, but it's not the fault of the model ;)


I deliberately suspended a task for 8+ hours using the 'keep non-gpu tasks in memory' option, and it completed and returned the task seemingly cleanly and successfully.
ID: 66757 · Report as offensive     Reply Quote
Vato

Send message
Joined: 4 Oct 19
Posts: 15
Credit: 9,174,915
RAC: 3,722
Message 66763 - Posted: 4 Dec 2022, 10:16:31 UTC

Credit granted to all my tasks - 2,353 each
ID: 66763 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,708,278
RAC: 9,361
Message 66765 - Posted: 4 Dec 2022, 10:48:19 UTC - in response to Message 66763.  

Credit granted to all my tasks - 2,353 each
Yes, but still no trickles visible. They are visible on the dev site, so I think it's a scripting issue on the main server.
ID: 66765 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 66766 - Posted: 4 Dec 2022, 11:58:08 UTC - in response to Message 66763.  

Credit granted to all my tasks - 2,353 each


Yes, but too late. My boincmgr Statistics Screen reveals a big drop last night, and the Projects Screen gives about a 50% drop since my previous report. Sad since I completed 21 or so of the OpenIFS tasks successfully. Sigh!
ID: 66766 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,022,240
RAC: 20,762
Message 66767 - Posted: 4 Dec 2022, 11:58:48 UTC - in response to Message 66765.  

Credit granted to all my tasks - 2,353 each
Yes, but still no trickles visible. They are visible on the dev site, so I think it's a scripting issue on the main server.
Message left for Andy on Trello board.
ID: 66767 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,845,098
RAC: 19,856
Message 66777 - Posted: 5 Dec 2022, 2:43:16 UTC - in response to Message 66748.  

There *is* an issue with restarts. The model process itself restarts just fine if the client/machine is shutdown and restarted. However, the controlling wrapper code then appears to be miscalculating where the model is in the forecast and this leads to the 'missing file' problem that's been reported.

So if you can manage to keep the tasks running uninterrupted they *should* work (famous last words), or at least fail less often. I have not tried 'keep non-gpu tasks in memory' option, that might help. And I know I said OpenIFS shouldn't have restart problems, but it's not the fault of the model ;)

I wonder if Hadley models restart issues are also due to a similar wrapper issue? Or is it the models issue?
ID: 66777 · Report as offensive     Reply Quote
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 32 · Next

Message boards : Number crunching : OpenIFS Discussion

©2024 cpdn.org