Message boards : Number crunching : OpenIFS Discussion
Message board moderation
Previous · 1 . . . 24 · 25 · 26 · 27 · 28 · 29 · 30 . . . 32 · Next
Author | Message |
---|---|
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,028,039 RAC: 20,189 |
First Africa task from batch 992 failed - Task 22314299 Reported on Trello board. |
Send message Joined: 4 Oct 15 Posts: 34 Credit: 9,075,151 RAC: 374 |
Also got two, both _2, both failed with the same error |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,855,867 RAC: 19,834 |
Is that the name of the tasks I use in app_config.xml? Yes, that's the name of the app for this latest OIFS batch that's used in app_config. However, the last 2 (l255 & l319) are not valid app names. Those are the ones Glenn would use for a test run, if there's enough interest, but it'll be outside of BOINC so app_config won't be read. Whether those apps will ever come to BOINC or with those names we don't know yet. It could be that the current 3 apps will be able to run both lower and higher resolution models. If you leave them there, not commented out, BOINC will be prompting you with error messages as it does whenever there's something invalid with the app_config file. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
We have been urged to move a thread from News to here, so here one is:
I can easily remove the (l255 & l319) parts from my app_config.xml file, if those warnings get to be too annoying. (I do not know how to put comments into an xml file.) I prefer that they be officially used (when the tasks are available) because I certainly wish to control the number of each type of task that uses radically different amounts of RAM. I do not know any other way of doing that. (Even though I should be getting 64 GBytes more DDR4 RDIMM 2933MHz ECC RAM tomorrow afternoon.) |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Replying to a post in News ... Also, does anyone know how to stop and start the Boinc client in Linux? In Red Hat Enterprise Linux release 8.7 (Ootpa), background processes are started and stopped with systemd. Other releases of Linux may still use the old way to start and stop background processes using stuff in /etc/rc.d especially init.d and one of the rc?.d (most likely rc5.d, but I have not run those releases in many years, so I am a bit hazy as to the details.). /etc/rc.d]$ ls -l total 4 drwxr-xr-x. 2 root root 37 Nov 21 09:24 init.d drwxr-xr-x. 2 root root 6 Aug 10 2022 rc0.d drwxr-xr-x. 2 root root 6 Aug 10 2022 rc1.d drwxr-xr-x. 2 root root 6 Aug 10 2022 rc2.d drwxr-xr-x. 2 root root 6 Aug 10 2022 rc3.d drwxr-xr-x. 2 root root 6 Aug 10 2022 rc4.d drwxr-xr-x. 2 root root 6 Aug 10 2022 rc5.d drwxr-xr-x. 2 root root 6 Aug 10 2022 rc6.d -rw-r--r--. 1 root root 474 Nov 21 09:24 rc.local It it is set up right, it will automatically start up the boinc-client when the system is booted up, and shut down when the system is taken down or rebooted. To stop it, be the root user and type systemctl stop boinc-client I suppose you can start it by typing systemctl start boinc-client, but I have never tried it. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,028,039 RAC: 20,189 |
To stop it, be the root user and type systemctl stop boinc-client Works in current releases of Ubuntu, Debian and derivatives. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,855,867 RAC: 19,834 |
I can easily remove the (l255 & l319) parts from my app_config.xml file, if those warnings get to be too annoying. When/if those apps come out, we can ask Glenn to post the names ahead of time so as to be prepared. The app names will also be found in client_state.xml. The way to have a section of an xml file be skipped is by surrounding it with <!-- -->. It doesn't have to be line by line, one set of those can surround many consecutive lines of code. <!--code to be skipped or comment--> |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,709,934 RAC: 9,107 |
The way to have a section of an xml file be skipped is by surrounding it with <!-- -->.Be careful with that - try with a simple comment, and check for error messages when it's read in. The boinc client doesn't use a fully-featured XML parser - it uses its own simplified code, only implementing the features it needs. |
Send message Joined: 5 Aug 04 Posts: 178 Credit: 18,789,280 RAC: 44,109 |
<!-- --> It works fine, I use it often in cc_config.xml and app_config.xml The way to have a section of an xml file be skipped is by surrounding it with <!-- -->.Be careful with that - try with a simple comment, and check for error messages when it's read in. Supporting BOINC, a great concept ! |
Send message Joined: 4 Oct 19 Posts: 15 Credit: 9,174,915 RAC: 3,722 |
at PrimeGrid, the names used in app_config.xml are displayed in the headers of the listing on the apps.php page, which makes a useful reference page |
Send message Joined: 22 Jan 05 Posts: 45 Credit: 4,608,003 RAC: 868 |
Additional information for all that would like to know: required OS: Linux running on an AMD x86_64 or Intel EM64T CPU |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Two of these failed for me. I was the third loser in each case. I think they are still failing right at the beginning. Run time 1 min 6 sec CPU time 1 sec Validate state Invalid Task 22314647 Name oifs_43r3_000t_2019110100_123_992_12213316_2 Workunit 12213316 Task 22314608 Name oifs_43r3_001m_2019110100_123_992_12213345_1 Workunit 12213345 Here is part of the end of stderr, with some "boring" parts deleted. The child process has been launched with process id: 346005 Executing the command: /var/lib/boinc/slots/0/oifs_43r3_model.exe [EC_DRHOOK:hostname:myproc:omptid:pid:unixtid] [YYYYMMDD:HHMMSS:epoch:walltime] [function@file:lineno] -- Max OpenMP threads = 1 /drhook.c:1973] New signal handler 'signal_drhook' for signal#31 (SIGSYS) at 0x810730 (old at (nil)) [EC_DRHOOK:localhost.localdomain:1:1:346005:346005] [20230221:223619:1677036979:0.000] [catch_signals@/home/glenn/github/gc_oifs43r3/src/ifsaux/support/drhook.c:1111] DR_HOOK_CATCH_SIGNALS=<undef> ABORT! 1 SUECOZC:ERROR OPENING FILE ECOZC [EC_DRHOOK:localhost.localdomain:1:1:346005:346005] [20230221:223620:1677036980:1.271] [c_drhook_print_@/home/glenn/github/gc_oifs43r3/src/ifsaux/support/drhook.c:3843] 1209 MB (maxheap), 861 MB (maxrss), 0 MB (maxstack) [EC_DRHOOK:localhost.localdomain:1:1:346005:346005] [20230221:223620:1677036980:1.271] [c_drhook_print_@/home/glenn/github/gc_oifs43r3/src/ifsaux/support/drhook.c:3897] : MASTER [EC_DRHOOK:localhost.localdomain:1:1:346005:346005] [20230221:223620:1677036980:1.271] [c_drhook_print_@/home/glenn/github/gc_oifs43r3/src/ifsaux/support/drhook.c:3897] : CNT0 [EC_DRHOOK:localhost.localdomain:1:1:346005:346005] [20230221:223620:1677036980:1.271] [c_drhook_print_@/home/glenn/github/gc_oifs43r3/src/ifsaux/support/drhook.c:3897] : CNT1 [EC_DRHOOK:localhost.localdomain:1:1:346005:346005] [20230221:223620:1677036980:1.271] [c_drhook_print_@/home/glenn/github/gc_oifs43r3/src/ifsaux/support/drhook.c:3897] : CNT2 [EC_DRHOOK:localhost.localdomain:1:1:346005:346005] [20230221:223620:1677036980:1.271] [c_drhook_print_@/home/glenn/github/gc_oifs43r3/src/ifsaux/support/drhook.c:3897] : CNT3 [EC_DRHOOK:localhost.localdomain:1:1:346005:346005] [20230221:223620:1677036980:1.271] [c_drhook_print_@/home/glenn/github/gc_oifs43r3/src/ifsaux/support/drhook.c:3897] : CNT4 [EC_DRHOOK:localhost.localdomain:1:1:346005:346005] [20230221:223620:1677036980:1.271] [c_drhook_print_@/home/glenn/github/gc_oifs43r3/src/ifsaux/support/drhook.c:3897] : UPDTIM [EC_DRHOOK:localhost.localdomain:1:1:346005:346005] [20230221:223620:1677036980:1.271] [c_drhook_print_@/home/glenn/github/gc_oifs43r3/src/ifsaux/support/drhook.c:3897] : SUECOZC SDL_TRACEBACK: Calling INTEL_TRBK, THRD = 1 Process 1 thread 1 calling linux_trbk from intel_trbk() [LinuxTraceBack@/home/glenn/github/gc_oifs43r3/src/ifsaux/utilities/linuxtrbk.c:240] Backtrace(s) for program '/var/lib/boinc/slots/0/oifs_43r3_model.exe' : sigcontextptr=0x7ffe27d567e0 [LinuxTraceBack@/home/glenn/github/gc_oifs43r3/src/ifsaux/utilities/linuxtrbk.c:277] Backtrace (size = 15) with addr2line-cmd [LinuxTraceBack@/home/glenn/github/gc_oifs43r3/src/ifsaux/utilities/linuxtrbk.c:302] /usr/bin/addr2line -fs -e '/var/lib/boinc/slots/0/oifs_43r3_model.exe' 0x838003 0x838609 0x8ecfd7 0x8256ee 0x14a3e9c 0x1418374 0x4322a0 0x41d979 0x41c8f2 0x407bd8 0x40734d 0x40220f 0x4021a2 0x1dc9c50 0x40206e [LinuxTraceBack@/home/glenn/github/gc_oifs43r3/src/ifsaux/utilities/linuxtrbk.c:319] [0]: [0x838003] : ??() at ??:0 [LinuxTraceBack@/home/glenn/github/gc_oifs43r3/src/ifsaux/utilities/linuxtrbk.c:319] [1]: [0x838609] : ??() at ??:0 [LinuxTraceBack@/home/glenn/github/gc_oifs43r3/src/ifsaux/utilities/linuxtrbk.c:319] [2]: [0x8ecfd7] : ??() at ??:0 [LinuxTraceBack@/home/glenn/github/gc_oifs43r3/src/ifsaux/utilities/linuxtrbk.c:319] [3]: [0x8256ee] : ??() at ??:0 [LinuxTraceBack@/home/glenn/github/gc_oifs43r3/src/ifsaux/utilities/linuxtrbk.c:319] [4]: [0x14a3e9c] : ??() at ??:0 [LinuxTraceBack@/home/glenn/github/gc_oifs43r3/src/ifsaux/utilities/linuxtrbk.c:319] [5]: [0x1418374] : ??() at ??:0 [LinuxTraceBack@/home/glenn/github/gc_oifs43r3/src/ifsaux/utilities/linuxtrbk.c:319] [6]: [0x4322a0] : ??() at ??:0 [LinuxTraceBack@/home/glenn/github/gc_oifs43r3/src/ifsaux/utilities/linuxtrbk.c:319] [7]: [0x41d979] : ??() at ??:0 [LinuxTraceBack@/home/glenn/github/gc_oifs43r3/src/ifsaux/utilities/linuxtrbk.c:319] [8]: [0x41c8f2] : ??() at ??:0 [LinuxTraceBack@/home/glenn/github/gc_oifs43r3/src/ifsaux/utilities/linuxtrbk.c:319] [9]: [0x407bd8] : ??() at ??:0 [LinuxTraceBack@/home/glenn/github/gc_oifs43r3/src/ifsaux/utilities/linuxtrbk.c:319] [10]: [0x40734d] : ??() at ??:0 [LinuxTraceBack@/home/glenn/github/gc_oifs43r3/src/ifsaux/utilities/linuxtrbk.c:319] [11]: [0x40220f] : ??() at ??:0 [LinuxTraceBack@/home/glenn/github/gc_oifs43r3/src/ifsaux/utilities/linuxtrbk.c:319] [12]: [0x4021a2] : ??() at ??:0 [LinuxTraceBack@/home/glenn/github/gc_oifs43r3/src/ifsaux/utilities/linuxtrbk.c:319] [13]: [0x1dc9c50] : ??() at ??:0 [LinuxTraceBack@/home/glenn/github/gc_oifs43r3/src/ifsaux/utilities/linuxtrbk.c:319] [14]: [0x40206e] : ??() at ??:0 [LinuxTraceBack@/home/glenn/github/gc_oifs43r3/src/ifsaux/utilities/linuxtrbk.c:370] End of backtrace(s) SDL_TRACEBACK: Done INTEL_TRBK, THRD = 1 [EC_DRHOOK:localhost.localdomain:1:1:346005:346005] [20230221:223620:1677036980:1.286] [signal_drhook@/home/glenn/github/gc_oifs43r3/src/ifsaux/support/drhook.c:1538] Received signal#6 (SIGABRT) :: 1209MB (heap), 861MB (maxrss), 0MB (maxstack), 0 (paging), nsigs = 1 [EC_DRHOOK:localhost.localdomain:1:1:346005:346005] [20230221:223620:1677036980:1.286] [signal_drhook@/home/glenn/github/gc_oifs43r3/src/ifsaux/support/drhook.c:1542] Also activating Harakiri-alarm (SIGALRM=14) to expire after 500s elapsed to prevent hangs, nsigs = 1 [EC_DRHOOK:localhost.localdomain:1:1:346005:346005] [20230221:223620:1677036980:1.286] [signal_drhook@/home/glenn/github/gc_oifs43r3/src/ifsaux/support/drhook.c:1544] Harakiri signal handler 'signal_harakiri' for signal#14 (SIGALRM) installed at 0x8102d0 (old at (nil)) [EC_DRHOOK:localhost.localdomain:1:1:346005:346005] [20230221:223620:1677036980:1.286] [signal_drhook@/home/glenn/github/gc_oifs43r3/src/ifsaux/support/drhook.c:1617] Signal#6 was caused by unrecognized si_code [memaddr=0x1f200054795], nsigs = 1 [EC_DRHOOK:localhost.localdomain:1:1:346005:346005] [20230221:223620:1677036980:1.286] [signal_drhook@/home/glenn/github/gc_oifs43r3/src/ifsaux/support/drhook.c:1686] Starting DrHook backtrace for signal#6, nsigs = 1 [EC_DRHOOK:localhost.localdomain:1:1:346005:346005] [20230221:223621:1677036981:2.287] [signal_drhook@/home/glenn/github/gc_oifs43r3/src/ifsaux/support/drhook.c:1734] DrHook backtrace done for signal#6, nsigs = 1 [EC_DRHOOK:localhost.localdomain:1:1:346005:346005] [20230221:223621:1677036981:2.287] [signal_drhook@/home/glenn/github/gc_oifs43r3/src/ifsaux/support/drhook.c:1785] Calling previous signal handler at 0x1ce8440 for signal#6, nsigs = 1 forrtl: error (76): Abort trap signal ..The child process has been killed with signal: 6 ABOR1 CALLED SUECOZC:ERROR OPENING FILE ECOZC ------------------------------------------------ oifs_get_stat: Error. ifs.stat file is not open CNT0 not found; string returned was: '' >>> Printing last 1 lines from file: ifs.stat 22:36:19 000000000 CNT3 -999 0.185 0.185 0.188 0:00 0:00 0.00000000000000E+00 1GB 0MB ------------------------------------------------ ..Failed, model did not complete successfully </stderr_txt> ]]> |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,028,039 RAC: 20,189 |
ABOR1 CALLED SUECOZC:ERROR OPENING FILE ECOZCI suspect there is something about the file that doesn't match the expected format. I didn't manage to get a look at the slots directory as my two failed and reported while I was away from the computer. |
Send message Joined: 4 Dec 15 Posts: 52 Credit: 2,483,649 RAC: 1,954 |
To stop it, be the root user and type systemctl stop boinc-client The second line works as you figured. There's also systemctl restart boinc-client. Don't know if it stops gracefully and then starts again or just does some kind of forced stop and restart. I mainly use this for PrimeGrid when fiddling with the xml files for the projects' rock solid apps, or when I just want to get things done really quick. Btw.: Some distributions or Boinc packages have problems with the Boinc installation and forget to install the autostarting of Boinc. If you happen to have a non-autostarting Boinc client you can use systemctl enable boinc-client. - - - - - - - - - - Greetings, Jens |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Failed another one. Same as the other two. Task 22314676 Name oifs_43r3_002q_2019110100_123_992_12213385_2 Workunit 12213385 Created 22 Feb 2023, 6:22:47 UTC Sent 22 Feb 2023, 6:23:59 UTC Report deadline 23 Apr 2023, 6:23:59 UTC Received 22 Feb 2023, 7:24:41 UTC Server state Over Outcome Computation error Client state Compute error Exit status 1 (0x00000001) Unknown error code Computer ID 1511241 Run time 1 min 6 sec CPU time 1 sec Validate state Invalid |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,028,039 RAC: 20,189 |
Failed another one. Same as the other two.I got a resend of one of these that needless to say failed. On the plus side a resend from batch 911 did complete. My guess is it failed on the first machine because of lack of memory. - 11GB RAM and 4 cores- it looks likely that the user is running all four cores at once which would explain a relatively high failure rate. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,709,934 RAC: 9,107 |
The consistency of the errors makes it much more likely that this is a task data error. I reported my single-case failure as quickly as possible, to warn the project and other users - I don't have any further way to analyse the data, save to say that it failed on a machine with 32 GB of RAM, much earlier (67.31 seconds elapsed time, 2.35 seconds CPU) than I would expect memory to fill up. I'm trying to get another for fuller examination, but at one request an hour, they're proving elusive. Apart from the single email from Andy Bowery on Monday evening, confirming that distribution had been paused, I haven't seen anything from the team. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
My guess is it failed on the first machine because of lack of memory. - 11GB RAM and 4 cores- it looks likely that the user is running all four cores at once which would explain a relatively high failure rate. Not my problem. Another 64 GBytes on order. Computer 1511241 Computer information CPU type GenuineIntel Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7] Number of processors 16 Operating System Linux Red Hat Enterprise Linux Red Hat Enterprise Linux 8.7 (Ootpa) [4.18.0-425.10.1.el8_7.x86_64|libc 2.28] BOINC version 7.20.2 Memory 62.4 GB Cache 16896 KB |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,028,039 RAC: 20,189 |
The consistency of the errors makes it much more likely that this is a task data error.I don't know anything about the file that wouldn't open. I was assuming the task data is formatted incorrectly which I have seen in the past though not resulting in this particular error. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I looked at the event log of one of these. Here it is, but I deleted most of the boring ones What I found interesting was not so much that the .zip files were missing (not produced or deleted by mistake), but how fast whatever computation was presumably done. And why does it go on, when files are missing? Tue 21 Feb 2023 10:36:15 PM EST | climateprediction.net | Starting task oifs_43r3_000t_2019110100_123_992_12213316_2 Tue 21 Feb 2023 10:37:23 PM EST | climateprediction.net | Computation for task oifs_43r3_000t_2019110100_123_992_12213316_2 finished Tue 21 Feb 2023 10:37:23 PM EST | climateprediction.net | Output file oifs_43r3_000t_2019110100_123_992_12213316_2_r1577626144_0.zip for task oifs_43r3_000t_2019110100_123_992_12213316_2 absent Tue 21 Feb 2023 10:37:23 PM EST | climateprediction.net | Output file oifs_43r3_000t_2019110100_123_992_12213316_2_r1577626144_1.zip for task oifs_43r3_000t_2019110100_123_992_12213316_2 absent Tue 21 Feb 2023 10:37:23 PM EST | climateprediction.net | Output file oifs_43r3_000t_2019110100_123_992_12213316_2_r1577626144_2.zip for task oifs_43r3_000t_2019110100_123_992_12213316_2 absent [big snip] Tue 21 Feb 2023 10:37:23 PM EST | climateprediction.net | Output file oifs_43r3_000t_2019110100_123_992_12213316_2_r1577626144_120.zip for task oifs_43r3_000t_2019110100_123_992_12213316_2 absent Tue 21 Feb 2023 10:37:23 PM EST | climateprediction.net | Output file oifs_43r3_000t_2019110100_123_992_12213316_2_r1577626144_121.zip for task oifs_43r3_000t_2019110100_123_992_12213316_2 absent Tue 21 Feb 2023 10:37:23 PM EST | climateprediction.net | Output file oifs_43r3_000t_2019110100_123_992_12213316_2_r1577626144_122.zip for task oifs_43r3_000t_2019110100_123_992_12213316_2 absent Tue 21 Feb 2023 10:49:25 PM EST | climateprediction.net | update requested by user |
©2024 cpdn.org