Message boards :
Number crunching :
OpenIFS Discussion
Message board moderation
Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 . . . 32 · Next
Author | Message |
---|---|
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
A new version (1.05) of the OpenIFS 43r3 Perturbed Surface application was distributed on Dec. 12th. The last 4 resends I received used the new app. version. Three completed successfully and 1 is in progress Mine, too. One has completed successfully, and two are about 2/3 done, though it has a gross exaggeration of the time remaining. The one that completed successfully is interesting because the previous machine it ran on used the 1.01 version and I ran the 1.05 version. I.e., the task does not seem to care which version it uses. Notice it now says this instead of the old task name. /var/lib/boinc/slots/9/oifs_43r3_model.exe /var/lib/boinc/slots/10/oifs_43r3_model.exe I always wonder why the failing tasks use more time than the ones I get that work correctly. What are they doing with all the wasted time? His machine is a little bit faster than mine, so that is not the reason. Workunit 12164019 name oifs_43r3_ps_0930_2021050100_123_945_12164019 application OpenIFS 43r3 Perturbed Surface created 28 Nov 2022, 21:33:48 UTC canonical result 22250176 granted credit 0.00 minimum quorum 1 initial replication 1 max # of error/total/success tasks 3, 3, 1 Task click for details Computer Sent Time reported or deadline explain Status Run time (sec) CPU time (sec) Credit Application 22250176 1511241 13 Dec 2022, 8:25:20 UTC 13 Dec 2022, 23:24:41 UTC Completed 52,607.22 52,023.88 0.00 OpenIFS 43r3 Perturbed Surface v1.05 x86_64-pc-linux-gnu 22245964 1518460 29 Nov 2022, 4:15:46 UTC 13 Dec 2022, 8:22:42 UTC Error while computing 98,458.51 91,682.23 --- OpenIFS 43r3 Perturbed Surface v1.01 x86_64-pc-linux-gnu |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,408,472 RAC: 15,705 |
And another oddity. two of the resends, from batch 947 went up to 100% and then dropped back to 99.990% as I was watching. When time remaining dropped to 0 they kept showing as running despite negligble cpu usage. top. I have suspended them in case getting any information from the slot files might be scuppered by letting them continue, though it may be I have to kill the processes to stop them showing as running. I will wait till the third task from 945 finishes just to ensure I don't kill the wrong process.Dave, this 100% -> 99.9% is just an oddity of the way the boinc client computes the time remaining. Ignore it. There's nothing wrong with the task. I can probably tweak the fraction done computation but I can never eliminate it completely. What's happening is the model itself proceeds at, say, 1% every 10mins. When the model finishes, there is a bit more work to do to zip up the remaining files, housekeeping etc. This doesn't proceed at 1% every 10mins, it's slower, but the client seems to use the previous running time of 1% every 10mins and therefore thinks the task will finish sooner. But then the task gets to a point in the code where it sends a message to the client saying it's actually only at 99.99% and you see the value done change. This is why I turn off time remaining, it's not accurate, only an estimate. The tasks know what they are doing, there is no error here, it's just the client trying to be clever. The proper way of working out where the task is would be to go into the slot directory and look at the 'stderr.txt' output. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,694,196 RAC: 10,449 |
And another oddity. two of the resends, from batch 947 went up to 100% and then dropped back to 99.990% as I was watching. When time remaining dropped to 0 they kept showing as running despite negligble cpu usage. top.I saw that during the test runs too. Those finished normally around 4 minutes after the 100% flash - it doesn't seem to be detrimental to the run. |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,983,309 RAC: 21,818 |
And another oddity. two of the resends, from batch 947 went up to 100% and then dropped back to 99.990% as I was watching. When time remaining dropped to 0 they kept showing as running despite negligble cpu usage. top.I saw that during the test runs too. Those finished normally around 4 minutes after the 100% flash - it doesn't seem to be detrimental to the run. Thanks for that Richard and Glen. They obviously keep doing the zipping up after being suspended as one has now completed and the other is now uploading. Obviously, I haven't had my eyes glued to the screen at that point with these tasks before! |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,983,309 RAC: 21,818 |
And the last of my resends has survived a reboot albeit with suspending computation, waiting, exiting BOINC first. If success rate proves to be substantially higher with the next batch, I might try the hard reboot test. Edit: three backed up zips remaining, all getting transient upload error but new zips from the still running task are going through. Same whether I use router and bored band or 4G via my phone which is 4 times faster on a bad day, 10 times faster on a good day but with 15G data limit I am only going to upload a few tasks a month using it! Edit2: Started working again after half an hour. Probably still some congestion on the system? |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,408,472 RAC: 15,705 |
And the last of my resends has survived a reboot albeit with suspending computation, waiting, exiting BOINC first. If success rate proves to be substantially higher with the next batch, I might try the hard reboot test.A power on/off reboot works fine now with OpenIFS and the updated control code. From a programming point of view, it's the same as restarting the client (without powering off). Just remember to tick/check the 'keep non-GPU apps in memory', otherwise if openifs exe gets swapped out, it'll have to do a restart. This is why people are reporting seeing the task running for a long time, if it restarts it will repeat a few timesteps on each restart. If it's restarting frequently, it could potentially never finish. This is why we've upped the memory bounds to keep it away from 8Gb RAM machines. If it stays in memory suspended, that's fine. CPDN have moved onto a newer managed cloud upload server so that too should be faster & more stable. |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,408,472 RAC: 15,705 |
I understand there will be some more OpenIFS work with the BL & PS apps appearing in the next week now recent issues have been resolved. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
This is why I turn off time remaining, it's not accurate, only an estimate. With the "1.01" tasks, I noticed after I had completed a few, my Boinc client learned pretty quickly to make much better estimates. I expect the same will apply to the "1.05" tasks. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Edit2: Started working again after half an hour. Probably still some congestion on the system? I have not noticed any congestion. I expected some when it started working the other day. But it just started taking about two at a time every second or so until all 900 uploaded. The upload speed is fabulous on my 75 megabit/second fiber-optic Internet connection that has not changed. The speedup I attribute to the server at the other (CPDN) end. Now this is typically what happens. I have noticed no exceptions. Wed 14 Dec 2022 08:54:12 AM EST | climateprediction.net | Started upload of oifs_43r3_ps_0622_2021050100_123_945_12163711_1_r1059561091_118.zip Wed 14 Dec 2022 08:54:17 AM EST | climateprediction.net | Finished upload of oifs_43r3_ps_0622_2021050100_123_945_12163711_1_r1059561091_118.zip Wed 14 Dec 2022 08:54:34 AM EST | climateprediction.net | Started upload of oifs_43r3_ps_1602_2021050100_123_946_12164691_1_r1605124282_118.zip Wed 14 Dec 2022 08:54:39 AM EST | climateprediction.net | Finished upload of oifs_43r3_ps_1602_2021050100_123_946_12164691_1_r1605124282_118.zip Wed 14 Dec 2022 09:01:14 AM EST | climateprediction.net | Started upload of oifs_43r3_ps_0622_2021050100_123_945_12163711_1_r1059561091_119.zip Wed 14 Dec 2022 09:01:19 AM EST | climateprediction.net | Finished upload of oifs_43r3_ps_0622_2021050100_123_945_12163711_1_r1059561091_119.zip Wed 14 Dec 2022 09:01:45 AM EST | climateprediction.net | Started upload of oifs_43r3_ps_1602_2021050100_123_946_12164691_1_r1605124282_119.zip Wed 14 Dec 2022 09:01:49 AM EST | climateprediction.net | Finished upload of oifs_43r3_ps_1602_2021050100_123_946_12164691_1_r1605124282_119.zip |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,983,309 RAC: 21,818 |
Good to see #949 is now at 84% success as of 0400 UCT. (GMT in old money.) That is a significant increase on the three previous batches which have been out for longer. Not sure what happened to #948? |
Send message Joined: 29 Nov 17 Posts: 82 Credit: 14,369,121 RAC: 91,650 |
In case it hasn't been mentioned previously: if an oifs task has bugged out part way through then a directory for that task may be left in the project folder that you may want to manually delete to reclaim the disk space. Another task failed with error 9 today. The host had previously run 7 of the last batch successfully and 1 more since without a problem. The host has been running the same projects continuously with 8 - 11 GB free memory according to top, with roughly 70GB of disk space available. I believe a wrapper is being used to run the oifs tasks, is that correct ? Regarding the entries that are put into the stderr.txt file whilst the 'task' is running, is that just by the actual real application, by the wrapper, or both ? These entries for example: 04:51:49 STEP 1440 H= 360:00 +CPU= 25.486 The child process terminated with status: 0 Moving to projects directory: /var/lib/boinc-client/slots/0/ICMGGh7zg+001440 Moving to projects directory: /var/lib/boinc-client/slots/0/ICMSHh7zg+001440 Moving to projects directory: /var/lib/boinc-client/slots/0/ICMUAh7zg+001440 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMSHh7zg+001392 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMUAh7zg+001416 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMUAh7zg+001404 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMSHh7zg+001416 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMSHh7zg+001344 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMGGh7zg+001356 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMGGh7zg+001344 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMGGh7zg+001428 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMGGh7zg+001380 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMSHh7zg+001368 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMGGh7zg+001404 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMUAh7zg+001392 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMSHh7zg+001440 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMUAh7zg+001368 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMUAh7zg+001440 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMUAh7zg+001344 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMSHh7zg+001356 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMUAh7zg+001356 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMGGh7zg+001416 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMSHh7zg+001428 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMGGh7zg+001368 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMUAh7zg+001428 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMGGh7zg+001440 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMGGh7zg+001392 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMUAh7zg+001380 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMSHh7zg+001380 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMSHh7zg+001404 Zipping up the final file: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_a05k_2016092300_15_949_12166594_0_r812719879_14.zip Uploading the final file: upload_file_14.zip Uploading trickle at timestep: 1295100 04:55:23 (1414522): called boinc_finish(0) The code in app_control.cpp for the client that is handling the error 9 reporting is this section: if (WIFEXITED(stat)) { result->exit_status = WEXITSTATUS(stat); double x; char buf[256]; bool is_notice; int e; if (temporary_exit_file_present(x, buf, is_notice)) { handle_temporary_exit(will_restart, x, buf, is_notice); } else { if (log_flags.task_debug) { msg_printf(result->project, MSG_INFO, "[task] process exited with status %d\n", result->exit_status ); } if (result->exit_status) { set_task_state(PROCESS_EXITED, "handle_exited_app"); snprintf(err_msg, sizeof(err_msg), "process exited with code %d (0x%x, %d)", result->exit_status, result->exit_status, (~0xff)|result->exit_status ); gstate.report_result_error(*result, err_msg); } else { if (finish_file_present(e)) { set_task_state(PROCESS_EXITED, "handle_exited_app"); } else { handle_premature_exit(will_restart); } } } } else if (WIFSIGNALED(stat)) { int got_signal = WTERMSIG(stat); if (log_flags.task_debug) { msg_printf(result->project, MSG_INFO, "[task] process got signal %d", got_signal ); } if (WIFEXITED(stat)) { <=== This evaluates to TRUE, the exit was NORMAL result->exit_status = WEXITSTATUS(stat); <==== This sets the result's exit_status flag to be that for the child, in this case 9 if (result->exit_status) { <==== Because non-zero we do this section of code set_task_state(PROCESS_EXITED, "handle_exited_app"); snprintf(err_msg, sizeof(err_msg), "process exited with code %d (0x%x, %d)", result->exit_status, result->exit_status, (~0xff)|result->exit_status ); gstate.report_result_error(*result, err_msg); So that is why I'm asking if a wrapper is being used and it is the wrapper code that I need to go and look at. Is the wrapper code being used a standard BOINC flavour or have you made your own version that is not available to the public ? It would seem to me that the application/model that is doing all the work has done a splendid job and got all the right answers and sent them to the server but at the very last moment the wrapper has indicated a problem of its own that marks all that work and the task done as invalid. I've seen it mentioned several times here that error 9 is signal 9 but it isn't. There is a separate bit of code, at the end of that posted above, that handles signals and you get a different message in the log. I'm solely looking at those tasks that get an error 9 after completing the work correctly as best I can tell. You could check that by requiring a quorum to validate against but then you are duplicating every task rather than just having those that fail get repeated by a second or third host. Linux says Error 9 is EBADF - It is displayed in case of bad file descriptor I have been looking for any remnants in the slot or project directories that would indicate something went amiss but have found no such evidence. I enabled slot_debug that showed approximately 1,950 files got cleared out of the slot upon termination without a problem. The temporary task directory under the projects folder is gone. The task job files (jf_*) have gone. To get any further with the cause of this problem will require a better understanding of what is happening right at the end. If it is the wrapper code that is returning an error code of 9 right at the end then that is where we need to look to determine the cause. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I just got 8 oifs_43r3_ps tasks, and three of them are running with about 1.5 hours on each. My guess is they are not going to crash. All are re-runs. They seem to crash for no apparent reason. Each is different. Many leave no stderr, but some do. IIRC, the ones that leave stderr seem to have the model crash, which seems to confuse the wrapper. Just my impression. |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,983,309 RAC: 21,818 |
#949 now up to 88%success with only 2 hard fails. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
#949 now up to 88%success with only 2 hard fails. My ten are all re-runs that have failed for the previous users. Most are #946 and #947. None are #949. The three that are running now are a little over 90% complete, so my guess is they will complete correctly. So far, all oifs_43r3_ps tasks have run successfully for me. |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,983,309 RAC: 21,818 |
The three that are running now are a little over 90% complete, so my guess is they will complete correctly.945, 6 and 7 are 79, 82 and 82% complete when I looked earlier but 945 is over 70% failures. I think that figure though is artificially high as I suspect a task that fails three times counts as three failures. I don't know enough to check that out with the BOINC server code. In contrast, #949 is only 11% fails and only 2 hard fails. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
The three that are running now are a little over 90% complete, so my guess is they will complete correctly. The three re-runs just completed successfully. |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,408,472 RAC: 15,705 |
OpenIFS model fails due to inappropriate perturbations. Some model tasks (but very few) are failing due to inappropriate perturbations pushing the model too far. Think of this as equivalent to the 'theta level' type error that the Hadley models give. For the BL OpenIFS app, you will see a long stack-trace at the bottom of the task's report on the CPDN website, which mentions the function 'vdfmain'. This is the model's vertical diffusion scheme and indicates vertical air velocity is too high somewhere (typically near the surface). For the PS OpenIFS app, you will see a similar stack-trace which refers to 'gp_norm'. Similar kind of thing, numbers at or near the surface have got too large. The scientists in both these projects are perturbing some of the model's fixed parameters across quite a range of values so some fails are expected - though, to date, surprisingly few. |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,408,472 RAC: 15,705 |
Hi, I can answer some of this. As it's getting technical maybe the moderators might want to move it to a separate thread? In case it hasn't been mentioned previously: if an oifs task has bugged out part way through then a directory for that task may be left in the project folder that you may want to manually delete to reclaim the disk space.Yes, this is a known issue that's being looked into. Another task failed with error 9 today. The host had previously run 7 of the last batch successfully and 1 more since without a problem.Tasks are failing with error codes 1 & 9. Some seem to report more information: 'double free corruption' or 'free()... fast' errors appear at the end of the stderr but most don't. My guess is that the error is not being flushed to the log before the process dies. With the help of others here, we know that 'double free' corruption seems to be symptomatic of the model itself failing. The fail with error code 9 happens after the model has finished as you say, and sometimes with the 'free()...' message as well. What I find interesting is both these only seem to happen on AMD hardware. I didn't do an exhaustive trawl through the logs but I could not find a single intel machine with these fails. My suspicion is that both these errors are memory related. 'double free' corruption obviously is, the error code 9 with the 'free()..' error could also refer to a memory resident file, but quite what I am not sure. Both codes were compiled on Intel with the latest Intel compiler. Whether there's additional compiler options required I don't know. I have not been able to reproduce these errors on my little AMD box. It's possible AMD chips are triggering memory bugs in the code depending on what else happens to be in memory at the same time (hence the seemingly random nature of the fail). Hard to say exactly at the moment but it could also been something system/hardware related specific to Ryzens. I have never seen the model fail like this before on the processors I've worked with in the past (none of which were AMD unfortunately). I am tempted to turn down the optimization and see what happens.... I believe a wrapper is being used to run the oifs tasks, is that correct?Yes, and it's a code created by CPDN, which I've also recently worked on. I don't know the history of where it came from. It's undergoing some refactoring & bugfixing. To answer your other question, the code is in github here: https://github.com/CPDN-git/openifs, just be aware Andy & I both have private forks which are ahead of this. Regarding the entries that are put into the stderr.txt file whilst the 'task' is running, is that just by the actual real application, by the wrapper, or both ?Both. The lines with 'STEP' are coming from the model's output to stderr, everything else is coming from the separate controlling wrapper process. It would seem to me that the application/model that is doing all the work has done a splendid job and got all the right answers and sent them to the server but at the very last moment the wrapper has indicated a problem of its own that marks all that work and the task done as invalid.Agreed. I've asked CPDN if there is a way of getting the server to check the upload was received OK to reclassify this as a success. It may not be easy as the uploads go to a cloud server first. Not my expertise. Linux says Error 9 is EBADF - It is displayed in case of bad file descriptorThere are some possible issues at the end of the code I've already noted with Andy. I've been working on the more urgent fixes up to now. As I mentioned above, this could refer to a memory resident file. Comments/suggestions/help/advice; all welcome. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,694,196 RAC: 10,449 |
I have a theory on this, which may suggest where to look.It would seem to me that the application/model that is doing all the work has done a splendid job and got all the right answers and sent them to the server but at the very last moment the wrapper has indicated a problem of its own that marks all that work and the task done as invalid.Agreed. I've asked CPDN if there is a way of getting the server to check the upload was received OK to reclassify this as a success. It may not be easy as the uploads go to a cloud server first. Not my expertise. BOINC clients communicate with projects on two completely separate levels. There are very simple-minded file copying processes, both download and upload, which simply move complete files form one place to another, with no information on or interest in what those files might contain. Those files belong to the project's scientists. And separately, BOINC clients communicate with the project's "scheduler" about the administration and inner workings of BOINC itself. CPDN is unusual in producing intermediate returns of both types: 'uploads' go to the scientists, and 'trickles' go to the administrators. Most projects only communicate once, when the task has completely finished, and BOINC is careful to wait until the data transfer has completed, before finalising the administration. My suspicion is that the completion of intermediary upload file transfers isn't checked so carefully by the client. Some volunteers with powerful computers but slower data lines may still have intermediate files chugging along when the rest of the task has completed, and I'm worried that the final administrative 'all done' message may be sent too early in those cases. Next time I get some tasks, I'll trying holding some files back and studying the logs, to see what order the various 'finished' processes are handled in. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
I take it all these problems mean you haven't even considered starting on virtualbox to get windows machines running it? I've got 126 cores sat waiting.... On desktops and laptops, Linux = 1.46%, Windows = 82.56%. Add windows and you'd get over 57 times as many users. |
©2024 cpdn.org