Message boards : Number crunching : New Model Type HadAM4
Message board moderation
Author | Message |
---|---|
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
A new Linux only model type. Aside from the crashes due to missing 32bit libs, there have been a number of crashes with, Model crashed: READDUMP: BAD BUFFIN OF DATA tmp/xnnuj.pipe_dummy The project would like to know a bit more about what is causing this. I have managed to recreate it by 1.Suspending computation 2.Exiting BOINC. 3.Rebooting 4.Restarting BOINC 5 Resuming computation. Three times the task survived leaving boinc and restarting without the reboot even when one of them didn't include suspending computation. My experience in the past is that the task crashing is more likely if there has been a kernel upgrade prior to the reboot which there had been in this case though a hadcm3s task did survive. If you experience tasks crashing with this error please do post and say what was happening at the time. Thank you. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
I'm running Ubuntu 18.04 in VMs on Windows 10 PCs. The only time I've interrupted one was when I shut down my virtual machine to allocate more logical cores to it. I waited for a checkpoint, stopped boinc a few timesteps past that checkpoint through the File -> Exit Boinc Manager menu choice, shut down Ubuntu in the VM in the normal Ubuntu shutdown method, changed the number of logical cores that the VM uses from 2 to 4, restarted Ubuntu in the VM, and restarted BOINC Manager. The task crashed immediately with the BAD BUFFIN error. There was no PC reboot or Ubuntu update that occurred during this time. |
Send message Joined: 15 Nov 18 Posts: 1 Credit: 15,365 RAC: 0 |
only linux :( |
Send message Joined: 28 May 17 Posts: 49 Credit: 17,405,877 RAC: 5,319 |
My Linux PC grabbed 6 of this this morning. About 18.5 day ETA. Still running after 12 hours. ~630 MB of memory usage. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I just noticed it existed this morning. I checked to see if it would run on Linux, so I added it to my list. So far, I have received none. I get so few tasks from climateprediction that my client only tries about every two or three days. |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
I see that over one in five have fallen to the three strikes and out rule. I think the advice for this batch is something those of us above a certain age will remember. Das machine is nicht fur gefingerpoken und mittengrabben. Ist easy |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Back when the first computer I saw took up a room the size of a small aircraft hanger, comedians would put little messages on the machines. The first one I saw said:"If you can remain calm in all this confusion, you obviously do not understand the problem." But the best one I ever saw went something like this: We have not answered all our questions. Sometimes we think we have not answered any of them. Those questions we have answered have served only to raise a host of new questions. So now we are as confused as ever, but on a higher level, and about more important things. |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
Currently uploading an atmos_restart.day file for Sarah to have a look at to see if the problem with these tasks can be identified. Makes some of the recent uploads look small. (187.2MB) Edit: and unless someone else has a different experience, like the hadcme3s tasks only the first trickle shows up on the task web pages. |
Send message Joined: 28 May 17 Posts: 49 Credit: 17,405,877 RAC: 5,319 |
I realized due to these CPDN tasks that tasks from other projects were taking many times as long so I paused all but one. Today I received from BURP tasks which paused that one task since they have a short deadline and are mt. Upon resuming it had an error. Along with every other one I had. Pretty much a waste of time. Model crashed: READDUMP: BAD BUFFIN OF DATA Sorry, too many model crashes! :- |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
These new models REALLY do not like being interrupted, even more than climate models usually do. |
Send message Joined: 28 May 17 Posts: 49 Credit: 17,405,877 RAC: 5,319 |
Its hard for them to not be interrupted when there isn't enough CPDN work to fill the queue, they have a runtime of 18 days and they have a year deadline. Something will end up interrupting them. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
These new models REALLY do not like being interrupted, even more than climate models usually do. Is it enough to check the Leave applications in memory while suspended box? I reboot only when Red Hat update the kernel which is usually a little less often than once a month. And even then, the shutdown procedure sends a shutdown signal to all running processes, then waits about 10 seconds before killing any that remain. |
Send message Joined: 30 Aug 04 Posts: 4 Credit: 535,502 RAC: 0 |
additional messages - immediately after stopping the task <stderr_txt> Signal 15 received: Software termination signal from kill Signal 15 received: Abnormal termination triggered by abort call Signal 15 received, exiting... 07:16:01 (6392): called boinc_finish(193) *** Error in `../../projects/climateprediction.net/hadam4_8.08_i686-pc-linux-gnu': free(): invalid pointer: 0xf7371008 *** . . . . . . *** Error in `../../projects/climateprediction.net/hadam4_8.08_i686-pc-linux-gnu': free(): invalid pointer: 0xf7371008 *** then, after re-start Model crashed: READDUMP: BAD BUFFIN OF DATA tmp/xnnuj.pipe_dummy Model crashed: READDUMP: BAD BUFFIN OF DATA tmp/xnnuj.pipe_dummy Sorry, too many model crashes! :-( 08:58:05 (2041): called boinc_finish(22) |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
I am currently downloading three tasks to see if the issue with stopping and restarting is resolved. Sarah believes they have identified the issue. Edit: Any moderators able to get the remaining two of these with faster boxes than mine might be useful. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I just got one. According to my client, it is expected to take about 44 days on my machine. The other two computers that tried this work unit bombed out, one very quickly. Name hadam4_a04k_200811_12_785_011729940_2 Workunit 11729940 I do not expect to reboot my machine before it finishes. |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
I am currently downloading three tasks to see if the issue with stopping and restarting is resolved. Sarah believes they have identified the issue. Well, one of the three on the laptop has just crashed when I exited boinc and restarted it, even without the reboot along with the four main site tasks I had running. So, not fixed yet. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,705,191 RAC: 5,539 |
I got two at their 2nd attempt, so far at 13% and it looks like 20-23 days for this WUs. I do not intend to interrupt them just to check how they behave. I will do normal use of the machine and if they crash I will report. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
8 at 58% after 5 days. Six zips created and returned. And all trickles showing on the Task's pages. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Name hadam4_a04k_200811_12_785_011729940_2 This work unit now has 15 hours 32 minutes on it and has not crashed, so it does not seem to have congenital weakness. It has not uploaded any .zip files yet. Claims to to be a trifle over 4% done now. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
There are 12 months to these, so 12 zips, which is about 8% intervals. |
©2024 cpdn.org