Message boards : Number crunching : New Model Type HadAM4
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Two tasks finished successfully on my PCs today. These two were never interrupted/removed from memory which explains why they had a chance to finish without crashing. |
Send message Joined: 15 May 09 Posts: 4544 Credit: 19,039,635 RAC: 18,944 |
Two tasks finished successfully on my PCs today. These two were never interrupted/removed from memory which explains why they had a chance to finish without crashing. What has become clear is that these tasks have a very high probability of failing with the bad buffin error if BOINC is stopped or the tasks are suspended and under Options>computing preferences>disk and memory the leave non gpu tasks in memory when suspended is not ticked. I have determined that if suspend or hibernate, i.e. suspend to disk or to ram is used, the computer can be stopped but as a full reboot means re-starting BOINC this will almost inevitably crash the tasks. This information may make running these tasks unsuitable for some until/if the issue is resolved. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Two tasks finished successfully on my PCs today. These two were never interrupted/removed from memory which explains why they had a chance to finish without crashing. I am not sure what this actually means, especially on machines running Linux. The memory manager will try to keep the working set of the program in RAM, but will page out inactive pages if RAM is needed for something else. And since BOINC processes run with the lowest priority; i.e., run only if nothing with higher priority wants a processor, the entire process could get paged out. I always leave Leave non-GPU tasks in memory while suspended checked, but "in memory" just means it can be paged in very quickly, not that it is physically in RAM all the time. Since my machine these days does little other than the occasional e-mail and some web browsing, BOINC usually get over 90% of all processor time. And at night, when I am logged out of the machine, they get a little over 99% of the processor time. |
Send message Joined: 15 May 09 Posts: 4544 Credit: 19,039,635 RAC: 18,944 |
Well over one in ten of these tasks are crashing due to not having the requisite 32bit libs. Tomorrow, Sarah will speak to Andy about setting the misbehaving computers' flags to -1 so they will not get any more tasks. The owners should also get emails telling them about this. Once they have added the relevant libs to their Linux installation, they will have to report back to get their computers re-instated. https://www.cpdn.org/cpdnboinc/forum_thread.php?id=7828 Is the most recent thread which explains how to install the libs. There is a specific thread in the Linux section of the forums should anyone have problems installing the missing files. There are about 15 of you out there judging by my perusing of the task pages! |
Send message Joined: 15 May 09 Posts: 4544 Credit: 19,039,635 RAC: 18,944 |
I am sure George can answer this better than I but I don't think the issue is whether the information is in ram or in the swap file. What ticking the, "Keep non-gpu tasks in memory while suspended" does is stop the information being written over while waiting for the task to be resumed. Going to swap would slightly increase the risk of corruption happening but I probably not by very much. Unticking it on some testing tasks didn't actually crash them when I suspended them, I suspect because I didn't leave it long enough for anything else to write over the information. The computer has enough memory that it probably didn't get written out to swap anyway. Still a bit of learning for me when trying to do these diagnostic tests. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Dave is correct. My wording was careless. One can interrupt the task as long as it remains in memory, real or virtual. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
Well over one in ten of these tasks are crashing due to not having the requisite 32bit libs. Tomorrow, Sarah will speak to Andy about setting the misbehaving computers' flags to -1 so they will not get any more tasks. The owners should also get emails telling them about this. Excellent idea. I would also think about a user-selectable checkbox to enable work units to go only to 24/7 machines. The ones that are shut down/suspended will fail, and so those crunchers won't want them anyway. |
Send message Joined: 15 Jan 11 Posts: 175 Credit: 6,242,691 RAC: 699 |
RE - Leaving in memory - Jean-David Beyer "I always leave Leave non-GPU tasks in memory while suspended checked, but "in memory" just means it can be paged in very quickly, not that it is physically in RAM all the time." Are you sure about that? I might be wrong but all the evidence I can find seems to point to a task state physically staying in system memory when suspended with a 'Leave'. Obviously, in unchecked, suspended tasks are removed from memory, and resume from their last checkpoint. Hence the problems. Since paging is virtual memory methodology, it would only occur if the system needed more RAM than was available for some process. I wouldn’t have thought that leaving a task state in memory would trigger paging of that state if another task needed more memory. Rather that elements of the the second task would be paged. Obviously, when the machine is switched off everything in RAM is lost but the system copies the RAM state to disk to be read back on on startup. Having said all that, I’m always willing to be corrected. Out of interest, I have used 'Leave non-GPU tasks in memory while suspended' from shortly after I joined CPDN, having had a task crash after a suspend and restart of my Mac. Since then, I have never had a problem, even when running CPDN in a Virtual Machine. I sometimes have to switch between dual booted systems so the 'Leave' option is invaluable. I still take care to ensure I don't do this if trickles are pending or uploads are occuring, although pending zips don't appear problematic. I have Mac systems. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
I think that you're right there, Dave. I remember something from years ago, about people saying they don't use "Leave ..." because it uses up too much memory. Which is part of the reason why we also suggest, that a rough rule of thumb is: 2 Gigs of memory per processor core. I think what the "Leave ..." option does, is, it tells the OS that the data in memory is important, so don't delete it just because we're not using it at the moment. And if you REALLY need to use the memory for something else, then swap out all of the data to the HD first, so that it's not lost. The problem with this new model type is thought to be something similar; some data in one of the ancils isn't getting saved, which is OK if the model just chuggs along, but fatal if it's stopped in some specific way. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
On my Red Hat Enterprise Linux Server release 6.10 (Santiago) system (an older release, but Red Hat support their releases for 10 years), the dependencies are: $ ldd hadcm3s_8.34_i686-pc-linux-gnu linux-gate.so.1 => (0x006b7000) libpthread.so.0 => /lib/libpthread.so.0 (0x00cef000) libdl.so.2 => /lib/libdl.so.2 (0x00d0c000) libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00347000) libm.so.6 => /lib/libm.so.6 (0x00da1000) libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00fd5000) libc.so.6 => /lib/libc.so.6 (0x00b4b000) /lib/ld-linux.so.2 (0x565c9000) $ ldd hadam4_8.08_i686-pc-linux-gnu linux-gate.so.1 => (0x0038e000) libpthread.so.0 => /lib/libpthread.so.0 (0x00cef000) libdl.so.2 => /lib/libdl.so.2 (0x00d0c000) libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x006cc000) libm.so.6 => /lib/libm.so.6 (0x00da1000) libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x002f4000) libc.so.6 => /lib/libc.so.6 (0x00b4b000) /lib/ld-linux.so.2 (0x56607000) $ rpm -qf /usr/lib/libstdc++.so.6 /lib/libm.so.6 /lib/libgcc_s.so.1 /lib/libc.so.6 libstdc++-4.4.7-23.el6.i686 glibc-2.12-1.212.el6.i686 libgcc-4.4.7-23.el6.i686 glibc-2.12-1.212.el6.i686 |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
"I always leave Leave non-GPU tasks in memory while suspended checked, but "in memory" just means it can be paged in very quickly, not that it is physically in RAM all the time." I am very sure about what I say (still admitting that I have been wrong before). I do not know what you mean by "a task state physically staying in system memory when suspended with a 'Leave'". In Linux, if the process scheduler decides to interrupt a process and give the processor to another process, it requires no cooperation of the to-b interrupted process. The interrupted process does not know if it lost the processor because of a hardware interrupt or anything else. And once it is interrupted, as far as the memory manager is concerned, the interrupted process does not need any RAM at all, so it is all candidate for being paged out. Now, if the there is enough RAM in the machine, it may not be paged out at all. When RAM is required, it is grabbed from free RAM if there is any. If not, it is grabbed from the input cache. If that is not available, output cache must be written out, and can then be re-used. Lacking that, least recently used pages of processes can be written be written out. But even if there is enough physical RAM, if a process stops running, its memory translation registers are remapped to the new process and the other RAM would be inaccessable. And when the interrupted process again gets a processor, the memory translation registers will be mapped back to where its data are. I could put 512 GBytes of RAM in my 64-bit machine. So in one sense, it would "never" run out of RAM. But even so, I do not know the maximum virtual address space a process can get. In the old days, it was quite easy to put more RAM in a 32-bit machine than can be addressed by 32 bits (if you had the PAE hardware in the chip set; I did). The OS kernel had to diddle the memory translation registers so a process could address all the memory it needed. Generally, this meant about 32 bits worth of address space per process. Sometimes (on PDP-11 machines), they could have 32 bits worth of data space and 32 bits of instruction space, but that required trickiness in the hardware. The exact details may have changed since I worked on these things, but that is the general idea. Since paging is virtual memory methodology, it would only occur if the system needed more RAM than was available for some process. I wouldn’t have thought that leaving a task state in memory would trigger paging of that state if another task needed more memory. Rather that elements of the the second task would be paged. Leaving a process suspended would not trigger paging in and of itself. The kernel tends to try to keep such stuff in physical memory if possible, so that when the process is resumed it can do so quickly. Remember, the memory manager does not usually know how long a process will be suspended, and it may be resumed much quicker that it takes to write out a page and read it back. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Name hadam4_a04k_200811_12_785_011729940_2 12-Feb-2019 18:41:52 Started upload of hadam4_a04k_200811_12_785_011729940_2_r1285776232_1.zip 12-Feb-2019 18:47:17 Finished upload of hadam4_a04k_200811_12_785_011729940_2_r1285776232_1.zip Has not crashed yet. The other two workers have both crashed. One because no 32-bit libraries. The other, after about three days, with a computation error. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Good for you. :) Mine have uploaded zip 8, and another mod has completed 2, so that's 3 of us running OK. |
Send message Joined: 30 Sep 15 Posts: 11 Credit: 91,760 RAC: 0 |
Hello all, Firstly I just wanted to say a big thank you for all your useful posts on this model. They have been very gratefully received and trying to resolve issues such as this would be much harder without your input. To give you a quick update we are still trying to resolve the BAD BUFFIN error to identify what is happening. My initial tests trying to ensure that ancillary information is read when the model starts do not appear to have fixed the problem so this is proving to be a more complicated issue and I will need to think further how to fix it. I will talk to Andy about the machines with missing 32-bit libraries so we can clear those sort of errors for this model. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I will talk to Andy about the machines with missing 32-bit libraries so we can clear those sort of errors for this model. I suppose the information is there, since when I look at the stderr file in the results entries, it says why they failed. One of my failed partners got this: stderr out <core_client_version>7.9.3</core_client_version> <![CDATA[ <message> process exited with code 127 (0x7f, -129)</message> <stderr_txt> ../../projects/climateprediction.net/hadam4_8.08_i686-pc-linux-gnu: error while loading shared libraries: libstdc++.so.6: cannot open shared object file: No such file or directory </stderr_txt> ]]> so all you have to do is look at the error returns from any failed work units, and if it says error while loading shared libraries send him no more work units until (s)he interacts. This test need be done only once. Of course this is easier for me to say than for you to implement. It does show that the needed information is there, but not how to do it in a low resource demand way. Both human and hardware resources. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Name hadam4_a04k_200811_12_785_011729940_2 Not only did I get a second trickle done, when my two partners did not seem to, but I got a second trickle actually acknowledged. So this runs better than the hadcm3s models that also upload further trickles, but do not appear 12-Feb-2019 18:41:52 [climateprediction.net] Started upload of hadam4_a04k_200811_12_785_011729940_2_r1285776232_1.zip 12-Feb-2019 18:47:17 [climateprediction.net] Finished upload of hadam4_a04k_200811_12_785_011729940_2_r1285776232_1.zip 14-Feb-2019 02:41:10 [climateprediction.net] Started upload of hadam4_a04k_200811_12_785_011729940_2_r1285776232_2.zip 14-Feb-2019 02:42:18 [climateprediction.net] Finished upload of hadam4_a04k_200811_12_785_011729940_2_r1285776232_2.zip Latest Trickles Received Time Sent (UTC) Host ID Result ID Result Name Phase Timestep CPU Time (sec) Average (sec/TS) 14 Feb 2019 07:42:10 1256552 21490395 hadam4_a04k_200811_12_785_011729940_2 1 8,741 221,125 25.2974 12 Feb 2019 23:44:55 1256552 21490395 hadam4_a04k_200811_12_785_011729940_2 1 4,421 112,144 25.3662 |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Well! My machine just crashed for the first time since I got it a little over six years ago (1 Dec 2012, 10:10:08 UTC was when I registered it with BOINC and CPDN). The screen put up a curious pattern (not the one I get if there is no video), and nothing worked as far as I could tell. I tried to run a regular terminal, but if I did, I could not tell. I powered off the monitor and waited a bit, and restarted it: no change. I pushed the reset button on the monitor with no change. I tried to shut down the windowing system (Control-Alt-Backspace) but no change. So I powered the whole thing off, waited a bunch of seconds for the hard drives to spin down, and rebooted. It came up as normal, and everything seems to be running, including my current CPDN task. So we will see what we will see. I wish it had not done that. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Oh! Poo! It died. Because my machine crashed, and that was too much for the program. It did restart and ran almost 1/2 hour, but by then it could not continue. Name hadam4_a04k_200811_12_785_011729940_2 Name hadam4_a04k_200811_12_785_011729940_2 Workunit 11729940 Created 11 Feb 2019, 14:54:33 UTC Sent 11 Feb 2019, 14:54:46 UTC Report deadline 24 Jan 2020, 20:14:46 UTC Received 14 Feb 2019, 19:11:16 UTC Server state Over Outcome Computation error Client state Compute error Exit status 22 (0x16) Unknown error number Computer ID 1256552 Run time 3 days 2 hours 38 min 42 sec CPU time 2 days 22 hours 48 min 23 sec Validate state Invalid Credit 389.03 Device peak FLOPS 1.28 GFLOPS Application version UK Met Office HadAM4 at N144 resolution v8.08 i686-pc-linux-gnu stderr out <core_client_version>7.2.33</core_client_version> <![CDATA[ <message> process exited with code 22 (0x16, -234) </message> <stderr_txt> Suspended CPDN Monitor - Suspend request from BOINC... Model crashed: READDUMP: BAD BUFFIN OF DATA tmp/xnnuj.pipe_dummy Model crashed: READDUMP: BAD BUFFIN OF DATA tmp/xnnuj.pipe_dummy Model crashed: READDUMP: BAD BUFFIN OF DATA tmp/xnnuj.pipe_dummy Model crashed: READDUMP: BAD BUFFIN OF DATA tmp/xnnuj.pipe_dummy Model crashed: READDUMP: BAD BUFFIN OF DATA tmp/xnnuj.pipe_dummy Model crashed: READDUMP: BAD BUFFIN OF DATA tmp/xnnuj.pipe_dummy Sorry, too many model crashes! :-( 13:45:02 (3083): called boinc_finish(22) </stderr_txt> ]]> |
Send message Joined: 15 May 09 Posts: 4544 Credit: 19,039,635 RAC: 18,944 |
The good news is Sarah has worked out what is causing the problem. The bad news is, so far she doesn't know how to fix it. :( |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
The good news is Sarah has worked out what is causing the problem. I wonder if they ever figured out what was causing the original Linux problem, or whether it just worked with a new batch? That is, can we expect more Linux in the future? I find it easier to put a Linux machine on CPDN, as I already have them set up to run BOINC anyway. |
©2025 cpdn.org