Thread 'HadCM3N o series not compatible with Linux?'

Author	Message
Greg van Paassen Send message Joined: 17 Nov 07 Posts: 142 Credit: 4,271,370 RAC: 0	Message 43193 - Posted: 11 Oct 2011, 20:08:50 UTC My record with the 'o' series: downloaded: 15. completed: 1. Failed, code 193: 7. (3 @ 25%, 2 @ 50%, 1 @ 75%, 1 @ 100%.) In progress: 7. (2 of which are due to finish/fail later today.) What is the remedy for code 193 errors? ID: 43193 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 43194 - Posted: 11 Oct 2011, 21:24:58 UTC - in response to Message 43193. "Thyme Lawn" posted a few words about that error code here. However, failures at 25% points is possibly a different matter. This is when crunching stops while the files accumulated since the start (or last zip point), are collected, have their extensions changed, and then get zipped up for sending back to the server. ANY interruptions at this time usually causes the model to fail. The danger zone extends from late November until after the first check point in December. Backups: Here ID: 43194 · Reply Quote

Greg van Paassen Send message Joined: 17 Nov 07 Posts: 142 Credit: 4,271,370 RAC: 0	Message 43200 - Posted: 12 Oct 2011, 8:54:37 UTC - in response to Message 43194. It's not a permissions problem, because other models complete normally, or fail with "invalid theta" - i.e., problems with the physics in the model. And some 'o' series models succeed. I don't think that it's other activity on the computer, because 'o' series models fail at 25% or 50% or 75% even when I make special efforts to eliminate other activities: suspending all models but one, shutting down the GUI, shutting down cron ('scheduled tasks'), etc. It could be that the models need the computer to be doing other things, and it's not doing enough of them. But that seems unlikely. ID: 43200 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,022,240 RAC: 20,762	Message 43201 - Posted: 12 Oct 2011, 15:12:08 UTC - in response to Message 43200. Doesn't seem to be a problem of compatibility with linux. I am about to complete my first one of these models however - now over 99% complete! ID: 43201 · Reply Quote

wateroakley Send message Joined: 6 Aug 04 Posts: 195 Credit: 28,347,450 RAC: 10,508	Message 43202 - Posted: 12 Oct 2011, 17:53:30 UTC - in response to Message 43201. Greg, My Intel/Linux ubuntu host has been chugging away mostly successfully on these CM3. I have made a comparison of your CM3 model WU set completion rate with the WU sets on two Intel/Win hosts and the one Intel/Linux host. I have ignored non-starters, early crashes, detached hosts, and serial crashers, etc. a. Intel/Win host. 22 tasks attempted. WU set size 29, 7 in progress, 11 completed, 11 crashed (frequently at the 25/50/75/100 point, as observed by credit assigned). That is a 50% task completion rate. b. Intel/Linux ubuntu host. 5 tasks attempted. WU set size 10, 4 in progress, 3 completed, 3 crashed. That is a 50% task completion rate. c. Your Intel/Linux host. 20 tasks attempted. WU set size 45, 18 in progress, 6 completed, 21 crashed. That is a 22% task completion rate. Hmm - a much lower success rate with a similar set size of tasks? Results a & b ? no difference between Intel/Win and Intel/Linux. Result c - about half the above a & b Win/Linux success rate for your c. Intel/Linux host. Only found one Intel/Darwin that made a brave effort to a crash after about 30 hours. All others croaked after a few seconds. The totals of Crashes (Error while computing) whilst in progress and Completions (Completed) by cpu/OS mix. Crashes Intel/Win = 18 = 62% AMD/Win = 6 = 86% Intel/Linux = 7 = 47% Completions Intel/Win = 11 = 38% AMD/Win = 1 = 14% Intel/Linux = 8 = 53% The topic has been discussed in other threads, at other times, and in published papers - this set of 51 task results suggests that the cpu/OS mix does have some bearing on the outcome - evidenced by noticeably variable completion/crash rates) of current CM3 WUs. That leads to the first possibility ? the previously documented variations between results from models running on different cpu/OS types? If you consider that CM3s run slower on Linux, leads to a second possibility ? that the compilation is different, perhaps slightly less aggressive on the Linux version? What the evidence of better overall completion rates in Linux does not readily explain is the lower success rate of ?o? WUs in Linux on your host? unless it is simply the variation of results due to ?o? model parameter settings combined with the external parameter of cpu/OS? (I have not looked at any of the WU initial settings). There may be many more possibilities, perhaps even a latent software bug? Perplexed in Water Oakley. ID: 43202 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275	Message 43209 - Posted: 13 Oct 2011, 20:19:14 UTC Last modified: 13 Oct 2011, 20:24:29 UTC Just another data point but probably won't help any diagnosis... The only linux PC I've been running since May has 23 hadcm3n successes with 3 failures. (excepting those crashed/aborted because of the improper ancillary input files) The three failures were with "invalid theta" and not at a multiple of 25%. It's only had 3 "o" series runs and those completed successfully. It's an AMD Phenom II X4 940 running 64 bit Fedora 14 and BOINC 6.10.58. It's pretty much a dedicated crunching PC, and the only boinc project it runs is cpdn. ID: 43209 · Reply Quote

Greg van Paassen Send message Joined: 17 Nov 07 Posts: 142 Credit: 4,271,370 RAC: 0	Message 43210 - Posted: 13 Oct 2011, 21:12:55 UTC Yes, code 193 is not just a linux problem, is it. Looking back through work units that my PC has been allocated tasks for, the first 'o' series back in May-early June seems free of code 193 errors. They first show up in late June in the 't' series, but seem rare. The 'y' series has more, the 'P' series more still, and the latest 'o' series has a lot. Of the 15 recent 'o' series work units for which my PC has been allocated a task there can be at most 11 successful completions, with a more likely value being 9. Without the code 193 errors, that would be 12 or more. Code 193 appears not to relate to the physics of the model, but rather to the mechanics of preparing zip files, so this is skewing the results. Of the 63 tasks issued so far for the 15 work units, 29 failed immediately, on computers that almost invariably do this. Perhaps the project could invest a couple of weeks' programmer time in creating 'config check' tasks, in order to pre-qualify PCs asking for work. This would significantly reduce bandwidth demand and database load. (For comparison with the 29 'immediate crash' errors, there have been 10 'code 193' errors and 6 'other', and 14 in-progress tasks have made it to the first trickle.) geophi, my PC also only runs CPDN (no other BOINC work). It does also act as a file server and backup host on my home network, and I word-process and web-browse on it. But those activities use negligible CPU even on an Atom. I'll have a think about rehabilitating an old PC and moving the word processing, browsing, file serving and backup to that. But I'll wait to see if more work is forthcoming from CPDN, first. ID: 43210 · Reply Quote

old_user170894 Send message Joined: 3 Mar 06 Posts: 96 Credit: 353,185 RAC: 0	Message 43211 - Posted: 14 Oct 2011, 0:09:55 UTC - in response to Message 43210. BOINC error codes (errors in the BOINC binaries) are all negative numbers. Code 193 is positive so that means it's an application error. The BOINC FAQ says exit code 193 means: Code 193 is a segmentation violation error. You either have problems with your memory or swap file, or the application attempts to access a memory location that it is not allowed to access, or attempts to access a memory location in a way that is not allowed (for example, attempting to write to a read-only location, or to overwrite part of the operating system). Use a memory checking program like memtest86+ to rigorously test your memory. And always when you have this error, report it on the forums of the application it happens with. It may well be an error in the application's code. My Fedora installer DVD has memtest86+ on it, IIRC. Maybe your distro does too? ID: 43211 · Reply Quote

Greg van Paassen Send message Joined: 17 Nov 07 Posts: 142 Credit: 4,271,370 RAC: 0	Message 43212 - Posted: 14 Oct 2011, 3:14:28 UTC - in response to Message 43211. Last modified: 14 Oct 2011, 3:29:32 UTC Yes, I have memtest86+, and I have run it - but only for 48 hours. Perhaps I need to run it for three weeks. But why are other people getting increasing numbers of code 193 errors? Edit: And why does it only happen to the models that are at 25%, 50%, 75%, or 100%? Why not randomly? ID: 43212 · Reply Quote

old_user170894 Send message Joined: 3 Mar 06 Posts: 96 Credit: 353,185 RAC: 0	Message 43213 - Posted: 14 Oct 2011, 5:20:37 UTC - in response to Message 43212. Last modified: 14 Oct 2011, 5:25:06 UTC Those are the points at which files are zipped and uploaded, right? It could be a bug in the compression binary or a shared lib it uses. If it's happening only to you then it could be your binary or shared libs are corrupted. You could try: 1) replacing them with fresh copies 2) comparing MD5 checksums of the binary and shared lib(s) with MD5 checksums of known good files. 3) run the compression tools yourself and see if they crash, use a script to run them repeatedly against a variety of files Otherwise, there might be clues in your system error logs as to what process is failing, if you can stand reading those damn things for more than 10 minutes. Instead you could grep them for terms like error, zip, filename strings common to CPDN uploads, seg(mentation error), boinc, CPDN executable names... ID: 43213 · Reply Quote

wateroakley Send message Joined: 6 Aug 04 Posts: 195 Credit: 28,347,450 RAC: 10,508	Message 43214 - Posted: 14 Oct 2011, 9:33:07 UTC Greg, overnight memtest is normally long enough to pick up most soft memory errors, so 48 hours without error is probably long enough to say the memory (and dont forget the BIOS settings) are good to go. If this is happending at the zip point, it is also possible to be exacerbated by hard drive errors. I've had four HDD's fail (RIP) this year on XP & ubuntu (including warranty replacements) that exhibited random errors with CPDN when writing to disc. You may wish to run your hard drive diagnostics. Any reported soft or recoverable or fixable errors should be treated with suspicion, as they all lead ultimately to a terminal event on the (RIP) four. Good luck. ID: 43214 · Reply Quote

wateroakley Send message Joined: 6 Aug 04 Posts: 195 Credit: 28,347,450 RAC: 10,508	Message 43215 - Posted: 14 Oct 2011, 16:57:22 UTC Mea culpa. I paused this 'o' task hadcm3n_o5yd_1940_40_007443919_4 in the normal manner (snooze) to close the laptop down, and 'ker-ching! it trips out with 193 error. ID: 43215 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 43216 - Posted: 14 Oct 2011, 19:11:44 UTC - in response to Message 43215. 'Snooze' is the same as 'pull out the power plug' as far as climate models are concerned. It shuts down the computer faster than the program can shut down all of it's open files. Sometimes you can get away with it, but sooner or later ... Backups: Here ID: 43216 · Reply Quote

NewtonianRefractor Send message Joined: 22 May 08 Posts: 49 Credit: 2,335,997 RAC: 0	Message 43217 - Posted: 14 Oct 2011, 19:44:49 UTC I have an intel linux machine which had a serious amount of -193 crashes: hostid=1170809 ID: 43217 · Reply Quote

Greg van Paassen Send message Joined: 17 Nov 07 Posts: 142 Credit: 4,271,370 RAC: 0	Message 43218 - Posted: 15 Oct 2011, 1:46:27 UTC - in response to Message 43212. Last modified: 15 Oct 2011, 2:16:57 UTC Newtonian, thanks for chipping in. But the code 193 errors on that PC aren't happening only at the zip points. Some of them are segmentation violations and some of them are due to a too-old libc. It looks like you fixed the problems, though. Hagar: thanks for the advice. I do run smartctl -t short about weekly, and check the disks' "health" and error logs (smartctl -H, smartctl - l error), but there's nothing there. It could be a bug in md, but I doubt that. There would have been screams from the enterprise people. (I'm running a software RAID5, having long ago and many times learned that hard disks have two states: "not failed yet" and "failed", and that backups follow the law of inverse significance - they work fine, except when you really need them.) Dagorath, about shared library corruption: it's possible. But: since my last message, four zip milestones for the tasks on my PC all passed uneventfully yesterday. Then a few minutes ago the very latest one failed at 50%. I would have thought that corrupted files would be consistently corrupted...? Edit: Mo.V provides more examples of computers crashing with code 193 at 25%, 50%, 75%, and 100% here. ID: 43218 · Reply Quote

wateroakley Send message Joined: 6 Aug 04 Posts: 195 Credit: 28,347,450 RAC: 10,508	Message 43225 - Posted: 15 Oct 2011, 15:24:35 UTC - in response to Message 43218. (I'm running a software RAID5, having long ago and many times learned that hard disks have two states: "not failed yet" and "failed", and that backups follow the law of inverse significance - they work fine, except when you really need them.) Indeed, SOP for hard drives, and always assuming that your client actually takes a backup. ID: 43225 · Reply Quote

Greg van Paassen Send message Joined: 17 Nov 07 Posts: 142 Credit: 4,271,370 RAC: 0	Message 43668 - Posted: 14 Jan 2012, 0:42:02 UTC For the record, the problem is much reduced after these changes:- 1. More RAM. I added another 8GB of RAM, bringing the per-core total to 1.5 GB. In retrospect, allowing only 500MB per core might have been a bit tight for HadCM3Ns; even though they appear to use only 220 - 240 MB each, perhaps there is a bit of copying between memory locations at checkpoints, which could double the memory requirement temporarily. Also, more total RAM means that the OS will make its process tables and file-access tables a bit bigger. 2. OS tweaks. Reduced vm.vfs_cache_pressure. This setting relates to the OS's memory caches for disk filesystem data; reducing the value makes the OS keep these caches, rather than shrinking them if they haven't been used "for a while". Probably handy during the creation of the zip files, which seems to involve a sudden storm of reading from a lot of files, and therefore a lot of looking up filesystem information. Also: reduced vm.swappiness - a lower value tells the OS not to swap out application code; having to read it back in would only make the disk I/O situation worse at zip-file creation time. 3. OS helper. Installed irqbalance to share the interrupt-servicing load among the four cores of the processor, so the task that happens to be scheduled on processor 0 when there is an "interrupt storm" (for disk reads) is not delayed so much. 4. Possible helper. Installed rtirq.sh to speed up servicing of certain interrupts. (This was done mainly to fix audio under-run errors, but it seemed to improve overall stability. Maybe this is psychological, though.) If I had to do just one of these, I'd pick no. 2, with no. 1 next. ID: 43668 · Reply Quote

astroWX Volunteer moderator Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0	Message 43669 - Posted: 14 Jan 2012, 19:57:40 UTC Thanks for posting that, Greg. "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. ID: 43669 · Reply Quote