Message boards : Number crunching : UK Met Office HadCM3 Short
Message board moderation
Author | Message |
---|---|
Send message Joined: 9 Sep 04 Posts: 228 Credit: 30,750,791 RAC: 3,898 |
This kind of wu should not be stopped! The wu works fine and is going to finish without any interruption. But if you want to do backups and after that resume, it resolves in wu errors... |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Yes, unfortunately. :( But I think that's only with Windows, with a service install of BOINC. :) It's nice to see some short, speedy little models, even if the zips ARE big. |
Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0 |
Well as I fit the description from Les, I've disconnected from HadCM3 Short as the BOINC service is stopped every night for backups and then restarted. That said, my failure success/error rate has been around 50/50 in the shorts and I managed a good number of those errors myself. Like I managed to trigger a windows update when 8 of them were running and that crashed them all. What a suprise! I always stop BOINC before doing things like that, but I guess we're all allowed our bad days :-( Not sure that a huge number were triggered by stopping and restarting the service. The other approach would be not to stop BOINC for backups. Any thoughts? BOINC data is on a separate HD which is not backed up at all. BOINC programs are BUed as an image is taken of that drive. |
Send message Joined: 9 Sep 04 Posts: 228 Credit: 30,750,791 RAC: 3,898 |
Don'T get me wrong: 1. it's NOT installed as a service 2. I ALWAYS stop BOINC and look after that at the taskmgr for remaining CPU cycles and resident (not stopped) boincmgr.exe 3. it's EASY to do severall BKUPS a day with SSdrive installation and can't be afford to miss because of demands of other Projects 4. at last, untill now, E V E R Y wu fails because off interruption Have a nice day |
Send message Joined: 9 Sep 04 Posts: 228 Credit: 30,750,791 RAC: 3,898 |
4. at last, untill now, E V E R Y wu fails because of interruption |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
Weell buddy -- if you was running the Linux version -- smiley smiley. Saving, backing up, and restarting would work ok (this particular issue only).The hadcm3s continue ok after restart on linux. The weird bit is -- download about 50 meg. run a day or two, upload 63 meg twice. And then leave 800 meg sitting the the wu's folder. Have to clear it out myself. Have done so for many dozens wu. The down,up,remainder seems mathematically strange. |
Send message Joined: 9 Sep 04 Posts: 228 Credit: 30,750,791 RAC: 3,898 |
Well Eirik, that's strange, because I don't have that remaining folder over here. |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
BOINC 7.2.42 (64bit) on Ubuntu Trusty (64bit). BOINC folder is in BOINC user's home directory with good permissions. When a hadcm3s_ fails, the subfolder BOINC/projects/climateprediction.net/hadcm3s_<task-id> gets removed along with the other task-specific files. When a hadcm3s succeeds --- that's when the 814 megabyte folder gets left behind. Seen several dozen examples last few weeks. No idea why. |
Send message Joined: 9 Sep 04 Posts: 228 Credit: 30,750,791 RAC: 3,898 |
in case of success I have also not that folder |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
here's what it looks like cpdn@thistle:~$ du -s BOINC/projects/climateprediction.net/* | sort -g -r | head -n3 814848 BOINC/projects/climateprediction.net/hadcm3s_1bb6_1990_2_008918940 720684 BOINC/projects/climateprediction.net/hadam3p_anz_rudx_2012_1_008965960 673152 BOINC/projects/climateprediction.net/hadam3p_anz_rue2_2012_1_008965965 1bb6 completed OK. Time to browse client.state and the log files |
Send message Joined: 21 Oct 10 Posts: 53 Credit: 2,101,753 RAC: 3,985 |
I got 2 of those WUs on my iMac and they both failed after more than one day of calculation <core_client_version>7.5.0</core_client_version> But I did not suspend boinc or anything, the only thing I can think of is that I have changed (long ago) the parameter that tells boinc to switch application after one hour, I set it on one day instead (1440 minutes), so is this "killing" this application ? But these would be failing for almost everybody then, since this parameter is set to 60 mins by default in boinc installation and most people are probably not changing it... ? |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,808,726 RAC: 5,192 |
... But these would be failing for almost everybody then, since this parameter is set to 60 mins by default in boinc installation and most people are probably not changing it... ? The variability between machines is very large for this model. Some users (e.g. astroWX) have completed many of these models and others (including me) have not succeeded in starting a single one. Some of my crashed models have reported "INVALID THETA DETECTED", which is normally interpreted as an unphysical model. That so many should crash in that way, so early, and others crash with different errors suggests to me some model configuration error or BOINC compatibility problem - so I have excluded HADCM3S from my project preferences. I have not yet seen any explanation for why a particular model should fail to start on a completely reliable machine (as you can see I don't believe the "filtering parameter space for viable points" explanation) ... |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,008,987 RAC: 21,524 |
Briefly wondering why this has never been an issue for me till I realised that it doesn't affect those of us who will only be running one project at a time. |
Send message Joined: 18 Dec 13 Posts: 62 Credit: 1,078,935 RAC: 0 |
Briefly wondering why this has never been an issue for me till I realised that it doesn't affect those of us who will only be running one project at a time. I'm only running CPDN while I have CPDN work, without interruptions. I crashed 8 hadcm3s units (and successfully ran none) before giving up. Two hadam3p_anz units, a hadam3p_pnw unit and a hadcm3n unit are all currently running normally. I don't think that's where the problem lies. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
The fail/succeed difference is Windows/Linux. Mostly, anyway. During beta testing, I tried all sorts of things to crash them, including setting the prefs for "don't keep in memory", and shutting down both BOINC and the computer. They just kept on running. |
Send message Joined: 16 Aug 04 Posts: 156 Credit: 9,035,872 RAC: 2,928 |
This kind of wu should not be stopped! Oh, it should be stopped, doesn't work properly. 20000 new ones on the server yikes! |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,008,987 RAC: 21,524 |
Oh, it should be stopped, doesn't work properly. Or just not made available to windows users. As les has said they seem bullet proof on nix. I have had them survive two power failures here. |
Send message Joined: 15 Feb 06 Posts: 137 Credit: 35,290,001 RAC: 13,288 |
Just to show a different perspective, here are my results. My computer is running in Windows 8.1 and has a ratio of 44 successfully completed to 3 failures (though I was given full credit for 2 of the failures). My son's computer runs Windows 7 and has a ratio of 52 success to 4 failures. I run CPDN exclusively and continuously on both computers. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,008,987 RAC: 21,524 |
Takes us back to the question as to why some boxes but not others? |
Send message Joined: 9 Sep 04 Posts: 228 Credit: 30,750,791 RAC: 3,898 |
Seven new wu. E V E R Y wu crashed without interruption. |
©2024 cpdn.org