Message boards : Number crunching : List of Errors with TRIFFIDs models - Linux only
Message board moderation
Author | Message |
---|---|
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,620,508 RAC: 4,981 |
Hi, since currently a big chunk of models being piled up and not processed by Linux users, namely the MOSES II land scheme and TRIFFID I thought if we list some of the problems with them we may be able to get project people's attention to fix these models. So far these models: 1) Miss an upload whenever the BOINC client (the system) is restarted. No matter how - gently or by power outage. As result these models finish with status: Error While computing. This is probably true for the wider family of MOSES II models as well 2) Do not upload their last zip even if they run 24/7. In my case the model reported missing 12 zip and some other error. The model run uninterrupted until the end. Status: Error While computing 3) TRIFFID models - global and EU ones run the same time as non TRIFFID MOSES ones, but get 25% credit compared to them. As reported already by Eirik here (and few others) and taken up to project people be geophi 4) When interrupted (gently or not so) seem to loose a trickle up message as well. Though I could not find the relevant posts in the forum. Please list any other anomalies with these models you've encountered and hope they get fixed so one can crunch them successfully. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Hi Bernard Thanks for putting it all in one place, but the project people already know about these things. The moderators keep telling them, but unfortunately nothing happens. *** My computers quite happily run them, without 1) or 2) happening. Unless the Newtwork access option in BOINC is set to Off, some zips and trickles accumulate, and then there's a power failure. After, even though those 2 items are still there, BOINC has forgotten about them, so they don't get uploaded. If there are NO zips/trickles, shutting down BOINC, turning off the power for a few minutes, and then re-starting everything, has no effect on the models. They keep running, and produce and upload ALL files. BUT ... There may be an interval before the zips/trickles are about to be created, when it IS fatal to interrupt them. How long this is, is something that could be worked out the hard way, if it wasn't for the long, slow, period of time it takes to run them. *** Something may be about to happen, but everyone is going to have to wait. |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
What I see from here http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?userid=3970&offset=0&show_names=0&state=2 Problems 1) and 2) are either gone and past, or my minifarm ran the last few of them end-to-end with no interruptions. Problem 3) yup. But that's just a credit thing - probably fixed sooner or later. Right now, (Since April 1) the current batch of (Linux only) "UK Met Office HadAM3P and HadRM3P model with MOSES II and TRIFFID Europe v7.01" is huge, and I've only seen a few misconfigured instant failures in the past week or two. The only problem with these is that there's so MANY of them. And only the fraction of Linux machines with the right 32-bit libs to run them. About 13000 waiting to send. I've a totally almost zero-evidence intuitional clue that about 1000 per week will compete and report success. Yes I've checked, and this batch can restart, reboot, hibernate just fine without losing uploads. And yes, there are still vulnerable times when the BOINC-climate-model interface is vulnerable. About getting all models working flawlessly on all platforms from first submission -- never happen. Even with USA DOD size budget. The base Hadley model is stable, BOINC infrastructure and the various OS's, and even the hardware (virtualized? cloud? shirimasen - no se) not so much. And the contributor hardware and software? That's part of the whole distributed thing. |
©2024 cpdn.org