Message boards : Number crunching : New Model Type HadAM4
Message board moderation
Previous · 1 · 2 · 3
Author | Message |
---|---|
Send message Joined: 15 May 09 Posts: 4544 Credit: 19,039,635 RAC: 18,944 |
I wonder if they ever figured out what was causing the original Linux problem, or whether it just worked with a new batch? I can't remember, there are a lot of messages to trawl through to check that and the crucial one may be one I have deleted! That is, can we expect more Linux in the future? I find it easier to put a Linux machine on CPDN, as I already have them set up to run BOINC anyway. Yes, we can certainly expect some more Linux tasks. I think I am right in saying that the windows version of this particular task type didn't work and didn't make it to the testing stage. Be warned however, the restart uploads for higher resolution HadAM4 tasks are likely to be around 190-200MB but discussions about how to handle that are still ongoing. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
Be warned however, the restart uploads for higher resolution HadAM4 tasks are likely to be around 190-200MB but discussions about how to handle that are still ongoing. No problem. I can upload at 10 Mbps, and need something to justify my monthly fee. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
The good news is Sarah has worked out what is causing the problem. It's a long way back now, and the details are getting a bit hazy, but I think it went something like: The office computers/software are supplied by the uni/IT, and when our people tried to compile a 32 bit Linux model on a 64 bit computer, the exe didn't work correctly. And I think there was urgent work to get out at around the same time. (Isn't there always, when you're trying to cram in a bit of "other stuff"?) So it got dropped. And another opportunity didn't show up any time soon. Until now. And now it's the other way around - the Windows version won't co-operate. But that's all with new programs; they still have the older programs for them to use. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
Thank you, that is quite sufficient. And I won't need to build a Windows machine, which would only crash the project anyway. |
Send message Joined: 22 Aug 06 Posts: 6 Credit: 2,836,837 RAC: 0 |
Another data point: I've had 8 models running on my 8-core machine. 5 have just finished with 22 days run time, and the other three look like they'll make it within the next 18 hours. BOINC is configured to suspend computation if other processes are demanding CPU time, but I have "Leave in memory" checked. The models were suspended dozens of times but didn't seem to mind. Sadly I can't continue with CPDN because my PC is unlikely to be on 24/7 beyond this week. |
Send message Joined: 15 May 09 Posts: 4544 Credit: 19,039,635 RAC: 18,944 |
Sadly I can't continue with CPDN because my PC is unlikely to be on 24/7 beyond this week. I have found that tasks which crash if machine is fully closed survive if suspend or hibernate is used. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,898,299 RAC: 9,018 |
I'm leaving mine to finish, then some software update and then will test restart, suspend etc with HadAM4 if I get any |
Send message Joined: 9 Oct 04 Posts: 82 Credit: 70,030,430 RAC: 2,252 |
I have several errors: "Model crashed: READDUMP: BAD BUFFIN OF DATA". The Wus have been quite advanced. Like this one: https://www.cpdn.org/result.php?resultid=21924162 I have limited the climateprediction to two concurrent WUs on this computer, so I was wondering if there is a cure. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
I have several errors: "Model crashed: READDUMP: BAD BUFFIN OF DATA". The Wus have been quite advanced. It looks like the problems started in early to mid April. Did anything change on that PC or the environment it's in during that time frame? Some of the crashes, and even some of those that said they completed successfully, had negative theta errors in stderr.txt. While that is sometimes a problem with the initial conditions or parameters for a given task or set of tasks, it can also indicate some hardware instability. If it's in a particularly dusty, or warm environment, that could cause some problems and a thorough cleaning and checking that good air flow through the system is occurring might remove that possibility. Or perhaps CPU, memory and hard disk integrity checking software could be run to determine if any obvious errors are evident? Just a shot in the dark here as I'm not certain it is a hardware/cooling issue but checking those things would at least remove them as possibilities for the problems. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Not enough memory for a computer that size, either. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Not enough memory for a computer that size, either. He said he's only running a couple at a time though. If he was trying to run 32, or even 16 that would be a whole different matter. |
Send message Joined: 9 Oct 04 Posts: 82 Credit: 70,030,430 RAC: 2,252 |
Just to clarify: a) computer is limited to "use at most = 99%" b) climateprediction is limeted to 2 WUs with app_config c) rest of the task is CPU: tn-grid and GPU: gpugrid d) CPU is works at 4100 GHz - I tried to go higher (4150 GHz - around april), but then tn-grid has some random errors "SIGSEGV: segmentation violation", so I limited it again to 4100 GHz, I had some rare shut-downs since then.* e) With this set-up RAM seems not to be an issue f) I run some dispersion models on the computer for work sometimes, but not during this shutdown. g) I clean computer periodicaly, so dirt should be no issue * this explains some errors. tn-grid works fine again, this is why I am wondering why climeteprediction does has some errors. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
I thought it might be a "can't get the data fast enough" problem, caused by tasks from other projects running. It's going to crash on the next computer as well, because of the missing library problem. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,898,299 RAC: 9,018 |
I also have few WUs with this error, however they all finished successfully https://www.cpdn.org/cpdnboinc/result.php?resultid=21927138 https://www.cpdn.org/cpdnboinc/result.php?resultid=21920061 This computer is on heavy usage with other tasks, hence 2 WUs only but I suspect it just can't handle all the load (i7-3520m 8GB RAM) |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
That computer that got it next, DID crash it. (hostid=1482854) It has a perfect record: 803 tasks run, 803 tasks failed. And it's now been blocked. If they show up, could someone please point them to the Linux thread that has all the various updates that need to be run for the different types of Linux. |
©2025 cpdn.org