Message boards : Number crunching : hadam3p_eu crash 45 seconds in.
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
It doesn't look like there'll be any EU models left by Monday morning, All gone. Possibly deprecated by one of the project people, as requested by a moderator. Backups: Here |
Send message Joined: 28 Mar 11 Posts: 35 Credit: 82,588 RAC: 0 |
Why never test these loser WUs before testing the volunteers bandwidth? Dear Mr Redd, I certainly do dare to apologise for the inconvenience and annoyance that the current round of problems have caused you, and all other supporters of CPDN. The recent Hadam3p release was, as far as I am aware, produced in exactly the same manner as previous hadam3p releases (which have been running successfully). I apologise for the wasted bandwidth that we have caused. We have been made fully aware how much this has upset people, and we will, in future, take steps to minimise these problems. I will be posting an apology on the news items on this site on Monday. Finally, I would ask that, annoyed as you obviously are, please don't abuse the project staff. Jonathan Miller CPDN System Administrator |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,013,957 RAC: 21,195 |
Two latest regional models crashed in 9 and 10 seconds respectively. Even faster than the 45ish seconds from the last few I have had! |
Send message Joined: 28 Mar 11 Posts: 35 Credit: 82,588 RAC: 0 |
We have been investigating the problem with the Hadam3p work units. It appears that the crash is caused by the combination of two perfectly normal forcing files. The SST and SI files were altered in the previous suspect run. If either of these files is substituted for the previous version, the model runs perfectly well. The crash only occurs when both files are specified as inputs to the same work unit. We are conducting tests on the Met Office UM in order to try to find out why this should be the case. In the mean time, the current release of Hadam3p work units are resubmission jobs of proven work units. These should be fully functional since we are extending the duration of previous experimental runs. Jonathan |
Send message Joined: 31 Dec 04 Posts: 5 Credit: 37,523 RAC: 0 |
Thanks Jonathan, a suplementry question I have is are there enough work units going out yet to lose the restriction as having had these two latest fail I can't get any more till tomorrow? It is a question and I can live with it if the answer is, "No" Dave |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
The answer will be no, as there are still at least twice as many computers attached as there are models. However, looking at your list of computers, (where it's a bit hard to tell just how many you have), I think that it's possible that they're too old, as per this thread at the top of Number crunching. All of the current work requires SSE2. Backups: Here |
Send message Joined: 31 Dec 04 Posts: 5 Credit: 37,523 RAC: 0 |
Thanks Les, all except the two computers with recent credit don't exist any more and are sse2 compatible. I will spend some time deleting the old ones to clear up any confusion. Dave |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,013,957 RAC: 21,195 |
I posted last from a different computer than this. - Looking at tit it is an old identity and the computers listed on it don't exist any more. Only two of those on this loggin are still relevant to the project and these are sse2 compatible. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Dave A bit more on your post about not getting more models until 'tomorrow'. There are now 2 different delays: 1) If a computer crashes models serially without returning one that's successful, then the Maximum daily WU quota per CPU will keep being decreased until it reaches zero, at which point the computer has to wait until midnight project time, (in the case of cpdn, this is UK time), to get another model. This is to slow down serial crashers from wasting too many models. 2) A newly introduced delay, of a 1 hour backoff between requests, to try and spread out the limited amount of work. Each time the computer contacts the server for ANY reason, this delay is reset back to 1 hour. Only a request made after the countdown has timed out will be considered for the allocation of a model. IF any still remain at that point. This delay is not dependent on waiting until the following day. Backups: Here |
Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0 |
A new batch of regional models were created over the weekend, all of them regens from previously completed work, so they should be OK. Sadly, the 2 models downloaded to mine from this latest batch have both experienced the 45 second crash, so it looks from this end of the telescope as though the problem is still extant. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Please provide a link to them, so that I can have a look at the batch numbers. Backups: Here |
Send message Joined: 17 Aug 04 Posts: 289 Credit: 44,103,664 RAC: 0 |
Please provide a link to them, so that I can have a look at the batch numbers. Hi Les hadam3p_eu_vbww_1993_1_007337686 hadam3p_eu_vbwh_1978_1_007337685 hadam3p_eu_vbwa_1971_1_007337684 hadam3p_eu_vbvn_1996_1_007337682 All tasks for computer 948812 Byron |
Send message Joined: 1 Sep 04 Posts: 55 Credit: 17,223,688 RAC: 967 |
The same thing has happened to me on the two computers I have running CPDN - one a Windows 7 machine, and one an iMac. I don't want to be rude, but was this batch of models tested before release? iMac tasks: http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=7539655 http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=7535832 http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=7535713 http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=7539464 http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=7535460 http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=7535459 Windows tasks http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=7537097 http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=7536546 http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=7539251 http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=7536772 http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=7535216 Derrick Ashby |
Send message Joined: 29 Dec 09 Posts: 34 Credit: 18,395,130 RAC: 0 |
3 of 3 new downloads failed. First two failed earlier today, twice! hadam3p_eu_2k6q_2000_1_007340694_2 WU ID 7538124 Task 13080679 hadam3p_eu_2kgn_1985_1_007340817_2 WU ID 7538247 Task 13080666 hadam3p_eu_va95_2001_1_007337165_0 WU ID 7534595 Task 13055425 |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
I leave a message for the project people for when they're out of bed and at work. For anyone else who wants to provide details, it's preferable to use links such as Byron and Darmok, as it saves having to go to each linked model to see the name. Backups: Here |
Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0 |
A new batch of regional models were created over the weekend, all of them regens from previously completed work, so they should be OK. Model numbers were hadam3p_eu_2qx6_1990_1_007341826_1 and hadam3p_eu_2rbo_1990_1_007342014_1 |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Hmmm. It looks like there's a 2 series, and a v series. Not long now until they're back at work. Backups: Here |
Send message Joined: 28 Nov 06 Posts: 89 Credit: 11,986,335 RAC: 2,269 |
It is sad - the project gives second batch of permanently corrupted work units, typical error messages are like these:
05/07/2011 06:49:48 | climateprediction.net | Computation for task hadam3p_eu_v9mp_1962_1_007337397_2 finished 05/07/2011 06:49:48 | climateprediction.net | Output file hadam3p_eu_v9mp_1962_1_007337397_2_1.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent 05/07/2011 06:49:48 | climateprediction.net | Output file hadam3p_eu_v9mp_1962_1_007337397_2_2.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent 05/07/2011 06:49:48 | climateprediction.net | Output file hadam3p_eu_v9mp_1962_1_007337397_2_3.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent 05/07/2011 06:49:48 | climateprediction.net | Output file hadam3p_eu_v9mp_1962_1_007337397_2_4.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent 05/07/2011 06:49:48 | climateprediction.net | Output file hadam3p_eu_v9mp_1962_1_007337397_2_5.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent 05/07/2011 06:49:48 | climateprediction.net | Output file hadam3p_eu_v9mp_1962_1_007337397_2_6.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent 05/07/2011 06:49:48 | climateprediction.net | Output file hadam3p_eu_v9mp_1962_1_007337397_2_7.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent 05/07/2011 06:49:48 | climateprediction.net | Output file hadam3p_eu_v9mp_1962_1_007337397_2_8.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent 05/07/2011 06:49:48 | climateprediction.net | Output file hadam3p_eu_v9mp_1962_1_007337397_2_9.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent 05/07/2011 06:49:48 | climateprediction.net | Output file hadam3p_eu_v9mp_1962_1_007337397_2_10.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent 05/07/2011 06:49:48 | climateprediction.net | Output file hadam3p_eu_v9mp_1962_1_007337397_2_11.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent 05/07/2011 06:49:48 | climateprediction.net | Output file hadam3p_eu_v9mp_1962_1_007337397_2_12.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent 05/07/2011 06:49:48 | climateprediction.net | Output file hadam3p_eu_v9mp_1962_1_007337397_2_13.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent |
Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0 |
A new batch of regional models were created over the weekend, all of them regens from previously completed work, so they should be OK. Also model numbers: hadam3p_eu_v9yf_2000_1_007339072_1 and hadam3p_eu_vbwa_1970_1_007336518_2 |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
It looks like all of them are faulty. To much pressure to produce work units, especially late in the week. More later. In the meantime, everybody put away the axes, knives, halberds, and spears, and go home. Backups: Here |
©2024 cpdn.org