Thread 'hadam3p_eu crash 45 seconds in.'

Author	Message
Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 42537 - Posted: 3 Jul 2011, 0:26:55 UTC - in response to Message 42536. It doesn't look like there'll be any EU models left by Monday morning, All gone. Possibly deprecated by one of the project people, as requested by a moderator. Backups: Here ID: 42537 · Reply Quote

old_user651284 Send message Joined: 28 Mar 11 Posts: 35 Credit: 82,588 RAC: 0	Message 42538 - Posted: 3 Jul 2011, 10:29:50 UTC - in response to Message 42529. Why never test these loser WUs before testing the volunteers bandwidth? A few hundred thousand times 100 or so MB -- what's that to a volunteer? These obviously never tested WU -- EU4 -- yeah, you can figure it out later, after wasting my time and bandwidth. Why never test before sending a gazillion to us? Huh? Don't you newb clowns test anything before you send a few bazillion WUs out? Forgive me, I've been volunteering my machine's time for a decade -- Did you try even one of these loser models at home before you sent a few hundred thou out to us? Don't think so.. It's obvious. Please -- don't abuse the volunteers. Do some minimal testing before you send a totally wasteful broken model times 300,000 to us crunchers. OK? Actually, I'm really annoyed by this last batch of broken s*** that I download, it breaks, ----- Do you do ANY testing before sending this stuff? No, obviously not. And yes, I, and a few others, are annoyed. If you dare, apologize. Eric Dear Mr Redd, I certainly do dare to apologise for the inconvenience and annoyance that the current round of problems have caused you, and all other supporters of CPDN. The recent Hadam3p release was, as far as I am aware, produced in exactly the same manner as previous hadam3p releases (which have been running successfully). I apologise for the wasted bandwidth that we have caused. We have been made fully aware how much this has upset people, and we will, in future, take steps to minimise these problems. I will be posting an apology on the news items on this site on Monday. Finally, I would ask that, annoyed as you obviously are, please don't abuse the project staff. Jonathan Miller CPDN System Administrator ID: 42538 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 42546 - Posted: 4 Jul 2011, 12:08:33 UTC - in response to Message 42538. Two latest regional models crashed in 9 and 10 seconds respectively. Even faster than the 45ish seconds from the last few I have had! ID: 42546 · Reply Quote

old_user651284 Send message Joined: 28 Mar 11 Posts: 35 Credit: 82,588 RAC: 0	Message 42547 - Posted: 4 Jul 2011, 14:03:11 UTC Last modified: 4 Jul 2011, 14:03:48 UTC We have been investigating the problem with the Hadam3p work units. It appears that the crash is caused by the combination of two perfectly normal forcing files. The SST and SI files were altered in the previous suspect run. If either of these files is substituted for the previous version, the model runs perfectly well. The crash only occurs when both files are specified as inputs to the same work unit. We are conducting tests on the Met Office UM in order to try to find out why this should be the case. In the mean time, the current release of Hadam3p work units are resubmission jobs of proven work units. These should be fully functional since we are extending the duration of previous experimental runs. Jonathan ID: 42547 · Reply Quote

old_user34451 Send message Joined: 31 Dec 04 Posts: 5 Credit: 37,523 RAC: 0	Message 42549 - Posted: 4 Jul 2011, 17:40:29 UTC - in response to Message 42547. Thanks Jonathan, a suplementry question I have is are there enough work units going out yet to lose the restriction as having had these two latest fail I can't get any more till tomorrow? It is a question and I can live with it if the answer is, "No" Dave ID: 42549 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 42556 - Posted: 4 Jul 2011, 19:49:52 UTC - in response to Message 42549. The answer will be no, as there are still at least twice as many computers attached as there are models. However, looking at your list of computers, (where it's a bit hard to tell just how many you have), I think that it's possible that they're too old, as per this thread at the top of Number crunching. All of the current work requires SSE2. Backups: Here ID: 42556 · Reply Quote

old_user34451 Send message Joined: 31 Dec 04 Posts: 5 Credit: 37,523 RAC: 0	Message 42557 - Posted: 4 Jul 2011, 20:05:39 UTC - in response to Message 42556. Thanks Les, all except the two computers with recent credit don't exist any more and are sse2 compatible. I will spend some time deleting the old ones to clear up any confusion. Dave ID: 42557 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 42559 - Posted: 4 Jul 2011, 21:25:32 UTC - in response to Message 42557. I posted last from a different computer than this. - Looking at tit it is an old identity and the computers listed on it don't exist any more. Only two of those on this loggin are still relevant to the project and these are sse2 compatible. ID: 42559 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 42560 - Posted: 4 Jul 2011, 22:01:03 UTC Dave A bit more on your post about not getting more models until 'tomorrow'. There are now 2 different delays: 1) If a computer crashes models serially without returning one that's successful, then the Maximum daily WU quota per CPU will keep being decreased until it reaches zero, at which point the computer has to wait until midnight project time, (in the case of cpdn, this is UK time), to get another model. This is to slow down serial crashers from wasting too many models. 2) A newly introduced delay, of a 1 hour backoff between requests, to try and spread out the limited amount of work. Each time the computer contacts the server for ANY reason, this delay is reset back to 1 hour. Only a request made after the countdown has timed out will be considered for the allocation of a model. IF any still remain at that point. This delay is not dependent on waiting until the following day. Backups: Here ID: 42560 · Reply Quote

Lockleys Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0	Message 42561 - Posted: 4 Jul 2011, 22:27:06 UTC A new batch of regional models were created over the weekend, all of them regens from previously completed work, so they should be OK. Sadly, the 2 models downloaded to mine from this latest batch have both experienced the 45 second crash, so it looks from this end of the telescope as though the problem is still extant. ID: 42561 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 42562 - Posted: 4 Jul 2011, 22:48:44 UTC - in response to Message 42561. Please provide a link to them, so that I can have a look at the batch numbers. Backups: Here ID: 42562 · Reply Quote

Byron Leigh Hatch @ team Carl ... Send message Joined: 17 Aug 04 Posts: 289 Credit: 44,103,664 RAC: 0	Message 42563 - Posted: 4 Jul 2011, 23:58:43 UTC - in response to Message 42562. Please provide a link to them, so that I can have a look at the batch numbers. Hi Les hadam3p_eu_vbww_1993_1_007337686 hadam3p_eu_vbwh_1978_1_007337685 hadam3p_eu_vbwa_1971_1_007337684 hadam3p_eu_vbvn_1996_1_007337682 All tasks for computer 948812 Byron ID: 42563 · Reply Quote

dajashby Send message Joined: 1 Sep 04 Posts: 55 Credit: 17,223,688 RAC: 967	Message 42564 - Posted: 5 Jul 2011, 0:14:22 UTC Last modified: 5 Jul 2011, 0:21:57 UTC The same thing has happened to me on the two computers I have running CPDN - one a Windows 7 machine, and one an iMac. I don't want to be rude, but was this batch of models tested before release? iMac tasks: http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=7539655 http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=7535832 http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=7535713 http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=7539464 http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=7535460 http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=7535459 Windows tasks http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=7537097 http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=7536546 http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=7539251 http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=7536772 http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=7535216 Derrick Ashby ID: 42564 · Reply Quote

Darmok Send message Joined: 29 Dec 09 Posts: 34 Credit: 18,395,130 RAC: 0	Message 42565 - Posted: 5 Jul 2011, 0:26:40 UTC - in response to Message 42562. 3 of 3 new downloads failed. First two failed earlier today, twice! hadam3p_eu_2k6q_2000_1_007340694_2 WU ID 7538124 Task 13080679 hadam3p_eu_2kgn_1985_1_007340817_2 WU ID 7538247 Task 13080666 hadam3p_eu_va95_2001_1_007337165_0 WU ID 7534595 Task 13055425 ID: 42565 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 42566 - Posted: 5 Jul 2011, 2:29:12 UTC I leave a message for the project people for when they're out of bed and at work. For anyone else who wants to provide details, it's preferable to use links such as Byron and Darmok, as it saves having to go to each linked model to see the name. Backups: Here ID: 42566 · Reply Quote

Lockleys Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0	Message 42569 - Posted: 5 Jul 2011, 6:49:29 UTC - in response to Message 42561. A new batch of regional models were created over the weekend, all of them regens from previously completed work, so they should be OK. Sadly, the 2 models downloaded to mine from this latest batch have both experienced the 45 second crash, so it looks from this end of the telescope as though the problem is still extant. Model numbers were hadam3p_eu_2qx6_1990_1_007341826_1 and hadam3p_eu_2rbo_1990_1_007342014_1 ID: 42569 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 42571 - Posted: 5 Jul 2011, 7:19:30 UTC Hmmm. It looks like there's a 2 series, and a v series. Not long now until they're back at work. Backups: Here ID: 42571 · Reply Quote

metalius Send message Joined: 28 Nov 06 Posts: 89 Credit: 12,007,915 RAC: 3,381	Message 42573 - Posted: 5 Jul 2011, 7:41:56 UTC - in response to Message 42571. Last modified: 5 Jul 2011, 7:45:03 UTC It is sad - the project gives second batch of permanently corrupted work units, typical error messages are like these: 05/07/2011 06:49:12 \| climateprediction.net \| Starting task hadam3p_eu_v9mp_1962_1_007337397_2 using hadam3p_eu version 609 05/07/2011 06:49:48 \| climateprediction.net \| Computation for task hadam3p_eu_v9mp_1962_1_007337397_2 finished 05/07/2011 06:49:48 \| climateprediction.net \| Output file hadam3p_eu_v9mp_1962_1_007337397_2_1.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent 05/07/2011 06:49:48 \| climateprediction.net \| Output file hadam3p_eu_v9mp_1962_1_007337397_2_2.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent 05/07/2011 06:49:48 \| climateprediction.net \| Output file hadam3p_eu_v9mp_1962_1_007337397_2_3.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent 05/07/2011 06:49:48 \| climateprediction.net \| Output file hadam3p_eu_v9mp_1962_1_007337397_2_4.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent 05/07/2011 06:49:48 \| climateprediction.net \| Output file hadam3p_eu_v9mp_1962_1_007337397_2_5.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent 05/07/2011 06:49:48 \| climateprediction.net \| Output file hadam3p_eu_v9mp_1962_1_007337397_2_6.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent 05/07/2011 06:49:48 \| climateprediction.net \| Output file hadam3p_eu_v9mp_1962_1_007337397_2_7.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent 05/07/2011 06:49:48 \| climateprediction.net \| Output file hadam3p_eu_v9mp_1962_1_007337397_2_8.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent 05/07/2011 06:49:48 \| climateprediction.net \| Output file hadam3p_eu_v9mp_1962_1_007337397_2_9.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent 05/07/2011 06:49:48 \| climateprediction.net \| Output file hadam3p_eu_v9mp_1962_1_007337397_2_10.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent 05/07/2011 06:49:48 \| climateprediction.net \| Output file hadam3p_eu_v9mp_1962_1_007337397_2_11.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent 05/07/2011 06:49:48 \| climateprediction.net \| Output file hadam3p_eu_v9mp_1962_1_007337397_2_12.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent 05/07/2011 06:49:48 \| climateprediction.net \| Output file hadam3p_eu_v9mp_1962_1_007337397_2_13.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent ID: 42573 · Reply Quote

Lockleys Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0	Message 42574 - Posted: 5 Jul 2011, 8:23:36 UTC - in response to Message 42569. A new batch of regional models were created over the weekend, all of them regens from previously completed work, so they should be OK. Sadly, the 2 models downloaded to mine from this latest batch have both experienced the 45 second crash, so it looks from this end of the telescope as though the problem is still extant. Model numbers were hadam3p_eu_2qx6_1990_1_007341826_1 and hadam3p_eu_2rbo_1990_1_007342014_1 Also model numbers: hadam3p_eu_v9yf_2000_1_007339072_1 and hadam3p_eu_vbwa_1970_1_007336518_2 ID: 42574 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 42577 - Posted: 5 Jul 2011, 11:38:23 UTC It looks like all of them are faulty. To much pressure to produce work units, especially late in the week. More later. In the meantime, everybody put away the axes, knives, halberds, and spears, and go home. Backups: Here ID: 42577 · Reply Quote