climateprediction.net (CPDN) home page
Thread 'hadam3p_eu crash 45 seconds in.'

Thread 'hadam3p_eu crash 45 seconds in.'

Message boards : Number crunching : hadam3p_eu crash 45 seconds in.
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 42537 - Posted: 3 Jul 2011, 0:26:55 UTC - in response to Message 42536.  

It doesn't look like there'll be any EU models left by Monday morning,

All gone.
Possibly deprecated by one of the project people, as requested by a moderator.


Backups: Here
ID: 42537 · Report as offensive     Reply Quote
Profileold_user651284

Send message
Joined: 28 Mar 11
Posts: 35
Credit: 82,588
RAC: 0
Message 42538 - Posted: 3 Jul 2011, 10:29:50 UTC - in response to Message 42529.  

Why never test these loser WUs before testing the volunteers bandwidth?
A few hundred thousand times 100 or so MB -- what's that to a volunteer?
These obviously never tested WU -- EU4 -- yeah, you can figure it out later, after wasting my time and bandwidth.
Why never test before sending a gazillion to us?
Huh?

Don't you newb clowns test anything before you send a few bazillion WUs out?

Forgive me, I've been volunteering my machine's time for a decade --
Did you try even one of these loser models at home before you sent a few hundred thou out to us? Don't think so.. It's obvious.

Please -- don't abuse the volunteers.

Do some minimal testing before you send a totally wasteful broken model times 300,000 to us crunchers. OK?

Actually, I'm really annoyed by this last batch of broken s*** that I download, it breaks, -----

Do you do ANY testing before sending this stuff?

No, obviously not.

And yes, I, and a few others, are annoyed.

If you dare, apologize.

Eric


Dear Mr Redd,

I certainly do dare to apologise for the inconvenience and annoyance that the current round of problems have caused you, and all other supporters of CPDN.

The recent Hadam3p release was, as far as I am aware, produced in exactly the same manner as previous hadam3p releases (which have been running successfully).

I apologise for the wasted bandwidth that we have caused. We have been made fully aware how much this has upset people, and we will, in future, take steps to minimise these problems.

I will be posting an apology on the news items on this site on Monday.

Finally, I would ask that, annoyed as you obviously are, please don't abuse the project staff.

Jonathan Miller
CPDN System Administrator
ID: 42538 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,011,472
RAC: 21,368
Message 42546 - Posted: 4 Jul 2011, 12:08:33 UTC - in response to Message 42538.  

Two latest regional models crashed in 9 and 10 seconds respectively. Even faster than the 45ish seconds from the last few I have had!
ID: 42546 · Report as offensive     Reply Quote
Profileold_user651284

Send message
Joined: 28 Mar 11
Posts: 35
Credit: 82,588
RAC: 0
Message 42547 - Posted: 4 Jul 2011, 14:03:11 UTC
Last modified: 4 Jul 2011, 14:03:48 UTC

We have been investigating the problem with the Hadam3p work units. It appears that the crash is caused by the combination of two perfectly normal forcing files.

The SST and SI files were altered in the previous suspect run. If either of these files is substituted for the previous version, the model runs perfectly well. The crash only occurs when both files are specified as inputs to the same work unit.

We are conducting tests on the Met Office UM in order to try to find out why this should be the case.

In the mean time, the current release of Hadam3p work units are resubmission jobs of proven work units. These should be fully functional since we are extending the duration of previous experimental runs.

Jonathan
ID: 42547 · Report as offensive     Reply Quote
old_user34451

Send message
Joined: 31 Dec 04
Posts: 5
Credit: 37,523
RAC: 0
Message 42549 - Posted: 4 Jul 2011, 17:40:29 UTC - in response to Message 42547.  

Thanks Jonathan, a suplementry question I have is are there enough work units going out yet to lose the restriction as having had these two latest fail I can't get any more till tomorrow? It is a question and I can live with it if the answer is, "No"

Dave
ID: 42549 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 42556 - Posted: 4 Jul 2011, 19:49:52 UTC - in response to Message 42549.  

The answer will be no, as there are still at least twice as many computers attached as there are models.

However, looking at your list of computers, (where it's a bit hard to tell just how many you have), I think that it's possible that they're too old, as per this thread at the top of Number crunching.

All of the current work requires SSE2.


Backups: Here
ID: 42556 · Report as offensive     Reply Quote
old_user34451

Send message
Joined: 31 Dec 04
Posts: 5
Credit: 37,523
RAC: 0
Message 42557 - Posted: 4 Jul 2011, 20:05:39 UTC - in response to Message 42556.  

Thanks Les, all except the two computers with recent credit don't exist any more and are sse2 compatible. I will spend some time deleting the old ones to clear up any confusion.

Dave
ID: 42557 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,011,472
RAC: 21,368
Message 42559 - Posted: 4 Jul 2011, 21:25:32 UTC - in response to Message 42557.  

I posted last from a different computer than this. - Looking at tit it is an old identity and the computers listed on it don't exist any more. Only two of those on this loggin are still relevant to the project and these are sse2 compatible.
ID: 42559 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 42560 - Posted: 4 Jul 2011, 22:01:03 UTC

Dave

A bit more on your post about not getting more models until 'tomorrow'.

There are now 2 different delays:
1) If a computer crashes models serially without returning one that's successful, then the Maximum daily WU quota per CPU will keep being decreased until it reaches zero, at which point the computer has to wait until midnight project time, (in the case of cpdn, this is UK time), to get another model.
This is to slow down serial crashers from wasting too many models.

2) A newly introduced delay, of a 1 hour backoff between requests, to try and spread out the limited amount of work.
Each time the computer contacts the server for ANY reason, this delay is reset back to 1 hour. Only a request made after the countdown has timed out will be considered for the allocation of a model. IF any still remain at that point.
This delay is not dependent on waiting until the following day.


Backups: Here
ID: 42560 · Report as offensive     Reply Quote
Lockleys

Send message
Joined: 13 Jan 07
Posts: 195
Credit: 10,581,566
RAC: 0
Message 42561 - Posted: 4 Jul 2011, 22:27:06 UTC

A new batch of regional models were created over the weekend, all of them regens from previously completed work, so they should be OK.


Sadly, the 2 models downloaded to mine from this latest batch have both experienced the 45 second crash, so it looks from this end of the telescope as though the problem is still extant.
ID: 42561 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 42562 - Posted: 4 Jul 2011, 22:48:44 UTC - in response to Message 42561.  

Please provide a link to them, so that I can have a look at the batch numbers.
Backups: Here
ID: 42562 · Report as offensive     Reply Quote
ProfileByron Leigh Hatch @ team Carl ...
Avatar

Send message
Joined: 17 Aug 04
Posts: 289
Credit: 44,103,664
RAC: 0
Message 42563 - Posted: 4 Jul 2011, 23:58:43 UTC - in response to Message 42562.  

ID: 42563 · Report as offensive     Reply Quote
dajashby

Send message
Joined: 1 Sep 04
Posts: 55
Credit: 17,223,688
RAC: 967
Message 42564 - Posted: 5 Jul 2011, 0:14:22 UTC
Last modified: 5 Jul 2011, 0:21:57 UTC

The same thing has happened to me on the two computers I have running CPDN - one a Windows 7 machine, and one an iMac. I don't want to be rude, but was this batch of models tested before release?

iMac tasks:

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=7539655
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=7535832
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=7535713
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=7539464
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=7535460
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=7535459

Windows tasks

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=7537097
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=7536546
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=7539251
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=7536772
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=7535216
Derrick Ashby
ID: 42564 · Report as offensive     Reply Quote
Darmok

Send message
Joined: 29 Dec 09
Posts: 34
Credit: 18,395,130
RAC: 0
Message 42565 - Posted: 5 Jul 2011, 0:26:40 UTC - in response to Message 42562.  

3 of 3 new downloads failed.
First two failed earlier today, twice!

hadam3p_eu_2k6q_2000_1_007340694_2 WU ID 7538124 Task 13080679
hadam3p_eu_2kgn_1985_1_007340817_2 WU ID 7538247 Task 13080666
hadam3p_eu_va95_2001_1_007337165_0 WU ID 7534595 Task 13055425
ID: 42565 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 42566 - Posted: 5 Jul 2011, 2:29:12 UTC

I leave a message for the project people for when they're out of bed and at work.

For anyone else who wants to provide details, it's preferable to use links such as Byron and Darmok, as it saves having to go to each linked model to see the name.


Backups: Here
ID: 42566 · Report as offensive     Reply Quote
Lockleys

Send message
Joined: 13 Jan 07
Posts: 195
Credit: 10,581,566
RAC: 0
Message 42569 - Posted: 5 Jul 2011, 6:49:29 UTC - in response to Message 42561.  

A new batch of regional models were created over the weekend, all of them regens from previously completed work, so they should be OK.


Sadly, the 2 models downloaded to mine from this latest batch have both experienced the 45 second crash, so it looks from this end of the telescope as though the problem is still extant.


Model numbers were hadam3p_eu_2qx6_1990_1_007341826_1 and hadam3p_eu_2rbo_1990_1_007342014_1
ID: 42569 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 42571 - Posted: 5 Jul 2011, 7:19:30 UTC

Hmmm. It looks like there's a 2 series, and a v series.
Not long now until they're back at work.


Backups: Here
ID: 42571 · Report as offensive     Reply Quote
metalius
Avatar

Send message
Joined: 28 Nov 06
Posts: 89
Credit: 11,985,507
RAC: 2,216
Message 42573 - Posted: 5 Jul 2011, 7:41:56 UTC - in response to Message 42571.  
Last modified: 5 Jul 2011, 7:45:03 UTC

It is sad - the project gives second batch of permanently corrupted work units, typical error messages are like these:
    05/07/2011 06:49:12 | climateprediction.net | Starting task hadam3p_eu_v9mp_1962_1_007337397_2 using hadam3p_eu version 609
    05/07/2011 06:49:48 | climateprediction.net | Computation for task hadam3p_eu_v9mp_1962_1_007337397_2 finished
    05/07/2011 06:49:48 | climateprediction.net | Output file hadam3p_eu_v9mp_1962_1_007337397_2_1.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent
    05/07/2011 06:49:48 | climateprediction.net | Output file hadam3p_eu_v9mp_1962_1_007337397_2_2.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent
    05/07/2011 06:49:48 | climateprediction.net | Output file hadam3p_eu_v9mp_1962_1_007337397_2_3.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent
    05/07/2011 06:49:48 | climateprediction.net | Output file hadam3p_eu_v9mp_1962_1_007337397_2_4.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent
    05/07/2011 06:49:48 | climateprediction.net | Output file hadam3p_eu_v9mp_1962_1_007337397_2_5.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent
    05/07/2011 06:49:48 | climateprediction.net | Output file hadam3p_eu_v9mp_1962_1_007337397_2_6.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent
    05/07/2011 06:49:48 | climateprediction.net | Output file hadam3p_eu_v9mp_1962_1_007337397_2_7.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent
    05/07/2011 06:49:48 | climateprediction.net | Output file hadam3p_eu_v9mp_1962_1_007337397_2_8.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent
    05/07/2011 06:49:48 | climateprediction.net | Output file hadam3p_eu_v9mp_1962_1_007337397_2_9.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent
    05/07/2011 06:49:48 | climateprediction.net | Output file hadam3p_eu_v9mp_1962_1_007337397_2_10.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent
    05/07/2011 06:49:48 | climateprediction.net | Output file hadam3p_eu_v9mp_1962_1_007337397_2_11.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent
    05/07/2011 06:49:48 | climateprediction.net | Output file hadam3p_eu_v9mp_1962_1_007337397_2_12.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent
    05/07/2011 06:49:48 | climateprediction.net | Output file hadam3p_eu_v9mp_1962_1_007337397_2_13.zip for task hadam3p_eu_v9mp_1962_1_007337397_2 absent


ID: 42573 · Report as offensive     Reply Quote
Lockleys

Send message
Joined: 13 Jan 07
Posts: 195
Credit: 10,581,566
RAC: 0
Message 42574 - Posted: 5 Jul 2011, 8:23:36 UTC - in response to Message 42569.  

A new batch of regional models were created over the weekend, all of them regens from previously completed work, so they should be OK.


Sadly, the 2 models downloaded to mine from this latest batch have both experienced the 45 second crash, so it looks from this end of the telescope as though the problem is still extant.


Model numbers were hadam3p_eu_2qx6_1990_1_007341826_1 and hadam3p_eu_2rbo_1990_1_007342014_1


Also model numbers: hadam3p_eu_v9yf_2000_1_007339072_1 and hadam3p_eu_vbwa_1970_1_007336518_2
ID: 42574 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 42577 - Posted: 5 Jul 2011, 11:38:23 UTC



It looks like all of them are faulty.
To much pressure to produce work units, especially late in the week.

More later.

In the meantime, everybody put away the axes, knives, halberds, and spears, and go home.


Backups: Here
ID: 42577 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : hadam3p_eu crash 45 seconds in.

©2024 cpdn.org