climateprediction.net (CPDN) home page
Thread 'Possible problem with new EU work units'

Thread 'Possible problem with new EU work units'

Message boards : Number crunching : Possible problem with new EU work units
Message board moderation

To post messages, you must log in.

AuthorMessage
Niall

Send message
Joined: 18 Dec 13
Posts: 62
Credit: 1,078,935
RAC: 0
Message 49546 - Posted: 12 Jul 2014, 20:11:34 UTC

I'm currently crunching a couple of the latest batch of hadam3p-eu work units. Usually, these units send a dozen updates as zip files to the server at evenly spaced intervals (at roughly 8.3%, 16.7%, 25% complete, and so on).

Work unit 8958921 has failed to create zip number 11, now at 92.8% complete.
Work unit 9002138 has failed to create zip number 5, now at 47.7% complete.

Is this a problem?
ID: 49546 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 49547 - Posted: 12 Jul 2014, 20:20:07 UTC - in response to Message 49546.  

Yes. The data is needed.
I don't know what the problem is though.

Could you post the 4 character code of the tasks in question please. Or even just copy and paste the entire string of the full label from the top of it's server page.

As an aside, I have a feeling that one of my zips went missing recently, but it'll be a while yet before I can upload the rest, and see if there's an error message.


ID: 49547 · Report as offensive     Reply Quote
Niall

Send message
Joined: 18 Dec 13
Posts: 62
Credit: 1,078,935
RAC: 0
Message 49548 - Posted: 12 Jul 2014, 20:27:55 UTC - in response to Message 49547.  

By 4-character code, I assume you mean the alphanumeric code after the hadam3p_eu on the work unit name - hadam3p_eu_*xxxx*_2013_1_nnnnnnnnn_0.

These are ne8f and g6dk.
ID: 49548 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 49549 - Posted: 12 Jul 2014, 20:31:25 UTC - in response to Message 49548.  

That's them. Thanks.

Might have to wait a couple of days, it being the weekend. Again!
But I'll send an email.


ID: 49549 · Report as offensive     Reply Quote
Niall

Send message
Joined: 18 Dec 13
Posts: 62
Credit: 1,078,935
RAC: 0
Message 49550 - Posted: 12 Jul 2014, 20:41:58 UTC - in response to Message 49549.  

Welcome.

I have also noticed they have been crunching unusually fast. I've completed one of these already. Usually hadam3p_eu units take my system about 125 hours of crunching. That first one of the new batch (hadam3p_eu_i3gp_2013_1_008768925) took under 100 hours, and these two are on schedule to take about the same. It occurred to me the two phenomena might be connected.
ID: 49550 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 49551 - Posted: 12 Jul 2014, 20:54:33 UTC - in response to Message 49550.  

Yes, good point about the speed. It could be that they're not doing all of the calcs or something. This might show up in the zip size.
Have to log more data with some of these for comparison. At present I only log the start and end times, the number of hours, and the size of zip 13.


ID: 49551 · Report as offensive     Reply Quote
Nigel Garvey

Send message
Joined: 5 May 10
Posts: 69
Credit: 1,169,103
RAC: 2,258
Message 49552 - Posted: 13 Jul 2014, 7:10:44 UTC

The one I've just finished returned 13 trickles! I thought the zip file numbers were out with the percentage completion in the later stages too, but the correct number (13) appear to have been uploaded, with no gaps or additions (according to my BOINC "stdoutdae.txt" file).
NG
ID: 49552 · Report as offensive     Reply Quote
Niall

Send message
Joined: 18 Dec 13
Posts: 62
Credit: 1,078,935
RAC: 0
Message 49553 - Posted: 13 Jul 2014, 9:29:49 UTC

I left the computer on overnight to see if this was a transient problem or a misconfiguration here.

WU ne8f apparently completed normally, and uploaded zips 12 and 13.
WU g6dk uploaded zips 6 and 7 as expected.

Oh, this is interesting. The whole thing may be a minor bug. I was going on the output from the BOINC manager event log, but I've had a look at the "stdoutdae.txt" file, which I wasn't aware of until Nigel Garvey mentioned it. The output from the two corresponds in all particulars, as far as I can see, except for two. The event log fails to mention two incidents that are in the "stdoutdae.txt" file:

12-Jul-2014 13:46:05 [climateprediction.net] Started upload of hadam3p_eu_g6dk_2013_1_008856209_0_5.zip
12-Jul-2014 13:52:26 [climateprediction.net] Finished upload of hadam3p_eu_g6dk_2013_1_008856209_0_5.zip

12-Jul-2014 19:29:29 [climateprediction.net] Started upload of hadam3p_eu_ne8f_2013_1_008812943_0_11.zip
12-Jul-2014 19:35:38 [climateprediction.net] Finished upload of hadam3p_eu_ne8f_2013_1_008812943_0_11.zip

The problem may lie with a minor error in my event log, and otherwise be a false alarm, in which case all I probably need to do is update BOINC.

I suppose the next question is, are those files on your server or not?
ID: 49553 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,826,970
RAC: 5,066
Message 49554 - Posted: 13 Jul 2014, 11:18:51 UTC - in response to Message 49552.  

The one I've just finished returned 13 trickles! I thought the zip file numbers were out with the percentage completion in the later stages too, but the correct number (13) appear to have been uploaded, with no gaps or additions (according to my BOINC "stdoutdae.txt" file).

That is odd. The alien trickle is this one:

08 Jul 2014 10:08:52 1073433 16706651 hadam3p_eu_j9dj_2013_1_008791627_0 1 69,222 225,267 3.2543

I've had a few "Model crashed: INITTIME: Atmosphere basis time mismatch" from the previous EU batch, which is a configuration error.

Both now reported to project team ...
ID: 49554 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,826,970
RAC: 5,066
Message 49555 - Posted: 13 Jul 2014, 11:36:36 UTC - in response to Message 49553.  
Last modified: 13 Jul 2014, 11:37:55 UTC

... The whole thing may be a minor bug. ...

Thanks for reporting this. Twenty-five years or so of working as a software engineer have convinced me that bugs can only confidently be described as minor after they have been investigated and understood. For diagnostic purposes a minor error in output might be caused by a major internal error. I despair of the number of computer programmers who dismiss minor inconsistencies which subsequently prove fatal to the purpose of their application.
ID: 49555 · Report as offensive     Reply Quote
Nigel Garvey

Send message
Joined: 5 May 10
Posts: 69
Credit: 1,169,103
RAC: 2,258
Message 49557 - Posted: 13 Jul 2014, 20:40:34 UTC - in response to Message 49554.  

Iain Inglis wrote:
The one I've just finished returned 13 trickles! I thought the zip file numbers were out with the percentage completion in the later stages too, but the correct number (13) appear to have been uploaded, with no gaps or additions (according to my BOINC "stdoutdae.txt" file).

That is odd. The alien trickle is this one:

08 Jul 2014 10:08:52 1073433 16706651 hadam3p_eu_j9dj_2013_1_008791627_0 1 69,222 225,267 3.2543


Hmm. It's timed shortly after trickles 5 and 6, which, because my BOINC client's set not to use the network overnight, were both sent at the same time one morning after I'd manually enabled network access earlier than the set time. The corresponding zip files, however, kept failing to upload. The reason given was "transient HTTP error", but I knew from previous experience that this probaby meant my Mac was overdue for a reboot. So, at an opportune moment, I carefully suspended all BOINC computation, quit BOINC Manager, and shut down the machine. Then I restarted it and resumed BOINC computation once everything was up and running again. The two zip files immediately uploaded with no problem at all and the additional trickle was apparently sent an hour after the uploads completed, being recorded on the task page five minutes after that.

This is probably all connected, but I can't think why a spurious trickle should have been sent and with an incorrect timestep.
NG
ID: 49557 · Report as offensive     Reply Quote
Niall

Send message
Joined: 18 Dec 13
Posts: 62
Credit: 1,078,935
RAC: 0
Message 49589 - Posted: 18 Jul 2014, 8:48:56 UTC
Last modified: 18 Jul 2014, 9:00:11 UTC

Sorry. Delete. My mistake this time.
ID: 49589 · Report as offensive     Reply Quote

Message boards : Number crunching : Possible problem with new EU work units

©2024 cpdn.org