Message boards : Number crunching : Possible problem with new EU work units
Message board moderation
Author | Message |
---|---|
Send message Joined: 18 Dec 13 Posts: 62 Credit: 1,078,935 RAC: 0 |
I'm currently crunching a couple of the latest batch of hadam3p-eu work units. Usually, these units send a dozen updates as zip files to the server at evenly spaced intervals (at roughly 8.3%, 16.7%, 25% complete, and so on). Work unit 8958921 has failed to create zip number 11, now at 92.8% complete. Work unit 9002138 has failed to create zip number 5, now at 47.7% complete. Is this a problem? |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Yes. The data is needed. I don't know what the problem is though. Could you post the 4 character code of the tasks in question please. Or even just copy and paste the entire string of the full label from the top of it's server page. As an aside, I have a feeling that one of my zips went missing recently, but it'll be a while yet before I can upload the rest, and see if there's an error message. |
Send message Joined: 18 Dec 13 Posts: 62 Credit: 1,078,935 RAC: 0 |
By 4-character code, I assume you mean the alphanumeric code after the hadam3p_eu on the work unit name - hadam3p_eu_*xxxx*_2013_1_nnnnnnnnn_0. These are ne8f and g6dk. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
That's them. Thanks. Might have to wait a couple of days, it being the weekend. Again! But I'll send an email. |
Send message Joined: 18 Dec 13 Posts: 62 Credit: 1,078,935 RAC: 0 |
Welcome. I have also noticed they have been crunching unusually fast. I've completed one of these already. Usually hadam3p_eu units take my system about 125 hours of crunching. That first one of the new batch (hadam3p_eu_i3gp_2013_1_008768925) took under 100 hours, and these two are on schedule to take about the same. It occurred to me the two phenomena might be connected. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Yes, good point about the speed. It could be that they're not doing all of the calcs or something. This might show up in the zip size. Have to log more data with some of these for comparison. At present I only log the start and end times, the number of hours, and the size of zip 13. |
Send message Joined: 5 May 10 Posts: 69 Credit: 1,169,103 RAC: 2,258 |
The one I've just finished returned 13 trickles! I thought the zip file numbers were out with the percentage completion in the later stages too, but the correct number (13) appear to have been uploaded, with no gaps or additions (according to my BOINC "stdoutdae.txt" file). NG |
Send message Joined: 18 Dec 13 Posts: 62 Credit: 1,078,935 RAC: 0 |
I left the computer on overnight to see if this was a transient problem or a misconfiguration here. WU ne8f apparently completed normally, and uploaded zips 12 and 13. WU g6dk uploaded zips 6 and 7 as expected. Oh, this is interesting. The whole thing may be a minor bug. I was going on the output from the BOINC manager event log, but I've had a look at the "stdoutdae.txt" file, which I wasn't aware of until Nigel Garvey mentioned it. The output from the two corresponds in all particulars, as far as I can see, except for two. The event log fails to mention two incidents that are in the "stdoutdae.txt" file: 12-Jul-2014 13:46:05 [climateprediction.net] Started upload of hadam3p_eu_g6dk_2013_1_008856209_0_5.zip 12-Jul-2014 13:52:26 [climateprediction.net] Finished upload of hadam3p_eu_g6dk_2013_1_008856209_0_5.zip 12-Jul-2014 19:29:29 [climateprediction.net] Started upload of hadam3p_eu_ne8f_2013_1_008812943_0_11.zip 12-Jul-2014 19:35:38 [climateprediction.net] Finished upload of hadam3p_eu_ne8f_2013_1_008812943_0_11.zip The problem may lie with a minor error in my event log, and otherwise be a false alarm, in which case all I probably need to do is update BOINC. I suppose the next question is, are those files on your server or not? |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,826,970 RAC: 5,066 |
The one I've just finished returned 13 trickles! I thought the zip file numbers were out with the percentage completion in the later stages too, but the correct number (13) appear to have been uploaded, with no gaps or additions (according to my BOINC "stdoutdae.txt" file). That is odd. The alien trickle is this one: 08 Jul 2014 10:08:52 1073433 16706651 hadam3p_eu_j9dj_2013_1_008791627_0 1 69,222 225,267 3.2543 I've had a few "Model crashed: INITTIME: Atmosphere basis time mismatch" from the previous EU batch, which is a configuration error. Both now reported to project team ... |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,826,970 RAC: 5,066 |
... The whole thing may be a minor bug. ... Thanks for reporting this. Twenty-five years or so of working as a software engineer have convinced me that bugs can only confidently be described as minor after they have been investigated and understood. For diagnostic purposes a minor error in output might be caused by a major internal error. I despair of the number of computer programmers who dismiss minor inconsistencies which subsequently prove fatal to the purpose of their application. |
Send message Joined: 5 May 10 Posts: 69 Credit: 1,169,103 RAC: 2,258 |
Iain Inglis wrote: The one I've just finished returned 13 trickles! I thought the zip file numbers were out with the percentage completion in the later stages too, but the correct number (13) appear to have been uploaded, with no gaps or additions (according to my BOINC "stdoutdae.txt" file). Hmm. It's timed shortly after trickles 5 and 6, which, because my BOINC client's set not to use the network overnight, were both sent at the same time one morning after I'd manually enabled network access earlier than the set time. The corresponding zip files, however, kept failing to upload. The reason given was "transient HTTP error", but I knew from previous experience that this probaby meant my Mac was overdue for a reboot. So, at an opportune moment, I carefully suspended all BOINC computation, quit BOINC Manager, and shut down the machine. Then I restarted it and resumed BOINC computation once everything was up and running again. The two zip files immediately uploaded with no problem at all and the additional trickle was apparently sent an hour after the uploads completed, being recorded on the task page five minutes after that. This is probably all connected, but I can't think why a spurious trickle should have been sent and with an incorrect timestep. NG |
Send message Joined: 18 Dec 13 Posts: 62 Credit: 1,078,935 RAC: 0 |
Sorry. Delete. My mistake this time. |
©2024 cpdn.org