Message boards : Number crunching : EAS batches 1001-4
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Send message Joined: 14 Feb 06 Posts: 31 Credit: 4,507,116 RAC: 2,013 |
Just to be clear, this segv problem with these tasks is nothing to do with the files produced by the model - so don't waste time waiting for the OS to do its thing. It's a memory issue related to the model starting up, not reading the files. Thanks for the useful explanation Glenn. I had noticed that the failures occurred at start up, so having that confirmed is really useful. And having had 4 tasks fail today, I've set BOINC to no new work for CPDN for the moment. |
Send message Joined: 2 Oct 06 Posts: 54 Credit: 27,309,613 RAC: 28,128 |
FWIW, I am having a problem uploading a completed task. It has failed 7 times now: 1/22/2024 9:07:45 AM [error] Error reported by file upload server: [wah2_eas25_n1z2_201412_24_1002_012238572_0_r1345862682_out.zip] locked by file_upload_handler PID=2762255 Edit: I now have a total of 6 uploads across 3 machines, with this same problem. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,026,382 RAC: 20,431 |
Have posted to project on Trello board for batch numbers. I see that all my uploads have been going through fine. From memory this may be the weird problem that only affects people with faster connections. Presumably there will be more reports soon unless you have been caught by a temporary glitch which after some time out period will resolve itself. |
Send message Joined: 2 Oct 06 Posts: 54 Credit: 27,309,613 RAC: 28,128 |
Thanks. I am not sure if someone fixed something, of if they just magically resolved themselves. In any case, all my pending uploaded are gone now. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,026,382 RAC: 20,431 |
Thanks. I am not sure if someone fixed something, of if they just magically resolved themselves. In any case, all my pending uploaded are gone now.Pretty sure it is a server issue that is more prevalent with faster connections. The file lock put on by the server is something that happens at times when the upload is interrupted for which there could be a number of reasons. I am not sure if hitting the retry button delays the lock getting taken off in the same way that hitting the update project button restarts the backoff to get new work. Edit: I wouldn't be surprised if it crops up again at some point as the number of machines returning work ramps up. |
Send message Joined: 6 Aug 04 Posts: 195 Credit: 28,350,763 RAC: 10,531 |
Most of the solar work is complete. Unfortunately, waking after 'hibernate' lost three of the five running tasks. The other two tasks should complete before the commissioning.I see the faster machines have started to return results. Still over 2 days till the first of mine finish but it does mean the number of tasks waiting to be sent from these batches is going down a little faster now.Two more days for the first tasks to complete here. They survived Storm Isha brown-outs yesterday evening. We have planned power outages tomorrow, for connecting the solar panels, battery and a new charger. So fingers crossed that the WAH models behave after 'hibernate'. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,026,382 RAC: 20,431 |
Interesting that 1003 and 1004 seem to have over twice as many hard fails as 1002 and 1002 even though the latter batches have slightly lower numbers of tasks that have gone out. I get the GHG is Greenhouse Gas forcing. I am pretty sure I should know what AER is but my mind has gone blank on that one! |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
AER = aerosol. It means the aerosol forcing, such as sulfate aerosol has been changed. I wouldn't read anything yet into the number of fails, too early. But it will be similar to the early EAS batches Interesting that 1003 and 1004 seem to have over twice as many hard fails as 1002 and 1002 even though the latter batches have slightly lower numbers of tasks that have gone out. I get the GHG is Greenhouse Gas forcing. I am pretty sure I should know what AER is but my mind has gone blank on that one! |
Send message Joined: 6 Aug 04 Posts: 195 Credit: 28,350,763 RAC: 10,531 |
Two tasks finished succesfully. I've let just one task start and set 'no new tasks' while the contractor cracks on tomorrow with the solar and battery installation and commissioning.Most of the solar work is complete. Unfortunately, waking after 'hibernate' lost three of the five running tasks. The other two tasks should complete before the commissioning.I see the faster machines have started to return results. Still over 2 days till the first of mine finish but it does mean the number of tasks waiting to be sent from these batches is going down a little faster now.Two more days for the first tasks to complete here. They survived Storm Isha brown-outs yesterday evening. We have planned power outages tomorrow, for connecting the solar panels, battery and a new charger. So fingers crossed that the WAH models behave after 'hibernate'. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
EAS Batches 1002, 1003, 1004 have been closed An error in the production of the files used by these batches has been found and the results are not scientifically accurate. These batches have been closed, the error will be corrected and the batches resubmitted. You can kill any tasks currently running with these batch numbers. The batch number can be found in the task title as the second to last number. e.g. wah2_eas25_h000_200912_24_1002_012229966is for batch 1002. Batch 1001 is unaffected, please keep these tasks running! --- CPDN Visiting Scientist |
Send message Joined: 6 Aug 04 Posts: 195 Credit: 28,350,763 RAC: 10,531 |
As requested, five EAS tasks from batches 1002 1003 and 1004 have been aborted. One NZ task from batch 1005 is still ticking along. The CPU fan has gone quiet! |
Send message Joined: 2 Oct 06 Posts: 54 Credit: 27,309,613 RAC: 28,128 |
EAS Batches 1002, 1003, 1004 have been closed I just tried an update for the machines with those tasks, but nothing happened. Will a server-side abort be issued? That seem like the right way to do this. |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
Killed three tasks, only 1001 and 1005 group tasks left! It's kind of a grey solar day anyway, so I was debating suspending the machine for a day or two anyway (S3 suspend of the Linux host, with a Windows VM, doesn't disrupt tasks) - we're not into the sunny bits of year yet. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
CPDN don't abort tasks from the server. They used to do it years ago but users complained. Abort them yourself if you want. Yes the cpdn scheduler might restrict number of tasks to your host but that doesn't last long. EAS Batches 1002, 1003, 1004 have been closed --- CPDN Visiting Scientist |
Send message Joined: 2 Jul 15 Posts: 21 Credit: 4,211,312 RAC: 1,512 |
Just to confirm, before I terminate something I shouldn't ... ... I should terminate: wah2_eas25_n20y_201412_24_1002_012238640 and wah2_eas25_n2y1_201612_24_1002_012239831 but not terminate: wah2_eas25_h03v_200912_24_1001_012230105. Correct? Thanks. |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
David, looks reasonable to me! </random compute rig guy> |
Send message Joined: 2 Oct 06 Posts: 54 Credit: 27,309,613 RAC: 28,128 |
CPDN don't abort tasks from the server. They used to do it years ago but users complained. With trickles, there are no (valid) reasons to complain any more. No need to complete the tasks to get paid for work done. They already get paid for partial work. So doing server-side aborts should be resumed. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,026,382 RAC: 20,431 |
CPDN don't abort tasks from the server. They used to do it years ago but users complained. Trickles were used then as well. Not that I have any power over the policy. |
Send message Joined: 2 Jul 15 Posts: 21 Credit: 4,211,312 RAC: 1,512 |
The deed is done. 2@1002 are terminated. 1@1001 remains. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
I brought up the question of aborting tasks from the server in the CPDN meeting this morning, highlighting feedback from users on the forums regarding closing of batches 1002-1004. CPDN agreed that in this case sending an Abort from the server is appropriate and they will 're-close' the batch with a 'send abort' for any existing tasks. To clarify their policy on closing batches, the head of CPDN told me that in future, if there are any discovered issues with the scientific validity of a batch, it will be closed with 'Abort tasks'. Any other batches are typically closed once they reach 85-90% returns, but existing tasks are left running as the results are still scientifically useful. In the past, they have been concerned volunteers may not be happy with the project suddenly Aborting tasks on volunteer machines. Hope that helps. CPDN don't abort tasks from the server. They used to do it years ago but users complained. --- CPDN Visiting Scientist |
©2024 cpdn.org