climateprediction.net (CPDN) home page
Thread 'EAS batches 1001-4'

Thread 'EAS batches 1001-4'

Message boards : Number crunching : EAS batches 1001-4
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Paul

Send message
Joined: 14 Feb 06
Posts: 31
Credit: 4,507,116
RAC: 2,013
Message 70180 - Posted: 22 Jan 2024, 16:57:02 UTC - in response to Message 70176.  

Just to be clear, this segv problem with these tasks is nothing to do with the files produced by the model - so don't waste time waiting for the OS to do its thing. It's a memory issue related to the model starting up, not reading the files.

Thanks for the useful explanation Glenn.

I had noticed that the failures occurred at start up, so having that confirmed is really useful.

And having had 4 tasks fail today, I've set BOINC to no new work for CPDN for the moment.
ID: 70180 · Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 2 Oct 06
Posts: 54
Credit: 27,309,613
RAC: 28,128
Message 70181 - Posted: 22 Jan 2024, 17:14:38 UTC
Last modified: 22 Jan 2024, 18:00:29 UTC

FWIW, I am having a problem uploading a completed task. It has failed 7 times now:

1/22/2024 9:07:45 AM	[error] Error reported by file upload server: [wah2_eas25_n1z2_201412_24_1002_012238572_0_r1345862682_out.zip] locked by file_upload_handler PID=2762255	


Edit: I now have a total of 6 uploads across 3 machines, with this same problem.
ID: 70181 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,026,382
RAC: 20,431
Message 70183 - Posted: 22 Jan 2024, 19:13:49 UTC - in response to Message 70181.  

Have posted to project on Trello board for batch numbers. I see that all my uploads have been going through fine. From memory this may be the weird problem that only affects people with faster connections. Presumably there will be more reports soon unless you have been caught by a temporary glitch which after some time out period will resolve itself.
ID: 70183 · Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 2 Oct 06
Posts: 54
Credit: 27,309,613
RAC: 28,128
Message 70186 - Posted: 23 Jan 2024, 0:30:15 UTC

Thanks. I am not sure if someone fixed something, of if they just magically resolved themselves. In any case, all my pending uploaded are gone now.
ID: 70186 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,026,382
RAC: 20,431
Message 70187 - Posted: 23 Jan 2024, 8:10:53 UTC - in response to Message 70186.  
Last modified: 23 Jan 2024, 10:09:00 UTC

Thanks. I am not sure if someone fixed something, of if they just magically resolved themselves. In any case, all my pending uploaded are gone now.
Pretty sure it is a server issue that is more prevalent with faster connections. The file lock put on by the server is something that happens at times when the upload is interrupted for which there could be a number of reasons. I am not sure if hitting the retry button delays the lock getting taken off in the same way that hitting the update project button restarts the backoff to get new work.

Edit: I wouldn't be surprised if it crops up again at some point as the number of machines returning work ramps up.
ID: 70187 · Report as offensive     Reply Quote
wateroakley

Send message
Joined: 6 Aug 04
Posts: 195
Credit: 28,350,763
RAC: 10,531
Message 70190 - Posted: 23 Jan 2024, 16:54:46 UTC - in response to Message 70179.  

I see the faster machines have started to return results. Still over 2 days till the first of mine finish but it does mean the number of tasks waiting to be sent from these batches is going down a little faster now.
Two more days for the first tasks to complete here. They survived Storm Isha brown-outs yesterday evening. We have planned power outages tomorrow, for connecting the solar panels, battery and a new charger. So fingers crossed that the WAH models behave after 'hibernate'.
Most of the solar work is complete. Unfortunately, waking after 'hibernate' lost three of the five running tasks. The other two tasks should complete before the commissioning.
ID: 70190 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,026,382
RAC: 20,431
Message 70192 - Posted: 24 Jan 2024, 12:16:26 UTC

Interesting that 1003 and 1004 seem to have over twice as many hard fails as 1002 and 1002 even though the latter batches have slightly lower numbers of tasks that have gone out. I get the GHG is Greenhouse Gas forcing. I am pretty sure I should know what AER is but my mind has gone blank on that one!
ID: 70192 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 70194 - Posted: 24 Jan 2024, 12:37:37 UTC - in response to Message 70192.  

AER = aerosol. It means the aerosol forcing, such as sulfate aerosol has been changed.

I wouldn't read anything yet into the number of fails, too early. But it will be similar to the early EAS batches

Interesting that 1003 and 1004 seem to have over twice as many hard fails as 1002 and 1002 even though the latter batches have slightly lower numbers of tasks that have gone out. I get the GHG is Greenhouse Gas forcing. I am pretty sure I should know what AER is but my mind has gone blank on that one!
ID: 70194 · Report as offensive     Reply Quote
wateroakley

Send message
Joined: 6 Aug 04
Posts: 195
Credit: 28,350,763
RAC: 10,531
Message 70198 - Posted: 24 Jan 2024, 20:39:21 UTC - in response to Message 70190.  

I see the faster machines have started to return results. Still over 2 days till the first of mine finish but it does mean the number of tasks waiting to be sent from these batches is going down a little faster now.
Two more days for the first tasks to complete here. They survived Storm Isha brown-outs yesterday evening. We have planned power outages tomorrow, for connecting the solar panels, battery and a new charger. So fingers crossed that the WAH models behave after 'hibernate'.
Most of the solar work is complete. Unfortunately, waking after 'hibernate' lost three of the five running tasks. The other two tasks should complete before the commissioning.
Two tasks finished succesfully. I've let just one task start and set 'no new tasks' while the contractor cracks on tomorrow with the solar and battery installation and commissioning.
ID: 70198 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 70273 - Posted: 2 Feb 2024, 11:23:39 UTC

EAS Batches 1002, 1003, 1004 have been closed

An error in the production of the files used by these batches has been found and the results are not scientifically accurate. These batches have been closed, the error will be corrected and the batches resubmitted.

You can kill any tasks currently running with these batch numbers. The batch number can be found in the task title as the second to last number. e.g.
       wah2_eas25_h000_200912_24_1002_012229966
is for batch 1002.

Batch 1001 is unaffected, please keep these tasks running!
---
CPDN Visiting Scientist
ID: 70273 · Report as offensive     Reply Quote
wateroakley

Send message
Joined: 6 Aug 04
Posts: 195
Credit: 28,350,763
RAC: 10,531
Message 70274 - Posted: 2 Feb 2024, 11:33:42 UTC - in response to Message 70273.  

As requested, five EAS tasks from batches 1002 1003 and 1004 have been aborted. One NZ task from batch 1005 is still ticking along. The CPU fan has gone quiet!
ID: 70274 · Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 2 Oct 06
Posts: 54
Credit: 27,309,613
RAC: 28,128
Message 70289 - Posted: 2 Feb 2024, 14:37:02 UTC - in response to Message 70273.  

EAS Batches 1002, 1003, 1004 have been closed


I just tried an update for the machines with those tasks, but nothing happened. Will a server-side abort be issued? That seem like the right way to do this.
ID: 70289 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 70294 - Posted: 2 Feb 2024, 19:59:37 UTC
Last modified: 2 Feb 2024, 20:00:33 UTC

Killed three tasks, only 1001 and 1005 group tasks left! It's kind of a grey solar day anyway, so I was debating suspending the machine for a day or two anyway (S3 suspend of the Linux host, with a Windows VM, doesn't disrupt tasks) - we're not into the sunny bits of year yet.
ID: 70294 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 70295 - Posted: 2 Feb 2024, 20:23:59 UTC - in response to Message 70289.  

CPDN don't abort tasks from the server. They used to do it years ago but users complained.

Abort them yourself if you want. Yes the cpdn scheduler might restrict number of tasks to your host but that doesn't last long.

EAS Batches 1002, 1003, 1004 have been closed


I just tried an update for the machines with those tasks, but nothing happened. Will a server-side abort be issued? That seem like the right way to do this.

---
CPDN Visiting Scientist
ID: 70295 · Report as offensive     Reply Quote
David Berg

Send message
Joined: 2 Jul 15
Posts: 21
Credit: 4,211,312
RAC: 1,512
Message 70296 - Posted: 2 Feb 2024, 22:09:32 UTC - in response to Message 70273.  
Last modified: 2 Feb 2024, 22:18:04 UTC

Just to confirm, before I terminate something I shouldn't ...
... I should terminate:
wah2_eas25_n20y_201412_24_1002_012238640 and
wah2_eas25_n2y1_201612_24_1002_012239831

but not terminate:
wah2_eas25_h03v_200912_24_1001_012230105.

Correct?
Thanks.
ID: 70296 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 70297 - Posted: 2 Feb 2024, 23:54:37 UTC

David, looks reasonable to me! </random compute rig guy>
ID: 70297 · Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 2 Oct 06
Posts: 54
Credit: 27,309,613
RAC: 28,128
Message 70304 - Posted: 3 Feb 2024, 13:15:39 UTC - in response to Message 70295.  

CPDN don't abort tasks from the server. They used to do it years ago but users complained.


With trickles, there are no (valid) reasons to complain any more. No need to complete the tasks to get paid for work done. They already get paid for partial work. So doing server-side aborts should be resumed.
ID: 70304 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,026,382
RAC: 20,431
Message 70305 - Posted: 3 Feb 2024, 13:20:10 UTC - in response to Message 70304.  

CPDN don't abort tasks from the server. They used to do it years ago but users complained.


With trickles, there are no (valid) reasons to complain any more. No need to complete the tasks to get paid for work done. They already get paid for partial work. So doing server-side aborts should be resumed.

Trickles were used then as well. Not that I have any power over the policy.
ID: 70305 · Report as offensive     Reply Quote
David Berg

Send message
Joined: 2 Jul 15
Posts: 21
Credit: 4,211,312
RAC: 1,512
Message 70311 - Posted: 5 Feb 2024, 7:35:52 UTC - in response to Message 70297.  

The deed is done. 2@1002 are terminated. 1@1001 remains.
ID: 70311 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 70312 - Posted: 5 Feb 2024, 10:57:21 UTC - in response to Message 70304.  

I brought up the question of aborting tasks from the server in the CPDN meeting this morning, highlighting feedback from users on the forums regarding closing of batches 1002-1004.

CPDN agreed that in this case sending an Abort from the server is appropriate and they will 're-close' the batch with a 'send abort' for any existing tasks.

To clarify their policy on closing batches, the head of CPDN told me that in future, if there are any discovered issues with the scientific validity of a batch, it will be closed with 'Abort tasks'. Any other batches are typically closed once they reach 85-90% returns, but existing tasks are left running as the results are still scientifically useful. In the past, they have been concerned volunteers may not be happy with the project suddenly Aborting tasks on volunteer machines.

Hope that helps.

CPDN don't abort tasks from the server. They used to do it years ago but users complained.


With trickles, there are no (valid) reasons to complain any more. No need to complete the tasks to get paid for work done. They already get paid for partial work. So doing server-side aborts should be resumed.

---
CPDN Visiting Scientist
ID: 70312 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : EAS batches 1001-4

©2024 cpdn.org