climateprediction.net (CPDN) home page
Thread 'error report for wah2_sas50'

Thread 'error report for wah2_sas50'

Message boards : Number crunching : error report for wah2_sas50
Message board moderation

To post messages, you must log in.

AuthorMessage
ProfileBonsai911

Send message
Joined: 9 Sep 04
Posts: 228
Credit: 30,750,791
RAC: 3,898
Message 57709 - Posted: 24 Jan 2018, 17:08:00 UTC
Last modified: 24 Jan 2018, 17:10:00 UTC

SAS50 workunits 'break down' after a few seconds.

<stderr_txt>

Model crashed: INANCILA:integer header error tmp/xadae.pipe_dummy
Regional Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=15784, selfPID=15784, iMonCtr=2
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=15784, selfPID=16112, iMonCtr=1
Model crash detected, will try to restart...
Leaving CPDN_ain::Monitor...
15:33:37 (16112): called boinc_finish(0)
ID: 57709 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 57710 - Posted: 24 Jan 2018, 17:14:36 UTC
Last modified: 24 Jan 2018, 17:15:08 UTC

Have you had a bunch of them do so? Are any tasks that you might have downloaded today still running?

They think there's a corruption in a restart file (or restart files) but are unsure how widespread it is.
ID: 57710 · Report as offensive     Reply Quote
ProfileBonsai911

Send message
Joined: 9 Sep 04
Posts: 228
Credit: 30,750,791
RAC: 3,898
Message 57711 - Posted: 24 Jan 2018, 17:22:17 UTC

No all workunits crashed so far.
ID: 57711 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,014,785
RAC: 20,946
Message 57712 - Posted: 24 Jan 2018, 17:22:19 UTC

I see that 5 on Bonsai's computer have crashed so far. https://www.cpdn.org/cpdnboinc/results.php?hostid=1377284

When a second one of mine crashed I had to wait an hour for the timeout before it would show as crashed on its page.
ID: 57712 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 57713 - Posted: 24 Jan 2018, 17:26:01 UTC - in response to Message 57712.  

I see that 5 on Bonsai's computer have crashed so far. https://www.cpdn.org/cpdnboinc/results.php?hostid=1377284

When a second one of mine crashed I had to wait an hour for the timeout before it would show as crashed on its page.


You can update the project in boinc manager to show the crashed task immediately on the webpage. Of course if you are wanting it to request more tasks, that resets the communication time to 1 hour. I've been doing that since it appears that I am having no luck downloading a good task, and now I've "fulfilled my daily quota" for both PCs.
ID: 57713 · Report as offensive     Reply Quote
ProfileBonsai911

Send message
Joined: 9 Sep 04
Posts: 228
Credit: 30,750,791
RAC: 3,898
Message 57714 - Posted: 24 Jan 2018, 17:28:25 UTC
Last modified: 24 Jan 2018, 17:38:34 UTC

That's right.
I see that 5 on Bonsai's computer have crashed so far.


But I received 5 workunits more, and they all also crashed after 13 or 14 seconds.
ID: 57714 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,014,785
RAC: 20,946
Message 57715 - Posted: 24 Jan 2018, 17:33:33 UTC

You can update the project in boinc manager to show the crashed task immediately on the webpage.


Should have thought of that! Having also crashed my daily quota machine is now crunching other projects on one of its two cores. I am keeping my marginally faster machine running native Linux work in the hope that at some stage testing might lead to main site work for it again.
ID: 57715 · Report as offensive     Reply Quote
ProfileBonsai911

Send message
Joined: 9 Sep 04
Posts: 228
Credit: 30,750,791
RAC: 3,898
Message 57716 - Posted: 24 Jan 2018, 18:03:36 UTC

Nothing new: The next five workunits crashed.


15 wu Model crashed: INANCILA:integer header error
ID: 57716 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,014,785
RAC: 20,946
Message 57718 - Posted: 24 Jan 2018, 20:20:57 UTC
Last modified: 24 Jan 2018, 20:21:51 UTC

Nothing new: The next five workunits crashed.


Thanks,
If anyone has any of these tasks (batches 703,704 and 705) running past the first few seconds (13) on my box without crashing can you please let us know.

Unsent tasks may be withdrawn but if some are working OK the withdrawn ones will be re-issued as that would mean it was a science issue rather than a purely dodgy xml file one.

Edit: unsent tasks have been paused for now.
ID: 57718 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,620,508
RAC: 4,981
Message 57720 - Posted: 24 Jan 2018, 20:53:26 UTC - in response to Message 57718.  

I've got two of these batch 705, they ran for 30 secs and crashed with INANCILA:integer header error. (running WINE, BOINC 7.8.3)
ID: 57720 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,014,785
RAC: 20,946
Message 57721 - Posted: 24 Jan 2018, 21:09:10 UTC - in response to Message 57720.  

I've got two of these batch 705, they ran for 30 secs and crashed with INANCILA:integer header error. (running WINE, BOINC 7.8.3)


17 seconds longer than my 703s managed! Priority is anyone with them not crashing please post!
ID: 57721 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 30,985,838
RAC: 14,284
Message 57722 - Posted: 24 Jan 2018, 23:14:01 UTC - in response to Message 57721.  

Just had a 704 crash after about 12 seconds.
ID: 57722 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,014,785
RAC: 20,946
Message 57725 - Posted: 25 Jan 2018, 10:04:04 UTC - in response to Message 57722.  

No reports of these tasks running. Team at Oxford are investigating the problem.
ID: 57725 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,014,785
RAC: 20,946
Message 57726 - Posted: 25 Jan 2018, 14:23:21 UTC

Problem identified and when correct files are uploaded to the system the batches will go out again.
ID: 57726 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,014,785
RAC: 20,946
Message 57728 - Posted: 25 Jan 2018, 16:49:29 UTC - in response to Message 57726.  

These have started pouring into the hopper and one from batch 706 is now about 8 minutes in, well past the time when they were crashing before due to some misonfigured files.
ID: 57728 · Report as offensive     Reply Quote

Message boards : Number crunching : error report for wah2_sas50

©2024 cpdn.org