climateprediction.net home page
EAS batches 1001-4

EAS batches 1001-4

Message boards : Number crunching : EAS batches 1001-4
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,967,570
RAC: 21,693
Message 70313 - Posted: 5 Feb 2024, 12:26:40 UTC

Thanks Glenn.
ID: 70313 · Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 2 Oct 06
Posts: 54
Credit: 27,309,613
RAC: 28,128
Message 70314 - Posted: 5 Feb 2024, 12:53:06 UTC - in response to Message 70312.  

Great news! Thanks!
ID: 70314 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,404,330
RAC: 16,403
Message 70315 - Posted: 5 Feb 2024, 15:50:32 UTC - in response to Message 70314.  

I believe I've successfully sent an 'abort task' for batches 1002, 1003, 1004. As it's the first time I've done this, let me know if that's not what happened!

As mentioned previously, the tasks were aborted as the scientific result were found to have errors due to problems with the input files. These are being corrected now and these batches will be resubmitted (we hope to run the new revised app with these).
---
CPDN Visiting Scientist
ID: 70315 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,967,570
RAC: 21,693
Message 70319 - Posted: 5 Feb 2024, 18:23:31 UTC - in response to Message 70315.  

I believe I've successfully sent an 'abort task' for batches 1002, 1003, 1004. As it's the first time I've done this, let me know if that's not what happened!
I had already aborted mine but the sudden drop in the number of tasks shown as running would suggest it has been successful.
ID: 70319 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 70385 - Posted: 15 Feb 2024, 2:10:31 UTC
Last modified: 15 Feb 2024, 2:11:33 UTC

Four 1001 tasks failed about the same time on my pipsqueak machine. They all ran a very long tme, making progress.
My machine is like this:
Computer 1512658
Computer information
Total credit 	341,419
Average credit 	5,407.74
CPU type 	GenuineIntel
11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz [Family 6 Model 140 Stepping 1]
Number of processors 	8
Coprocessors 	---
Virtualization 	None
Operating System 	Microsoft Windows 10
Core x64 Edition, (10.00.19045.00)
BOINC version 	7.24.1
Memory 	15.64 GB
Cache 	256 KB
Swap space 	18.02 GB
Total disk space 	460.73 GB
Free Disk Space 	307.5 GB
Measured floating point speed 	3.92 billion ops/sec
Measured integer speed 	29.31 billion ops/sec
Average upload rate 	195.13 KB/sec
Average download rate 	6044.66 KB/sec
Average turnaround time 	10.74 days


Here is one of them:
Task 22396678
Name 	wah2_eas25_h1kj_201312_24_1001_012232001_2
Workunit 	12232001
Created 	7 Feb 2024, 22:49:15 UTC
Sent 	7 Feb 2024, 22:50:32 UTC
Report deadline 	6 Jun 2024, 22:50:32 UTC
Received 	14 Feb 2024, 22:35:59 UTC
Server state 	Over
Outcome 	Computation error
Client state 	Compute error
Exit status 	9 (0x00000009) Unknown error code
Computer ID 	1512658
Run time 	5 days 15 hours 45 min 54 sec
CPU time 	5 days 15 hours 31 min 43 sec
Validate state 	Invalid
Credit 	8,304.81
Device peak FLOPS 	3.92 GFLOPS
Application version 	Weather At Home 2 (wah2) v8.24
windows_intelx86
Peak working set size 	342.78 MB
Peak swap size 	310.43 MB
Peak disk usage 	94.46 MB
Stderr 	

<core_client_version>7.24.1</core_client_version>
<![CDATA[
<message>
The storage control block address is invalid.
 (0x9) - exit code 9 (0x9)</message>
<stderr_txt>
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Regional Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=5180, selfPID=13064, iMonCtr=1
Global Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=5180, selfPID=5180, iMonCtr=1

</stderr_txt>
]]>

ID: 70385 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,404,330
RAC: 16,403
Message 70386 - Posted: 15 Feb 2024, 12:58:08 UTC - in response to Message 70385.  

Thanks for reporting.
<message>
The storage control block address is invalid.
 (0x9) - exit code 9 (0x9)</message>
Never seen that error before, but it's definitely coming from Windows not the app itself, though it might be boinc related.

A quick google suggests it's possibly related to updates?
https://stackoverflow.com/questions/61939852/windows-process-activation-service-error-9-the-storage-control-block-address-is

Nothing I can do on the CPDN side but if it was me I'd reboot the machine.
---
CPDN Visiting Scientist
ID: 70386 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,690,861
RAC: 10,559
Message 70387 - Posted: 15 Feb 2024, 13:03:47 UTC - in response to Message 70386.  

Particularly plausible because yesterday was 'patch Wednesday' in UTC: the Wednesday after the second Tuesday of the month. That's the usual day Microsoft releases major update packages - even large security update packages for Windows 7, which is otherwise out of support.
ID: 70387 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,967,570
RAC: 21,693
Message 70581 - Posted: 1 Mar 2024, 15:56:13 UTC

A resend I have picked up has this.

couldn't start app: Task file wah2_8.29_windows_intelx86.exe: file missing</message>
Not the virus scanner issue as it got going and has produced 7 zips gaining 407,756.40 in credit.

Task ID:1549001 The computer in question seems to be trashing everything with this though often not till several zips have been sent. Issue is there both with region independent and the older app versions.
ID: 70581 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2185
Credit: 64,822,615
RAC: 5,275
Message 70582 - Posted: 1 Mar 2024, 17:22:37 UTC - in response to Message 70581.  

A resend I have picked up has this.

couldn't start app: Task file wah2_8.29_windows_intelx86.exe: file missing</message>
Not the virus scanner issue as it got going and has produced 7 zips gaining 407,756.40 in credit.

Task ID:1549001 The computer in question seems to be trashing everything with this though often not till several zips have been sent. Issue is there both with region independent and the older app versions.

Dave, it looks like the 8.24 crashes were either signal 11 or they didn't have a stderr. For the region independent crashes, if 8.29 doesn't exist, how does it start the task, or the next RI task it crashes, etc.? Very strange.
ID: 70582 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,967,570
RAC: 21,693
Message 70583 - Posted: 1 Mar 2024, 18:17:19 UTC - in response to Message 70582.  

if 8.29 doesn't exist, how does it start the task, or the next RI task it crashes, etc.? Very strange.


One of my guesses is either an intermittent disk or memory issue but that will probably remain a theory without access to the machine!
ID: 70583 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,404,330
RAC: 16,403
Message 70584 - Posted: 1 Mar 2024, 19:24:47 UTC - in response to Message 70582.  
Last modified: 1 Mar 2024, 19:28:07 UTC

Dave, I presume you mean hostid: 1549001, not task id? It's an AV issue. Probably the app started ok first time, uploaded a few zips, then on a restart the AV scanner decided it didn't like it (or its ruleset updated, or it was turned off before). Shame I can't tell what AV system killed it. McAfee have accepted the exe as a false positive and fixed the problem, waiting for a reply from Norton.

A resend I have picked up has this.

couldn't start app: Task file wah2_8.29_windows_intelx86.exe: file missing</message>
Not the virus scanner issue as it got going and has produced 7 zips gaining 407,756.40 in credit.

Task ID:1549001 The computer in question seems to be trashing everything with this though often not till several zips have been sent. Issue is there both with region independent and the older app versions.

Dave, it looks like the 8.24 crashes were either signal 11 or they didn't have a stderr. For the region independent crashes, if 8.29 doesn't exist, how does it start the task, or the next RI task it crashes, etc.? Very strange.

---
CPDN Visiting Scientist
ID: 70584 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,967,570
RAC: 21,693
Message 70585 - Posted: 1 Mar 2024, 22:20:57 UTC

Yes, that is what I meant. Yes that makes sense. Running Linux the only AV I use is ClamAV and I have yet to have an issue. Indeed the only things it has thrown up are a couple of dodgy emails that have scraped through my ISP's system.
ID: 70585 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,967,570
RAC: 21,693
Message 70593 - Posted: 4 Mar 2024, 7:24:01 UTC
Last modified: 4 Mar 2024, 7:31:20 UTC

On two different machines I am getting
Error reported by file upload server: Server is out of disk space.
I have sent an email to Andy.

Edit: If this continues for more than a day or so, it may be worth halting tasks to prevent the disk_bound error that happens when the size of the task on disk exceeds that which has been allowed in the setup files. Unless I am mistaken, Andy will have to let someone in Korea know unless they see my post on the Trello Board for the batch.
ID: 70593 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,404,330
RAC: 16,403
Message 70594 - Posted: 4 Mar 2024, 8:30:06 UTC - in response to Message 70593.  

I've contacted the project scientist directly. He's quicker at dealing with the IT people in S. Korea.

p.s. we should probably make this a separate thread as it also affects batches 1006 & 1007.
---
CPDN Visiting Scientist
ID: 70594 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,967,570
RAC: 21,693
Message 70595 - Posted: 4 Mar 2024, 9:04:23 UTC - in response to Message 70594.  

Thanks Glen.Now 6PM in Korea. I don't know what hours he or his IT support people work. I guess we will get a clue in how quickly it gets sorted.
ID: 70595 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3

Message boards : Number crunching : EAS batches 1001-4

©2024 cpdn.org