Thread 'Batch 996 Weather@Home2 East Asia25'

Author	Message
Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,491,370 RAC: 15,733	Message 69710 - Posted: 9 Oct 2023, 10:49:11 UTC - in response to Message 69708. Definitely do not abort the transfers. I raised this with Andy at the meeting this morning, and he will check. The server is definitely up and running. I've personally had uploads which take 12 retries before eventually getting through. Might be congestion at the Korean site. Should I just abort the transfer or keep my fingers crossed that it will go at some point? I would keep them at least till Glen reports back from the meeting tomorrow morning. ID: 69710 · Reply Quote

zombie67 [MM] Send message Joined: 2 Oct 06 Posts: 54 Credit: 27,309,613 RAC: 28,128	Message 69711 - Posted: 9 Oct 2023, 11:49:55 UTC Last modified: 9 Oct 2023, 12:27:06 UTC FWIW, I have 7 zips that cannot upload. "transient HTTP error" ID: 69711 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 69712 - Posted: 9 Oct 2023, 15:46:34 UTC - in response to Message 69711. FWIW, I have 7 zips that cannot upload. "transient HTTP error" I have three tasks running and have had no trouble uploading zip files. Each has uploaded seven .zip files. Here is one of them: Task 22340449 Name wah2_eas25_a3fh_200712_24_996_012227993_0 Workunit 12227993 Created 5 Oct 2023, 16:02:19 UTC Sent 5 Oct 2023, 16:38:36 UTC Report deadline 16 Oct 2024, 21:58:36 UTC Received --- Server state In progress Outcome --- Client state New Exit status 0 (0x00000000) Computer ID 1512658 Run time CPU time Validate state Initial Credit 5,819.81 Device peak FLOPS 4.23 GFLOPS Application version Weather At Home 2 (wah2) v8.24 windows_intelx86 Stderr -- Latest Trickles Received Time Sent (UTC) Host ID Result ID Result Name Timestep CPU Time (sec) Average (sec/TS) 09 Oct 2023 06:04:24 1512658 22340449 wah2_eas25_a3fh_200712_24_996_012227993_0 80,939 306,949 3.7923 08 Oct 2023 17:21:55 1512658 22340449 wah2_eas25_a3fh_200712_24_996_012227993_0 69,419 261,246 3.7633 08 Oct 2023 05:04:43 1512658 22340449 wah2_eas25_a3fh_200712_24_996_012227993_0 57,899 217,100 3.7496 07 Oct 2023 16:52:41 1512658 22340449 wah2_eas25_a3fh_200712_24_996_012227993_0 46,379 173,219 3.7349 07 Oct 2023 04:53:35 1512658 22340449 wah2_eas25_a3fh_200712_24_996_012227993_0 34,859 130,196 3.7349 06 Oct 2023 16:55:50 1512658 22340449 wah2_eas25_a3fh_200712_24_996_012227993_0 23,339 87,178 3.7353 06 Oct 2023 05:00:33 1512658 22340449 wah2_eas25_a3fh_200712_24_996_012227993_0 11,819 44,353 3.7527 ID: 69712 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,491,370 RAC: 15,733	Message 69713 - Posted: 9 Oct 2023, 16:14:41 UTC - in response to Message 69711. FWIW, I have 7 zips that cannot upload. "transient HTTP error" Andy's just informed me that he's restarted the httpd server on the Korean machine. It was running & not out of space, but rather alot of uploads and most likely stale connections. Hope that's got stuck uploads moving again. If it misbehaves again, pls post it here. ID: 69713 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 69714 - Posted: 9 Oct 2023, 16:15:49 UTC Last modified: 9 Oct 2023, 16:16:16 UTC No issues uploading zips so far here but this one failed after uploading nine zips with, <![CDATA[ <message> Invalid drive. (0xf) - exit code 15 (0xf)</message> <stderr_txt> Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Quit request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Quit request from BOINC... No Process Handle Global Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=920, selfPID=920, iMonCtr=1 No Process Handle Regional Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=920, selfPID=936, iMonCtr=1 </stderr_txt> This running the client under WINE. Tasks under Windows in VM I suspect will fail based on the failures at previous attempts. Exit code 15 I read means the process has been requested to exit gracefully![/code] ID: 69714 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,491,370 RAC: 15,733	Message 69717 - Posted: 9 Oct 2023, 18:12:06 UTC - in response to Message 69714. Funnily enough, I was looking over the hard fail workunits last couple of days and I've seen multiple tasks failing with that kind of error 'invalid device/drive, device not found'. I am starting to wonder if it's task related rather than just host specifc. But I'd need to trawl through the logs of all the fails after the batch to see how prevalent it is to be sure. I've never seen that kind of error with previous batches. ID: 69717 · Reply Quote

ChelseaOilman Send message Joined: 24 Dec 19 Posts: 32 Credit: 41,242,061 RAC: 72,754	Message 69719 - Posted: 9 Oct 2023, 20:07:01 UTC Add me to the list of people having issues uploading zip files. Sometimes retry works, most of the time it doesn't. ID: 69719 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 69720 - Posted: 9 Oct 2023, 20:43:21 UTC - in response to Message 69717. I've never seen that kind of error with previous batches. A new one on me which is why I posted. ID: 69720 · Reply Quote

rob Send message Joined: 5 Jun 09 Posts: 97 Credit: 3,736,855 RAC: 4,073	Message 69721 - Posted: 9 Oct 2023, 21:26:50 UTC This might be a red herring, but.... All three .exe files associated with the current wah2 tasks are 32-bit, so my thought is that the current batch of eas25 tasks (batch 996) cover a large (geographic?) area, and someone suggested that one of the problems may that there is an overflow in an array, and this causes the task to crash in the first few minutes of execution. Could this be solved by compiling the application in 64 bit mode - or is my herring really red? Likewise the apparently random task crashes mid-run that a few have seen might be another array exceeding its (32-bit) bounds? ID: 69721 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275	Message 69722 - Posted: 9 Oct 2023, 22:55:14 UTC - in response to Message 69717. Funnily enough, I was looking over the hard fail workunits last couple of days and I've seen multiple tasks failing with that kind of error 'invalid device/drive, device not found'. I am starting to wonder if it's task related rather than just host specifc. But I'd need to trawl through the logs of all the fails after the batch to see how prevalent it is to be sure. I've never seen that kind of error with previous batches. These have happened occasionally in the past for me with, I believe, the hadam4/h model series. The best I can figure is it happens when lots of disk writes are occurring with multiple models, like when all the models are essentially in sync with each other and saving files, or finishing the model at the same time. I haven't had one for a long time though. When I'm running one or two models at a time, I've never seen it. ID: 69722 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275	Message 69723 - Posted: 9 Oct 2023, 23:12:04 UTC - in response to Message 69722. Looking back on the message board for drive not specified with Search limits set to no limit, Iain Inglis had some ideas about that error message. It goes back farther than the hadam4 models, and may have been on Windows tasks instead. My memory isn't performing too well today. ID: 69723 · Reply Quote

ChelseaOilman Send message Joined: 24 Dec 19 Posts: 32 Credit: 41,242,061 RAC: 72,754	Message 69724 - Posted: 9 Oct 2023, 23:23:06 UTC ID: 69724 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 69727 - Posted: 10 Oct 2023, 5:32:59 UTC - in response to Message 69721. Last modified: 10 Oct 2023, 5:44:00 UTC This might be a red herring, but.... All three .exe files associated with the current wah2 tasks are 32-bit Not sure. I do know that going 64bit has been suggested before and it would get rid of those of us who have taken the pledge having to install 32bit libraries. Having looked at the old thread which suggests BOINC is translating a FORTRAN error into a Windows error description makes sense. Perhaps something to look at on the BOINC fora or to raise as an issue on git-hub? ID: 69727 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 69730 - Posted: 10 Oct 2023, 6:37:11 UTC Last modified: 10 Oct 2023, 7:06:29 UTC The researcher in Korea reports files seem to be uploading normally. However, after over 70 zips going through without issue, I have got one that is being stubborn at the moment. This could be a bandwidth issue in which case all should clear eventually but before that happens some might run into problems with BOINC limits or disk space. I will have to keep an eye on my VM and pause processing if this becomes an issue. Edit: after over an hour of refusing to budge, another click on the retry now button and it has cleared. ID: 69730 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,719,896 RAC: 7,946	Message 69731 - Posted: 10 Oct 2023, 7:39:34 UTC I'm continuing to monitor the upload duration of the .zip files, all of which are about the same size. I was slightly surprised to see a restart file being uploaded at the same time as .zip_12 - that one's about 10% bigger. From memory, only one restart file is specified per task (I'll check later), so there won't one one on task completion. But there will be a surge of data for the upload server as users reach the mid-point of the run (assuming to restart files go to the same server - I'll check that too). ID: 69731 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 69732 - Posted: 10 Oct 2023, 7:45:39 UTC - in response to Message 69731. I was slightly surprised to see a restart file being uploaded at the same time as .zip_12 Pretty sure I first noticed that on testing for the previous batch. Also on previous main site batch. ID: 69732 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,719,896 RAC: 7,946	Message 69733 - Posted: 10 Oct 2023, 10:41:23 UTC - in response to Message 69732. OK, surprise resolved. There is only one restart.zip, but there's also an out.zip, which I assume will be sent at the very end - that makes sense. ID: 69733 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,491,370 RAC: 15,733	Message 69734 - Posted: 10 Oct 2023, 11:31:09 UTC - in response to Message 69721. It's a red herring. All 3 executable are supposed to be 32 bit. The problem with fails after restarts isn't anything to do with 32bit array sizes (and compiling into 64bit is not as easy as the apps rely on 32bit addressing for some shared memory ops). I'm not going into details but the problem is related to the communication between the global & regional models - we have a pretty good idea what's causing it. It's not an easy fix though as the model doesn't have much control over the computing environment it's running in. This might be a red herring, but.... All three .exe files associated with the current wah2 tasks are 32-bit, so my thought is that the current batch of eas25 tasks (batch 996) cover a large (geographic?) area, and someone suggested that one of the problems may that there is an overflow in an array, and this causes the task to crash in the first few minutes of execution. Could this be solved by compiling the application in 64 bit mode - or is my herring really red? Likewise the apparently random task crashes mid-run that a few have seen might be another array exceeding its (32-bit) bounds? --- CPDN Visiting Scientist ID: 69734 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,491,370 RAC: 15,733	Message 69735 - Posted: 10 Oct 2023, 11:44:41 UTC - in response to Message 69723. I did try a google search but it didn't return anything useful. I've emailed the CPDN folk to see if they recognise it. Suggests the hardware is starting to fail to me. Might be time to check the drive health? But good to know this is not a new issue, thanks for that. Looking back on the message board for drive not specified with Search limits set to no limit, Iain Inglis had some ideas about that error message. It goes back farther than the hadam4 models, and may have been on Windows tasks instead. My memory isn't performing too well today. --- CPDN Visiting Scientist ID: 69735 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275	Message 69736 - Posted: 10 Oct 2023, 12:49:43 UTC - in response to Message 69735. @Glenn I should have said using the Advanced search link at the top of the forum, that is how you would get to the search I was talking about. ID: 69736 · Reply Quote