Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 12 · Next
Author | Message |
---|---|
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Definitely do not abort the transfers. I raised this with Andy at the meeting this morning, and he will check. The server is definitely up and running. I've personally had uploads which take 12 retries before eventually getting through. Might be congestion at the Korean site. Should I just abort the transfer or keep my fingers crossed that it will go at some point?I would keep them at least till Glen reports back from the meeting tomorrow morning. |
Send message Joined: 2 Oct 06 Posts: 54 Credit: 27,309,613 RAC: 28,128 |
FWIW, I have 7 zips that cannot upload. "transient HTTP error" |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
FWIW, I have 7 zips that cannot upload. "transient HTTP error" I have three tasks running and have had no trouble uploading zip files. Each has uploaded seven .zip files. Here is one of them: Task 22340449 Name wah2_eas25_a3fh_200712_24_996_012227993_0 Workunit 12227993 Created 5 Oct 2023, 16:02:19 UTC Sent 5 Oct 2023, 16:38:36 UTC Report deadline 16 Oct 2024, 21:58:36 UTC Received --- Server state In progress Outcome --- Client state New Exit status 0 (0x00000000) Computer ID 1512658 Run time CPU time Validate state Initial Credit 5,819.81 Device peak FLOPS 4.23 GFLOPS Application version Weather At Home 2 (wah2) v8.24 windows_intelx86 Stderr -- Latest Trickles Received Time Sent (UTC) Host ID Result ID Result Name Timestep CPU Time (sec) Average (sec/TS) 09 Oct 2023 06:04:24 1512658 22340449 wah2_eas25_a3fh_200712_24_996_012227993_0 80,939 306,949 3.7923 08 Oct 2023 17:21:55 1512658 22340449 wah2_eas25_a3fh_200712_24_996_012227993_0 69,419 261,246 3.7633 08 Oct 2023 05:04:43 1512658 22340449 wah2_eas25_a3fh_200712_24_996_012227993_0 57,899 217,100 3.7496 07 Oct 2023 16:52:41 1512658 22340449 wah2_eas25_a3fh_200712_24_996_012227993_0 46,379 173,219 3.7349 07 Oct 2023 04:53:35 1512658 22340449 wah2_eas25_a3fh_200712_24_996_012227993_0 34,859 130,196 3.7349 06 Oct 2023 16:55:50 1512658 22340449 wah2_eas25_a3fh_200712_24_996_012227993_0 23,339 87,178 3.7353 06 Oct 2023 05:00:33 1512658 22340449 wah2_eas25_a3fh_200712_24_996_012227993_0 11,819 44,353 3.7527 |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
FWIW, I have 7 zips that cannot upload. "transient HTTP error"Andy's just informed me that he's restarted the httpd server on the Korean machine. It was running & not out of space, but rather alot of uploads and most likely stale connections. Hope that's got stuck uploads moving again. If it misbehaves again, pls post it here. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,013,957 RAC: 21,195 |
No issues uploading zips so far here but this one failed after uploading nine zips with, <![CDATA[ <message> Invalid drive. (0xf) - exit code 15 (0xf)</message> <stderr_txt> Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Quit request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Quit request from BOINC... No Process Handle Global Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=920, selfPID=920, iMonCtr=1 No Process Handle Regional Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=920, selfPID=936, iMonCtr=1 </stderr_txt>This running the client under WINE. Tasks under Windows in VM I suspect will fail based on the failures at previous attempts. Exit code 15 I read means the process has been requested to exit gracefully![/code] |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Funnily enough, I was looking over the hard fail workunits last couple of days and I've seen multiple tasks failing with that kind of error 'invalid device/drive, device not found'. I am starting to wonder if it's task related rather than just host specifc. But I'd need to trawl through the logs of all the fails after the batch to see how prevalent it is to be sure. I've never seen that kind of error with previous batches. |
Send message Joined: 24 Dec 19 Posts: 32 Credit: 41,012,505 RAC: 76,708 |
Add me to the list of people having issues uploading zip files. Sometimes retry works, most of the time it doesn't. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,013,957 RAC: 21,195 |
I've never seen that kind of error with previous batches. A new one on me which is why I posted. |
Send message Joined: 5 Jun 09 Posts: 97 Credit: 3,736,855 RAC: 4,073 |
This might be a red herring, but.... All three .exe files associated with the current wah2 tasks are 32-bit, so my thought is that the current batch of eas25 tasks (batch 996) cover a large (geographic?) area, and someone suggested that one of the problems may that there is an overflow in an array, and this causes the task to crash in the first few minutes of execution. Could this be solved by compiling the application in 64 bit mode - or is my herring really red? Likewise the apparently random task crashes mid-run that a few have seen might be another array exceeding its (32-bit) bounds? |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Funnily enough, I was looking over the hard fail workunits last couple of days and I've seen multiple tasks failing with that kind of error 'invalid device/drive, device not found'. I am starting to wonder if it's task related rather than just host specifc. But I'd need to trawl through the logs of all the fails after the batch to see how prevalent it is to be sure. I've never seen that kind of error with previous batches. These have happened occasionally in the past for me with, I believe, the hadam4/h model series. The best I can figure is it happens when lots of disk writes are occurring with multiple models, like when all the models are essentially in sync with each other and saving files, or finishing the model at the same time. I haven't had one for a long time though. When I'm running one or two models at a time, I've never seen it. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Looking back on the message board for drive not specified with Search limits set to no limit, Iain Inglis had some ideas about that error message. It goes back farther than the hadam4 models, and may have been on Windows tasks instead. My memory isn't performing too well today. |
Send message Joined: 24 Dec 19 Posts: 32 Credit: 41,012,505 RAC: 76,708 |
|
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,013,957 RAC: 21,195 |
This might be a red herring, but....Not sure. I do know that going 64bit has been suggested before and it would get rid of those of us who have taken the pledge having to install 32bit libraries. Having looked at the old thread which suggests BOINC is translating a FORTRAN error into a Windows error description makes sense. Perhaps something to look at on the BOINC fora or to raise as an issue on git-hub? |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,013,957 RAC: 21,195 |
The researcher in Korea reports files seem to be uploading normally. However, after over 70 zips going through without issue, I have got one that is being stubborn at the moment. This could be a bandwidth issue in which case all should clear eventually but before that happens some might run into problems with BOINC limits or disk space. I will have to keep an eye on my VM and pause processing if this becomes an issue. Edit: after over an hour of refusing to budge, another click on the retry now button and it has cleared. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,705,793 RAC: 9,655 |
I'm continuing to monitor the upload duration of the .zip files, all of which are about the same size. I was slightly surprised to see a restart file being uploaded at the same time as .zip_12 - that one's about 10% bigger. From memory, only one restart file is specified per task (I'll check later), so there won't one one on task completion. But there will be a surge of data for the upload server as users reach the mid-point of the run (assuming to restart files go to the same server - I'll check that too). |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,013,957 RAC: 21,195 |
I was slightly surprised to see a restart file being uploaded at the same time as .zip_12 Pretty sure I first noticed that on testing for the previous batch. Also on previous main site batch. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,705,793 RAC: 9,655 |
OK, surprise resolved. There is only one restart.zip, but there's also an out.zip, which I assume will be sent at the very end - that makes sense. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
It's a red herring. All 3 executable are supposed to be 32 bit. The problem with fails after restarts isn't anything to do with 32bit array sizes (and compiling into 64bit is not as easy as the apps rely on 32bit addressing for some shared memory ops). I'm not going into details but the problem is related to the communication between the global & regional models - we have a pretty good idea what's causing it. It's not an easy fix though as the model doesn't have much control over the computing environment it's running in. This might be a red herring, but.... --- CPDN Visiting Scientist |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
I did try a google search but it didn't return anything useful. I've emailed the CPDN folk to see if they recognise it. Suggests the hardware is starting to fail to me. Might be time to check the drive health? But good to know this is not a new issue, thanks for that. Looking back on the message board for --- CPDN Visiting Scientist |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
@Glenn I should have said using the Advanced search link at the top of the forum, that is how you would get to the search I was talking about. |
©2024 cpdn.org