Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25
Message board moderation
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · Next
Author | Message |
---|---|
Send message Joined: 6 Aug 04 Posts: 195 Credit: 28,374,828 RAC: 10,749 |
The five recalcitrant zips uploaded earrlier today. |
Send message Joined: 5 Aug 04 Posts: 178 Credit: 18,891,165 RAC: 45,129 |
Uploads complete! +1 Supporting BOINC, a great concept ! |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
And the number of hosts reporting completed tasks in last 24 hours has doubled since I looked earlier this morning. My last two should finish shortly. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,476,460 RAC: 15,681 |
The big jump in the number of users reporting tasks is I think evidence that switching to Jasmine has worked.No, the switch to JASMIN hasn't happened -- CPDN are looking into moving the Korean machine outside the firewall first, as that would be easier for the scientists. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
No, the switch to JASMIN hasn't happened -- CPDN are looking into moving the Korean machine outside the firewall first, as that would be easier for the scientists.Something has changed if the stuck uploads have gone through. The increase in number of machines reporting could I suppose be due to slower machines now finishing tasks. My two in the VM have just reported. They take about 20% longer than those using WINE. Next batch I shall attempt running a task under both systems to see what differences there are or if I get a resend and catch it in time. |
Send message Joined: 2 Oct 06 Posts: 54 Credit: 27,309,613 RAC: 28,128 |
My 40+ uploads finally went through over night. First time the transfers tab has been empty in weeks. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,476,460 RAC: 15,681 |
It's possible something has changed at the Korean side that I'm not aware of. I will ask & report back. I know there was a high level exchange of emails yesterday.No, the switch to JASMIN hasn't happened -- CPDN are looking into moving the Korean machine outside the firewall first, as that would be easier for the scientists.Something has changed if the stuck uploads have gone through. The increase in number of machines reporting could I suppose be due to slower machines now finishing tasks. My two in the VM have just reported. They take about 20% longer than those using WINE. Next batch I shall attempt running a task under both systems to see what differences there are or if I get a resend and catch it in time. Anyway, whatever's happened, I'm glad! |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Anyway, whatever's happened, I'm glad!Agreed! Knowing what if anything has changed, is mainly to satisfy my curiosity, secondly to have ideas for if it happens again. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,476,460 RAC: 15,681 |
Confirmation from the Korean site. Their IT staff have opened up the http port on their firewall -- effectively they're temporarily disabling protection against DDoS. --- CPDN Visiting Scientist |
Send message Joined: 24 Dec 19 Posts: 32 Credit: 41,218,846 RAC: 73,465 |
Even though the server status page always shows no available tasks it seems there are a few available. Built a new Ryzen 7950X system yesterday and got BOINC running on it last night. Checked it this morning and it was crunching 2 CPDN tasks. Uploads still going great for me. No tasks waiting to upload. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Even though the server status page always shows no available tasks it seems there are a few available. Built a new Ryzen 7950X system yesterday and got BOINC running on it last night. Checked it this morning and it was crunching 2 CPDN tasks.There will be the odd retreads that have failed on their first and possibly second attempts for a while yet. I see I have just picked up two a few minutes ago. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,476,460 RAC: 15,681 |
Because the model is flaky for this batch region when restarting (e.g. power off/off), we are losing alot of the 1st & 2nd attempts. That's why we're getting more resends than normal. I am sure alot of the hard fails are simply due to this and not because of an inherent problem with the model perturbations. Not sure whether CPDN will decide to rerun them or not yet. |
Send message Joined: 5 Jun 09 Posts: 97 Credit: 3,736,855 RAC: 4,073 |
Hopefully if they do someone will have a look for the root cause of the issue that has led to the poor re-start performance of these tasks. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
two _1 tasks running here. I have the same tasks running both under wine and also in Windows in a VM. which hopefully will enable some comparisons to be made. between the output files. Network activity is turned off for the WINE install of BOINC. In fact network activity is off for both so the zips don't go on the windows install before I get a chance to look at the files. What I am not sure of Glen is whether even if the science data is the same between both runs, whether tasks that complete under WINE but not using Windows are still invalid or even if there is a way of checking that? |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Do you want to see this? It failed pretty fast. on a Windows 10 box. Task 22347812 Name wah2_eas25_a11q_199112_24_996_012224906_2 Workunit 12224906 Created 16 Oct 2023, 23:44:03 UTC Sent 16 Oct 2023, 23:44:37 UTC Report deadline 28 Oct 2024, 5:04:37 UTC Received 17 Oct 2023, 0:45:18 UTC Server state Over Outcome Computation error Client state Compute error Exit status 0 (0x00000000) Computer ID 1512658 Run time 2 min 41 sec CPU time 2 min 23 sec Validate state Invalid Credit 0.00 Device peak FLOPS 4.23 GFLOPS Application version Weather At Home 2 (wah2) v8.24 windows_intelx86 Peak working set size 166.88 MB Peak swap size 160.23 MB Peak disk usage 0.01 MB Stderr <core_client_version>7.24.1</core_client_version> <![CDATA[ <stderr_txt> Signal 11 received: Segment violation Signal 11 received: Software termination signal from kill Signal 11 received: Abnormal termination triggered by abort call Signal 11 received, exiting... 19:47:52 (7736): called boinc_finish(193) Global Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=2932, iMonCtr=2 Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=7736, selfPID=13976, iMonCtr=1 Model crash detected, will try to restart... Leaving CPDN_ain::Monitor... 19:47:56 (13976): called boinc_finish(0) </stderr_txt> <message> upload failure: <file_xfer_error> <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_1.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_2.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_3.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_4.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_5.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_6.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_7.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_8.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_9.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_10.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_11.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_12.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_13.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_14.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_15.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_16.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_17.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_18.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_19.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_20.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_21.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_22.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_23.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_24.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_restart.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> </message> ]]> |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Do you want to see this? It failed pretty fast. on a Windows 10 box.That would be the swapping from Global to regional models at end of first model day. Sadly I don't think data from crunchers' machines is likely to help isolate what is happening there. It is proving difficult enough to track down on in house machines where there is access to the code. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,476,460 RAC: 15,681 |
I have the model compiled under linux and am currently debugging what's going on. Thanks for the offer but am well past that point.Do you want to see this? It failed pretty fast. on a Windows 10 box.That would be the swapping from Global to regional models at end of first model day. Sadly I don't think data from crunchers' machines is likely to help isolate what is happening there. It is proving difficult enough to track down on in house machines where there is access to the code. It's restarting the model from a shutdown that risks the model failing like this. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,476,460 RAC: 15,681 |
two _1 tasks running here. I have the same tasks running both under wine and also in Windows in a VM. which hopefully will enable some comparisons to be made. between the output files. Network activity is turned off for the WINE install of BOINC. In fact network activity is off for both so the zips don't go on the windows install before I get a chance to look at the files.What do you mean by 'invalid'? |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
two _1 tasks running here. I have the same tasks running both under wine and also in Windows in a VM. which hopefully will enable some comparisons to be made. between the output files. Network activity is turned off for the WINE install of BOINC. In fact network activity is off for both so the zips don't go on the windows install before I get a chance to look at the files.What do you mean by 'invalid'? Not invalid as in rejected by the software but invalid as in useless for the science. |
Send message Joined: 5 Jun 09 Posts: 97 Credit: 3,736,855 RAC: 4,073 |
It's restarting the model from a shutdown that risks the model failing like this. None my "two minute crashes" have been the result of re-start after a shutdown. |
©2024 cpdn.org