Message boards : Number crunching : New work Discussion
Message board moderation
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 91 · Next
Author | Message |
---|---|
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
Get them while there hot folks. There are a lot of hungry computers out there. Server status shows them down to about 650. |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,989,107 RAC: 21,788 |
Server status shows them down to about 650. Now only showing 8. I only know there were three batches because of the batch numbers on those I downloaded. |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,989,107 RAC: 21,788 |
And now another 4thousand or show showing on server status page. Still a case of grab em while you can I think. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,803,756 RAC: 5,187 |
Here are the batches since the restart: Batch Date App. Reg. Res. Period Name (1st) Size Comment 570 16-May-17 WAH2 8.24 EU-r 50 km 3 month wah2_eu50r_mw10_20174_3_570_011020987 3600 569 16-May-17 WAH2 8.24 EU-r 50 km 3 month wah2_eu50r_myt0_20174_3_569_011013787 7200 568 15-May-17 WAH2 8.24 EAS 50 km 12 month wah2_eas50_g000_201212_12_568_011012047 1740 WAH2 East Asia 2013 GHG, 15 May 2017, WAH2 Attribution run over East Asia 2013: restarts from 13 out of 78 ensembles of batch 538 (1740 simulations). 567 15-May-17 WAH2 8.24 EAS 50 km 12 month wah2_eas50_n000_201212_12_567_011010307 1740 WAH2 East Asia 2013 NAT, 15 May 2017, WAH2 Attribution run over East Asia 2013: restarts from 13 out of 78 ensembles of batch 537 (1740 simulations). 566 15-May-17 WAH2 8.24 EAS 50 km 12 month wah2_eas50_a000_201212_12_566_011008567 1740 WAH2 East Asia 2013 ACT, 15 May 2017, WAH2 Attribution run over East Asia 2013: restarts from 13 out of 78 ensembles of batch 536 (1740 simulations). 565 03-May-17 WAH2 8.24 PNW 25 km 21 month wah2_pnw25_h000_198312_21_565_011008381 186 Generate historic pnw restarts, 3 May 2017, Generate historic pnw restarts (186 simulations). [Edit: Batch #570 added.] |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,803,756 RAC: 5,187 |
There are 3 batches of 2000 CAM50/12 in the queue (batch list thread). |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I got two pairs of these. All failed quickly. Here is onoe of them: Name wah2_cam50_ixzu_193012_12_571_011025877_1 Workunit 11025877 Created 19 May 2017, 15:37:01 UTC Sent 19 May 2017, 15:37:52 UTC Report deadline 1 May 2018, 20:57:52 UTC Received 19 May 2017, 17:11:05 UTC Server state Over Outcome Computation error Client state Compute error Exit status 0 (0x0) Computer ID 1256552 Run time 8 min 30 sec CPU time 7 min 4 sec Validate state Invalid Credit 0.00 Device peak FLOPS 1.28 GFLOPS Application version Weather At Home 2 (wah2) v8.25 i686-pc-linux-gnu stderr out <core_client_version>7.2.33</core_client_version> <![CDATA[ <stderr_txt> SIGSEGV: segmentation violation Stack trace (13 frames): /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu(boinc_catch_signal+0x67)[0x839e357] [0x55555400] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x814442b] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x814b133] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8141220] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x813ff46] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8077583] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x831cd74] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8330985] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x833318a] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8334c8d] /lib/libc.so.6(__libc_start_main+0xe6)[0x352d26] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x804c7a1] Exiting... Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=10383, iMonCtr=2 Model crash detected, will try to restart... Global Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=10385, iMonCtr=2 Leaving CPDN_ain::Monitor... Calling boinc_finish...12:51:10 (10383): called boinc_finish(0) In boinc_exit called with status 0 Calloing set_signal_exit_code with status 0 </stderr_txt> <message> upload failure: <file_xfer_error> <file_name>wah2_cam50_ixzu_193012_12_571_011025877_1_r1944199386_1.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_cam50_ixzu_193012_12_571_011025877_1_r1944199386_2.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_cam50_ixzu_193012_12_571_011025877_1_r1944199386_3.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_cam50_ixzu_193012_12_571_011025877_1_r1944199386_4.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_cam50_ixzu_193012_12_571_011025877_1_r1944199386_5.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_cam50_ixzu_193012_12_571_011025877_1_r1944199386_6.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_cam50_ixzu_193012_12_571_011025877_1_r1944199386_7.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_cam50_ixzu_193012_12_571_011025877_1_r1944199386_8.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_cam50_ixzu_193012_12_571_011025877_1_r1944199386_9.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_cam50_ixzu_193012_12_571_011025877_1_r1944199386_10.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_cam50_ixzu_193012_12_571_011025877_1_r1944199386_11.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_cam50_ixzu_193012_12_571_011025877_1_r1944199386_12.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_cam50_ixzu_193012_12_571_011025877_1_r1944199386_13.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_cam50_ixzu_193012_12_571_011025877_1_r1944199386_14.zip</file_name> <error_code>-161</error_code> </file_xfer_error> </message> ]]> |
Send message Joined: 7 Aug 04 Posts: 2186 Credit: 64,822,615 RAC: 5,275 |
Yep, lots and lost of sigsev/segmentation type errors on these tasks. I have informed the project people...but it's the weekend...so... |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,989,107 RAC: 21,788 |
Sarah from the project has found the problem with these tasks. Ok I have spotted an error with these. There was a small error in the header of the xml which meant that the wrong namelist was trying to be used and so the model will crash. I will close these batches and resend out correctly. The faulty tasks will also be withdrawn so that retreads will not be sent out for second and third tries for those that are sitting in the ether somewhere. As many will have observed, quite a few have already failed three times. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,803,756 RAC: 5,187 |
All the models I had from batches 571-573 crashed after a very early checkpoint. The replacement batches seem to be running fine - or, at least, the 576 I've got running ... |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,989,107 RAC: 21,788 |
Another hour I am guessing for my 575 to reach its first checkpoint. Not sure whether these are the resends with the xml sorted or if they are brand new - I wasn't really expecting the fixed ones to come out till normal working hours here in UK. |
Send message Joined: 15 Jan 11 Posts: 175 Credit: 6,242,691 RAC: 699 |
atmos_restart_batch_537_eas50_n022_2012-12-01.gz Tue 23 May 10:06:30 2017 climateprediction.net Temporarily failed download of wah2_data_8.24_i686-apple-darwin.zip: HTTP error I'm getting download failures with this task. Can't find any references in other posts. Any one else with this problem in current batches? |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,989,107 RAC: 21,788 |
Just got a 574 retread. failed on two machines already, one a sig-seg fault but after 9 zips uploaded. Will let it run and see what happens. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,803,756 RAC: 5,187 |
There's a batch of 10,000 global models just been added. [Edit: 10,000 x 2.] |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,989,107 RAC: 21,788 |
Feels like the first time all my cores have been crunching for quite a while. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,803,756 RAC: 5,187 |
Plus 660 x ANZ50 (batch list). |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
All my cores (one 4-core 64-bit Xeon) run all the time anyway because I run WCG, seti@home, and rosetta in addition to climateprediction. But since I have received no CP work units in so long, and since my boinc client tries to give 50% of the time to climateprediction, it is now running three of those as I am typing this. Each has already returned three trickles. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,803,756 RAC: 5,187 |
Plus 10 x WUS25/120, 450 x SAM50/13, 432 x WUS25/25, and today 10 x PNW25/120 (batch list). |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,803,756 RAC: 5,187 |
2160 NAWA25/13 somehow came and went this morning (batch list). |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,989,107 RAC: 21,788 |
And another 13K tasks in the hopper this morning. (wus25) |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,989,107 RAC: 21,788 |
Note, the wus tasks (batch 583) are quite long. about 7 times the estimated computation size of the Global model tasks I have running currently. (On my admittedly slow machine by modern standards they are estimating about 17 days. Edit: I see looking at the batch list these are some extras from an already released batch. Edit2: even more of them, by 09.31 The number available had gone up to over 14,000. |
©2024 cpdn.org