Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25
Message board moderation
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 12 · Next
Author | Message |
---|---|
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,008,987 RAC: 21,524 |
Out.zip files are part of the validity check on task results. Trickle up files are smaller still. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,008,987 RAC: 21,524 |
If you can, please wait till after the cpdn meeting on Monday. Glen has asked the researcher to check whether there is any ddos protection that might be causing the problems with so many computers from around the globe sending files at once. |
Send message Joined: 24 Dec 19 Posts: 32 Credit: 40,991,754 RAC: 77,248 |
I'll leave the files waiting to upload as long as needed. With 8 computers running this project right now I'm accumulating quite a few tasks that won't upload. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
I've tried upload speeds of 30 100 and 150 and it hasn't helped. I'm probably going to abort the transfer and limit my project participation to no more than 1 work unit at a time so I don't waste so much computer time on these in the future. I had 8 going this last week, 6 couldn't handle a reboot and the 2 that finished won't upload. Waste of time and energy at this point, it's just too fragile. That's sad because I think this is one of the more worthwhile projects out there and up until now I had given CPDN the absolute highest priority on my computers. I appreciate this sentiment, it's annoying me as well. My 'restart failure rate' is ~1 in 8 tasks per restart, which is poor. But to be fair to CPDN, WaH is only flakier for this batch because the project scientist asked to model a larger region than normal at 25km scale, which is pushing the model too far essentially. Previous WaH batches have been much more stable. We are working on fixing this. Regarding uploads, keep them if you can. If you don't it'll invalidate that entire task's data. Data uploads will be discussed in the meeting next week. From my own experience, I've had transfers 'stuck' for days with ~50 retries but they do eventually get through. The IT guy in Korea is keen to find out why people are having problems. So again, it's getting looked at. Korea are a new server for CPDN. --- CPDN Visiting Scientist |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,978,383 RAC: 14,247 |
Some of my recent uploads: 15/10/2023 14:26:10 | climateprediction.net | Finished upload of wah2_eas25_a1rl_199612_24_996_012225837_0_r1703371973_13.zip (99338471 bytes) 15/10/2023 14:26:10 | climateprediction.net | [file_xfer] Throughput 181915 bytes/sec 15/10/2023 14:56:24 | climateprediction.net | Finished upload of wah2_eas25_a3is_200712_24_996_012228112_0_r685833828_13.zip (98900336 bytes) 15/10/2023 14:56:24 | climateprediction.net | [file_xfer] Throughput 198808 bytes/sec 15/10/2023 17:38:25 | climateprediction.net | Finished upload of wah2_eas25_a0q1_198912_24_996_012224485_2_r113967697_2.zip (99007240 bytes) 15/10/2023 17:38:25 | climateprediction.net | [file_xfer] Throughput 179704 bytes/sec which have all gone through OK. |
Send message Joined: 2 Oct 06 Posts: 54 Credit: 27,309,613 RAC: 28,128 |
The IT guy in Korea is keen to find out why people are having problems. So again, it's getting looked at. Korea are a new server for CPDN. Do any of the back end guys not run the projects on their own home machines? Shouldn't they be seeing the same thing? FWIW, I am up to 20 pending transfers, with the largest being 50 attempts. I tried setting each of the machines to 100k upload speed, with no luck. But I am not sure that helps, since all my machines are coming from my single IP address. So even if the speed is limited per machine, the speed per IP address can be larger if there are multiple attempts at the same time. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,008,987 RAC: 21,524 |
Do any of the back end guys not run the projects on their own home machines? Shouldn't they be seeing the same thing?Not sure how relevant that is. All of mine have been going through without problems. Unless they have several machines like you, they may be running tasks but like me be having everything go through. Hence Glen's request to find out about anti DDOS measures either at the server itself or in a gateway machine at the data centre they use. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Yes, I do run on my own machines (3 of them for Windows) and I am seeing the same thing; tasks failing & uploads stalling. I've learnt over the years to leave boinc alone and not try to force transfers. So far, although I've had retries up to 50 all my transfers have eventually gone through. However, I'm probably not running as many tasks as some. I don't think limiting the upload speed helps as mine never went through much faster than 100kbps anyway. People are looking into it, so far nothing obvious. The server is up, running fine. Uploads are coming in constantly. No disks filling up. It's under investigation at the Korean site. Any more news and I'll report it here. The IT guy in Korea is keen to find out why people are having problems. So again, it's getting looked at. Korea are a new server for CPDN. --- CPDN Visiting Scientist |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Update from this morning's CPDN technical meeting: we'll look at increasing the max no. of concurrent connections to the Korean server. It's thought this is primarily why uploads are stalling. Otherwise the server is working normally. A query has been lodged with IT support at the Korean site regarding any DDoS protection that might also be playing a role. --- CPDN Visiting Scientist |
Send message Joined: 5 Jun 09 Posts: 97 Credit: 3,736,855 RAC: 4,073 |
While increasing the number of connections should help by reducing the number of "first time stalls", there appears to be an issue that's leading to tasks with high re-try counts. The symptom is that once a task reaches a certain number of re-tries it becomes increasing more probable that it will fail on it's next attempt, so thinking out loud here, is the time-out time before declaring a failure too short for the "find this zip" time? |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,008,987 RAC: 21,524 |
No Joy from my logs here as I still have less than 5 zips that have failed to go through first time and nothing for several days that has been delayed. 5 tasks completed so far, one waiting to report, two more due to complete in next 24 hours. |
Send message Joined: 24 Dec 19 Posts: 32 Credit: 40,991,754 RAC: 77,248 |
Here's my situation: 6 computers running tasks currently 95 tasks running 98 tasks queued In excess of 100 total uploads waiting to be uploaded I'm retired and at my summer home at 9,000 feet elevation in the Colorado Rockies. It's already snowing here. In a couple weeks I'll be packing up all but 1 computer running CPDN tasks and will be heading to my much warmer Texas winter home. The computer being left behind is a Threadripper 2990WX. It's running 33 tasks that won't finish before I leave but with a 1 year deadline on the tasks I should be able to continue on them when I get back to Colorado next June. The tasks have been running 10 days but are only about 33% done. This system is running 33 tasks and has 14 uploads waiting. If the Korean server has banned my IP address due to to many uploads, please ask them to unban me. Dave and Glenn, we're on the same UK BOINK Team, please help out a fellow teammate! LOL |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,008,987 RAC: 21,524 |
They really need to shorten the deadlines. They are a hangover from when tasks could take six months or longer. While the deadline may be one year, more often than not, that is too late for the scientists to use. |
Send message Joined: 24 Dec 19 Posts: 32 Credit: 40,991,754 RAC: 77,248 |
They really need to shorten the deadlines. They are a hangover from when tasks could take six months or longer. I agree 100% !!!!!!!!!! All that does is encourage task hoarding since tasks are few and rarely available. That's my biggest gripe about this project. Why is a deadline of more than 2 months necessary? Preferably even shorter. I'm debating taking the Threadripper with me to Texas. Normally I wouldn't. I retire my older and slower systems to my Colorado summer home. There's also a 2P Xeon server here in CO that stays here. It runs Linux and didn't seem to be able to get any CPDN tasks. At least any tasks that didn't fail. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
It's been said elsewhere several times the long deadline of 1yr is a hangover from when CPDN started and they were running long multi-year simulations which ran for months. In practice as Dave says the real deadline for when the scientists start work on the data is ~6-8 weeks. The deadline is going to be shortened for future batches. Increasing number of connections should help. The server was set to the default value and hadn't been increased for this batch. I really hope we can get all those results through! |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
There's also a 2P Xeon server here in CO that stays here. It runs Linux and didn't seem to be able to get any CPDN tasks. At least any tasks that didn't fail. My main machine runs Linux essentially 24/7 The most recen CPDN task I got was 22318648 12138603 30 May 2023, 3:38:46 UTC 9 Jun 2023, 1:20:39 UTC Completed 852,578.34 843,274.30 33,854.34 UK Met Office HadAM4 at N216 resolution v8.52 i686-pc-linux-gnu computer 1511241 Task 22318648 Name hadam4h_a015_200011_5_931_012138603_1 up 46 days, 1 min So you probably would not get any tasks either after that. They are very few and far between. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,808,726 RAC: 5,192 |
It's been said elsewhere several times the long deadline of 1yr is a hangover from when CPDN started and they were running long multi-year simulations which ran for months. In practice as Dave says the real deadline for when the scientists start work on the data is ~6-8 weeks. The deadline is going to be shortened for future batches. There was also another issue in which those long-running CPDN tasks would dominate processing to the exclusion of other projects. The long deadlines allowed those users who wanted to run a variety of projects to have completions on all the projects. Shortening the deadlines may bring that competition back, unless BOINC has got better at sharing the resources. (Not a problem for users such as myself who only run CPDN except during gaps.) |
Send message Joined: 28 Dec 17 Posts: 18 Credit: 1,097,261 RAC: 147 |
Hi all. Trying to upload one stubborn zip file, but no success. Is their server down or just overloaded? Thanks. 16-Oct-2023 17:20:37 [climateprediction.net] Started upload of wah2_eas25_a244_199812_24_996_012226288_0_r479624848_16.zip 16-Oct-2023 17:20:59 [---] Project communication failed: attempting access to reference site 16-Oct-2023 17:20:59 [climateprediction.net] Temporarily failed upload of wah2_eas25_a244_199812_24_996_012226288_0_r479624848_16.zip: transient HTTP error 16-Oct-2023 17:20:59 [climateprediction.net] Backing off 05:59:17 on upload of wah2_eas25_a244_199812_24_996_012226288_0_r479624848_16.zip 16-Oct-2023 17:21:00 [---] Internet access OK - project servers may be temporarily down. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,703,308 RAC: 9,860 |
unless BOINC has got better at sharing the resources.I don't think BOINC has got better, but there are now more resources to share. Multi-CPU machines were a rarity in those early days, but are more commonplace now. |
Send message Joined: 6 Aug 04 Posts: 195 Credit: 28,335,853 RAC: 10,368 |
Hi all. Trying to upload one stubborn zip file, but no success. Is their server down or just overloaded?From 11 WAH tasks, we had ten upolads waiting yesterday afternoon and five waiting today. Looking in the event log: there are a lot of upload starts and temporary upload failures, interspersed with a few groups of uploads finishing. We have found that it is best to be patient with BOINC and CPDN. Unless told otherwise by those that know better, I'd let BOINC do it's own thing with a stubborn upload. |
©2024 cpdn.org