climateprediction.net (CPDN) home page
Thread 'Batch 996 Weather@Home2 East Asia25'

Thread 'Batch 996 Weather@Home2 East Asia25'

Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 12 · Next

AuthorMessage
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,008,987
RAC: 21,524
Message 69874 - Posted: 15 Oct 2023, 19:27:12 UTC - in response to Message 69872.  

Out.zip files are part of the validity check on task results. Trickle up files are smaller still.
ID: 69874 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,008,987
RAC: 21,524
Message 69875 - Posted: 15 Oct 2023, 19:31:21 UTC - in response to Message 69870.  

If you can, please wait till after the cpdn meeting on Monday. Glen has asked the researcher to check whether there is any ddos protection that might be causing the problems with so many computers from around the globe sending files at once.
ID: 69875 · Report as offensive     Reply Quote
ChelseaOilman

Send message
Joined: 24 Dec 19
Posts: 32
Credit: 40,991,754
RAC: 77,248
Message 69876 - Posted: 15 Oct 2023, 19:40:57 UTC

I'll leave the files waiting to upload as long as needed. With 8 computers running this project right now I'm accumulating quite a few tasks that won't upload.
ID: 69876 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 69877 - Posted: 15 Oct 2023, 20:00:46 UTC - in response to Message 69870.  

I've tried upload speeds of 30 100 and 150 and it hasn't helped. I'm probably going to abort the transfer and limit my project participation to no more than 1 work unit at a time so I don't waste so much computer time on these in the future. I had 8 going this last week, 6 couldn't handle a reboot and the 2 that finished won't upload. Waste of time and energy at this point, it's just too fragile. That's sad because I think this is one of the more worthwhile projects out there and up until now I had given CPDN the absolute highest priority on my computers.

I appreciate this sentiment, it's annoying me as well. My 'restart failure rate' is ~1 in 8 tasks per restart, which is poor. But to be fair to CPDN, WaH is only flakier for this batch because the project scientist asked to model a larger region than normal at 25km scale, which is pushing the model too far essentially. Previous WaH batches have been much more stable. We are working on fixing this.

Regarding uploads, keep them if you can. If you don't it'll invalidate that entire task's data. Data uploads will be discussed in the meeting next week. From my own experience, I've had transfers 'stuck' for days with ~50 retries but they do eventually get through. The IT guy in Korea is keen to find out why people are having problems. So again, it's getting looked at. Korea are a new server for CPDN.
---
CPDN Visiting Scientist
ID: 69877 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 30,978,383
RAC: 14,247
Message 69881 - Posted: 15 Oct 2023, 22:19:15 UTC

Some of my recent uploads:

15/10/2023 14:26:10 | climateprediction.net | Finished upload of wah2_eas25_a1rl_199612_24_996_012225837_0_r1703371973_13.zip (99338471 bytes)
15/10/2023 14:26:10 | climateprediction.net | [file_xfer] Throughput 181915 bytes/sec

15/10/2023 14:56:24 | climateprediction.net | Finished upload of wah2_eas25_a3is_200712_24_996_012228112_0_r685833828_13.zip (98900336 bytes)
15/10/2023 14:56:24 | climateprediction.net | [file_xfer] Throughput 198808 bytes/sec

15/10/2023 17:38:25 | climateprediction.net | Finished upload of wah2_eas25_a0q1_198912_24_996_012224485_2_r113967697_2.zip (99007240 bytes)
15/10/2023 17:38:25 | climateprediction.net | [file_xfer] Throughput 179704 bytes/sec

which have all gone through OK.
ID: 69881 · Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 2 Oct 06
Posts: 54
Credit: 27,309,613
RAC: 28,128
Message 69884 - Posted: 16 Oct 2023, 2:20:29 UTC - in response to Message 69877.  
Last modified: 16 Oct 2023, 2:22:28 UTC

The IT guy in Korea is keen to find out why people are having problems. So again, it's getting looked at. Korea are a new server for CPDN.


Do any of the back end guys not run the projects on their own home machines? Shouldn't they be seeing the same thing?

FWIW, I am up to 20 pending transfers, with the largest being 50 attempts. I tried setting each of the machines to 100k upload speed, with no luck. But I am not sure that helps, since all my machines are coming from my single IP address. So even if the speed is limited per machine, the speed per IP address can be larger if there are multiple attempts at the same time.
ID: 69884 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,008,987
RAC: 21,524
Message 69887 - Posted: 16 Oct 2023, 5:49:27 UTC

Do any of the back end guys not run the projects on their own home machines? Shouldn't they be seeing the same thing?
Not sure how relevant that is. All of mine have been going through without problems. Unless they have several machines like you, they may be running tasks but like me be having everything go through. Hence Glen's request to find out about anti DDOS measures either at the server itself or in a gateway machine at the data centre they use.
ID: 69887 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 69891 - Posted: 16 Oct 2023, 8:43:47 UTC - in response to Message 69884.  
Last modified: 16 Oct 2023, 8:44:46 UTC

Yes, I do run on my own machines (3 of them for Windows) and I am seeing the same thing; tasks failing & uploads stalling. I've learnt over the years to leave boinc alone and not try to force transfers. So far, although I've had retries up to 50 all my transfers have eventually gone through. However, I'm probably not running as many tasks as some. I don't think limiting the upload speed helps as mine never went through much faster than 100kbps anyway.

People are looking into it, so far nothing obvious. The server is up, running fine. Uploads are coming in constantly. No disks filling up. It's under investigation at the Korean site. Any more news and I'll report it here.

The IT guy in Korea is keen to find out why people are having problems. So again, it's getting looked at. Korea are a new server for CPDN.


Do any of the back end guys not run the projects on their own home machines? Shouldn't they be seeing the same thing?

FWIW, I am up to 20 pending transfers, with the largest being 50 attempts. I tried setting each of the machines to 100k upload speed, with no luck. But I am not sure that helps, since all my machines are coming from my single IP address. So even if the speed is limited per machine, the speed per IP address can be larger if there are multiple attempts at the same time.

---
CPDN Visiting Scientist
ID: 69891 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 69893 - Posted: 16 Oct 2023, 10:09:27 UTC

Update from this morning's CPDN technical meeting: we'll look at increasing the max no. of concurrent connections to the Korean server. It's thought this is primarily why uploads are stalling. Otherwise the server is working normally. A query has been lodged with IT support at the Korean site regarding any DDoS protection that might also be playing a role.
---
CPDN Visiting Scientist
ID: 69893 · Report as offensive     Reply Quote
rob

Send message
Joined: 5 Jun 09
Posts: 97
Credit: 3,736,855
RAC: 4,073
Message 69898 - Posted: 16 Oct 2023, 15:51:02 UTC - in response to Message 69893.  

While increasing the number of connections should help by reducing the number of "first time stalls", there appears to be an issue that's leading to tasks with high re-try counts. The symptom is that once a task reaches a certain number of re-tries it becomes increasing more probable that it will fail on it's next attempt, so thinking out loud here, is the time-out time before declaring a failure too short for the "find this zip" time?
ID: 69898 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,008,987
RAC: 21,524
Message 69900 - Posted: 16 Oct 2023, 17:34:59 UTC - in response to Message 69898.  

No Joy from my logs here as I still have less than 5 zips that have failed to go through first time and nothing for several days that has been delayed. 5 tasks completed so far, one waiting to report, two more due to complete in next 24 hours.
ID: 69900 · Report as offensive     Reply Quote
ChelseaOilman

Send message
Joined: 24 Dec 19
Posts: 32
Credit: 40,991,754
RAC: 77,248
Message 69901 - Posted: 16 Oct 2023, 18:00:05 UTC

Here's my situation:
6 computers running tasks currently
95 tasks running
98 tasks queued
In excess of 100 total uploads waiting to be uploaded

I'm retired and at my summer home at 9,000 feet elevation in the Colorado Rockies. It's already snowing here. In a couple weeks I'll be packing up all but 1 computer running CPDN tasks and will be heading to my much warmer Texas winter home.
The computer being left behind is a Threadripper 2990WX. It's running 33 tasks that won't finish before I leave but with a 1 year deadline on the tasks I should be able to continue on them when I get back to Colorado next June. The tasks have been running 10 days but are only about 33% done. This system is running 33 tasks and has 14 uploads waiting.

If the Korean server has banned my IP address due to to many uploads, please ask them to unban me.

Dave and Glenn, we're on the same UK BOINK Team, please help out a fellow teammate! LOL
ID: 69901 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,008,987
RAC: 21,524
Message 69902 - Posted: 16 Oct 2023, 18:14:31 UTC - in response to Message 69901.  

They really need to shorten the deadlines. They are a hangover from when tasks could take six months or longer.

While the deadline may be one year, more often than not, that is too late for the scientists to use.
ID: 69902 · Report as offensive     Reply Quote
ChelseaOilman

Send message
Joined: 24 Dec 19
Posts: 32
Credit: 40,991,754
RAC: 77,248
Message 69903 - Posted: 16 Oct 2023, 19:17:42 UTC - in response to Message 69902.  
Last modified: 16 Oct 2023, 19:24:14 UTC

They really need to shorten the deadlines. They are a hangover from when tasks could take six months or longer.

While the deadline may be one year, more often than not, that is too late for the scientists to use.

I agree 100% !!!!!!!!!!
All that does is encourage task hoarding since tasks are few and rarely available.
That's my biggest gripe about this project.
Why is a deadline of more than 2 months necessary?
Preferably even shorter.

I'm debating taking the Threadripper with me to Texas. Normally I wouldn't. I retire my older and slower systems to my Colorado summer home. There's also a 2P Xeon server here in CO that stays here. It runs Linux and didn't seem to be able to get any CPDN tasks. At least any tasks that didn't fail.
ID: 69903 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 69904 - Posted: 16 Oct 2023, 21:11:57 UTC - in response to Message 69903.  
Last modified: 16 Oct 2023, 21:13:51 UTC

It's been said elsewhere several times the long deadline of 1yr is a hangover from when CPDN started and they were running long multi-year simulations which ran for months. In practice as Dave says the real deadline for when the scientists start work on the data is ~6-8 weeks. The deadline is going to be shortened for future batches.

Increasing number of connections should help. The server was set to the default value and hadn't been increased for this batch. I really hope we can get all those results through!
ID: 69904 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 69905 - Posted: 16 Oct 2023, 21:43:12 UTC - in response to Message 69903.  

There's also a 2P Xeon server here in CO that stays here. It runs Linux and didn't seem to be able to get any CPDN tasks. At least any tasks that didn't fail.


My main machine runs Linux essentially 24/7 The most recen CPDN task I got was

22318648 	12138603 	30 May 2023, 3:38:46 UTC 	9 Jun 2023, 1:20:39 UTC 	Completed 	852,578.34 	843,274.30 	33,854.34 	UK Met Office HadAM4 at N216 resolution v8.52
i686-pc-linux-gnu


computer 1511241
Task 22318648
Name hadam4h_a015_200011_5_931_012138603_1
up 46 days, 1 min

So you probably would not get any tasks either after that. They are very few and far between.
ID: 69905 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,808,726
RAC: 5,192
Message 69906 - Posted: 16 Oct 2023, 22:08:38 UTC - in response to Message 69904.  

It's been said elsewhere several times the long deadline of 1yr is a hangover from when CPDN started and they were running long multi-year simulations which ran for months. In practice as Dave says the real deadline for when the scientists start work on the data is ~6-8 weeks. The deadline is going to be shortened for future batches.

Increasing number of connections should help. The server was set to the default value and hadn't been increased for this batch. I really hope we can get all those results through!


There was also another issue in which those long-running CPDN tasks would dominate processing to the exclusion of other projects. The long deadlines allowed those users who wanted to run a variety of projects to have completions on all the projects. Shortening the deadlines may bring that competition back, unless BOINC has got better at sharing the resources.

(Not a problem for users such as myself who only run CPDN except during gaps.)
ID: 69906 · Report as offensive     Reply Quote
Iceberg

Send message
Joined: 28 Dec 17
Posts: 18
Credit: 1,097,261
RAC: 147
Message 69907 - Posted: 16 Oct 2023, 22:25:46 UTC

Hi all. Trying to upload one stubborn zip file, but no success. Is their server down or just overloaded?

Thanks.

16-Oct-2023 17:20:37 [climateprediction.net] Started upload of wah2_eas25_a244_199812_24_996_012226288_0_r479624848_16.zip
16-Oct-2023 17:20:59 [---] Project communication failed: attempting access to reference site
16-Oct-2023 17:20:59 [climateprediction.net] Temporarily failed upload of wah2_eas25_a244_199812_24_996_012226288_0_r479624848_16.zip: transient HTTP error
16-Oct-2023 17:20:59 [climateprediction.net] Backing off 05:59:17 on upload of wah2_eas25_a244_199812_24_996_012226288_0_r479624848_16.zip
16-Oct-2023 17:21:00 [---] Internet access OK - project servers may be temporarily down.
ID: 69907 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,703,308
RAC: 9,860
Message 69908 - Posted: 17 Oct 2023, 7:25:28 UTC - in response to Message 69906.  

unless BOINC has got better at sharing the resources.
I don't think BOINC has got better, but there are now more resources to share. Multi-CPU machines were a rarity in those early days, but are more commonplace now.
ID: 69908 · Report as offensive     Reply Quote
wateroakley

Send message
Joined: 6 Aug 04
Posts: 195
Credit: 28,336,682
RAC: 10,407
Message 69909 - Posted: 17 Oct 2023, 8:37:58 UTC - in response to Message 69907.  

Hi all. Trying to upload one stubborn zip file, but no success. Is their server down or just overloaded?
From 11 WAH tasks, we had ten upolads waiting yesterday afternoon and five waiting today. Looking in the event log: there are a lot of upload starts and temporary upload failures, interspersed with a few groups of uploads finishing. We have found that it is best to be patient with BOINC and CPDN. Unless told otherwise by those that know better, I'd let BOINC do it's own thing with a stubborn upload.
ID: 69909 · Report as offensive     Reply Quote
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 12 · Next

Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25

©2024 cpdn.org