climateprediction.net (CPDN) home page
Thread 'The uploads are stuck'

Thread 'The uploads are stuck'

Message boards : Number crunching : The uploads are stuck
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 20 · 21 · 22 · 23 · 24 · 25 · Next

AuthorMessage
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 43,926,387
RAC: 52,738
Message 68100 - Posted: 28 Jan 2023, 19:18:21 UTC - in response to Message 68060.  

Reminder to reset <ncpus> tag in cc_config.xml if you changed it

If you altered the <ncpus> tag in cc_config.xml from -1 to a large number, as a way of bypassing the 'no more tasks too many uploads in progress' problem when upload11 was down, could I please remind everyone to change that tag back to <ncpus>-1</ncpus>.

There are some more OpenIFS batches coming soon and we don't want 100+ tasks landing on volunteer machines that really don't have 100 cores: e.g. https://www.cpdn.org/show_host_detail.php?hostid=1524863.

It would save CPDN trawling through their database to find these hosts and contact their owners.

Thanks!

This reminds me of some other project's trick. They grant additional credits if tasks are returned within X days to disincentive excessive hoarding. No matter how people fake it, whether it's bogus core count or multiple clients per machine, they can't fake the actual compute throughput of the machine. Given CPDN has credit granting script run once a week instead of continuously, it might even be possible to adjust for upload server downtime if necessary.
ID: 68100 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 68101 - Posted: 28 Jan 2023, 20:34:13 UTC - in response to Message 68100.  

This reminds me of some other project's trick. They grant additional credits if tasks are returned within X days to disincentive excessive hoarding.


Back when my machine could always get work from CPDN and WCG (and Seti@home), I had at least 0.35 days of work and about 0.65 days additional work in my preferences. Now that Seti@home is gone, WCG is sort-of back up, and CPDN is erratic in work availability, my machine is set to at least 0.50 days of work and 1.5 days of additional work, and that 1.5 days is not really enough. When the upload server went down, I mostly let my machine keep crunching and I got around 20 completed tasks before they started uploading again. I do not think of that as hoarding. IIRC, some of the more recent "classical" CPDN work took around 8 days to complete a task, and in the distant past (and on slower machines) tasks could take several months.

But I do not think, with the new Oifs tasks, there is much point grabbing a month's supply because they would time-out before I could get around to them. I usually leave my machine up 24/7.
ID: 68101 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 68102 - Posted: 29 Jan 2023, 6:31:16 UTC - in response to Message 68101.  

This reminds me of some other project's trick. They grant additional credits if tasks are returned within X days to disincentive excessive hoarding.


Back when my machine could always get work from CPDN and WCG (and Seti@home), I had at least 0.35 days of work and about 0.65 days additional work in my preferences. Now that Seti@home is gone, WCG is sort-of back up, and CPDN is erratic in work availability, my machine is set to at least 0.50 days of work and 1.5 days of additional work, and that 1.5 days is not really enough. When the upload server went down, I mostly let my machine keep crunching and I got around 20 completed tasks before they started uploading again. I do not think of that as hoarding. IIRC, some of the more recent "classical" CPDN work took around 8 days to complete a task, and in the distant past (and on slower machines) tasks could take several months.

But I do not think, with the new Oifs tasks, there is much point grabbing a month's supply because they would time-out before I could get around to them. I usually leave my machine up 24/7.


Yup, I'm an old-timer here, times change, the new models have mucho memory needs, but we don't have to do interim backups any more (yet) because to decades ago models took months to complete and we didn't want to waste a quarter of a workunit after a couple weeks.
Naah, we'll adapt to the new.
And I've noticed, and bought on the price decline, some ECC UDIMMS for my current hosts, to good effect.
ID: 68102 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4542
Credit: 19,039,635
RAC: 18,944
Message 68118 - Posted: 30 Jan 2023, 12:19:27 UTC
Last modified: 30 Jan 2023, 13:09:44 UTC

Just started getting this. Last time it all worked normally a few minutes later so not contacting Andy just yet.

Mon 30 Jan 2023 12:17:25 GMT |  | Project communication failed: attempting access to reference site
Mon 30 Jan 2023 12:17:25 GMT | climateprediction.net | Temporarily failed upload of oifs_43r3_ps_0348_1989050100_123_958_12174992_1_r944602151_121.zip: transient HTTP error
Mon 30 Jan 2023 12:17:25 GMT | climateprediction.net | Backing off 00:02:15 on upload of oifs_43r3_ps_0348_1989050100_123_958_12174992_1_r944602151_121.zip
Mon 30 Jan 2023 12:17:25 GMT | climateprediction.net | Started upload of oifs_43r3_ps_0085_1992050100_123_961_12177729_2_r1618474980_1.zip
Mon 30 Jan 2023 12:17:26 GMT |  | Internet access OK - project servers may be temporarily down.


Edit: Now going through again.

Edit2: All uploads here now stuck again.

Edit3: If anyone confirms it isn't just me i will message Andy.
ID: 68118 · Report as offensive     Reply Quote
cetus

Send message
Joined: 7 Aug 04
Posts: 10
Credit: 148,100,750
RAC: 29,951
Message 68119 - Posted: 30 Jan 2023, 13:32:14 UTC - in response to Message 68118.  

I have the same problem
ID: 68119 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4542
Credit: 19,039,635
RAC: 18,944
Message 68120 - Posted: 30 Jan 2023, 13:48:08 UTC - in response to Message 68119.  

I have the same problem
I have now messaged Andy.
ID: 68120 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,818,575
RAC: 4,615
Message 68122 - Posted: 30 Jan 2023, 14:02:04 UTC
Last modified: 30 Jan 2023, 14:16:41 UTC

Started here around quarter past twelve:

30/01/2023 12:14:07 | climateprediction.net | Started upload of oifs_43r3_ps_0783_2001050100_123_970_12187427_1_r1449012272_86.zip
30/01/2023 12:20:00 | climateprediction.net | Backing off 00:02:04 on upload of oifs_43r3_ps_0783_2001050100_123_970_12187427_1_r1449012272_86.zip
I don't know why it waited 6 minutes before backing off - the timeout is usually two minutes. I'll look at http_debug if it persists.

Edit - nothing more to see with debug - just the timeout.
ID: 68122 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4542
Credit: 19,039,635
RAC: 18,944
Message 68124 - Posted: 30 Jan 2023, 14:45:38 UTC

Thanks, yes this server is inaccessible at the moment and I cannot get in through there online portal. I have emailed their helpdesk.

Best wishes,

Andy
ID: 68124 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,322,658
RAC: 1,085
Message 68129 - Posted: 30 Jan 2023, 17:23:52 UTC - in response to Message 68100.  
Last modified: 30 Jan 2023, 17:24:42 UTC

(on the avoidance of overly deep work buffers at client computers)
wujj123456 wrote:
This reminds me of some other project's trick. They grant additional credits if tasks are returned within X days to disincentive excessive hoarding. No matter how people fake it, whether it's bogus core count or multiple clients per machine, they can't fake the actual compute throughput of the machine. Given CPDN has credit granting script run once a week instead of continuously, it might even be possible to adjust for upload server downtime if necessary.
GPUGrid used to apply different credit if the turnaroud time of a result was <24 h, 24…48 h, or >48 h. I am not sure if they are still doing it. AFAICS the corresponding FAQ vanished from their message board. Folding@Home practically prevents work buffering in their client, and apply an extremely nonlinear credit based on turnaround time (to a degree which is ridiculous; little credit is given to work done, much credit is given to the speed at which the work was done). — GPUGrid did this/ F@H does this because newer workunit batches are built based on results from previous workunit batches, presumably.
ID: 68129 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,818,575
RAC: 4,615
Message 68130 - Posted: 30 Jan 2023, 17:37:31 UTC - in response to Message 68129.  

Yes, they are. Their presentation of it is:

Standard credit if task is reported 48 hours or more after issue.
Plus 25% bonus if returned between 24 hours and 48 hours of issue.
Plus 50% bonus if returned within 24 hours of issue.

Some of their tasks follow each other in sequence - the results of one task are used to create the starting conditions of the next, up to five times. They say they need to complete the task sequences for the entire cohort quickly, in order to make the research ready for publication. They also award badges - usually some years after the computing is completed - to recognise the contribution of each volunteer to each publication.
ID: 68130 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4542
Credit: 19,039,635
RAC: 18,944
Message 68131 - Posted: 30 Jan 2023, 17:52:04 UTC

Back to the uploads. As we are now well past the time the support desk for JASMIN and last I heard Andy can't look at the machine using the on-line portal, I would guess the chances of seeing uploads resume before tomorrow afternoon is remote.
ID: 68131 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,818,575
RAC: 4,615
Message 68132 - Posted: 30 Jan 2023, 18:03:22 UTC - in response to Message 68131.  

My task has finished now, and can simply sit and wait until the server is ready. I'm doing other work on the machine now - including a GPUGrid task, as it happens - so networking will remain active, and it'll keep trying as the backoffs end, until it gets through.
ID: 68132 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 68134 - Posted: 30 Jan 2023, 18:33:32 UTC - in response to Message 68118.  

If anyone confirms it isn't just me i will message Andy.


I see you have already done this.

I just got a work unit and it is in trouble right away.
N.B.: Times are EST.

Mon 30 Jan 2023 11:25:51 AM EST | climateprediction.net | Starting task oifs_43r3_ps_0171_2001050100_123_970_12186815_1
Mon 30 Jan 2023 11:34:17 AM EST | climateprediction.net | Started upload of oifs_43r3_ps_0171_2001050100_123_970_12186815_1_r1404095512_0.zip
Mon 30 Jan 2023 11:36:17 AM EST |  | Project communication failed: attempting access to reference site
Mon 30 Jan 2023 11:36:17 AM EST | climateprediction.net | Temporarily failed upload of oifs_43r3_ps_0171_2001050100_123_970_12186815_1_r1404095512_0.zip: transient HTTP error
Mon 30 Jan 2023 11:36:17 AM EST | climateprediction.net | Backing off 00:02:59 on upload of oifs_43r3_ps_0171_2001050100_123_970_12186815_1_r1404095512_0.zip
Mon 30 Jan 2023 11:36:19 AM EST |  | Internet access OK - project servers may be temporarily down.

ID: 68134 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,817,940
RAC: 12,877
Message 68136 - Posted: 30 Jan 2023, 18:55:14 UTC - in response to Message 68130.  

Yes, they are. Their presentation of it is:

Standard credit if task is reported 48 hours or more after issue.
Plus 25% bonus if returned between 24 hours and 48 hours of issue.
Plus 50% bonus if returned within 24 hours of issue.

Some of their tasks follow each other in sequence - the results of one task are used to create the starting conditions of the next, up to five times. They say they need to complete the task sequences for the entire cohort quickly, in order to make the research ready for publication. They also award badges - usually some years after the computing is completed - to recognise the contribution of each volunteer to each publication.

It's a nice idea but bear in mind that CPDN are stupidly under-resourced and therefore highly unlikely to create work for themselves. I'll mention it and see what response I get, credit is actually quite a pain for a project to have to manage from what I've learnt.
ID: 68136 · Report as offensive     Reply Quote
wateroakley

Send message
Joined: 6 Aug 04
Posts: 195
Credit: 28,617,787
RAC: 9,624
Message 68138 - Posted: 30 Jan 2023, 19:57:46 UTC - in response to Message 68136.  

I'll mention it and see what response I get, credit is actually quite a pain for a project to have to manage from what I've learnt.
Returning valid results for the researchers to analyze is the important matter, credit is a 'nice to have'. Others may think otherwise, but I won't mind if the credit update is rare or doesn't happen.
ID: 68138 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4542
Credit: 19,039,635
RAC: 18,944
Message 68139 - Posted: 30 Jan 2023, 23:09:34 UTC - in response to Message 68138.  

I'll mention it and see what response I get, credit is actually quite a pain for a project to have to manage from what I've learnt.
Returning valid results for the researchers to analyze is the important matter, credit is a 'nice to have'. Others may think otherwise, but I won't mind if the credit update is rare or doesn't happen.

I tend to agree. For myself, credit is more about another tool to spot problems than anything else. But I get that for some it is a motivating factor.
ID: 68139 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 318
Credit: 15,031,602
RAC: 4,207
Message 68142 - Posted: 31 Jan 2023, 9:44:07 UTC - in response to Message 68139.  
Last modified: 31 Jan 2023, 10:07:45 UTC

I'll mention it and see what response I get, credit is actually quite a pain for a project to have to manage from what I've learnt.
Returning valid results for the researchers to analyze is the important matter, credit is a 'nice to have'. Others may think otherwise, but I won't mind if the credit update is rare or doesn't happen.

I tend to agree. For myself, credit is more about another tool to spot problems than anything else. But I get that for some it is a motivating factor.

Some kind of accounting of work done both in relative and absolute measure is important I'd argue. I have little doubt that at one point or another everyone looked at their credit and standings even if to just get an idea of the level of contribution over time. It's pretty normal behavior and I have no doubt that everyone would be asking for some way of knowing of how much one's contributed if there was no such measure provided at all. I agree that complicated credit system is unnecessary and a weekly credit run is fine. Unlike other projects, on CPDN task counts don't get erased from when you first join so you can actually tell how many tasks you've processed and calculate your error rate long term. I think that's kind of a nice thing. Although a cluttered and somewhat uninformative Project Status page is the trade-off, is my guess. Badges might be nice as they'll provide incentive to likely many people. Although, I'd rather see a number of other things improved first before concerning with badges.
ID: 68142 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,817,940
RAC: 12,877
Message 68148 - Posted: 31 Jan 2023, 11:04:02 UTC

News from CPDN/Andy.

JASMIN maintenance issues are ongoing, no update on when it will be up yet. But it's out of CPDN's hands. CPDN are actively looking at ANZ server upload issues to work around recent reports of problems there.


p.s. also not fussed about credit -- but I do like a badge or two.. :)
ID: 68148 · Report as offensive     Reply Quote
Bryn Mawr

Send message
Joined: 28 Jul 19
Posts: 150
Credit: 12,830,559
RAC: 228
Message 68154 - Posted: 31 Jan 2023, 15:55:54 UTC - in response to Message 68142.  

I'll mention it and see what response I get, credit is actually quite a pain for a project to have to manage from what I've learnt.
Returning valid results for the researchers to analyze is the important matter, credit is a 'nice to have'. Others may think otherwise, but I won't mind if the credit update is rare or doesn't happen.

I tend to agree. For myself, credit is more about another tool to spot problems than anything else. But I get that for some it is a motivating factor.

Some kind of accounting of work done both in relative and absolute measure is important I'd argue. I have little doubt that at one point or another everyone looked at their credit and standings even if to just get an idea of the level of contribution over time. It's pretty normal behavior and I have no doubt that everyone would be asking for some way of knowing of how much one's contributed if there was no such measure provided at all. I agree that complicated credit system is unnecessary and a weekly credit run is fine. Unlike other projects, on CPDN task counts don't get erased from when you first join so you can actually tell how many tasks you've processed and calculate your error rate long term. I think that's kind of a nice thing. Although a cluttered and somewhat uninformative Project Status page is the trade-off, is my guess. Badges might be nice as they'll provide incentive to likely many people. Although, I'd rather see a number of other things improved first before concerning with badges.


Rather than the absolute credit count (although that is nice as well) I find the RAC a more useful measure of how well the system is performing - not that I’m hinting or anything :-)
ID: 68154 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,817,940
RAC: 12,877
Message 68155 - Posted: 31 Jan 2023, 17:51:52 UTC - in response to Message 68154.  
Last modified: 31 Jan 2023, 17:56:00 UTC

Rather than the absolute credit count (although that is nice as well) I find the RAC a more useful measure of how well the system is performing - not that I’m hinting or anything :-)
As the RAC is a long-term average, month or two I think (someone correct me if I'm wrong), it's not a great measure of how well the system is performing.
ID: 68155 · Report as offensive     Reply Quote
Previous · 1 . . . 20 · 21 · 22 · 23 · 24 · 25 · Next

Message boards : Number crunching : The uploads are stuck

©2024 cpdn.org