climateprediction.net (CPDN) home page
Thread 'The uploads are stuck'

Thread 'The uploads are stuck'

Message boards : Number crunching : The uploads are stuck
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 25 · Next

AuthorMessage
DJStarfox

Send message
Joined: 27 Jan 07
Posts: 300
Credit: 3,288,263
RAC: 26,370
Message 67216 - Posted: 2 Jan 2023, 13:22:17 UTC

Uploads still stuck for me. Hopefully, server can be fixed today or tomorrow.
ID: 67216 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,476,460
RAC: 15,681
Message 67217 - Posted: 2 Jan 2023, 13:45:37 UTC - in response to Message 67216.  

Uploads still stuck for me. Hopefully, server can be fixed today or tomorrow.
The absolute earliest will be tomorrow when the cloud provider support go back to work. However, they will probably have a queue of requests to get through so might be day or two after.
ID: 67217 · Report as offensive     Reply Quote
gemini8

Send message
Joined: 4 Dec 15
Posts: 52
Credit: 2,489,447
RAC: 2,080
Message 67218 - Posted: 2 Jan 2023, 13:47:51 UTC - in response to Message 67125.  

When you say "cores" do you mean real cpu cores or threads? (Just want to double check).

Boinc doesn't know about Hyper-Threading.
Every thread is counted as a core.
In every setting that refers to 'cpu' this actually means 'thread'.
- - - - - - - - - -
Greetings, Jens
ID: 67218 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 67239 - Posted: 3 Jan 2023, 12:23:59 UTC

Messages flying about but no identification of exactly what the problem is yet. Andy thinks Jasmin have changed something in their infrastructure but not sure what. What is clear is that those involved are working on it and not sitting on their hands. (With the possible exception of those managing jasmin which I assume is the successor to Joint Academic Network. I never did work out what the rest of the new acronym is.
ID: 67239 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,718,239
RAC: 8,054
Message 67241 - Posted: 3 Jan 2023, 12:41:26 UTC - in response to Message 67239.  

Messages flying about but no identification of exactly what the problem is yet. Andy thinks Jasmin have changed something in their infrastructure but not sure what. What is clear is that those involved are working on it and not sitting on their hands. (With the possible exception of those managing jasmin which I assume is the successor to Joint Academic Network. I never did work out what the rest of the new acronym is.
Thanks Dave. I've been bouncing around all morning like a kitten on heat, wanting to (but knowing I mustn't) ask 'are we nearly there yet?'.
ID: 67241 · Report as offensive     Reply Quote
ProfilePDW

Send message
Joined: 29 Nov 17
Posts: 82
Credit: 14,758,952
RAC: 86,633
Message 67243 - Posted: 3 Jan 2023, 13:20:33 UTC - in response to Message 67239.  

Messages flying about but no identification of exactly what the problem is yet. Andy thinks Jasmin have changed something in their infrastructure but not sure what. What is clear is that those involved are working on it and not sitting on their hands. (With the possible exception of those managing jasmin which I assume is the successor to Joint Academic Network. I never did work out what the rest of the new acronym is.

The Joint Analysis System Meeting Infrastructure Needs
ID: 67243 · Report as offensive     Reply Quote
biodoc

Send message
Joined: 2 Oct 19
Posts: 21
Credit: 47,674,094
RAC: 24,265
Message 67245 - Posted: 3 Jan 2023, 13:36:44 UTC

traceroute eventually makes its way to the proper destination.

traceroute to upload11.cpdn.org (192.171.169.187), 64 hops max
  1   192.168.1.1 (Fios_Quantum_Gateway.fios-router.home)  0.342ms  0.285ms  0.264ms 
  2   100.0.197.1 (lo0-100.BSTNMA-VFTTP-308.verizon-gni.net)  7.776ms  9.234ms  12.042ms 
  3   100.41.214.178 (B3308.BSTNMA-LCR-21.verizon-gni.net)  11.290ms  7.794ms  14.045ms 
  4   *  *  * 
  5   140.222.236.255 (0.ae2.BR1.BOS30.ALTER.NET)  4.700ms  9.752ms  9.719ms 
  6   62.115.170.72 (bost-b2-link.telia.net)  10.007ms  *  * 
  7   62.115.122.202 (nyk-bb1-link.ip.twelve99.net)  15.936ms  9.395ms  9.848ms 
  8   62.115.112.245 (ldn-bb4-link.ip.twelve99.net)  79.272ms  78.822ms  78.587ms 
  9   62.115.120.239 (ldn-b2-link.ip.twelve99.net)  79.447ms  79.003ms  78.299ms 
 10   62.115.175.131 (jisc-ic345131-ldn-b2.ip.twelve99-cust.net)  76.696ms  78.259ms  78.755ms 
 11   146.97.35.197 (ae24.londhx-sbr1.ja.net)  79.159ms  78.027ms  78.719ms 
 12   146.97.33.2 (ae29.londpg-sbr2.ja.net)  79.844ms  77.670ms  81.317ms 
 13   146.97.33.22 (ae31.erdiss-sbr2.ja.net)  90.225ms  87.908ms  88.663ms 
 14   *  *  * 
 15   146.97.41.34 (ral-r26.ja.net)  88.691ms  88.240ms  88.613ms 
 16   *  *  * 
 17   *  *  * 
 18   *  *  * 
 19   *  *  * 
 20   192.171.169.187 (192.171.169.187)  85.539ms !*  85.660ms !*  88.446ms !* 
ID: 67245 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,718,239
RAC: 8,054
Message 67246 - Posted: 3 Jan 2023, 13:42:37 UTC - in response to Message 67245.  

And "transient HTTP error" after two minutes (timeout) has changed to "connect() failed" after 1 second (software not running yet).
ID: 67246 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 67247 - Posted: 3 Jan 2023, 13:43:55 UTC - in response to Message 67243.  

The Joint Analysis System Meeting Infrastructure Needs

In case anyone cares, this describes their systems and services...

https://www.bnlawrence.net/assets/talks/2014-07-02-lawrence_eresearchNZ14_jasmin.pdf
ID: 67247 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 67248 - Posted: 3 Jan 2023, 13:54:51 UTC - in response to Message 67246.  

And "transient HTTP error" after two minutes (timeout) has changed to "connect() failed" after 1 second (software not running yet).

True: I guess it is better to fail fast. At least they are doing something.
Traceroute is the same as last week (so I will not bother to post the results here).

Tue 03 Jan 2023 08:30:18 AM EST | climateprediction.net | Started upload of oifs_43r3_ps_0144_2001050100_123_970_12186788_0_r1420963935_113.zip
Tue 03 Jan 2023 08:30:18 AM EST | climateprediction.net | Started upload of oifs_43r3_ps_0144_2001050100_123_970_12186788_0_r1420963935_114.zip
Tue 03 Jan 2023 08:30:21 AM EST |  | Project communication failed: attempting access to reference site
Tue 03 Jan 2023 08:30:21 AM EST | climateprediction.net | Temporarily failed upload of oifs_43r3_ps_0144_2001050100_123_970_12186788_0_r1420963935_113.zip: connect() failed
Tue 03 Jan 2023 08:30:21 AM EST | climateprediction.net | Backing off 00:53:19 on upload of oifs_43r3_ps_0144_2001050100_123_970_12186788_0_r1420963935_113.zip
Tue 03 Jan 2023 08:30:21 AM EST | climateprediction.net | Temporarily failed upload of oifs_43r3_ps_0144_2001050100_123_970_12186788_0_r1420963935_114.zip: connect() failed
Tue 03 Jan 2023 08:30:21 AM EST | climateprediction.net | Backing off 00:18:48 on upload of oifs_43r3_ps_0144_2001050100_123_970_12186788_0_r1420963935_114.zip
Tue 03 Jan 2023 08:30:23 AM EST |  | Internet access OK - project servers may be temporarily down.

ID: 67248 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 67249 - Posted: 3 Jan 2023, 13:58:09 UTC

David can now log in.
More messages flying about.
ID: 67249 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,306,920
RAC: 297
Message 67250 - Posted: 3 Jan 2023, 14:30:27 UTC - in response to Message 67132.  
Last modified: 3 Jan 2023, 14:31:27 UTC

AndreyOR wrote:
I can't get any more work because of too many uploads in progress, according to event log.
Richard Haselgrove wrote:
The "too many uploads in progress" limit? It doesn't count the files, just the number of tasks that can't report because they have at least one file still to upload.

The limit is twice the number of CPU cores in the system.
Richard Haselgrove wrote:
It's what BOINC reads as 'number of CPUs' in the system - so that's probably what the OS reads from the BIOS. If you have a physical CPU that supports hyperthreading, you could double the number by turning hyperthreading on.
Actually it's twice the number of "usable CPUs".
"Usable CPUs" is AFAIU the number of logical CPUs of the host system, possibly overridden by cc_config::options::ncpus, _and_ modified by computing preferences' percentage of CPUs allowed to be used by BOINC.
ID: 67250 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,718,239
RAC: 8,054
Message 67251 - Posted: 3 Jan 2023, 14:36:34 UTC

But back to the matter in hand - I was misled (over-optimistic) about the changed error message. In full, it reads:

03/01/2023 14:31:57 | climateprediction.net | [http] [ID#4304] Info: Trying 192.171.169.187:80...
03/01/2023 14:31:57 | climateprediction.net | [http] [ID#4304] Info: connect to 192.171.169.187 port 80 failed: No route to host
03/01/2023 14:31:57 | climateprediction.net | [http] [ID#4304] Info: Failed to connect to upload11.cpdn.org port 80 after 19 ms: No route to host
03/01/2023 14:31:57 | climateprediction.net | [http] [ID#4304] Info: Closing connection 4403
03/01/2023 14:31:57 | climateprediction.net | [http] HTTP error: Couldn't connect to server
03/01/2023 14:31:58 | climateprediction.net | Temporarily failed upload of oifs_43r3_ps_0264_2002050100_123_971_12187908_2_r743750888_75.zip: connect() failed
ID: 67251 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,476,460
RAC: 15,681
Message 67253 - Posted: 3 Jan 2023, 14:43:01 UTC - in response to Message 67239.  

Messages flying about but no identification of exactly what the problem is yet. Andy thinks Jasmin have changed something in their infrastructure but not sure what. What is clear is that those involved are working on it and not sitting on their hands. (With the possible exception of those managing jasmin which I assume is the successor to Joint Academic Network. I never did work out what the rest of the new acronym is.
JASMIN is a cloud service managed by people at the Rutherford Appleton Labs for the benefit of UK academia. JANet is the UK academic network provider. A traceroute will show a connection from your broadband to a janet hub before a hop or two to JASMIN.

I would not bother trying to force uploads through, the server is up and will be at capacity for some time. The connect() message just means the httpd servers were busy and there's none spare. Should improve after a couple of hrs.
ID: 67253 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 67254 - Posted: 3 Jan 2023, 14:47:46 UTC

Just tell the volunteers to sit tight and be patient. It's not all quite working yet.
(I think we guessed that.)
ID: 67254 · Report as offensive     Reply Quote
leloft

Send message
Joined: 7 Jun 17
Posts: 23
Credit: 44,434,789
RAC: 2,600,991
Message 67255 - Posted: 3 Jan 2023, 15:05:21 UTC - in response to Message 67254.  

I'm Uploading.
ID: 67255 · Report as offensive     Reply Quote
[SG]Felix

Send message
Joined: 4 Oct 15
Posts: 34
Credit: 9,075,151
RAC: 374
Message 67256 - Posted: 3 Jan 2023, 16:02:32 UTC

Me Too :)
And to let everybody else have a part of it, i limit my upload slots to one for now. Doesn't hurt me, since the server can use the full bandwith of my upload with one slot.

Greets
Felix
ID: 67256 · Report as offensive     Reply Quote
biodoc

Send message
Joined: 2 Oct 19
Posts: 21
Credit: 47,674,094
RAC: 24,265
Message 67257 - Posted: 3 Jan 2023, 17:27:25 UTC

I've had few files upload so only 39,113 to go. That's from 340 completed tasks.
ID: 67257 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 67258 - Posted: 3 Jan 2023, 18:14:11 UTC - in response to Message 67257.  

I've had few files upload so only 39,113 to go. That's from 340 completed tasks.
Bunch of files uploaded then it started getting http error at about 15:30. 15:50, started working again then more errors from 17:10 onwards. May just be the server being overloaded though.
ID: 67258 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,718,239
RAC: 8,054
Message 67259 - Posted: 3 Jan 2023, 18:20:58 UTC - in response to Message 67258.  

I saw it working quite a bit during the late afternoon, but with occasional breaks. I don't know whether that was just a pause to cool down, or a full stop requiring manual intervention. But it seems to have stopped for good (or at least, for the night) now.
ID: 67259 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 25 · Next

Message boards : Number crunching : The uploads are stuck

©2024 cpdn.org