climateprediction.net (CPDN) home page
Thread 'Upload failures'

Thread 'Upload failures'

Message boards : Number crunching : Upload failures
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 19 · Next

AuthorMessage
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 60220 - Posted: 27 May 2019, 14:41:39 UTC - in response to Message 60218.  

Ah, Jasmin.
That's at Oxford, and it's working fine for me.
So it looks like it's something at your end.

And DON'T go switching to the secure url while you have tasks for the project, or you'll lose the lot.
ID: 60220 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 60224 - Posted: 27 May 2019, 21:48:17 UTC - in response to Message 60216.  

I now have 50 uploads queued 3700 MB of data.

That may be the problem. There's a limit to either the number of files, or the amount of data that BOINC is happy with.

If it IS the problem, then the cure is painful:

1. Suspend network access
2. Suspend each and every one of the tasks in the Tasks tab. (To stop more files from being created.)
3. Create a temporary folder somewhere nearby.
4. Move all but 4-5 of the cpdn zip files to this folder. The ones left should be the lowest numbered zips.
5. Resume network access and see if the zips left behind upload OK.
6. If so, move 4-5 of the zips back to their normal place, and upload them.
7. Repeat.
8. UnSuspend all of the tasks.

If this doesn't work, post here again, and we'll all go down to the pub for a few beers and a good winge.
ID: 60224 · Report as offensive     Reply Quote
Mephist0

Send message
Joined: 21 Feb 08
Posts: 47
Credit: 7,929,915
RAC: 0
Message 60230 - Posted: 28 May 2019, 8:06:45 UTC - in response to Message 60224.  

hmm thats strange i have never had these problems before. I had a computer with 3G modem attached and connected that once a month to upload. I then had hundreds of tasks to upload and did not run into any problems. That was a year or two ago though..

I have to try this then..
ID: 60230 · Report as offensive     Reply Quote
Mephist0

Send message
Joined: 21 Feb 08
Posts: 47
Credit: 7,929,915
RAC: 0
Message 60231 - Posted: 28 May 2019, 8:08:09 UTC
Last modified: 28 May 2019, 8:08:48 UTC

I must also say that im running a proxy CCproxy to be able to upload. But it has been working fine before though.. :) Since the problem started i tried to use both HTTP proxy and SOCKS proxy but same error. Download of new tasks works fine still though.
ID: 60231 · Report as offensive     Reply Quote
Mephist0

Send message
Joined: 21 Feb 08
Posts: 47
Credit: 7,929,915
RAC: 0
Message 60258 - Posted: 10 Jun 2019, 7:22:41 UTC - in response to Message 60224.  
Last modified: 10 Jun 2019, 7:23:20 UTC

I tried this now. But removeing the zip files does not remove them from the transferlist... I tried stopping and restaring the boincmgr and the boinc service and no difference.

Also there seem to be problems reaching jasmin-upload.cpdn.org to upload files.

I moved the files from this location:
C:\ProgramData\BOINC\projects\climateprediction.net
ID: 60258 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,620,508
RAC: 4,981
Message 60259 - Posted: 10 Jun 2019, 9:24:12 UTC

During the offline period of the project, one of my machines killed 3 safr50 WU with the following error.

<![CDATA[
<stderr_txt>
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Signal 11 received: Segment violation
Signal 11 received: Software termination signal from kill
Signal 11 received: Abnormal termination triggered by abort call
Signal 11 received, exiting...
03:20:25 (12508): called boinc_finish(193)
Global Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=2576, iMonCtr=2
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=13092, iMonCtr=2
Model crash detected, will try to restart...
Leaving CPDN_ain::Monitor...
03:20:29 (13092): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>wah2_safr50_a0lb_201512_13_817_011859012_0_r844198882_9.zip</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>

I guess BOINC killed them because upload failures which I noticed last week, but were enable to check the full log of the machine.
ID: 60259 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 60260 - Posted: 10 Jun 2019, 11:11:49 UTC
Last modified: 10 Jun 2019, 11:16:09 UTC

I guess BOINC killed them because upload failures which I noticed last week, but were enable to check the full log of the machine.


I don't think so,
Segmentation violation is a program problem.

In computing, a segmentation fault or access violation is a fault, or failure condition, raised by hardware with memory protection, notifying an operating system the software has attempted to access a restricted area of memory.Wikipedia


Some model types seem much more prone to this than others but I don't think anyone has really worked out what is causing it. The problem may be somewhere in the met office code that CPDN uses under license.
ID: 60260 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 60261 - Posted: 10 Jun 2019, 21:35:32 UTC - in response to Message 60258.  

Mephist0

It's been perhaps 10 years since this trick of reducing the number of zips in the Transfers queue was last used.
I guess that the way BOINC works has changed a lot since then.

As for not being able to reach the server Jasmin, could you please post the line from the Event log that says this.
ID: 60261 · Report as offensive     Reply Quote
Mephist0

Send message
Joined: 21 Feb 08
Posts: 47
Credit: 7,929,915
RAC: 0
Message 60265 - Posted: 11 Jun 2019, 13:11:35 UTC - in response to Message 60261.  

Sure.. Here it is.. One thing that is strange.. <file_size>.. Says 1310720 and 2610815 but the files are around 74MB big.. Seems wrong?

Also the filetransfer looks strange.. When it looks finished the tranferbar jumps to around 30% and after that it finishes.. Looks like it does some kind of resume..

Is it possible to reset the resume function and not use that? Could that be the problem? It tries to resume the files but Jasmin does not have the files any longer?


2019-06-11 15:06:57 | | Resuming network activity
2019-06-11 15:06:57 | climateprediction.net | [fxd] starting upload, upload_offset -1
2019-06-11 15:06:57 | climateprediction.net | Started upload of wah2_sam50_n088_201412_24_814_011846908_0_r734526431_1.zip
2019-06-11 15:06:57 | climateprediction.net | [file_xfer] URL: http://jasmin-upload.cpdn.org/cgi-bin/file_upload_handler
2019-06-11 15:06:57 | climateprediction.net | [fxd] starting upload, upload_offset -1
2019-06-11 15:06:57 | climateprediction.net | Started upload of wah2_sam50_n088_201412_24_814_011846908_0_r734526431_6.zip
2019-06-11 15:06:57 | climateprediction.net | [file_xfer] URL: http://jasmin-upload.cpdn.org/cgi-bin/file_upload_handler
2019-06-11 15:07:00 | climateprediction.net | [file_xfer] http op done; retval 0 (Success)
2019-06-11 15:07:00 | climateprediction.net | [file_xfer] parsing upload response: <data_server_reply> <status>0</status> <file_size>1310720</file_size></data_server_reply>
2019-06-11 15:07:00 | climateprediction.net | [file_xfer] parsing status: 0
2019-06-11 15:07:00 | climateprediction.net | [fxd] starting upload, upload_offset 1310720
2019-06-11 15:07:00 | climateprediction.net | [file_xfer] http op done; retval 0 (Success)
2019-06-11 15:07:00 | climateprediction.net | [file_xfer] parsing upload response: <data_server_reply> <status>0</status> <file_size>2610815</file_size></data_server_reply>
2019-06-11 15:07:00 | climateprediction.net | [file_xfer] parsing status: 0
2019-06-11 15:07:00 | climateprediction.net | [fxd] starting upload, upload_offset 2610815
2019-06-11 15:09:34 | | Project communication failed: attempting access to reference site
2019-06-11 15:09:34 | climateprediction.net | [file_xfer] http op done; retval -184 (transient HTTP error)
2019-06-11 15:09:34 | climateprediction.net | [file_xfer] http op done; retval -184 (transient HTTP error)
2019-06-11 15:09:34 | climateprediction.net | [file_xfer] file transfer status -184 (transient HTTP error)
2019-06-11 15:09:34 | climateprediction.net | Temporarily failed upload of wah2_sam50_n088_201412_24_814_011846908_0_r734526431_1.zip: transient HTTP error
2019-06-11 15:09:34 | climateprediction.net | Backing off 02:21:39 on upload of wah2_sam50_n088_201412_24_814_011846908_0_r734526431_1.zip
2019-06-11 15:09:34 | climateprediction.net | [file_xfer] file transfer status -184 (transient HTTP error)
2019-06-11 15:09:34 | climateprediction.net | Temporarily failed upload of wah2_sam50_n088_201412_24_814_011846908_0_r734526431_6.zip: transient HTTP error
2019-06-11 15:09:34 | climateprediction.net | Backing off 00:08:45 on upload of wah2_sam50_n088_201412_24_814_011846908_0_r734526431_6.zip
2019-06-11 15:09:34 | climateprediction.net | [fxd] starting upload, upload_offset -1
ID: 60265 · Report as offensive     Reply Quote
Mephist0

Send message
Joined: 21 Feb 08
Posts: 47
Credit: 7,929,915
RAC: 0
Message 60266 - Posted: 11 Jun 2019, 13:49:29 UTC

I also tried to downgrade BOINC to 7.12.1 (x64) without success :(
ID: 60266 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 60269 - Posted: 12 Jun 2019, 6:26:18 UTC

I've emailed Andy about this.
ID: 60269 · Report as offensive     Reply Quote
Mephist0

Send message
Joined: 21 Feb 08
Posts: 47
Credit: 7,929,915
RAC: 0
Message 60270 - Posted: 12 Jun 2019, 7:54:52 UTC - in response to Message 60269.  

Ok thank you. I could also delete the project and add it as https instead. Maybe that's what causing the problem... Then i have to delete the completed work though. But ill wait for the response first... :)
ID: 60270 · Report as offensive     Reply Quote
Mephist0

Send message
Joined: 21 Feb 08
Posts: 47
Credit: 7,929,915
RAC: 0
Message 60271 - Posted: 12 Jun 2019, 8:31:39 UTC

In the meantime i try to run World Community Grid tasks to see that my proxy server is working correctly when uploading results to them. It has been working correctly before though with ClimatePrediction...
ID: 60271 · Report as offensive     Reply Quote
Mephist0

Send message
Joined: 21 Feb 08
Posts: 47
Credit: 7,929,915
RAC: 0
Message 60272 - Posted: 12 Jun 2019, 13:04:19 UTC

WCG worked fine:

2019-06-12 15:03:16 | | Resuming network activity
2019-06-12 15:03:37 | World Community Grid | [fxd] starting upload, upload_offset -1
2019-06-12 15:03:37 | World Community Grid | Started upload of MIP1_00197817_1787_0_r2014039693_0
2019-06-12 15:03:37 | World Community Grid | [file_xfer] URL: https://upload.worldcommunitygrid.org/boinc/wcg_cgi/file_upload_handler
2019-06-12 15:03:38 | World Community Grid | [file_xfer] http op done; retval 0 (Success)
2019-06-12 15:03:38 | World Community Grid | [file_xfer] parsing upload response: <data_server_reply> <status>0</status> <file_size>0</file_size></data_server_reply>
2019-06-12 15:03:38 | World Community Grid | [file_xfer] parsing status: 0
2019-06-12 15:03:38 | World Community Grid | [fxd] starting upload, upload_offset 0
2019-06-12 15:03:39 | World Community Grid | [file_xfer] http op done; retval 0 (Success)
2019-06-12 15:03:39 | World Community Grid | [file_xfer] parsing upload response: <data_server_reply> <status>0</status></data_server_reply>
2019-06-12 15:03:39 | World Community Grid | [file_xfer] parsing status: 0
2019-06-12 15:03:39 | World Community Grid | [file_xfer] file transfer status 0 (Success)
2019-06-12 15:03:39 | World Community Grid | Finished upload of MIP1_00197817_1787_0_r2014039693_0
2019-06-12 15:03:39 | World Community Grid | [file_xfer] Throughput 53790 bytes/sec
2019-06-12 15:03:42 | World Community Grid | Sending scheduler request: To report completed tasks.
2019-06-12 15:03:42 | World Community Grid | Reporting 1 completed tasks
2019-06-12 15:03:42 | World Community Grid | Not requesting tasks: "no new tasks" requested via Manager
2019-06-12 15:03:43 | World Community Grid | Scheduler request completed
ID: 60272 · Report as offensive     Reply Quote
[P3D] Crashtest

Send message
Joined: 2 Apr 05
Posts: 16
Credit: 19,190,081
RAC: 10,804
Message 60276 - Posted: 12 Jun 2019, 20:38:36 UTC

Over 12GB of Upload is waiting:

12.06.2019 22:34:41 | climateprediction.net | [error] Error reported by file upload server: can't write file wah2_safr50_n2af_199512_13_820_011867777_1_r1455705735_3.zip: No space left on server

So we have lots of new Workunits but the Server disks are full !?!
ID: 60276 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 60277 - Posted: 12 Jun 2019, 20:50:42 UTC

Crashtest

Which of the dozen servers is it?
ID: 60277 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 60278 - Posted: 12 Jun 2019, 21:01:41 UTC

Mephist0

Apparently "Jasmin" isn't a single server, it's a data center.

They ARE having problems, and our IT people are liaising with their IT people about it.

No idea of when it will be "fixed".
ID: 60278 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 60279 - Posted: 12 Jun 2019, 21:15:44 UTC

I am in the same boat. It just started in the last hour apparently. Only one is stuck for me thus far. How do I determine which server?

16178 climateprediction.net 6/12/2019 5:13:24 PM Started upload of wah2_safr50_n13c_201512_13_819_011864198_0_r670590459_5.zip
16179 climateprediction.net 6/12/2019 5:13:25 PM [error] Error reported by file upload server: Server is out of disk space
16180 climateprediction.net 6/12/2019 5:13:25 PM Temporarily failed upload of wah2_safr50_n13c_201512_13_819_011864198_0_r670590459_5.zip: transient upload error
16181 climateprediction.net 6/12/2019 5:13:25 PM Backing off 00:24:11 on upload of wah2_safr50_n13c_201512_13_819_011864198_0_r670590459_5.zip
ID: 60279 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 60284 - Posted: 13 Jun 2019, 2:23:51 UTC

This is in the client_state.xml file.
Look for the file name, in the upload section.
CAREFULLY.

And if you look at the posts below by Mephist0, you can see that it's in one of the BOINC "flags". Probably [file_xfer], as that's at the start of the lines.

Event Log options by the look of it.

Also, I think that this space problem triggers an email alarm to the project people, so I won't do anything in the middle of their night.
ID: 60284 · Report as offensive     Reply Quote
Max Ringler

Send message
Joined: 14 Jun 10
Posts: 2
Credit: 6,623,376
RAC: 34,518
Message 60287 - Posted: 13 Jun 2019, 8:07:52 UTC

If you look at the project stats, there hasn't been any succesfull upload for more than 2 weeks, since 25.06.2019 : https://boincstats.com/en/stats/2/project/detail/lastDays
ID: 60287 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 19 · Next

Message boards : Number crunching : Upload failures

©2024 cpdn.org