Thread 'The uploads are stuck'

Author	Message
Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,730,664 RAC: 6,969	Message 67787 - Posted: 17 Jan 2023, 9:25:21 UTC All my stacked-up uploads have cleared, and I just have four tasks in the final stages. So today is maintenance day, and afterwards I have a plan to try and grab a memory usage log to illustrate the startup problem. I'll be using a machine with 6 cores and 64 GB RAM, so no CPDN work should be harmed in the process (though it may take a couple of tries to get it right), and then we'll have something to show CPDN staff in the first instance, and BOINC developers later on. ID: 67787 · Reply Quote

ncoded.com Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,408,433 RAC: 2,038	Message 67789 - Posted: 17 Jan 2023, 12:23:48 UTC Last modified: 17 Jan 2023, 12:49:36 UTC Dave, Richard, et al.. Can I ask, where have these completed tasks gone? https://www.cpdn.org/results.php?hostid=1535374&offset=0&show_names=0&state=1&appid= It says In-Progress but most (if not all) of these have already been completed, uploaded, and reported? eg I just uploaded and reported one task just a few minutes ago which I downloaded 15 hours ago, but there is nothing showing in the list, just 'in-progress'. I think the problem is that CPDN is treating 2 hosts, as a single host. eg: L-7113-1 and L-7113-2 are two different hosts. But CPDN see's just 1 host. If I swap out the hard drive, all CPDN does is change the hostname of this device, rather than see it as a seperate device (host). The two disks are completely separate. Both have a full install of Ubuntu and BOINC. Only one drive is inserted into the server at any one time. I have ran this server on many BOINC projects, at different times with each drive, and all projects (except CPDN) see it as 2 different hosts. Are all the reported tasks from this server since Dec 24th, now orphaned? If so, that would mean its not just the 17 tasks on this list, but also the 250+ tasks that this server has and is currently reporting? If I report a task, should not the in-progress not decrease by 1, and the Valid, Invalid, or Error increase by 1? This has not happened for ANY of the 250+ tasks reported (or being reported) by this server since Dec 24th. However it has and does happen for all our other hosts (devices) from before and after this date. ID: 67789 · Reply Quote

PDW Send message Joined: 29 Nov 17 Posts: 82 Credit: 15,020,070 RAC: 84,566	Message 67790 - Posted: 17 Jan 2023, 12:45:57 UTC - in response to Message 67789. Dave, Richard, et al.. Can I ask, where have these completed tasks gone? https://www.cpdn.org/results.php?hostid=1535374&offset=0&show_names=0&state=1&appid= It says In-Progress but most (if not all) of these have already been completed, uploaded, and reported? eg I just uploaded and reported one task just a few minutes ago which I downloaded 15 hours ago, but there is nothing showing in the list, just 'in-progress'. I think the problem is that CPDN is treating 2 hosts, as a single host. eg: L-7113-1 and L-7113-2 are two different hosts. But CPDN see's just 1 host. If I swap out the hard drive, all CPDN does is change the hostname of this device, rather than see it as a seperate device (host). The two disks are completely separate. Both have a full install of Ubuntu and BOINC. Only one drive is inserted into the server at any one time. I have ran this server on many BOINC projects, at different times with each drive, and all projects (except CPDN) see it as 2 different hosts. The link you give is just to show ONLY in progress tasks, this link shows all tasks for that host: https://www.cpdn.org/results.php?hostid=1535374 There is a server setting that doesn't allow multiple clients. The way you have your 2 drives setup means BOINC sees them as the same so when you swap them over the tasks will get abandoned as shown in your full list. Go and try GPUGrid, that does not allow multiple clients and their results will be abandoned if you swap your drives over whilst tasks are still active on the drive you swap out. ID: 67790 · Reply Quote

ncoded.com Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,408,433 RAC: 2,038	Message 67791 - Posted: 17 Jan 2023, 12:54:53 UTC - in response to Message 67790. Last modified: 17 Jan 2023, 13:10:16 UTC Thank you PDW, That is my point. There are no tasks in progress on this server. It is in jail so we only get 1 or 2 tasks per day. We complete them in around 15 hours, upload and report them. Yet this is showing 17 In-progress. Thank you also for confirming there is some kind of lock. I guessed there was, and it was using the Mac address and/or the local IP. Obviously I am not sure what to do about the missing tasks now. I have run 300 threads on CPDN since the OpenIFS large batch(s) were issued a couple of weeks ago. It's going to be a hard and better pill to swallow if it turns out that was all for nothing and wasted. ID: 67791 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,551,831 RAC: 17,001	Message 67792 - Posted: 17 Jan 2023, 12:58:52 UTC - in response to Message 67790. Last modified: 17 Jan 2023, 13:00:53 UTC I think the problem is that CPDN is treating 2 hosts, as a single host. eg: L-7113-1 and L-7113-2 are two different hosts. But CPDN see's just 1 host. If I swap out the hard drive, all CPDN does is change the hostname of this device, rather than see it as a seperate device (host). The link you give is just to show ONLY in progress tasks, this link shows all tasks for that host: https://www.cpdn.org/results.php?hostid=1535374 There is a server setting that doesn't allow multiple clients. The way you have your 2 drives setup means BOINC sees them as the same so when you swap them over the tasks will get abandoned as shown in your full list. This should work - it's equivalent to running two clients on the same machine, and just shutting one down whilst the 2nd drive is in the machine. It's perfectly possible to run 2 clients on the same host for CPDN (I do it), but there must be two separate client ids. CPDN's server does not see the mac address, only your external (router) IP. To swap out the disks you'd need to have created a new client instance on the 2nd disk, whilst keeping the original client on the first disk without detaching from the project. That way, CPDN's server will see two client, one for each disk and that should work. If each disk's boinc client datadir has the same client id (check the 'client_state.xml' file) then I suspect you'll get the behaviour you describe. ID: 67792 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,551,831 RAC: 17,001	Message 67793 - Posted: 17 Jan 2023, 13:08:10 UTC - in response to Message 67791. All the tasks on your machine: https://www.cpdn.org/results.php?hostid=1535374 are showing as Abandoned on the 31st Dec, which is before the deadline so not sure what happened there. Or they failed because they hit their disk quota limit (this is probably because 'leave non-GPU in memory was not enabled). I can't see any tasks that have worked on this host after listing several pages :( The disk limit error is what I'm working on improving now. To work around it, make sure 'leave non-gpu is memory is enabled' and try not to shutdown the boinc client too many times whilst the task is running (no more than 2). That should help. So I think those have already been lost. Thank you PDW, That is my point. There are no tasks in progress on this server. It is in jail so we only get 1 or 2 tasks per day. We complete them in around 15 hours, upload them and report. Yet this is showing 17 In-progress. Thank you also for confirming there is some kind of lock. I guessed there was, and it was using the Mac address and/or the local IP. Obviously I am not sure what to do about the missing tasks now. I have run 300 vcores on CPDN since the OpenIFS large batch(s) were issued a couple of weeks ago. It's going to be a hard and better pill to swallow if it turns out that was all for nothing and wasted. ID: 67793 · Reply Quote

ncoded.com Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,408,433 RAC: 2,038	Message 67794 - Posted: 17 Jan 2023, 13:11:15 UTC Last modified: 17 Jan 2023, 13:22:35 UTC I have to wonder why I am bothering to upload 500GB worth of uploads, if they are going to just be abandoned as soon as they get reported. ID: 67794 · Reply Quote

PDW Send message Joined: 29 Nov 17 Posts: 82 Credit: 15,020,070 RAC: 84,566	Message 67795 - Posted: 17 Jan 2023, 13:17:51 UTC - in response to Message 67792. I think the problem is that CPDN is treating 2 hosts, as a single host. eg: L-7113-1 and L-7113-2 are two different hosts. But CPDN see's just 1 host. If I swap out the hard drive, all CPDN does is change the hostname of this device, rather than see it as a seperate device (host). The link you give is just to show ONLY in progress tasks, this link shows all tasks for that host: https://www.cpdn.org/results.php?hostid=1535374 There is a server setting that doesn't allow multiple clients. The way you have your 2 drives setup means BOINC sees them as the same so when you swap them over the tasks will get abandoned as shown in your full list. This should work - it's equivalent to running two clients on the same machine, and just shutting one down whilst the 2nd drive is in the machine. It's perfectly possible to run 2 clients on the same host for CPDN (I do it), but there must be two separate client ids. CPDN's server does not see the mac address, only your external (router) IP. To swap out the disks you'd need to have created a new client instance on the 2nd disk, whilst keeping the original client on the first disk without detaching from the project. That way, CPDN's server will see two client, one for each disk and that should work. If each disk's boinc client datadir has the same client id (check the 'client_state.xml' file) then I suspect you'll get the behaviour you describe. I didn't know how CPDN was using the setting, I do know there is one, I wasn't going to try running multiple clients to test it before posting. As I said, "The way you have your 2 drives setup means BOINC sees them as the same" so ncoded could change their setup to make it work if you say it works for you. ID: 67795 · Reply Quote

ncoded.com Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,408,433 RAC: 2,038	Message 67796 - Posted: 17 Jan 2023, 13:26:20 UTC Last modified: 17 Jan 2023, 13:28:57 UTC I have to wonder why I am bothering to upload 500GB worth of uploads, if they are going to just be abandoned as soon as they get reported. Also I am not sure if people realise what I am saying here. ANY task I crunch on this server now will just disappear and get stuck In-progress, even after it gets uploaded and reported. ID: 67796 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,730,664 RAC: 6,969	Message 67797 - Posted: 17 Jan 2023, 13:29:08 UTC The actual field you'll be looking for is <hostid>. You have a different one for each project: make sure you look at the right one. This is a known, and deliberate, design feature in BOINC. It's more commonly encountered when people clone an existing BOINC installation to a new machine, either because they didn't know how to do it safely, or because they want one hostid to rack up all the points from several separate bits of hardware. The latter would be regarded as cheating, and is discouraged. The best way, as Glenn has described, is to enable the 'Allow multiple clients' flags on both client instances, and keep both attached and visible to the project. BOINC should keep both hostids separate, and allow both to communicate (but do check that has worked properly). If you have run two separate clones of the same hostid, perhaps because of the restriction on fetching new work while uploads are stuck, you can retrieve the situation with care. 1) Let the currently-active instance complete all outstanding tasks, and report them. Shut down that instance completely, so it doesn't contact the server again. 2) Before you do anything else, look on this website for the details of the computer, and find the line "Number of times client has contacted server". Make a note of that number. 3) Before you start the second instance, look for the tag <rpc_seqno> in the second client_state.xml file. Edit the number to be one greater than the one you just noted. Save the file, making sure you don't change the file type from plain text. It should now be safe to re-start the second instance, without causing the associated tasks to be abandoned. ID: 67797 · Reply Quote

ncoded.com Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,408,433 RAC: 2,038	Message 67798 - Posted: 17 Jan 2023, 13:35:14 UTC Last modified: 17 Jan 2023, 13:51:41 UTC Okay easy solution is just remove CPDN from every device. Do you want me to abort all the uploads? Or do nothing and just remove the project? Or let the uploads complete, and then remove the Project? I will let any running tasks complete before removing anything. Let me know if you have a specific preference Glen et al. ID: 67798 · Reply Quote

Yeti Send message Joined: 5 Aug 04 Posts: 178 Credit: 18,974,041 RAC: 39,376	Message 67800 - Posted: 17 Jan 2023, 13:56:56 UTC It is not easy to run more than one instances of BOINC on the same hardware. If you create the new instance, the server checks if it has already seen this machine before; this happens by name and IP-Adress. As they are the same, the server assumes that you lost your last instance and cancels all former assigned tasks. You may upload already crunched results, but as the server has already cancelled these tasks, it can not use your uploads. In latest BOINC-Versions you can set an Instance-Name in cc_config.xml to avoid that the server assignes the old ID to a new instance: <device_name>HereTheNameForTheNewInstance</device_name> What to do now ? I would cancel both instances, remove them, delete them and then setup the first one. Then creating a second one, with different name via cc_config.xml and with a different directory then Instance 1 Before you start timeconsuming crunching check that the server has recognized both instances as separate machines Supporting BOINC, a great concept ! ID: 67800 · Reply Quote

ncoded.com Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,408,433 RAC: 2,038	Message 67801 - Posted: 17 Jan 2023, 14:09:03 UTC - in response to Message 67800. Last modified: 17 Jan 2023, 14:17:47 UTC The thing is, none of this problem is about BOINC instances. All I did in this case was buy a new drive so I could continue crunching for cpdn as the old drive was full of uploads. I then swapped the drives over, and did a fresh install of Ubuntu and BOINC on the new drive. As that is now causing loads of problems then clearly CPDN is not the right project for us at this time. ID: 67801 · Reply Quote

PDW Send message Joined: 29 Nov 17 Posts: 82 Credit: 15,020,070 RAC: 84,566	Message 67802 - Posted: 17 Jan 2023, 14:17:36 UTC - in response to Message 67801. As you didn't make an effort to change the second OS drive to look different from the first when you installed Boinc it came up with the same (or possibly very similar) identifier that it defined for that new host. When the host was attached to CPDN it was recognised as the same host that you had been using, resulting in abandonment of the old results. Much like running multiple clients on the same disk without using the allow_multiple_clients flag in cc_config.xml. ID: 67802 · Reply Quote

ncoded.com Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,408,433 RAC: 2,038	Message 67803 - Posted: 17 Jan 2023, 14:22:29 UTC Last modified: 17 Jan 2023, 14:58:57 UTC Okay thanks ID: 67803 · Reply Quote

Boone Send message Joined: 8 Aug 05 Posts: 3 Credit: 13,689,587 RAC: 4,982	Message 67811 - Posted: 17 Jan 2023, 17:11:08 UTC - in response to Message 67745. Hi, I would like to inform you that all my WUs have been uploaded, so far 88GB :-) I am glad that I made it in time. Thank you for making this possible. ID: 67811 · Reply Quote

gemini8 Send message Joined: 4 Dec 15 Posts: 52 Credit: 2,493,589 RAC: 2,146	Message 67812 - Posted: 17 Jan 2023, 17:22:36 UTC ncoded.com: The point is that Boinc has this feature which the project CPDN can't circumvent, and no other project can. You as user could have, but you didn't know about this. This is quite a bitter pill, and I'm sorry you have to gulp it down. - - - - - - - - - - Greetings, Jens ID: 67812 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 127 Credit: 42,285,401 RAC: 73,468	Message 67813 - Posted: 17 Jan 2023, 17:22:46 UTC - in response to Message 67785. Last modified: 17 Jan 2023, 17:30:01 UTC I'd argue against doing this or that there's even a need. It seems to me that BOINC upload is for the most part a background process that does its job relatively well. Pretty much the only times uploading generates user complaints are when upload servers aren't working right. The length of this upload outage is rather unique but even so the progress has been very good so far. Even the users who've had a hard time getting a connection slot are starting to report completed uploads. Even though CPDN put in a due date grace period I suspect it'll hardly be needed, which I believe was also Glenn's position in an earlier post. Unfortunately not everyone's upload is that fast relative to their compute and I will very likely need the grace period. I got good connection in past one day and half. So far, I've uploaded around 60 with 170 pending. I have a few WUs due in 2-3 days and they seem to be determined to be the very last to go. If boinc client ordered uploads properly, they would have all been reported by now, removing the need for extending the grace period. On the other hand, I don't agree this should require user intervention either. Boinc client should simply order this correctly by itself, just like how it prioritizes compute deadlines. After all, the goal that matters is to get the WUs reported by deadline, and upload is part of the process to get there. Edit: Thinking more, I realized it's possible boinc would order this properly as the deadline approaches, just like how it does with compute. Perhaps I will learn whether that's the case in two days... ID: 67813 · Reply Quote

xii5ku Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,311,890 RAC: 633	Message 67815 - Posted: 17 Jan 2023, 19:31:30 UTC - in response to Message 67785. Last modified: 17 Jan 2023, 19:37:00 UTC AndreyOR wrote: It seems to me that BOINC upload is for the most part a background process that does its job relatively well. Pretty much the only times uploading generates user complaints are when upload servers aren't working right. The length of this upload outage is rather unique but even so the progress has been very good so far. It's not only the length of the server outage which is a one-off edge case here. The extreme ratio of result data size to CPU time is also unique. AFAIK, it's very unlike any of the current active projects. (And it's atypical for Distributed Computing which requires client-server communications to be minimal to be effective. Client-server bandwidth and latency in Distributed Computing are, obviously, worlds apart from an HPC cluster.) (On a positive note, both the result data size and the CPU time of oifs_43r3_ps tasks are very predictable, making it easy for users to control their output accordingly, if they care.) ID: 67815 · Reply Quote

[SG]Felix Send message Joined: 4 Oct 15 Posts: 34 Credit: 9,075,151 RAC: 374	Message 67821 - Posted: 17 Jan 2023, 21:12:37 UTC - in response to Message 67813. On the other hand, I don't agree this should require user intervention either. Boinc client should simply order this correctly by itself, just like how it prioritizes compute deadlines. After all, the goal that matters is to get the WUs reported by deadline, and upload is part of the process to get there. Edit: Thinking more, I realized it's possible boinc would order this properly as the deadline approaches, just like how it does with compute. Perhaps I will learn whether that's the case in two days... As far as i could watch it at my machine, the zips where uploaded in the order, of which they where created. If one failed, it would be simply retried at the end, like setting it at the end of the queue again. If many failed, they startet at the order of failing. So it should work, that the oldest WUs are uploadad first. Greets Felix ID: 67821 · Reply Quote