CPDN under Wine: not getting new tasks

Author	Message
jrapdx Send message Joined: 4 Jul 15 Posts: 63 Credit: 3,223,760 RAC: 0	Message 53433 - Posted: 15 Feb 2016, 10:21:55 UTC BOINC was working under Wine (Ubuntu 15.04), and 4 CPDN tasks were running OK. All 4 wah2 WUs are nearly complete (>95%). Unfortunately, something happened earlier today when I wasn't around, causing an apparent reboot. After logging in and getting Windows BOINC restarted, trickles (for all 4 tasks) were immediately uploaded. However the trickles haven't shown up yet in the task details pages, which before were usually listed a couple of hours after upload. Also boincmgr didn't request new tasks and none received. In the past as tasks approached completion, new ones were lined up to start running but that's not happened even though available. I'm wondering if the reboot (whatever the reason) messed things up. Indeed running tasks under Wine feels riskier vs. a "real" Windows platform. On the good side, none of the tasks failed and everything looks on track for them to finish satisfactorily. It would be nice to get new work to do rather than have the machine sit idle. Maybe the issues will be self-correcting. Requested a project update but that didn't do a whole lot. Anything else worth doing? ID: 53433 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,976,682 RAC: 21,948	Message 53434 - Posted: 15 Feb 2016, 11:16:54 UTC - in response to Message 53433. With regards to the requesting new tasks when there is space, the default amount of time that new tasks are expected to fill is ridiculously small for CPDN and I quickly changed it to the maximum of ten days plus an additional ten which sorted that out. I have also noticed it sometimes taking longer for trickles to show up. ID: 53434 · Reply Quote

jrapdx Send message Joined: 4 Jul 15 Posts: 63 Credit: 3,223,760 RAC: 0	Message 53436 - Posted: 15 Feb 2016, 21:13:21 UTC - in response to Message 53434. BOINC is showing notices saying the new tasks need more memory than it thinks available, precisely: UK Met Office HadAM3P-HadRM3P Australia New Zealand needs 840.86MB more disk space. You currently have 1066.49 MB available and it needs 1907.35 MB. That doesn't make sense to me re: OS reports there are 672GB available on the drive "c:" as Wine knows it. Must be something in the Wine config, but not clear what the problem is. Anyway if anyone has a clue I'd appreciate the info. ID: 53436 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 53437 - Posted: 15 Feb 2016, 21:33:20 UTC - in response to Message 53436. The "usual suspect" here, is crashed models, which always leave bits of files lying around. I'm not sure what else there is. And I don't know what effect left overs on the Linux system have on the Wine/Windows system. It could be a reporting problem with the OS, when running Wine. ID: 53437 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2185 Credit: 64,822,615 RAC: 5,275	Message 53438 - Posted: 15 Feb 2016, 23:26:43 UTC Look at the Disk Tab in boinc manager and see what it's reporting for "used by boinc" and "free, available to boinc" and "free, not available to boinc". You may need to change (under the Tools menu), Computing preferences, disk and memory usage, Disk usage, some of the options there so more of the C: drive is available for boinc? ID: 53438 · Reply Quote

jrapdx Send message Joined: 4 Jul 15 Posts: 63 Credit: 3,223,760 RAC: 0	Message 53439 - Posted: 16 Feb 2016, 1:05:35 UTC Last modified: 16 Feb 2016, 1:34:52 UTC Wow. Do something else for a couple of hours, come back and everything's changed! In the interim, 6 WU were downloaded and errored out. The error was "file not found", e.g., wah2_sas50_fe5a_201412_13_341_010314204_2_[1..14].zip (for WU 10314204). Seems like we've seen this before. However, three fresh tasks were sent which are running now. Not sure what accounts for the difference, they're all wah2 tasks. Maybe it would be useful for someone more familiar with the programs to take a look at computer 139186 (Win10) to assess the errors. The problem I reported earlier is still a mystery to me. According to BOINC, 10GB is total disk allocation, of which 3.23GB is in use, leaving 6.77GB available. Obviously there was and is ample disk space. There are some crashed tasks now, but there weren't any earlier, so probably could rule that out as a cause of the problem. I do appreciate the responses to my query! The new developments above raise a bunch of interesting questions about what's going on, and what to do to help things run more smoothly. Edit: I spoke too soon. On closer inspection it appears the tasks that seemed to be running actually terminated due to errors and replaced by tasks which had errors, etc. I may have to stop accepting downloads for a while, until this situation is clarified. ID: 53439 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2185 Credit: 64,822,615 RAC: 5,275	Message 53440 - Posted: 16 Feb 2016, 1:45:46 UTC - in response to Message 53439. It looks like all those SAS tasks are erroring on all the other PCs too. So likely a problem with batch 341 setup. ID: 53440 · Reply Quote

jrapdx Send message Joined: 4 Jul 15 Posts: 63 Credit: 3,223,760 RAC: 0	Message 53441 - Posted: 16 Feb 2016, 4:58:07 UTC - in response to Message 53440. According to the "Server Status" page, there are 15704 wah2 tasks ready to send. Are all of these afflicted with "batch 341 setup"? If not all, what proportion? As it is, I saw no point continuing to download only to have tasks immediately error out. Of course I'd like to have my computer get back to work, but hard to know when doing that will be "safe". Seems like a problem that could cause a lot of consternation for participants. I'm guessing how quickly it's resolved could depend on the size of the bad batch. In my imagination, if not too large, could be easier to remove error-causing WUs from the ready-to-send list and task startup would go back to normal that much sooner. Resolving the problem, however it's done, will be a good thing. ID: 53441 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 53442 - Posted: 16 Feb 2016, 6:06:10 UTC - in response to Message 53441. Resolving the problem, however it's done, will be a good thing. In hand. Time is needed for things like this, and one of the servers will be down in a few hours as per "News". As for the "number waiting to be sent", keep in mind that there's a lot happening these days. Several production batches, as well as any test models. (Front page, right hand side.) New year, new approach to how things are done. :) ID: 53442 · Reply Quote

jrapdx Send message Joined: 4 Jul 15 Posts: 63 Credit: 3,223,760 RAC: 0	Message 53443 - Posted: 16 Feb 2016, 6:58:17 UTC - in response to Message 53442. Thanks for your reply. In the morning (local time) the servers should be cleaned up and back on line, so I can give it another try... ID: 53443 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2185 Credit: 64,822,615 RAC: 5,275	Message 53444 - Posted: 16 Feb 2016, 7:05:15 UTC - in response to Message 53441. I sent an e-mail to the cpdn tech people about the apparently bad SAS batch. Unfortunately, WAH2 now contains different areas/experiments, and it's not possible to tell from the server page how many of a given batch are left in the queue. ID: 53444 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,976,682 RAC: 21,948	Message 53445 - Posted: 16 Feb 2016, 8:03:53 UTC and it's not possible to tell from the server page how many of a given batch are left in the queue. Peaking around it looks like still over 80% unsent. ID: 53445 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 53446 - Posted: 16 Feb 2016, 8:24:52 UTC It's all a matter of being patient for a few hours. And going Ommmmmmm. ID: 53446 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,976,682 RAC: 21,948	Message 53447 - Posted: 16 Feb 2016, 9:06:40 UTC - in response to Message 53446. The batch 341 tasks are being taken out of the queue to see what's wrong. ID: 53447 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,976,682 RAC: 21,948	Message 53448 - Posted: 16 Feb 2016, 11:38:44 UTC The batch 341 tasks are being taken out of the queue to see what's wrong. They still seem to be there but the batch size was only 7,200 so about half those on the server are from other batches. ID: 53448 · Reply Quote

jrapdx Send message Joined: 4 Jul 15 Posts: 63 Credit: 3,223,760 RAC: 0	Message 53449 - Posted: 16 Feb 2016, 18:09:04 UTC - in response to Message 53448. Just checked the Server Status page, still gives wah2 "ready to send" number as 15,274, same as before. Not clear if the work of removing the error-prone WUs is completed and OK to resume downloading tasks. As it happens I need to do some updating on my computer, so I'll take advantage of this hiatus to do that and check back later. I'm thinking the number of available tasks will be <15274 after sorting out good vs. bad batches. ID: 53449 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 53454 - Posted: 16 Feb 2016, 21:45:57 UTC - in response to Message 53449. What's on the Server Status page doesn't reflect what's going on behind the scenes. It could be that the relevant researcher was notified at the same time as the faulty models were being removed, found the problem, and issued a new batch. Or perhaps the percentage of faulty models was small. The best way to test for what's available, (and "good"), is to set your computer to NNW and/or set Network to Off, then set your prefs for 1 processor, update your client, and then only allow New work (and/or network access) once every X hours (perhaps 6 hours) and see what you get. ID: 53454 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 53456 - Posted: 16 Feb 2016, 22:59:31 UTC Annndddd SS now says 9,336 ready to send. ID: 53456 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 53457 - Posted: 17 Feb 2016, 0:50:24 UTC The error was "file not found", e.g., wah2_sas50_fe5a_201412_13_341_010314204_2_[1..14].zip (for WU 10314204). Seems like we've seen this before. You'll see that "error" every time a model fails, because it's NOT an error; it's a BOINC information message, to tell the user that it couldn't find one or more of the upload zip files. And how could it find the file(s), when the model never got far enough to create it/them. ID: 53457 · Reply Quote

jrapdx Send message Joined: 4 Jul 15 Posts: 63 Credit: 3,223,760 RAC: 0	Message 53459 - Posted: 17 Feb 2016, 6:08:30 UTC - in response to Message 53457. ...it's NOT an error; it's a BOINC information message... Indeed a closer look shows a dozen tasks exited on SIGSEGV. It's a familiar result of mistakes I've made, like calling free() on a NULL pointer or some similar bug. I understand what you mean re: "file not found" messages. Taking too quick a glance at stderr, my attention was drawn to the prominent and repeated "file xfer error" vs. "Signal 11" that's kind of buried in the stuff at the top. More relevantly, a few hours ago BOINC downloaded 4 new tasks, unfortunately one promptly crashed per above error. After rebooting, turns out it requires just the right magical incantation to get Wine and BOINC going correctly. Once that got sorted out, 3 wah2 tasks seem to be running fine. Now awaiting a fourth task to land, however when requesting a project update, I get messages like "... Not sending work - last request too recent: 3174 sec". I take it an interval needs to expire, that 3174 sec is too soon. Not clear how long a delay is enough or how/where the interval is set. Also, does the "timer" reset with each attempt to update the project? Eventually I'll get these things figured out. ID: 53459 · Reply Quote