Message boards : Number crunching : Error on File Upload
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
Send message Joined: 9 Jan 07 Posts: 497 Credit: 342,899 RAC: 0 |
MikeMarsUK wrote
Umm, since it\'s already \"towards the end of the week\", I take it this means next week, Mike? Visit the Scotland team |
Send message Joined: 23 Nov 05 Posts: 18 Credit: 407,491 RAC: 0 |
Help with this error Should I wait for the server upgrade or is this another problem? Thanks DP 4/25/2007 10:21:38 PM|climateprediction.net|[error] Error on file upload: can\'t write file /home/boinc/data/hadcm3pbb_bsk3_05824302_0_2.zip: No space left on device |
Send message Joined: 9 Jan 07 Posts: 497 Credit: 342,899 RAC: 0 |
This is the same problem we\'ve all got DP :-( We can send trickles and get credits but not upload the 10-yearly data file. Visit the Scotland team |
Send message Joined: 5 Aug 04 Posts: 250 Credit: 93,274 RAC: 0 |
Perhaps that the NEWS thread at the top should be flashing and have animated arrows pointing to it and even then people will miss it. ;-) Jord. |
Send message Joined: 28 Aug 04 Posts: 42 Credit: 1,443,857 RAC: 0 |
What exactly is the long time issue if a user does nothing? I can see this could be a pain for dialup users, but I see no impact for non-dialup. Seems there might be log meesages that the upload failed, but won\'t things recover cleanly when the servers are upgraded? (might be multiple uploads, I understand that). If stuff is still crunching correctly, why do anything? |
Send message Joined: 28 Aug 04 Posts: 42 Credit: 1,443,857 RAC: 0 |
If people already have the failing upload problem, then it\'s too late to do much about it. I sure hope you meant to say \"resume\" and not \"restart\"! |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
There are 2 issues: 1) People not familiar with these error messages will see all of the repeated messages and worry/panic (See all recent threads in all areas of these boards.) 2) People who have been doing nothing for several days before posting here are in danger of losing their 10 year zip before the new server is in place. This has already happened to at least one person. *********** Restart / Resume / Get-it-going-again, or whatever wording is used in whichever version of BOINC people are using. The BOINC people keep changing the wording, and I\'ve given up trying to keep up with it. And for those who are \"a bit geeky\", it\'s apparently possible to edit an xml file and alter some slot files, to extend the 14 day time limit. But all of the advice here assumes that people actually look at the boards regularly to find out about what is happening, which a lot don\'t. |
Send message Joined: 28 Aug 04 Posts: 42 Credit: 1,443,857 RAC: 0 |
There are 2 issues: But again, what is the result? Will the Wu crash and burn? Will the next 10 year zip fill in the missing data? As you said, most folks don\'t check here, so will there be major confustion when all kinds of WU\'s \"crap out\"? I\'m still not clear why it\'s not just recomended to \"sit back and don\'t worry\"? Isn\'t that the way Boinc was designed to work when a project has problems or is down?
There\'s a big difference - in common terms of everyday people. \"restart\" is \"start again, from the beginning\". \"Resume\" is \"pick up from where you left off.. \"Pause/resume\" is a function that most know from their DVD/VCR/TIVO... \"restart\" means watching the recording from the beginning. |
Send message Joined: 6 Jul 06 Posts: 147 Credit: 3,615,496 RAC: 420 |
I run a numer of projects at the same time (about 6), so stopping network activity is not really an option. I have suspended the CPDN task AND the CPDN project and this seems to have stopped my computer trying to upload as you get a \'communication failed\' error. It still keeps counting down in the transfer box but can\'t send. This will have to do for now. EDIT: Although it stopped the first attempt it did not stop the second and the file has tried to upload again, despite being Suspended. Some of my projects have short dead lines of a week so stopping network activity I am not too keen about. I also don\'t sit on the computers all the time. |
Send message Joined: 28 Aug 04 Posts: 42 Credit: 1,443,857 RAC: 0 |
Ok Les, if there\'s a real problem if the 10 year zip gets lost (and that\'s still a big question to me!), why not post the info on how the xml might be modified to extend the limit? (what needs to be changed and where...) \"suspend everything CPDN and network access, and wait\" seems to be what you\'re saying, but that means no other projects can \"phone home\" to request work or report results... As you are a Moderator here, I\'d really hoped your posts were informational..... |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Hi Azwoody Trickles are getting through to cpdn in Oxford. The problem is only with the 10-year zip uploads. These appear in the Transfers tab of boinc manager at the beginning of the December of every model year ending in 0, eg 1980, 2030. During all the other model years, as long as there\'s no cpdn zip file waiting in the Transfers tab, we can allow network activity. But as our models reach a year ending in 0, Les suggests for multi-project crunchers For those APPROACHING a 10 year point in their crunching, I\'d recommend that: 1) Make sure that the project is set to \"No new tasks\" in the Projects tab 2) Suspend the model in the Tasks tab 3) Wait until the problem is resolved 4) Then restart the model Doing things this way allows other projects to crunch and contact their server. It avoids the zip file problem by stopping the climate model before this file is produced. However, single-project cpdn crunchers can if they wish allow the 10-year file to be produced but avoid problems by suspending network activity before the file tries to upload ie before Dec of any year ending in 0. Anyone (single or multi-project) with a zip file already waiting in the Transfers tab to upload (whether it\'s already produced upload error messages or not) should suspend network activity until the problem is solved. But the model may continue running. This should keep most computers busy until the extra space becomes available, ?next Thursday? Workunits from other projects could be suspended to keep the computer busy with the climate model. What we must try to avoid is multiple attempts to upload the same zip file. If we can avoid ALL attempts to upload them, that\'s better still. Every failed upload puts the zip file at risk. A 10-year zip file can lie in the Transfers tab for up to two weeks after the first attempted upload and still be accepted by the server. If no attempt is made to upload zip files, I think they will be accepted up to 6 weeks later. If we can avoid editing the xml files, this is preferable. Anyone with a model reaching 2080 should suspend it before the end following Les\'s green instructions. If this is done, network activity can be allowed. The computer could be kept busy by attaching to another project and crunching something different for a few days for variety. Cpdn news |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
azwoody The zips are the accumulated result of the 10 years just crunched. If one is lost, it\'s not possible to get it back, and neither will it get recreated to be sent again later. Lost is lost! And I\'d be interested in knowing how you would restart a model from the very beginning, if this is what you think I meant. And why everyone would assume that\'s what I meant. As for the \'edit fix\', this is still being discussed at admin level on the php board. I don\'t feel inclined to tell everyone just yet, because several thousand are from the BBC project, and they aren\'t up to speed on what will be required. Have you ewvery visited there? It\'s sort of the \"coffee shop\", where most of us \"hang out\". Including the occasional vist from a project person. And do you make regular backups as is recommended for this project? Doing this will ensure that you have a copy of the file(s), just in case. Backups: Here |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Les, isn\'t there one way to recreate a 10-year zip file in a worst-case scenario. Even if the model hasn\'t crashed, restore a backup made before the file creation point (Dec of year ending ...0). For example, I have a model crunching 2007. I must be certain to back up the complete contents of the boinc folder before the model reaches Autumn 2010. Backup and restore instructions available through my sig. Les\'s method there, item #1 in the README about avoiding crashes, is really easy to follow. Cpdn news |
Send message Joined: 28 Aug 04 Posts: 42 Credit: 1,443,857 RAC: 0 |
Hi Azwoody Seems you also don\'t understand that \"restart vs resume\" is kind of bogus. I do understand, but it\'s bad info to others, IMHO!
But most wont check for a problem at CPDN until there is a zip file waiting to upload! You say I can\'t do any work on the machine, for any project that requires network access, until CPDN gets things fixed? (\"suspend network activity\") That\'s nuts, and not the way BOINC was designed! Seems it\'s a server problem! I got a dual core machine, with zip files waiting to send, and only a cache of other projects for .75 days!
Why? - do you expect folks to keep an eye on bonic 24/7?? Things should work unattended 24/7/365.25. If the zip get\'s lost, it\'s a server problem! Seems CPDN needs more help than just new HW!
So, how do \"geeks\" extend this????
Come on, get real... Most folks (like me) won\'t know there is a problem UNTIL there\'s a zip file stuck in the transfers tab! There\'s bogus code on the server - or within the CPDN app, in dealing with a problem like this... What needs to be hacked in the xml, so the WU I got on one box that has been crunching for over a year, and with days to complete wont get trashed? I think we need to hear from someone that really understands the code, and not a moderator, to chime in on this one... That way I can \"resume\" the discussion and not \"restart\" it! |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Yes. And those people who make backups would, I feel know this. I did this myself years ago when a \"Report\" upload disappeared. I just ran it again from a recent backup from just before the completion, removed the trickle files, and let it Report again. Or something like that. I have a feeling that those people asking about the current problem, and who are making their first post, are probably also those who have never made a backup. My BBC model is now at the start of 1935, so about 4 days to go. Then it gets suspended. Backups: Here |
Send message Joined: 28 Aug 04 Posts: 42 Credit: 1,443,857 RAC: 0 |
Les, isn\'t there one way to recreate a 10-year zip file in a worst-case scenario. Even if the model hasn\'t crashed, restore a backup made before the file creation point (Dec of year ending ...0). Guys... No backups for 99.9999999999999% of the folks here. They assume that with BOINC, a project will recover from it\'s own problems, and not require crunchers to do a darn thing. So what can I hack in the XML to help the project with their problem? |
Send message Joined: 28 Aug 04 Posts: 42 Credit: 1,443,857 RAC: 0 |
azwoody Some may see this as \"de-attach\" followed by \"attach\" (i.e. \"retarting\" the project) You and the other moderators seem as confused as the rest of us on what this will really do.. I could \"restart\" a model\" by hacking the XML! |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Apparently it IS possible to recovery from a deleted file if the 14 day limit is passed. But it requires that a backup is available containing the file so that it can be copied back, along with some other work. *************** And to extend the deadline past 14 days, just increase the number-of-days limit in client_state.xml Backups: Here |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Hi Azwoody There are a number of moderators trying to help members. We are making our posts as clear as we can. It doesn\'t help if you object to almost everything we say; it simply distracts members from the real problem. To clarify a few points: *Our projects are BBC-CCE, cpdn, Rosetta, Einstein etc. Our tasks/workunits are the climate models *The problem does not lie in the code for boinc or the models. There is no \'bogus code\'. The problem is current lack or space on the cpdn servers. The new server has been delivered but a lot of data must be moved from one server to another before the new server becomes functional. This takes time. *Once members realise that the words scheduler and device mean \'the server in Oxford\', we think the vast majority of them including newbies will easily understand the concept of the server being so full of data that it can accept no more. *We know that a lot of members will only realise there\'s a problem when they see the boinc manager error messages and the zip file stuck in the Transfers window. This is why we\'re offering ideas for everybody, whether the error messages have already started or not. *We know this is not a standard boinc situation. But boinc is specifically designed to allow flexible workarounds for this sort of situation. *Regular visitors to the forum will know that for more than a year we have been suggesting backups as a means of recovering models from almost every type of crash and disaster. Many members are making regular backups. In the READMEs accessible through my signature, you will find that in the README about avoiding model crashes, item #1 by Les gives simple step-by-step backup and restore instructions suitable for newbies. For those who want a more sophisticated backup method, there\'s an entire README offering a selection. *Members will find it much easier to a)make backups b)suspend network activity etc until (probably) next Thursday than to start \'hacking\' into files for whatever reason. *We have suggested ways of keeping computers busy until the problem is solved, though this may not always be with the project of choice. We have ample evidence that the vast majority of our members are good-tempered and flexible enough to take this in their stride. If some computers have to stop crunching for a few days, we think most of their owners will welcome the opportunity to stop worrying about the messages/get out and do other things/play computer games/let the kids play games. Cpdn news |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,699,166 RAC: 9,972 |
Speaking personally, I\'m afraid I won\'t be rewinding models back to a 14-day backup, and re-crunching ~60 model-years, if the servers aren\'t ready to receive data within 14 days of my first upload attempt. Also, I don\'t propose to suspend the models I\'m running - these things take too darn long in the first place! <g> What I am prepared to do is to copy and preserve the upload files - which can be found in the \'..\\BOINC\\projects\\climateprediction.net\' folder, not the individual model sub-folders - so that the data is safeguarded against the possibility that the new server doesn\'t get online in time and BOINC starts deleting them. In that event, I\'m then prepared to deliver the data to Oxford by whatever mechanism you\'re prepared to accept it - email, ftp, CD-R or whatever. The files are all uniquely named (workunit_nn.zip), so an ad-hoc, carrier-pigeon style of delivery should be manageable. I would hope it would be possible to: a) set up an emergency, non-BOINC, data upload path b) write fairly simple instructions which the majority of users could follow Some of us might also be competent/daring enough to edit client_state.xml and give ourselves an extended deadline, but I would not advise that as a general project policy. |
©2024 cpdn.org