Questions and Answers :
Windows :
Full run still in state unknown / new
Message board moderation
Author | Message |
---|---|
Send message Joined: 31 Oct 04 Posts: 336 Credit: 3,316,482 RAC: 0 |
The "owner" of this problem doesn't speak English so I post this here in his name : WU in question has wuid=16080 with resultid=25931 The model crashed on Sep. 24th and has been restored from a 6 days old backup. CPDN continued working on it but didn't receive credits first. After catching up those 6 days, the server started to count credits for the trickles again, received all data and the model now _looks_ as if it has been completed successfully. The graphs are there, they look complete and quite flawless. So the server _could_ accept the model as a full run but it seems to stick to those 57 trickles that it counted before the crash, still waiting for the 15 that came after. I know that one isn't supposed to restore BOINC project data but it seemed to have worked in this case. Conclusion : Either the server should not accept trickles that do not match the progress (backwards trickles) or it should try to accept a full run from it's completed data instead of the trickle count. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Ananas You got one right & one wrong: quote: So the server _could_ accept the model as a full run but it seems to stick to those 57 trickles that it counted before the crash, still waiting for the 15 that came after. answer: this is indeed how it works. quote: I know that one isn't supposed to restore BOINC project data but it seemed to have worked in this case. answer: restoring from a backup IS acceptable. Frequent backups are recommended if one has problems. Some people have to do a lot of restores/start-agains, to get a wu to finish. Tell your friend to persevere. He'll get there eventually. And trickles only contain five lines of data: a header, the wu name, the phase, current ts, and cpu time. They just tell the server that the wu is still being worked on. ALL the result data is returned when processing has finished. Having said that, about 330Megs of extra data per wu remain stored on your computer. This is partly because it's a lot to send back via dial-up lines, and partly because the scientists haven't yet decided how best to store huge amounts of data. But they're working on it. If it's a problem, copy it to a cd. Two wus fit on one cd, but if even this is too much, delete it. Les |
Send message Joined: 5 Aug 04 Posts: 426 Credit: 2,426,069 RAC: 0 |
> Ananas > > You got one right & one wrong: > > quote: > So the server _could_ accept the model as a full run but it seems to stick to > those 57 trickles that it counted before the crash, still waiting for the 15 > that came after. > > answer: this is indeed how it works. > > quote: > I know that one isn't supposed to restore BOINC project data but it seemed to > have worked in this case. > > answer: restoring from a backup IS acceptable. Frequent backups are > recommended if one has problems. > Some people have to do a lot of restores/start-agains, to get a wu to finish. > Restoring from backup is only acceptable with CPDN. With other BOINC projects it can cause problems especially if any of the host data has changed (CPU speed, ram, or just the string BOINC uses to identify them). > Tell your friend to persevere. He'll get there eventually. > > And trickles only contain five lines of data: a header, the wu name, the > phase, current ts, and cpu time. > They just tell the server that the wu is still being worked on. ALL the result > data is returned when > processing has finished. > > Having said that, about 330Megs of extra data per wu remain stored on your > computer. > This is partly because it's a lot to send back via dial-up lines, and partly > because the scientists > haven't yet decided how best to store huge amounts of data. But they're > working on it. > If it's a problem, copy it to a cd. Two wus fit on one cd, but if even this is > too much, delete it. > > Les > > BOINC WIKI BOINCing since 2002/12/8 |
Send message Joined: 5 Aug 04 Posts: 39 Credit: 87,633 RAC: 0 |
> > Ananas > > > > You got one right & one wrong: > > > > quote: > > So the server _could_ accept the model as a full run but it seems to > stick to > > those 57 trickles that it counted before the crash, still waiting for the > 15 > > that came after. > > > > answer: this is indeed how it works. > > > > quote: > > I know that one isn't supposed to restore BOINC project data but it > seemed to > > have worked in this case. > > > > answer: restoring from a backup IS acceptable. Frequent backups are > > recommended if one has problems. > > Some people have to do a lot of restores/start-agains, to get a wu to > finish. > > > Restoring from backup is only acceptable with CPDN. With other BOINC projects > it can cause problems especially if any of the host data has changed (CPU > speed, ram, or just the string BOINC uses to identify them). Ok, I forwarded this so far - thanks to both of you :-) But it still doesn't explain, why the model is not a full run in this case, all three phases have been completed so it should have "success" state now. Any ideas? It is not a credits question, the question is more : Will those data be used independant from the state in the database that doesn't reflect the real progress? (sorry, I used the wrong ID for this reply, should be "Ananas") |
Send message Joined: 16 Oct 04 Posts: 692 Credit: 277,679 RAC: 0 |
> but it seems to stick to those 57 trickles that it counted before the crash, > still waiting for the 15 that came after. If you click on the 57 trickles, it actually shows all 24 phase 3 trickles, all 24 phase 2 trickles, and last 9 trickles of phase 1. I have no idea what happened to the first 15 trickles. Many units do not say success even though they have run to the end. The most common cause is that _1.zip file which was too big until the size allowed was increased. I am sure they will make use of such information rather than ingoring all runs that simply do not say success as the outcome. Visit BOINC WIKI for help And join BOINC Synergy for all the news in one place. |
Send message Joined: 5 Aug 04 Posts: 1283 Credit: 15,824,334 RAC: 0 |
The result name is 015q_300026487_0, which indicates that the job was (as crandles suggests) one of those that was susceptible to the zip file size problem and processed with version 4.03, but the reason why the state doesn't indicate successful completion is because it was restored from a crash. I restored <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=13">resultid 13</a> from a 4 hour old backup after I inadvertently managed to crash it, and reactivated it by hacking the client_state.xml file (I don't know how the CPU time is formatted, hence the very low average sec/TS for the last 9 trickles). It also has outcome unknown and client state new, and it stays that way because it's one of the 4.03 jobs that Carl invalidated on the server (purely to prevent them from being sent out again if they failed - the results are still valid). Incidentally, that WU also shows 73 trickles because the backup was just before a trickle point. As for the missing trickles at the start, my guess is that there was a host merge about that time (in addition to the one on 30 Sept). There used to be a problem with host merging that meant trickles on the old hostid wouldn't show up (something missing from the database tables if my memory doesn't fail me). My hostid 20652 only shows 3 trickles for <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=5">resultid 5</a>, but I know the earlier trickles were done by hostid 5 (which suffered a hard disk failure and was merged with 20652 after the system was rebuilt). "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer |
©2024 cpdn.org