climateprediction.net home page
Full run still in state unknown / new

Full run still in state unknown / new

Questions and Answers : Windows : Full run still in state unknown / new
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Ananas
Volunteer moderator

Send message
Joined: 31 Oct 04
Posts: 336
Credit: 3,316,482
RAC: 0
Message 7370 - Posted: 18 Jan 2005, 22:43:27 UTC

The "owner" of this problem doesn't speak English so I post this here in his name :

WU in question has wuid=16080 with resultid=25931

The model crashed on Sep. 24th and has been restored from a 6 days old backup. CPDN continued working on it but didn't receive credits first.

After catching up those 6 days, the server started to count credits for the trickles again, received all data and the model now _looks_ as if it has been completed successfully. The graphs are there, they look complete and quite flawless.

So the server _could_ accept the model as a full run but it seems to stick to those 57 trickles that it counted before the crash, still waiting for the 15 that came after.

I know that one isn't supposed to restore BOINC project data but it seemed to have worked in this case.

Conclusion : Either the server should not accept trickles that do not match the progress (backwards trickles) or it should try to accept a full run from it's completed data instead of the trickle count.
ID: 7370 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 7372 - Posted: 19 Jan 2005, 0:22:22 UTC

Ananas

You got one right & one wrong:

quote:
So the server _could_ accept the model as a full run but it seems to stick to those 57 trickles that it counted before the crash, still waiting for the 15 that came after.

answer: this is indeed how it works.

quote:
I know that one isn't supposed to restore BOINC project data but it seemed to have worked in this case.

answer: restoring from a backup IS acceptable. Frequent backups are recommended if one has problems.
Some people have to do a lot of restores/start-agains, to get a wu to finish.

Tell your friend to persevere. He'll get there eventually.

And trickles only contain five lines of data: a header, the wu name, the phase, current ts, and cpu time.
They just tell the server that the wu is still being worked on. ALL the result data is returned when
processing has finished.

Having said that, about 330Megs of extra data per wu remain stored on your computer.
This is partly because it's a lot to send back via dial-up lines, and partly because the scientists
haven't yet decided how best to store huge amounts of data. But they're working on it.
If it's a problem, copy it to a cd. Two wus fit on one cd, but if even this is too much, delete it.

Les
ID: 7372 · Report as offensive     Reply Quote
Profile Keck_Komputers
Avatar

Send message
Joined: 5 Aug 04
Posts: 426
Credit: 2,426,069
RAC: 0
Message 7377 - Posted: 19 Jan 2005, 9:53:08 UTC - in response to Message 7372.  

> Ananas
>
> You got one right & one wrong:
>
> quote:
> So the server _could_ accept the model as a full run but it seems to stick to
> those 57 trickles that it counted before the crash, still waiting for the 15
> that came after.
>
> answer: this is indeed how it works.
>
> quote:
> I know that one isn't supposed to restore BOINC project data but it seemed to
> have worked in this case.
>
> answer: restoring from a backup IS acceptable. Frequent backups are
> recommended if one has problems.
> Some people have to do a lot of restores/start-agains, to get a wu to finish.
>
Restoring from backup is only acceptable with CPDN. With other BOINC projects it can cause problems especially if any of the host data has changed (CPU speed, ram, or just the string BOINC uses to identify them).

> Tell your friend to persevere. He'll get there eventually.
>
> And trickles only contain five lines of data: a header, the wu name, the
> phase, current ts, and cpu time.
> They just tell the server that the wu is still being worked on. ALL the result
> data is returned when
> processing has finished.
>
> Having said that, about 330Megs of extra data per wu remain stored on your
> computer.
> This is partly because it's a lot to send back via dial-up lines, and partly
> because the scientists
> haven't yet decided how best to store huge amounts of data. But they're
> working on it.
> If it's a problem, copy it to a cd. Two wus fit on one cd, but if even this is
> too much, delete it.
>
> Les
>
>
BOINC WIKI

BOINCing since 2002/12/8
ID: 7377 · Report as offensive     Reply Quote
old_user169

Send message
Joined: 5 Aug 04
Posts: 39
Credit: 87,633
RAC: 0
Message 7379 - Posted: 19 Jan 2005, 11:00:30 UTC - in response to Message 7377.  
Last modified: 19 Jan 2005, 11:09:00 UTC

> > Ananas
> >
> > You got one right & one wrong:
> >
> > quote:
> > So the server _could_ accept the model as a full run but it seems to
> stick to
> > those 57 trickles that it counted before the crash, still waiting for the
> 15
> > that came after.
> >
> > answer: this is indeed how it works.
> >
> > quote:
> > I know that one isn't supposed to restore BOINC project data but it
> seemed to
> > have worked in this case.
> >
> > answer: restoring from a backup IS acceptable. Frequent backups are
> > recommended if one has problems.
> > Some people have to do a lot of restores/start-agains, to get a wu to
> finish.
> >
> Restoring from backup is only acceptable with CPDN. With other BOINC projects
> it can cause problems especially if any of the host data has changed (CPU
> speed, ram, or just the string BOINC uses to identify them).


Ok, I forwarded this so far - thanks to both of you :-)


But it still doesn't explain, why the model is not a full run in this case, all three phases have been completed so it should have "success" state now. Any ideas?

It is not a credits question, the question is more : Will those data be used independant from the state in the database that doesn't reflect the real progress?

(sorry, I used the wrong ID for this reply, should be "Ananas")
ID: 7379 · Report as offensive     Reply Quote
crandles
Volunteer moderator

Send message
Joined: 16 Oct 04
Posts: 692
Credit: 277,679
RAC: 0
Message 7380 - Posted: 19 Jan 2005, 11:32:47 UTC

> but it seems to stick to those 57 trickles that it counted before the crash, > still waiting for the 15 that came after.

If you click on the 57 trickles, it actually shows all 24 phase 3 trickles, all 24 phase 2 trickles, and last 9 trickles of phase 1. I have no idea what happened to the first 15 trickles.

Many units do not say success even though they have run to the end. The most common cause is that _1.zip file which was too big until the size allowed was increased. I am sure they will make use of such information rather than ingoring all runs that simply do not say success as the outcome.
Visit BOINC WIKI for help

And join BOINC Synergy for all the news in one place.
ID: 7380 · Report as offensive     Reply Quote
Profile Thyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 7404 - Posted: 20 Jan 2005, 12:25:31 UTC
Last modified: 21 Jan 2005, 7:48:20 UTC

The result name is 015q_300026487_0, which indicates that the job was (as crandles suggests) one of those that was susceptible to the zip file size problem and processed with version 4.03, but the reason why the state doesn't indicate successful completion is because it was restored from a crash.

I restored <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=13">resultid 13</a> from a 4 hour old backup after I inadvertently managed to crash it, and reactivated it by hacking the client_state.xml file (I don't know how the CPU time is formatted, hence the very low average sec/TS for the last 9 trickles). It also has outcome unknown and client state new, and it stays that way because it's one of the 4.03 jobs that Carl invalidated on the server (purely to prevent them from being sent out again if they failed - the results are still valid). Incidentally, that WU also shows 73 trickles because the backup was just before a trickle point.

As for the missing trickles at the start, my guess is that there was a host merge about that time (in addition to the one on 30 Sept). There used to be a problem with host merging that meant trickles on the old hostid wouldn't show up (something missing from the database tables if my memory doesn't fail me). My hostid 20652 only shows 3 trickles for <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=5">resultid 5</a>, but I know the earlier trickles were done by hostid 5 (which suffered a hard disk failure and was merged with 20652 after the system was rebuilt).
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
ID: 7404 · Report as offensive     Reply Quote

Questions and Answers : Windows : Full run still in state unknown / new

©2024 cpdn.org