climateprediction.net (CPDN) home page
Thread 'Task won't finish?'

Thread 'Task won't finish?'

Message boards : Number crunching : Task won't finish?
Message board moderation

To post messages, you must log in.

AuthorMessage
Felix Kaeufer

Send message
Joined: 3 Feb 12
Posts: 4
Credit: 194,442
RAC: 0
Message 47098 - Posted: 18 Sep 2013, 17:07:34 UTC
Last modified: 18 Sep 2013, 17:08:00 UTC

Hi, I'm having one task, which ran well for about 249 hours. Now the BOINCManager says 100% and no time left to crunch but the task still runs. stderr reads:
hadcm3n_6.07_i686-apple-darwin(96996,0xac2ed2c0) malloc: *** error for object 0x13ca604: incorrect checksum for freed object - object was probably modified after being freed.
*** set a breakpoint in malloc_error_break to debug
hadcm3n_6.07_i686-apple-darwin(96996,0xac2ed2c0) malloc: *** error for object 0x13ca600: incorrect checksum for freed object - object was probably modified after being freed.
*** set a breakpoint in malloc_error_break to debug
SIGSEGV: segmentation violation

What can I do now?
ID: 47098 · Report as offensive     Reply Quote
ProfileMikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 47099 - Posted: 18 Sep 2013, 17:14:29 UTC


The model has to do work at the end (packaging up the files for upload etc), so it is normal for there to be a period when it is still running at 100%. However if it is still running for more a couple of hours after this point, then it may have got stuck, and you'll have to abort it. This does sometimes happen (although it is rare).


I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 47099 · Report as offensive     Reply Quote
Felix Kaeufer

Send message
Joined: 3 Feb 12
Posts: 4
Credit: 194,442
RAC: 0
Message 47101 - Posted: 18 Sep 2013, 17:48:33 UTC

Well, I fear the task is broken because there are still about 2000 timesteps left.
I will wait for tomorrow morning and will abort the task then.
ID: 47101 · Report as offensive     Reply Quote
Felix Kaeufer

Send message
Joined: 3 Feb 12
Posts: 4
Credit: 194,442
RAC: 0
Message 47103 - Posted: 18 Sep 2013, 18:09:26 UTC

I discovered the task permanently uses 100% CPU.
ID: 47103 · Report as offensive     Reply Quote
Felix Kaeufer

Send message
Joined: 3 Feb 12
Posts: 4
Credit: 194,442
RAC: 0
Message 47105 - Posted: 18 Sep 2013, 19:05:31 UTC
Last modified: 18 Sep 2013, 19:11:52 UTC

I tried to close and open BOINC but something was wrong I was told to reinstall BOINC.
No there's a line that says:
Crashed executable name: hadcm3n_6.07_i686-apple-darwin
built using BOINC library version 6.13.0
Machine type Intel 80486 (32-bit executable)


I don't want the progress to be lost. I know the trickles are saved, but other users crunched that WU too and I had to crunch all timesteps.
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8484929
ID: 47105 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 47146 - Posted: 21 Sep 2013, 12:43:55 UTC

Felix, your model's web page shows that it finished at 1036,800 timesteps which is the correct number for a completed model. This means it will have generated and uploaded its four decadal files (one at the end of each 10 years); these files contain the data that the researchers need.

These hadcm models never reach 100% progress while processing and never complete the number of timesteps that are stated in the models' graphics window. Our hadcm models finish at the end of 6 December in the final year, but this is not at the last timestep listed. I think this is an error in the graphics window that should probably be corrected (the total number of timesteps is slightly smaller than it says in the graphics).
Cpdn news
ID: 47146 · Report as offensive     Reply Quote
thyst

Send message
Joined: 27 Nov 12
Posts: 1
Credit: 69,121
RAC: 0
Message 47501 - Posted: 8 Nov 2013, 15:48:47 UTC - in response to Message 47146.  

Hi
also stuck on 100% in BOINC manager and 99.79% on the task (1037232 of 1039392 completed).
Looking on the website this task is still incomplete.
CPU usage is zero. Workunit is hadcm3n_7zo2_1980_40_008457509.

Is there a way to get BOINC Manager to re-activate this task?

ID: 47501 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,826,970
RAC: 5,066
Message 47502 - Posted: 8 Nov 2013, 17:25:38 UTC - in response to Message 47501.  

Hi
also stuck on 100% in BOINC manager and 99.79% on the task (1037232 of 1039392 completed).
Looking on the website this task is still incomplete.
CPU usage is zero. Workunit is hadcm3n_7zo2_1980_40_008457509.

Is there a way to get BOINC Manager to re-activate this task?

The model has submitted all the expected number of trickles (i.e. 40) and has probably uploaded the expected number of Zip files (i.e. 4). Some of the HADCM3N models do then get stuck. Normally in that situation the model will terminate if the BOINC Manger is stopped and then restarted. The model will show as an error on the Web site but all the data has been uploaded, so the scientific objectives of the model have been achieved. If the model does not terminate after a BOINC Manager restart then the only option is to abort it.
ID: 47502 · Report as offensive     Reply Quote
Carolina Calling

Send message
Joined: 22 Jun 09
Posts: 8
Credit: 1,760,735
RAC: 0
Message 47773 - Posted: 11 Dec 2013, 23:05:51 UTC
Last modified: 11 Dec 2013, 23:16:55 UTC

I too have a never ending work unit: hadcm3n_o1z5_2020_40_008410991_2 (8561847)
It has submitted 20 trickle-ups and the listed number of timesteps are 1,036,800. No trickle-ups since 28 Nov 2013 09:47:42 and has been at 100% completion. All it does it add seconds to its count and occupy 0% of one of my CPUs (I have eight) using up a slot. Should this be aborted and will I get credit for the work unit? I want it to finish before I update to BOINC 7.2.33.

Also, when the power blips and crashes my Windows XP system, the CP.N work units die with an "error while computing". It seems to be an error in the Windows run-time library loaded with the task. It doesn't recover and the work unit dies a horrible death. (Yes, I could replace the battery in my UPS and all this will go away but no other BOINC task seem to have this sensitivity. (I'll check with Santa about the battery...[singing] all I want for Christmas is my UPS...))
ID: 47773 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 47776 - Posted: 12 Dec 2013, 12:15:00 UTC - in response to Message 47773.  

It's completed, so Abort it.
Some of these models just won't quit when they finish.

Credit is given for each trickle_up file as it's received, so if they've all been sent, you already have the credit.

ID: 47776 · Report as offensive     Reply Quote
Carolina Calling

Send message
Joined: 22 Jun 09
Posts: 8
Credit: 1,760,735
RAC: 0
Message 47782 - Posted: 12 Dec 2013, 23:33:36 UTC - in response to Message 47776.  
Last modified: 12 Dec 2013, 23:36:32 UTC

It's completed, so Abort it.
Some of these models just won't quit when they finish.

Credit is given for each trickle_up file as it's received, so if they've all been sent, you already have the credit.

Good to know!  Thank you!  I'll dispose of the husk...


Does anyone have information/suggestions on the second issue? i.e. Work units not restarting after a power glitch due to something being not quite right with the Windows C++ runtime library? (I'd have said squirrelly but it might not translate well...) Is there a way to restart it?

Last but not least, can CP.N work units use more than one thread/cpu? How would one accomplish it?
ID: 47782 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 47783 - Posted: 13 Dec 2013, 0:07:39 UTC - in response to Message 47782.  

[quote]
Does anyone have information/suggestions on the second issue? i.e. Work units not restarting after a power glitch due to something being not quite right with the Windows C++ runtime library? (I'd have said squirrelly but it might not translate well...) Is there a way to restart it?

Last but not least, can CP.N work units use more than one thread/cpu? How would one accomplish it?


I think it has more to do with the power cut. When the computer crashes due to a power irruption the machines just stops dead. Unlike a car it doesn�t coast along long enough for you to pull over to the side of the road and stop safely. It just dies. There is no time for proper shutdown and no time for the machine to write to the disc. There is nothing for the model to start up from. An uninterruptible power supply unit and the right software can shut the computer down slowly enough to allow the shutdown to run and the model to be saved.

As to hyper-threading models the answer is NO. Some projects like Milkyway can and do use more than one core per WU, but, CPDN can�t. This has been discussed many time on these boards. The answer is that it would take a complete rewrite of 1,000,000 lines of computer code to make that work and it just isn�t going to happen.

ID: 47783 · Report as offensive     Reply Quote
Carolina Calling

Send message
Joined: 22 Jun 09
Posts: 8
Credit: 1,760,735
RAC: 0
Message 47784 - Posted: 13 Dec 2013, 0:56:42 UTC - in response to Message 47783.  
Last modified: 13 Dec 2013, 1:44:35 UTC

In BOINC, there is a tag: Tools->disk and memory usage->tasks checkpoint to disk at most every (seconds)

I would think that this has to do with restarting tasks that have been iterrupted by, say, a power outage. That might take care of BOINC's problems with restarts.

However, if the code doesn't properly initialize (we Yanks don't know how to spell, do we) the Windows C++ runtime on a restart and properly keep track of intermediate states, it might behaved exactly as described... we are talking about a work unit restart from available data after all. Certainly no state information is left from before the crash other than what the code writes out as it goes along. It's just the information is not sufficient about once in three or four crashes. (i.e. it doesn't do it every time...) Also, the flaw might just as well be in the runtime, could it not? We'll see what Windows updates will bring... My point was to make these failures known to the community. This is not about blame.

BTW, depending on the capacitance built into a system's power supply, once DC low is asserted, a system might accomplish rather a lot in the milliseconds it has left... no disk writes though (sans a SSD) ... a disc might be able to park it's heads for instance... certainly not a proper checkpoint of the CP.N code (or am I just being argumentative? :)

I really do appreciate the help about the hung work unit...it's gone. Thanks again!

Good to know about the multi-threading. Sad though. I wouldn't want to rewrite 10^6 lines of code just for that either.

It seems that Father Christmas is going to be kind. I'm getting a new UPS battery...

(I probably should have made this another tread, sorry.)
ID: 47784 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 47785 - Posted: 13 Dec 2013, 2:26:11 UTC - in response to Message 47784.  
Last modified: 13 Dec 2013, 2:32:16 UTC

There are many points in the length of model where a power failure can hit.
I've had quite a few that did nothing serious, and a small number where one of the several models on the machine failed. Luck of the draw.

I suspect that the worst spot, and one which may account for ALL of these power failure crashes, is at a check point. There's a lot of files to write to the disk, especially when running several models, and if some have been written and not others, then the files on the disk will be split; some from the current save, and some from the previous save, perhaps a lot of minutes previously.
The only thing that can prevent that, is to NOT have a power failure at a check point. And even an UPS may not prevent that.

BUT ...

A properly managed backup WILL save the day.
See my sig for more about this.

PS
These models don't take any notice of the BOINC checkpoint times; they checkpoint at a certain point in the modelling process.
If you use the Show graphics to look at them while running, it's easy to see when this is.
Backups: Here
ID: 47785 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,826,970
RAC: 5,066
Message 47789 - Posted: 13 Dec 2013, 12:52:31 UTC

... since an error while writing a checkpoint is a danger the obvious thing to do is to keep the old one until the new one is known to be correctly generated. It's one of those Pareto splits, I would guess: 20% of the effort gets 80% of the benefit. Without a single-checkpoint system very few tasks would succeed; with only a single-checkpoint system very few tasks fail (at least because of that).
ID: 47789 · Report as offensive     Reply Quote
ProfileMikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 47800 - Posted: 17 Dec 2013, 18:55:09 UTC


At least when running 6 Had3CM models simultaneously, I did find that a power cut killed them all. MartinNZ reported the same. If I had to guess, I would suggest that there is an open data file which is needed to restart properly.


A UPS is a good idea if you get frequent power cuts ... if you very rarely get power cuts, it would be an unnecessary expense and a risk in itself (e.g., accidentally brushing the 'test' button cut the power & killed all my tasks once).
I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 47800 · Report as offensive     Reply Quote
Werinbert

Send message
Joined: 29 Jul 13
Posts: 4
Credit: 1,008,021
RAC: 0
Message 47816 - Posted: 19 Dec 2013, 9:25:04 UTC

FYI...
I just had one of my full ocean models run to 100%. The time take was in the same ball park as previous models (356 hours). So I though nothing of it. None the less after getting to 100% it just kept chugging away. I looked at the trickle 11 out of 40 sent, and intermittent at that. So I thought a bad WU. Stopped and started Boinc and it resulted in a computer error. As expected...such is life. Lo and behold, a couple days later I was granted full credits.

Still unsure of what the problem was. I am also not so sure I deserve the credits (but I will accept them). So I hope the error was directly related to the trickles, and the final result files of the entire model are useable.
ID: 47816 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,826,970
RAC: 5,066
Message 47817 - Posted: 19 Dec 2013, 10:20:58 UTC - in response to Message 47816.  

FYI...
I just had one of my full ocean models run to 100%. The time take was in the same ball park as previous models (356 hours). So I though nothing of it. None the less after getting to 100% it just kept chugging away. I looked at the trickle 11 out of 40 sent, and intermittent at that. So I thought a bad WU. Stopped and started Boinc and it resulted in a computer error. As expected...such is life. Lo and behold, a couple days later I was granted full credits.

Still unsure of what the problem was. I am also not so sure I deserve the credits (but I will accept them). So I hope the error was directly related to the trickles, and the final result files of the entire model are useable.

That's all the trickles for the first decade and then the last trickle of the last decade, which may explain the credits. There have been some changes in data storage over the course of that model, so perhaps some of the data has gone astray. However, as you say, the CPU time is the same as other models and the stderr log has errors for Zips 1-3 but not 4, so perhaps they got the most important Zip.

The HADCM3N models have a number of problems at decade points, when the Zip files are created and uploaded. This looks like a rather bizarre variant.
ID: 47817 · Report as offensive     Reply Quote

Message boards : Number crunching : Task won't finish?

©2024 cpdn.org