Message boards : Number crunching : Task won't finish?
Message board moderation
Author | Message |
---|---|
Send message Joined: 3 Feb 12 Posts: 4 Credit: 194,442 RAC: 0 |
Hi, I'm having one task, which ran well for about 249 hours. Now the BOINCManager says 100% and no time left to crunch but the task still runs. stderr reads: hadcm3n_6.07_i686-apple-darwin(96996,0xac2ed2c0) malloc: *** error for object 0x13ca604: incorrect checksum for freed object - object was probably modified after being freed. What can I do now? |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
The model has to do work at the end (packaging up the files for upload etc), so it is normal for there to be a period when it is still running at 100%. However if it is still running for more a couple of hours after this point, then it may have got stuck, and you'll have to abort it. This does sometimes happen (although it is rare). I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 3 Feb 12 Posts: 4 Credit: 194,442 RAC: 0 |
Well, I fear the task is broken because there are still about 2000 timesteps left. I will wait for tomorrow morning and will abort the task then. |
Send message Joined: 3 Feb 12 Posts: 4 Credit: 194,442 RAC: 0 |
I discovered the task permanently uses 100% CPU. |
Send message Joined: 3 Feb 12 Posts: 4 Credit: 194,442 RAC: 0 |
I tried to close and open BOINC but something was wrong I was told to reinstall BOINC. No there's a line that says: Crashed executable name: hadcm3n_6.07_i686-apple-darwin I don't want the progress to be lost. I know the trickles are saved, but other users crunched that WU too and I had to crunch all timesteps. http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8484929 |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Felix, your model's web page shows that it finished at 1036,800 timesteps which is the correct number for a completed model. This means it will have generated and uploaded its four decadal files (one at the end of each 10 years); these files contain the data that the researchers need. These hadcm models never reach 100% progress while processing and never complete the number of timesteps that are stated in the models' graphics window. Our hadcm models finish at the end of 6 December in the final year, but this is not at the last timestep listed. I think this is an error in the graphics window that should probably be corrected (the total number of timesteps is slightly smaller than it says in the graphics). Cpdn news |
Send message Joined: 27 Nov 12 Posts: 1 Credit: 69,121 RAC: 0 |
Hi also stuck on 100% in BOINC manager and 99.79% on the task (1037232 of 1039392 completed). Looking on the website this task is still incomplete. CPU usage is zero. Workunit is hadcm3n_7zo2_1980_40_008457509. Is there a way to get BOINC Manager to re-activate this task? |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,808,726 RAC: 5,192 |
Hi The model has submitted all the expected number of trickles (i.e. 40) and has probably uploaded the expected number of Zip files (i.e. 4). Some of the HADCM3N models do then get stuck. Normally in that situation the model will terminate if the BOINC Manger is stopped and then restarted. The model will show as an error on the Web site but all the data has been uploaded, so the scientific objectives of the model have been achieved. If the model does not terminate after a BOINC Manager restart then the only option is to abort it. |
Send message Joined: 22 Jun 09 Posts: 8 Credit: 1,760,735 RAC: 0 |
I too have a never ending work unit: hadcm3n_o1z5_2020_40_008410991_2 (8561847) It has submitted 20 trickle-ups and the listed number of timesteps are 1,036,800. No trickle-ups since 28 Nov 2013 09:47:42 and has been at 100% completion. All it does it add seconds to its count and occupy 0% of one of my CPUs (I have eight) using up a slot. Should this be aborted and will I get credit for the work unit? I want it to finish before I update to BOINC 7.2.33. Also, when the power blips and crashes my Windows XP system, the CP.N work units die with an "error while computing". It seems to be an error in the Windows run-time library loaded with the task. It doesn't recover and the work unit dies a horrible death. (Yes, I could replace the battery in my UPS and all this will go away but no other BOINC task seem to have this sensitivity. (I'll check with Santa about the battery...[singing] all I want for Christmas is my UPS...)) |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
It's completed, so Abort it. Some of these models just won't quit when they finish. Credit is given for each trickle_up file as it's received, so if they've all been sent, you already have the credit. |
Send message Joined: 22 Jun 09 Posts: 8 Credit: 1,760,735 RAC: 0 |
It's completed, so Abort it. Good to know! Thank you! I'll dispose of the husk... Does anyone have information/suggestions on the second issue? i.e. Work units not restarting after a power glitch due to something being not quite right with the Windows C++ runtime library? (I'd have said squirrelly but it might not translate well...) Is there a way to restart it? Last but not least, can CP.N work units use more than one thread/cpu? How would one accomplish it? |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
[quote] I think it has more to do with the power cut. When the computer crashes due to a power irruption the machines just stops dead. Unlike a car it doesn�t coast along long enough for you to pull over to the side of the road and stop safely. It just dies. There is no time for proper shutdown and no time for the machine to write to the disc. There is nothing for the model to start up from. An uninterruptible power supply unit and the right software can shut the computer down slowly enough to allow the shutdown to run and the model to be saved. As to hyper-threading models the answer is NO. Some projects like Milkyway can and do use more than one core per WU, but, CPDN can�t. This has been discussed many time on these boards. The answer is that it would take a complete rewrite of 1,000,000 lines of computer code to make that work and it just isn�t going to happen. |
Send message Joined: 22 Jun 09 Posts: 8 Credit: 1,760,735 RAC: 0 |
In BOINC, there is a tag: Tools->disk and memory usage->tasks checkpoint to disk at most every (seconds) I would think that this has to do with restarting tasks that have been iterrupted by, say, a power outage. That might take care of BOINC's problems with restarts. However, if the code doesn't properly initialize (we Yanks don't know how to spell, do we) the Windows C++ runtime on a restart and properly keep track of intermediate states, it might behaved exactly as described... we are talking about a work unit restart from available data after all. Certainly no state information is left from before the crash other than what the code writes out as it goes along. It's just the information is not sufficient about once in three or four crashes. (i.e. it doesn't do it every time...) Also, the flaw might just as well be in the runtime, could it not? We'll see what Windows updates will bring... My point was to make these failures known to the community. This is not about blame. BTW, depending on the capacitance built into a system's power supply, once DC low is asserted, a system might accomplish rather a lot in the milliseconds it has left... no disk writes though (sans a SSD) ... a disc might be able to park it's heads for instance... certainly not a proper checkpoint of the CP.N code (or am I just being argumentative? :) I really do appreciate the help about the hung work unit...it's gone. Thanks again! Good to know about the multi-threading. Sad though. I wouldn't want to rewrite 10^6 lines of code just for that either. It seems that Father Christmas is going to be kind. I'm getting a new UPS battery... (I probably should have made this another tread, sorry.) |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
There are many points in the length of model where a power failure can hit. I've had quite a few that did nothing serious, and a small number where one of the several models on the machine failed. Luck of the draw. I suspect that the worst spot, and one which may account for ALL of these power failure crashes, is at a check point. There's a lot of files to write to the disk, especially when running several models, and if some have been written and not others, then the files on the disk will be split; some from the current save, and some from the previous save, perhaps a lot of minutes previously. The only thing that can prevent that, is to NOT have a power failure at a check point. And even an UPS may not prevent that. BUT ... A properly managed backup WILL save the day. See my sig for more about this. PS These models don't take any notice of the BOINC checkpoint times; they checkpoint at a certain point in the modelling process. If you use the Show graphics to look at them while running, it's easy to see when this is. Backups: Here |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,808,726 RAC: 5,192 |
... since an error while writing a checkpoint is a danger the obvious thing to do is to keep the old one until the new one is known to be correctly generated. It's one of those Pareto splits, I would guess: 20% of the effort gets 80% of the benefit. Without a single-checkpoint system very few tasks would succeed; with only a single-checkpoint system very few tasks fail (at least because of that). |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
At least when running 6 Had3CM models simultaneously, I did find that a power cut killed them all. MartinNZ reported the same. If I had to guess, I would suggest that there is an open data file which is needed to restart properly. A UPS is a good idea if you get frequent power cuts ... if you very rarely get power cuts, it would be an unnecessary expense and a risk in itself (e.g., accidentally brushing the 'test' button cut the power & killed all my tasks once). I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 29 Jul 13 Posts: 4 Credit: 1,008,021 RAC: 0 |
FYI... I just had one of my full ocean models run to 100%. The time take was in the same ball park as previous models (356 hours). So I though nothing of it. None the less after getting to 100% it just kept chugging away. I looked at the trickle 11 out of 40 sent, and intermittent at that. So I thought a bad WU. Stopped and started Boinc and it resulted in a computer error. As expected...such is life. Lo and behold, a couple days later I was granted full credits. Still unsure of what the problem was. I am also not so sure I deserve the credits (but I will accept them). So I hope the error was directly related to the trickles, and the final result files of the entire model are useable. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,808,726 RAC: 5,192 |
FYI... That's all the trickles for the first decade and then the last trickle of the last decade, which may explain the credits. There have been some changes in data storage over the course of that model, so perhaps some of the data has gone astray. However, as you say, the CPU time is the same as other models and the stderr log has errors for Zips 1-3 but not 4, so perhaps they got the most important Zip. The HADCM3N models have a number of problems at decade points, when the Zip files are created and uploaded. This looks like a rather bizarre variant. |
©2024 cpdn.org