climateprediction.net (CPDN) home page
Thread '"Time remaining" going up'

Thread '"Time remaining" going up'

Questions and Answers : Unix/Linux : "Time remaining" going up
Message board moderation

To post messages, you must log in.

AuthorMessage
Dave Roberts

Send message
Joined: 15 Jan 11
Posts: 175
Credit: 6,242,691
RAC: 699
Message 52279 - Posted: 20 Jul 2015, 9:16:25 UTC

I'm running a HadAM3P task on a Linux machine (Id 18553136) and all was OK until two days ago. The "time remaining" had gone down to 23 hrs but since then is steadily increasing - now at ~33 hrs.
Trickles are still being generated on a daily basis.
Can't find anything related to this phenomenon in the forums and wondered if anyone else has seen this.
ID: 52279 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,025,554
RAC: 20,468
Message 52280 - Posted: 20 Jul 2015, 11:10:46 UTC - in response to Message 52279.  

I have certainly seen this before. I put it down to the quirks of the algorithm that estimates the time left. Is something else using a lot of cpu time on the computer?

It tends to get better as more tasks of a particular type are completed.

Incidentally, I notice the two failed tasks on that machine are both showing over 1,000 seconds of run time but 19. something and 3.something seconds of cpu time.

Is this computer doing some very demanding cpu intensive work or was doing so when these tasks were attempted?
ID: 52280 · Report as offensive     Reply Quote
Dave Roberts

Send message
Joined: 15 Jan 11
Posts: 175
Credit: 6,242,691
RAC: 699
Message 52281 - Posted: 20 Jul 2015, 12:42:44 UTC - in response to Message 52280.  

Ah, I'll just let it run. Thanks for the info..

As far as the two failed tasks are concerned, I think there was a problem with that particular model :- UK Met Office HadAM3P and HadRM3P model with MOSES II and TRIFFID Europe v7.01

This a just an old spare laptop that I'm using to investigate Linux. It's not doing much more than run CPDN stuff.
ID: 52281 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,025,554
RAC: 20,468
Message 52282 - Posted: 20 Jul 2015, 13:30:26 UTC

As far as the two failed tasks are concerned, I think there was a problem with that particular model :- UK Met Office HadAM3P and HadRM3P model with MOSES II and TRIFFID Europe v7.01


I can't remember the discussion on the number crunching section for these models, on looking I see thatsome of them I have had give the same type of error you had. Others (also on a laptop which may or may not be relevant) have fallen over right at the end with a missing library message. This despite the fact that ldd doesn't show anything missing when I run it on the executables. I have just re-installed Ubuntu on the laptop, going for Xubuntu rather than Kubuntu. - That was the only difference between the installation on my desktop and laptop and the desktop has recently completed one of them completely.

Incidentally, with the global only task type that is the model you are currently running, they tend to fall over if interrupted in any way.
ID: 52282 · Report as offensive     Reply Quote
Dave Roberts

Send message
Joined: 15 Jan 11
Posts: 175
Credit: 6,242,691
RAC: 699
Message 52293 - Posted: 22 Jul 2015, 8:45:31 UTC - in response to Message 52279.  

HadAM3P (global only) - Task 18553136

I've been trying to find other tasks for this model, to see if I can get an idea as to how long they should run.
My task has been running for 740 hours & "time remaining" now at 33 hrs (up from 23 hours, two days ago).
Is there a way of looking at the data of work units other than the one relating to this task?

The WU related to my task has not had a successful task completion, so not much joy there.

I don't know if this info is useful as well, but the machine that was running the following associated task has been disconnected for some time.
18187719 1356184 21 Mar 2015 14:09:12 UTC 17 Jun 2015 18:03:34 UTC Client detached 0.00 0.00 3,328.36 3,328.36 UK Met Office HadAM3P (global only) with MOSES II landsurface scheme v7.03

As far as interruptions go, my task has been running continuously with only one interruption from which it seemed to recover OK.
ID: 52293 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,025,554
RAC: 20,468
Message 52294 - Posted: 22 Jul 2015, 9:08:48 UTC

Should have thought of this earlier - Is the percentage under the progress tab going up? If not the task may well be stuck in a loop and if so there is not much choice other than to abort it.

Casting around by going to the work unit and changing the end digits I didn't find any completed tasks of this type to look at to get the answers you are after. Many are marked, "No resubmission" but sadly the vast majority have fallen to the lack of 32bit libraries problem.

It has been suggested elsewhere that packaging the missing libraries with the tasks could be a way around this.
ID: 52294 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 52295 - Posted: 22 Jul 2015, 9:15:29 UTC

I had a look at my tasks, and only found a few of these back in February. 1 failed while computing and the rest were aborted. This may have been due to a bad batch. Can't remember.
However, mine were running 8 times faster than yours, and the time-to-run is often set close to what my computers run them, so perhaps yours is just still adjusting to the fact that it's going to take a lot longer then "the average".

But yours seems to be returning trickles regularly, so I'd suggest "just hang in there".


ID: 52295 · Report as offensive     Reply Quote
Dave Roberts

Send message
Joined: 15 Jan 11
Posts: 175
Credit: 6,242,691
RAC: 699
Message 52303 - Posted: 22 Jul 2015, 16:42:23 UTC - in response to Message 52279.  

Well, it is an old machine with a 1.5GHz core but 8x slower than Ian's does seem a little tardy. I guess I'll 'just hang in there' - at least for a while.
ID: 52303 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,025,554
RAC: 20,468
Message 52304 - Posted: 22 Jul 2015, 18:02:01 UTC

Well, it is an old machine with a 1.5GHz core but 8x slower than Ian's does seem a little tardy.


Not sure it is Tardy if Ian's machine is 4GHz plus and with a much more efficient instruction set on an up to date processor. - You should have seen the time things took when I was running them on a 1.2GHz Atom!
ID: 52304 · Report as offensive     Reply Quote
Dave Roberts

Send message
Joined: 15 Jan 11
Posts: 175
Credit: 6,242,691
RAC: 699
Message 52325 - Posted: 24 Jul 2015, 16:32:34 UTC - in response to Message 52279.  

Are there usually 10 zip files associated with each task? I've Just had the 7th uploaded with this task, which would give an idea of the probable run time length.
ID: 52325 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 52331 - Posted: 25 Jul 2015, 3:39:48 UTC - in response to Message 52325.  

Are there usually 10 zip files associated with each task? I've Just had the 7th uploaded with this task, which would give an idea of the probable run time length.


There are more than that. Depending on which type of hadam3p task you are running there are either 13 or 19 zips, one for each month the model crunches plus one more at the end for the restart dump needed to create the next segment in the model.

ID: 52331 · Report as offensive     Reply Quote
Dave Roberts

Send message
Joined: 15 Jan 11
Posts: 175
Credit: 6,242,691
RAC: 699
Message 52332 - Posted: 25 Jul 2015, 9:22:28 UTC - in response to Message 52279.  

The task is from the model HadAM3P (global only) with MOSES II landsurface scheme v7.03, and has a 10 year run length which would imply 120 zips.
The task has been running for 780 hours - ie ~110 hrs/zip - which would imply ~13,400hrs for the complete task. (~560 days)`
If that's true, it will cost me about 680 KWH (the best part of �100), having used 40KWH so far.
If this is right, I shall have to pull the plug.

ID: 52332 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 52334 - Posted: 25 Jul 2015, 10:37:11 UTC - in response to Message 52332.  

You can see the intended number of zips for any model in BOINC's To Do list, aka client_state.xml
Although it's best to make a copy of the file, and look at that, rather than mess with the live file.

None of the recent models have had an expected run time of more than a couple of months. The hadam3prm3pm2t_eu models that I'm currently running are talking about 6 days.
The very long coupled ocean models from early 2006 took about 3 months on my core 2 computers.

ID: 52334 · Report as offensive     Reply Quote
Dave Roberts

Send message
Joined: 15 Jan 11
Posts: 175
Credit: 6,242,691
RAC: 699
Message 52335 - Posted: 25 Jul 2015, 18:06:57 UTC - in response to Message 52279.  

Thanks Les, the xml file has a list of 10 zip files. which is somewhat more reassuring. It's 'only' going to take another 250 hours to finish.
That looks like 1 zip per model year.
I don't think I'll be using this machine again, this run will have used 70 KWH which is about 13 Euros.
I'll wait until some Mac jobs are available; a coupled ocean model took about 15 days when I last had one, using 14 KWH.

ID: 52335 · Report as offensive     Reply Quote
Dave Roberts

Send message
Joined: 15 Jan 11
Posts: 175
Credit: 6,242,691
RAC: 699
Message 52392 - Posted: 8 Aug 2015, 17:29:30 UTC

It finally finished after 46 days, (which isn't too bad considering the age of the PC) but errored at the end with not finding the last zip file.
I see that it's thought that missing libraries could be the cause.
My OS is Linux Mint Rebecca. (32bit machine)

sudo find /lib -name 'libz*' came up with
/lib/i386-linux-gnu/libz.so.1
/lib/i386-linux-gnu/libz.so.1.2.8

sudo ldd hadam3prm3pm2t_eu_se_7.01_i686-pc-linux-gnu.so came up with
No such file or directory

I'm a complete newby with Linux so am blundering around a bit. Help welcome.
ID: 52392 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 52393 - Posted: 9 Aug 2015, 3:05:17 UTC - in response to Message 52392.  

I think that there's a library that's only used for the final stage of processing.

I had problems getting 32 bit files to stay where I put them, and I found from watching them being installed, that they were being deleted as part of the clean up at the end, probably due to them being older versions.

So I loaded a 32 bit version of Mint onto an old laptop, found the files, copied them to a usb ram stick, and then copied them to the same place as the other, similar, files. And it's worked OK since then. (I may have had to set permissions to be the same as the other files, don't remember.)

ID: 52393 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,025,554
RAC: 20,468
Message 52394 - Posted: 9 Aug 2015, 12:23:03 UTC

I think that there's a library that's only used for the final stage of processing.


The problem is Les, that it doesn't show up as missing when an ldd is done on any of the executables. I don't know if this is only a problem with downloading the BOINC installer or if it present also with installing via whatever package manager is in place. (I have always downloaded the tarball and extracted it where I want BOINC to be.

For some reason, I get this problem with KDE when I install Kubuntu but not with XUbuntu (XFCE) I am going to be away from my machines for a couple of weeks soon. When I get back I will have a play and try and work out what the problem is, probably by copying my BOINC folders from one of my crunching computers and running the same tasks offline.

If I get any definitive results I will of course feed back.
ID: 52394 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 52396 - Posted: 9 Aug 2015, 14:19:11 UTC - in response to Message 52392.  


sudo ldd hadam3prm3pm2t_eu_se_7.01_i686-pc-linux-gnu.so came up with
No such file or directory

I'm a complete newby with Linux so am blundering around a bit. Help welcome.


If you are in the same directory as the file you are "ldd'ing", try

sudo ldd ./hadam3prm3pm2t_eu_se_7.01_i686-pc-linux-gnu.so


However, that doesn't appear to be the problem on that model with your Celeron. That is a "MOSES II global only" model. Any removal from memory of that type of model results in trickles for the model year it was interrupted, and yearly upload files for that model year, not being generated. At the end of the model, it sees it didn't send some of the expected yearly upload files and gives an error. So, if you exited BOINC for any reason, on purpose or by accident, that end result will be an error.

It's terrible behavior for that type of model and it's why I don't run them. The MOSES II EU type model is much better and doesn't have that type of behavior. Although that PC had errors of a different and difficult to diagnose type on the 2 EU models it tried to run.

ID: 52396 · Report as offensive     Reply Quote

Questions and Answers : Unix/Linux : "Time remaining" going up

©2024 cpdn.org