climateprediction.net (CPDN) home page
Posts by Dave Jackson

Posts by Dave Jackson

InfoMessage
1) Message boards : Number crunching : incoherent progress numbers
Message 71889
Posted 14 days ago by ProfileDave Jackson
What causes a WU to completely restart?
I wish I knew. I can't add more than what Glenn wrote. He understands the code and mechanisms far better than I am ever likely to. The task I linked to repeated the first trickle several times and the computer was running continually during that time, 24GB or RAM allocated to the VM running Windows 10 and nothing memory intensive running on the host machine. BOINC and the tasks I was running all that was running on the guest. I have still to go through the stderr on that task but suspect it will need someone with greater knowledge than I possess to pick anything out.
2) Message boards : Number crunching : incoherent progress numbers
Message 71887
Posted 14 days ago by ProfileDave Jackson
In reply to Friedrich S.'s message of 23 Nov 2024:
I know that it would be better to have it running 24/7, but due to the energy cost...

But as it checkpoints approx. every 45 minutes, it should nevertheless progress much better than it did this time. And most of the older WU did. It is just the second WU that shows this different behavior.
But I am assuming that at every checkpoint it saves the work achieved. Or is that a wrong assumption? And if so, how often is the achieved work being saved?

Thanks!

Friedrich

The time between checkpoints can vary quite a bit between task types and processor speeds. It is based on progress as opposed to time is my understanding. If these particular tasks are checkpointing every 45 minutes on your machine, there will be occasions when unless you are checking before shutting down you do so just before a checkpoint resulting in close to 45 minutes worth of computation needing to be repeated. This will significantly increase your electricity bill/task completed. As Bryan suggests, sleep or hibernation would be better. Both use very little electricity, indeed, my current system even if I don't do that but just stop all work, my box consumes so little power, the, "Intelligent" multi-socket I use, if computer is plugged into the master socket, does not draw enough to keep the other sockets live. Before I realised this, I thought I had a problem with my then brand new computer I had just installed the latest XUbuntu on.
3) Message boards : Number crunching : incoherent progress numbers
Message 71883
Posted 14 days ago by ProfileDave Jackson
The only reason for memory problems I can see is that I only have 16GB of RAM

How many tasks are you using at once? In theory at least even using 12 real cores that should be enough RAM assuming you are not running anything other than BOINC that uses a lot.
4) Message boards : Number crunching : incoherent progress numbers
Message 71881
Posted 14 days ago by ProfileDave Jackson
this is my one the first trickle repeated multiple times. Not quite sure why it might have been running out of memory as I had previously been running 12 tasks in the VM and this was happening with only 8 and the Windows VM has 24GB allocated to it. I shall do some more digging if it happens again.

Edit: BOINC is set to use 95% of available memory both when computer is in use and not in use.
5) Message boards : Number crunching : incoherent progress numbers
Message 71878
Posted 14 days ago by ProfileDave Jackson
It may not be related but tomorrow I will look at the page for a task I aborted that wasn't making progress.
6) Message boards : Cafe CPDN : WCG African Rainfall Project (ARP) restart update Apr 25, 2024
Message 71872
Posted 17 days ago by ProfileDave Jackson
And currently no ARP tasks available which explains my uploads going through a lot faster!
7) Message boards : Cafe CPDN : WCG African Rainfall Project (ARP) restart update Apr 25, 2024
Message 71870
Posted 18 days ago by ProfileDave Jackson
As the last of the resends of CPDN work finish and the number of ARP tasks running increases I am building up a large backlog of uploads that need to clear in order to not get the too many uploads in progress explaining no more work being sent. I think I will set WCG to no new tasks till they clear a bit.
8) Questions and Answers : Windows : "Calculation failure" after whenever i reboot the PC
Message 71864
Posted 21 days ago by ProfileDave Jackson
Windows applies the updates and triggers the restart "outside working hours", so it will be assuming that the user isn't around to respond to the messages that are sometimes generated during a 'soft' close ('save edited file?', and so on). So, is this version of the app responding badly to 'hard' shutdowns?


Shouldn't MS be doing a soft shutdown rather than a hard shutdown? Could this be a seperate issue from when a task fails following restart after a power cut?
9) Message boards : Number crunching : Connection and Download issues Oct24
Message 71859
Posted 25 days ago by ProfileDave Jackson
A week wasn't enough !

It reported complete at 12 Nov 2024, 4:47:50 UTC.
The last trickle had been at 01 Nov 2024 15:46:51 UTC.

The resends due to time outs I have, I found three had completed when I checked. Of the nine I have still running there are just three I have yet to overtake and that should happen by the end of today unless the _0 crunchers get a move on.
10) Message boards : Cafe CPDN : Moving working installation on same computer. (Wine to VM)
Message 71855
Posted 27 days ago by ProfileDave Jackson
On the one that stalled, about 2% more progress and stalled again. Will wait till the others finish before trying stopping the client and restarting but I suspect I will need to abort.
11) Message boards : Cafe CPDN : Moving working installation on same computer. (Wine to VM)
Message 71854
Posted 27 days ago by ProfileDave Jackson
I have had problems with BOINC (8.0.4) manager crashing and or freezing under WINE. Having previously run a test to compare the output between a WINE installation and one under Windows in a VM I decided to move it all to the VM for Windows work.

First I removed the already present Windows installation to start with a clean slate. I then did a fresh install of 8.0.4 in the guest OS. (Win10). I then moved the data folder from the wine installation from the host to the guest OS.

I can report that everything is working, including uploads and reporting of work despite the change of name of the computer from Swarm to Tiny10.

Clearly this swap is between machines of identical hardware.

On the original WINE install, I could have just kept going using command tools to manage the client as that seemed unaffected.

Today I noticed one stalled task where the estimated completion time was in the past. Pausing computation and then restarting the client got it going again. I know in this scenario they often crash. I do not know whether the stalled task had anything to do with the swap or not.

Edit: On WCG the VM installation is still called, "Swarm" while here on CPDN, it has changed to, "Tiny10."
12) Message boards : Number crunching : #1020,1,2,3...
Message 71846
Posted 6 Nov 2024 by ProfileDave Jackson
It seems like there's been a lot of timed out tasks lately. Not sure if it's normal and I just haven't noticed before.
Fairly normal. Especially for older computers that have long periods turned off or don't compute while computer is in use.
13) Message boards : Cafe CPDN : WCG African Rainfall Project (ARP) restart update Apr 25, 2024
Message 71841
Posted 6 Nov 2024 by ProfileDave Jackson
Perhaps someone here can postulate why the researchers decided on a very high resolution grid of 1km.
My guess is that the researchers have concluded they need this high resolution to develop really accurate forecasts for the areas in question.

With regards to the infrastructure, my completed tasks after the first couple have all uploaded without any intervention via the retry pending uploads button. But, longer term they do need a major upgrade of their infrastructure and from CPDN we know that major changes to infrastructure can at least initially cause more problems than they solve!
14) Message boards : Number crunching : #1020,1,2,3...
Message 71840
Posted 6 Nov 2024 by ProfileDave Jackson
I am letting mine run. I think the results are wanted or the batches would have been cancelled to prevent resends.
If you can be bothered, you can do what I did. - Three of those I got sent were still running on the original hosts but being resent because past the deadline. They finished after I had been sent them and I then cancelled them. I think the rest are unlikely to do so but I will have another check later today and if any finish on other hosts i will abort them too.
Edit:this taskfor example you can see that I have leap frogged the original cruncher.
15) Message boards : Cafe CPDN : WCG African Rainfall Project (ARP) restart update Apr 25, 2024
Message 71833
Posted 5 Nov 2024 by ProfileDave Jackson
CPDN does the opposite, long tasks but less data to deal with for the project. Would that be a correct assessment?

I think the longer tasks for CPDN are more about it working better for the science than about data considerations though CPDN has had issues with large amounts of data. It happened with some of the IFS batches.
16) Message boards : Cafe CPDN : WCG African Rainfall Project (ARP) restart update Apr 25, 2024
Message 71830
Posted 5 Nov 2024 by ProfileDave Jackson
I don't know how many tasks they are sending out for this but downloads and uploads are often in single figures of KB/s, I would certainly rather crunch longer tasks for them and fewer of them to reduce the amount of data going to and from crunchers. I don't know if any of the scientists involved ever look at the WCG forums. I know you, Glenn are the first to regularly come on the CPDN ones and that has I think made a massive difference to understanding among the crunchers who regularly read your posts even if they don't agree with you always.

WCG would I feel benefit greatly from more direct communication between those running projects and those who crunch.
17) Message boards : Cafe CPDN : WCG African Rainfall Project (ARP) restart update Apr 25, 2024
Message 71828
Posted 5 Nov 2024 by ProfileDave Jackson
I got a few also over half a day ago and still download issues. The short test they did over the weekend downloaded ok but just very slowly.
Yep. Taking longer to download tasks than it did when I downloaded CPDN tasks on dial-up! This is where it would be nice if BOINC could pause all the downloads except for one task. That way you could get one downloaded and running a bit more quickly.

Edit: The downloads for CPDN were not quite as big then as they are now!
18) Message boards : Number crunching : Almost 2025. Why doesn’t this project support multithreading?
Message 71825
Posted 4 Nov 2024 by ProfileDave Jackson
Sorry, enabling & testing the multiprocessing in the older models is not something I'm going to spend time doing. They work fine as they are. I have more pressing things to do.

Fair enough. There are always going to be priorities that we won't know anything about. If climate science had the funding and resources it deserves....
19) Message boards : Number crunching : Almost 2025. Why doesn’t this project support multithreading?
Message 71823
Posted 4 Nov 2024 by ProfileDave Jackson
All the meteorological codes that CPDN run are capable of multiprocessing, even the older ones.
Of course they are capable of multiprocessing. They were written to run on supercomputers. I blame my last post on having a head stuffed full of cold at the moment!

I can see that total throughput of tasks might be more without multithreading. However when there are relatively small batches and the first few hundred computers grab them all, then multithreading would spread the tasks out between more computers and would get them returned more quickly.
20) Message boards : Number crunching : Connection and Download issues Oct24
Message 71820
Posted 4 Nov 2024 by ProfileDave Jackson
I checked with Andy about this. CPDN doesn't issue a 'not needed' response if a earlier task in the workunit finishes. Experience has taught them users get annoyed by tasks being killed. So, yes, you'll need to abort it yourself

If only BOINC had an option to say you were more interested in the science than in credit allowing unwanted tasks to be killed by the project for those people. On checking through the tasks, it was just three on my box that had completed by today. At least two hadn't even started so unless the person (not) running them has a very fast computer, there isn't much doubt my Ryzen9 will get in first.

Edit:If I had a vote, it would be for the tasks to be deleted. It might cut down on the numbers crunching for CPDN but over time might weed out some habitual very slow returners. But I get that such decisions are way above my pay grade. I am not intending to make waves by expressing my opinion!
Next 20

©2024 cpdn.org