climateprediction.net (CPDN) home page
Thread 'World Community Grid mostly down for 2 months while transitioning'

Thread 'World Community Grid mostly down for 2 months while transitioning'

Message boards : Cafe CPDN : World Community Grid mostly down for 2 months while transitioning
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 5 · Next

AuthorMessage
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 65130 - Posted: 10 Feb 2022, 1:44:24 UTC

My backup project isn't expected to issue new work for at least two months while servers are transitioned from IBM to Krembil Research Institute.

https://www.worldcommunitygrid.org/about_us/article.s?articleId=757

I hope cpdn actually comes up with some new work soon.
ID: 65130 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 65131 - Posted: 10 Feb 2022, 2:18:21 UTC - in response to Message 65130.  
Last modified: 10 Feb 2022, 2:18:50 UTC

I hope cpdn actually comes up with some new work soon.

Me too. But I am perplexed that with 2163 unsent HadCM3 shorts, they would make them Mac only, or so I understand it.
It slows down the usual glacial speed to continental drift speed.
ID: 65131 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 65132 - Posted: 10 Feb 2022, 3:05:03 UTC - in response to Message 65131.  
Last modified: 10 Feb 2022, 3:08:05 UTC

I hope cpdn actually comes up with some new work soon.

Me too. But I am perplexed that with 2163 unsent HadCM3 shorts, they would make them Mac only, or so I understand it.
It slows down the usual glacial speed to continental drift speed.

So far 65% of the previous hadcm3s batch (926) have hard failed (all 3 tasks in the work units errored out). This is a huge percentage. And a significant majority of those failures by percentage were Linux with segmentation violations and missing libraries. The seg violations were through the roof. On the other hand, Macs had significant fewer failures percentage-wise. But given how long we went without any Mac work here, we probably don't have all that many Mac users running models. So, it's going to take awhile. I've got a feeling it would take quite awhile to determine why these things are failing with seg violations on Linux PCs at a much higher percentage than previous hadcm3s batches.
ID: 65132 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 65133 - Posted: 10 Feb 2022, 3:53:24 UTC - in response to Message 65132.  

I've got a feeling it would take quite awhile to determine why these things are failing with seg violations on Linux PCs at a much higher percentage than previous hadcm3s batches.


I am sorry to say I agree with you. I would have preferred to believe it would be easy to locate the segmentation violation problem. I do know that the same programs worked last April (I think it was) where I got only one failure that was due to
CPU type 	GenuineIntel Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7]
Number of processors 	16
Operating System 	Red Hat Enterprise Linux 8.5 (Ootpa) [4.18.0-348.12.2.el8_5.x86_64|libc 2.28 (GNU libc)]
BOINC version 	7.16.11
Memory 	62.4 GB
Cache 	16896 KB

Task 22024864
Name 	hadcm3s_r157_190012_240_837_011897728_1
Workunit 	11897728
Created 	28 Feb 2021, 11:33:16 UTC
Sent 	9 Mar 2021, 12:13:36 UTC
Report deadline 	19 Feb 2022, 17:33:36 UTC
Received 	12 Mar 2021, 11:59:59 UTC
Server state 	Over
Outcome 	Computation error
Client state 	Compute error
Exit status 	22 (0x00000016) Unknown error code
Computer ID 	1511241

Model crashed: ATM_DYN : INVALID THETA DETECTED.                                                                                                                                                                                                                               tmp/pipe_dummy

Model crashed: ATM_DYN : INVALID THETA DETECTED.                                                                                                                                                                                                                               tmp/pipe_dummy                                                                  

Model crashed: ATM_DYN : INVALID THETA DETECTED.                                                                                                                                                                                                                               tmp/pipe_dummy                                                                  

Model crashed: ATM_DYN : INVALID THETA DETECTED.                                                                                                                                                                                                                               tmp/pipe_dummy                                                                  

Model crashed: ATM_DYN : INVALID THETA DETECTED.                                                                                                                                                                                                                               tmp/pipe_dummy                                                                  

Model crashed: ATM_DYN : INVALID THETA DETECTED.                                                                                                                                                                                                                               tmp/pipe_dummy                                                                  
Sorry, too many model crashes! :-(


This one died for a valid reason, and does not count as a bug. Of the January 2022 ones, all of mine failed with a segmentation violation except for one that completed succesfully.

The ones with missing 32-bit compatibility libraries are regrettable, but belong in a different thread I believe.

My system is quite reliable, both the software and the hardware. It seems to me that the only way I should be getting segmentation violations would be if I had hardware problems (over temperature, over clocking) or bugs in the application program. I keep track of the temperatures and I do not overclock things. It seems to me my processor chip does not even allow me to adjust the clock speed.

Every 11.0s: sensors  localhost.localdomain: Wed Feb  9 22:29:17 2022

coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +68.0°C  (high = +88.0°C, crit = +98.0°C)
Core 1:        +64.0°C  (high = +88.0°C, crit = +98.0°C)
Core 2:        +63.0°C  (high = +88.0°C, crit = +98.0°C)
Core 3:        +64.0°C  (high = +88.0°C, crit = +98.0°C)
Core 5:        +67.0°C  (high = +88.0°C, crit = +98.0°C)
Core 8:        +67.0°C  (high = +88.0°C, crit = +98.0°C)
Core 9:        +64.0°C  (high = +88.0°C, crit = +98.0°C)
Core 11:       +68.0°C  (high = +88.0°C, crit = +98.0°C)
Core 12:       +61.0°C  (high = +88.0°C, crit = +98.0°C)

amdgpu-pci-6500
Adapter: PCI adapter
vddgfx:       +0.79 V
fan1:        2087 RPM  (min = 1800 RPM, max = 6000 RPM)
edge:         +33.0°C  (crit = +97.0°C, hyst = -273.1°C)
power1:        4.25 W  (cap =  25.00 W)

dell_smm-virtual-0
Adapter: Virtual device
fan1:        4279 RPM
fan2:         891 RPM
fan3:        2907 RPM


Since the programs are FORTRAN, it seems to me the only bug that could be in there would be going off the end of an array (either end of the array). I do not know if CPDN people can even look at the million lines of FORTRAN code or not, but there may be no point since they would not be allowed to change it anyway. If they are using a dialect of FORTRAN that allows them to call on the OS to allocate more RAM to the process (and later give it back), they can mess up by using more space than they got, or by continuing to use it after they gave it back. I am thinking of the malloc(3) functions in Linux/UNIX but they deal with pointers and FORTRAN, strictly-speaking, does not.

But whatever the problem is, it causes a crash within about 3 seconds from start-up, so that should make it easier (not easy) to find where the problem is.

If there is any way I can help with this, perhaps by running a fake application that deletes most of the code of a real application, let me know and we will see what we can do.
ID: 65133 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 65192 - Posted: 20 Feb 2022, 21:10:21 UTC

I don't think any of the people managing the BOINC side are in the slightest bit overlapping with those who run the models and want the results, with "it has to match the previous runs" constraints on code. :/

Over in the Linux forum, I do have a writeup on how to run the Mac tasks on your Linux boxes with a VM, if you're up for some experimentation. I've been chewing through quite a few on my Linux compute nodes as I have spare power.
ID: 65192 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 65384 - Posted: 25 Apr 2022, 15:08:48 UTC

Well, April 22nd has come and gone, with the new WCG expectation being May 9th.

https://www.worldcommunitygrid.org/news/0421

I suppose I could toss a few Einstein tasks on the random compute nodes, but that's far less interesting than the physical simulation stuff WCG and CPDN are doing... I suppose I'll just let them idle down for a while. :( Keeps my office quieter, at least.
ID: 65384 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 65470 - Posted: 31 May 2022, 23:36:43 UTC

May 23rd: Almost ready to restart.

May 28th on Twitter:

We were unable to bring our production environment to the same state as the QA environment this week. As we also have yet to resolve an issue that prevents BOINC clients from downloading workunits, the effort to bring the Grid back online has stretched into next week.


twiddles his idle CPUs

I'm throwing some cycles at Einstein@Home, but I just can't get excited about pulsars. I've got a shiny new pair of 3900Xs hanging out ready for real work to chew into, be it 32-bit Linux, 32-bit MacOS, or (dare I hope?) 64-bit!

I guess I could see if Folding@Home needs CPU cycles.
ID: 65470 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 65471 - Posted: 1 Jun 2022, 12:01:55 UTC - in response to Message 65470.  
Last modified: 1 Jun 2022, 12:02:52 UTC

I've got a shiny new pair of 3900Xs hanging out ready for real work to chew into, be it 32-bit Linux, 32-bit MacOS, or (dare I hope?) 64-bit!

I guess I could see if Folding@Home needs CPU cycles.

Folding is a great project, and I do a lot of it (both CPU and GPU). But the 3900X has a large cache/core ratio, which is helpful on some projects.

I have one on QuChemPedIA, and it works very well (Ubuntu 20.04.4).
https://quchempedia.univ-angers.fr/athome/
Ignore the large number of invalids. It is part of the science that they don't know which ones will work beforehand.
ID: 65471 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 65472 - Posted: 1 Jun 2022, 16:38:47 UTC

I used to do a bunch of F@H work on GPUs, but then the Great GPU Shortage happened, and I sent my GPUs to help out a friend who's using them as actual GPUs for some CAD work and such.

I'll take a look at that - thanks! I got F@H running on my 3900Xs for now. I'd like to get them chewing on CPDN or WCG tasks again... but as long as they've got something to munch on, I'm good. My office is solar powered and off grid, so power on a sunny day is use-it-or-lose-it - may as well use as much as I can during the good solar conditions.
ID: 65472 · Report as offensive     Reply Quote
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 65473 - Posted: 2 Jun 2022, 23:13:29 UTC - in response to Message 65470.  

May 23rd: Almost ready to restart.

May 28th on Twitter:

We were unable to bring our production environment to the same state as the QA environment this week. As we also have yet to resolve an issue that prevents BOINC clients from downloading workunits, the effort to bring the Grid back online has stretched into next week.
twiddles his idle CPUs
They'll probably be idle forever more. How long can it take to move a server?! 1.5 weeks ago they said it was working but they didn't trust their own work and are testing in house for a few days. Now it's been 9 days with no further news. Something else broke didn't it?

I'm throwing some cycles at Einstein@Home, but I just can't get excited about pulsars. I've got a shiny new pair of 3900Xs hanging out ready for real work to chew into, be it 32-bit Linux, 32-bit MacOS, or (dare I hope?) 64-bit!

I guess I could see if Folding@Home needs CPU cycles.
Their cancer research seems to have fallen off my vast array of GPUs, they need CPUs for the next phase. GPUs are now concentrating on Alzheimers. Annoyingly, you can't choose which project you prefer there (there is an option but it's entirely ignored). They also lack basic functions like being able to abort a task it downloaded by mistake, because it has a nasty habit of starting up when you had it disabled then rebooted the machine. My CPUs are playing with Cosmology.
ID: 65473 · Report as offensive     Reply Quote
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 65474 - Posted: 2 Jun 2022, 23:13:54 UTC - in response to Message 65472.  

I used to do a bunch of F@H work on GPUs, but then the Great GPU Shortage happened, and I sent my GPUs to help out a friend who's using them as actual GPUs for some CAD work and such.

I'll take a look at that - thanks! I got F@H running on my 3900Xs for now. I'd like to get them chewing on CPDN or WCG tasks again... but as long as they've got something to munch on, I'm good. My office is solar powered and off grid, so power on a sunny day is use-it-or-lose-it - may as well use as much as I can during the good solar conditions.
No batteries?
ID: 65474 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 65477 - Posted: 3 Jun 2022, 2:12:43 UTC - in response to Message 65473.  

They'll probably be idle forever more.


https://www.youtube.com/watch?v=CGyTLvS_Ruo


I was born under the star,
never meant to journey far
From all the faces and the
place that I called home;
And my father lived the same,
and his father before him,
But now I see in my son's eyes
something has changed.

And the smoke it has stopped rising
from the chimney up the road,
And the light no longer shines over the door;
Last year I lent a hand to
haul the boats onto the land,
They've been lying there for
nineteen months or more,
And I wonder will they lie there evermore?

Wasn't many years ago that
the men 'round here would go
Out in their skiffs and haul
their traps out on the bay;
And then shortly they'd return
loaded down from stem to stern,
And weigh off the fish,
and store their gear away.

Now the waters are as barren
as the cliffs that guard the cove
And catch the north wind blowing off the shore;
And I wonder how an ocean
turns as lifeless as a stone,
And I wonder can the sea revive once more?
And I wonder will they lie there evermore?

Well, I hear some people say
we'd be better off to stay ashore
And train for jobs outside the fishery;
Now wouldn't I look like a fool
to go traipsing off to school,
After forty years of living off the sea?

Now, my son, he's barely twenty-one,
and handy at the trawl,
For years he helped me fish the Labrador;
Now he's moving to Ontario before the first snowfall,
"Dad, there's nothing left for me 'round here no more."
And I wonder will I see his children born?
And I wonder will they lie there evermore?


How long can it take to move a server?! 1.5 weeks ago they said it was working but they didn't trust their own work and are testing in house for a few days. Now it's been 9 days with no further news. Something else broke didn't it?


What's concerning to me is that BOINC is not using fancy, bleeding edge, failure prone technology. It's an absolutely ancient technology stack - straight up LAMP, as far as I can tell from the server install guides (Linux, Apache, MySQL, PHP). It's not the sort of thing that should be hard to port, and while WCG is more complex than some others, unless they've gone absolutely nuts or have zero "legacy Linux sysadmins," it shouldn't take more than a couple weeks to move. I assume the bulk of the time was moving data around, but... even then, just drive a server around and hook up 10G server to server. I don't get how this is nearly so long a transition as it is, and I'm not at all optimistic that they're "Nearly, almost, just a tippy tappy bit more... almost... any day now, soon... *crickets*" about the move.

Their cancer research seems to have fallen off my vast array of GPUs, they need CPUs for the next phase. GPUs are now concentrating on Alzheimers. Annoyingly, you can't choose which project you prefer there (there is an option but it's entirely ignored). They also lack basic functions like being able to abort a task it downloaded by mistake, because it has a nasty habit of starting up when you had it disabled then rebooted the machine. My CPUs are playing with Cosmology.


Well, I've got CPUs to throw at them. One of the 3900Xs is purring away, the other is waiting on a better heatsink - I got a used combo and the heatsink on it is... not suited to that power hungry a chip, it slammed into 95C and pulled back clocks. It should be up and online fully in another couple days, and I'll point it at F@H until I get either CPDN tasks (Mac, Windows, Linux, I don't care, I'll run 'em in a VM!), WCG tasks, or... I just give up. It seems like there was a resurgence in BOINC stuff in the early days of Covid as a last hurrah, and now... :/

No batteries?


I have batteries, but they're not sized to run serious loads throughout the night.

My office has about 5kW of panel hung, and a 10kWh flooded lead acid bank that is basically "surge current during the day," and "keep the property area network running and the systems sleeping overnight." I can run some limited compute 24/7 out there during the summer months, and my 5775C is running 24/7 right now, but the bigger stuff goes to sleep in the evening and I wake it in the morning.

When I replace the bank at some point, I'll put a larger bank in (I actually have it, but it's serving duty in a power trailer at the moment), and I can run more, but the office is still fundamentally designed as a daylight hours workplace.
ID: 65477 · Report as offensive     Reply Quote
Bryn Mawr

Send message
Joined: 28 Jul 19
Posts: 150
Credit: 12,830,559
RAC: 228
Message 65483 - Posted: 3 Jun 2022, 9:30:01 UTC - in response to Message 65477.  


What's concerning to me is that BOINC is not using fancy, bleeding edge, failure prone technology. It's an absolutely ancient technology stack - straight up LAMP, as far as I can tell from the server install guides (Linux, Apache, MySQL, PHP). It's not the sort of thing that should be hard to port, and while WCG is more complex than some others, unless they've gone absolutely nuts or have zero "legacy Linux sysadmins," it shouldn't take more than a couple weeks to move. I assume the bulk of the time was moving data around, but... even then, just drive a server around and hook up 10G server to server. I don't get how this is nearly so long a transition as it is, and I'm not at all optimistic that they're "Nearly, almost, just a tippy tappy bit more... almost... any day now, soon... *crickets*" about the move.


If you look at the updates they’ve provided the development they’re doing is WebSphere / Message Broker and whilst MB has been around for a long time WS is quite new and the combination is very much current technology for real time transaction processing and is complex and difficult to get right - especially if you get into scenarios like dual centre working for security fallback.

This I know having worked on just such a system and seen the problems first hand.
ID: 65483 · Report as offensive     Reply Quote
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 65485 - Posted: 3 Jun 2022, 10:25:02 UTC - in response to Message 65483.  

If you look at the updates they’ve provided the development they’re doing is WebSphere / Message Broker and whilst MB has been around for a long time WS is quite new and the combination is very much current technology for real time transaction processing and is complex and difficult to get right - especially if you get into scenarios like dual centre working for security fallback.

This I know having worked on just such a system and seen the problems first hand.
It worked fine before, why are they messing about?
ID: 65485 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 65488 - Posted: 3 Jun 2022, 12:02:05 UTC - in response to Message 65485.  

It worked fine before, why are they messing about?


I suspect the CPDN fora are not where you are most likely to find someone who knows the answer to that one. ;)
ID: 65488 · Report as offensive     Reply Quote
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 65489 - Posted: 3 Jun 2022, 12:53:57 UTC - in response to Message 65488.  
Last modified: 3 Jun 2022, 12:54:32 UTC

It worked fine before, why are they messing about?


I suspect the CPDN fora are not where you are most likely to find someone who knows the answer to that one. ;)
Well I won't get it from theirs as I've been banned permanently, twice.

Odd that when they ban you, they don't stop you crunching for them....
ID: 65489 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 65490 - Posted: 3 Jun 2022, 13:00:18 UTC

Well I won't get it from theirs as I've been banned permanently, twice.


My guess is that will be where it appears first, just not in response to a question from you.
ID: 65490 · Report as offensive     Reply Quote
Bryn Mawr

Send message
Joined: 28 Jul 19
Posts: 150
Credit: 12,830,559
RAC: 228
Message 65491 - Posted: 3 Jun 2022, 14:52:42 UTC - in response to Message 65485.  

If you look at the updates they’ve provided the development they’re doing is WebSphere / Message Broker and whilst MB has been around for a long time WS is quite new and the combination is very much current technology for real time transaction processing and is complex and difficult to get right - especially if you get into scenarios like dual centre working for security fallback.

This I know having worked on just such a system and seen the problems first hand.
It worked fine before, why are they messing about?


Because they need back end systems to create WUs in the first place and validate and post process the WUs on return, all of which is project related and not part of Boinc.
ID: 65491 · Report as offensive     Reply Quote
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 65492 - Posted: 3 Jun 2022, 23:29:35 UTC - in response to Message 65490.  

Well I won't get it from theirs as I've been banned permanently, twice.
My guess is that will be where it appears first, just not in response to a question from you.
They haven't even got the forums working yet. They're using facebook, where they have answered literally zero questions. They just post annoying empty promises repatedly.
ID: 65492 · Report as offensive     Reply Quote
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 65493 - Posted: 3 Jun 2022, 23:31:36 UTC - in response to Message 65491.  

It worked fine before, why are they messing about?
Because they need back end systems to create WUs in the first place and validate and post process the WUs on return, all of which is project related and not part of Boinc.
But they already had this and will be using the same scientific programs as before, they're not going to change all that. And why on earth didn't they get this one up and running before they stopped using the other one?! Imagine if Google shut down for 3 months while they moved house.
ID: 65493 · Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 5 · Next

Message boards : Cafe CPDN : World Community Grid mostly down for 2 months while transitioning

©2024 cpdn.org