Thread 'World Community Grid mostly down for 2 months while transitioning'

Author	Message
geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275	Message 65130 - Posted: 10 Feb 2022, 1:44:24 UTC My backup project isn't expected to issue new work for at least two months while servers are transitioned from IBM to Krembil Research Institute. https://www.worldcommunitygrid.org/about_us/article.s?articleId=757 I hope cpdn actually comes up with some new work soon. ID: 65130 · Reply Quote

Jim1348 Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653	Message 65131 - Posted: 10 Feb 2022, 2:18:21 UTC - in response to Message 65130. Last modified: 10 Feb 2022, 2:18:50 UTC I hope cpdn actually comes up with some new work soon. Me too. But I am perplexed that with 2163 unsent HadCM3 shorts, they would make them Mac only, or so I understand it. It slows down the usual glacial speed to continental drift speed. ID: 65131 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275	Message 65132 - Posted: 10 Feb 2022, 3:05:03 UTC - in response to Message 65131. Last modified: 10 Feb 2022, 3:08:05 UTC I hope cpdn actually comes up with some new work soon. Me too. But I am perplexed that with 2163 unsent HadCM3 shorts, they would make them Mac only, or so I understand it. It slows down the usual glacial speed to continental drift speed. So far 65% of the previous hadcm3s batch (926) have hard failed (all 3 tasks in the work units errored out). This is a huge percentage. And a significant majority of those failures by percentage were Linux with segmentation violations and missing libraries. The seg violations were through the roof. On the other hand, Macs had significant fewer failures percentage-wise. But given how long we went without any Mac work here, we probably don't have all that many Mac users running models. So, it's going to take awhile. I've got a feeling it would take quite awhile to determine why these things are failing with seg violations on Linux PCs at a much higher percentage than previous hadcm3s batches. ID: 65132 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 65133 - Posted: 10 Feb 2022, 3:53:24 UTC - in response to Message 65132. I've got a feeling it would take quite awhile to determine why these things are failing with seg violations on Linux PCs at a much higher percentage than previous hadcm3s batches. I am sorry to say I agree with you. I would have preferred to believe it would be easy to locate the segmentation violation problem. I do know that the same programs worked last April (I think it was) where I got only one failure that was due to CPU type GenuineIntel Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7] Number of processors 16 Operating System Red Hat Enterprise Linux 8.5 (Ootpa) [4.18.0-348.12.2.el8_5.x86_64\|libc 2.28 (GNU libc)] BOINC version 7.16.11 Memory 62.4 GB Cache 16896 KB Task 22024864 Name hadcm3s_r157_190012_240_837_011897728_1 Workunit 11897728 Created 28 Feb 2021, 11:33:16 UTC Sent 9 Mar 2021, 12:13:36 UTC Report deadline 19 Feb 2022, 17:33:36 UTC Received 12 Mar 2021, 11:59:59 UTC Server state Over Outcome Computation error Client state Compute error Exit status 22 (0x00000016) Unknown error code Computer ID 1511241 Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Sorry, too many model crashes! :-( This one died for a valid reason, and does not count as a bug. Of the January 2022 ones, all of mine failed with a segmentation violation except for one that completed succesfully. The ones with missing 32-bit compatibility libraries are regrettable, but belong in a different thread I believe. My system is quite reliable, both the software and the hardware. It seems to me that the only way I should be getting segmentation violations would be if I had hardware problems (over temperature, over clocking) or bugs in the application program. I keep track of the temperatures and I do not overclock things. It seems to me my processor chip does not even allow me to adjust the clock speed. Every 11.0s: sensors localhost.localdomain: Wed Feb 9 22:29:17 2022 coretemp-isa-0000 Adapter: ISA adapter Package id 0: +68.0°C (high = +88.0°C, crit = +98.0°C) Core 1: +64.0°C (high = +88.0°C, crit = +98.0°C) Core 2: +63.0°C (high = +88.0°C, crit = +98.0°C) Core 3: +64.0°C (high = +88.0°C, crit = +98.0°C) Core 5: +67.0°C (high = +88.0°C, crit = +98.0°C) Core 8: +67.0°C (high = +88.0°C, crit = +98.0°C) Core 9: +64.0°C (high = +88.0°C, crit = +98.0°C) Core 11: +68.0°C (high = +88.0°C, crit = +98.0°C) Core 12: +61.0°C (high = +88.0°C, crit = +98.0°C) amdgpu-pci-6500 Adapter: PCI adapter vddgfx: +0.79 V fan1: 2087 RPM (min = 1800 RPM, max = 6000 RPM) edge: +33.0°C (crit = +97.0°C, hyst = -273.1°C) power1: 4.25 W (cap = 25.00 W) dell_smm-virtual-0 Adapter: Virtual device fan1: 4279 RPM fan2: 891 RPM fan3: 2907 RPM Since the programs are FORTRAN, it seems to me the only bug that could be in there would be going off the end of an array (either end of the array). I do not know if CPDN people can even look at the million lines of FORTRAN code or not, but there may be no point since they would not be allowed to change it anyway. If they are using a dialect of FORTRAN that allows them to call on the OS to allocate more RAM to the process (and later give it back), they can mess up by using more space than they got, or by continuing to use it after they gave it back. I am thinking of the malloc(3) functions in Linux/UNIX but they deal with pointers and FORTRAN, strictly-speaking, does not. But whatever the problem is, it causes a crash within about 3 seconds from start-up, so that should make it easier (not easy) to find where the problem is. If there is any way I can help with this, perhaps by running a fake application that deletes most of the code of a real application, let me know and we will see what we can do. ID: 65133 · Reply Quote

SolarSyonyk Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463	Message 65192 - Posted: 20 Feb 2022, 21:10:21 UTC I don't think any of the people managing the BOINC side are in the slightest bit overlapping with those who run the models and want the results, with "it has to match the previous runs" constraints on code. :/ Over in the Linux forum, I do have a writeup on how to run the Mac tasks on your Linux boxes with a VM, if you're up for some experimentation. I've been chewing through quite a few on my Linux compute nodes as I have spare power. ID: 65192 · Reply Quote

SolarSyonyk Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463	Message 65384 - Posted: 25 Apr 2022, 15:08:48 UTC Well, April 22nd has come and gone, with the new WCG expectation being May 9th. https://www.worldcommunitygrid.org/news/0421 I suppose I could toss a few Einstein tasks on the random compute nodes, but that's far less interesting than the physical simulation stuff WCG and CPDN are doing... I suppose I'll just let them idle down for a while. :( Keeps my office quieter, at least. ID: 65384 · Reply Quote

SolarSyonyk Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463	Message 65470 - Posted: 31 May 2022, 23:36:43 UTC May 23rd: Almost ready to restart. May 28th on Twitter: We were unable to bring our production environment to the same state as the QA environment this week. As we also have yet to resolve an issue that prevents BOINC clients from downloading workunits, the effort to bring the Grid back online has stretched into next week. twiddles his idle CPUs I'm throwing some cycles at Einstein@Home, but I just can't get excited about pulsars. I've got a shiny new pair of 3900Xs hanging out ready for real work to chew into, be it 32-bit Linux, 32-bit MacOS, or (dare I hope?) 64-bit! I guess I could see if Folding@Home needs CPU cycles. ID: 65470 · Reply Quote

Jim1348 Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653	Message 65471 - Posted: 1 Jun 2022, 12:01:55 UTC - in response to Message 65470. Last modified: 1 Jun 2022, 12:02:52 UTC I've got a shiny new pair of 3900Xs hanging out ready for real work to chew into, be it 32-bit Linux, 32-bit MacOS, or (dare I hope?) 64-bit! I guess I could see if Folding@Home needs CPU cycles. Folding is a great project, and I do a lot of it (both CPU and GPU). But the 3900X has a large cache/core ratio, which is helpful on some projects. I have one on QuChemPedIA, and it works very well (Ubuntu 20.04.4). https://quchempedia.univ-angers.fr/athome/ Ignore the large number of invalids. It is part of the science that they don't know which ones will work beforehand. ID: 65471 · Reply Quote

SolarSyonyk Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463	Message 65472 - Posted: 1 Jun 2022, 16:38:47 UTC I used to do a bunch of F@H work on GPUs, but then the Great GPU Shortage happened, and I sent my GPUs to help out a friend who's using them as actual GPUs for some CAD work and such. I'll take a look at that - thanks! I got F@H running on my 3900Xs for now. I'd like to get them chewing on CPDN or WCG tasks again... but as long as they've got something to munch on, I'm good. My office is solar powered and off grid, so power on a sunny day is use-it-or-lose-it - may as well use as much as I can during the good solar conditions. ID: 65472 · Reply Quote

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 65473 - Posted: 2 Jun 2022, 23:13:29 UTC - in response to Message 65470. May 23rd: Almost ready to restart. May 28th on Twitter: We were unable to bring our production environment to the same state as the QA environment this week. As we also have yet to resolve an issue that prevents BOINC clients from downloading workunits, the effort to bring the Grid back online has stretched into next week. twiddles his idle CPUs They'll probably be idle forever more. How long can it take to move a server?! 1.5 weeks ago they said it was working but they didn't trust their own work and are testing in house for a few days. Now it's been 9 days with no further news. Something else broke didn't it? I'm throwing some cycles at Einstein@Home, but I just can't get excited about pulsars. I've got a shiny new pair of 3900Xs hanging out ready for real work to chew into, be it 32-bit Linux, 32-bit MacOS, or (dare I hope?) 64-bit! I guess I could see if Folding@Home needs CPU cycles. Their cancer research seems to have fallen off my vast array of GPUs, they need CPUs for the next phase. GPUs are now concentrating on Alzheimers. Annoyingly, you can't choose which project you prefer there (there is an option but it's entirely ignored). They also lack basic functions like being able to abort a task it downloaded by mistake, because it has a nasty habit of starting up when you had it disabled then rebooted the machine. My CPUs are playing with Cosmology. ID: 65473 · Reply Quote

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 65474 - Posted: 2 Jun 2022, 23:13:54 UTC - in response to Message 65472. I used to do a bunch of F@H work on GPUs, but then the Great GPU Shortage happened, and I sent my GPUs to help out a friend who's using them as actual GPUs for some CAD work and such. I'll take a look at that - thanks! I got F@H running on my 3900Xs for now. I'd like to get them chewing on CPDN or WCG tasks again... but as long as they've got something to munch on, I'm good. My office is solar powered and off grid, so power on a sunny day is use-it-or-lose-it - may as well use as much as I can during the good solar conditions. No batteries? ID: 65474 · Reply Quote

SolarSyonyk Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463	Message 65477 - Posted: 3 Jun 2022, 2:12:43 UTC - in response to Message 65473. They'll probably be idle forever more. https://www.youtube.com/watch?v=CGyTLvS_Ruo I was born under the star, never meant to journey far From all the faces and the place that I called home; And my father lived the same, and his father before him, But now I see in my son's eyes something has changed. And the smoke it has stopped rising from the chimney up the road, And the light no longer shines over the door; Last year I lent a hand to haul the boats onto the land, They've been lying there for nineteen months or more, And I wonder will they lie there evermore? Wasn't many years ago that the men 'round here would go Out in their skiffs and haul their traps out on the bay; And then shortly they'd return loaded down from stem to stern, And weigh off the fish, and store their gear away. Now the waters are as barren as the cliffs that guard the cove And catch the north wind blowing off the shore; And I wonder how an ocean turns as lifeless as a stone, And I wonder can the sea revive once more? And I wonder will they lie there evermore? Well, I hear some people say we'd be better off to stay ashore And train for jobs outside the fishery; Now wouldn't I look like a fool to go traipsing off to school, After forty years of living off the sea? Now, my son, he's barely twenty-one, and handy at the trawl, For years he helped me fish the Labrador; Now he's moving to Ontario before the first snowfall, "Dad, there's nothing left for me 'round here no more." And I wonder will I see his children born? And I wonder will they lie there evermore? How long can it take to move a server?! 1.5 weeks ago they said it was working but they didn't trust their own work and are testing in house for a few days. Now it's been 9 days with no further news. Something else broke didn't it? What's concerning to me is that BOINC is not using fancy, bleeding edge, failure prone technology. It's an absolutely ancient technology stack - straight up LAMP, as far as I can tell from the server install guides (Linux, Apache, MySQL, PHP). It's not the sort of thing that should be hard to port, and while WCG is more complex than some others, unless they've gone absolutely nuts or have zero "legacy Linux sysadmins," it shouldn't take more than a couple weeks to move. I assume the bulk of the time was moving data around, but... even then, just drive a server around and hook up 10G server to server. I don't get how this is nearly so long a transition as it is, and I'm not at all optimistic that they're "Nearly, almost, just a tippy tappy bit more... almost... any day now, soon... crickets" about the move. Their cancer research seems to have fallen off my vast array of GPUs, they need CPUs for the next phase. GPUs are now concentrating on Alzheimers. Annoyingly, you can't choose which project you prefer there (there is an option but it's entirely ignored). They also lack basic functions like being able to abort a task it downloaded by mistake, because it has a nasty habit of starting up when you had it disabled then rebooted the machine. My CPUs are playing with Cosmology. Well, I've got CPUs to throw at them. One of the 3900Xs is purring away, the other is waiting on a better heatsink - I got a used combo and the heatsink on it is... not suited to that power hungry a chip, it slammed into 95C and pulled back clocks. It should be up and online fully in another couple days, and I'll point it at F@H until I get either CPDN tasks (Mac, Windows, Linux, I don't care, I'll run 'em in a VM!), WCG tasks, or... I just give up. It seems like there was a resurgence in BOINC stuff in the early days of Covid as a last hurrah, and now... :/ No batteries? I have batteries, but they're not sized to run serious loads throughout the night. My office has about 5kW of panel hung, and a 10kWh flooded lead acid bank that is basically "surge current during the day," and "keep the property area network running and the systems sleeping overnight." I can run some limited compute 24/7 out there during the summer months, and my 5775C is running 24/7 right now, but the bigger stuff goes to sleep in the evening and I wake it in the morning. When I replace the bank at some point, I'll put a larger bank in (I actually have it, but it's serving duty in a power trailer at the moment), and I can run more, but the office is still fundamentally designed as a daylight hours workplace. ID: 65477 · Reply Quote

Bryn Mawr Send message Joined: 28 Jul 19 Posts: 149 Credit: 12,830,559 RAC: 228	Message 65483 - Posted: 3 Jun 2022, 9:30:01 UTC - in response to Message 65477. What's concerning to me is that BOINC is not using fancy, bleeding edge, failure prone technology. It's an absolutely ancient technology stack - straight up LAMP, as far as I can tell from the server install guides (Linux, Apache, MySQL, PHP). It's not the sort of thing that should be hard to port, and while WCG is more complex than some others, unless they've gone absolutely nuts or have zero "legacy Linux sysadmins," it shouldn't take more than a couple weeks to move. I assume the bulk of the time was moving data around, but... even then, just drive a server around and hook up 10G server to server. I don't get how this is nearly so long a transition as it is, and I'm not at all optimistic that they're "Nearly, almost, just a tippy tappy bit more... almost... any day now, soon... crickets" about the move. If you look at the updates they’ve provided the development they’re doing is WebSphere / Message Broker and whilst MB has been around for a long time WS is quite new and the combination is very much current technology for real time transaction processing and is complex and difficult to get right - especially if you get into scenarios like dual centre working for security fallback. This I know having worked on just such a system and seen the problems first hand. ID: 65483 · Reply Quote

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 65485 - Posted: 3 Jun 2022, 10:25:02 UTC - in response to Message 65483. If you look at the updates they’ve provided the development they’re doing is WebSphere / Message Broker and whilst MB has been around for a long time WS is quite new and the combination is very much current technology for real time transaction processing and is complex and difficult to get right - especially if you get into scenarios like dual centre working for security fallback. This I know having worked on just such a system and seen the problems first hand. It worked fine before, why are they messing about? ID: 65485 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4537 Credit: 19,001,532 RAC: 21,726	Message 65488 - Posted: 3 Jun 2022, 12:02:05 UTC - in response to Message 65485. It worked fine before, why are they messing about? I suspect the CPDN fora are not where you are most likely to find someone who knows the answer to that one. ;) ID: 65488 · Reply Quote

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 65489 - Posted: 3 Jun 2022, 12:53:57 UTC - in response to Message 65488. Last modified: 3 Jun 2022, 12:54:32 UTC It worked fine before, why are they messing about? I suspect the CPDN fora are not where you are most likely to find someone who knows the answer to that one. ;) Well I won't get it from theirs as I've been banned permanently, twice. Odd that when they ban you, they don't stop you crunching for them.... ID: 65489 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4537 Credit: 19,001,532 RAC: 21,726	Message 65490 - Posted: 3 Jun 2022, 13:00:18 UTC Well I won't get it from theirs as I've been banned permanently, twice. My guess is that will be where it appears first, just not in response to a question from you. ID: 65490 · Reply Quote

Bryn Mawr Send message Joined: 28 Jul 19 Posts: 149 Credit: 12,830,559 RAC: 228	Message 65491 - Posted: 3 Jun 2022, 14:52:42 UTC - in response to Message 65485. If you look at the updates they’ve provided the development they’re doing is WebSphere / Message Broker and whilst MB has been around for a long time WS is quite new and the combination is very much current technology for real time transaction processing and is complex and difficult to get right - especially if you get into scenarios like dual centre working for security fallback. This I know having worked on just such a system and seen the problems first hand. It worked fine before, why are they messing about? Because they need back end systems to create WUs in the first place and validate and post process the WUs on return, all of which is project related and not part of Boinc. ID: 65491 · Reply Quote

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 65492 - Posted: 3 Jun 2022, 23:29:35 UTC - in response to Message 65490. Well I won't get it from theirs as I've been banned permanently, twice. My guess is that will be where it appears first, just not in response to a question from you. They haven't even got the forums working yet. They're using facebook, where they have answered literally zero questions. They just post annoying empty promises repatedly. ID: 65492 · Reply Quote

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 65493 - Posted: 3 Jun 2022, 23:31:36 UTC - in response to Message 65491. It worked fine before, why are they messing about? Because they need back end systems to create WUs in the first place and validate and post process the WUs on return, all of which is project related and not part of Boinc. But they already had this and will be using the same scientific programs as before, they're not going to change all that. And why on earth didn't they get this one up and running before they stopped using the other one?! Imagine if Google shut down for 3 months while they moved house. ID: 65493 · Reply Quote