Message boards : Cafe CPDN : World Community Grid mostly down for 2 months while transitioning
Message board moderation
Author | Message |
---|---|
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
My backup project isn't expected to issue new work for at least two months while servers are transitioned from IBM to Krembil Research Institute. https://www.worldcommunitygrid.org/about_us/article.s?articleId=757 I hope cpdn actually comes up with some new work soon. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
I hope cpdn actually comes up with some new work soon. Me too. But I am perplexed that with 2163 unsent HadCM3 shorts, they would make them Mac only, or so I understand it. It slows down the usual glacial speed to continental drift speed. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
I hope cpdn actually comes up with some new work soon. So far 65% of the previous hadcm3s batch (926) have hard failed (all 3 tasks in the work units errored out). This is a huge percentage. And a significant majority of those failures by percentage were Linux with segmentation violations and missing libraries. The seg violations were through the roof. On the other hand, Macs had significant fewer failures percentage-wise. But given how long we went without any Mac work here, we probably don't have all that many Mac users running models. So, it's going to take awhile. I've got a feeling it would take quite awhile to determine why these things are failing with seg violations on Linux PCs at a much higher percentage than previous hadcm3s batches. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I've got a feeling it would take quite awhile to determine why these things are failing with seg violations on Linux PCs at a much higher percentage than previous hadcm3s batches. I am sorry to say I agree with you. I would have preferred to believe it would be easy to locate the segmentation violation problem. I do know that the same programs worked last April (I think it was) where I got only one failure that was due to CPU type GenuineIntel Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7] Number of processors 16 Operating System Red Hat Enterprise Linux 8.5 (Ootpa) [4.18.0-348.12.2.el8_5.x86_64|libc 2.28 (GNU libc)] BOINC version 7.16.11 Memory 62.4 GB Cache 16896 KB Task 22024864 Name hadcm3s_r157_190012_240_837_011897728_1 Workunit 11897728 Created 28 Feb 2021, 11:33:16 UTC Sent 9 Mar 2021, 12:13:36 UTC Report deadline 19 Feb 2022, 17:33:36 UTC Received 12 Mar 2021, 11:59:59 UTC Server state Over Outcome Computation error Client state Compute error Exit status 22 (0x00000016) Unknown error code Computer ID 1511241 Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Sorry, too many model crashes! :-( This one died for a valid reason, and does not count as a bug. Of the January 2022 ones, all of mine failed with a segmentation violation except for one that completed succesfully. The ones with missing 32-bit compatibility libraries are regrettable, but belong in a different thread I believe. My system is quite reliable, both the software and the hardware. It seems to me that the only way I should be getting segmentation violations would be if I had hardware problems (over temperature, over clocking) or bugs in the application program. I keep track of the temperatures and I do not overclock things. It seems to me my processor chip does not even allow me to adjust the clock speed. Every 11.0s: sensors localhost.localdomain: Wed Feb 9 22:29:17 2022 coretemp-isa-0000 Adapter: ISA adapter Package id 0: +68.0°C (high = +88.0°C, crit = +98.0°C) Core 1: +64.0°C (high = +88.0°C, crit = +98.0°C) Core 2: +63.0°C (high = +88.0°C, crit = +98.0°C) Core 3: +64.0°C (high = +88.0°C, crit = +98.0°C) Core 5: +67.0°C (high = +88.0°C, crit = +98.0°C) Core 8: +67.0°C (high = +88.0°C, crit = +98.0°C) Core 9: +64.0°C (high = +88.0°C, crit = +98.0°C) Core 11: +68.0°C (high = +88.0°C, crit = +98.0°C) Core 12: +61.0°C (high = +88.0°C, crit = +98.0°C) amdgpu-pci-6500 Adapter: PCI adapter vddgfx: +0.79 V fan1: 2087 RPM (min = 1800 RPM, max = 6000 RPM) edge: +33.0°C (crit = +97.0°C, hyst = -273.1°C) power1: 4.25 W (cap = 25.00 W) dell_smm-virtual-0 Adapter: Virtual device fan1: 4279 RPM fan2: 891 RPM fan3: 2907 RPM Since the programs are FORTRAN, it seems to me the only bug that could be in there would be going off the end of an array (either end of the array). I do not know if CPDN people can even look at the million lines of FORTRAN code or not, but there may be no point since they would not be allowed to change it anyway. If they are using a dialect of FORTRAN that allows them to call on the OS to allocate more RAM to the process (and later give it back), they can mess up by using more space than they got, or by continuing to use it after they gave it back. I am thinking of the malloc(3) functions in Linux/UNIX but they deal with pointers and FORTRAN, strictly-speaking, does not. But whatever the problem is, it causes a crash within about 3 seconds from start-up, so that should make it easier (not easy) to find where the problem is. If there is any way I can help with this, perhaps by running a fake application that deletes most of the code of a real application, let me know and we will see what we can do. |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
I don't think any of the people managing the BOINC side are in the slightest bit overlapping with those who run the models and want the results, with "it has to match the previous runs" constraints on code. :/ Over in the Linux forum, I do have a writeup on how to run the Mac tasks on your Linux boxes with a VM, if you're up for some experimentation. I've been chewing through quite a few on my Linux compute nodes as I have spare power. |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
Well, April 22nd has come and gone, with the new WCG expectation being May 9th. https://www.worldcommunitygrid.org/news/0421 I suppose I could toss a few Einstein tasks on the random compute nodes, but that's far less interesting than the physical simulation stuff WCG and CPDN are doing... I suppose I'll just let them idle down for a while. :( Keeps my office quieter, at least. |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
May 23rd: Almost ready to restart. May 28th on Twitter:
twiddles his idle CPUs I'm throwing some cycles at Einstein@Home, but I just can't get excited about pulsars. I've got a shiny new pair of 3900Xs hanging out ready for real work to chew into, be it 32-bit Linux, 32-bit MacOS, or (dare I hope?) 64-bit! I guess I could see if Folding@Home needs CPU cycles. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
I've got a shiny new pair of 3900Xs hanging out ready for real work to chew into, be it 32-bit Linux, 32-bit MacOS, or (dare I hope?) 64-bit! Folding is a great project, and I do a lot of it (both CPU and GPU). But the 3900X has a large cache/core ratio, which is helpful on some projects. I have one on QuChemPedIA, and it works very well (Ubuntu 20.04.4). https://quchempedia.univ-angers.fr/athome/ Ignore the large number of invalids. It is part of the science that they don't know which ones will work beforehand. |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
I used to do a bunch of F@H work on GPUs, but then the Great GPU Shortage happened, and I sent my GPUs to help out a friend who's using them as actual GPUs for some CAD work and such. I'll take a look at that - thanks! I got F@H running on my 3900Xs for now. I'd like to get them chewing on CPDN or WCG tasks again... but as long as they've got something to munch on, I'm good. My office is solar powered and off grid, so power on a sunny day is use-it-or-lose-it - may as well use as much as I can during the good solar conditions. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
May 23rd: Almost ready to restart.They'll probably be idle forever more. How long can it take to move a server?! 1.5 weeks ago they said it was working but they didn't trust their own work and are testing in house for a few days. Now it's been 9 days with no further news. Something else broke didn't it? I'm throwing some cycles at Einstein@Home, but I just can't get excited about pulsars. I've got a shiny new pair of 3900Xs hanging out ready for real work to chew into, be it 32-bit Linux, 32-bit MacOS, or (dare I hope?) 64-bit!Their cancer research seems to have fallen off my vast array of GPUs, they need CPUs for the next phase. GPUs are now concentrating on Alzheimers. Annoyingly, you can't choose which project you prefer there (there is an option but it's entirely ignored). They also lack basic functions like being able to abort a task it downloaded by mistake, because it has a nasty habit of starting up when you had it disabled then rebooted the machine. My CPUs are playing with Cosmology. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
I used to do a bunch of F@H work on GPUs, but then the Great GPU Shortage happened, and I sent my GPUs to help out a friend who's using them as actual GPUs for some CAD work and such.No batteries? |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
They'll probably be idle forever more. https://www.youtube.com/watch?v=CGyTLvS_Ruo
How long can it take to move a server?! 1.5 weeks ago they said it was working but they didn't trust their own work and are testing in house for a few days. Now it's been 9 days with no further news. Something else broke didn't it? What's concerning to me is that BOINC is not using fancy, bleeding edge, failure prone technology. It's an absolutely ancient technology stack - straight up LAMP, as far as I can tell from the server install guides (Linux, Apache, MySQL, PHP). It's not the sort of thing that should be hard to port, and while WCG is more complex than some others, unless they've gone absolutely nuts or have zero "legacy Linux sysadmins," it shouldn't take more than a couple weeks to move. I assume the bulk of the time was moving data around, but... even then, just drive a server around and hook up 10G server to server. I don't get how this is nearly so long a transition as it is, and I'm not at all optimistic that they're "Nearly, almost, just a tippy tappy bit more... almost... any day now, soon... *crickets*" about the move. Their cancer research seems to have fallen off my vast array of GPUs, they need CPUs for the next phase. GPUs are now concentrating on Alzheimers. Annoyingly, you can't choose which project you prefer there (there is an option but it's entirely ignored). They also lack basic functions like being able to abort a task it downloaded by mistake, because it has a nasty habit of starting up when you had it disabled then rebooted the machine. My CPUs are playing with Cosmology. Well, I've got CPUs to throw at them. One of the 3900Xs is purring away, the other is waiting on a better heatsink - I got a used combo and the heatsink on it is... not suited to that power hungry a chip, it slammed into 95C and pulled back clocks. It should be up and online fully in another couple days, and I'll point it at F@H until I get either CPDN tasks (Mac, Windows, Linux, I don't care, I'll run 'em in a VM!), WCG tasks, or... I just give up. It seems like there was a resurgence in BOINC stuff in the early days of Covid as a last hurrah, and now... :/ No batteries? I have batteries, but they're not sized to run serious loads throughout the night. My office has about 5kW of panel hung, and a 10kWh flooded lead acid bank that is basically "surge current during the day," and "keep the property area network running and the systems sleeping overnight." I can run some limited compute 24/7 out there during the summer months, and my 5775C is running 24/7 right now, but the bigger stuff goes to sleep in the evening and I wake it in the morning. When I replace the bank at some point, I'll put a larger bank in (I actually have it, but it's serving duty in a power trailer at the moment), and I can run more, but the office is still fundamentally designed as a daylight hours workplace. |
Send message Joined: 28 Jul 19 Posts: 150 Credit: 12,830,559 RAC: 228 |
If you look at the updates they’ve provided the development they’re doing is WebSphere / Message Broker and whilst MB has been around for a long time WS is quite new and the combination is very much current technology for real time transaction processing and is complex and difficult to get right - especially if you get into scenarios like dual centre working for security fallback. This I know having worked on just such a system and seen the problems first hand. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
If you look at the updates they’ve provided the development they’re doing is WebSphere / Message Broker and whilst MB has been around for a long time WS is quite new and the combination is very much current technology for real time transaction processing and is complex and difficult to get right - especially if you get into scenarios like dual centre working for security fallback.It worked fine before, why are they messing about? |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
It worked fine before, why are they messing about? I suspect the CPDN fora are not where you are most likely to find someone who knows the answer to that one. ;) |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
Well I won't get it from theirs as I've been banned permanently, twice.It worked fine before, why are they messing about? Odd that when they ban you, they don't stop you crunching for them.... |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Well I won't get it from theirs as I've been banned permanently, twice. My guess is that will be where it appears first, just not in response to a question from you. |
Send message Joined: 28 Jul 19 Posts: 150 Credit: 12,830,559 RAC: 228 |
If you look at the updates they’ve provided the development they’re doing is WebSphere / Message Broker and whilst MB has been around for a long time WS is quite new and the combination is very much current technology for real time transaction processing and is complex and difficult to get right - especially if you get into scenarios like dual centre working for security fallback.It worked fine before, why are they messing about? Because they need back end systems to create WUs in the first place and validate and post process the WUs on return, all of which is project related and not part of Boinc. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
They haven't even got the forums working yet. They're using facebook, where they have answered literally zero questions. They just post annoying empty promises repatedly.Well I won't get it from theirs as I've been banned permanently, twice.My guess is that will be where it appears first, just not in response to a question from you. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
But they already had this and will be using the same scientific programs as before, they're not going to change all that. And why on earth didn't they get this one up and running before they stopped using the other one?! Imagine if Google shut down for 3 months while they moved house.It worked fine before, why are they messing about?Because they need back end systems to create WUs in the first place and validate and post process the WUs on return, all of which is project related and not part of Boinc. |
©2024 cpdn.org