climateprediction.net (CPDN) home page
Thread 'Nearly there'

Thread 'Nearly there'

Message boards : Number crunching : Nearly there
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
zaphod80013

Send message
Joined: 25 May 12
Posts: 8
Credit: 7,633,965
RAC: 3,387
Message 58312 - Posted: 30 Jun 2018, 0:13:15 UTC - in response to Message 58303.  

Thanks for the insight; I knew there were issues but not the nature of them, this kind of problem sucks. I know, first hand, from back in the days when I was a computer operator (remember them?) we had a tape deck that had bad internal memory, it was silently corrupting anything written on it for about 4 months, worst of all the corruption was calculated into the checksum so everything looked fine until we had to do a restore.
ID: 58312 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 58313 - Posted: 30 Jun 2018, 3:36:15 UTC

That is nasty. Sounds like the machines were starting to take over way back then. ;)

I go back to the mid 70's. We had punched cards and two line printers, and every now and then I had to take a data tape, (12 inch spool?), to a CDC data center a few miles away, where they had big tape machines and those big hard disk packs. (14 inch?).
Been out of all that for 20 years now.

As for "our" problem, I've been wondering if it had anything to do with one of those chip level bugs, and someone got into the system.
Having to re-build so many servers wouldn't have been fun.

But just to "keep up my level of expertise", my machines all had HD failures while we were down.
The first was a second hand HP machine running Windows 7, when the Smart Drive (or some such) said: "We have detected failures with the hard disk. Please run this diagnostic."
Then: "This may take a while to run."
20-30 seconds later: "There are problems with the hard disk. Please take it to a service center for replacement."
It also said to run a backup program, so I've got "something", but am not sure whether to bother.

Problem is, there are/were 4 climate models on it that I was just about to start. :)

A couple of weeks later the Ivy Bridge machine had a problem. OK, I'll just re-boot. And then lots of sector errors started to show up.
So shut down and replace the HD.

Another couple of weeks, and the Haswell did the same. But this machine is what I use daily, so lots of files plus emails.
Eventually worked out what needed to be backed up and how to do it, then an afternoon spent replacing it.
An hour or so loading a new OS version (Mint 18.3), and then over two hours updating it all. Slowly.

I'm getting fibre-to-the-curb in a few months, which will take my landline speed from about 66 Kbs to several Megs, perhaps 10 Mbs
So next time ...
ID: 58313 · Report as offensive     Reply Quote
zaphod80013

Send message
Joined: 25 May 12
Posts: 8
Credit: 7,633,965
RAC: 3,387
Message 58315 - Posted: 30 Jun 2018, 21:15:04 UTC - in response to Message 58313.  

I started as a computer operator in 1979, so I too remember punch cards, 12" tape & 14" disk (when I started, on ICL 2900 series kit, they were a whole 60Mb each, yes thats Mb not Gb)

Funny my NAS had a HD fail about 3 wks ago, fortunately I run Raid 1-0, O.K. I loose 50% capacity but in principle I can survive 4 of the 8 disks failing provided they're all in different raid 1 pairs. I replace the failed drive & ordered a couple of spares.
ID: 58315 · Report as offensive     Reply Quote
Jord
Avatar

Send message
Joined: 5 Aug 04
Posts: 250
Credit: 93,274
RAC: 0
Message 58318 - Posted: 2 Jul 2018, 13:18:36 UTC - in response to Message 58315.  

Funny my NAS had a HD fail about 3 wks ago, fortunately I run Raid 1-0, O.K. I loose 50% capacity but in principle I can survive 4 of the 8 disks failing provided they're all in different raid 1 pairs. I replace the failed drive & ordered a couple of spares.
Do watch out though. My first NAS had two 4TB drives, both of which failed at the same time (due to a bad fan, so they overheated), meaning we lost 5TB of data.
So the next NAS had 4 drives, 4x4TB, RAID5. Lost two drives and also the whole array, so again 5TB of data gone. Luckily, this time I had 3.5TB of that backed up.

We now have that same NAS with 4x4TB HDDs, all single drive setups. I have a 4, 2 and 1 TB drive for backups via a USB 3.0 docking station. But we will at max just lose 4TB of data next time. Unless the whole NAS goes kaboom. ;-)
ID: 58318 · Report as offensive     Reply Quote
Alex Plantema

Send message
Joined: 3 Sep 04
Posts: 126
Credit: 26,610,380
RAC: 3,377
Message 58320 - Posted: 2 Jul 2018, 19:13:14 UTC - in response to Message 58318.  

I wouldn't recommend anything less than Raid 6, because by the time one drive fails, another drive may have errors too, and without redundancy you cannot rebuild the set.
ID: 58320 · Report as offensive     Reply Quote
Art Masson
Avatar

Send message
Joined: 16 Oct 11
Posts: 254
Credit: 15,954,577
RAC: 0
Message 58321 - Posted: 2 Jul 2018, 20:21:13 UTC

Hi all...I'm assuming the certificate problem remains -- since I'm still getting the errors below, correct?

So...we are still not back in business. I'm thinking once BOINC can communicate successfully, I'll abort any outstanding tasks from all seven of my machines and wait for new ones...does this make sense? Meantime I've got CPDN suspended on all machines.


7/2/2018 3:15:07 PM | climateprediction.net | Sending scheduler request: Requested by user.
7/2/2018 3:15:07 PM | climateprediction.net | Not requesting tasks: don't need (CPU: job cache full; NVIDIA GPU: not highest priority project)
7/2/2018 3:15:09 PM | | Project communication failed: attempting access to reference site
7/2/2018 3:15:09 PM | climateprediction.net | Scheduler request failed: Peer certificate cannot be authenticated with given CA certificates
7/2/2018 3:15:11 PM | | Internet access OK - project servers may be temporarily down.
ID: 58321 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 58322 - Posted: 2 Jul 2018, 20:48:04 UTC

Hi Art

Not sure what's going on any more. The Status page shows all the BOINC functions are shut down, and all of the urls were unreachable over the weekend, so something happened.
Don't abort just yet. I'll ask about already issued tasks.
ID: 58322 · Report as offensive     Reply Quote
Art Masson
Avatar

Send message
Joined: 16 Oct 11
Posts: 254
Credit: 15,954,577
RAC: 0
Message 58323 - Posted: 2 Jul 2018, 21:24:36 UTC

OK, Thanks, Les.

I suggest some overall guidance to crunchers (after the smoke clears) to provide direction on currently processing and/or pending tasks given that it's not clear what the state of the database is on outstanding tasks. Depending on individual actions in the BOINC clients it seems highly likely that status will not be aligned between clients and the CPDN database -- especially for those who tried to continue processing while the backup database was running or aborted individual tasks that were processing.

We certainly don't want to contribute to additional data errors by processing tasks that the system thinks are in a different status/state. Given that the system has been down for almost two months, perhaps it's better to just zero out all the outstanding tasks for all clients centrally and "start over" processing new ones...(just a thought). Hopefully just retaining the cumulative credit for each account.

By the way...who remembers toggling in "bootstrap loaders" on DEC PDP equipment...so that the system could read a paper tape to load the operating system?? (I do!!! and you had to do this over and over every time your new program failed)

Art Masson
ID: 58323 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 58324 - Posted: 2 Jul 2018, 22:03:01 UTC

Starting again was the suggestion in my email, but it'll probably miss the hardcore "set and forget people".
More wait and see.

I remember something about a boot strap loader on the line printers, but not what for.
ID: 58324 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 58325 - Posted: 3 Jul 2018, 1:06:14 UTC - in response to Message 58323.  

By the way...who remembers toggling in "bootstrap loaders" on DEC PDP equipment...so that the system could read a paper tape to load the operating system?? (I do!!! and you had to do this over and over every time your new program failed)


I had a PDP-11/45, and I wanted to boot both RSX-11D OS and UNIX easily. The machine had a card in it with 32 (16-bit_ words of data that it would boot from. I wrote binary code on that for both OSs to boot. It was tricky. The last instruction in the boot sequence required a branch (jump, branch) to absolute address 0. But that instruction required 32-bits and there was only 15 bits left. I solved that by clearing register 7 (or whatever the one was that was the program counter). So I never had to enter the initial boot sequence on that machne.

The PDP-5, though ... .
ID: 58325 · Report as offensive     Reply Quote
ProfileBill F

Send message
Joined: 17 Jan 09
Posts: 124
Credit: 2,037,778
RAC: 2,752
Message 58326 - Posted: 3 Jul 2018, 3:56:06 UTC

Yes old memories of paper tape loaded instruction sets. And 5 bit parallel Direct Current circuits provided for Sheraton Hotels by Western Union. It would bite real good if you put your finger where it should not have been.

5 Bit code was called Baudot code does anyone remember that code ? Or where it and the Baud in 2400 Baud modem came from ?

Any users of Wires Sonic Delay Lines out there? A whole 1K of quick, 600ms access, storage.

Started in 1968 and retired in 2016 ... 48 years one company with lots of name changes and mergers etc.

Bill F
Dallas TX
In October 1969 I took an oath to support and defend the Constitution of the United States against all enemies, foreign and domestic;
There was no expiration date.


ID: 58326 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 58327 - Posted: 3 Jul 2018, 5:24:14 UTC - in response to Message 58326.  

Paper tape and punched cards on an Eliott 22,000 machine is what I cut my teeth on, oh and as a first year being allowed a maximum of 16KB to fit my programs into!

(Some of us obviously need a diversion.)
ID: 58327 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 58331 - Posted: 3 Jul 2018, 14:28:57 UTC - in response to Message 58327.  

latest is I could connect but message was

03/07/2018 14:43:42 | climateprediction.net | [http] [ID#1] Info: TLSv1.2 (OUT), TLS alert, Client hello (1):
03/07/2018 14:43:43 | climateprediction.net | Scheduler request completed: got 0 new tasks
03/07/2018 14:43:43 | climateprediction.net | Project is temporarily shut down for maintenance




This is consistent with the latest news from Andy.


Hi All,


I am afraid there has been an issue with the underlying storage of CPDN. The storage of CPDN is held on a GPFS storage unit, this has experienced an issue with one of it's power cooling modules. The servers of the project are unaffected, however the download files of the project are held on the GPFS storage and these are inaccessible. This will mean that we cannot start the project again until this is solved.

Best regards,

Andy
Hi All,


We are going to use down period for the chance to take a backup of the database. So I will need to disconnect the database from the website for 24 hours for this to occur. Now as we have no slave database machine we have to make regular downtimes in order to take dumps of the database. A new slave machine has been ordered so we only need to do this for a finite period.

Best regards,

Andy


Originally posted under adjacent thread, "what happened."
ID: 58331 · Report as offensive     Reply Quote
Brummig

Send message
Joined: 3 Nov 05
Posts: 26
Credit: 687,388
RAC: 529
Message 58332 - Posted: 4 Jul 2018, 11:19:57 UTC

By the way...who remembers toggling in "bootstrap loaders" on DEC PDP equipment...so that the system could read a paper tape to load the operating system??

7756 ... Address Load ... Clear ... Continue (PDP8), and hope it hadn't forgotten the loader or it was frantic switch-toggling time. Forty years on, and I can still remember it. Has that been tried on the CPDN server?
ID: 58332 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 58333 - Posted: 4 Jul 2018, 12:53:38 UTC

Switches and buttons! I'd forgotten about that. Now I'm going to keep wondering what it was that I'm remembering.

I think the problem with the cpdn servers is getting a long enough piece of duct tape to wrap around it to hold it together. It really needs to l o s e w e i g h t, but you know how touchy computers are. :)
ID: 58333 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 58334 - Posted: 4 Jul 2018, 13:24:04 UTC - in response to Message 58326.  

5 Bit code was called Baudot code does anyone remember that code ? Or where it and the Baud in 2400 Baud modem came from ?


https://en.wikipedia.org/wiki/Baudot_code

Any users of Wires Sonic Delay Lines out there? A whole 1K of quick, 600ms access, storage.


I used them (eight in parallel) to store a frame of video data that I read into high speed RAM on a Computer Control Corporation DDP-224 computer. This would have been sometime in the late 1960s or early 1970s.

https://vdocuments.site/level-reassignment-a-technique-for-bit-rate-reduction.html
ID: 58334 · Report as offensive     Reply Quote
Brummig

Send message
Joined: 3 Nov 05
Posts: 26
Credit: 687,388
RAC: 529
Message 58335 - Posted: 4 Jul 2018, 17:26:59 UTC - in response to Message 58334.  
Last modified: 4 Jul 2018, 17:28:25 UTC

I used them (eight in parallel) to store a frame of video data

Make them long enough to store more than one frame and you would have the basis of a digital video player.

I used to know someone who could read Baudot tapes as if it was text on a sheet of paper.
ID: 58335 · Report as offensive     Reply Quote
zaphod80013

Send message
Joined: 25 May 12
Posts: 8
Credit: 7,633,965
RAC: 3,387
Message 58339 - Posted: 6 Jul 2018, 3:25:09 UTC - in response to Message 58320.  

With raid 1 or 10 (mirror + stripe) the recovery only has to read a single drive to recover, with raid 5 or 6 it has to read all remaining drives to recover, taking my case of 8 x 4TB with raid 10 I only get 16TB total space but to recover a disk only need to read 4TB. With raid 5 I'd get 28TB storage but would need to read 28TB to recover (not 100% sure but believe thats the same for raid 6 although the capacity would be lower at 24TB) This significantly increases the recovery time and therefore the risk window for loss of service.

While I'll accept that raid 6 may be better than raid 10 in terms of data loss risk (by about 2 orders of magnitude) that is not the only factor to consider, raid is primarily about continuity of service, not data loss, that's what backups are for. With raid 6 there is a significant write performance penalty (x6 vs x4 for raid 5 & x2 for raid 10) I use my Synology NAS mostly for local backup for the laptops & desktop in the house (I've 2 adult kids still living at home, which adds up to a lot of hardware) so, for us (and this is a pure judgment call), this is 'secondary storage', write performance & recovery time have greater significance than data loss & capacity.

While I could be wide of the mark, your comment suggest to me a possible confusion regarding the purpose of raid & backup.
ID: 58339 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 58341 - Posted: 6 Jul 2018, 13:20:51 UTC - in response to Message 58335.  

Make them long enough to store more than one frame and you would have the basis of a digital video player.


You couldn't in those days. 30 milliseconds was about the limit in those days (late 1960s).

We had an analogue disk to store a short "movie" (the "instant replay" technology of the day) and we could read a frame at a time into the delay lines (8-bits per sample), and from the delay lines, we built a controller that could read a few scan lines at a time into the computer RAM.

From there, we could simulate all manner of encoding and compression schemes to reduce the bandwidth of a television signal. We were trying to make Picturephone possible. Wasted effort, in spite of the progress we made, because with the advent of fiber optics, the need for bandwidth reduction just about disappeared, but that was 10 years or so later.
[/quote]
ID: 58341 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 58345 - Posted: 10 Jul 2018, 1:38:35 UTC

Is this true? Should I do it, or wait it out?

Mon 09 Jul 2018 09:15:16 PM EDT | climateprediction.net |
You used the wrong URL for this project.
When convenient, remove this project, then add https://climateprediction.net/

My boinc client is using: MASTER Url http://climateprediction.net/
ID: 58345 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : Nearly there

©2024 cpdn.org