Message boards : Number crunching : New small batches of long runs --> with problems
Message board moderation
Author | Message |
---|---|
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
From Sarah sparrow, Oxford CPDN scientist: Hi all, Two appeared, one on each of two of my machines. Both failed early: after about two minutes on i7, ~5.5 minutes on a Q9550. The i7 boinc installation usefulness was destroyed -- boinc couldn't reconnect. (All data was wiped from from all pages; boinc framework and labels remain.) The situation persisted after reboot, after "repair" installation, and after boinc 'uninstall'/'reinstall'. The Q9550 suffered page data-wipe but recovered. Email sent to Sarah but it's middle of the night in England ... Until we receive guidance from on high, my suggestion is to suspend, upon receipt, any tasks from the mentioned batches (539-550). It seems luck of the draw whether boinc suffers a knockdown or a knockout. "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
Any updates on how to recognize these killer WU’s other than that they are come from batches 539 – 550? Have they been pulled from the que? |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Hi Jim Most of these batches (up 552 now), seem to be running OK, so I don't think that there's a general problem. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,363,965 RAC: 15,559 |
I've got one from batch 545 waiting to start but it won't get to the front of the queue for a few days yet. Might suspend it as a task and set no new tasks until it's ready to go. |
Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0 |
"long simulations" - well there's an understatement. Currently typical completion time estimate of 3 days, but one of these is in my queue with an estimate of 72 days. Not sure I remember anything running that long before. No more feedback on the reliability of these? I've suspended it waiting for some more feedback. |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
Not sure I remember anything running that long before. There was a time when tasks would last six months or more! |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,706,848 RAC: 5,644 |
Not sure I remember anything running that long before. Well, wah2_global_a04o..145_520. runs around 30 days on my i5-2520M @2.5 GHz so I can only imagine how long the "long runs" will be. |
Send message Joined: 1 Sep 04 Posts: 161 Credit: 81,522,141 RAC: 1,164 |
The estimate for mine is 52 days (AMD Phenom II X4 945, so not the latest and fastest CPU). Windows XP. It has been running for almost 24 hours and is 1.526% complete. No problems. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,363,965 RAC: 15,559 |
Now running this task. 1hour down, 121days to go! |
Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0 |
Well the 72 days turned out to be about 3 minutes before it crashed. Surprised me as the new PC build is proving to have a pretty low error rate. Still, blessing in disguise as I see the supply of tasks is petering out. With tasks available the PC runs 24/7, but if none around it now gets turned off at night. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,363,965 RAC: 15,559 |
2.5% in 2 days so looks as if the 121 day estimate will be out by about 40days. We shall see. Also running a global with 145 month run which is generating trickles about every 2,800 timesteps. |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
Those people running these very long tasks might want to think about going back to the practice of making frequent backups. Four months is a long time and a lot can happen. You don’t want to invest 2 or 3 months in a task and then lose it because of a power failure or unexpected reboot. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
That's about 80 days with an i5 4690, so a fast PC. Running this type of model on a slow PC would really take forever. |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
I don’t know if you have been with the project long enough to remember the 160 year models that took about 8 months to run on 1.2 GHz processors with 256 KB’s of RAM. In those days, they used to post every time someone finished one. |
Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0 |
I no longer back up CPDN data as it is just too difficult to restore. OK, you can restore it but then you have overlapping trickles etc etc. Then there is the issue if a PC glitch knocks out your tasks (I run 12) then another series of tasks start running so how do you deal with that & the restore. I decided that the project probably realises that tasks fail and there are enough running to cover this. (If they don't they shouldn't be using distributed computing!) Besides, any failed task gets reallocated to another PC a couple of times. If all 3 fail it probably points to a model problem or is the result they were after anyway. |
Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0 |
Running this type of model on a slow PC would really take forever. Interesting. Looking back at my first CPDN PC in 2006 (Pentium 4 CPU 3.06GHz) it had a floating point speed of 1420 million ops/sec. My latest creation (i7-6900K CPU @ 3.20GHz) is 4870 million ops/sec. Now although that is a factor of 3.3 difference, it is not as large a difference as I was expecting given it is 11 years on. Although agreed, it would still be taking forever :-( |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
Or a 650MHz Duron in my case. Cue Monty Python sketch https://www.youtube.com/watch?v=Xe1a1wHxTyo |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
The fast running chickens are starting to come home to roost. |
Send message Joined: 26 Aug 04 Posts: 67 Credit: 10,382,602 RAC: 6,424 |
Running this type of model on a slow PC would really take forever. Reminiscing about old runs and machines, I remember running one of the first HADCM3 Spinups back in 2005/6 in advance of the BBC Expt on a Pentium 4 @ 3.2GHz and 1GB RAM. This ran at around 2sec/TS on that machine. I recently found I still had an old backup of the spinup files in an as downloaded but not started state. I made all the adjustments to the files to get the spinup to run on my current i7 6700 with 16GB RAM in 2017. I ran it for a day and checking the zip files it was returning 0.45sec/TS, about 4.5 times faster than on the 12 year old Pentium 4 machine. The old Pentium 4 machine was single core (2 if running hyperthreading). This is 4 core (8 if running hyperthreading). Taking both into account, some difference :) I'm only actually running 2 models together on it though just now. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,363,965 RAC: 15,559 |
Updated progress: task now about 24% complete after 12days and a few hours. This would infer about 53days total run time . So about another 41 days to go! BOINC still thinks about 90days left!! |
©2024 cpdn.org