Message boards : Number crunching : New work Discussion
Message board moderation
Previous · 1 . . . 24 · 25 · 26 · 27 · 28 · 29 · 30 . . . 91 · Next
Author | Message |
---|---|
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,005,674 RAC: 21,647 |
My i7 grabbed 2 of the batch 797 tasks and they each failed with a Signal 11 error 2 minutes into the run. TNC 2005(act799, nat**), 2010(act798, nat**), 2015 (act797, nat**) topup runs Interesting, as top up runs any issues with the tasks should have been sorted out. I haven't yet worked out which batch they are a top up for. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,005,674 RAC: 21,647 |
Small batch #796 of 38 global models at 25 km resolution for 1 month (batch list). This is a test batch so please if you see anything untoward on these let us know so the project can be informed. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
My i7 grabbed 2 of the batch 797 tasks and they each failed with a Signal 11 error 2 minutes into the run. It's now crashed all 4 of these models from the SAM25 batches that its downloaded with signal 11, all at ~2 min 20 sec. This is when the regional part of the model starts. There are no attempted restarts, they just die after the last global timestep as the regional model starts. Edit...on the other hand, I allowed my i5 to download one and it's gotten into the regional model without crashing. Very weird. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
Edit...on the other hand, I allowed my i5 to download one and it's gotten into the regional model without crashing. Very weird. Some years ago, I noted that the first work units of a group crashed often, and then the later ones ran OK. I don't know any reason for that, but maybe it is happening here. But if so, it is probably how they generate their models, with the more extreme initial conditions going first. On the other hand, no one has explained what "Signal 11" is, so it may be something else. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
A signal 11 error, commonly know as a segmentation fault, means that the program accessed a memory location that was not assigned to it. A signal 11 error may be due to a bug in one of the software programs that is installed, or faulty hardware. ********************* This is happening so often now, that perhaps it needs looking into. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
A signal 11 error, commonly know as a segmentation fault, means that the program accessed a memory location that was not assigned to it. A signal 11 error may be due to a bug in one of the software programs that is installed, or faulty hardware. It seems like certain processors get a lot more of these than others. I don't know if it's a generation of processor thing or a Windows thing. It'd be interesting to see a breakdown of Signal 11 errors by CPU. The i7 I have running is a laptop, but high end. I'm running at most 2 models at a time on it, and it's run plenty of models before. It's prime95 stable for 12 hours on 4 cores, and cpdn doesn't tax the processor near as much as prime95. So I'm at a loss as to what's going on here. |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
Another Signal 11 fish floats belly-up: wah2_sam25_a0bi_201412_24_797_011771638_0 This one died on my oldest I5 Desktop box (I5-3550) in Win10 after 2m50s CPU time and 3m14s wall time. "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,005,674 RAC: 21,647 |
This is happening so often now, that perhaps it needs looking into. I agree, below are the batch statistics for another that looks like it has major problems, I have messaged the, "owner" of 789, '90 and '91 and will update him along with batch statistics in the morning but it it isn't just his and seems to be across all those who submit batches to the system. Batch: 795 |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,975,898 RAC: 14,500 |
I've just had 1 from batch 797 and 2 from 798 fail with segmantation error. One baych 797 is still going though. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,005,674 RAC: 21,647 |
I have 2 797's, both on their second attempt. The first is now even allowing for my laptop being much slower than the machine that first tried it, is well past the point at which it failed. Second one just started so will wait and see. Unfortunately, I can't see the breakdown by processor type and OS to look for patterns. There is also the issue that my system is lying to CPDN and is running Windows tasks under WINE. I don't know how many of us are doing this and will skew the statistics? Edit:The second one didn't fail till over 3 hours in so won't know whether that is failing at same point till about fifteen hours in on my slower system. Edit2: Looked at around 30 failures and the running ones in between on batch 797, All showing Windows10 but then so were all of those still running! Looks like M$ have convinced nearly everyone to change. Failures I looked at covered i5, i7, xeon and AMD CPUs. My guess would be within the margin or error, representing the proportion of each. So in the absence of a statistical analysis by the project, I don't see anything of value in that line of enquiry. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
All three of the 797s I have run have failed consistently at around 3 1/2 hours on a Ryzen 2600 (Win 10). It is the same for three 798's and 799's. At least they are failing quickly. And the earlier ones (788 to 794) are running fine, at up to three days now. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,005,674 RAC: 21,647 |
All three of the 797s I have run have failed consistently at around 3 1/2 hours on a Ryzen 2600 (Win 10). It is the same for three 798's and 799's. Of the two 797s I have, one failed at about four or five minutes on its first attempt. The second at about the 3 1/2 hour mark. The first is now at 6 1/2 hours so way past where it failed on first go round but on my much slower machine probably another six hours before the second hurdle. Most of those I looked at on the task pages failed at a few minutes in with a much lower percentage failing a few hours in. I only found 2 that had gotten as far as the first zip. |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
Frustrating day. Ten SAM25 tasks were downloaded across four Intel boxes in Windoze 10. Tasks were approximately equal in distribution across the boxes and across SAM25 797/798/799. (One or two tasks at a time were downloaded on each box.) ALL TEN DIED AFTER ~3 SECONDS ON i5/i7 desktops. (Adding to the fun, this was M$ bug-fix day - and, as a self-defense measure, I micro-manage Windoze 'updates'.) Perhaps that SAMxx scientist needs a bit of retraining in configuring input file structure -- and/or more attention to detail, eh? Too common in SAMxx tasks... "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Commiserations Astro. For the non-mods, the following is part of a reply to my email about the high failures: Yes we noticed the high failure rate with this region and we think it is when the model is setup to do the vegetation as well as the climate that the failure rates increase. ************************ Welcome to the World of Advanced Climate Modelling. And it'll get worse when they start using the new high res models. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Looks like the ones that are failing on Jim's/astroWX's PCs are doing the same thing they did on mine...going through the global first day, then failing when starting the regional model on that day. I tried again on my i7 and it grabbed one 799 task and has gone a month with it now with no Signal 11. But it's only running one right now. Before it was running an ANZ and the SAM25. No problems with the ANZ models that I've seen. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,005,674 RAC: 21,647 |
And hopefully away from discussions on failures, Batch 800, 3,300 EU25 13 month tasks have been released. Edit: And batch 801 another 8658 as part of the same experiment. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,005,674 RAC: 21,647 |
I may be tempting fate but of the two 797's I have one is well past the point where it failed on the first try. One failed a few minutes in and has now been running for almost 31 hours. The other failed about the 3.5 hour point and is now just over 16 hours in so I suppose on my slower machine it could still about to fall over on my much slower machine. It would be interesting to know how many are on Linux machines using WINE and how easy it is or isn't to pick that up from the sched_request_climateprediction.net.xml that the server gets its information from. I need to change this machine to say it is using win10 rather than XP to see if it shows up as an identical win10version to my laptop. I will then need to look at what win10 machines show. Reason for these musings is I don't know whether there are enough WINE machines to skew any OS based statistics that the project are looking at. |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
Is there something unusual about the wah2_sam25 models of batch 797? Are they unusually high resolution or something? I started one yesterday and it is progressing extremely slowly. At 24 hours it is only 0.31% complete. At this rate it will take more than 300 days to complete. The computer is an I3 2.7 GHz with 8 GB of RAM running Win10. It runs other WU’s at normal speed. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,005,674 RAC: 21,647 |
At this rate it will take more than 300 days to complete. The computer is an I3 2.7 GHz with 8 GB of RAM running Win10. It runs other WU’s at normal speed. I have two of this batch running on my 2.16GHz laptop. one is 3.274%complete after 35 hours, the other is 1.5% complete after a tad under 19 hours. So even on my slower box they should complete in under 50 days. I suspect there is something wrong with that particular task. Have you tried suspending other tasks to see if it speeds up? (clutching at straws rather than expecting it to make a difference.) |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,975,898 RAC: 14,500 |
I have one that is just over 6% after 1 day on my 3.5Ghz i5. One on my slower i5 failed after 4 minutes - seg violation! |
©2024 cpdn.org