Message boards : Number crunching : New work discussion - 2
Message board moderation
Previous · 1 . . . 23 · 24 · 25 · 26 · 27 · 28 · 29 . . . 42 · Next
Author | Message |
---|---|
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
Why would this file behave differently on different machines? Is it a file which has not been downloaded when it should have been, and once a computer gets it, it stays there and everything works? Yes I was going to ask why we still had 1 year deadlines. I often find my computers leaving them and going off to do something else. I had to crank up the resource share of CPDN to stop them doing so. |
Send message Joined: 15 May 09 Posts: 4537 Credit: 19,001,532 RAC: 21,726 |
Why would this file behave differently on different machines? Why should anything else in the tasks do that? Sarh identified a potential problem with a file which makes me think there is a good chance it is the culprit. There are so many variables between computers that pinning down the common link between either those that work or those that don't is never going to be as straightforward as it is with the missing 32bit library files for the older Linux tasks. I have pretty much eliminated CPU type and OS versions from my list. None of those I looked at were short on RAM which can be an issue when running a lot of tasks. That over a hundred batches of this type of task have run in the past without this problem suggests one of the batch specific files. Once the file in question is identified, it would be nice to know why it affects some and not others but I am not sure we will ever know. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
As the models are failing right at the start it's almost certainly a problem with the input files. Though normally I would expect to see a floating point exception error because of bad input values rather than a segmentation violation (which means a bad memory reference). However, some bad data, say a negative pressure reference might put a -ve value in a memory reference and cause a segv. Without seeing the process traceback and the model log file it's v difficult to know. If the CPDN server decides to give me some more tasks I'll disable networking to keep the files so I can look at them. However, all my tasks' workunits all failed so I suspect it's a bad input problem. I'll join the CPDN technical meeting tomorrow to find out more. --- CPDN Visiting Scientist |
Send message Joined: 17 Jan 09 Posts: 124 Credit: 2,027,010 RAC: 2,694 |
Well my slowest machine after 4 back to back failures is now 33 minutes into a task and running it by all appearances. It has not trickled yet but it is early. My other system that has a task in progress has done 3 trickles and is still happy. Bill F |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
As the models are failing right at the start it's almost certainly a problem with the input files. Though normally I would expect to see a floating point exception error because of bad input values rather than a segmentation violation (which means a bad memory reference). However, some bad data, say a negative pressure reference might put a -ve value in a memory reference and cause a segv. Should this not be proven impossible, or checked by the program, before a bad memory reference is even generated or used? I.e., when all is said and done, no matter what bad data is presented to a program, it should never get a segentation violation. The only thing that should cause a segmentation violation in a correct program would be a hardware error. |
Send message Joined: 15 May 09 Posts: 4537 Credit: 19,001,532 RAC: 21,726 |
The only thing that should cause a segmentation violation in a correct program would be a hardware error.It is possible the error is down to how windows handles the data rather than the met office programs. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
That's not correct. Even if the code is correct, if it's fed bad data that causes an array reference to go out of bounds of the program memory space you will get a segv.The only thing that should cause a segmentation violation in a correct program would be a hardware error.It is possible the error is down to how windows handles the data rather than the met office programs. --- CPDN Visiting Scientist |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Even if the code is correct, if it's fed bad data that causes an array reference to go out of bounds of the program memory space you will get a segv. The program knows the dimensions of the array, so it should be able to determine if the array reference is in bounds or not. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
That's true but not all codes add this extra protection all the time. The code may have the correct computation but the data can still cause the code to fail. Although compilers can add in automatic array bound checking this is never turned on in production codes as it's a performance hit.Even if the code is correct, if it's fed bad data that causes an array reference to go out of bounds of the program memory space you will get a segv.The program knows the dimensions of the array, so it should be able to determine if the array reference is in bounds or not. |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
The only thing that should cause a segmentation violation in a correct program would be a hardware error.It is possible the error is down to how windows handles the data rather than the met office programs. If you get different behavior, particularly around SIGSEGV, with the same code on different platforms, it's usually related to how memory is allocated and being "a little bit off" the end of an array in one direction or another. I don't do cross platform stuff anymore, but Windows and Linux absolutely handle memory allocation differently enough that the same memory access error (what should be an invalid access) will segfault on one platform, but not the other. They're both "wrong," but "how wrong you have to be to segfault" is different between the platforms. But it almost certainly means the code isn't bounds checking stuff somewhere, and probably could use some Valgrind-based love to catch those. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
As the models are failing right at the start it's almost certainly a problem with the input files. Though normally I would expect to see a floating point exception error because of bad input values rather than a segmentation violation (which means a bad memory reference). However, some bad data, say a negative pressure reference might put a -ve value in a memory reference and cause a segv.Presumably these files you wish to keep are now on the server from all of us who failed, so somebody can check, do you not have access to them? |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
Well my slowest machine after 4 back to back failures is now 33 minutes into a task and running it by all appearances. It has not trickled yet but it is early. My other system that has a task in progress has done 3 trickles and is still happy.It's difficult with the 1 task a day limit when your computer has been a naughty boy and been forced to strip in front of the headmaster, but I've managed to get three "dodgy" machines to get one task running. So for some reason they can sometimes get a good task. I have had a couple fail after several hours, although most are several minutes. They're actually running faster than my fast machine which filled all 24 threads with tasks. It shows full CPU usage on my monitoring software, but the temperature is a lot lower, and Boinc says they're only getting 2/3 of a CPU core each. I'm guessing these things have big data sets and are overloading the CPU cache? |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
As the models are failing right at the start it's almost certainly a problem with the input files. Though normally I would expect to see a floating point exception error because of bad input values rather than a segmentation violation (which means a bad memory reference). However, some bad data, say a negative pressure reference might put a -ve value in a memory reference and cause a segv. I only have 3 running on my Ryzen, but they are almost through 7 model months now. Of the work units associated with these tasks, two of the work units had two SEGV failure tasks each, very early in their runs, prior to my download. The third task running on my Ryzen had a similar early SEGV task failure prior to my downloading the 2nd task from that work unit. So, if it's an input file problem, that can't be the reason for the SEGV failures in the work units my three tasks came from. The work units are: https://www.cpdn.org/workunit.php?wuid=12217926 https://www.cpdn.org/workunit.php?wuid=12216852 https://www.cpdn.org/workunit.php?wuid=12217357 Like Dave, my Ryzen is running a version of Ubuntu, with Windows BOINC running under Wine. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Update: The current Wah batch will be suspended due to the v high number of fails with the same error. We'll be running some tests with the model to understand what's happened before the batch is resubmitted. --- CPDN Visiting Scientist |
Send message Joined: 15 May 09 Posts: 4537 Credit: 19,001,532 RAC: 21,726 |
Makes sense. Should we let working tasks run to completion or abort? I have seven that have all made it to at least 4th or fifth model month. |
Send message Joined: 22 Feb 11 Posts: 32 Credit: 226,546 RAC: 4,080 |
I get "climateprediction.net | [http] [ID#21943] Info: Failed to connect to upload7.cpdn.org port 80 after 4356 ms: Couldn't connect to server" when uploading preliminary results. I have 22 stuck uploads for wah2 |
Send message Joined: 15 May 09 Posts: 4537 Credit: 19,001,532 RAC: 21,726 |
I have 22 stuck uploads for wah2Andy, Sarah and the researcher in Korea are all aware of this. I currently have over 30 zips waiting to go. Andy is in meetings all day today but tomorrow should be able to give things a nudge be that a tweak from Oxford or an email to the owners of the server in Korea. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Makes sense. Should we let working tasks run to completion or abort? I have seven that have all made it to at least 4th or fifth model month.I don't have a good answer to that. I don't know (yet) what CPDN will decide to do about this batch. If it was me, I'd let them run. |
Send message Joined: 15 May 09 Posts: 4537 Credit: 19,001,532 RAC: 21,726 |
If it was me, I'd let them run.That was what I was going to do in the absence of being told otherwise. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Dave - have sent you a private message regarding the model logs. Could you please check. Thx.If it was me, I'd let them run.That was what I was going to do in the absence of being told otherwise. |
©2024 cpdn.org