climateprediction.net (CPDN) home page
Thread 'New work discussion - 2'

Thread 'New work discussion - 2'

Message boards : Number crunching : New work discussion - 2
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 23 · 24 · 25 · 26 · 27 · 28 · 29 . . . 42 · Next

AuthorMessage
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 68971 - Posted: 25 Jun 2023, 10:02:12 UTC - in response to Message 68970.  
Last modified: 25 Jun 2023, 10:15:56 UTC

Why would this file behave differently on different machines? Is it a file which has not been downloaded when it should have been, and once a computer gets it, it stays there and everything works?

Yes I was going to ask why we still had 1 year deadlines. I often find my computers leaving them and going off to do something else. I had to crank up the resource share of CPDN to stop them doing so.
ID: 68971 · Report as offensive
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4537
Credit: 19,001,532
RAC: 21,726
Message 68973 - Posted: 25 Jun 2023, 10:19:59 UTC - in response to Message 68971.  

Why would this file behave differently on different machines?


Why should anything else in the tasks do that? Sarh identified a potential problem with a file which makes me think there is a good chance it is the culprit. There are so many variables between computers that pinning down the common link between either those that work or those that don't is never going to be as straightforward as it is with the missing 32bit library files for the older Linux tasks. I have pretty much eliminated CPU type and OS versions from my list. None of those I looked at were short on RAM which can be an issue when running a lot of tasks. That over a hundred batches of this type of task have run in the past without this problem suggests one of the batch specific files. Once the file in question is identified, it would be nice to know why it affects some and not others but I am not sure we will ever know.
ID: 68973 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 68974 - Posted: 25 Jun 2023, 10:56:38 UTC - in response to Message 68973.  

As the models are failing right at the start it's almost certainly a problem with the input files. Though normally I would expect to see a floating point exception error because of bad input values rather than a segmentation violation (which means a bad memory reference). However, some bad data, say a negative pressure reference might put a -ve value in a memory reference and cause a segv.

Without seeing the process traceback and the model log file it's v difficult to know. If the CPDN server decides to give me some more tasks I'll disable networking to keep the files so I can look at them. However, all my tasks' workunits all failed so I suspect it's a bad input problem.

I'll join the CPDN technical meeting tomorrow to find out more.
---
CPDN Visiting Scientist
ID: 68974 · Report as offensive
ProfileBill F

Send message
Joined: 17 Jan 09
Posts: 124
Credit: 2,027,010
RAC: 2,694
Message 68975 - Posted: 25 Jun 2023, 11:55:44 UTC

Well my slowest machine after 4 back to back failures is now 33 minutes into a task and running it by all appearances. It has not trickled yet but it is early. My other system that has a task in progress has done 3 trickles and is still happy.

Bill F
ID: 68975 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 68976 - Posted: 25 Jun 2023, 13:20:40 UTC - in response to Message 68974.  

As the models are failing right at the start it's almost certainly a problem with the input files. Though normally I would expect to see a floating point exception error because of bad input values rather than a segmentation violation (which means a bad memory reference). However, some bad data, say a negative pressure reference might put a -ve value in a memory reference and cause a segv.


Should this not be proven impossible, or checked by the program, before a bad memory reference is even generated or used? I.e., when all is said and done, no matter what bad data is presented to a program, it should never get a segentation violation. The only thing that should cause a segmentation violation in a correct program would be a hardware error.
ID: 68976 · Report as offensive
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4537
Credit: 19,001,532
RAC: 21,726
Message 68977 - Posted: 25 Jun 2023, 14:13:17 UTC

The only thing that should cause a segmentation violation in a correct program would be a hardware error.
It is possible the error is down to how windows handles the data rather than the met office programs.
ID: 68977 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 68979 - Posted: 25 Jun 2023, 16:01:37 UTC - in response to Message 68977.  
Last modified: 25 Jun 2023, 16:04:32 UTC

The only thing that should cause a segmentation violation in a correct program would be a hardware error.
It is possible the error is down to how windows handles the data rather than the met office programs.
That's not correct. Even if the code is correct, if it's fed bad data that causes an array reference to go out of bounds of the program memory space you will get a segv.
---
CPDN Visiting Scientist
ID: 68979 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 68980 - Posted: 25 Jun 2023, 16:53:13 UTC - in response to Message 68979.  

Even if the code is correct, if it's fed bad data that causes an array reference to go out of bounds of the program memory space you will get a segv.


The program knows the dimensions of the array, so it should be able to determine if the array reference is in bounds or not.
ID: 68980 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 68982 - Posted: 25 Jun 2023, 17:11:07 UTC - in response to Message 68980.  
Last modified: 25 Jun 2023, 17:12:41 UTC

Even if the code is correct, if it's fed bad data that causes an array reference to go out of bounds of the program memory space you will get a segv.
The program knows the dimensions of the array, so it should be able to determine if the array reference is in bounds or not.
That's true but not all codes add this extra protection all the time. The code may have the correct computation but the data can still cause the code to fail. Although compilers can add in automatic array bound checking this is never turned on in production codes as it's a performance hit.
ID: 68982 · Report as offensive
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 68985 - Posted: 25 Jun 2023, 20:52:07 UTC - in response to Message 68977.  
Last modified: 25 Jun 2023, 20:52:23 UTC

The only thing that should cause a segmentation violation in a correct program would be a hardware error.
It is possible the error is down to how windows handles the data rather than the met office programs.


If you get different behavior, particularly around SIGSEGV, with the same code on different platforms, it's usually related to how memory is allocated and being "a little bit off" the end of an array in one direction or another.

I don't do cross platform stuff anymore, but Windows and Linux absolutely handle memory allocation differently enough that the same memory access error (what should be an invalid access) will segfault on one platform, but not the other. They're both "wrong," but "how wrong you have to be to segfault" is different between the platforms. But it almost certainly means the code isn't bounds checking stuff somewhere, and probably could use some Valgrind-based love to catch those.
ID: 68985 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 68986 - Posted: 25 Jun 2023, 21:56:16 UTC - in response to Message 68974.  

As the models are failing right at the start it's almost certainly a problem with the input files. Though normally I would expect to see a floating point exception error because of bad input values rather than a segmentation violation (which means a bad memory reference). However, some bad data, say a negative pressure reference might put a -ve value in a memory reference and cause a segv.

Without seeing the process traceback and the model log file it's v difficult to know. If the CPDN server decides to give me some more tasks I'll disable networking to keep the files so I can look at them. However, all my tasks' workunits all failed so I suspect it's a bad input problem.

I'll join the CPDN technical meeting tomorrow to find out more.
Presumably these files you wish to keep are now on the server from all of us who failed, so somebody can check, do you not have access to them?
ID: 68986 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 68987 - Posted: 25 Jun 2023, 22:00:13 UTC - in response to Message 68975.  

Well my slowest machine after 4 back to back failures is now 33 minutes into a task and running it by all appearances. It has not trickled yet but it is early. My other system that has a task in progress has done 3 trickles and is still happy.

Bill F
It's difficult with the 1 task a day limit when your computer has been a naughty boy and been forced to strip in front of the headmaster, but I've managed to get three "dodgy" machines to get one task running. So for some reason they can sometimes get a good task. I have had a couple fail after several hours, although most are several minutes. They're actually running faster than my fast machine which filled all 24 threads with tasks. It shows full CPU usage on my monitoring software, but the temperature is a lot lower, and Boinc says they're only getting 2/3 of a CPU core each. I'm guessing these things have big data sets and are overloading the CPU cache?
ID: 68987 · Report as offensive
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 68988 - Posted: 26 Jun 2023, 7:30:06 UTC - in response to Message 68974.  

As the models are failing right at the start it's almost certainly a problem with the input files. Though normally I would expect to see a floating point exception error because of bad input values rather than a segmentation violation (which means a bad memory reference). However, some bad data, say a negative pressure reference might put a -ve value in a memory reference and cause a segv.

Without seeing the process traceback and the model log file it's v difficult to know. If the CPDN server decides to give me some more tasks I'll disable networking to keep the files so I can look at them. However, all my tasks' workunits all failed so I suspect it's a bad input problem.

I'll join the CPDN technical meeting tomorrow to find out more.

I only have 3 running on my Ryzen, but they are almost through 7 model months now. Of the work units associated with these tasks, two of the work units had two SEGV failure tasks each, very early in their runs, prior to my download. The third task running on my Ryzen had a similar early SEGV task failure prior to my downloading the 2nd task from that work unit. So, if it's an input file problem, that can't be the reason for the SEGV failures in the work units my three tasks came from. The work units are:

https://www.cpdn.org/workunit.php?wuid=12217926
https://www.cpdn.org/workunit.php?wuid=12216852
https://www.cpdn.org/workunit.php?wuid=12217357

Like Dave, my Ryzen is running a version of Ubuntu, with Windows BOINC running under Wine.
ID: 68988 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 68989 - Posted: 26 Jun 2023, 10:12:58 UTC

Update: The current Wah batch will be suspended due to the v high number of fails with the same error. We'll be running some tests with the model to understand what's happened before the batch is resubmitted.
---
CPDN Visiting Scientist
ID: 68989 · Report as offensive
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4537
Credit: 19,001,532
RAC: 21,726
Message 68990 - Posted: 26 Jun 2023, 10:19:12 UTC - in response to Message 68989.  

Makes sense. Should we let working tasks run to completion or abort? I have seven that have all made it to at least 4th or fifth model month.
ID: 68990 · Report as offensive
kotenok2000

Send message
Joined: 22 Feb 11
Posts: 32
Credit: 226,546
RAC: 4,080
Message 68991 - Posted: 26 Jun 2023, 11:45:48 UTC
Last modified: 26 Jun 2023, 11:49:07 UTC

I get "climateprediction.net | [http] [ID#21943] Info: Failed to connect to upload7.cpdn.org port 80 after 4356 ms: Couldn't connect to server" when uploading preliminary results.
I have 22 stuck uploads for wah2
ID: 68991 · Report as offensive
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4537
Credit: 19,001,532
RAC: 21,726
Message 68992 - Posted: 26 Jun 2023, 14:41:13 UTC - in response to Message 68991.  

I have 22 stuck uploads for wah2
Andy, Sarah and the researcher in Korea are all aware of this. I currently have over 30 zips waiting to go. Andy is in meetings all day today but tomorrow should be able to give things a nudge be that a tweak from Oxford or an email to the owners of the server in Korea.
ID: 68992 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 68993 - Posted: 26 Jun 2023, 15:56:47 UTC - in response to Message 68990.  

Makes sense. Should we let working tasks run to completion or abort? I have seven that have all made it to at least 4th or fifth model month.
I don't have a good answer to that. I don't know (yet) what CPDN will decide to do about this batch. If it was me, I'd let them run.
ID: 68993 · Report as offensive
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4537
Credit: 19,001,532
RAC: 21,726
Message 68994 - Posted: 26 Jun 2023, 16:21:44 UTC - in response to Message 68993.  

If it was me, I'd let them run.
That was what I was going to do in the absence of being told otherwise.
ID: 68994 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 68995 - Posted: 26 Jun 2023, 19:46:12 UTC - in response to Message 68994.  

If it was me, I'd let them run.
That was what I was going to do in the absence of being told otherwise.
Dave - have sent you a private message regarding the model logs. Could you please check. Thx.
ID: 68995 · Report as offensive
Previous · 1 . . . 23 · 24 · 25 · 26 · 27 · 28 · 29 . . . 42 · Next

Message boards : Number crunching : New work discussion - 2

©2024 cpdn.org