climateprediction.net (CPDN) home page
Thread 'OpenIFS Discussion'

Thread 'OpenIFS Discussion'

Message boards : Number crunching : OpenIFS Discussion
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 19 · 20 · 21 · 22 · 23 · 24 · 25 . . . 32 · Next

AuthorMessage
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 30,967,615
RAC: 14,422
Message 68251 - Posted: 10 Feb 2023, 23:27:00 UTC - in response to Message 68249.  
Last modified: 10 Feb 2023, 23:35:00 UTC

"Initially the BOINC estimated run time is off likely due to the new app version that BOINC has no data for yet."

For the six that I have from batch 990 the estimated run time is 2days 23hrs compared to 16hrs (ish) for the previous batches.

Edit: Actually running at 5.04% per hour. First one 73% complete after 14 hrs, remaining estimated at 19hrs so adjusting as it goes.
ID: 68251 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 68252 - Posted: 11 Feb 2023, 3:31:10 UTC - in response to Message 68251.  

"Initially the BOINC estimated run time is off likely due to the new app version that BOINC has no data for yet."

Unfortunately, I have no estimate of how long they were to take.
Task 22250483  First one done on my Linux machine...
Name 	oifs_43r3_bl_a051_2016092300_15_949_12166575_0
Workunit 	12166575
Created 	14 Dec 2022, 14:15:27 UTC
Sent 	14 Dec 2022, 14:24:00 UTC
Report deadline 	13 Jan 2023, 14:24:00 UTC
Received 	15 Dec 2022, 12:25:25 UTC
Server state 	Over
Outcome 	Success
Client state 	Done
Exit status 	0 (0x00000000)
Computer ID 	1511241
Run time 	6 hours 46 min 55 sec
CPU time 	6 hours 41 min 2 sec
Validate state 	Valid
Credit 	1,232.00
Application version 	OpenIFS 43r3 Baroclinic Lifecycle v1.07
                        x86_64-pc-linux-gnu

Task 22250807 Most recent one done on my Linux machine.
Name 	oifs_43r3_bl_a04c_2016092300_15_949_12166550_2
Workunit 	12166550
Created 	19 Dec 2022, 2:21:53 UTC
Sent 	19 Dec 2022, 2:23:58 UTC
Report deadline 	18 Jan 2023, 2:23:58 UTC
Received 	19 Dec 2022, 9:23:21 UTC
Server state 	Over
Outcome 	Success
Client state 	Done
Exit status 	0 (0x00000000)
Computer ID 	1511241
Run time 	6 hours 12 min 40 sec
CPU time 	6 hours 7 min 11 sec
Validate state 	Valid
Credit 	1,232.00

ID: 68252 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,807,823
RAC: 19,824
Message 68253 - Posted: 11 Feb 2023, 6:59:15 UTC - in response to Message 68252.  

Unfortunately, I have no estimate of how long they were to take.

Those 2 tasks are from a BL test batch (949) from a coupe of months ago using the old app version (1.07). I'm not sure that I'd use them for any significant info or comparison as they were just part of the initial test runs in preparation for OIFS release. Production runs are likely to be different and will use the latest app version (1.11 or newer).
ID: 68253 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4536
Credit: 18,993,249
RAC: 21,753
Message 68254 - Posted: 11 Feb 2023, 11:43:49 UTC

Got this on the last of my tasks from 990

[EC_DRHOOK:swarm:1:1:4860:4860] [20230211:101058:1676110258:14770.286] [signal_drhook@/home/glenn/github/jamie_oifs43r3.git/src/ifsaux/support/drhook.c:1734] DrHook backtrace done for signal#8, nsigs = 1
[EC_DRHOOK:swarm:1:1:4860:4860] [20230211:101058:1676110258:14770.286] [signal_drhook@/home/glenn/github/jamie_oifs43r3.git/src/ifsaux/support/drhook.c:1785] Calling previous signal handler at 0x1ce8cf0 for signal#8, nsigs = 1
forrtl: error (65): floating invalid
ID: 68254 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 68255 - Posted: 11 Feb 2023, 12:36:54 UTC - in response to Message 68254.  

If you want to know what the signals mean in Linux, consider the following table where they are defined. Especially #8. Floating point exception. You might wish to keep it around for reference.

And here is an explanation on how it can occur.
https://itslinuxfoss.com/floating-point-exception-core-dumped/

#define SIGHUP           1
#define SIGINT           2
#define SIGQUIT          3
#define SIGILL           4
#define SIGTRAP          5
#define SIGABRT          6
#define SIGIOT           6
#define SIGBUS           7
#define SIGFPE           8
#define SIGKILL          9
#define SIGUSR1         10
#define SIGSEGV         11
#define SIGUSR2         12
#define SIGPIPE         13
#define SIGALRM         14
#define SIGTERM         15
#define SIGSTKFLT       16
#define SIGCHLD         17
#define SIGCONT         18
#define SIGSTOP         19
#define SIGTSTP         20
#define SIGTTIN         21
#define SIGTTOU         22
#define SIGURG          23
#define SIGXCPU         24
#define SIGXFSZ         25
#define SIGVTALRM       26
#define SIGPROF         27
#define SIGWINCH        28
#define SIGIO           29
#define SIGPOLL         SIGIO
/*
#define SIGLOST         29
*/
#define SIGPWR          30
#define SIGSYS          31
#define SIGUNUSED       31

ID: 68255 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4536
Credit: 18,993,249
RAC: 21,753
Message 68256 - Posted: 11 Feb 2023, 13:20:19 UTC

If you want to know what the signals mean in Linux, consider the following table where they are defined. Especially #8. Floating point exception. You might wish to keep it around for reference.

And here is an explanation on how it can occur.
https://itslinuxfoss.com/floating-point-exception-core-dumped/
Thanks, looking at the link and also in a couple of other places, this one is I suspect down to the physics of the model producing a value that the program doesn't like. It will be interesting to see what happens on subsequent attempts.
ID: 68256 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 68257 - Posted: 11 Feb 2023, 14:21:14 UTC - in response to Message 68256.  

Thanks, looking at the link and also in a couple of other places, this one is I suspect down to the physics of the model producing a value that the program doesn't like. It will be interesting to see what happens on subsequent attempts.


I agree.

But do not overlook the possibility of bad addresses, bad subscripts in arrays, or using dynamically allocated memory that has been freed, yet still used by defective programs. These can all give very strange, difficult-to-reproduce, errors.
ID: 68257 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4536
Credit: 18,993,249
RAC: 21,753
Message 68258 - Posted: 11 Feb 2023, 14:23:57 UTC

But do not overlook the possibility of bad addresses, bad subscripts in arrays, or using dynamically allocated memory that has been freed, yet still used by defective programs. These can all give very strange, difficult-to-reproduce, errors.


Of course. Though very little running on this computer. Only programs open apart from BOINC were Firefox with only a couple of tabs open, Thunderbird and Libre Office.
ID: 68258 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 68259 - Posted: 11 Feb 2023, 15:27:01 UTC - in response to Message 68258.  
Last modified: 11 Feb 2023, 15:40:50 UTC

Got this on the last of my tasks from 990


Of course. Though very little running on this computer. Only programs open apart from BOINC were Firefox with only a couple of tabs open, Thunderbird and Libre Office.


If you could somehow send me that work unit, I could run it on my machine. IIRC my machine has never failed to complete any of these Oifs work units. Neither the _ps nor the _bl ones. Almost 300 tasks altogether.

P.s.: I just got
Name oifs_43r3_ps_0561_2021050100_123_990_12206681_2
Workunit 12206681
that has failed for the two previous attempts. Each has failed for very different reasons. I betcha it works on my machine.
ID: 68259 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4536
Credit: 18,993,249
RAC: 21,753
Message 68260 - Posted: 11 Feb 2023, 15:49:05 UTC

If you could somehow send me that work unit, I could run it on my machine. IIRC my machine has never failed to complete any of these Oifs work units. Neither the _ps nor the _bl ones. Almost 300 tasks altogether.
The Intel machine running the second attempt has a very good record (about 1% failure rate.) I don't know if the failure on my machine is something AMD ones are more prone to or not. I know the facility to send a work unit to only a specific machine exists as it has been used on the testing site at times but I am not aware of it ever being used on the Main site.
ID: 68260 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 68261 - Posted: 11 Feb 2023, 15:57:09 UTC - in response to Message 68260.  

The Intel machine running the second attempt has a very good record (about 1% failure rate.) I don't know if the failure on my machine is something AMD ones are more prone to or not.


Well, my machine is Intel; at present I am allowing 12 cores to run Boinc tasks.

CPU type 	GenuineIntel
Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7]
Number of processors 	16

Operating System 	Linux Red Hat Enterprise Linux
Red Hat Enterprise Linux 8.7 (Ootpa) [4.18.0-425.10.1.el8_7.x86_64|libc 2.28]
BOINC version 	7.20.2
Memory 	62.4 GB
Cache 	16896 KB
Swap space 	15.62 GB
Total disk space 	488.04 GB
Free Disk Space 	479.24 GB

ID: 68261 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 68264 - Posted: 11 Feb 2023, 17:22:26 UTC - in response to Message 68254.  
Last modified: 11 Feb 2023, 17:24:47 UTC

Got this on the last of my tasks from 990
[EC_DRHOOK:swarm:1:1:4860:4860] [20230211:101058:1676110258:14770.286] [signal_drhook@/home/glenn/github/jamie_oifs43r3.git/src/ifsaux/support/drhook.c:1734] DrHook backtrace done for signal#8, nsigs = 1
[EC_DRHOOK:swarm:1:1:4860:4860] [20230211:101058:1676110258:14770.286] [signal_drhook@/home/glenn/github/jamie_oifs43r3.git/src/ifsaux/support/drhook.c:1785] Calling previous signal handler at 0x1ce8cf0 for signal#8, nsigs = 1
forrtl: error (65): floating invalid
I've seen that one. If you're interested, look further back in the traceback and you'll see:

             >OMP-RADINTG-RADLSW       (1210) 
              RADIATION_SCHEME 
               radiation_interface:radiation 
                radiation_cloud_optics:cloud_optics 

The model has failed in the radiation code. Floating invalid is usually a divide-by-zero. There were a few WUs that failed each try because the butterfly wings were perhaps too big :) Interesting though, there were a few other cases where the model failed like this on AMD hardware, the resend went to an Intel CPU and worked fine. Which is why they've been tried again.

It's got nothing to do with bad memory etc. It's just normal differences in floating point arithmetic seen on different hardware (and OSes from different math libs).
ID: 68264 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 68265 - Posted: 11 Feb 2023, 17:52:18 UTC - in response to Message 68264.  

It's got nothing to do with bad memory etc. It's just normal differences in floating point arithmetic seen on different hardware (and OSes from different math libs).


I agree that this is the most likely explanation.

I did not mean that it was a failure of the memory of the machine doing the task. I meant that the current (and probably all future) Oifs models do a lot of memory allocation and freeing during their execution, and some failures seem to complain about freeing the same memory more than once; indicating, most likely, a programming error. And that being a possible thing, it is most vexing to find.

In a former life, I was involved in writing (part of) the optimizer for the C compiler in UNIX. And people accused the optimizer of being defective because it gave different results than when code was not optimized. It turns out that the optimizer was not at fault. We guaranteed that our optimizer would give the same result for correctly-written code, but were silent about what would happen for incorrect code. We even compiled and ran the UNIX kernel and all the libraries with the optimizer turned on. It turns out that there was a lot of code out there that used pointers that were not initialized, so G.O.K. what values they had. Most were zero, and it was easy to trap those since we never stored anything in the bottom page of RAM, so all traps to there were uninitialized pointers. We found so many of those that they would not even read my MRs after a while. We had a secretary file my MRs with her name on them for a while, but then the caught on. By the time I left, they had never fixed those problems.
ID: 68265 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 68266 - Posted: 11 Feb 2023, 19:46:12 UTC - in response to Message 68265.  
Last modified: 11 Feb 2023, 19:48:08 UTC

It's got nothing to do with bad memory etc. It's just normal differences in floating point arithmetic seen on different hardware (and OSes from different math libs).
I agree that this is the most likely explanation.

I did not mean that it was a failure of the memory of the machine doing the task. I meant that the current (and probably all future) Oifs models do a lot of memory allocation and freeing during their execution, and some failures seem to complain about freeing the same memory more than once; indicating, most likely, a programming error. And that being a possible thing, it is most vexing to find.

In a former life, I was involved in writing (part of) the optimizer for the C compiler in UNIX. And people accused the optimizer of being defective because it gave different results than when code was not optimized. It turns out that the optimizer was not at fault. We guaranteed that our optimizer would give the same result for correctly-written code, but were silent about what would happen for incorrect code. We even compiled and ran the UNIX kernel and all the libraries with the optimizer turned on. It turns out that there was a lot of code out there that used pointers that were not initialized, so G.O.K. what values they had. Most were zero, and it was easy to trap those since we never stored anything in the bottom page of RAM, so all traps to there were uninitialized pointers. We found so many of those that they would not even read my MRs after a while. We had a secretary file my MRs with her name on them for a while, but then the caught on. By the time I left, they had never fixed those problems.
Don't get me started on code optimizers - especially when dealing with vector instructions. I have a couple of stories there...

Anyway, the OpenIFS code does do alot of heap allocate/free (it's mostly Fortran code) but the memory problems that have been reported here are not from the model but from the C++ wrapper code that monitors it and talks to boinc, just in case I've confused things. It's a newer code and not so tried & tested as the model.

I agree completely about being careful with code & optimizers. I once saw a model go from radiative heating in the model stratosphere to radiative cooling just by moving the code to a new machine & compiler (I forget what that was now). That wasn't a good thing, which took time to understand.

Before we put out these batches which have slight model perturbations, the idea of how much perturbation occurs from different computers was discussed. The machine perturbations are relatively small compared to the model changes being made, so the "hardware-only" model outcomes will still be part of the perturbation space explored by the scientist's perturbations.
ID: 68266 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 68270 - Posted: 12 Feb 2023, 4:06:15 UTC - in response to Message 68259.  

I just got
Name oifs_43r3_ps_0561_2021050100_123_990_12206681_2
Workunit 12206681
that has failed for the two previous attempts. Each has failed for very different reasons. I betcha it works on my machine.


I win. My attempt worked just fine:

Task 22306953
Name 	oifs_43r3_ps_0561_2021050100_123_990_12206681_2
Workunit 	12206681
Created 	11 Feb 2023, 12:22:34 UTC
Sent 	11 Feb 2023, 12:25:27 UTC
Report deadline 	12 Apr 2023, 12:25:27 UTC
Received 	12 Feb 2023, 3:56:11 UTC
Server state 	Over
Outcome 	Success
Client state 	Done
Exit status 	0 (0x00000000)
Computer ID 	1511241
Run time 	15 hours 20 min 39 sec
CPU time 	15 hours 1 min 51 sec
Validate state 	Valid
Credit 	0.00
Device peak FLOPS 	6.06 GFLOPS
Application version 	OpenIFS 43r3 Perturbed Surface v1.09
x86_64-pc-linux-gnu

OpenIFS 43r3 Perturbed Surface 1.09 x86_64-pc-linux-gnu
Number of tasks completed 	2
Max tasks per day 	6
Number of tasks today 	0
Consecutive valid tasks 	2
Average processing rate 	28.85 GFLOPS
Average turnaround time 	0.64 days

ID: 68270 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4536
Credit: 18,993,249
RAC: 21,753
Message 68271 - Posted: 12 Feb 2023, 7:18:26 UTC

And the one that failed on my Ryzen has completed on its second attempt.
ID: 68271 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4536
Credit: 18,993,249
RAC: 21,753
Message 68274 - Posted: 12 Feb 2023, 18:20:44 UTC - in response to Message 68271.  

Now running one that has failed once on an intel machine and once on AMD. The AMD is a double corruption and the Intel is
free(): invalid next size (fast)
ID: 68274 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 68275 - Posted: 12 Feb 2023, 19:03:19 UTC - in response to Message 68274.  

Now running one that has failed once on an intel machine and once on AMD. The AMD is a double corruption and the Intel is
free(): invalid next size (fast)
Don't take this the wrong way, but I sincerely hope that fails as well. Then we may have found a repeatable failure - which has eluded me so far.

As for the other AMD:fail, Intel:Ok, I am wondering whether to turn down the optimization level on the Intel compiler I use for the model.

Thx for reporting. Links to the WUs pages are useful too.
ID: 68275 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4536
Credit: 18,993,249
RAC: 21,753
Message 68276 - Posted: 12 Feb 2023, 19:38:24 UTC - in response to Message 68275.  
Last modified: 12 Feb 2023, 19:47:29 UTC

Thx for reporting. Links to the WUs pages are useful too.

Work unit

I am only running the one task at the moment and set to a maximum of 2 which will minimise the chances of other tasks interfering.

Edited to provide the correct work unit.

Edit2: Intel failed after uploading zip 95. The AMD managed another 10 zips so possibly not a smoking gun.
ID: 68276 · Report as offensive     Reply Quote
biodoc

Send message
Joined: 2 Oct 19
Posts: 21
Credit: 47,674,094
RAC: 24,265
Message 68277 - Posted: 12 Feb 2023, 20:16:00 UTC - in response to Message 68275.  


As for the other AMD:fail, Intel:Ok, I am wondering whether to turn down the optimization level on the Intel compiler I use for the model.

I'm interested to know why you chose the Intel compiler over GCC. Would GCC offer better compatibility with the hardware and OS heterogeneity on a DC project?
ID: 68277 · Report as offensive     Reply Quote
Previous · 1 . . . 19 · 20 · 21 · 22 · 23 · 24 · 25 . . . 32 · Next

Message boards : Number crunching : OpenIFS Discussion

©2024 cpdn.org