climateprediction.net (CPDN) home page
Thread 'New Work Announcements 2024'

Thread 'New Work Announcements 2024'

Message boards : Number crunching : New Work Announcements 2024
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 · Next

AuthorMessage
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 70935 - Posted: 7 Jun 2024, 15:00:05 UTC - in response to Message 70933.  
Last modified: 7 Jun 2024, 15:00:35 UTC

Batch 1017 is a production batch for the OpenIFS baroclinic lifecycle version. This batch is being used to test a difficult background state which arose from comments made by referees to the submitted paper using the results obtained from last year. The scientists want to rerun the batch before making the data available.

The baroclinic lifecycle experiment uses an 'aquaplanet' configuration; i.e. there's no land. This is a specialized configuration used for testing various theories in atmospheric science. From a technical point of view it means the model needs less memory and tasks complete quicker, because the land processes are not needed.

Batch 1016 was a testing batch.
---
CPDN Visiting Scientist
ID: 70935 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 70938 - Posted: 7 Jun 2024, 17:21:44 UTC - in response to Message 70935.  
Last modified: 7 Jun 2024, 17:31:03 UTC

The baroclinic lifecycle experiment uses an 'aquaplanet' configuration; i.e. there's no land. This is a specialized configuration used for testing various theories in atmospheric science. From a technical point of view it means the model needs less memory and tasks complete quicker, because the land processes are not needed.


My Linux machine has lots of RAM, so I am unlikely to run out of it. The memory these tasks consume varies a lot with time; it takes some, it gives some back. My app_config file will run only two of these tasks at a time. It is running 14 Boinc tasks at a time. Since it is approaching summer, and I have no AC, I may have to cut it to 13 or 12 Boinc tasks at a time. iirc, Last summer, I had to cut it to 8 for a while.

When these two tasks started, Boinc thought they would take about 12 hours each.
top - 13:15:13 up 2 days,  1:43,  2 users,  load average: 14.55, 14.60, 14.44
Tasks: 477 total,  15 running, 462 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.6 us,  0.5 sy, 86.7 ni, 11.5 id,  0.5 wa,  0.2 hi,  0.0 si,  0.0 st
MiB Mem : 128086.0 total,  25635.9 free,  14783.6 used,  87666.5 buff/cache
MiB Swap:  15992.0 total,  15992.0 free,      0.0 used. 111669.4 avail Mem 

    PID    PPID USER      PR  NI S    RES  %MEM  %CPU  P     TIME+ COMMAND                                                                   
 390792  390785 boinc     39  19 R   5.3g   4.3  99.2  5 189:05.37 /var/lib/boinc/slots/3/oifs_43r3_model.exe                                
 389086  389078 boinc     39  19 R   4.1g   3.3  99.3 14 204:29.21 /var/lib/boinc/slots/0/oifs_43r3_model.exe    

ID: 70938 · Report as offensive     Reply Quote
alanb1951

Send message
Joined: 31 Aug 04
Posts: 37
Credit: 9,581,380
RAC: 3,853
Message 70939 - Posted: 7 Jun 2024, 19:33:11 UTC
Last modified: 7 Jun 2024, 19:39:54 UTC

Has anyone had one of these get past the 15-day point? I've had reported failures on the only "completed" tasks so far on each of three systems; the stderr.txt for one of them has this sequence
  19:04:24 STEP 1438 H= 359:30 +CPU= 11.542
  19:04:36 STEP 1439 H= 359:45 +CPU= 11.581
  19:04:53 STEP 1440 H= 360:00 +CPU= 16.735
..The child process terminated with status: 0
>>> Printing last 70 lines from file: NODE.001_01

followed by some statistics and the information about building the 14.zip file, which it calls the final file when it uploads it :-)

Then, of course, it complains that it can't find the other 5 files to upload when boinc_finish() has been called :-(

I'm only running one at a time, by the way, so the next failures are anticipated at about midnight -- I hope I'm wrong, but...

Cheers - Al.

P.S. Apologies if this isn't really the right place to post this...

[Edited to try to improve clarity...]
ID: 70939 · Report as offensive     Reply Quote
Helmer Bryd

Send message
Joined: 16 Aug 04
Posts: 156
Credit: 9,035,872
RAC: 2,928
Message 70940 - Posted: 7 Jun 2024, 19:54:36 UTC

Yeah it stops to early, all mine fail.
ID: 70940 · Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 15 Jul 17
Posts: 99
Credit: 18,701,746
RAC: 318
Message 70941 - Posted: 7 Jun 2024, 19:57:00 UTC
Last modified: 7 Jun 2024, 20:09:23 UTC

Thrilled that I'm getting so many Linux oifs_43r3_bl WUs but many are crashing. They keep using more RAM until each gets to 6 GB. WUs and browser tabs start crashing when the RAM is fully committed and it starts using Swap. I'm trying to limit the number running:
<app_config>
<!-- i9-7980XE 18c36t 4x16=64 GB L3 Cache 24.75 MB -->
<app>
<name>oifs_43r3_bl</name>
<!-- OpenIFS 43r3 Baroclinic Lifecycle -->
<!-- needs 6 GB RAM per WU -->
<max_concurrent>10</max_concurrent>
<fraction_done_exact/>
</app>
<project_max_concurrent>10</project_max_concurrent>
</app_config>
ID: 70941 · Report as offensive     Reply Quote
klepel

Send message
Joined: 9 Oct 04
Posts: 82
Credit: 69,926,017
RAC: 7,296
Message 70942 - Posted: 7 Jun 2024, 21:07:36 UTC - in response to Message 70940.  

Same problem with Zip14.
ID: 70942 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 70943 - Posted: 7 Jun 2024, 21:22:02 UTC - in response to Message 70939.  

Has anyone had one of these get past the 15-day point? I've had reported failures on the only "completed" tasks so far on each of three systems;


I have now received 9 of these tasks, and have set app_config to allow up to three at a time to run, which my machine is now doing. Two of them have done a trickle.
I do not suppose I will get to 15 days for any of them. Boinc manager thinks they will take a trifle over 12 hours to run, but ...

top - 17:07:14 up 2 days,  5:35,  2 users,  load average: 14.46, 14.85, 15.76
Tasks: 478 total,  15 running, 463 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.1 us,  0.5 sy, 86.6 ni, 10.5 id,  0.0 wa,  0.3 hi,  0.1 si,  0.0 st
MiB Mem : 128086.0 total,  22679.4 free,  16173.7 used,  89232.9 buff/cache
MiB Swap:  15992.0 total,  15992.0 free,      0.0 used. 110244.9 avail Mem 

    PID    PPID USER      PR  NI S    RES  %MEM  %CPU  P     TIME+ COMMAND                                                                   
 389086  389078 boinc     39  19 R   4.1g   3.3  99.3 10 432:12.06 /var/lib/boinc/slots/0/oifs_43r3_model.exe                                
 421785  421781 boinc     39  19 R   4.1g   3.3  99.3 12 161:20.33 /var/lib/boinc/slots/14/oifs_43r3_model.exe                               
 390792  390785 boinc     39  19 R   2.3g   1.9  99.3  9 416:46.12 /var/lib/boinc/slots/3/oifs_43r3_model.exe          


two of them are about 3/4 done at about 7 hours of processing.
ID: 70943 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 70944 - Posted: 7 Jun 2024, 21:50:51 UTC - in response to Message 70939.  

OK: one of mine just failed. The last of my Stderr file (huge) is like this:

Uploading the final file: upload_file_14.zip
Uploading trickle at timestep: 1295100
17:14:10 (389078): called boinc_finish(0)
</stderr_txt>
<message>
upload failure: <file_xfer_error>
  <file_name>oifs_43r3_bl_a1dn_2016092300_20_1017_12283614_0_r699848044_15.zip</file_name>
  <error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>oifs_43r3_bl_a1dn_2016092300_20_1017_12283614_0_r699848044_16.zip</file_name>
  <error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>oifs_43r3_bl_a1dn_2016092300_20_1017_12283614_0_r699848044_17.zip</file_name>
  <error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>oifs_43r3_bl_a1dn_2016092300_20_1017_12283614_0_r699848044_18.zip</file_name>
  <error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>oifs_43r3_bl_a1dn_2016092300_20_1017_12283614_0_r699848044_19.zip</file_name>
  <error_code>-161 (not found)</error_code>
</file_xfer_error>
</message>
]]>


Here is everything except the Stderr stuff that you can get from

Task 22435193
Name 	oifs_43r3_bl_a1dn_2016092300_20_1017_12283614_0
Workunit 	12283614
Created 	7 Jun 2024, 13:26:01 UTC
Sent 	7 Jun 2024, 13:27:20 UTC
Report deadline 	6 Aug 2024, 13:27:20 UTC
Received 	7 Jun 2024, 21:30:58 UTC
Server state 	Over
Outcome 	Computation error
Client state 	Compute error
Exit status 	0 (0x00000000)
Computer ID 	1511241
Run time 	7 hours 24 min 37 sec
CPU time 	7 hours 15 min 35 sec
Validate state 	Invalid
Credit 	1,318.46
Device peak FLOPS 	5.93 GFLOPS
Application version 	OpenIFS 43r3 Baroclinic Lifecycle v1.13
x86_64-pc-linux-gnu
Peak working set size 	5,566.54 MB
Peak swap size 	5,980.80 MB
Peak disk usage 	1,283.83 MB

ID: 70944 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 70945 - Posted: 7 Jun 2024, 21:52:42 UTC - in response to Message 70941.  

There is a bug in the boinc client where it does not accurately total up the memory required by the tasks it starts. It starts too many unless you are controlling them with an app_config.xml file. I have reported this and the fix is scheduled for boinc 8.20.

I thought CPDN has set a limit in the scheduler of max tasks in progress that helped with this. I can check. In the meantime, either use an app_config.xml or control it with the percentage of cpus used.

Thrilled that I'm getting so many Linux oifs_43r3_bl WUs but many are crashing. They keep using more RAM until each gets to 6 GB. WUs and browser tabs start crashing when the RAM is fully committed and it starts using Swap. I'm trying to limit the number running:
<app_config>
<!-- i9-7980XE 18c36t 4x16=64 GB L3 Cache 24.75 MB -->
<app>
<name>oifs_43r3_bl</name>
<!-- OpenIFS 43r3 Baroclinic Lifecycle -->
<!-- needs 6 GB RAM per WU -->
<max_concurrent>10</max_concurrent>
<fraction_done_exact/>
</app>
<project_max_concurrent>10</project_max_concurrent>
</app_config>

---
CPDN Visiting Scientist
ID: 70945 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 70946 - Posted: 7 Jun 2024, 21:53:39 UTC - in response to Message 70942.  
Last modified: 7 Jun 2024, 21:53:51 UTC

Same problem with Zip14.
Which batch is this? If it's batch 1016, some tasks were released in error. Batch 1017 should be ok unless you know different?
---
CPDN Visiting Scientist
ID: 70946 · Report as offensive     Reply Quote
alanb1951

Send message
Joined: 31 Aug 04
Posts: 37
Credit: 9,581,380
RAC: 3,853
Message 70947 - Posted: 7 Jun 2024, 23:07:22 UTC - in response to Message 70946.  
Last modified: 7 Jun 2024, 23:12:08 UTC

Same problem with Zip14.
Which batch is this? If it's batch 1016, some tasks were released in error. Batch 1017 should be ok unless you know different?
The task(s) I talked about up-thread were batch 1017 -- the one I actually linked to was oifs_43r3_bl_a0mt_2016092300_20_1017_12282648_0. Hope that helps.

Cheers - Al.

P.S. it's really handy that the WU number is in the task name, isn't it :-)
ID: 70947 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 70950 - Posted: 8 Jun 2024, 0:39:09 UTC - in response to Message 70945.  

There is a bug in the boinc client where it does not accurately total up the memory required by the tasks it starts. It starts too many unless you are controlling them with an app_config.xml file. I have reported this and the fix is scheduled for boinc 8.20.


I have 128 GBytes of RAM in my Linux machine. My app_config file limits me to running 3 oifs_43r3_bl tasks at a time and they confine to a small (to me) amount of RAM. Running 14 Boinc processes and everything else is currently using about 16 GBytes of RAM. So that can hardly be the reason for it failing to run these tasks.

Good thing too, because I doubt my Linux distro will ever upgrade past 7.20.2.

 
MiB Mem : 128086.0 total,  20744.5 free,  18040.2 used,  89301.2 buff/cache
MiB Swap:  15992.0 total,  15992.0 free,      0.0 used. 108365.3 avail Mem 

PID    PPID USER      PR  NI S    RES  %MEM  %CPU  P     TIME+ COMMAND                                                                   
 445502  445497 boinc     39  19 R   4.7g   3.8  99.2  3 192:51.32 /var/lib/boinc/slots/0/oifs_43r3_model.exe                                
 421785  421781 boinc     39  19 R   4.1g   3.3  99.1  8 361:08.70 /var/lib/boinc/slots/14/oifs_43r3_model.exe                               
 448361  448354 boinc     39  19 R   2.3g   1.8  99.1  0 168:39.41 /var/lib/boinc/slots/3/oifs_43r3_model.exe        

ID: 70950 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,709,934
RAC: 9,107
Message 70952 - Posted: 8 Jun 2024, 6:28:03 UTC - in response to Message 70946.  

Same problem with Zip14.
Which batch is this? If it's batch 1016, some tasks were released in error. Batch 1017 should be ok unless you know different?
It's batch 2017.

The server has already set my quota per day for v1.13 of this application to one. Bit the server still has 3770 tasks ready to send: I'd advise everyone to set 'No new tasks' until the dust settles - at least, until the staff have had time to assess the situation on Monday.
ID: 70952 · Report as offensive     Reply Quote
klepel

Send message
Joined: 9 Oct 04
Posts: 82
Credit: 69,926,017
RAC: 7,296
Message 70954 - Posted: 8 Jun 2024, 8:42:38 UTC - in response to Message 70946.  

Which batch is this? If it's batch 1016, some tasks were released in error. Batch 1017 should be ok unless you know different?

It is batch 1017.
ID: 70954 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 70955 - Posted: 8 Jun 2024, 9:16:38 UTC - in response to Message 70954.  
Last modified: 8 Jun 2024, 9:19:21 UTC

Which batch is this? If it's batch 1016, some tasks were released in error. Batch 1017 should be ok unless you know different?

It is batch 1017.
Yep. Mine are also failing.

What's happening is the model completes successfully but the code controlling the model has miscalculated the number of uploads and the task is registered as a fail but has actually worked.

Please keep computing 1017 as it's possible to still use the results, since they are all there. Just with fewer uploads than expected.
---
CPDN Visiting Scientist
ID: 70955 · Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 15 Jul 17
Posts: 99
Credit: 18,701,746
RAC: 318
Message 70956 - Posted: 8 Jun 2024, 9:45:37 UTC - in response to Message 70945.  

All of my WUs have failed. Is 6 GB enough for an oifs_43r3_bl WU?

There is a bug in the boinc client where it does not accurately total up the memory required by the tasks it starts. It starts too many unless you are controlling them with an app_config.xml file. I have reported this and the fix is scheduled for boinc 8.20.

I thought CPDN has set a limit in the scheduler of max tasks in progress that helped with this. I can check. In the meantime, either use an app_config.xml or control it with the percentage of cpus used.

Thrilled that I'm getting so many Linux oifs_43r3_bl WUs but many are crashing. They keep using more RAM until each gets to 6 GB. WUs and browser tabs start crashing when the RAM is fully committed and it starts using Swap. I'm trying to limit the number running:
<app_config>
<!-- i9-7980XE 18c36t 4x16=64 GB L3 Cache 24.75 MB -->
<app>
<name>oifs_43r3_bl</name>
<!-- OpenIFS 43r3 Baroclinic Lifecycle -->
<!-- needs 6 GB RAM per WU -->
<max_concurrent>10</max_concurrent>
<fraction_done_exact/>
</app>
<project_max_concurrent>10</project_max_concurrent>
</app_config>
ID: 70956 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 41,817,007
RAC: 65,023
Message 70957 - Posted: 8 Jun 2024, 9:57:15 UTC - in response to Message 70944.  

I got the same failure too: https://www.cpdn.org/result.php?resultid=22439755

It seems that the calculation happily finished at 14.zip but the result is expecting more? This is on a machine with enough memory, runs no other projects and has never paused the WU.
ID: 70957 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,026,382
RAC: 20,431
Message 70958 - Posted: 8 Jun 2024, 10:37:27 UTC

All of my WUs have failed. Is 6 GB enough for an oifs_43r3_bl WU?
The machine I am using now is borked so I am not running any tasks right now but in testing I have certainly had tasks go up over 9GB RAM per task. I don't remember off hand which of the oifs variants that was though.
ID: 70958 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 70959 - Posted: 8 Jun 2024, 10:43:53 UTC - in response to Message 70958.  
Last modified: 8 Jun 2024, 10:47:00 UTC

I've now disabled resends for batch 1017.

As I mentioned in an earlier post, the model finishes correctly but the controlling code has miscalculated the number of upload files expected so it fails the batch, even though all the results are there. So please let the tasks run as the results are still usable.

The BL OIFS app only needs ~3.5Gb RAM. The normal OIFS app needs more ~6Gb.
---
CPDN Visiting Scientist
ID: 70959 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,709,934
RAC: 9,107
Message 70960 - Posted: 8 Jun 2024, 11:07:17 UTC - in response to Message 70959.  

So please let the tasks run as the results are still usable.
Sure, we can give that a go.

But it'll be slow progress. Because v1.13 is a new iteration of the app (released 3 Jun 2024), none of us will have built up a reputation as reliable crunchers yet. We'll all hit

08/06/2024 11:32:18 | climateprediction.net | This computer has finished a daily quota of 1 tasks
quickly, as I have already. But allow work fetch again, and we'll trickle through them.

On a positive note, the enforced restriction to one task in progress at a time will help bypass the risk of 'out of memory' errors.
ID: 70960 · Report as offensive     Reply Quote
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 · Next

Message boards : Number crunching : New Work Announcements 2024

©2024 cpdn.org