Message boards : Number crunching : New Work Announcements 2024
Message board moderation
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 · Next
Author | Message |
---|---|
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Batch 1017 is a production batch for the OpenIFS baroclinic lifecycle version. This batch is being used to test a difficult background state which arose from comments made by referees to the submitted paper using the results obtained from last year. The scientists want to rerun the batch before making the data available. The baroclinic lifecycle experiment uses an 'aquaplanet' configuration; i.e. there's no land. This is a specialized configuration used for testing various theories in atmospheric science. From a technical point of view it means the model needs less memory and tasks complete quicker, because the land processes are not needed. Batch 1016 was a testing batch. --- CPDN Visiting Scientist |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
The baroclinic lifecycle experiment uses an 'aquaplanet' configuration; i.e. there's no land. This is a specialized configuration used for testing various theories in atmospheric science. From a technical point of view it means the model needs less memory and tasks complete quicker, because the land processes are not needed. My Linux machine has lots of RAM, so I am unlikely to run out of it. The memory these tasks consume varies a lot with time; it takes some, it gives some back. My app_config file will run only two of these tasks at a time. It is running 14 Boinc tasks at a time. Since it is approaching summer, and I have no AC, I may have to cut it to 13 or 12 Boinc tasks at a time. iirc, Last summer, I had to cut it to 8 for a while. When these two tasks started, Boinc thought they would take about 12 hours each. top - 13:15:13 up 2 days, 1:43, 2 users, load average: 14.55, 14.60, 14.44 Tasks: 477 total, 15 running, 462 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.6 us, 0.5 sy, 86.7 ni, 11.5 id, 0.5 wa, 0.2 hi, 0.0 si, 0.0 st MiB Mem : 128086.0 total, 25635.9 free, 14783.6 used, 87666.5 buff/cache MiB Swap: 15992.0 total, 15992.0 free, 0.0 used. 111669.4 avail Mem PID PPID USER PR NI S RES %MEM %CPU P TIME+ COMMAND 390792 390785 boinc 39 19 R 5.3g 4.3 99.2 5 189:05.37 /var/lib/boinc/slots/3/oifs_43r3_model.exe 389086 389078 boinc 39 19 R 4.1g 3.3 99.3 14 204:29.21 /var/lib/boinc/slots/0/oifs_43r3_model.exe |
Send message Joined: 31 Aug 04 Posts: 37 Credit: 9,581,380 RAC: 3,853 |
Has anyone had one of these get past the 15-day point? I've had reported failures on the only "completed" tasks so far on each of three systems; the stderr.txt for one of them has this sequence 19:04:24 STEP 1438 H= 359:30 +CPU= 11.542 19:04:36 STEP 1439 H= 359:45 +CPU= 11.581 19:04:53 STEP 1440 H= 360:00 +CPU= 16.735 ..The child process terminated with status: 0 >>> Printing last 70 lines from file: NODE.001_01 followed by some statistics and the information about building the 14.zip file, which it calls the final file when it uploads it :-) Then, of course, it complains that it can't find the other 5 files to upload when boinc_finish() has been called :-( I'm only running one at a time, by the way, so the next failures are anticipated at about midnight -- I hope I'm wrong, but... Cheers - Al. P.S. Apologies if this isn't really the right place to post this... [Edited to try to improve clarity...] |
Send message Joined: 16 Aug 04 Posts: 156 Credit: 9,035,872 RAC: 2,928 |
Yeah it stops to early, all mine fail. |
Send message Joined: 15 Jul 17 Posts: 99 Credit: 18,701,746 RAC: 318 |
Thrilled that I'm getting so many Linux oifs_43r3_bl WUs but many are crashing. They keep using more RAM until each gets to 6 GB. WUs and browser tabs start crashing when the RAM is fully committed and it starts using Swap. I'm trying to limit the number running: <app_config> <!-- i9-7980XE 18c36t 4x16=64 GB L3 Cache 24.75 MB --> <app> <name>oifs_43r3_bl</name> <!-- OpenIFS 43r3 Baroclinic Lifecycle --> <!-- needs 6 GB RAM per WU --> <max_concurrent>10</max_concurrent> <fraction_done_exact/> </app> <project_max_concurrent>10</project_max_concurrent> </app_config> |
Send message Joined: 9 Oct 04 Posts: 82 Credit: 69,926,017 RAC: 7,296 |
Same problem with Zip14. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Has anyone had one of these get past the 15-day point? I've had reported failures on the only "completed" tasks so far on each of three systems; I have now received 9 of these tasks, and have set app_config to allow up to three at a time to run, which my machine is now doing. Two of them have done a trickle. I do not suppose I will get to 15 days for any of them. Boinc manager thinks they will take a trifle over 12 hours to run, but ... top - 17:07:14 up 2 days, 5:35, 2 users, load average: 14.46, 14.85, 15.76 Tasks: 478 total, 15 running, 463 sleeping, 0 stopped, 0 zombie %Cpu(s): 2.1 us, 0.5 sy, 86.6 ni, 10.5 id, 0.0 wa, 0.3 hi, 0.1 si, 0.0 st MiB Mem : 128086.0 total, 22679.4 free, 16173.7 used, 89232.9 buff/cache MiB Swap: 15992.0 total, 15992.0 free, 0.0 used. 110244.9 avail Mem PID PPID USER PR NI S RES %MEM %CPU P TIME+ COMMAND 389086 389078 boinc 39 19 R 4.1g 3.3 99.3 10 432:12.06 /var/lib/boinc/slots/0/oifs_43r3_model.exe 421785 421781 boinc 39 19 R 4.1g 3.3 99.3 12 161:20.33 /var/lib/boinc/slots/14/oifs_43r3_model.exe 390792 390785 boinc 39 19 R 2.3g 1.9 99.3 9 416:46.12 /var/lib/boinc/slots/3/oifs_43r3_model.exe two of them are about 3/4 done at about 7 hours of processing. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
OK: one of mine just failed. The last of my Stderr file (huge) is like this: Uploading the final file: upload_file_14.zip Uploading trickle at timestep: 1295100 17:14:10 (389078): called boinc_finish(0) </stderr_txt> <message> upload failure: <file_xfer_error> <file_name>oifs_43r3_bl_a1dn_2016092300_20_1017_12283614_0_r699848044_15.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> <file_xfer_error> <file_name>oifs_43r3_bl_a1dn_2016092300_20_1017_12283614_0_r699848044_16.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> <file_xfer_error> <file_name>oifs_43r3_bl_a1dn_2016092300_20_1017_12283614_0_r699848044_17.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> <file_xfer_error> <file_name>oifs_43r3_bl_a1dn_2016092300_20_1017_12283614_0_r699848044_18.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> <file_xfer_error> <file_name>oifs_43r3_bl_a1dn_2016092300_20_1017_12283614_0_r699848044_19.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> </message> ]]> Here is everything except the Stderr stuff that you can get from Task 22435193 Name oifs_43r3_bl_a1dn_2016092300_20_1017_12283614_0 Workunit 12283614 Created 7 Jun 2024, 13:26:01 UTC Sent 7 Jun 2024, 13:27:20 UTC Report deadline 6 Aug 2024, 13:27:20 UTC Received 7 Jun 2024, 21:30:58 UTC Server state Over Outcome Computation error Client state Compute error Exit status 0 (0x00000000) Computer ID 1511241 Run time 7 hours 24 min 37 sec CPU time 7 hours 15 min 35 sec Validate state Invalid Credit 1,318.46 Device peak FLOPS 5.93 GFLOPS Application version OpenIFS 43r3 Baroclinic Lifecycle v1.13 x86_64-pc-linux-gnu Peak working set size 5,566.54 MB Peak swap size 5,980.80 MB Peak disk usage 1,283.83 MB |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
There is a bug in the boinc client where it does not accurately total up the memory required by the tasks it starts. It starts too many unless you are controlling them with an app_config.xml file. I have reported this and the fix is scheduled for boinc 8.20. I thought CPDN has set a limit in the scheduler of max tasks in progress that helped with this. I can check. In the meantime, either use an app_config.xml or control it with the percentage of cpus used. Thrilled that I'm getting so many Linux oifs_43r3_bl WUs but many are crashing. They keep using more RAM until each gets to 6 GB. WUs and browser tabs start crashing when the RAM is fully committed and it starts using Swap. I'm trying to limit the number running: --- CPDN Visiting Scientist |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Same problem with Zip14.Which batch is this? If it's batch 1016, some tasks were released in error. Batch 1017 should be ok unless you know different? --- CPDN Visiting Scientist |
Send message Joined: 31 Aug 04 Posts: 37 Credit: 9,581,380 RAC: 3,853 |
The task(s) I talked about up-thread were batch 1017 -- the one I actually linked to was oifs_43r3_bl_a0mt_2016092300_20_1017_12282648_0. Hope that helps.Same problem with Zip14.Which batch is this? If it's batch 1016, some tasks were released in error. Batch 1017 should be ok unless you know different? Cheers - Al. P.S. it's really handy that the WU number is in the task name, isn't it :-) |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
There is a bug in the boinc client where it does not accurately total up the memory required by the tasks it starts. It starts too many unless you are controlling them with an app_config.xml file. I have reported this and the fix is scheduled for boinc 8.20. I have 128 GBytes of RAM in my Linux machine. My app_config file limits me to running 3 oifs_43r3_bl tasks at a time and they confine to a small (to me) amount of RAM. Running 14 Boinc processes and everything else is currently using about 16 GBytes of RAM. So that can hardly be the reason for it failing to run these tasks. Good thing too, because I doubt my Linux distro will ever upgrade past 7.20.2. MiB Mem : 128086.0 total, 20744.5 free, 18040.2 used, 89301.2 buff/cache MiB Swap: 15992.0 total, 15992.0 free, 0.0 used. 108365.3 avail Mem PID PPID USER PR NI S RES %MEM %CPU P TIME+ COMMAND 445502 445497 boinc 39 19 R 4.7g 3.8 99.2 3 192:51.32 /var/lib/boinc/slots/0/oifs_43r3_model.exe 421785 421781 boinc 39 19 R 4.1g 3.3 99.1 8 361:08.70 /var/lib/boinc/slots/14/oifs_43r3_model.exe 448361 448354 boinc 39 19 R 2.3g 1.8 99.1 0 168:39.41 /var/lib/boinc/slots/3/oifs_43r3_model.exe |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,709,934 RAC: 9,107 |
It's batch 2017.Same problem with Zip14.Which batch is this? If it's batch 1016, some tasks were released in error. Batch 1017 should be ok unless you know different? The server has already set my quota per day for v1.13 of this application to one. Bit the server still has 3770 tasks ready to send: I'd advise everyone to set 'No new tasks' until the dust settles - at least, until the staff have had time to assess the situation on Monday. |
Send message Joined: 9 Oct 04 Posts: 82 Credit: 69,926,017 RAC: 7,296 |
Which batch is this? If it's batch 1016, some tasks were released in error. Batch 1017 should be ok unless you know different? It is batch 1017. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Yep. Mine are also failing.Which batch is this? If it's batch 1016, some tasks were released in error. Batch 1017 should be ok unless you know different? What's happening is the model completes successfully but the code controlling the model has miscalculated the number of uploads and the task is registered as a fail but has actually worked. Please keep computing 1017 as it's possible to still use the results, since they are all there. Just with fewer uploads than expected. --- CPDN Visiting Scientist |
Send message Joined: 15 Jul 17 Posts: 99 Credit: 18,701,746 RAC: 318 |
All of my WUs have failed. Is 6 GB enough for an oifs_43r3_bl WU? There is a bug in the boinc client where it does not accurately total up the memory required by the tasks it starts. It starts too many unless you are controlling them with an app_config.xml file. I have reported this and the fix is scheduled for boinc 8.20. |
Send message Joined: 14 Sep 08 Posts: 127 Credit: 41,817,007 RAC: 65,023 |
I got the same failure too: https://www.cpdn.org/result.php?resultid=22439755 It seems that the calculation happily finished at 14.zip but the result is expecting more? This is on a machine with enough memory, runs no other projects and has never paused the WU. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,026,382 RAC: 20,431 |
All of my WUs have failed. Is 6 GB enough for an oifs_43r3_bl WU?The machine I am using now is borked so I am not running any tasks right now but in testing I have certainly had tasks go up over 9GB RAM per task. I don't remember off hand which of the oifs variants that was though. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
I've now disabled resends for batch 1017. As I mentioned in an earlier post, the model finishes correctly but the controlling code has miscalculated the number of upload files expected so it fails the batch, even though all the results are there. So please let the tasks run as the results are still usable. The BL OIFS app only needs ~3.5Gb RAM. The normal OIFS app needs more ~6Gb. --- CPDN Visiting Scientist |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,709,934 RAC: 9,107 |
So please let the tasks run as the results are still usable.Sure, we can give that a go. But it'll be slow progress. Because v1.13 is a new iteration of the app (released 3 Jun 2024), none of us will have built up a reputation as reliable crunchers yet. We'll all hit 08/06/2024 11:32:18 | climateprediction.net | This computer has finished a daily quota of 1 tasksquickly, as I have already. But allow work fetch again, and we'll trickle through them. On a positive note, the enforced restriction to one task in progress at a time will help bypass the risk of 'out of memory' errors. |
©2024 cpdn.org