Message boards : Number crunching : New work discussion - 2
Message board moderation
Previous · 1 . . . 12 · 13 · 14 · 15 · 16 · 17 · 18 . . . 42 · Next
Author | Message |
---|---|
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,431,665 RAC: 17,512 |
Another name is /var/lib/boinc/slots/11/./master.exe which is pretty funny because my Linux machine will not really run .exe files.'.exe' is just a convention we adopted to indicate an executable file, it's not related to a Windows .exe. Normal linux convention is not to have any suffix but for CPDN we preferred to have one. My two take 2.5 and 3.5 GBytes working set but the amounts jump around a lot.The model uses dynamic memory alot. The high water memory is when it goes into the radiation code. This involves recomputing look-up tables & matrix computations. It's the most expensive timestep too. Predicted 2 days 18 hours to go, having done about 1 hour 18 minutes each.Ignore predicted time, it's rather useless for these models. It depends on the client seeing this app enough times to work out a figure, it's not under control of the app. The problem is that OpenIFS apps can be run for varying lengths of time, so the boinc client will never get a good estimate of time remaining. The fraction done is accurate, use that to work out time to completion. I turn off the display of 'time remaining'. |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,989,107 RAC: 21,788 |
Typical runtime is ~7-10hrs depending on your CPU. Memory should be ~6Gb.The two I have should finish in a fraction under 10 hours. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
It's been so long since here was any work, that I've forgotten how to get it to start. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
How do Linux users know what type of file they're looking at?Another name is /var/lib/boinc/slots/11/./master.exe which is pretty funny because my Linux machine will not really run .exe files.'.exe' is just a convention we adopted to indicate an executable file, it's not related to a Windows .exe. Normal linux convention is not to have any suffix but for CPDN we preferred to have one. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
How do Linux users know what type of file they're looking at? Linux follows the conventions of the original UNIX Operating systems. Filenames are just names, and the dot is just another letter (when it is part of a directory name; . and .. have special meanings in a directory, but that is just another story). So they could have called that file master.exe master.jpeg for all the difference it would make. The easiest way to find out what a Linux file is to apply the file command to it. For example [/var/lib/boinc/slots/10]$ file master.exe master.exe: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), statically linked, BuildID[sha1]=c3f8ea54db10edfe769adb8096efc92f023410a8, for GNU/Linux 3.2.0, stripped You can find many files by looking at the first two bytes of the file. For example [/var/lib/boinc/slots/10]$ od master.exe | head -n 2 0000000 042577 043114 000402 001401 000000 000000 000000 000000 0000020 000002 000076 000001 000000 020100 000100 000000 000000 IIRC, the 042577 tells you it is an executable file. At least it did in the old days (early 1970s). Many files are not like this, however, so the file program must apply other heuristics to those files. It could even do this correctly (TaxAct for my Windows machine). file ta22stpremier.exe ta22stpremier.exe: PE32 executable (GUI) Intel 80386, for MS Windows |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,431,665 RAC: 17,512 |
So they could have called that file master.exe master.jpeg for all the difference it would make.True, but these days with desktop environments like Cinnamon or even macOS, the file suffix is used to identify files and assign default applications to handle them. If I had called the file master.jpg, and double clicked on it in the file explorer app in Mint, it would have tried to open an image browser. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
Yes, thankfully even Linux and Mac have realised an extension is so much easier for the user to see at a glance. Unfortunately the three operating systems often copy the worst aspects of each other. Windows 95 made a wonderful thing called the start menu, and also the taskbar. Apple copied this and screwed it up. Then Windows copied it back and I have to use a third party utility to make it work as it used to. One click and I see x recently used apps. The taskbar shows nothing but the start button, the running apps (in words, not a silly unintelligible icon), and the clock. Not a mixture of links/aliases/shortcuts to apps aswell as running ones, almost indistinguishable from each other. If I had a penny for every minute of my time wasted getting basic interfaces to do what I want them to....So they could have called that file master.exe master.jpeg for all the difference it would make.True, but these days with desktop environments like Cinnamon or even macOS, the file suffix is used to identify files and assign default applications to handle them. If I had called the file master.jpg, and double clicked on it in the file explorer app in Mint, it would have tried to open an image browser. |
Send message Joined: 7 Aug 04 Posts: 2186 Credit: 64,822,615 RAC: 5,275 |
Looks like about 13 hours running two at a time on an i7-4790K and about 9 hours running two at a time on my Rzyen 5 5600X. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
It's been so long since here was any work, that I've forgotten how to get it to start. If you are not joking, I did not have to do anything to get my (now 3) OIFS tasks to start, other than waiting for some other task to finish. |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,989,107 RAC: 21,788 |
Looks like about 13 hours running two at a time on an i7-4790K and about 9 hours running two at a time on my Rzyen 5 5600X.10hrs47 on my Ryzen7. |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,989,107 RAC: 21,788 |
The first of the Perturbed Surface variant of OpenIFS are going out now. App name: oifs_43r3_psOnly seem to be 1,000 in this batch. I presume the other 2,000 will arrive soon seeing as the first lot are all gone. 7 showing as completed on batch statistics page which is a bit out of date. |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,431,665 RAC: 17,512 |
Those times depend on what %cpu boinc is allowed to use. 100%? Perhaps add that info. Machine load affects wall clock time too. I'm more interested in any failures. If you get one, let me know. Thx. There's another 2000 ready to go as soon as Andy gets to it. And then there's plenty more after, the scientist needs to run a minimum of ~42000. |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,431,665 RAC: 17,512 |
Restarts for openifs_43r3_ps work ok. As a test, I shutdown my Ubuntu/WSL last night to make sure the task would restart. Before shutting the machine down I suspended the task(s) in boincmgr, made sure the 'master.exe' had disappeared from output from 'ps' (or top), and then shutdown the machine. This morning, restarted boinc client, resumed the task and the model happily restarted from its last checkpoint. |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,989,107 RAC: 21,788 |
Restarts for openifs_43r3_ps work ok. If you only did it the once, I wouldn't guarantee it. Last four restarts with hadam4s tasks didn't lose any for me but I still occasionally lose one or more. I was going to wait till I had gotten a few tasks under my belt so to speak before trying. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Looks like about 13 hours running two at a time on an i7-4790K and about 9 hours running two at a time on my Rzyen 5 5600X. For me, my very first OIFS task ran like this: Task 22245034 Name oifs_43r3_ps_0002_2021050100_123_945_12163091_0 Workunit 12163091 Created 28 Nov 2022, 19:12:00 UTC Sent 28 Nov 2022, 19:24:39 UTC Report deadline 28 Dec 2022, 19:24:39 UTC Received 29 Nov 2022, 11:23:20 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x00000000) Computer ID 1511241 Run time 15 hours 33 min 50 sec <---<<< CPU time 15 hours 22 min 37 sec <---<<< Validate state Valid Credit 0.00 Device peak FLOPS 6.13 GFLOPS <---<<< Application version OpenIFS 43r3 Perturbed Surface v1.01 x86_64-pc-linux-gnu Peak working set size 4,619.30 MB <---<<< My next one, Task 22245062 Name oifs_43r3_ps_0030_2021050100_123_945_12163119_0 was about the same. |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,431,665 RAC: 17,512 |
OpenIFS is a completely different model with a new controlling code. Experience with HadSM4 doesn't apply. I did a lot of testing to make sure it works. Yes there are always edge cases but 99% should be ok.This morning, restarted boinc client, resumed the task and the model happily restarted from its last checkpoint. |
Send message Joined: 7 Aug 04 Posts: 2186 Credit: 64,822,615 RAC: 5,275 |
Those times depend on what %cpu boinc is allowed to use. 100%? Perhaps add that info. Machine load affects wall clock time too. One of the two on my i7-4790K crashed at the end with exit status of "194 (0x000000C2) EXIT_ABORTED_BY_CLIENT". In stderr, it has "Process still present 5 min after writing finish file; aborting". https://www.cpdn.org/result.php?resultid=22245298 Both the successful task and the errored task ran through step 2592. Both tasks on my Ryzen 5600X completed successfully in just under 9 hours CPU and wall clock time. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Those times depend on what %cpu boinc is allowed to use. 100%? Perhaps add that info. Machine load affects wall clock time too. Well, the %cpu times are pretty-much 99+%. These are just the Boinc processes on my 16-core machine. I only allow 12 cores for Boinc, so everything else gets the other four cores. PID PPID USER PR NI S RES %MEM %CPU P TIME+ COMMAND 619112 619107 boinc 39 19 R 4.6g 7.3 99.0 13 359:48.53 /var/lib/boinc/slots/11/./master.exe 596096 596091 boinc 39 19 R 3.9g 6.2 99.0 9 793:52.71 /var/lib/boinc/slots/9/./master.exe 618959 618951 boinc 39 19 R 2.8g 4.5 99.0 0 362:41.64 /var/lib/boinc/slots/10/./master.exe 621104 2146 boinc 39 19 R 213108 0.3 99.1 12 322:43.61 ../../projects/einstein.phys.uwm.edu/einstein+ 633168 2146 boinc 39 19 R 212988 0.3 99.0 2 132:49.51 ../../projects/einstein.phys.uwm.edu/einstein+ 640995 2146 boinc 39 19 R 133692 0.2 99.0 7 24:47.84 ../../projects/www.worldcommunitygrid.org/wcg+ 636512 2146 boinc 39 19 R 72996 0.1 99.3 4 92:53.62 ../../projects/www.worldcommunitygrid.org/wcg+ 641748 2146 boinc 39 19 R 63884 0.1 99.2 8 18:23.02 ../../projects/www.worldcommunitygrid.org/wcg+ 2146 1 boinc 30 10 S 46352 0.1 0.3 15 118009:13 /usr/bin/boinc <---<<< This is the Boinc client 640361 2146 boinc 39 19 R 7172 0.0 99.0 5 33:15.25 ../../projects/milkyway.cs.rpi.edu_milkyway/m+ 642260 2146 boinc 39 19 R 5924 0.0 99.0 6 9:05.17 ../../projects/milkyway.cs.rpi.edu_milkyway/m+ 596091 2146 boinc 39 19 S 5360 0.0 0.1 10 1:43.66 ../../projects/climateprediction.net/oifs_43r+ 638162 2146 boinc 39 19 R 5008 0.0 99.1 3 74:30.71 ../../projects/universeathome.pl_universe/BHs+ 638038 2146 boinc 39 19 R 4932 0.0 99.2 14 75:41.48 ../../projects/universeathome.pl_universe/BHs+ 619107 2146 boinc 39 19 S 4912 0.0 0.1 3 0:46.86 ../../projects/climateprediction.net/oifs_43r+ 618951 2146 boinc 39 19 S 4868 0.0 0.1 14 0:46.95 ../../projects/climateprediction.net/oifs_43r+ |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,431,665 RAC: 17,512 |
OpenIFS batch problems First, thanks to those who have reported problems either in threads or to me via private messages. It has been very useful to track what's going wrong. And apologies to those volunteers having difficulties running these tasks. These errors were not found in testing and it's apparent a larger scale test should have been done for this first use of OpenIFS. Task failures There are examples of the tasks failing, either midway through or at the very end. It only seems to be happening on some machines and it's related to memory issues in the code (it's not related to hardware as far as I can tell). The process that starts with the name 'oifs_43r3_1.....' dies for some reason. As this controls the model, it leaves the model process called 'master.exe' still running in the same slot directory (it shouldn't do this but it does). If the client then restarts the task (in the same slot directory), it not only regenerates more output (filling the slot dir) it will corrupt the model files confusing the client. There should always be the same number of 'master.exe' and 'oifs_43r3_1...' processes running. If you have more master.exe processes, one of them is the rogue one. Suspend all your oifs tasks and kill the one that's still running. Or use the 'ps' command to check the parent of each master.exe process. I think the boinc client will eventually kill of any rogue processes, though you may need to manually clean the slot directory. Error code 9 : Some users have reported seeing 'task exited with error code 9'. This is an indication of lack of system memory. Reduce the number of OpenIFS tasks you have running. If anyone has problems/questions with this, send me a Private Message and I'll help. Data volumes Volunteers on slower internet lines (ADSL) have reported problems with transferring the model output. That's something we can deal with in subsequent batches. Remedy : reduce the number of OpenIFS tasks There have also been messages that the boinc client reports climateprediction.net needs very large storage (38Gb was mentioned in 1 post). This is a consequence of both the task failures causing data left behind & the data volumes. I suggest setting 'no new tasks', letting the openifs tasks finish and then manually delete any files in the open slot directories. Last, if anyone who has more experience with boinc than me wants to add anything useful, please feel free. |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,989,107 RAC: 21,788 |
Interesting. I have 2 processes running despite all tasks being suspended while I wait for uploads to clear. 52816 boinc 39 19 141824 1164 308 S 0.3 0.0 2:24.43 oifs_43r3_ps_1. 59704 boinc 39 19 10752 896 312 S 0.3 0.0 0:47.65 oifs_43r3_ps_1. I currently have 4 successes and two of the crashes right at the end. |
©2024 cpdn.org