Message boards : Number crunching : OpenIFS Discussion
Message board moderation
Previous · 1 . . . 11 · 12 · 13 · 14 · 15 · 16 · 17 . . . 32 · Next
Author | Message |
---|---|
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,430,837 RAC: 17,503 |
The 2min sleep is there just to let all the file operations complete and make sure everything is flushed from memory. I don't understand why you suspended it a few times? It would have worked without? The trickle file is quite small. New day, new task ! |
Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,302,757 RAC: 1,077 |
Jean-David Beyer wrote: Glenn Carver wrote:As noted before, I still get occasional HTTP transfer timeouts. Not particularly frequent, but at the moment they have nevertheless a considerable negative effect:the server has been moved onto a newer, managed cloud service with better resources.It sure has. I am now running four Oifs _ps tasks at a time and was watching the Transfers tab as things went up. Often I got 3 to 4 megabytes/second upload speeds each when two tasks were uploading at once, and I notice one that got over 7 megabytes/second when uploading alone. Once in a while, I get one considerably slower. My Internet connection at my end is 75 megabits/second or 9.375 Megabytes/second. – I am located in Germany. – I have got a 1 MByte/s uplink, which would allow me to run 33 OIFS tasks in parallel (based on 16.5 h average task duration). – Yet I cautiously set up my currently two connected computers to run only 20 + 2 tasks at once. – Even though, since new work became available ~11 hours ago, the two computers built up a backlog of a few files to be retried and many new files pending for upload. That is, the timeouts are frequent enough that the client spends a considerable time not uploading, but waiting for timeout. On the computer with 20 concurrently running tasks, I now increased cc_config::options::max_file_xfers_per_project from 2 to 4 in hope that the probability of such idle time is much reduced, and that my upload backlog clears eventually. [Edit: Upped it to 6 now.] (The files are always uploading with my full bandwidth of 1 MByte/s. It's just that a random file will eventually stop transferring at a random point in time, with the client waiting on it until the default transfer timeout. As I mentioned in New work discussion 2, #66939, the client's first several upload retries of such a file will then fail because the server's upload handler still has this file locked. Only a long while later there will eventually be a successful retry. In the mean time, the client will of course successfully transfer many other files. But even so, 2 max_file_xfers_per_project were evidently not enough to get me anywhere near of my theoretic upload capacity of 1 MByte/s sustained.) |
Send message Joined: 6 Jul 06 Posts: 147 Credit: 3,615,496 RAC: 420 |
These Oifs _ps tasks really test your system out. Running 9 at once, each using from 2.7 to 4.2 GB of RAM, after 2 hours run time they have written 11.3 GB of data to disk each (101.7 GB), which is huge. Hitting 50 GB of RAM in use out of 64 GB, but I am also running LODA tasks which each use 1 GB of RAM. All 24 threads are running. 12% in and running fine so far. Conan |
Send message Joined: 29 Nov 17 Posts: 82 Credit: 14,388,172 RAC: 91,192 |
I don't understand why you suspended it a few times? It would have worked without? The trickle file is quite small. I wanted to be around for when it finished to see what happened, part of that was to give the pending trickle a chance to go. I was under the impression we were thinking that the trickles failing to upload was a cause for concern and might be why some valid work was being marked invalid. I've now seen tasks fail and work with trickle files present after the tasks have completed. |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,430,837 RAC: 17,503 |
Ah great, for a second I thought something else was going wrong! I think the trickles are ok. From what I can see the zip file to be trickled gets moved to the main ./projects/climateprediction.net directory and handed over to the client to deal with. I don't understand why you suspended it a few times? It would have worked without? The trickle file is quite small. |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,430,837 RAC: 17,503 |
Conan, I'm puzzled by your filesizes. 11.3Gb each task is too high. Exactly how did you come up with that figure? For example, if I go into my slot/0 directory with an oifs task running: % cd slots/0 % du -hs . # note the '.' 1.2G . and then do the same in the projects/climateprediction.net directory where the trickle uploads live (and will include all the hadley models as well so it's an overestimate, but just for illustration): % cd projects/climateprediction.net % du -hs . 1.4G .So that's a total of 2.6G, I'm nowhere near your values of 11G after the model has been running for 2hrs as well. Something is not right there. Can you please check something for me. Go back into your slot directory and type: du -hs srf*and report back with the output. The 'srf' files is one of the model's restart (checkpoint) files, the biggest one. If you have more than one of these, it's a sign the model is restarting frequently. In which case, check that you have enabled: "Leave non-GPU in memory while suspended" under the 'Disk & Memory' tab in 'Computing preferences' for boincmgr. This setting is important. If it's NOT selected, every time the boinc client suspends the task, the model is at risk of being pushed from memory which effectively shuts it down and it has to restart again (I think). The model can accumulate restart files in the slot directory if it is frequently restarted. The model will normally delete the old restart/checkpoint files as it runs, but if it has to restart, it leaves the old one behind as a safeguard. The problem of course is if it frequently restarts these files accumulate. I have seen several tasks reports with disk full errors which makes me think this is happening. So, bottom line: 1/ Please check how many srf files you have in the slot directory and report back. 2/ You can safely delete the OLDER srf files to recover some space. By old, I mean the oldest dated files as shown by 'ls -l'. But, you MUST leave the most recent one. Let me know. Hope that makes sense. Cheers, Glenn p.s. I did have a smile at your 'test your system' comment. This is the model at it's smallest configuration with minimal output. You haven't seen anything.. :) These Oifs _ps tasks really test your system out. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,696,681 RAC: 10,226 |
Glenn, can we please be careful to distinguish between 'trickles', which are tiny administrative fragments, and 'uploads', which are substantial scientific data? They each have their own quirks and foibles. I suspect the query about file write aggregate sizes may be a confusion between 'process' and 'resultant' sizes. If the app writes a group of individual data files to disk, and then compresses them into a single zip, then the process as a whole writes far more to disk than the final size of the uploaded zip. I'm running the first four from this batch, and finalising the other work which has been keeping them warm between batches. I should be able to do the first full analysis of a file completion later this afternoon. |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,430,837 RAC: 17,503 |
Richard, perhaps there's a misunderstanding. CPDN use 'trickles' to upload model results. Unpack the trickle zip files and it will contain the model output files from the previous steps since the last trickle upload. The final upload file just contains the last remaining model outputs. I don't know how other projects use trickles but they are not administrative fragments, they contain actual model (scientific) data. Glenn, can we please be careful to distinguish between 'trickles', which are tiny administrative fragments, and 'uploads', which are substantial scientific data? They each have their own quirks and foibles. |
Send message Joined: 6 Jul 06 Posts: 147 Credit: 3,615,496 RAC: 420 |
G'Day Glenn, You may of miss read what I wrote I think. The 11.3 GB was not a file size but the amount of disk writes made in that first 2 hours (now after 5 hours well over 30 Gb). The 2.7 to 4.6 GB were RAM amounts that each work unit was using. This was all taken from System Monitor. I did what you have asked and % cd slots/26 % du -hs . # note the '.' 1.2G . This is the same as your example. % cd projects/climateprediction.net % du -hs . 1.2G . This is similar to your example. du -hs srf* 768 MB srf00370000.0001 So all running fine, so maybe just a bit of a misunderstanding I think with data amounts and RAM usage. Thanks Conan |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,430,837 RAC: 17,503 |
Ah, excellent. That's clearer. Ok, so what you're seeing there is the I/O from the restart/checkpoint files (I call them restarts, boinc calls them checkpoints). Each set of restart files is just under 1Gb. The model writes out these files once per model day, every 24 model steps. As it deletes the previous restart files, you only ever see the most recent ones. These PS tasks are quite long & will run for 2952 hrs (steps) so you'll get 123 restart files, or 123Gb in write I/O from the restart files alone.The model output that goes back to CPDN is very much smaller by comparison. If that I/O proves too much, let me know. The model has to write these files in full precision to allow an exact restart (64bit floating point, not to be confused with 64bit/32bit operating systems). This is one of the things we have to balance when applying any meteorological model to boinc, reduce the I/O which will cause the model to repeat more steps and take longer to finish, or minimise task run time at the expense of more I/O. I recall an earlier thread in which someone (I forget who) was asking why volunteers can't adjust the checkpointing frequency. Well, you have the explanation. If you turned up the checkpointing frequency so it happened every model step, it would thrash the system and slow down the task as the model blocks until the I/O is complete. G'Day Glenn, |
Send message Joined: 29 Nov 17 Posts: 82 Credit: 14,388,172 RAC: 91,192 |
Richard, perhaps there's a misunderstanding. CPDN use 'trickles' to upload model results. Unpack the trickle zip files and it will contain the model output files from the previous steps since the last trickle upload. The final upload file just contains the last remaining model outputs. I don't know how other projects use trickles but they are not administrative fragments, they contain actual model (scientific) data. The trickle files (named as trickle files) and the zip files are being treated differently. The zip files appear in the Transfer tab in BOINC Manager whereas the trickle files do not. The zip files are large and contain model data, the trickle files are tiny and look like this... <variety>orig</variety> <wu>oifs_43r3_ps_1351_2021050100_123_946_12164440</wu> <result>oifs_43r3_ps_1351_2021050100_123_946_12164440_2_r1535625001</result> <ph></ph> <ts>10368000</ts> <cp>60877</cp> <vr></vr> |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,430,837 RAC: 17,503 |
The trickle & zip upload files are initiated at the same time by a single boinc function call in the code. They are not separate steps. The trickle file contains the filename of the zip to be uploaded. It's terminology largely, but I think of the 'trickle' as this initiation of transfers. Richard, perhaps there's a misunderstanding. CPDN use 'trickles' to upload model results. Unpack the trickle zip files and it will contain the model output files from the previous steps since the last trickle upload. The final upload file just contains the last remaining model outputs. I don't know how other projects use trickles but they are not administrative fragments, they contain actual model (scientific) data. |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,430,837 RAC: 17,503 |
Adjusting write I/O from OpenIFS tasks Further to Conan's point about the amount of write I/O. It can be adjusted but only once a task has started running. The adjustment made will reduce the checkpoint frequency, meaning if the model does have to restart from a shutdown, it will have to repeat more steps. This change does NOT affect the model's scientific output as that's controlled differently. ONLY make this change if you leave the model running near-continuously with minimal possibility of a restart. Do NOT do it if you often shutdown the PC or boinc client, otherwise it will hamper the progress of the task. If in doubt, just leave it. To make the change: 1/ shutdown the boinc client & make sure all processes with 'oifs' in their name have gone. 2/ change to the slot directory. 3/ make a backup copy of the fort.4 file (just in case): cp fort.4 fort.4.old 4/ edit the text file fort.4, locate the line: NFRRES=-24, and change it to: NFRRES=-72, Preserve the minus sign and the comma. This will reduce the checkpoint frequency from 1 day (24 model hrs) to 3 days (72 model hrs). But, it will mean the model might have to repeat as many as 3 model days if it has to restart. 5/ restart the boinc client. The changes can only be made once the model has started in a slot directory, not before. |
Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,302,757 RAC: 1,077 |
Half a day ago xii5ku wrote: As noted before, I still get occasional HTTP transfer timeouts. [And this caused a quick buildup of an upload backlog, even though I am running much fewer tasks at once than my Internet connection could sustain if transfers always succeeded.] On the computer with 20 concurrently running tasks, I now increased cc_config::options::max_file_xfers_per_project from 2 to 4 in hope that the probability of such idle time is much reduced, and that my upload backlog clears eventually. [Edit: Upped it to 6 now.]As desired, the upload backlog is now cleared. Just a certain amount of files to be retried after previous timeouts is still left. |
Send message Joined: 6 Aug 04 Posts: 195 Credit: 28,322,579 RAC: 10,225 |
It's possible AMD chips are triggering memory bugs in the code depending on what else happens to be in memory at the same time (hence the seemingly random nature of the fail). Hard to say exactly at the moment but it could also been something system/hardware related specific to Ryzens. I have never seen the model fail like this before on the processors I've worked with in the past (none of which were AMD unfortunately). I am tempted to turn down the optimization and see what happens.... Hello Glenn, https://www.cpdn.org/result.php?resultid=22252269 This task crashed earlier today with double free or corruption (out). It’s an IFS task running in VirtualBox, ubuntu 20.04, on an Intel i7-8700, WIN10 host, running ubuntu VM https://www.cpdn.org/show_host_detail.php?hostid=1512045. The VM has 32GB RAM assigned (40GB physical) and about 100GB disc (2TB physical) The only touches today around 6:00am: a) In the Win10 host, I updated our daily energy usage in excel and saved the two files. b) In the ubuntu VM, I looked to see what had happened overnight in the BOINC event log. No changes to the ubuntu host or BOINC manager. No stops or restarts, no config changes. The other five IFS tasks have about two hours to go. Stderr … 06:00:54 STEP 1054 H=1054:00 +CPU= 22.385 06:01:16 STEP 1055 H=1055:00 +CPU= 21.665 06:01:57 STEP 1056 H=1056:00 +CPU= 39.203 Moving to projects directory: /var/lib/boinc-client/slots/0/ICMGGhq0f+001056 Moving to projects directory: /var/lib/boinc-client/slots/0/ICMSHhq0f+001056 Moving to projects directory: /var/lib/boinc-client/slots/0/ICMUAhq0f+001056 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_12168039/ICMGGhq0f+001032 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_12168039/ICMSHhq0f+001032 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_12168039/ICMUAhq0f+001032 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_12168039/ICMGGhq0f+001044 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_12168039/ICMSHhq0f+001044 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_12168039/ICMUAhq0f+001044 Zipping up the intermediate file: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_0395_1982050100_123_951_12168039_0_r1962384054_43.zip Uploading the intermediate file: upload_file_43.zip 06:02:19 STEP 1057 H=1057:00 +CPU= 20.970 double free or corruption (out) If there is anything particular you would like me to look at and report, please let me know. |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,430,837 RAC: 17,503 |
06:02:19 STEP 1057 H=1057:00 +CPU= 20.970Yes, I've been watching the returned tasks for the errored tasks (seeing error codes 1, 5 & 9 mostly). If you could kindly check your /var/log/syslog file for an entry around the time the task finished. There should be mention of 'oifs_43r3_' something being killed. Let me know what there is. If you don't have a syslog file you might have a /var/log/messages file. If you don't have either, it means the syslog service hasn't started (often an issue on WSL), run: sudo service rsyslog startwhich will create the /var/log/syslog file. Out of interest, how many tasks did you have running compared to how many cores? I have got a 11th & 3rd gen Intel i7 and the model has never crashed like this for me. The only suggestion I can make is not to put too many tasks on the machine. Random memory issues like this can depend on how busy memory is. I have one less task than I have cores running (note cores not threads) i.e. 3 tasks max for a 4 core machine. So far, touch wood, it's never crashed and I'm nowhere near my total ram. I was going to do a test by letting more tasks run to see what happens once I've done a few successfully. It's quite tough to debug without being able to reproduce. thx. |
Send message Joined: 6 Jul 06 Posts: 147 Credit: 3,615,496 RAC: 420 |
All 9 work units that I had running overnight have completed successfully. Running on an AMD Ryzen 9 5900x, 64GB RAM, all 24 threads used to run BOINC programmes at the same time as the ClimatePrediction models. All took around 17 hours 10 minutes run time. Conan |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,430,837 RAC: 17,503 |
All 9 work units that I had running overnight have completed successfully.The memory requirement for the OpenIFS PS app is 5Gb => 9x5 = 45Gb RAM max, with other boinc apps as well, maybe not a lot of memory headroom? With a 5900x I would expect runtimes nearer 12hrs, so maybe there is memory contention (though that depends what the %age CPU usage in boincmgr is set to). Personally , I wouldn't advise running like this. There's a memory issue with the OpenIFS PS app. The higher the memory pressure, the more likely you are to have a fail at some point I suspect. On the other hand, if you never get a fail in all the tasks you run, that will be interesting (and a puzzle!). Let me know in the New Year! |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
My machine is running four oifs_43r3_ps_1.05 tasks at a time. If some other tasks oifs_43r3_BaroclinicLifecycle_v1.07 showed up, I could run them at the same time. That happened only once. The three with around 14 hours on them will complete within about an hour. The last one in slot 2, started later, so it will need about an additional hour. I have had no failures with any of the Oifs model tasks (so far, anyway). My machine is Computer 1511241. It is running eight more Boinc processes, but little else. I am nowhere near running out of RAM. It is true that there are only 4.337 GBytes of RAM listed as free, but the number that really matters is the 43.454 GBytes listed in avail Mem. top - 20:00:30 up 5 days, 11:38, 1 user, load average: 12.37, 12.35, 12.47 Tasks: 469 total, 14 running, 455 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.5 us, 6.4 sy, 68.5 ni, 24.3 id, 0.1 wa, 0.2 hi, 0.1 si, 0.0 st MiB Mem : 63772.8 total, 4337.9 free, 19480.7 used, 39954.2 buff/cache MiB Swap: 15992.0 total, 15756.7 free, 235.2 used. 43454.1 avail Mem PID PPID USER PR NI S RES %MEM %CPU P TIME+ COMMAND 477613 477610 boinc 39 19 R 3.9g 6.2 98.9 1 746:00.42 /var/lib/boinc/slots/2/oifs_43r3_model.exe 472285 472281 boinc 39 19 R 3.8g 6.2 98.9 5 853:39.76 /var/lib/boinc/slots/10/oifs_43r3_model.exe 472215 472212 boinc 39 19 R 3.7g 5.9 98.9 10 855:28.23 /var/lib/boinc/slots/9/oifs_43r3_model.exe 472332 472329 boinc 39 19 R 3.3g 5.3 98.9 9 852:20.58 /var/lib/boinc/slots/11/oifs_43r3_model.exe |
Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,302,757 RAC: 1,077 |
Typical RAM usage is quite a bit less than peak RAM usage. The peak happens briefly and, I presume, periodically during every nth(?) timestep. Concurrently running tasks are not very likely to reach such a peak simultaneously. From this follows: – On hosts with not a lot of RAM, the number of concurrently running tasks should be sized with the peak RAM demand in mind. – On hosts with a lot of RAM, the number of concurrently running tasks can be sized for a figure somewhere between average and peak RAM demand per task. The boinc client watches overall RAM usage and puts work into waiting state if the configured RAM limit is exceeded, but from what I understand, this built-in control mechanism has difficulties to cope with fast fluctuations of RAM usage, like OIFS's. (Edit: And let's not forget about disk usage per task. Sometimes, computers or VMs which are dedicated to distributed computing have comparably small mass storage attached. Though I suppose the workunit parameter of rsc_disk_bound = 7 GiB is already set suitably. From what I understand, this covers a worst case of a longer period of disconnected network, i.e. accumulated result data. I am not sure though in which way the boinc client takes rsc_disk_bound into account when starting tasks.) (Edit 2: Exactly 7.0 GiB is not enough though when all of the 122 intermediate result files – already compressed – aren't uploaded yet and the final result data is written and needs to be compressed into the 123rd file, plus the input data etc. are still around. Though that's a quite theoretic case.) |
©2024 cpdn.org