Message boards : Number crunching : New work Discussion
Message board moderation
Previous · 1 . . . 31 · 32 · 33 · 34 · 35 · 36 · 37 . . . 91 · Next
Author | Message |
---|---|
Send message Joined: 18 Feb 17 Posts: 81 Credit: 14,026,949 RAC: 972 |
How are these New Linux tasks on ram and disk space? I've got an i7-2600 with Ubuntu 1804 with all the propper fixes for 32 bit work installed (I think). From what I remember the WAH 2 tasks use around 1 gb per core on Windows, unless I'm mistaken, I forget how much disk space though. Should I expect similar runtimes under Linux e.g. 3-5 days? thanks and look forward to giving this a shot. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Thomas One of your computers had 2 models fail with Model crashed: ATM_DYN : INVALID THETA DETECTED. This is a perfectly normal science result - ATM_DYN is Atmospheric Dynamics., and it means that the stating values used for that model eventually lead to an impossible physical condition of some sort. This is one of the reasons for running these models; to find out what happens at some time in it's future. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
wolfman The time taken will depend on the particular research being done in a given batch. If the researcher gets ambitious and wants a long run, then that's what it will end up as. Disk space - give them plenty of room to start with, because you never know. One test batch of OpenIFS models used a bit over 9 Gigs per model. Some more work being assembled at present, but I don't any details. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I am running Red Hat Enterprise Linux 6.10 on a machine with a 4-core 64-bit Xeon processor. There are 4 CPDN tasks running, and this is the disk space they are using: 988M ./hadam4_a08h_209410_12_838_011900611 982M ./hadam4_a0hb_209810_12_838_011900929 778M ./hadam4_a0hb_209810_12_838_011900929/datain 778M ./hadam4_a08h_209410_12_838_011900611/datain 640K ./hadcm3s_hd57_190012_240_835_011892279/datain/ancil/ctldata 640K ./hadcm3s_hd55_190012_240_835_011892277/datain/ancil/ctldata 624K ./hadcm3s_hd57_190012_240_835_011892279/jobs 624K ./hadcm3s_hd55_190012_240_835_011892277/jobs 624K ./hadam4_a0hb_209810_12_838_011900929/datain/ancil/ctldata 624K ./hadam4_a08h_209410_12_838_011900611/datain/ancil/ctldata 603M ./hadam4_a0hb_209810_12_838_011900929/datain/ancil 603M ./hadam4_a08h_209410_12_838_011900611/datain/ancil 552K ./hadcm3s_hd57_190012_240_835_011892279/datain/ancil/ctldata/STASHmaster 552K ./hadcm3s_hd55_190012_240_835_011892277/datain/ancil/ctldata/STASHmaster 536K ./hadam4_a0hb_209810_12_838_011900929/datain/ancil/ctldata/STASHmaster 536K ./hadam4_a08h_209410_12_838_011900611/datain/ancil/ctldata/STASHmaster 470M ./hadcm3s_hd55_190012_240_835_011892277 351M ./hadcm3s_hd57_190012_240_835_011892279 276K ./hadam4_a0hb_209810_12_838_011900929/jobs 276K ./hadam4_a08h_209410_12_838_011900611/jobs 245M ./hadcm3s_hd57_190012_240_835_011892279/datain 245M ./hadcm3s_hd55_190012_240_835_011892277/datain 210M ./hadam4_a08h_209410_12_838_011900611/dataout 208M ./hadcm3s_hd55_190012_240_835_011892277/dataout 205M ./hadam4_a0hb_209810_12_838_011900929/dataout 180K ./hadcm3s_hd57_190012_240_835_011892279/tmp 180K ./hadcm3s_hd55_190012_240_835_011892277/tmp 175M ./hadam4_a0hb_209810_12_838_011900929/datain/dumps 175M ./hadam4_a08h_209410_12_838_011900611/datain/dumps 143M ./hadcm3s_hd57_190012_240_835_011892279/datain/masks 143M ./hadcm3s_hd55_190012_240_835_011892277/datain/masks 88M ./hadcm3s_hd57_190012_240_835_011892279/dataout 84K ./hadcm3s_hd57_190012_240_835_011892279/datain/ancil/ctldata/stasets 84K ./hadcm3s_hd55_190012_240_835_011892277/datain/ancil/ctldata/stasets 84K ./hadam4_a0hb_209810_12_838_011900929/datain/ancil/ctldata/stasets 84K ./hadam4_a08h_209410_12_838_011900611/datain/ancil/ctldata/stasets 70M ./hadcm3s_hd57_190012_240_835_011892279/datain/dumps 70M ./hadcm3s_hd55_190012_240_835_011892277/datain/dumps 33M ./hadcm3s_hd57_190012_240_835_011892279/datain/ancil 33M ./hadcm3s_hd55_190012_240_835_011892277/datain/ancil 28K ./hadam4_a0hb_209810_12_838_011900929/tmp 28K ./hadam4_a08h_209410_12_838_011900611/tmp 3.2G total The two big ones have been running about a day, and the two little ones for about 5 days. |
Send message Joined: 18 Feb 17 Posts: 81 Credit: 14,026,949 RAC: 972 |
Thank you both. Looks like my Linux machine hasn't gotten anything just yet, but there are a few hundred tasks available, so one can hope. I theoretically have the 32 bit libraries installed for Ubuntu 1804. I didn't get the 'no tasks available for this operating system' message when adding. Is there a way to tell, similar to WAH 2, that I got a shorter or longer task - or how exactly are these workunits different from the ones ran on Windows? For instance - number of months ran, number of KM that it covers (similar to sam25 being 25 km) and batch number? The machine I'm most worried about as far as ram usage is a Ryzen 7 1800x with 16 gb. It's running Windows, I'd just not like it to start thrashing a ton, so maybe I will limit CPDN to 12 or so tasks just to be on the safe side. I have considered having Linux run off a USB drive when I'm not using it for alternative crunching and projects that don't run on Windows. Are these similar to for instance Rosetta, when ram usage will slowly creep up until a checkpoint, or in this case a zip file is created? thanks and sorry for the barrage of questions. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Thank you both. Looks like my Linux machine hasn't gotten anything just yet, but there are a few hundred tasks available, so one can hope. I theoretically have the 32 bit libraries installed for Ubuntu 1804. I didn't get the 'no tasks available for this operating system' message when adding. The hadcm3s models now on Linux have been run on Windows (and Linux and Mac) before. These are "simpler" models and run faster for a given model day/month/year. The ones we are running right now are 240 months, so 20 years, but I've seen other numbers of months in other batches in the past. The 240 month models might take 5 days or less on a very fast PC as long as it's not running hyperthreading/SMT. They take less than 200 MB of RAM per task It's a global model with no regional component, and has a rather large spacing between grid points. The N144 models are currently being run for 12 model months and are also global models with no regional component. The have a grid spacing of N144 which is on the order of 100 km or so at the equator. These take about 650 MB of RAM per task and loading up a multi-core PC with a whole bunch of them will slow down the progress considerably. We're talking 5 to 10 days on fast PCs that aren't loaded up with too many of them. The N216 models have a grid spacing about 2/3 of the N144 ones and can take up to 1.5 GB of RAM per task. On the development site, we've been running them for 4 months. The most I've run at a time is 2 as it is desired to get back results quickly on some of these beta tests. I imagine they will really slow down progress if you load up a multi-core PC with a whole bunch of them, let alone trying to run them on HT/SMT logical cores. Just running a couple is 5+ days for a fast PC. The machine I'm most worried about as far as ram usage is a Ryzen 7 1800x with 16 gb. It's running Windows, I'd just not like it to start thrashing a ton, so maybe I will limit CPDN to 12 or so tasks just to be on the safe side. I have considered having Linux run off a USB drive when I'm not using it for alternative crunching and projects that don't run on Windows. For the hadcm3s and N144 and N216 models, there isn't much fluctuation in the RAM used per task over the course of the run. For the OpenIFS models that are 64 bit and may be coming later, they do fluctuate in RAM as the model runs along, and we have run some that *only* take up 3.5 GB per task and others up to 5.2 GB per task. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Looks like my Linux machine hasn't gotten anything just yet, but there are a few hundred tasks available VERY important: DON'T click the Update button. Doing this will reset the 1 hour backoff to 1 hour and a couple of minutes, and you may only be a few seconds away from getting work. Although there should be a message in that case, along the lines of "too soon since last request". And the model names contain a lot of what you asked about. You just need to decode it. |
Send message Joined: 18 Feb 17 Posts: 81 Credit: 14,026,949 RAC: 972 |
Looks like my Linux machine hasn't gotten anything just yet, but there are a few hundred tasks available Excellent. Of course about 5 minutes after posting my previous message I did exactly that. Though I have since left it alone and it doesn't seem to want to attempt to schedule more work, even hours later. I've got Asteroids cued up at the moment but there is no more work available over there so I'm hoping this straightens itself out before the tasks here are all gone. Both projects are set equally as far as resource management. I do have Rosetta set up, but it's at 0%, so if work was available here I should hope I would get it beforehand. I see 271 workunits are still available, so I'm hoping that I'll have at least 1 or 2 when I wake up tomorrow. This is an i7-2600 with 16 GB of ram, so certainly not the fastest, but I doubt I'll get enough work to hit all the threads at this rate. HT is enabled because I run other projects that benefit from it. Perhaps tomorrow I'll do the app config and do the max_concurrent assuming I get anything. Work seems to be coming out fairly regularly though. thank you for the help. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
One Update now and then doesn't hurt, it depends on what you're doing. e.g. If I'm setting up the computer after reloading the OS, I'll waste some of the backoff by doing an Update to get changes to the Venue, or anything else that I want set differently to default. THEN I leave the update alone. But if you haven't had a response in that long, then it's time to get suspicious and try things. Starting with an Update, then, if that doesn't work, try shuting down BOINC and rebooting the computer. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
The N144 models are currently being run for 12 model months and are also global models with no regional component. The have a grid spacing of N144 which is on the order of 100 km or so at the equator. These take about 650 MB of RAM per task and loading up a multi-core PC with a whole bunch of them will slow down the progress considerably. Why does loading up a multi-core PC with a bunch of N144 models slow the progress down? (I am not referring to hyperthreaded processors.) It seems to me that if all my cores are running N144 models, that the on-chip cache (Memory 15.5 GB, Cache 10240 KB) would get a greater hit rate (for the instructions, not the data of course) than if, say, only one core were running N144, and the other three were running WCG or something. And I do not recall that the hadam4 models ran the hard drives very hard. Each take 4% of my RAM, which is not a big deal. I have not seen any N216 models yet. |
Send message Joined: 18 Feb 17 Posts: 81 Credit: 14,026,949 RAC: 972 |
Okay this is a little odd. This is what I woke up to. Perhaps its normal and I'm just not sure what I'm seeing. 2019-10-02 9:28:12 AM | climateprediction.net | [sched_op] Starting scheduler request 2019-10-02 9:28:12 AM | climateprediction.net | Sending scheduler request: To fetch work. 2019-10-02 9:28:12 AM | climateprediction.net | Requesting new tasks for CPU 2019-10-02 9:28:12 AM | climateprediction.net | [sched_op] CPU work request: 9387264.63 seconds; 8.00 devices 2019-10-02 9:28:13 AM | climateprediction.net | Scheduler request completed: got 0 new tasks 2019-10-02 9:28:13 AM | climateprediction.net | [sched_op] Server version 713 2019-10-02 9:28:13 AM | climateprediction.net | No tasks sent 2019-10-02 9:28:13 AM | climateprediction.net | Project requested delay of 3636 seconds 2019-10-02 9:28:13 AM | climateprediction.net | [sched_op] Deferring communication for 01:00:36 2019-10-02 9:28:13 AM | climateprediction.net | [sched_op] Reason: requested by project Does this mean I have too much work cued up in the meantime? Since it isn't saying project has no work or no tasks available, I'm trying to figure out the reasoning behind not receiving any work, since Asteroids has been running with 0 new tasks in its cue for several days now and I only have enough Rosetta tasks to fill the empty cores. Any help appreciated. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
I have noticed that it starts out with a one-hour delay in any case. You may get something after that. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Why does loading up a multi-core PC with a bunch of N144 models slow the progress down? I only noticed a small hit on performance running 4 N144 tasks. With the IFS having only 8GB or ram for 4 cores, there was a massive hit though total throughput of tasks did still increase with each additional cpu. However, I think the hit on my SSD would be excessive if doing so more than occasionally. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
wolfman It IS quite possible that BOINC has enough work to be getting on with. cpdn doesn't play well with other projects, with their tasks that only last minutes to a few hours. Having other projects active at the same time only really works when cpdn has a large, constant stream of work. Then BOINC can slot in a climate model when other project work is scheduled to run out "soon". And now we're out of work again. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I only noticed a small hit on performance running 4 N144 tasks. With the IFS having only 8GB or ram for 4 cores, there was a massive hit though total throughput of tasks did still increase with each additional cpu. However, I think the hit on my SSD would be excessive if doing so more than occasionally. OK; I have 4 cores and I ran 4 N144 tasks a while ago with 16 GBytes RAM, and the machine doing little else that some web browsing and e-mail. And how fast can I read or type? I am currently running 2 hadam4 N144 tasks and two hadcm3s tasks. When the hadam4 tasks complete, I will see if they run any faster than when I was running four. But I still wonder what the mechanism of running four N144 tasks at a time. Cache poisoning seems unlikely. Their disk requirements seem low, so that is not likely to be the problem. Shortage of memory right now is not a problem. The hadam4's take about 4% of my RAM each, and the hadcm3s take 1.1% of my RAM each, so it is not memory thrashing to disk. Could it be that your processor(s) overheat and get automatically throttled down due to the work load? So far, I have received no IFS work units. I run Linux on a 64-bit Xeon processor, so I should be able to run them when they become more available. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
If you're not having problems, then don't worry. And the easiest way to tell if you have a problem may well be the sec/TS number on your Task page. I found it was obvious when running the big introductory batch of OpenIFS earlier in the year. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
If you're not having problems, then don't worry. Thank you. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Could it be that your processor(s) overheat and get automatically throttled down due to the work load? Pretty sure in my case it was the disk writes slowing things down. Though this was around the time when the temperature in Cambridge set a new UK record. The fan in my laptop is quieter now with 4 cores running than it was then with two! With the IFS I know it was swapping data out to the swap partition from main memory. |
Send message Joined: 18 Feb 17 Posts: 81 Credit: 14,026,949 RAC: 972 |
wolfman Would setting another project to say 10% and CPDN to say 190 work at all? Along those same lines, would having a third backup project, just in case the former two are out of work, which in this case can happen with fair bit of regularity, set at 0% resources harm fetching of new work? It was a toss up between this and LHC. I've got quite a few Windows 10 machines. I know Boinc can harness the Windows 10 Linux subsystem I'm just not sure exactly how seamless it actually is. I'll throw the libraries in the "Linux" install in the terminal tomorrow and see what happens from there. Hopefully with more than one machine looking for work I'll get somewhere when work is available. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Windows tasks coming up. |
©2024 cpdn.org