Message boards : Number crunching : East Asia testing.
Message board moderation
Author | Message |
---|---|
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,018,099 RAC: 20,856 |
Too early to say if main site work will come from this. (My instinct is yes but don't hold your breath!) I am running 4 East Asia 25Km resolution tasks under wine from testing branch. Assuming they finish, they will take 48 days which is why I say, "don't hold your breath!" I have also been warned that the region crossing the Himalayas may cause them/some of them to go unstable so there may be higher than normal physics failures. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I might as well hold my breath because I am not getting any other work from CPDN these days. (Nor from WCG.) And this after raising my RAM. The disk space listed here is just the partition allocated solely to Boinc. So "Ready when you are, chief!" (Punch line from a joke-story.) Memory 125.34 GB Cache 16896 KB Swap space 15.62 GB Total disk space 488.04 GB Free Disk Space 479.26 GB Measured floating point speed 6.04 billion ops/sec Measured integer speed 24.55 billion ops/sec Average upload rate 146.64 KB/sec Average download rate 15542.13 KB/sec |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Cecil B. DeMille wanted to make a really great scene of Moses and the Israelites crossing the Red Sea, so with all his skills, and all the producers' money he arranged with On High to have the sea part for long enough to film the scene. Not only that, but he had three camera crews filming the event. One on each side of the Red Sea, and on on top of the hill nearby. He then got Moses and the Israelites (actors and extras) to cross, but he was a little late and the sea closed in and drowned all of them. He turned to his cameraman and asked if he got the scene, and the cameraman apologized because, for the first time in his career, he forgot to take the lens cap off the camera lens. No problem said DeMille; Harry on the other side will have it. He shouts across, but Harry was so upset he could hardly answer: for the first time in his career, he forgot to load film into his camera. Thank goodness, said DeMille, Fred on the hill will have it. Fred! Did you get that? and Fred shouted back, "Ready when you are, chief!" |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,018,099 RAC: 20,856 |
Because the region covered is much bigger than the ANZ region, these tasks will be long if they get here. Currently looking like about 50 days on my Ryzen. Slower machines will be well over 2 months even if running 24/7. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,840,128 RAC: 20,043 |
Because the region covered is much bigger than the ANZ region, these tasks will be long if they get here. Currently looking like about 50 days on my Ryzen. Slower machines will be well over 2 months even if running 24/7. Sounds like these are Windows Hadley models? |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,018,099 RAC: 20,856 |
Sounds like these are Windows Hadley models?Yes the four I am running from testing branch are running under WINE. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,018,099 RAC: 20,856 |
If these Windows work units are long running as you suggest, I hope there will be a mechanism in place on the server to ensure that everyone who wants some will get them, shared fairly and equally, instead of by greedy, selfish users who download dozens of work units, or more, and then can't complete them by the deadline.I have no idea on that. My hope is that those in charge have learned from the effectiveness of the shorter deadlines given to OIFS tasks. Slower machines will I would guess take over three months to complete these tasks so setting the deadline at 3 months would stop some users from downloading them at all because they wouldn't finish in time. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
greedy, selfish users who download dozens of work units, or more, and then can't complete them by the deadline. Just so I do not be considered greedy, I do not wish to hog an unfair number of tasks, but OTOH, I do want a large enough number of tasks on my machine to coast me over the dead spots. Right now, both CPDN and WCG have extremely long dead spots. CPDN just does not have tasks ready, and I would not wish to have many weeks of those work units, that used to have a 1-year deadlines. In the old days, new tasks were always available, so I did not need a large input queue. WCG is just plain down for extremely long intervals (months), and has not really been running right in over a year. I do not remember how long their tasks take, but some of them were 8 hours or so. and others less. Some project had some 8-day tasks, but I do not remember if it was one of the WCG or not. For one project I am on, DENIS, It downloaded about 100 tasks all at once. But they run pretty quickly (about 70 minutes) so I have no trouble completing them on time (deadline is about three days). Then they have several days of no work. This is not actually a complaint. For another project (Einstein), I get only half a dozen at most, and I can complete them on time too. They take longer to run (about 11hours each). As far as I know, the only way to get more tasks to download would be to set Options->Computing Preferences->Computing->Other to have higher Days of Work settings. What do other people consider fair settings for these? My setttings are At least 0.5 days of work Additional 1.0 days of work. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
If these Windows work units are long running as you suggest, I hope there will be a mechanism in place on the server to ensure that everyone who wants some will get them, shared fairly and equally, instead of by greedy, selfish users who download dozens of work units, or more, and then can't complete them by the deadline.I have no idea on that. My hope is that those in charge have learned from the effectiveness of the shorter deadlines given to OIFS tasks. Slower machines will I would guess take over three months to complete these tasks so setting the deadline at 3 months would stop some users from downloading them at all because they wouldn't finish in time. If they do it the way they did for the nz25 batches, the development site spinups ran for 113 model months and took about 20 days on my i7-4790K. When they later sent out stash/ancil test nz25 batches to the dev site, they were for 25 model months and took less than a quarter of the time that the spinups did. The nz25 batches sent to the main cpdn site were also 25 model months. Just guessing but I don't think 119 model month batches will come to the main site. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,840,128 RAC: 20,043 |
My hope is that those in charge have learned from the effectiveness of the shorter deadlines given to OIFS tasks. While it was helpful to add a 30 day grace period in addition to the 30 day deadline for OIFS tasks during the storage outage, it seems to me that it should've been removed once things stabilized. There are still almost 200 PS tasks out and from what I remember, the contract deadline was supposed to be end of February. BL, and regular OIFS apps also still have 100-150 tasks out each although I'm not sure if 30 days have passed yet on those. Just so I do not be considered greedy, I do not wish to hog an unfair number of tasks, but OTOH, I do want a large enough number of tasks on my machine to coast me over the dead spots. I agree, since work availability is not constant or consistent I also like to have a large enough cache store but not too big so that I can still finish all work by the 30 day deadline. It's not always possible as there are limits based on one's consecutive valid tasks. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,018,099 RAC: 20,856 |
If they do it the way they did for the nz25 batches, the development site spinups ran for 113 model months and took about 20 days on my i7-4790K. When they later sent out stash/ancil test nz25 batches to the dev site, they were for 25 model months and took less than a quarter of the time that the spinups did. The nz25 batches sent to the main cpdn site were also 25 model months. Just guessing but I don't think 119 model month batches will come to the main site. I missed this in a post from George about a month ago. When the four I have running get nearer completion I might ask what the plan is. I also missed noticing the long spinup NZ tasks or I might have twigged that main site tasks might well be considerably shorter. Thanks for the reminder/hint by PM George. |
Send message Joined: 31 Aug 04 Posts: 5 Credit: 17,401,474 RAC: 5,243 |
Hello! Now that the testing is over and the "real" working starts, I'll have got some wu's from Batch #994 and allas , so far several have errored out with Sement violation (Signal 11) and when trying to upload the result "upload failure: <file_xfer_error>" "<error_code>-240 (stat() failed)</error_code>" Is it my computers or the units that has some serious errors?? Example: https://www.cpdn.org/workunit.php?wuid=12216822 Hans Sveen Oslo,Norway |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,018,099 RAC: 20,856 |
The file transfer error is just because the model crashed before the zip files could be created. One of the other moderators had the ones on his computer all crash, though at least one was due to a power outage while the task was running. I have one running so far and it is about two hours in. It is really too early to say if there are serious issues with these tasks or not yet. the ones of yours I looked at have all gone out again. I suspect the crashes you are experiencing which are all in the same ball park figure of cpu time are happening at the end of the first model day and as your machine has produced pretty consistent results in the past, it is probably something to do with the model. If it is we can expect a flurry of reports over the next day or two. The region being examined includes the Himalayas and the researcher did say when the testing tasks went out that it was possible that some of the tasks might be pushing the limits of the model. Edit: In about three quarters of an hour, I will be able to get some more work. (The one task I have running is under WINE in a Linux VM.) When the WCG work I have running finishes, I will shut down the Linux client and get some tasks under WINE in the host machine. That will give me a chance to see more than just one task at a time. Edit2: A further four tasks downloaded and have gotten past the point at which yours crashed. (all over one hour in.) However the sample size is still too small to draw any conclusions, How many tasks are actually running at once? Is anything else using a lot of memory? Those are two things that potentially can cause problems. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I just got a WAH2 task on my Windows 10 maachine. It failed after 3 minutes 31 seconds. Also, sending the out.zip failure message is having trouble going up. I hit retry and it wants to wait. So I will wait. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Looks like the regional model portion of the task takes 400+ MB of resident memory while the global portion of each task takes about 200 MB, so 600 to 700 MB total resident memory for each task. My Ryzen 5600 is running 3 at a time and it works out to about 8.5 days to complete them. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Yep, I had 4 WaH tasks and all crashed with a segV. Unfortunately they disappeared too quickly for me to see the detailed logs but it was the model that crashed and not the wrapper process. Maybe it's a Win11 issue, AFAIK the model has not been recompiled for quite some time on Windows. update: checking the WU I see the previous host also failed the task and was running Win10. Unlike OIFS there's very little returned to the task page making it v difficult to diagnose the problem --- CPDN Visiting Scientist |
Send message Joined: 30 Aug 06 Posts: 27 Credit: 1,879,577 RAC: 1,213 |
I had 2 wah2_eas25_a3ny_201511_25_994_012220188 tasks and both crashed after 2 minutes. It's having trouble uploading them. Regards, DadX |
Send message Joined: 30 Aug 06 Posts: 27 Credit: 1,879,577 RAC: 1,213 |
Correction I had two tasks that failed wah2_eas25_a3ny_201511_25_994_012220188 and wah2_eas25_a1cm_199611_25_994_012217188 after 2 minutes. Regards, DadX |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I am still waiting for the out.zip failure message file to upload. I checked my Windows 10 machine hardware. I do not fully understand what these mean -- especially why the first two are different. It has 16.0 GBytes Installed 15.6 GBytes Total 8.78 GBytes Available 18.0 GBytes Virtual 10.1 GBytes Available Virtual The computer is running 6 Boinc tasks (no CPDN at the moment). The Boinc-client allows 7 tasks to run, but the various app_config files are limiting them to only six. The one for CPDN allows only one of those to run at a time. |
Send message Joined: 6 Aug 04 Posts: 195 Credit: 28,343,308 RAC: 10,415 |
Five tasks downloaded on this WIN 11 PC. Four crashed almost immedlately with the event log complaining about missing output files. The four took took a little while to report and std.err complains about Segment violations. Number 5 task is ticking along nicely. After four hours the task is 2.1 per cent though, giving a run-time estimate of 189 hours or about 8 days. |
©2024 cpdn.org