Thread 'East Asia testing.'

Author	Message
Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,018,099 RAC: 20,856	Message 68589 - Posted: 14 Mar 2023, 10:38:56 UTC Too early to say if main site work will come from this. (My instinct is yes but don't hold your breath!) I am running 4 East Asia 25Km resolution tasks under wine from testing branch. Assuming they finish, they will take 48 days which is why I say, "don't hold your breath!" I have also been warned that the region crossing the Himalayas may cause them/some of them to go unstable so there may be higher than normal physics failures. ID: 68589 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 68590 - Posted: 14 Mar 2023, 13:33:37 UTC - in response to Message 68589. Last modified: 14 Mar 2023, 13:35:42 UTC I might as well hold my breath because I am not getting any other work from CPDN these days. (Nor from WCG.) And this after raising my RAM. The disk space listed here is just the partition allocated solely to Boinc. So "Ready when you are, chief!" (Punch line from a joke-story.) Memory 125.34 GB Cache 16896 KB Swap space 15.62 GB Total disk space 488.04 GB Free Disk Space 479.26 GB Measured floating point speed 6.04 billion ops/sec Measured integer speed 24.55 billion ops/sec Average upload rate 146.64 KB/sec Average download rate 15542.13 KB/sec ID: 68590 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 68591 - Posted: 14 Mar 2023, 15:09:10 UTC - in response to Message 68590. Cecil B. DeMille wanted to make a really great scene of Moses and the Israelites crossing the Red Sea, so with all his skills, and all the producers' money he arranged with On High to have the sea part for long enough to film the scene. Not only that, but he had three camera crews filming the event. One on each side of the Red Sea, and on on top of the hill nearby. He then got Moses and the Israelites (actors and extras) to cross, but he was a little late and the sea closed in and drowned all of them. He turned to his cameraman and asked if he got the scene, and the cameraman apologized because, for the first time in his career, he forgot to take the lens cap off the camera lens. No problem said DeMille; Harry on the other side will have it. He shouts across, but Harry was so upset he could hardly answer: for the first time in his career, he forgot to load film into his camera. Thank goodness, said DeMille, Fred on the hill will have it. Fred! Did you get that? and Fred shouted back, "Ready when you are, chief!" ID: 68591 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,018,099 RAC: 20,856	Message 68592 - Posted: 15 Mar 2023, 7:11:38 UTC Because the region covered is much bigger than the ANZ region, these tasks will be long if they get here. Currently looking like about 50 days on my Ryzen. Slower machines will be well over 2 months even if running 24/7. ID: 68592 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,840,128 RAC: 20,043	Message 68593 - Posted: 15 Mar 2023, 10:22:05 UTC - in response to Message 68592. Because the region covered is much bigger than the ANZ region, these tasks will be long if they get here. Currently looking like about 50 days on my Ryzen. Slower machines will be well over 2 months even if running 24/7. Sounds like these are Windows Hadley models? ID: 68593 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,018,099 RAC: 20,856	Message 68594 - Posted: 15 Mar 2023, 10:48:21 UTC - in response to Message 68593. Sounds like these are Windows Hadley models? Yes the four I am running from testing branch are running under WINE. ID: 68594 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,018,099 RAC: 20,856	Message 68596 - Posted: 17 Mar 2023, 16:39:14 UTC - in response to Message 68595. If these Windows work units are long running as you suggest, I hope there will be a mechanism in place on the server to ensure that everyone who wants some will get them, shared fairly and equally, instead of by greedy, selfish users who download dozens of work units, or more, and then can't complete them by the deadline. I have no idea on that. My hope is that those in charge have learned from the effectiveness of the shorter deadlines given to OIFS tasks. Slower machines will I would guess take over three months to complete these tasks so setting the deadline at 3 months would stop some users from downloading them at all because they wouldn't finish in time. ID: 68596 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 68597 - Posted: 17 Mar 2023, 17:03:26 UTC - in response to Message 68595. greedy, selfish users who download dozens of work units, or more, and then can't complete them by the deadline. Just so I do not be considered greedy, I do not wish to hog an unfair number of tasks, but OTOH, I do want a large enough number of tasks on my machine to coast me over the dead spots. Right now, both CPDN and WCG have extremely long dead spots. CPDN just does not have tasks ready, and I would not wish to have many weeks of those work units, that used to have a 1-year deadlines. In the old days, new tasks were always available, so I did not need a large input queue. WCG is just plain down for extremely long intervals (months), and has not really been running right in over a year. I do not remember how long their tasks take, but some of them were 8 hours or so. and others less. Some project had some 8-day tasks, but I do not remember if it was one of the WCG or not. For one project I am on, DENIS, It downloaded about 100 tasks all at once. But they run pretty quickly (about 70 minutes) so I have no trouble completing them on time (deadline is about three days). Then they have several days of no work. This is not actually a complaint. For another project (Einstein), I get only half a dozen at most, and I can complete them on time too. They take longer to run (about 11hours each). As far as I know, the only way to get more tasks to download would be to set Options->Computing Preferences->Computing->Other to have higher Days of Work settings. What do other people consider fair settings for these? My setttings are At least 0.5 days of work Additional 1.0 days of work. ID: 68597 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275	Message 68598 - Posted: 17 Mar 2023, 19:58:55 UTC - in response to Message 68596. If these Windows work units are long running as you suggest, I hope there will be a mechanism in place on the server to ensure that everyone who wants some will get them, shared fairly and equally, instead of by greedy, selfish users who download dozens of work units, or more, and then can't complete them by the deadline. I have no idea on that. My hope is that those in charge have learned from the effectiveness of the shorter deadlines given to OIFS tasks. Slower machines will I would guess take over three months to complete these tasks so setting the deadline at 3 months would stop some users from downloading them at all because they wouldn't finish in time. If they do it the way they did for the nz25 batches, the development site spinups ran for 113 model months and took about 20 days on my i7-4790K. When they later sent out stash/ancil test nz25 batches to the dev site, they were for 25 model months and took less than a quarter of the time that the spinups did. The nz25 batches sent to the main cpdn site were also 25 model months. Just guessing but I don't think 119 model month batches will come to the main site. ID: 68598 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,840,128 RAC: 20,043	Message 68599 - Posted: 17 Mar 2023, 21:24:46 UTC Last modified: 17 Mar 2023, 21:25:45 UTC My hope is that those in charge have learned from the effectiveness of the shorter deadlines given to OIFS tasks. While it was helpful to add a 30 day grace period in addition to the 30 day deadline for OIFS tasks during the storage outage, it seems to me that it should've been removed once things stabilized. There are still almost 200 PS tasks out and from what I remember, the contract deadline was supposed to be end of February. BL, and regular OIFS apps also still have 100-150 tasks out each although I'm not sure if 30 days have passed yet on those. Just so I do not be considered greedy, I do not wish to hog an unfair number of tasks, but OTOH, I do want a large enough number of tasks on my machine to coast me over the dead spots. I agree, since work availability is not constant or consistent I also like to have a large enough cache store but not too big so that I can still finish all work by the 30 day deadline. It's not always possible as there are limits based on one's consecutive valid tasks. ID: 68599 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,018,099 RAC: 20,856	Message 68671 - Posted: 18 Apr 2023, 19:41:08 UTC If they do it the way they did for the nz25 batches, the development site spinups ran for 113 model months and took about 20 days on my i7-4790K. When they later sent out stash/ancil test nz25 batches to the dev site, they were for 25 model months and took less than a quarter of the time that the spinups did. The nz25 batches sent to the main cpdn site were also 25 model months. Just guessing but I don't think 119 model month batches will come to the main site. I missed this in a post from George about a month ago. When the four I have running get nearer completion I might ask what the plan is. I also missed noticing the long spinup NZ tasks or I might have twigged that main site tasks might well be considerably shorter. Thanks for the reminder/hint by PM George. ID: 68671 · Reply Quote

Hans Sveen Send message Joined: 31 Aug 04 Posts: 5 Credit: 17,401,474 RAC: 5,243	Message 68922 - Posted: 23 Jun 2023, 11:38:45 UTC Last modified: 23 Jun 2023, 11:43:48 UTC Hello! Now that the testing is over and the "real" working starts, I'll have got some wu's from Batch #994 and allas , so far several have errored out with Sement violation (Signal 11) and when trying to upload the result "upload failure: <file_xfer_error>" "<error_code>-240 (stat() failed)</error_code>" Is it my computers or the units that has some serious errors?? Example: https://www.cpdn.org/workunit.php?wuid=12216822 Hans Sveen Oslo,Norway ID: 68922 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,018,099 RAC: 20,856	Message 68923 - Posted: 23 Jun 2023, 12:25:01 UTC - in response to Message 68922. Last modified: 23 Jun 2023, 14:17:17 UTC The file transfer error is just because the model crashed before the zip files could be created. One of the other moderators had the ones on his computer all crash, though at least one was due to a power outage while the task was running. I have one running so far and it is about two hours in. It is really too early to say if there are serious issues with these tasks or not yet. the ones of yours I looked at have all gone out again. I suspect the crashes you are experiencing which are all in the same ball park figure of cpu time are happening at the end of the first model day and as your machine has produced pretty consistent results in the past, it is probably something to do with the model. If it is we can expect a flurry of reports over the next day or two. The region being examined includes the Himalayas and the researcher did say when the testing tasks went out that it was possible that some of the tasks might be pushing the limits of the model. Edit: In about three quarters of an hour, I will be able to get some more work. (The one task I have running is under WINE in a Linux VM.) When the WCG work I have running finishes, I will shut down the Linux client and get some tasks under WINE in the host machine. That will give me a chance to see more than just one task at a time. Edit2: A further four tasks downloaded and have gotten past the point at which yours crashed. (all over one hour in.) However the sample size is still too small to draw any conclusions, How many tasks are actually running at once? Is anything else using a lot of memory? Those are two things that potentially can cause problems. ID: 68923 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 68924 - Posted: 23 Jun 2023, 14:59:11 UTC I just got a WAH2 task on my Windows 10 maachine. It failed after 3 minutes 31 seconds. Also, sending the out.zip failure message is having trouble going up. I hit retry and it wants to wait. So I will wait. ID: 68924 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275	Message 68927 - Posted: 23 Jun 2023, 16:28:36 UTC - in response to Message 68923. Looks like the regional model portion of the task takes 400+ MB of resident memory while the global portion of each task takes about 200 MB, so 600 to 700 MB total resident memory for each task. My Ryzen 5600 is running 3 at a time and it works out to about 8.5 days to complete them. ID: 68927 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331	Message 68928 - Posted: 23 Jun 2023, 17:20:29 UTC Last modified: 23 Jun 2023, 17:24:34 UTC Yep, I had 4 WaH tasks and all crashed with a segV. Unfortunately they disappeared too quickly for me to see the detailed logs but it was the model that crashed and not the wrapper process. Maybe it's a Win11 issue, AFAIK the model has not been recompiled for quite some time on Windows. update: checking the WU I see the previous host also failed the task and was running Win10. Unlike OIFS there's very little returned to the task page making it v difficult to diagnose the problem --- CPDN Visiting Scientist ID: 68928 · Reply Quote

DadX Send message Joined: 30 Aug 06 Posts: 27 Credit: 1,879,577 RAC: 1,213	Message 68929 - Posted: 23 Jun 2023, 17:41:36 UTC I had 2 wah2_eas25_a3ny_201511_25_994_012220188 tasks and both crashed after 2 minutes. It's having trouble uploading them. Regards, DadX ID: 68929 · Reply Quote

DadX Send message Joined: 30 Aug 06 Posts: 27 Credit: 1,879,577 RAC: 1,213	Message 68930 - Posted: 23 Jun 2023, 17:43:56 UTC - in response to Message 68929. Correction I had two tasks that failed wah2_eas25_a3ny_201511_25_994_012220188 and wah2_eas25_a1cm_199611_25_994_012217188 after 2 minutes. Regards, DadX ID: 68930 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 68931 - Posted: 23 Jun 2023, 18:05:30 UTC - in response to Message 68924. I am still waiting for the out.zip failure message file to upload. I checked my Windows 10 machine hardware. I do not fully understand what these mean -- especially why the first two are different. It has 16.0 GBytes Installed 15.6 GBytes Total 8.78 GBytes Available 18.0 GBytes Virtual 10.1 GBytes Available Virtual The computer is running 6 Boinc tasks (no CPDN at the moment). The Boinc-client allows 7 tasks to run, but the various app_config files are limiting them to only six. The one for CPDN allows only one of those to run at a time. ID: 68931 · Reply Quote

wateroakley Send message Joined: 6 Aug 04 Posts: 195 Credit: 28,343,308 RAC: 10,415	Message 68932 - Posted: 23 Jun 2023, 18:12:06 UTC Five tasks downloaded on this WIN 11 PC. Four crashed almost immedlately with the event log complaining about missing output files. The four took took a little while to report and std.err complains about Segment violations. Number 5 task is ticking along nicely. After four hours the task is 2.1 per cent though, giving a run-time estimate of 189 hours or about 8 days. ID: 68932 · Reply Quote