Message boards : Number crunching : Late November batch of Windows work
Message board moderation
Author | Message |
---|---|
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
OK, getting replies back, so several things: The following is mostly intended for the mods, but I've included it all: As the number of failures appeared to steadily be increasing for batch 238 (wah2_ri version 7.08) I have now removed the remaining workunits from this batch from the queue. However I am still very interested in how the workunits that have already been taken up progress. If you have any of these workunits running on your machines and are able to capture the working directories or .out files this would be very useful to us. Also if you have other information on the failures - when they are failing / how long they ran for this would also be helpful. Similarly if you have information on these workunits running successfully that would also be of benefit. Part of a separate email: This is the latest version of the newest incarnation of the W@H application, the region independent, start date independent, length independent with latest land cover model |
Send message Joined: 5 Jul 09 Posts: 63 Credit: 6,091,274 RAC: 0 |
OK, getting replies back, so several things: Just for those that have not done this before, (like me). I have two of these, one of which has just started running, It has failed on a previous machine. Where and when are the files present, and where would you like them sent, I am assuming this is only for files that fail. Thank you. Kevin |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Information about speed of progress and size of zip files when created may also be of interest. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,033,903 RAC: 14,766 |
Currently running four on my i5 machine. One repeat (computing error after 32secs!) and three new ones. Up to about 7% after 12hours. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Kevin, the files are in ..../BOINC/projects/climateprediction.net and the main files of interest are those ending iwth _x.zip where x is an integer between 1 and 13. If things work normally these will get transferred to the relevant server automatically but I am currently at 58% through transferring about 1GB worth of files through to Andy for the beta site which is still down. |
Send message Joined: 5 Jul 09 Posts: 63 Credit: 6,091,274 RAC: 0 |
Dave, OK thanks, mine is running fine so far. If I need it I will shout for the relevent address. Kevin |
Send message Joined: 15 Feb 06 Posts: 137 Credit: 35,334,752 RAC: 12,890 |
I have 6 of these tasks running on my i7 Computer. Estimated time for the tasks was 122 hours but 9% has taken 19 hours for the first task, so they will probably take just over 200 hours. First trickle at about 9%, Time Sent (UTC) Host ID Result ID Result Name Phase Timestep CPU Time (sec) Average (sec/TS) 30 Nov 2015 16:03:43 1305473 19108850 wah2_eu25_h9ig_197912_12_010206730_0 1 11,819 64,677 5.4723 I have 2 more tasks at 8.4%, so they should send a first trickle soon. |
Send message Joined: 15 Feb 06 Posts: 137 Credit: 35,334,752 RAC: 12,890 |
The next 2 tasks sent their first trickle at 8.6%, 18 hours 45 mins. Time Sent (UTC) Host ID Result ID Result Name Phase Timestep CPU Time (sec) Average (sec/TS) 30 Nov 2015 17:03:50 1305473 19110047 wah2_eu25_i7gm_198712_12_010207898_0 1 11,819 65,809 5.5681 30 Nov 2015 17:03:50 1305473 19107829 wah2_eu25_h3bh_197312_12_010205732_0 1 11,819 65,614 5.5516 Hope this info. is a help. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
|
Send message Joined: 3 Nov 10 Posts: 39 Credit: 2,494,427 RAC: 0 |
hello les my machine just attempted to download 4 of the PNW workunits...all 4 failed on download... frank |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,033,903 RAC: 14,766 |
These units have sent two trickles each so far. Taking about 4.8s/ts (i5, 3.5GHz). |
Send message Joined: 3 Nov 10 Posts: 39 Credit: 2,494,427 RAC: 0 |
well, another download attempt was more successful...4 out of 5 workunits are now running...the 5th failed the download...all are PNW units frank |
Send message Joined: 21 Aug 11 Posts: 10 Credit: 26,553,404 RAC: 1,491 |
Data from downloaded and runing units trickles: Latest Trickles Received Result ID Result Name Phase Timestep CPU Time (sec) Average (sec/TS) 19107519 wah2_eu25_h1ao_197112_12_010205426_0 1 23,339 161,793 6.9323 19107519 wah2_eu25_h1ao_197112_12_010205426_0 1 11,819 82,173 6.9526 Latest Trickles Received Result ID Result Name Phase Timestep CPU Time (sec) Average (sec/TS) 19107486 wah2_eu25_h0ik_197012_12_010205393_0 1 23,339 161,913 6.9374 19107486 wah2_eu25_h0ik_197012_12_010205393_0 1 11,819 82,149 6.9506 Latest Trickles Received Result ID Result Name Phase Timestep CPU Time (sec) Average (sec/TS) 19107361 wah2_eu25_f2bd_195212_12_010202599_1 1 23,339 161,386 6.9149 19107361 wah2_eu25_f2bd_195212_12_010202599_1 1 11,819 82,047 6.9420 zip files sizes: 16.407 wah2_eu25_j1ic_199112_12_010208514.zip 15.844 wah2_eu25_c0dm_192012_12_010197870.zip 15.631 wah2_eu25_f2bd_195212_12_010202599.zip 16.224 wah2_eu25_g2hd_196212_12_010204179.zip 15.735 wah2_eu25_g8il_196812_12_010205096.zip 15.629 wah2_eu25_h0ib_197012_12_010205384.zip 15.629 wah2_eu25_h0ik_197012_12_010205393.zip 15.631 wah2_eu25_h1ao_197112_12_010205426.zip |
Send message Joined: 1 Nov 06 Posts: 11 Credit: 579,556 RAC: 1,322 |
How often do these tasks checkpoint? Looking at the task running now, it seems it's been over 50 minutes of CPU time since the last checkpoint. |
Send message Joined: 5 Aug 04 Posts: 1283 Credit: 15,824,334 RAC: 0 |
How often do these tasks checkpoint? Looking at the task running now, it seems it's been over 50 minutes of CPU time since the last checkpoint. All CPDN models checkpoint at fixed points in the calculation. For these models it's at the end of each model day, with trickles and uploads being made every 30 model days. My 15 has a checkpoint interval of just under 50 minutes and for the Q6600 it's around 70 minutes. "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer |
Send message Joined: 1 Nov 06 Posts: 11 Credit: 579,556 RAC: 1,322 |
How often do these tasks checkpoint? Looking at the task running now, it seems it's been over 50 minutes of CPU time since the last checkpoint. Cheers, thanks for the reply! |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,824,485 RAC: 4,956 |
Two of my WAH2 models from the 29 November batch have completed. Some others have failed early on: at least one of those has made some progress on another computer, which makes we wonder whether they don't like being run with too many in parallel (my habit is to run 25% CPUs, except when getting new work when I put CPUs back to 100% - the crashes all occurred during the 100% period). |
Send message Joined: 15 Feb 06 Posts: 137 Credit: 35,334,752 RAC: 12,890 |
I'm running at 75% CPUs (6 tasks)during the day and 100% CPUs (8 Tasks)during the night on my Win 10 64bit i7. One WAH2 failed having trickled once. Another WAH2 failed after 8 trickles. Both failed during the day, so not at 100% CPus. The other WAH2s are at 70% completion. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Oops. Wrong models. I'll move this to the CM3n thread. |
Send message Joined: 5 Aug 04 Posts: 1283 Credit: 15,824,334 RAC: 0 |
Two of my WAH2 models from the 29 November batch have completed. Some others have failed early on: at least one of those has made some progress on another computer, which makes we wonder whether they don't like being run with too many in parallel (my habit is to run 25% CPUs, except when getting new work when I put CPUs back to 100% - the crashes all occurred during the 100% period). The memory load for WAH2 seems to be much higher than was the case for previous applications. My wah2_eu25 tasks have a total working set size of around 460MB and I've changed the project resource shares on my Q6600 (which only has 2GB of RAM) to prevent it from running more than one of these tasks. "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer |
©2024 cpdn.org