Message boards : Number crunching : New Work Announcements 2024
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 . . . 13 · Next
Author | Message |
---|---|
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
Now a few of my computers according to the error log are limited to a quota of 1 task for the day. I don't believe my computers are the issue. I believe a batch of bad tasks are what put me in this position. Any way to fix this? The limitation will be lifted once your boxes return some completed tasks. |
Send message Joined: 24 Dec 19 Posts: 32 Credit: 41,479,985 RAC: 18,320 |
Why am I getting the same daily quota message on a computer that is running 27 tasks? I get the quota message then an hour later the computer gets a task. Makes no sense to me. 1/16/2024 1:13:06 PM | climateprediction.net | Finished upload of wah2_eas25_h2wq_201612_24_1001_012233736_1_r485846348_2.zip (98883178 bytes) 1/16/2024 1:19:46 PM | climateprediction.net | Finished upload of wah2_eas25_n2b7_201412_24_1002_012239009_0_r1774178012_2.zip (99158378 bytes) 1/16/2024 1:22:30 PM | | Project communication failed: attempting access to reference site 1/16/2024 1:22:31 PM | | Internet access OK - project servers may be temporarily down. 1/16/2024 1:23:07 PM | climateprediction.net | Started upload of wah2_eas25_n47h_201912_24_1002_012241467_0_r1092062945_2.zip 1/16/2024 1:24:21 PM | | Project communication failed: attempting access to reference site 1/16/2024 1:24:22 PM | | Internet access OK - project servers may be temporarily down. 1/16/2024 1:34:04 PM | climateprediction.net | Finished upload of wah2_eas25_n47h_201912_24_1002_012241467_0_r1092062945_2.zip (99094328 bytes) 1/16/2024 1:51:53 PM | climateprediction.net | Started upload of wah2_eas25_n246_201412_24_1002_012238756_0_r1602600301_2.zip 1/16/2024 1:56:05 PM | climateprediction.net | Sending scheduler request: To send trickle-up message. 1/16/2024 1:56:05 PM | climateprediction.net | Requesting new tasks for CPU 1/16/2024 1:56:07 PM | climateprediction.net | Scheduler request completed: got 0 new tasks 1/16/2024 1:56:07 PM | climateprediction.net | No tasks sent 1/16/2024 1:56:07 PM | climateprediction.net | This computer has finished a daily quota of 1 tasks 1/16/2024 1:56:07 PM | climateprediction.net | Project requested delay of 3636 seconds 1/16/2024 2:03:31 PM | climateprediction.net | Finished upload of wah2_eas25_n246_201412_24_1002_012238756_0_r1602600301_2.zip (98890861 bytes) 1/16/2024 2:11:07 PM | climateprediction.net | Started upload of wah2_eas25_n47b_201912_24_1002_012241461_0_r1749094426_2.zip 1/16/2024 2:11:10 PM | climateprediction.net | Started upload of wah2_eas25_h2zu_201612_24_1001_012233848_1_r1912650644_2.zip 1/16/2024 2:20:05 PM | climateprediction.net | Started upload of wah2_eas25_h077_200912_24_1001_012230225_1_r1947292413_2.zip 1/16/2024 2:20:58 PM | climateprediction.net | Started upload of wah2_eas25_n2ru_201612_24_1002_012239608_0_r1644820096_2.zip 1/16/2024 2:22:10 PM | climateprediction.net | Finished upload of wah2_eas25_h2zu_201612_24_1001_012233848_1_r1912650644_2.zip (98543726 bytes) 1/16/2024 2:22:18 PM | climateprediction.net | Finished upload of wah2_eas25_n47b_201912_24_1002_012241461_0_r1749094426_2.zip (98872501 bytes) 1/16/2024 2:27:40 PM | climateprediction.net | Started upload of wah2_eas25_n3eq_201712_24_1002_012240432_1_r766709515_2.zip 1/16/2024 2:31:43 PM | climateprediction.net | Finished upload of wah2_eas25_h077_200912_24_1001_012230225_1_r1947292413_2.zip (99062272 bytes) 1/16/2024 2:33:03 PM | climateprediction.net | Finished upload of wah2_eas25_n2ru_201612_24_1002_012239608_0_r1644820096_2.zip (98983890 bytes) 1/16/2024 2:37:09 PM | climateprediction.net | Finished upload of wah2_eas25_n3eq_201712_24_1002_012240432_1_r766709515_2.zip (99020860 bytes) 1/16/2024 2:56:48 PM | climateprediction.net | Sending scheduler request: To send trickle-up message. 1/16/2024 2:56:48 PM | climateprediction.net | Requesting new tasks for CPU 1/16/2024 2:56:50 PM | climateprediction.net | Scheduler request completed: got 1 new tasks 1/16/2024 2:56:50 PM | climateprediction.net | Project requested delay of 3636 seconds 1/16/2024 2:56:52 PM | climateprediction.net | Started download of wah2_eas25_g2j2_201512_24_1003_012245340.zip 1/16/2024 2:56:52 PM | climateprediction.net | Started download of ic19610201_12_N96.gz 1/16/2024 2:56:52 PM | climateprediction.net | Started download of GHGclim_ancil_168months_CMIP6-GISS-E2-1-G_SST_2009-01-01_2022-12-30.gz 1/16/2024 2:56:52 PM | climateprediction.net | Started download of GHGclim_ancil_168months_CMIP6-GISS-E2-1-G_ice_2009-01-01_2022-12-30.gz 1/16/2024 2:56:54 PM | climateprediction.net | Finished download of wah2_eas25_g2j2_201512_24_1003_012245340.zip (19421 bytes) 1/16/2024 2:56:59 PM | climateprediction.net | Finished download of ic19610201_12_N96.gz (1312172 bytes) 1/16/2024 2:57:03 PM | climateprediction.net | Finished download of GHGclim_ancil_168months_CMIP6-GISS-E2-1-G_ice_2009-01-01_2022-12-30.gz (5052864 bytes) 1/16/2024 2:58:02 PM | climateprediction.net | Finished download of GHGclim_ancil_168months_CMIP6-GISS-E2-1-G_SST_2009-01-01_2022-12-30.gz (51595607 bytes) 1/16/2024 2:58:03 PM | climateprediction.net | Starting task wah2_eas25_g2j2_201512_24_1003_012245340_0 1/16/2024 3:15:35 PM | | Project communication failed: attempting access to reference site 1/16/2024 3:15:36 PM | | Internet access OK - project servers may be temporarily down. |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
I don't know the answer to that one. If Richard or none of the other moderators can answer this I will ask Andy at the project. The moderators are all volunteer crunchers who have been around a while. I think I am the youngest in terms of time moderating by quite a few years. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
I don't know the answer to that one. If Richard or none of the other moderators can answer this I will ask Andy at the project. The moderators are all volunteer crunchers who have been around a while. I think I am the youngest in terms of time moderating by quite a few years.Perhaps a trickle up makes the server forgive your host? It does show you're managing a task. |
Send message Joined: 2 Oct 06 Posts: 54 Credit: 27,309,613 RAC: 28,128 |
Now a few of my computers according to the error log are limited to a quota of 1 task for the day. I don't believe my computers are the issue. I believe a batch of bad tasks are what put me in this position. Any way to fix this? Yeah, this is a real problem. I have a 16 core machine with this issue. And even the faster machines will take 6 days to complete a task. So 16 cores...with a single task...and each day it gets to add one more. Except it got an error that second say. Now 16 days before it fills up the cores. Unless there is another error(s) in the duration. Perhaps after 6 days+, something changes? But with task errors mixed in who knows? The server side rules for this need to be modified. Other projects don't use these same impossible rules. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
The server side rules for this need to be modified. Other projects don't use these same impossible rules.I disagree - the rule protects the server from wasting time sending out tasks to a machine likely to break the next one. It doesn't matter if it's the task's fault, the point is it's better off sending the next task to another machine. |
Send message Joined: 22 Feb 11 Posts: 32 Credit: 226,546 RAC: 4,080 |
Why am I getting the same daily quota message on a computer that is running 27 tasks? Because the counter resets every 24 hours and gets one new task every day. |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
#1005, the NZ batch is being uploaded to the server about now so those tasks should start appearing in the mix soon. I will edit post when I know how many there are. |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
With respect to the error rates, 1002 has 10, 1001 has 5 3&4 have one between them almost certainly reflecting that they were later going out. 1 & 2 both still have 58% of tasks still to be sent out so I am hesitant about drawing any conclusions about failure rates at the moment. The different types of forcing being used in each I would expect to affect the failure rate during the run but not the failure rate right at the start of the task which all the failed tasks I have looked at have been. I would expect the changes to the code Glen is making to significantly reduce the error rate however. Even then, there will be a very small number of tasks where the model runs out of control leading to an impossible climate. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,767,175 RAC: 3,168 |
Adding another failure: wah2_eas25_g3ue_201812_24_1003_012247044_0 This is my travelling laptop, so I don't usually run CPDN on it (because of the restarts). I'm not going anywhere until winter is over, so I thought I'd try it static to observe the 'daily quota' issue. Task failed after 2 minutes with <stderr_out> <![CDATA[ <stderr_txt> Signal 11 received: Segment violation Signal 11 received: Software termination signal from kill Signal 11 received: Abnormal termination triggered by abort call Signal 11 received, exiting... 14:30:48 (16964): called boinc_finish(193) Global Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=16884, iMonCtr=2 Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=16964, selfPID=6608, iMonCtr=1 Model crash detected, will try to restart... Leaving CPDN_ain::Monitor... 14:30:52 (6608): called boinc_finish(0) </stderr_txt>'Signal 11' might be expected under Linux, but this is native Windows 11. Poking around before reporting it, I found a file boinc_ufs_cpdnout_out.zip in the slot directory, but the name is confusing: it's actually a plain-text file containing the single line <status>0</status> Which doesn't help, either. Willing to delve deeper if anyone has any questions/suggestions. |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
I am guessing this is the problem of the models crashing at the point where they go from the global to the regional model. I am hoping the work Glen is doing on the code will address this. Unfortunately time constraints meant the work had to go out before he has finished. |
Send message Joined: 2 Oct 06 Posts: 54 Credit: 27,309,613 RAC: 28,128 |
The server side rules for this need to be modified. Other projects don't use these same impossible rules.I disagree - the rule protects the server from wasting time sending out tasks to a machine likely to break the next one. It doesn't matter if it's the task's fault, the point is it's better off sending the next task to another machine. I disagree with your disagreement. There are still 16k tasks just waiting to be sent. There are no "other machines" at this point. Edit: And there is no harm sending a task to a bad machine. It just gets resent to the next. This is a feature, not a fault. |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
Other projects don't use these same impossible rules.I can't remember which but at least one other project does. I don't know enough about BOINC server to say whether it is possible to turn this feature on for some task types and not others but given the issue with the missing libraries problem for Linux tasks, even when the number of resends was upped to five, there were still a lot of tasks going to hard fail because of machines that crash everything. For me that is a valid reason to use that part of the server software. At the current rate there are over sixty tasks going out every hour, often over a hundred. When the first tasks start finishing, that will go up. Long story short, we will never all agree on exactly what the balance should be. I don't know if there are any tweaks that can make the penalties for tasks failing less severe? |
Send message Joined: 16 Aug 04 Posts: 156 Credit: 9,035,872 RAC: 2,928 |
Since there are no Linux tasks available and still +16k tasks unsent I turned on Wine, OK? Wah 8.24 is from 2016 and for win 2000, xp , vista, 7, 8. a factor? Got 3 tasks running fine with win7, |
Send message Joined: 2 Oct 06 Posts: 54 Credit: 27,309,613 RAC: 28,128 |
Other projects don't use these same impossible rules.I can't remember which but at least one other project does. I don't know enough about BOINC server to say whether it is possible to turn this feature on for some task types and not others but given the issue with the missing libraries problem for Linux tasks, even when the number of resends was upped to five, there were still a lot of tasks going to hard fail because of machines that crash everything. The 1/day rule doesn't change the amount of tasks that "hard fail". It just delays the eventual result. And delays the valid completion of the rest. |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
Since there are no Linux tasks available and still +16k tasks unsent I turned on Wine, OK? Yes, OK to turn on WINE. That is how I am running Windows tasks at the moment. The ones on Testing branch I am doing in a VM. so as to not give a false picture as some of the memory errors seem to be protected against with WINE. I get about a 20% performance hit if I run tasks using Windows in a VM compared with using WINE. Others may get a better or worse comparison depending on details of their machines. I haven't seen evidence of the Windows version affecting things but that doesn't totally rule it out. |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
The 1/day rule doesn't change the amount of tasks that "hard fail". It just delays the eventual result. And delays the valid completion of the rest.It does if there are machines trashing everything and looking at some of the machines that crop up in the hard fail lists to my way of thinking it is worth weeding them out. Not that I have any say in the matter either way! I also stand to be corrected by Glen or Richard if my understanding of how it is working is incorrect. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I have received three tasks, one at a time, that are all working fine on my Windows10 machine. One is 1002 and has returned three trickles, and two are 1003 and one has returned a trickle and the other not yet. They are all running at once and they are predicted to take almost 10 days. 22370239 12247973 17 Jan 2024, 15:55:24 UTC 16 May 2024, 15:55:24 UTC In progress --- --- --- Weather At Home 2 (wah2) v8.24 windows_intelx86 22369242 12246987 16 Jan 2024, 16:24:45 UTC 15 May 2024, 16:24:45 UTC In progress --- --- 1,678.16 Weather At Home 2 (wah2) v8.24 windows_intelx86 22359236 12238208 16 Jan 2024, 0:51:57 UTC 15 May 2024, 0:51:57 UTC In progress --- --- 2,506.49 Weather At Home 2 (wah2) v8.24 windows_intelx86 |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
I am going to open a new thread for the East Asia batches 1001-4. To free this thread for new work announcements rather than discussion. It would be good if anyone starting discussions for subsequent batches such as the NZ ones that should appear tomorrow could do the same. Thank you. |
Send message Joined: 29 Oct 17 Posts: 1052 Credit: 16,728,373 RAC: 12,646 |
Richard, (or anyone else), Don't waste your time looking into these segmentation failures. I know exactly where the problem is in the code, I've been working on this for weeks. The same code works fine under Linux but fails on Windows (same compiler too). Am trying to find a workaround that doesn't involve rewriting the code too much. As I know you're technically minded it relates to the old way in which Fortran was coded for low memory machines years ago, where arrays were "misused" and shared between data of different types. A v large REAL array is being equivalenced to both an integer and logical array. It should work (and does on Linux) but we get a bad memory address under Windows (which only serves to reinforce my dislike of Windows :P) Glenn Adding another failure: --- CPDN Visiting Scientist |
©2024 cpdn.org