climateprediction.net home page
New Work Announcements 2024

New Work Announcements 2024

Message boards : Number crunching : New Work Announcements 2024
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 13 · Next

AuthorMessage
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,961,772
RAC: 21,888
Message 70124 - Posted: 16 Jan 2024, 22:02:39 UTC

Now a few of my computers according to the error log are limited to a quota of 1 task for the day. I don't believe my computers are the issue. I believe a batch of bad tasks are what put me in this position. Any way to fix this?

The limitation will be lifted once your boxes return some completed tasks.
ID: 70124 · Report as offensive     Reply Quote
ChelseaOilman

Send message
Joined: 24 Dec 19
Posts: 32
Credit: 40,882,329
RAC: 85,111
Message 70125 - Posted: 16 Jan 2024, 22:27:32 UTC

Why am I getting the same daily quota message on a computer that is running 27 tasks?
I get the quota message then an hour later the computer gets a task. Makes no sense to me.


1/16/2024 1:13:06 PM | climateprediction.net | Finished upload of wah2_eas25_h2wq_201612_24_1001_012233736_1_r485846348_2.zip (98883178 bytes)
1/16/2024 1:19:46 PM | climateprediction.net | Finished upload of wah2_eas25_n2b7_201412_24_1002_012239009_0_r1774178012_2.zip (99158378 bytes)
1/16/2024 1:22:30 PM | | Project communication failed: attempting access to reference site
1/16/2024 1:22:31 PM | | Internet access OK - project servers may be temporarily down.
1/16/2024 1:23:07 PM | climateprediction.net | Started upload of wah2_eas25_n47h_201912_24_1002_012241467_0_r1092062945_2.zip
1/16/2024 1:24:21 PM | | Project communication failed: attempting access to reference site
1/16/2024 1:24:22 PM | | Internet access OK - project servers may be temporarily down.
1/16/2024 1:34:04 PM | climateprediction.net | Finished upload of wah2_eas25_n47h_201912_24_1002_012241467_0_r1092062945_2.zip (99094328 bytes)
1/16/2024 1:51:53 PM | climateprediction.net | Started upload of wah2_eas25_n246_201412_24_1002_012238756_0_r1602600301_2.zip
1/16/2024 1:56:05 PM | climateprediction.net | Sending scheduler request: To send trickle-up message.
1/16/2024 1:56:05 PM | climateprediction.net | Requesting new tasks for CPU
1/16/2024 1:56:07 PM | climateprediction.net | Scheduler request completed: got 0 new tasks
1/16/2024 1:56:07 PM | climateprediction.net | No tasks sent
1/16/2024 1:56:07 PM | climateprediction.net | This computer has finished a daily quota of 1 tasks
1/16/2024 1:56:07 PM | climateprediction.net | Project requested delay of 3636 seconds
1/16/2024 2:03:31 PM | climateprediction.net | Finished upload of wah2_eas25_n246_201412_24_1002_012238756_0_r1602600301_2.zip (98890861 bytes)
1/16/2024 2:11:07 PM | climateprediction.net | Started upload of wah2_eas25_n47b_201912_24_1002_012241461_0_r1749094426_2.zip
1/16/2024 2:11:10 PM | climateprediction.net | Started upload of wah2_eas25_h2zu_201612_24_1001_012233848_1_r1912650644_2.zip
1/16/2024 2:20:05 PM | climateprediction.net | Started upload of wah2_eas25_h077_200912_24_1001_012230225_1_r1947292413_2.zip
1/16/2024 2:20:58 PM | climateprediction.net | Started upload of wah2_eas25_n2ru_201612_24_1002_012239608_0_r1644820096_2.zip
1/16/2024 2:22:10 PM | climateprediction.net | Finished upload of wah2_eas25_h2zu_201612_24_1001_012233848_1_r1912650644_2.zip (98543726 bytes)
1/16/2024 2:22:18 PM | climateprediction.net | Finished upload of wah2_eas25_n47b_201912_24_1002_012241461_0_r1749094426_2.zip (98872501 bytes)
1/16/2024 2:27:40 PM | climateprediction.net | Started upload of wah2_eas25_n3eq_201712_24_1002_012240432_1_r766709515_2.zip
1/16/2024 2:31:43 PM | climateprediction.net | Finished upload of wah2_eas25_h077_200912_24_1001_012230225_1_r1947292413_2.zip (99062272 bytes)
1/16/2024 2:33:03 PM | climateprediction.net | Finished upload of wah2_eas25_n2ru_201612_24_1002_012239608_0_r1644820096_2.zip (98983890 bytes)
1/16/2024 2:37:09 PM | climateprediction.net | Finished upload of wah2_eas25_n3eq_201712_24_1002_012240432_1_r766709515_2.zip (99020860 bytes)
1/16/2024 2:56:48 PM | climateprediction.net | Sending scheduler request: To send trickle-up message.
1/16/2024 2:56:48 PM | climateprediction.net | Requesting new tasks for CPU
1/16/2024 2:56:50 PM | climateprediction.net | Scheduler request completed: got 1 new tasks
1/16/2024 2:56:50 PM | climateprediction.net | Project requested delay of 3636 seconds
1/16/2024 2:56:52 PM | climateprediction.net | Started download of wah2_eas25_g2j2_201512_24_1003_012245340.zip
1/16/2024 2:56:52 PM | climateprediction.net | Started download of ic19610201_12_N96.gz
1/16/2024 2:56:52 PM | climateprediction.net | Started download of GHGclim_ancil_168months_CMIP6-GISS-E2-1-G_SST_2009-01-01_2022-12-30.gz
1/16/2024 2:56:52 PM | climateprediction.net | Started download of GHGclim_ancil_168months_CMIP6-GISS-E2-1-G_ice_2009-01-01_2022-12-30.gz
1/16/2024 2:56:54 PM | climateprediction.net | Finished download of wah2_eas25_g2j2_201512_24_1003_012245340.zip (19421 bytes)
1/16/2024 2:56:59 PM | climateprediction.net | Finished download of ic19610201_12_N96.gz (1312172 bytes)
1/16/2024 2:57:03 PM | climateprediction.net | Finished download of GHGclim_ancil_168months_CMIP6-GISS-E2-1-G_ice_2009-01-01_2022-12-30.gz (5052864 bytes)
1/16/2024 2:58:02 PM | climateprediction.net | Finished download of GHGclim_ancil_168months_CMIP6-GISS-E2-1-G_SST_2009-01-01_2022-12-30.gz (51595607 bytes)
1/16/2024 2:58:03 PM | climateprediction.net | Starting task wah2_eas25_g2j2_201512_24_1003_012245340_0
1/16/2024 3:15:35 PM | | Project communication failed: attempting access to reference site
1/16/2024 3:15:36 PM | | Internet access OK - project servers may be temporarily down.
ID: 70125 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,961,772
RAC: 21,888
Message 70127 - Posted: 17 Jan 2024, 7:07:57 UTC

I don't know the answer to that one. If Richard or none of the other moderators can answer this I will ask Andy at the project. The moderators are all volunteer crunchers who have been around a while. I think I am the youngest in terms of time moderating by quite a few years.
ID: 70127 · Report as offensive     Reply Quote
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 70128 - Posted: 17 Jan 2024, 7:18:27 UTC - in response to Message 70127.  

I don't know the answer to that one. If Richard or none of the other moderators can answer this I will ask Andy at the project. The moderators are all volunteer crunchers who have been around a while. I think I am the youngest in terms of time moderating by quite a few years.
Perhaps a trickle up makes the server forgive your host? It does show you're managing a task.
ID: 70128 · Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 2 Oct 06
Posts: 54
Credit: 27,309,613
RAC: 28,128
Message 70129 - Posted: 17 Jan 2024, 8:02:31 UTC - in response to Message 70124.  
Last modified: 17 Jan 2024, 8:08:27 UTC

Now a few of my computers according to the error log are limited to a quota of 1 task for the day. I don't believe my computers are the issue. I believe a batch of bad tasks are what put me in this position. Any way to fix this?

The limitation will be lifted once your boxes return some completed tasks.

Yeah, this is a real problem. I have a 16 core machine with this issue. And even the faster machines will take 6 days to complete a task. So 16 cores...with a single task...and each day it gets to add one more. Except it got an error that second say. Now 16 days before it fills up the cores. Unless there is another error(s) in the duration.

Perhaps after 6 days+, something changes? But with task errors mixed in who knows?

The server side rules for this need to be modified. Other projects don't use these same impossible rules.
ID: 70129 · Report as offensive     Reply Quote
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 70130 - Posted: 17 Jan 2024, 8:11:04 UTC - in response to Message 70129.  

The server side rules for this need to be modified. Other projects don't use these same impossible rules.
I disagree - the rule protects the server from wasting time sending out tasks to a machine likely to break the next one. It doesn't matter if it's the task's fault, the point is it's better off sending the next task to another machine.
ID: 70130 · Report as offensive     Reply Quote
kotenok2000

Send message
Joined: 22 Feb 11
Posts: 32
Credit: 226,546
RAC: 4,080
Message 70131 - Posted: 17 Jan 2024, 8:47:42 UTC - in response to Message 70125.  
Last modified: 17 Jan 2024, 8:48:49 UTC

Why am I getting the same daily quota message on a computer that is running 27 tasks?
I get the quota message then an hour later the computer gets a task. Makes no sense to me.



Because the counter resets every 24 hours and gets one new task every day.
ID: 70131 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,961,772
RAC: 21,888
Message 70132 - Posted: 17 Jan 2024, 13:37:26 UTC

#1005, the NZ batch is being uploaded to the server about now so those tasks should start appearing in the mix soon. I will edit post when I know how many there are.
ID: 70132 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,961,772
RAC: 21,888
Message 70133 - Posted: 17 Jan 2024, 13:52:11 UTC

With respect to the error rates, 1002 has 10, 1001 has 5 3&4 have one between them almost certainly reflecting that they were later going out. 1 & 2 both still have 58% of tasks still to be sent out so I am hesitant about drawing any conclusions about failure rates at the moment. The different types of forcing being used in each I would expect to affect the failure rate during the run but not the failure rate right at the start of the task which all the failed tasks I have looked at have been. I would expect the changes to the code Glen is making to significantly reduce the error rate however. Even then, there will be a very small number of tasks where the model runs out of control leading to an impossible climate.
ID: 70133 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,690,033
RAC: 10,812
Message 70134 - Posted: 17 Jan 2024, 15:02:10 UTC - in response to Message 70133.  

Adding another failure:

wah2_eas25_g3ue_201812_24_1003_012247044_0

This is my travelling laptop, so I don't usually run CPDN on it (because of the restarts). I'm not going anywhere until winter is over, so I thought I'd try it static to observe the 'daily quota' issue.

Task failed after 2 minutes with

<stderr_out>
<![CDATA[
<stderr_txt>
Signal 11 received: Segment violation
Signal 11 received: Software termination signal from kill 
Signal 11 received: Abnormal termination triggered by abort call
Signal 11 received, exiting...
14:30:48 (16964): called boinc_finish(193)
Global Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=16884, iMonCtr=2
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=16964, selfPID=6608, iMonCtr=1
Model crash detected, will try to restart...
Leaving CPDN_ain::Monitor...
14:30:52 (6608): called boinc_finish(0)
</stderr_txt>
'Signal 11' might be expected under Linux, but this is native Windows 11. Poking around before reporting it, I found a file

boinc_ufs_cpdnout_out.zip

in the slot directory, but the name is confusing: it's actually a plain-text file containing the single line

<status>0</status>

Which doesn't help, either. Willing to delve deeper if anyone has any questions/suggestions.
ID: 70134 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,961,772
RAC: 21,888
Message 70135 - Posted: 17 Jan 2024, 15:44:27 UTC - in response to Message 70134.  

I am guessing this is the problem of the models crashing at the point where they go from the global to the regional model. I am hoping the work Glen is doing on the code will address this. Unfortunately time constraints meant the work had to go out before he has finished.
ID: 70135 · Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 2 Oct 06
Posts: 54
Credit: 27,309,613
RAC: 28,128
Message 70136 - Posted: 17 Jan 2024, 16:19:42 UTC - in response to Message 70130.  
Last modified: 17 Jan 2024, 16:27:26 UTC

The server side rules for this need to be modified. Other projects don't use these same impossible rules.
I disagree - the rule protects the server from wasting time sending out tasks to a machine likely to break the next one. It doesn't matter if it's the task's fault, the point is it's better off sending the next task to another machine.

I disagree with your disagreement. There are still 16k tasks just waiting to be sent. There are no "other machines" at this point.

Edit: And there is no harm sending a task to a bad machine. It just gets resent to the next. This is a feature, not a fault.
ID: 70136 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,961,772
RAC: 21,888
Message 70137 - Posted: 17 Jan 2024, 16:56:12 UTC - in response to Message 70136.  

Other projects don't use these same impossible rules.
I can't remember which but at least one other project does. I don't know enough about BOINC server to say whether it is possible to turn this feature on for some task types and not others but given the issue with the missing libraries problem for Linux tasks, even when the number of resends was upped to five, there were still a lot of tasks going to hard fail because of machines that crash everything. For me that is a valid reason to use that part of the server software. At the current rate there are over sixty tasks going out every hour, often over a hundred. When the first tasks start finishing, that will go up.

Long story short, we will never all agree on exactly what the balance should be. I don't know if there are any tweaks that can make the penalties for tasks failing less severe?
ID: 70137 · Report as offensive     Reply Quote
Helmer Bryd

Send message
Joined: 16 Aug 04
Posts: 156
Credit: 9,035,872
RAC: 2,928
Message 70138 - Posted: 17 Jan 2024, 17:21:28 UTC

Since there are no Linux tasks available and still +16k tasks unsent I turned on Wine, OK?
Wah 8.24 is from 2016 and for win 2000, xp , vista, 7, 8. a factor?
Got 3 tasks running fine with win7,
ID: 70138 · Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 2 Oct 06
Posts: 54
Credit: 27,309,613
RAC: 28,128
Message 70139 - Posted: 17 Jan 2024, 17:29:02 UTC - in response to Message 70137.  

Other projects don't use these same impossible rules.
I can't remember which but at least one other project does. I don't know enough about BOINC server to say whether it is possible to turn this feature on for some task types and not others but given the issue with the missing libraries problem for Linux tasks, even when the number of resends was upped to five, there were still a lot of tasks going to hard fail because of machines that crash everything.


The 1/day rule doesn't change the amount of tasks that "hard fail". It just delays the eventual result. And delays the valid completion of the rest.
ID: 70139 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,961,772
RAC: 21,888
Message 70140 - Posted: 17 Jan 2024, 17:47:36 UTC - in response to Message 70138.  

Since there are no Linux tasks available and still +16k tasks unsent I turned on Wine, OK?
Wah 8.24 is from 2016 and for win 2000, xp , vista, 7, 8. a factor?
Got 3 tasks running fine with win7,


Yes, OK to turn on WINE. That is how I am running Windows tasks at the moment. The ones on Testing branch I am doing in a VM. so as to not give a false picture as some of the memory errors seem to be protected against with WINE.

I get about a 20% performance hit if I run tasks using Windows in a VM compared with using WINE. Others may get a better or worse comparison depending on details of their machines. I haven't seen evidence of the Windows version affecting things but that doesn't totally rule it out.
ID: 70140 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,961,772
RAC: 21,888
Message 70141 - Posted: 17 Jan 2024, 17:50:48 UTC

The 1/day rule doesn't change the amount of tasks that "hard fail". It just delays the eventual result. And delays the valid completion of the rest.
It does if there are machines trashing everything and looking at some of the machines that crop up in the hard fail lists to my way of thinking it is worth weeding them out. Not that I have any say in the matter either way! I also stand to be corrected by Glen or Richard if my understanding of how it is working is incorrect.
ID: 70141 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 70142 - Posted: 17 Jan 2024, 20:19:09 UTC
Last modified: 17 Jan 2024, 20:20:11 UTC

I have received three tasks, one at a time, that are all working fine on my Windows10 machine. One is 1002 and has returned three trickles, and two are 1003 and one has returned a trickle and the other not yet. They are all running at once and they are predicted to take almost 10 days.

22370239 	12247973 	17 Jan 2024, 15:55:24 UTC 	16 May 2024, 15:55:24 UTC 	In progress 	--- 	--- 	--- 	Weather At Home 2 (wah2) v8.24
windows_intelx86
22369242 	12246987 	16 Jan 2024, 16:24:45 UTC 	15 May 2024, 16:24:45 UTC 	In progress 	--- 	--- 	1,678.16 	Weather At Home 2 (wah2) v8.24
windows_intelx86
22359236 	12238208 	16 Jan 2024, 0:51:57 UTC 	15 May 2024, 0:51:57 UTC 	In progress 	--- 	--- 	2,506.49 	Weather At Home 2 (wah2) v8.24
windows_intelx86

ID: 70142 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,961,772
RAC: 21,888
Message 70143 - Posted: 17 Jan 2024, 21:55:22 UTC

I am going to open a new thread for the East Asia batches 1001-4. To free this thread for new work announcements rather than discussion. It would be good if anyone starting discussions for subsequent batches such as the NZ ones that should appear tomorrow could do the same.

Thank you.
ID: 70143 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,386,107
RAC: 14,921
Message 70145 - Posted: 18 Jan 2024, 11:21:55 UTC - in response to Message 70134.  

Richard, (or anyone else),

Don't waste your time looking into these segmentation failures. I know exactly where the problem is in the code, I've been working on this for weeks. The same code works fine under Linux but fails on Windows (same compiler too). Am trying to find a workaround that doesn't involve rewriting the code too much.

As I know you're technically minded it relates to the old way in which Fortran was coded for low memory machines years ago, where arrays were "misused" and shared between data of different types. A v large REAL array is being equivalenced to both an integer and logical array. It should work (and does on Linux) but we get a bad memory address under Windows (which only serves to reinforce my dislike of Windows :P)

Glenn

Adding another failure:

wah2_eas25_g3ue_201812_24_1003_012247044_0

This is my travelling laptop, so I don't usually run CPDN on it (because of the restarts). I'm not going anywhere until winter is over, so I thought I'd try it static to observe the 'daily quota' issue.

Task failed after 2 minutes with

<stderr_out>
<![CDATA[
<stderr_txt>
Signal 11 received: Segment violation
Signal 11 received: Software termination signal from kill 
Signal 11 received: Abnormal termination triggered by abort call
Signal 11 received, exiting...
14:30:48 (16964): called boinc_finish(193)
Global Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=16884, iMonCtr=2
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=16964, selfPID=6608, iMonCtr=1
Model crash detected, will try to restart...
Leaving CPDN_ain::Monitor...
14:30:52 (6608): called boinc_finish(0)
</stderr_txt>
'Signal 11' might be expected under Linux, but this is native Windows 11. Poking around before reporting it, I found a file

boinc_ufs_cpdnout_out.zip

in the slot directory, but the name is confusing: it's actually a plain-text file containing the single line

<status>0</status>

Which doesn't help, either. Willing to delve deeper if anyone has any questions/suggestions.

---
CPDN Visiting Scientist
ID: 70145 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 13 · Next

Message boards : Number crunching : New Work Announcements 2024

©2024 cpdn.org