Message boards : Number crunching : Batch 1005 WAH2 NZ region
Message board moderation
Author | Message |
---|---|
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,008,987 RAC: 21,524 |
Can someone who has one or more of these tasks let me know if zips are going through all right? The ones for that region on the testing site are stuck. |
Send message Joined: 7 Aug 04 Posts: 10 Credit: 148,031,171 RAC: 39,293 |
I have two of these running. Both seem to be uploading OK. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,008,987 RAC: 21,524 |
Thanks, obviously going to a different server then. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Can someone who has one or more of these tasks let me know if zips are going through all right? The ones for that region on the testing site are stuck. I just got two of them. One has uploaded its first trickle. This is on my pipsqueak Windows 10 machine. Task 22387098 Name wah2_nz25_n31e_201205_25_1005_012258096_0 Workunit 12258096 Created 23 Jan 2024, 10:48:31 UTC Sent 25 Jan 2024, 19:27:25 UTC Report deadline 24 May 2024, 19:27:25 UTC |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,008,987 RAC: 21,524 |
Thanks both. the testing task was sending data to the wrong server. Also the test was just to make sure the task ran with a corrected ancillary file so the zips were not needed. In view of that the task has now been aborted. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,620,508 RAC: 4,981 |
Hi I've got this error on several of 1005 WUs I run <core_client_version>7.24.1</core_client_version> <![CDATA[ <message> Disk usage limit exceeded</message> <stderr_txt> CPDN Monitor - Abort request from BOINC... 00:52:22 (7412): called boinc_finish(10) </stderr_txt> ]]> |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,703,308 RAC: 9,860 |
BOINC reports that CPDN is using 1.71 GB for one eas25 task and one nz25 task (and probably including some residual program files from older runs). The nz25 task itself (at a little over 40% done) has a working set size of 263.69 MB. Check those figures against the amount of space remaining on your BOINC data drive, and check what proportion of the available space BOINC is allowed to use. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
That error can happen when the uploads are not going through and gradually eat up the allowed task space. Are your uploads working? Also, check in the /var/lib/boinc/projects/climateprediction.net directory. Sometimes, the tasks do not manage to tidy up on failures. You may have some old task directories in there, check they are not running tasks first before deleting. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,620,508 RAC: 4,981 |
There were only two ghost WUs occupying 0.5 Gb space and there is more than 170 GB available. It seems though the server abort worked partially. I have pending transfer zips from batch 1005 with no corresponding WU in tasks. It is 12 zips so I will abort them. Additionally I have several 1003, 1004, 1005 who are still running. I guess I should cancel them manually. Or 1005 should be left computing? |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Only cancel any 1002, 1003 & 1004. Any 1001 & 1005 should be left running/uploading. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,620,508 RAC: 4,981 |
Only cancel any 1002, 1003 & 1004. Any 1001 & 1005 should be left running/uploading. Thanks, however upload11 server can't let me upload the zips I have of 1005 06/02/2024 14:07:25 | climateprediction.net | [http] [ID#33153] Info: processing: http://upload11.cpdn.org/cgi-bin/file_upload_handler 06/02/2024 14:07:25 | climateprediction.net | [http] [ID#33154] Info: processing: http://upload11.cpdn.org/cgi-bin/file_upload_handler 06/02/2024 14:07:25 | climateprediction.net | [http] [ID#33154] Info: Found bundle for host: 0x1f293a28eb0 [serially] 06/02/2024 14:07:25 | climateprediction.net | [http] [ID#33154] Info: Connection #6232 is still name resolving, can't reuse 06/02/2024 14:07:26 | climateprediction.net | [http] [ID#33153] Info: Trying 192.171.169.187:80... 06/02/2024 14:07:26 | climateprediction.net | [http] [ID#33154] Info: Hostname 'upload11.cpdn.org' was found in DNS cache 06/02/2024 14:07:26 | climateprediction.net | [http] [ID#33154] Info: Trying 192.171.169.187:80... 06/02/2024 14:07:47 | climateprediction.net | [http] [ID#33153] Info: connect to 192.171.169.187 port 80 failed: Timed out 06/02/2024 14:07:47 | climateprediction.net | [http] [ID#33153] Info: Failed to connect to upload11.cpdn.org port 80 after 21552 ms: Couldn't connect to server 06/02/2024 14:07:47 | climateprediction.net | [http] [ID#33153] Info: Closing connection 06/02/2024 14:07:47 | climateprediction.net | [http] [ID#33154] Info: connect to 192.171.169.187 port 80 failed: Timed out 06/02/2024 14:07:47 | climateprediction.net | [http] [ID#33154] Info: Failed to connect to upload11.cpdn.org port 80 after 21552 ms: Couldn't connect to server 06/02/2024 14:07:47 | climateprediction.net | [http] [ID#33154] Info: Closing connection 06/02/2024 14:07:47 | climateprediction.net | [http] HTTP error: Timeout was reached 06/02/2024 14:07:47 | climateprediction.net | [http] HTTP error: Timeout was reached 06/02/2024 14:07:47 | climateprediction.net | Backing off 04:43:25 on upload of wah2_nz25_n0kd_199005_25_1005_012254891_0_r710857397_1.zip 06/02/2024 14:07:47 | climateprediction.net | Backing off 05:07:36 on upload of wah2_nz25_n0kd_199005_25_1005_012254891_0_r710857397_2.zip 06/02/2024 14:07:47 | climateprediction.net | [http] [ID#33155] Info: processing: http://upload11.cpdn.org/cgi-bin/file_upload_handler 06/02/2024 14:07:47 | climateprediction.net | [http] [ID#33156] Info: processing: http://upload11.cpdn.org/cgi-bin/file_upload_handler 06/02/2024 14:07:48 | climateprediction.net | [http] [ID#33155] Info: Hostname upload11.cpdn.org was found in DNS cache 06/02/2024 14:07:48 | climateprediction.net | [http] [ID#33155] Info: Trying 192.171.169.187:80... 06/02/2024 14:07:48 | climateprediction.net | [http] [ID#33156] Info: Found bundle for host: 0x1f293a28660 [serially] 06/02/2024 14:07:48 | climateprediction.net | [http] [ID#33156] Info: Connection #6234 is still name resolving, can't reuse 06/02/2024 14:07:48 | climateprediction.net | [http] [ID#33156] Info: Hostname upload11.cpdn.org was found in DNS cache 06/02/2024 14:07:48 | climateprediction.net | [http] [ID#33156] Info: Trying 192.171.169.187:80... 06/02/2024 14:07:48 | | Project communication failed: attempting access to reference site 06/02/2024 14:07:48 | | [http] HTTP_OP::init_get(): https://www.google.com/ 06/02/2024 14:07:48 | | [http] [ID#0] Info: processing: https://www.google.com/ 06/02/2024 14:07:49 | | [http] [ID#0] Info: Trying 172.217.20.68:443... 06/02/2024 14:07:49 | | [http] [ID#0] Info: Connected to www.google.com (172.217.20.68) port 443 06/02/2024 14:07:49 | | [http] [ID#0] Info: schannel: disabled automatic use of client certificate 06/02/2024 14:07:49 | | [http] [ID#0] Info: ALPN: offers http/1.1 06/02/2024 14:07:49 | | [http] [ID#0] Info: ALPN: server accepted http/1.1 06/02/2024 14:07:49 | | [http] [ID#0] Info: using HTTP/1.1 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: GET / HTTP/1.1 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: Host: www.google.com 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: User-Agent: BOINC client (windows_x86_64 7.24.1) 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: Accept: */* 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: Accept-Encoding: deflate, gzip 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: Accept-Language: en_GB 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: roject_name> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <name>wah2_nz25_n0kd_199005_25_1005_012254891_0_r710857397_1.zip</name> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <nbytes>90258709.000000</nbytes> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <max_nbytes>150000000.000000</max_nbytes> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <status>1</status> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <persistent_file_xfer> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <num_retries>15</num_retries> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <first_request_time>1706963052.704195</first_request_time> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <next_request_time>1707238272.680814</next_request_time> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <time_so_far>337.586294</time_so_far> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <last_bytes_xferred>0.000000</last_bytes_xferred> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <is_upload>1</is_upload> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: </persistent_file_xfer> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: </file_transfer> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <file_transfer> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <project_url>https://climateprediction.net/</project_url> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <project_name>climateprediction.net</project_name> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <name>wah2_nz25_n0kd_199005_25_1005_012254891_0_r710857397_2.zip</name> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <nbytes>90517431.000000</nbytes> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <max_nbytes>150000000.000000</max_nbytes> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <status>1</status> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <persistent_file_xfer> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <num_retries>12</num_retries> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <first_request_time>1707005482.908285</first_request_time> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <next_request_time>1707239724.238784</next_request_time> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <time_so_far>271.271139</time_so_far> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <last_bytes_xferred>0.000000</last_bytes_xferred> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <is_upload>1</is_upload> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: </persistent_file_xfer> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: </file_transfer> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <file_transfer> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <project_url>https://climateprediction.net/</project_url> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <project_name>climateprediction.net</project_name> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <name>wah2_nz25_n0kd_199005_25_1005_012254891_0_r710857397_3.zip</name> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <nbytes>90382194.000000</nbytes> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <max_nbytes>150000000.000000</max_nbytes> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <status>1</status> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <persistent_file_xfer> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <num_retries>11</num_retries> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <first_request_time>1707047627.895128</first_request_time> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <next_request_time>1707217908.505416</next_request_time> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <time_so_far>248.874489</time_so_far> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: <last_bytes_xferred>0.000000</last_bytes_xferred> 06/02/2024 14:07:49 | | [http] [ID#0] Sent header to server: 06/02/2024 14:07:49 | | [http] [ID#0] Received header from server: HTTP/1.1 200 OK 06/02/2024 14:07:49 | | [http] [ID#0] Received header from server: Date: Tue, 06 Feb 2024 12:07:49 GMT 06/02/2024 14:07:49 | | [http] [ID#0] Received header from server: Expires: -1 06/02/2024 14:07:49 | | [http] [ID#0] Received header from server: Cache-Control: private, max-age=0 06/02/2024 14:07:49 | | [http] [ID#0] Received header from server: Content-Type: text/html; charset=ISO-8859-1 06/02/2024 14:07:49 | | [http] [ID#0] Received header from server: Content-Security-Policy-Report-Only: object-src 'none';base-uri 'self';script-src 'nonce-VWQJlPawV74E_3zTcnJdYw' 'strict-dynamic' 'report-sample' 'unsafe-eval' 'unsafe-inline' https: http:;report-uri https://csp.withgoogle.com/csp/gws/other-hp 06/02/2024 14:07:49 | | [http] [ID#0] Received header from server: P3P: CP="This is not a P3P policy! See g.co/p3phelp for more info." 06/02/2024 14:07:49 | | [http] [ID#0] Received header from server: Content-Encoding: gzip 06/02/2024 14:07:49 | | [http] [ID#0] Received header from server: Server: gws 06/02/2024 14:07:49 | | [http] [ID#0] Received header from server: X-XSS-Protection: 0 06/02/2024 14:07:49 | | [http] [ID#0] Received header from server: X-Frame-Options: SAMEORIGIN 06/02/2024 14:07:49 | | [http] [ID#0] Received header from server: Set-Cookie: SOCS=CAAaBgiA7YWuBg; expires=Fri, 07-Mar-2025 12:07:49 GMT; path=/; domain=.google.com; Secure; SameSite=lax 06/02/2024 14:07:49 | | [http] [ID#0] Received header from server: Set-Cookie: AEC=Ae3NU9NKF883ga8MHuHKh1r1SY9atTfhu4wRiWDXd6J6WUbvjQr9pgLqlA; expires=Sun, 04-Aug-2024 12:07:49 GMT; path=/; domain=.google.com; Secure; HttpOnly; SameSite=lax 06/02/2024 14:07:49 | | [http] [ID#0] Received header from server: Set-Cookie: __Secure-ENID=17.SE=mfM3qEBAhbiflfe_MhQU9awPmCSF3l85SGOT8J4-x1W6KIHkohvrYc8PwebTl6eeB_Z1RvZY2yunws1VeUBKG7vSf93m2q8hEyFtJjp-0QGGo4WXU-uDXLcyCKrnNnYst5McT1TwYuXxwl2DOIT-uK0CXzbAIxZ7iHNuX-OgFM0-ojBq0vQ; expires=Sat, 08-Mar-2025 04:26:07 GMT; path=/; domain=.google.com; Secure; HttpOnly; SameSite=lax 06/02/2024 14:07:49 | | [http] [ID#0] Received header from server: Set-Cookie: CONSENT=PENDING+273; expires=Thu, 05-Feb-2026 12:07:49 GMT; path=/; domain=.google.com; Secure 06/02/2024 14:07:49 | | [http] [ID#0] Received header from server: Alt-Svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000 06/02/2024 14:07:49 | | [http] [ID#0] Received header from server: Transfer-Encoding: chunked 06/02/2024 14:07:49 | | [http] [ID#0] Received header from server: 06/02/2024 14:07:49 | | [http] [ID#0] Received header from server: 00000001 06/02/2024 14:07:49 | | [http] [ID#0] Received header from server: 06/02/2024 14:07:49 | | [http] [ID#0] Received header from server: 00000001 06/02/2024 14:07:49 | | 06/02/2024 14:07:49 | | [http] [ID#0] Received header from server: 00000001 06/02/2024 14:07:49 | | [http] [ID#0] Received header from server: 06/02/2024 14:07:49 | | [http] [ID#0] Received header from server: 00000001 06/02/2024 14:07:49 | | [http] [ID#0] Info: Connection #6236 to host www.google.com left intact 06/02/2024 14:07:49 | | Internet access OK - project servers may be temporarily down. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,008,987 RAC: 21,524 |
traceroute gets as far as 146.97.41.34 which is still in ja.net. I suspect these should be going to the Hobart server or somewhere in NZ? |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,703,308 RAC: 9,860 |
Putting the upload11 url into a browser just gives you the Apache test page: This page is used to test the proper operation of the Apache HTTP server after it has been installed. If you can read this page it means that this site is working properly. This server is powered by CentOS. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,620,508 RAC: 4,981 |
I still have problems uploading to upload 11. I have two WUs one at 22 zip and these can't get through. I also had two more WUs failing with <core_client_version>7.24.1</core_client_version> <![CDATA[ <message> Disk usage limit exceeded</message> <stderr_txt> CPDN Monitor - Abort request from BOINC... 22:22:29 (9908): called boinc_finish(10) </stderr_txt> ]]> And I have almost 200 GB allocated to BOINC so there is plenty of disk space |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
I still have problems uploading to upload 11. I have two WUs one at 22 zip and these can't get through. I also had two more WUs failing with Bernard. Just out of curiosity, how much disk space is the entire boinc data directory using? |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
We had errors like this before when a bunch of upload files couldn't be uploaded for a long time, and built up in the directory. It's not actually exceeding the total boinc disk space allocated, it's exceeding the the rsc_disk_bound value set for that work unit in client_state.xml <rsc_disk_bound> </rsc_disk_bound> Is the maximum amount of disk space your application should take up while running any given task. Includes all input, temporary and output files. Is set in bytes. https://boinc.bakerlab.org/rosetta/forum_thread.php?id=15160&postid=108374 We've seen this before, especially when a bunch of upload files can't be uploaded because of server problems. Maybe someone with better memory and/or a better understanding of boinc could expound on this. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,008,987 RAC: 21,524 |
I suspect that this is the task going above the space allocated in one of the config files downloaded when the model starts. In client_state.xml you will find something like this for each task that is running <name>wah2_eas25_h1pu_201312_24_1001_012232192</name> <app_name>wah2</app_name> <version_num>824</version_num> <rsc_fpops_est>3801388153458730.000000</rsc_fpops_est> <rsc_fpops_bound>38013881534587296.000000</rsc_fpops_bound> <rsc_memory_bound>364000000.000000</rsc_memory_bound> <rsc_disk_bound>2000000000.000000</rsc_disk_bound> If you have a few zips stuck, it is not difficult for an individual task to go above the limit. In the past, I have increased this limit by editing the file but as this requires halting the client, there is a risk of crashing the tasks. So when it was a known problem on a batch I did it before starting computation. I have emailed Andy to ask him to check on the server. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,620,508 RAC: 4,981 |
I still have problems uploading to upload 11. I have two WUs one at 22 zip and these can't get through. I also had two more WUs failing with Currently it uses 13.28 GB, allocated to BOINC 190 GB. I wouldn't adjust the config files, would that be possible server side for the next batches? Some zips have cleared and one WU (batch 1005) uploaded but it gave computation error due the above disc exceed error. Crashed at the 21 zip. The other WU (batch 1005) has stuck 16 zips + restart.zip and I think it will error in few days when hitting the space limit. Other batches seem to upload just fine. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,008,987 RAC: 21,524 |
I would suspend computation for the NZ tasks till zips clear. If you run out of work, you can turn them back on long enough to get more work then suspend them again. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,620,508 RAC: 4,981 |
I would suspend computation for the NZ tasks till zips clear. If you run out of work, you can turn them back on long enough to get more work then suspend them again. Thanks, Dave. Paused it. New task started, all others are EAS25 |
©2024 cpdn.org