Message boards : Number crunching : New work discussion - 2
Message board moderation
Previous · 1 . . . 15 · 16 · 17 · 18 · 19 · 20 · 21 . . . 42 · Next
Author | Message |
---|---|
Send message Joined: 22 May 21 Posts: 39 Credit: 1,180,250 RAC: 4,005 |
Anyone else having issues getting the results to upload? I'm just getting timeouts.See this post in OpenIFS discussion thread. Thanks! |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Uploads update Mon 12th Have just heard from CPDN. There was a major failure in the cloud system they use at the weekend. It will take a day or two to move over to a new system before uploads will work again. |
Send message Joined: 15 May 09 Posts: 4536 Credit: 18,997,390 RAC: 21,721 |
Thanks Glen. I just looked at the server state and deduced from the "users in last 24 hours" column that it wasn't sorted yet. Also, there will almost certainly be some transient http errors due tot he number of people trying to upload data at once when it does start working again. (Based on past experience!) |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,699,166 RAC: 9,972 |
Uploads update Mon 12thCould this possibly be re-posted in the news board? After all, an upload failure mainly impinges on old work, not this thread? |
Send message Joined: 15 May 09 Posts: 4536 Credit: 18,997,390 RAC: 21,721 |
Could this possibly be re-posted in the news board? After all, an upload failure mainly impinges on old work, not this thread?Done |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
A 250 size batch test for the BL app has just been posted to the main site for OpeniFS. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I just got 10 of them, and three are already running. I notice the Boinc client expects each to take 8 hours 40 minutes. We will see how fast the client learns the truth. Edit 1; I notice the trickles are about every half hour of my time. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I notice the Boinc client expects each to take 8 hours 40 minutes. We will see how fast the client learns the truth. Boinc-client learns pretty fast. Not too bad for a beginning. Task 22250505 Name oifs_43r3_bl_a05n_2016092300_15_949_12166597_0 Workunit 12166597 Computer ID 1511241 Run time 7 hours 34 min 35 sec CPU time 7 hours 23 min 6 sec |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
Boinc-client learns pretty fast.Pah! Try Primegrid where it's still predicting a 6 day task will take 70 days. After having done several of them. The programmers couldn't organise a jolly party in a beer creating factory. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Bear in mind though that OpenIFS models can be set to run for different lengths of forecast. So don't rely on that remaining time estimate to be accurate across projects. The fraction done is computed accurately though, I use that to work out runtime and ignore the client's guess-timate.I notice the Boinc client expects each to take 8 hours 40 minutes. We will see how fast the client learns the truth.Boinc-client learns pretty fast. Not too bad for a beginning.Run time 7 hours 34 min 35 sec CPU time 7 hours 23 min 6 sec The fastest runtime I've seen for this batch so far is 4hr 5mins (on a 5950X). Am interested to see the spread of total cpu times across hardware, will ask CPDN If they have any tools to scrape the data. Wonder if there's any 13th gen intel in use. Still seeing some fails but I can't get to the logs at the moment to see what the problem was. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
Bear in mind though OpenIFS models can be set to run for different lengths of forecast. So don't rely on that remaining time estimate to be accurate across projects. The fraction done is computed accurately though, I use that to work out runtime and ignore the client's guess-timate.Boinc can be told to use "fraction done exact", but only per subproject, you have to add every one manually to the app config. And it only works after the task has been started. |
Send message Joined: 2 Oct 19 Posts: 21 Credit: 47,674,094 RAC: 24,265 |
Still seeing some fails but I can't get to the logs at the moment to see what the problem was. So far I have 22 valid OpenIFS 43r3 Baroclinic Lifecycle v1.07 tasks and one computational error. https://www.cpdn.org/result.php?resultid=22250486 Exit status 9 (0x00000009) Unknown error code Zipping up the final file: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_a054_2016092300_15_949_12166578_0_r1730349614_14.zip Uploading the final file: upload_file_14.zip Uploading trickle at timestep: 1295100 double free or corruption (out) </stderr_txt> |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Still seeing some fails but I can't get to the logs at the moment to see what the problem was. I do not have a clue what the problem is because I have had no failures with these work units. My latest batch of 10 were all fresh new ones. The older ones were all successful. It is still a mystery why my machine and OS should be so much more reliable that some others. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
biodoc. Thanks, that's useful because it pins down the error to a specific part of the code in the controlling wrapper process that runs the model. The model completed successfully, the fail appears right at the end when the final upload is about to be uploaded but at that point it fails. When I was looking for memory leaks I noted that the boinc client functions, which we use, seems to leak memory. I use release/7.20 to link against whereas I note you have 7.16 installed. I wonder if that's a clue to what's happening. I wonder if Richard might know of memory leak issues with versions of the boinc client? biodoc: if we wanted to run some tests specifically on your machine would you be willing? (we can force push tasks to specific machines if needed). Thanks. So far I have 22 valid OpenIFS 43r3 Baroclinic Lifecycle v1.07 tasks and one computational error. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,699,166 RAC: 9,972 |
I wonder if Richard might know of memory leak issues with versions of the boinc client?I assume you're referring to the BOINC API library, which you would link into the science application? The client itself lives on the volunteer's computer, and may conceivably have memory issues of its own, but they should be independent of the memory space used by your application. I'm not aware of any specific memory leak problems, but it wouldn't surprise me if minor ones existed. I think the attitude in the past has been "it would be too tedious to go through all the compiler warnings - we'll just suppress that part of the output and live with them". One or two projects display memory leak warnings in stderr.txt - more often in Windows than Linux - but they don't affect the final outcome of the science run. BOINC is also very tolerant of version differences between the API library and the volunteer's client. Some projects compile with old libraries, some volunteers run old clients - but you have to go a long way back (at least 10 years) before the incompatibilities become serious. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
When I was looking for memory leaks I noted that the boinc client functions, which we use, seems to leak memory. I use release/7.20 to link against whereas I note you have 7.16 installed. I wonder if that's a clue to what's happening. I have been running version 7.20.2 of the boinc_client since Jul 29 09:36. It is started automatically by systemd when the system is booted up and runs continuously until the system is powered down. Mostly, when I am logged in, I run the boincmgr task on my machine to keep track of what is going on, which gives the client stuff to do. Currently, my system has been up 23 days, 9 hours 51 minutes, running 24/7 all that time. The client normally runs 12 Boinc tasks at once, provided the projects supply me with sufficient work. I have received no failures of boinc tasks that whole time. It is currently running three oifs_43r3_bl_1.07_x86_64-pc-linux-gnu tasks at a time. If the client were leaking memory, how long is it supposed to take before running the machine out of RAM? And what might be the symptoms of that happening? Would I notice before the system ran out of memory? I am running Red Hat Enterprise Linux release 8.6 (Ootpa) with 4.18.0-372.26.1.el8_6.x86_64 kernel. My hardware is a Dell T5820 desktop "workstation." with 64 GigaBytes of RAM. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Thanks Richard. That gives some reassurance about mixing versions. Error code 9 suggests a SIGKILL (I presume 'error code' means 'signal'), which means adding signal handlers would not help. Out-of-memory generates a SIGKILL, though in this example the machine has 64Gb RAM. Unless the RAM was being heavily used for other workloads I don't have any more clues at present. The very good news is that in batch 949 of 250 tasks, there's been only 14 fails and ~190 tasks completed on first attempt. That's significantly better than earlier batches. I wonder if Richard might know of memory leak issues with versions of the boinc client?I assume you're referring to the BOINC API library, which you would link into the science application? The client itself lives on the volunteer's computer, and may conceivably have memory issues of its own, but they should be independent of the memory space used by your application. |
Send message Joined: 2 Oct 19 Posts: 21 Credit: 47,674,094 RAC: 24,265 |
biodoc. Thanks, that's useful because it pins down the error to a specific part of the code in the controlling wrapper process that runs the model. The model completed successfully, the fail appears right at the end when the final upload is about to be uploaded but at that point it fails. When I was looking for memory leaks I noted that the boinc client functions, which we use, seems to leak memory. I use release/7.20 to link against whereas I note you have 7.16 installed. I wonder if that's a clue to what's happening. I have 2 computers with linux Mint 20.3 installed. The boinc version, as you pointed out, is 7.16 which is included in the Mint/Ubuntu repository. My other 2 computers have Mint 21 and kubuntu 22.04 installed and the boinc version is 7.18 which is also included in the repository. Sure, you can push tasks to any one of my computers. BTW, I just pick up the same error on another task for the OpenIFS 43r3 Perturbed Surface v1.05 This is the same computer as the other task. https://www.cpdn.org/result.php?resultid=22250622 This computer ran the v1.01 tasks error free. I can also upgrade Boinc on the one you pick to push tasks to. Let me know. |
Send message Joined: 2 Oct 19 Posts: 21 Credit: 47,674,094 RAC: 24,265 |
I can upgrade one computer to boinc 7.20.5 using this ppa: https://launchpad.net/~costamagnagianfranco/+archive/ubuntu/boinc I think that version is a development release. I guess it could cause other issues. |
Send message Joined: 7 Aug 04 Posts: 10 Credit: 148,011,291 RAC: 40,045 |
I also had a task that failed with that error, however the model did not finish: https://www.cpdn.org/result.php?resultid=22250347 Exit status 9 (0x00000009) Unknown error code ... 12:03:35 STEP 973 H= 243:15 +CPU= 10.376 12:03:45 STEP 974 H= 243:30 +CPU= 10.186 12:03:56 STEP 975 H= 243:45 +CPU= 10.185 double free or corruption (out) 12:04:06 STEP 976 H= 244:00 +CPU= 10.574 </stderr_txt> The same computer successfully completed 11 other jobs from the latest oifs batch. They were being run 6 at a time, with around 20GB free ram available. I'm certainly fine with test jobs being sent to it, if you want to. |
©2024 cpdn.org