climateprediction.net (CPDN) home page
Thread 'New work discussion - 2'

Thread 'New work discussion - 2'

Message boards : Number crunching : New work discussion - 2
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 15 · 16 · 17 · 18 · 19 · 20 · 21 . . . 42 · Next

AuthorMessage
bullschuck

Send message
Joined: 22 May 21
Posts: 39
Credit: 1,180,250
RAC: 4,005
Message 66869 - Posted: 12 Dec 2022, 1:35:28 UTC - in response to Message 66865.  

Anyone else having issues getting the results to upload? I'm just getting timeouts.
See this post in OpenIFS discussion thread.


Thanks!
ID: 66869 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 66870 - Posted: 12 Dec 2022, 11:40:33 UTC

Uploads update Mon 12th

Have just heard from CPDN. There was a major failure in the cloud system they use at the weekend. It will take a day or two to move over to a new system before uploads will work again.
ID: 66870 · Report as offensive
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4536
Credit: 18,997,390
RAC: 21,721
Message 66871 - Posted: 12 Dec 2022, 11:55:46 UTC - in response to Message 66870.  

Thanks Glen. I just looked at the server state and deduced from the "users in last 24 hours" column that it wasn't sorted yet. Also, there will almost certainly be some transient http errors due tot he number of people trying to upload data at once when it does start working again. (Based on past experience!)
ID: 66871 · Report as offensive
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,699,166
RAC: 9,972
Message 66873 - Posted: 12 Dec 2022, 13:10:44 UTC - in response to Message 66870.  

Uploads update Mon 12th

Have just heard from CPDN. There was a major failure in the cloud system they use at the weekend. It will take a day or two to move over to a new system before uploads will work again.
Could this possibly be re-posted in the news board? After all, an upload failure mainly impinges on old work, not this thread?
ID: 66873 · Report as offensive
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4536
Credit: 18,997,390
RAC: 21,721
Message 66875 - Posted: 12 Dec 2022, 14:28:38 UTC - in response to Message 66871.  

Could this possibly be re-posted in the news board? After all, an upload failure mainly impinges on old work, not this thread?
Done
ID: 66875 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 66905 - Posted: 14 Dec 2022, 14:26:18 UTC

A 250 size batch test for the BL app has just been posted to the main site for OpeniFS.
ID: 66905 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 66906 - Posted: 14 Dec 2022, 14:31:14 UTC - in response to Message 66905.  
Last modified: 14 Dec 2022, 15:17:09 UTC

I just got 10 of them, and three are already running.

I notice the Boinc client expects each to take 8 hours 40 minutes. We will see how fast the client learns the truth.

Edit 1; I notice the trickles are about every half hour of my time.
ID: 66906 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 66907 - Posted: 14 Dec 2022, 23:23:15 UTC - in response to Message 66906.  

I notice the Boinc client expects each to take 8 hours 40 minutes. We will see how fast the client learns the truth.


Boinc-client learns pretty fast. Not too bad for a beginning.

Task 22250505
Name 	oifs_43r3_bl_a05n_2016092300_15_949_12166597_0
Workunit 	12166597
Computer ID 	1511241
Run time 	7 hours 34 min 35 sec
CPU time 	7 hours 23 min 6 sec 

ID: 66907 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 66908 - Posted: 14 Dec 2022, 23:35:03 UTC - in response to Message 66907.  

Boinc-client learns pretty fast.
Pah! Try Primegrid where it's still predicting a 6 day task will take 70 days. After having done several of them. The programmers couldn't organise a jolly party in a beer creating factory.
ID: 66908 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 66909 - Posted: 15 Dec 2022, 0:20:41 UTC - in response to Message 66907.  
Last modified: 15 Dec 2022, 0:30:25 UTC

I notice the Boinc client expects each to take 8 hours 40 minutes. We will see how fast the client learns the truth.
Boinc-client learns pretty fast. Not too bad for a beginning.
Run time 	7 hours 34 min 35 sec
CPU time 	7 hours 23 min 6 sec
Bear in mind though that OpenIFS models can be set to run for different lengths of forecast. So don't rely on that remaining time estimate to be accurate across projects. The fraction done is computed accurately though, I use that to work out runtime and ignore the client's guess-timate.

The fastest runtime I've seen for this batch so far is 4hr 5mins (on a 5950X). Am interested to see the spread of total cpu times across hardware, will ask CPDN If they have any tools to scrape the data. Wonder if there's any 13th gen intel in use.

Still seeing some fails but I can't get to the logs at the moment to see what the problem was.
ID: 66909 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 66910 - Posted: 15 Dec 2022, 0:35:28 UTC - in response to Message 66909.  

Bear in mind though OpenIFS models can be set to run for different lengths of forecast. So don't rely on that remaining time estimate to be accurate across projects. The fraction done is computed accurately though, I use that to work out runtime and ignore the client's guess-timate.
Boinc can be told to use "fraction done exact", but only per subproject, you have to add every one manually to the app config. And it only works after the task has been started.
ID: 66910 · Report as offensive
biodoc

Send message
Joined: 2 Oct 19
Posts: 21
Credit: 47,674,094
RAC: 24,265
Message 66911 - Posted: 15 Dec 2022, 1:21:35 UTC - in response to Message 66909.  

Still seeing some fails but I can't get to the logs at the moment to see what the problem was.

So far I have 22 valid OpenIFS 43r3 Baroclinic Lifecycle v1.07 tasks and one computational error.
https://www.cpdn.org/result.php?resultid=22250486
Exit status	9 (0x00000009) Unknown error code

Zipping up the final file: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_a054_2016092300_15_949_12166578_0_r1730349614_14.zip
Uploading the final file: upload_file_14.zip
Uploading trickle at timestep: 1295100
double free or corruption (out)

</stderr_txt>
ID: 66911 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 66912 - Posted: 15 Dec 2022, 1:51:22 UTC - in response to Message 66909.  

Still seeing some fails but I can't get to the logs at the moment to see what the problem was.


I do not have a clue what the problem is because I have had no failures with these work units. My latest batch of 10 were all fresh new ones. The older ones were all successful. It is still a mystery why my machine and OS should be so much more reliable that some others.
ID: 66912 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 66913 - Posted: 15 Dec 2022, 12:43:32 UTC - in response to Message 66911.  
Last modified: 15 Dec 2022, 12:51:40 UTC

biodoc. Thanks, that's useful because it pins down the error to a specific part of the code in the controlling wrapper process that runs the model. The model completed successfully, the fail appears right at the end when the final upload is about to be uploaded but at that point it fails. When I was looking for memory leaks I noted that the boinc client functions, which we use, seems to leak memory. I use release/7.20 to link against whereas I note you have 7.16 installed. I wonder if that's a clue to what's happening.

I wonder if Richard might know of memory leak issues with versions of the boinc client?

biodoc: if we wanted to run some tests specifically on your machine would you be willing? (we can force push tasks to specific machines if needed).

Thanks.

So far I have 22 valid OpenIFS 43r3 Baroclinic Lifecycle v1.07 tasks and one computational error.
https://www.cpdn.org/result.php?resultid=22250486
Exit status	9 (0x00000009) Unknown error code
Zipping up the final file: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_a054_2016092300_15_949_12166578_0_r1730349614_14.zip
Uploading the final file: upload_file_14.zip
Uploading trickle at timestep: 1295100
double free or corruption (out)

</stderr_txt>
ID: 66913 · Report as offensive
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,699,166
RAC: 9,972
Message 66915 - Posted: 15 Dec 2022, 14:44:12 UTC - in response to Message 66913.  

I wonder if Richard might know of memory leak issues with versions of the boinc client?
I assume you're referring to the BOINC API library, which you would link into the science application? The client itself lives on the volunteer's computer, and may conceivably have memory issues of its own, but they should be independent of the memory space used by your application.

I'm not aware of any specific memory leak problems, but it wouldn't surprise me if minor ones existed. I think the attitude in the past has been "it would be too tedious to go through all the compiler warnings - we'll just suppress that part of the output and live with them". One or two projects display memory leak warnings in stderr.txt - more often in Windows than Linux - but they don't affect the final outcome of the science run.

BOINC is also very tolerant of version differences between the API library and the volunteer's client. Some projects compile with old libraries, some volunteers run old clients - but you have to go a long way back (at least 10 years) before the incompatibilities become serious.
ID: 66915 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 66916 - Posted: 15 Dec 2022, 15:18:18 UTC - in response to Message 66913.  

When I was looking for memory leaks I noted that the boinc client functions, which we use, seems to leak memory. I use release/7.20 to link against whereas I note you have 7.16 installed. I wonder if that's a clue to what's happening.


I have been running version 7.20.2 of the boinc_client since Jul 29 09:36. It is started automatically by systemd when the system is booted up and runs continuously until the system is powered down. Mostly, when I am logged in, I run the boincmgr task on my machine to keep track of what is going on, which gives the client stuff to do. Currently, my system has been up 23 days, 9 hours 51 minutes, running 24/7 all that time. The client normally runs 12 Boinc tasks at once, provided the projects supply me with sufficient work. I have received no failures of boinc tasks that whole time. It is currently running three oifs_43r3_bl_1.07_x86_64-pc-linux-gnu tasks at a time.

If the client were leaking memory, how long is it supposed to take before running the machine out of RAM? And what might be the symptoms of that happening? Would I notice before the system ran out of memory?

I am running Red Hat Enterprise Linux release 8.6 (Ootpa) with 4.18.0-372.26.1.el8_6.x86_64 kernel.
My hardware is a Dell T5820 desktop "workstation." with 64 GigaBytes of RAM.
ID: 66916 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 66917 - Posted: 15 Dec 2022, 15:37:54 UTC - in response to Message 66915.  

Thanks Richard. That gives some reassurance about mixing versions. Error code 9 suggests a SIGKILL (I presume 'error code' means 'signal'), which means adding signal handlers would not help. Out-of-memory generates a SIGKILL, though in this example the machine has 64Gb RAM. Unless the RAM was being heavily used for other workloads I don't have any more clues at present.

The very good news is that in batch 949 of 250 tasks, there's been only 14 fails and ~190 tasks completed on first attempt. That's significantly better than earlier batches.

I wonder if Richard might know of memory leak issues with versions of the boinc client?
I assume you're referring to the BOINC API library, which you would link into the science application? The client itself lives on the volunteer's computer, and may conceivably have memory issues of its own, but they should be independent of the memory space used by your application.

I'm not aware of any specific memory leak problems, but it wouldn't surprise me if minor ones existed. I think the attitude in the past has been "it would be too tedious to go through all the compiler warnings - we'll just suppress that part of the output and live with them". One or two projects display memory leak warnings in stderr.txt - more often in Windows than Linux - but they don't affect the final outcome of the science run.

BOINC is also very tolerant of version differences between the API library and the volunteer's client. Some projects compile with old libraries, some volunteers run old clients - but you have to go a long way back (at least 10 years) before the incompatibilities become serious.
ID: 66917 · Report as offensive
biodoc

Send message
Joined: 2 Oct 19
Posts: 21
Credit: 47,674,094
RAC: 24,265
Message 66918 - Posted: 15 Dec 2022, 16:03:03 UTC - in response to Message 66913.  
Last modified: 15 Dec 2022, 16:08:30 UTC

biodoc. Thanks, that's useful because it pins down the error to a specific part of the code in the controlling wrapper process that runs the model. The model completed successfully, the fail appears right at the end when the final upload is about to be uploaded but at that point it fails. When I was looking for memory leaks I noted that the boinc client functions, which we use, seems to leak memory. I use release/7.20 to link against whereas I note you have 7.16 installed. I wonder if that's a clue to what's happening.

I wonder if Richard might know of memory leak issues with versions of the boinc client?

biodoc: if we wanted to run some tests specifically on your machine would you be willing? (we can force push tasks to specific machines if needed).

I have 2 computers with linux Mint 20.3 installed. The boinc version, as you pointed out, is 7.16 which is included in the Mint/Ubuntu repository.
My other 2 computers have Mint 21 and kubuntu 22.04 installed and the boinc version is 7.18 which is also included in the repository.

Sure, you can push tasks to any one of my computers.
BTW, I just pick up the same error on another task for the OpenIFS 43r3 Perturbed Surface v1.05 This is the same computer as the other task. https://www.cpdn.org/result.php?resultid=22250622
This computer ran the v1.01 tasks error free.
I can also upgrade Boinc on the one you pick to push tasks to.
Let me know.
ID: 66918 · Report as offensive
biodoc

Send message
Joined: 2 Oct 19
Posts: 21
Credit: 47,674,094
RAC: 24,265
Message 66919 - Posted: 15 Dec 2022, 16:15:22 UTC

I can upgrade one computer to boinc 7.20.5 using this ppa: https://launchpad.net/~costamagnagianfranco/+archive/ubuntu/boinc
I think that version is a development release. I guess it could cause other issues.
ID: 66919 · Report as offensive
cetus

Send message
Joined: 7 Aug 04
Posts: 10
Credit: 148,011,291
RAC: 40,045
Message 66920 - Posted: 15 Dec 2022, 16:18:14 UTC

I also had a task that failed with that error, however the model did not finish:

https://www.cpdn.org/result.php?resultid=22250347

Exit status 9 (0x00000009) Unknown error code
...
12:03:35 STEP 973 H= 243:15 +CPU= 10.376
12:03:45 STEP 974 H= 243:30 +CPU= 10.186
12:03:56 STEP 975 H= 243:45 +CPU= 10.185
double free or corruption (out)
12:04:06 STEP 976 H= 244:00 +CPU= 10.574

</stderr_txt>

The same computer successfully completed 11 other jobs from the latest oifs batch. They were being run 6 at a time, with around 20GB free ram available.
I'm certainly fine with test jobs being sent to it, if you want to.
ID: 66920 · Report as offensive
Previous · 1 . . . 15 · 16 · 17 · 18 · 19 · 20 · 21 . . . 42 · Next

Message boards : Number crunching : New work discussion - 2

©2024 cpdn.org