climateprediction.net (CPDN) home page
Thread 'Computation Errors'

Thread 'Computation Errors'

Message boards : Number crunching : Computation Errors
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 64373 - Posted: 17 Aug 2021, 9:47:14 UTC - in response to Message 64371.  

"Trickles" are from the very beginning of this project, before BOINC existed.
They're xml files, and don't use the "BOINC file transfer" for uploading.

If you haven't been seeing zip files, look in the Event log, which will show the start and finish time of all zip transfers.
You may just have missed them.

As a rough rule of thumb, the number of zips, (just before the batch file number in the task name), will give you a clue about when they' ll get created.
My N216 tasks say "5", so about every 20%.
ID: 64373 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 64374 - Posted: 17 Aug 2021, 11:20:02 UTC - in response to Message 64373.  

zips for my tasks are certainly still going through, one finished uploading about ten minutes ago.

The trickle up files are what enable most of the credit to be given for a task that fails at 90%
ID: 64374 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 31,033,903
RAC: 14,766
Message 64376 - Posted: 17 Aug 2021, 22:12:35 UTC - in response to Message 64373.  

Example of trickle message in Event Log.


Tue 17 Aug 2021 19:44:02 BST | climateprediction.net | [trickle] read trickle file projects/climateprediction.net/trickle_up_hadcm3s_a0cv_192012_120_919_012114543_0_1628957381.xml.sent
Tue 17 Aug 2021 19:44:02 BST | climateprediction.net | [trickle] read trickle file projects/climateprediction.net/trickle_up_hadcm3s_a0cv_192012_120_919_012114543_0_1629026564.xml.sent
Tue 17 Aug 2021 19:44:02 BST | climateprediction.net | [trickle] read trickle file projects/climateprediction.net/trickle_up_hadam4h_114c_209705_5_902_012078811_2_1629152870.xml.sent
Tue 17 Aug 2021 19:44:02 BST | climateprediction.net | [trickle] read trickle file projects/climateprediction.net/trickle_up_hadcm3s_a0cv_192012_120_919_012114543_0_1629164830.xml.sent
Tue 17 Aug 2021 19:44:02 BST | climateprediction.net | [trickle] read trickle file projects/climateprediction.net/trickle_up_hadcm3s_a0cv_192012_120_919_012114543_0_1629095805.xml.sent
Tue 17 Aug 2021 19:44:02 BST | climateprediction.net | Sending scheduler request: To send trickle-up message.
ID: 64376 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 64377 - Posted: 18 Aug 2021, 1:37:10 UTC - in response to Message 64376.  
Last modified: 18 Aug 2021, 1:45:32 UTC

Tue 17 Aug 2021 19:44:02 BST | climateprediction.net | [trickle] read trickle file projects/climateprediction.net/trickle_up_hadcm3s_a0cv_192012_120_919_012114543_0_1628957381.xml.sent


I do not gat any like these. I get only
Tue 17 Aug 2021 08:46:23 PM EDT | climateprediction.net | Sending scheduler request: To send trickle-up message.
Tue 17 Aug 2021 08:46:23 PM EDT | climateprediction.net | Not requesting tasks: some download is stalled
Tue 17 Aug 2021 08:46:25 PM EDT | climateprediction.net | Scheduler request completed
Tue 17 Aug 2021 08:46:25 PM EDT | climateprediction.net | Project is temporarily shut down for maintenance
Tue 17 Aug 2021 08:46:25 PM EDT | climateprediction.net | Project requested delay of 3600 seconds


There are files like this lying around.
-rw-r--r--. 1 boinc boinc       191 Aug 15 22:00 trickle_up_hadam4h_10c9_209305_5_902_012077800_4_1629079217.xml.sent
-rw-r--r--. 1 boinc boinc       191 Aug 17 19:36 trickle_up_hadam4h_10c9_209305_5_902_012077800_4_1629243393.xml.sent
-rw-r--r--. 1 boinc boinc       191 Aug 15 22:34 trickle_up_hadam4h_10py_209505_5_902_012078293_4_1629081291.xml.sent
-rw-r--r--. 1 boinc boinc       191 Aug 17 13:51 trickle_up_hadam4h_10py_209505_5_902_012078293_4_1629222703.xml.sent
-rw-r--r--. 1 boinc boinc       191 Aug 15 22:37 trickle_up_hadam4h_c0fc_206511_5_883_012037240_0_1629081459.xml.sent
-rw-r--r--. 1 boinc boinc       191 Aug 17 20:52 trickle_up_hadam4h_c0fc_206511_5_883_012037240_0_1629247929.xml

ID: 64377 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 64378 - Posted: 18 Aug 2021, 2:34:19 UTC

Down towards the bottom of client_state.xml, there's a section for files, and BOINC has a number of flags for each one.
I think they start at flag=0, and go up.
Each time that BOINC reaches a new phase with a file, it moves to the next flag.
These are how it knows what to do next with each file.

Sent is one of the steps, with the next one being "this file has now been really sent". Or some such thing.

DON'T meddle with the client_state.xml file !!!!!

And please be Patient !!!
ID: 64378 · Report as offensive     Reply Quote
KAMasud

Send message
Joined: 6 Oct 06
Posts: 204
Credit: 7,608,986
RAC: 0
Message 64381 - Posted: 18 Aug 2021, 10:52:13 UTC

Les, I am not meddling with anything. Now I have WU's ready to report with nothing in the transfers folder. I was just wondering if the server tells the WU to generate trickles or whatever.
When you go to the Server Status page everything is dead. Then I check with my Boinc and WU's are ready to report? So, that Internet Black Hole Theory comes to mind. Stacie at Collatz Conjecture has made up a pretty realistic looking Theory about Black Holes. Never mind.
ID: 64381 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 64849 - Posted: 6 Dec 2021, 11:36:49 UTC

Just noticed this task which completed successfully yesterday has segmentation errors.

SIGSEGV: segmentation violation
Stack trace (21 frames):
../../projects/climateprediction.net/hadam4_8.52_i686-pc-linux-gnu(boinc_catch_signal+0x67)[0x80d4cf7]
linux-gate.so.1(__kernel_sigreturn+0x0)[0xf7f4a560]
/lib/i386-linux-gnu/libc.so.6(getenv+0x9a)[0xf79d8e3a]
/lib/i386-linux-gnu/libc.so.6(+0xcfcfd)[0xf7a6acfd]
/lib/i386-linux-gnu/libc.so.6(+0xd006f)[0xf7a6b06f]


There is quite a lot more if anyone wants to follow the link. I don't remember seeing it before on a successful task. I wonder if it means the error was after the files to be uploaded were produced?
ID: 64849 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 65659 - Posted: 19 Jul 2022, 1:00:20 UTC
Last modified: 19 Jul 2022, 1:01:46 UTC

I just completed some work units successfully.
For some of these work units, I am the second to attempt them.
Of those that worked for me that failed for others, many lacked the occasional 32-bit compatibility libraries.
But I got so many from machine
All tasks for computer 1517479
that I looked up that machine, and it fails everything it attempts. OVER 11,000 FAILURES.
Something wrong with its file-system setup. He acts as though he never checks anything and does not know his machine is failing.

Stderr

<core_client_version>7.9.3</core_client_version>
<![CDATA[
<message>
process exited with code 12 (0xc, -244)</message>
<stderr_txt>
unzip:  cannot find or open /mydisks/a/boinc_lib/projects/climateprediction.net/hadsm4_se_8.02_i686-pc-linux-gnu.zip, /mydisks/a/boinc_lib/projects/climateprediction.net/hadsm4_se_8.02_i686-pc-linux-gnu.zip.zip or /mydisks/a/boinc_lib/projects/climateprediction.net/hadsm4_se_8.02_i686-pc-linux-gnu.zip.ZIP.
unzip:  cannot find or open /mydisks/a/boinc_lib/projects/climateprediction.net/hadsm4_um_8.02_i686-pc-linux-gnu.zip, /mydisks/a/boinc_lib/projects/climateprediction.net/hadsm4_um_8.02_i686-pc-linux-gnu.zip.zip or /mydisks/a/boinc_lib/projects/climateprediction.net/hadsm4_um_8.02_i686-pc-linux-gnu.zip.ZIP.
unzip:  cannot find or open hadsm4_data_8.02_i686-pc-linux-gnu.zip, hadsm4_data_8.02_i686-pc-linux-gnu.zip.zip or hadsm4_data_8.02_i686-pc-linux-gnu.zip.ZIP.
unzip:  cannot find or open hadsm4_a05e_201310_12_934_012146656.zip, hadsm4_a05e_201310_12_934_012146656.zip.zip or hadsm4_a05e_201310_12_934_012146656.zip.ZIP.
cpdnmonitor: cannot open input file /mydisks/a/boinc_lib/projects/climateprediction.net/hadsm4_se_8.02_i686-pc-linux-gnu.so after 11 attempts
cpdnmonitor: cannot open input file /mydisks/a/boinc_lib/projects/climateprediction.net/hadsm4_um_8.02_i686-pc-linux-gnu after 11 attempts

</stderr_txt>
]]>


Can something be done about his machine, such as cut it off?
ID: 65659 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 65660 - Posted: 19 Jul 2022, 3:10:59 UTC

OK, email sent to Andy.
ID: 65660 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,885,708
RAC: 18,983
Message 65662 - Posted: 19 Jul 2022, 7:05:18 UTC - in response to Message 65659.  

It looks like that computer (1517479) belongs to Eric J Korpela, SETI@home director I believe. It seems like most of his computers are erroring out a lot of tasks here. https://www.cpdn.org/show_host_detail.php?hostid=1517479
ID: 65662 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 65663 - Posted: 19 Jul 2022, 7:12:47 UTC - in response to Message 65662.  

It looks like that computer (1517479) belongs to Eric J Korpela, SETI@home director I believe. It seems like most of his computers are erroring out a lot of tasks here. https://www.cpdn.org/show_host_detail.php?hostid=1517479


I looked at a lot of his work-unit. He seems to gobble up a lot of work unit at a time. And they all fail, no matter what model he tries to run. Except there are four or five that are still in progress from very early this year.

Seems to me he should know how to set up his system and check that it is running. I imagine SETI@home was the first user of BOINC.
ID: 65663 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,885,708
RAC: 18,983
Message 65664 - Posted: 19 Jul 2022, 7:28:57 UTC

A few weeks ago I decided to test and max out my Ryzen 5900X (12C/24T) with 50GB RAM dedicated to WSL2 Ubuntu 22.04. Ran 24 HadAM4 N144s at the same time and they all finished without errors. The CPU has 64MB of L3 cache so about 2.6MB per task available on average. They all got done in about 20 days so about 1.2 tasks per day average, not a bad throughput I thought.
ID: 65664 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,885,708
RAC: 18,983
Message 65665 - Posted: 19 Jul 2022, 7:35:09 UTC - in response to Message 64849.  

I believe I've seen successful tasks with SIGSEGV errors are well. Additionally with "Model crashed: INANCLA: Error opening file " such as this task https://www.cpdn.org/result.php?resultid=22206596
ID: 65665 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 65666 - Posted: 19 Jul 2022, 19:37:52 UTC - in response to Message 65664.  

A few weeks ago I decided to test and max out my Ryzen 5900X (12C/24T) with 50GB RAM dedicated to WSL2 Ubuntu 22.04. Ran 24 HadAM4 N144s at the same time and they all finished without errors. The CPU has 64MB of L3 cache so about 2.6MB per task available on average. They all got done in about 20 days so about 1.2 tasks per day average, not a bad throughput I thought.

I'm assuming you are talking about the 13 month HADAM4 N144 tasks. Running 5 at a time on my 5600X, each task takes about 4 days, so in 20 days it would finish about 25.

I really think that you should test this with no use of the SMT threads, running 12 at a time. My guess is that total model throughput would be considerably higher than what happened running 24 at a time.

Now I realize that the comparison of my PC with yours is not apples to apples as you are running these in a VM, with the associated performance penalty, and my 5600X is running these natively in Linux. Also, it was running at 4.4 to 4.5 GHz and I'm sure yours is throttling more running that many. But it's been a long time since running a significant number of models above the total number of cores resulted in more total model throughput. Perhaps with something like hadcm3s (if it were again to be released for Linux), using some of the SMT threads would increase throughput, but I doubt the HADAM4 N144 models would see much, if any, by running more tasks than cores.
ID: 65666 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 65667 - Posted: 19 Jul 2022, 23:59:27 UTC - in response to Message 65666.  

Perhaps with something like hadcm3s (if it were again to be released for Linux)

I had only one of these work on Linuxx:

All UK Met Office HadCM3 short tasks for computer 1511241

22191699 	12129726 	29 Jan 2022, 20:48:05 UTC 	1 Feb 2022, 13:43:03 UTC 	Completed 	211,754.62 	210,243.20 	4,354.56 	UK Met Office HadCM3 short v8.36
i686-pc-linux-gnu

ID: 65667 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,885,708
RAC: 18,983
Message 65668 - Posted: 20 Jul 2022, 8:28:36 UTC - in response to Message 65666.  

It was batch 929. The throughput was definitely not the best I've seen and I don't plan on running 24 at a time again in part due to that reason. I've ran 12 at a time before but with SMT on and the other 12 threads were running other BOINC projects. I believe it took about 9 days.

WSL2 uses a lot less resources than a typical VM, one of the reasons I like it. I actually have both the CPU and GPU undervolted to reduce energy use as I have the PC on 24/7 running BOINC projects. The CPU is set to 3.7 GHz, which is the base speed of the CPU. RAM is 64GB 3200MHz 16-20-20-40. I'm not sure how much CPU speed and RAM speed and timings make a difference as I never tried to test. From reading, it sounds like having a larger cache makes the biggest difference. I'll probably never run more than 12 CPDN tasks at a time again but I was curious to see how it'd go maxing out.
ID: 65668 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 65669 - Posted: 20 Jul 2022, 15:09:47 UTC

I've got a couple 3900X boxes now, and running 12 threads instead of 24 seems to actually improve "instructions retired per second" with CPDN tasks. They're just too RAM/cache intensive to run that many at once.

Though... sorry, one of my machines just barfed out a bunch of tasks. :( I upgraded the RAM and I think the new RAM is bad. The system won't suspend/resume properly. It ran a clean memtest, but just... things aren't right and it'll probably error out the rest of the tasks from suspend/resume errors as I have to power it down to replace the RAM again. I'd rather have a smaller amount of fast RAM than a lot of slower RAM for this stuff - it runs fewer tasks, but does get through them faster, so I'll just put it back in that configuration.
ID: 65669 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 65670 - Posted: 20 Jul 2022, 16:26:22 UTC - in response to Message 65669.  

I'd rather have a smaller amount of fast RAM than a lot of slower RAM for this stuff - it runs fewer tasks, but does get through them faster, so I'll just put it back in that configuration.


Yep, will be upgrading my RAM soon to get faster though probably going from 32GB to 64. I should probably do some tests to see exactly what difference it makes.
ID: 65670 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 65671 - Posted: 20 Jul 2022, 17:27:54 UTC

I'd be interested in the results, for sure! I went back to my 16GB of working DDR4-3600, vs 32GB of "my board doesn't like it" 3200.

I don't have enough tasks right now to properly load it up having crashed that set, but...

The 3900X that just errored out most of its tasks has 4 N216 tasks running, and is retiring about 30G instructions per second (in the 28-32G range).

The other 3900X, running 12 N216 tasks, is retiring... 30-34G instructions per second.

Same RAM speeds, just different capacities.

And when I was running 8 N216s on the low-RAM box, it was chewing through them quite a bit faster than the 12 task box.

I'm not sure that there aren't throughput gains with more tasks, but it isn't massive, for sure. And all the loaded cores are hitting the same speeds, I have good power and cooling on these rigs.
ID: 65671 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 65672 - Posted: 20 Jul 2022, 20:49:33 UTC - in response to Message 65671.  

From memory (my own rather than RAM) maximum throughput on my box was with six N216 tasks (I have 8 real cores) and ten or twelve N144 tasks. But I shall assuming there are tasks around run some proper tests before and after swapping.
ID: 65672 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : Computation Errors

©2024 cpdn.org