Message boards : Number crunching : Computation Errors
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
"Trickles" are from the very beginning of this project, before BOINC existed. They're xml files, and don't use the "BOINC file transfer" for uploading. If you haven't been seeing zip files, look in the Event log, which will show the start and finish time of all zip transfers. You may just have missed them. As a rough rule of thumb, the number of zips, (just before the batch file number in the task name), will give you a clue about when they' ll get created. My N216 tasks say "5", so about every 20%. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,005,674 RAC: 21,647 |
zips for my tasks are certainly still going through, one finished uploading about ten minutes ago. The trickle up files are what enable most of the credit to be given for a task that fails at 90% |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,975,898 RAC: 14,500 |
Example of trickle message in Event Log. Tue 17 Aug 2021 19:44:02 BST | climateprediction.net | [trickle] read trickle file projects/climateprediction.net/trickle_up_hadcm3s_a0cv_192012_120_919_012114543_0_1628957381.xml.sent Tue 17 Aug 2021 19:44:02 BST | climateprediction.net | [trickle] read trickle file projects/climateprediction.net/trickle_up_hadcm3s_a0cv_192012_120_919_012114543_0_1629026564.xml.sent Tue 17 Aug 2021 19:44:02 BST | climateprediction.net | [trickle] read trickle file projects/climateprediction.net/trickle_up_hadam4h_114c_209705_5_902_012078811_2_1629152870.xml.sent Tue 17 Aug 2021 19:44:02 BST | climateprediction.net | [trickle] read trickle file projects/climateprediction.net/trickle_up_hadcm3s_a0cv_192012_120_919_012114543_0_1629164830.xml.sent Tue 17 Aug 2021 19:44:02 BST | climateprediction.net | [trickle] read trickle file projects/climateprediction.net/trickle_up_hadcm3s_a0cv_192012_120_919_012114543_0_1629095805.xml.sent Tue 17 Aug 2021 19:44:02 BST | climateprediction.net | Sending scheduler request: To send trickle-up message. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Tue 17 Aug 2021 19:44:02 BST | climateprediction.net | [trickle] read trickle file projects/climateprediction.net/trickle_up_hadcm3s_a0cv_192012_120_919_012114543_0_1628957381.xml.sent I do not gat any like these. I get only Tue 17 Aug 2021 08:46:23 PM EDT | climateprediction.net | Sending scheduler request: To send trickle-up message. Tue 17 Aug 2021 08:46:23 PM EDT | climateprediction.net | Not requesting tasks: some download is stalled Tue 17 Aug 2021 08:46:25 PM EDT | climateprediction.net | Scheduler request completed Tue 17 Aug 2021 08:46:25 PM EDT | climateprediction.net | Project is temporarily shut down for maintenance Tue 17 Aug 2021 08:46:25 PM EDT | climateprediction.net | Project requested delay of 3600 seconds There are files like this lying around. -rw-r--r--. 1 boinc boinc 191 Aug 15 22:00 trickle_up_hadam4h_10c9_209305_5_902_012077800_4_1629079217.xml.sent -rw-r--r--. 1 boinc boinc 191 Aug 17 19:36 trickle_up_hadam4h_10c9_209305_5_902_012077800_4_1629243393.xml.sent -rw-r--r--. 1 boinc boinc 191 Aug 15 22:34 trickle_up_hadam4h_10py_209505_5_902_012078293_4_1629081291.xml.sent -rw-r--r--. 1 boinc boinc 191 Aug 17 13:51 trickle_up_hadam4h_10py_209505_5_902_012078293_4_1629222703.xml.sent -rw-r--r--. 1 boinc boinc 191 Aug 15 22:37 trickle_up_hadam4h_c0fc_206511_5_883_012037240_0_1629081459.xml.sent -rw-r--r--. 1 boinc boinc 191 Aug 17 20:52 trickle_up_hadam4h_c0fc_206511_5_883_012037240_0_1629247929.xml |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Down towards the bottom of client_state.xml, there's a section for files, and BOINC has a number of flags for each one. I think they start at flag=0, and go up. Each time that BOINC reaches a new phase with a file, it moves to the next flag. These are how it knows what to do next with each file. Sent is one of the steps, with the next one being "this file has now been really sent". Or some such thing. DON'T meddle with the client_state.xml file !!!!! And please be Patient !!! |
Send message Joined: 6 Oct 06 Posts: 204 Credit: 7,608,986 RAC: 0 |
Les, I am not meddling with anything. Now I have WU's ready to report with nothing in the transfers folder. I was just wondering if the server tells the WU to generate trickles or whatever. When you go to the Server Status page everything is dead. Then I check with my Boinc and WU's are ready to report? So, that Internet Black Hole Theory comes to mind. Stacie at Collatz Conjecture has made up a pretty realistic looking Theory about Black Holes. Never mind. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,005,674 RAC: 21,647 |
Just noticed this task which completed successfully yesterday has segmentation errors. SIGSEGV: segmentation violation Stack trace (21 frames): ../../projects/climateprediction.net/hadam4_8.52_i686-pc-linux-gnu(boinc_catch_signal+0x67)[0x80d4cf7] linux-gate.so.1(__kernel_sigreturn+0x0)[0xf7f4a560] /lib/i386-linux-gnu/libc.so.6(getenv+0x9a)[0xf79d8e3a] /lib/i386-linux-gnu/libc.so.6(+0xcfcfd)[0xf7a6acfd] /lib/i386-linux-gnu/libc.so.6(+0xd006f)[0xf7a6b06f] There is quite a lot more if anyone wants to follow the link. I don't remember seeing it before on a successful task. I wonder if it means the error was after the files to be uploaded were produced? |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I just completed some work units successfully. For some of these work units, I am the second to attempt them. Of those that worked for me that failed for others, many lacked the occasional 32-bit compatibility libraries. But I got so many from machine All tasks for computer 1517479 that I looked up that machine, and it fails everything it attempts. OVER 11,000 FAILURES. Something wrong with its file-system setup. He acts as though he never checks anything and does not know his machine is failing. Stderr <core_client_version>7.9.3</core_client_version> <![CDATA[ <message> process exited with code 12 (0xc, -244)</message> <stderr_txt> unzip: cannot find or open /mydisks/a/boinc_lib/projects/climateprediction.net/hadsm4_se_8.02_i686-pc-linux-gnu.zip, /mydisks/a/boinc_lib/projects/climateprediction.net/hadsm4_se_8.02_i686-pc-linux-gnu.zip.zip or /mydisks/a/boinc_lib/projects/climateprediction.net/hadsm4_se_8.02_i686-pc-linux-gnu.zip.ZIP. unzip: cannot find or open /mydisks/a/boinc_lib/projects/climateprediction.net/hadsm4_um_8.02_i686-pc-linux-gnu.zip, /mydisks/a/boinc_lib/projects/climateprediction.net/hadsm4_um_8.02_i686-pc-linux-gnu.zip.zip or /mydisks/a/boinc_lib/projects/climateprediction.net/hadsm4_um_8.02_i686-pc-linux-gnu.zip.ZIP. unzip: cannot find or open hadsm4_data_8.02_i686-pc-linux-gnu.zip, hadsm4_data_8.02_i686-pc-linux-gnu.zip.zip or hadsm4_data_8.02_i686-pc-linux-gnu.zip.ZIP. unzip: cannot find or open hadsm4_a05e_201310_12_934_012146656.zip, hadsm4_a05e_201310_12_934_012146656.zip.zip or hadsm4_a05e_201310_12_934_012146656.zip.ZIP. cpdnmonitor: cannot open input file /mydisks/a/boinc_lib/projects/climateprediction.net/hadsm4_se_8.02_i686-pc-linux-gnu.so after 11 attempts cpdnmonitor: cannot open input file /mydisks/a/boinc_lib/projects/climateprediction.net/hadsm4_um_8.02_i686-pc-linux-gnu after 11 attempts </stderr_txt> ]]> Can something be done about his machine, such as cut it off? |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
OK, email sent to Andy. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,821,077 RAC: 19,775 |
It looks like that computer (1517479) belongs to Eric J Korpela, SETI@home director I believe. It seems like most of his computers are erroring out a lot of tasks here. https://www.cpdn.org/show_host_detail.php?hostid=1517479 |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
It looks like that computer (1517479) belongs to Eric J Korpela, SETI@home director I believe. It seems like most of his computers are erroring out a lot of tasks here. https://www.cpdn.org/show_host_detail.php?hostid=1517479 I looked at a lot of his work-unit. He seems to gobble up a lot of work unit at a time. And they all fail, no matter what model he tries to run. Except there are four or five that are still in progress from very early this year. Seems to me he should know how to set up his system and check that it is running. I imagine SETI@home was the first user of BOINC. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,821,077 RAC: 19,775 |
A few weeks ago I decided to test and max out my Ryzen 5900X (12C/24T) with 50GB RAM dedicated to WSL2 Ubuntu 22.04. Ran 24 HadAM4 N144s at the same time and they all finished without errors. The CPU has 64MB of L3 cache so about 2.6MB per task available on average. They all got done in about 20 days so about 1.2 tasks per day average, not a bad throughput I thought. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,821,077 RAC: 19,775 |
I believe I've seen successful tasks with SIGSEGV errors are well. Additionally with "Model crashed: INANCLA: Error opening file " such as this task https://www.cpdn.org/result.php?resultid=22206596 |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
A few weeks ago I decided to test and max out my Ryzen 5900X (12C/24T) with 50GB RAM dedicated to WSL2 Ubuntu 22.04. Ran 24 HadAM4 N144s at the same time and they all finished without errors. The CPU has 64MB of L3 cache so about 2.6MB per task available on average. They all got done in about 20 days so about 1.2 tasks per day average, not a bad throughput I thought. I'm assuming you are talking about the 13 month HADAM4 N144 tasks. Running 5 at a time on my 5600X, each task takes about 4 days, so in 20 days it would finish about 25. I really think that you should test this with no use of the SMT threads, running 12 at a time. My guess is that total model throughput would be considerably higher than what happened running 24 at a time. Now I realize that the comparison of my PC with yours is not apples to apples as you are running these in a VM, with the associated performance penalty, and my 5600X is running these natively in Linux. Also, it was running at 4.4 to 4.5 GHz and I'm sure yours is throttling more running that many. But it's been a long time since running a significant number of models above the total number of cores resulted in more total model throughput. Perhaps with something like hadcm3s (if it were again to be released for Linux), using some of the SMT threads would increase throughput, but I doubt the HADAM4 N144 models would see much, if any, by running more tasks than cores. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Perhaps with something like hadcm3s (if it were again to be released for Linux) I had only one of these work on Linuxx: All UK Met Office HadCM3 short tasks for computer 1511241 22191699 12129726 29 Jan 2022, 20:48:05 UTC 1 Feb 2022, 13:43:03 UTC Completed 211,754.62 210,243.20 4,354.56 UK Met Office HadCM3 short v8.36 i686-pc-linux-gnu |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,821,077 RAC: 19,775 |
It was batch 929. The throughput was definitely not the best I've seen and I don't plan on running 24 at a time again in part due to that reason. I've ran 12 at a time before but with SMT on and the other 12 threads were running other BOINC projects. I believe it took about 9 days. WSL2 uses a lot less resources than a typical VM, one of the reasons I like it. I actually have both the CPU and GPU undervolted to reduce energy use as I have the PC on 24/7 running BOINC projects. The CPU is set to 3.7 GHz, which is the base speed of the CPU. RAM is 64GB 3200MHz 16-20-20-40. I'm not sure how much CPU speed and RAM speed and timings make a difference as I never tried to test. From reading, it sounds like having a larger cache makes the biggest difference. I'll probably never run more than 12 CPDN tasks at a time again but I was curious to see how it'd go maxing out. |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
I've got a couple 3900X boxes now, and running 12 threads instead of 24 seems to actually improve "instructions retired per second" with CPDN tasks. They're just too RAM/cache intensive to run that many at once. Though... sorry, one of my machines just barfed out a bunch of tasks. :( I upgraded the RAM and I think the new RAM is bad. The system won't suspend/resume properly. It ran a clean memtest, but just... things aren't right and it'll probably error out the rest of the tasks from suspend/resume errors as I have to power it down to replace the RAM again. I'd rather have a smaller amount of fast RAM than a lot of slower RAM for this stuff - it runs fewer tasks, but does get through them faster, so I'll just put it back in that configuration. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,005,674 RAC: 21,647 |
I'd rather have a smaller amount of fast RAM than a lot of slower RAM for this stuff - it runs fewer tasks, but does get through them faster, so I'll just put it back in that configuration. Yep, will be upgrading my RAM soon to get faster though probably going from 32GB to 64. I should probably do some tests to see exactly what difference it makes. |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
I'd be interested in the results, for sure! I went back to my 16GB of working DDR4-3600, vs 32GB of "my board doesn't like it" 3200. I don't have enough tasks right now to properly load it up having crashed that set, but... The 3900X that just errored out most of its tasks has 4 N216 tasks running, and is retiring about 30G instructions per second (in the 28-32G range). The other 3900X, running 12 N216 tasks, is retiring... 30-34G instructions per second. Same RAM speeds, just different capacities. And when I was running 8 N216s on the low-RAM box, it was chewing through them quite a bit faster than the 12 task box. I'm not sure that there aren't throughput gains with more tasks, but it isn't massive, for sure. And all the loaded cores are hitting the same speeds, I have good power and cooling on these rigs. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,005,674 RAC: 21,647 |
From memory (my own rather than RAM) maximum throughput on my box was with six N216 tasks (I have 8 real cores) and ten or twelve N144 tasks. But I shall assuming there are tasks around run some proper tests before and after swapping. |
©2024 cpdn.org