Message boards : Number crunching : Computing error
Message board moderation
Author | Message |
---|---|
Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0 |
The task hadcm3istd_cslz_1920_160_06021211 ended with an "error while computing" message after 4600 hours on my Linux box. It had started from 1920 and ended about 2070. Is it possible to know the causes of this error? I am running SuSE Linux 11.1 32-bit pae on an Opteron 1210, not overclocked, and I get very rarely a computing error in any of my 6 Boinc projects. Tullio |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
The list of what happens to a model, success OR failure, is in stderr, on the project's page for each model. Click on the plus sign along side it. In this case, it was cannot open input file ..., which may indicate that an AV program had the file locked while it was checking it, at the moment that the model's program wanted to use it. Backups: Here |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
From "stderr": Model crashed: My guess is that something external to boinc had the files locked, a virus scan, for example. "stderr" is visible on the model's page; click the " + " sign to see the diagnostics. If you have a recent backup, the model could be restarted from that point -- however, it would also restart work on your other projects. There is a convoluted way to get around the other projects but only CPDN would run until the CPDN Task completed. EDIT: Beat me to it again, Les! "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0 |
Thanks Les and AstroWX. I am using Linux and have not made any virus scan. I am using only a firewall, plus a modem with a built-in firewall protection by Telecom Italia. I shall read the stderr.txt file. Tullio |
Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0 |
This is the message I got on my terminal: 12-Oct-2010 22:24:16 [climateprediction.net] Restarting task hadcm3istd_cslz_1920_160_06021211_4 using hadcm3i version 604 12-Oct-2010 22:24:28 [climateprediction.net] Computation for task hadcm3istd_cslz_1920_160_06021211_4 finished 12-Oct-2010 22:24:28 [climateprediction.net] Output file hadcm3istd_cslz_1920_160_06021211_4_16.zip for task hadcm3istd_cslz_1920_160_06021211_4 absent |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Which isn't very useful. Use the stderr in your account on this web site. |
Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0 |
Which isn't very useful. here its last part: CPDN Monitor - Quit request from BOINC... cpdnmonitor: cannot open input file /home/tchersi/BOINC/projects/climateprediction.net/hadcm3istd_cslz_1920_160_06021211/dataout/atmos_restart.day cpdnmonitor: cannot open input file /home/tchersi/BOINC/projects/climateprediction.net/hadcm3istd_cslz_1920_160_06021211/dataout/ocean_restart.day Model crashed: cpdnmonitor: cannot open input file /home/tchersi/BOINC/projects/climateprediction.net/hadcm3istd_cslz_1920_160_06021211/dataout/atmos_restart.day cpdnmonitor: cannot open input file /home/tchersi/BOINC/projects/climateprediction.net/hadcm3istd_cslz_1920_160_06021211/dataout/ocean_restart.day Model crashed: cpdnmonitor: cannot open input file /home/tchersi/BOINC/projects/climateprediction.net/hadcm3istd_cslz_1920_160_06021211/dataout/atmos_restart.day cpdnmonitor: cannot open input file /home/tchersi/BOINC/projects/climateprediction.net/hadcm3istd_cslz_1920_160_06021211/dataout/ocean_restart.day Model crashed: cpdnmonitor: cannot open input file /home/tchersi/BOINC/projects/climateprediction.net/hadcm3istd_cslz_1920_160_06021211/dataout/atmos_restart.day cpdnmonitor: cannot open input file /home/tchersi/BOINC/projects/climateprediction.net/hadcm3istd_cslz_1920_160_06021211/dataout/ocean_restart.day Model crashed: cpdnmonitor: cannot open input file /home/tchersi/BOINC/projects/climateprediction.net/hadcm3istd_cslz_1920_160_06021211/dataout/atmos_restart.day cpdnmonitor: cannot open input file /home/tchersi/BOINC/projects/climateprediction.net/hadcm3istd_cslz_1920_160_06021211/dataout/ocean_restart.day Model crashed: cpdnmonitor: cannot open input file /home/tchersi/BOINC/projects/climateprediction.net/hadcm3istd_cslz_1920_160_06021211/dataout/atmos_restart.day cpdnmonitor: cannot open input file /home/tchersi/BOINC/projects/climateprediction.net/hadcm3istd_cslz_1920_160_06021211/dataout/ocean_restart.day Model crashed: Sorry, too many model crashes! :-( called boinc_finish </stderr_txt> |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
I've never seen so many messages in stderr before, or many of these particular messages. All I can say is that whenever I've seen a mention of lockfile it has proved fatal for the model. The model tries again and again but I don't think I've ever seen a case where a model has recovered from this (eg one or two instances of lockfile, but the file then miraculously unlocks and the model marches on again). The lockfile messages you had are not the same as the ones we see on some computers that mean the person must upgrade their Boinc version; and in any case that situation only happened on Windows. Is Boinc in the trusted zone of both your firewall and AV? I wouldn't let automatic AV scans run while Boinc is running. The only consolation is that the computer produced 15 decadal files which will all be used by the researchers. Cpdn news |
Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0 |
I have a backup made on October 9 (I make one every week) but I am not going to use it, also because I have AQUA, Einstein, QuantumFIRE, QMC and SETI (when not down) all running happily. I am glad if my 4600 hours of runtime and 4000 hours of CPU time have served any purpose. Tullio |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
If by chance you have a single-core machine even if it's quite old (I don't think that for HadCM one needs SSE2) with Linux you could transfer the whole backup to it when it has no work, suspend and never run the tasks that are being run on your good computer, and just let the HadCM crunch to the end. You'd get the message from the server that the task had already been reported as completed but the final file would still be accepted, added to the model's other results and used. I'm going to do this with a CPDN FAMOUS that crashed on my quad when I had a Big Problem. When my single-core machine has finished its current work next week I'll let it complete just this one model from the restore of a multi-model backup. After it's finished I'll delete the restored contents of the Boinc Data folder and put back the original Data folder package. CPDN has a task indexing system that puts together all the files from models even if the files upload to several different servers or from more than one computer. Cpdn news |
Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0 |
I had a 400 MHz PII Deschutes but I gave it to my son. He lives in Tuscany and visits me about monthly. I could give him a Flash memory stick with the BOINC directory. I just bought a 1.4 TB external hard disk to save my personal files so I can upgrade my SuSE distro to 11.3. Thanks for your suggestion. Tullio |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
How much RAM has that computer got? HadCM is much more likely to crunch and not crash if it has 512 RAM, not just 256. Maybe you'll need to look around to see whether you have a spare old RAM card too. Of course the computer's memory may be built into its motherboard, in which case you'd just have to take a chance. If it has less than 256 RAM I don't think the model would be likely to succeed. This opinion is based on what happened to BBC members' models which were almost the same as yours. Trying this is all much more worthwhile for long models than for short ones, especially if they crashed near the end. Lucky son, living in Tuscany. And lucky that you manage to see each other so often. Cpdn news |
Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0 |
If I remember well the PII has only 384 MB RAM. I renounced running CPDN on it but it made some work on the BBC model. It was also running SETI and Einstein. I now have 5 GB RAM on my Linux box. But I also have a AT&T Olivetti UNIX PC with 2.5 MB RAM and a 40 MB disk. It is still running UNIX System V. with a threadbare windowing and a three button mouse.Cheers. Tullio |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
If the computer could crunch a BBC model it should be able to crunch this HadCM. Cpdn news |
Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0 |
My first Famous task ended with compute error after 56 hours. I got a second one, hope it works. Tullio |
Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0 |
4 out of 7 computers, all with different CPUs and OS, errored on this task. Three are still working. Too many total results, says a red line. Tullio On my second Famous task, my wingman has already errored. |
Send message Joined: 30 Aug 04 Posts: 142 Credit: 9,936,132 RAC: 0 |
There is a thread about famous here. To sum up, the error rate seems to be about one in three. It's a good idea to look at how other computers are doing, to check if there's a problem at your end. Note that there are also discrepancies according to OS and CPU. As an example, although I've been fairly successful, here are two quick failures with Invalid Theta: http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6959088 http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6961241 Good luck with your crunching. Forum search Site search |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
We can forget about the Too many total results which is irrelevant to CPDN and really shouldn't be there. What I'm going to say now refers to HadSM and FAMOUS only. If you look at the workunit page for a crashed model of one of these types, you can see which other computers got a model from the same WU. Results from computers with the same combination of operating system + CPU type (AMD or Intel) should produce the same results. For example if one computer with Linux + AMD crashes one of these models after a particular timestep, any other computers of the same combination should produce the same result. Sometimes there's no other computer in the WU of the same type as our own, but often we can compare. If two computers of the same type crash a HadSM or FAMOUS at the same processing point we know the problem lies in the model. But if our model crashes while another computer of the same type completes it we know there's probably something wrong with one of those computers. The computer with the problem is more likely to be the one that crashed the model, though not necessarily. For these two model types, computers with the same OS + CPU type should generate bit-identical results. In any case, for FAMOUS if you go to a model's web page and click on stderr +, you see its messages. If you see NEGATIVE PRESSURE or INVALID THETA it's almost certain that a crash was caused by the model's parameter values. If one of these messages appears 5 or 6 times all together one after the other it's even more certain. If these PRESSURE or THETA messages appear here and there one at a time interspersed with lots of other messages, then it will often be that a problem with the computer is the cause. Cpdn news |
Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0 |
Here is my stderr.txt: <core_client_version>6.10.58</core_client_version> <![CDATA[ <message> process exited with code 22 (0x16, -234) </message> <stderr_txt> CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... BUFFIN: Read Failed: No such file or directory BUFFIN: C I/O Error feof - Unit 60 - Return code = 1 BUFFIN: Read Failed: No such file or directory BUFFIN: C I/O Error feof - Unit 61 - Return code = 1 BUFFIN: Read Failed: No such file or directory BUFFIN: C I/O Error feof - Unit 68 - Return code = 1 BUFFIN: Read Failed: No such file or directory BUFFIN: C I/O Error feof - Unit 69 - Return code = 1 CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... BUFFIN: Read Failed: No such file or directory BUFFIN: C I/O Error feof - Unit 60 - Return code = 1 BUFFIN: Read Failed: No such file or directory BUFFIN: C I/O Error feof - Unit 61 - Return code = 1 BUFFIN: Read Failed: No such file or directory BUFFIN: C I/O Error feof - Unit 68 - Return code = 1 BUFFIN: Read Failed: No such file or directory BUFFIN: C I/O Error feof - Unit 69 - Return code = 1 CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... BUFFIN: Read Failed: No such file or directory BUFFIN: C I/O Error feof - Unit 60 - Return code = 1 BUFFIN: Read Failed: No such file or directory BUFFIN: C I/O Error feof - Unit 61 - Return code = 1 BUFFIN: Read Failed: No such file or directory BUFFIN: C I/O Error feof - Unit 68 - Return code = 1 BUFFIN: Read Failed: No such file or directory BUFFIN: C I/O Error feof - Unit 69 - Return code = 1 CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Sorry, too many model crashes! :-( (2478): called boinc_finish </stderr_txt> ]]> Trickle Click here Perturbed Parameters for Result # 118 The second Famous unit is still crunching after 48+ hours. Tullio |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
The relevant text for the failure is: INVALID THETA DETECTED Which is the cause of most FAMOUS failures. |
©2024 cpdn.org