Thread 'Windows Sulphur Cycle Reached Phase 2 but Linux Blew Up at the Gate'

Author	Message
old_user3335 Send message Joined: 30 Aug 04 Posts: 29 Credit: 418,651 RAC: 0	Message 16346 - Posted: 1 Oct 2005, 1:56:48 UTC I have a Sulphur on a P4 3.0GHz HT with Windows XP Home that has made it to Phase 2. I had a Sulphur waiting on an AMD Athlon 64 2800+ running Linux 2.6.10, Boinc 4.19 (optimized by Ned Slider). I was messing with the Linux machine and changed to Boinc 4.43. I have changed back and forth before with no ill effects. The slab wu was early in Phase 2. So when Boinc 4.43 woke up, it went to work on the wu with the earliest deadline, which was the sulphur. I kept the messages, in case they are of interest to anyone, but it quit with an error. Then it went back to crunching the slab, and it seems to be fine. I stuck with optimized 4.19 because the benchmarks with 4.43 were so low. It turns out the timesteps with 4.43 are 2.2 sec compared to 2.4 sec. So at least for CPDN, optimized BOINC doesn\'t seem to help me. No, I haven\'t been backing up. But I guess I need to start doing that. If I had, I could have suspended that sulphur cycle and tried running it under 4.19 when the time came. I wanted to try 4.43 so I would have the ability to drain off the other projects wu\'s instead of trashing them by detaching, since 4.19 doesn\'t have as many features. I didn\'t think I would have got the sulphur done in time sharing with other projects. Thanks for listening and Happy Crunching. ID: 16346 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275	Message 16347 - Posted: 1 Oct 2005, 2:45:28 UTC - in response to Message 16346. I kept the messages, in case they are of interest to anyone, but it quit with an error. Then it went back to crunching the slab, and it seems to be fine. I would be interested in the error message. I, and several other people, had difficulty with 4.43 in Linux and sulphur associated with the BOINC benchmarks. I stuck with optimized 4.19 because the benchmarks with 4.43 were so low. It turns out the timesteps with 4.43 are 2.2 sec compared to 2.4 sec. So at least for CPDN, optimized BOINC doesn\'t seem to help me. The version of BOINC, optimized or not, should have miniscule impact on CPDN performance. Maybe .02 sec/TS, if that. Certainly not .2 sec/TS. I wanted to try 4.43 so I would have the ability to drain off the other projects wu\'s instead of trashing them by detaching, since 4.19 doesn\'t have as many features. I didn\'t think I would have got the sulphur done in time sharing with other projects. Yes, 4.43 does have some nice features compared to 4.19, but it also has some problems with CPDN. Hopefully the next version of BOINC will iron those out. ID: 16347 · Reply Quote

Arnaud Send message Joined: 3 Sep 04 Posts: 268 Credit: 256,045 RAC: 0	Message 16349 - Posted: 1 Oct 2005, 7:23:17 UTC A optimized BOINC isn\'t usefull for CPDN as credits are not claimed according to benchmarks. And an optimized BOINC doesn\'t accelerate the speed of the applications, it\'s just for credits. Using 4.19 in Sulphur is not a good idea because you can\'t upload results at the end of the phase. Arnaud ID: 16349 · Reply Quote

old_user3335 Send message Joined: 30 Aug 04 Posts: 29 Credit: 418,651 RAC: 0	Message 16364 - Posted: 1 Oct 2005, 14:46:40 UTC Here are the messages. 2nqc is the slab that I had been running with 4.19, and 494c is the sulphur cycle. I have also noticed that there appear to be about three timesteps in a row missing from 2nqc, so I am not so sure 4.43 took up where 4.19 left off. I promise not to play with it anymore in the middle of a model without backing up. 2005-09-29 23:13:14 [---] Resuming computation and network activity 2005-09-29 23:13:14 [---] Computer is overcommitted 2005-09-29 23:13:14 [---] Nearly overcommitted. 2005-09-29 23:13:14 [---] New work fetch policy: no work fetch allowed. 2005-09-29 23:13:14 [---] New CPU scheduler policy: earliest deadline first. 2005-09-29 23:13:14 [---] schedule_cpus: must schedule 2005-09-29 23:13:14 [---] earliest deadline: 1139934000.000000 494c_b00298716_0 2005-09-29 23:13:14 [climateprediction.net] Starting result 494c_b00298716_0 using sulphur_cycle version 4.21 Starting model in /root/BOINC/projects/climateprediction.net... Archive: sulphur_se_4.21_i686-pc-linux-gnu.zip inflating: ./sulphur_se_4.21_i686-pc-linux-gnu inflating: ./sulphur_gfx_4.21_i686-pc-linux-gnu inflating: ./globe.rgb extracting: ./gfx.sh Archive: sulphur_um_4.21_i686-pc-linux-gnu.zip inflating: ./sulphur_um_4.21_i686-pc-linux-gnu Archive: sulphur_data_4.21_i686-pc-linux-gnu.zip Archive: 494c_b00298716.zip inflating: 494c_b00298716/jobs/climate.spin inflating: 494c_b00298716/jobs/climate.cont inflating: 494c_b00298716/jobs/climate.doub inflating: 494c_b00298716/jobs/climate.so2.cont inflating: 494c_b00298716/jobs/climate.so2.doub inflating: 494c_b00298716/jobs/ncatts.cpdc Created shared memory region key = 26600 .so shmem return code = 136446572 Copying files for startup... In pre_initialise_phase (part 1 of 3) In initialise_phase (part 2 of 3) In startup_phase (part 3 of 3) 2005-09-29 23:13:17 [---] request_reschedule_cpus: process exited 2005-09-29 23:13:17 [climateprediction.net] Computation for result 494c_b00298716_0 finished 2005-09-29 23:13:17 [---] schedule_cpus: must schedule 2005-09-29 23:13:17 [---] New work fetch policy: work fetch allowed. 2005-09-29 23:13:17 [---] New CPU scheduler policy: highest debt first. Starting model in /root/BOINC/projects/climateprediction.net... Created shared memory region key = 25390 2005-09-29 23:13:17 [climateprediction.net] Restarting result 2nqc_000145322_0 using hadsm3 version 4.13 2005-09-29 23:13:17 [climateprediction.net] Unrecoverable error for result 494c_b00298716_0 (<file_xfer_error> <file_name>494c_b00298716_0_1.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> <file_xfer_error> <file_name>494c_b00298716_0_2.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> <file_xfer_error> <file_name>494c_b00298716_0_3.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> <file_xfer_error> <file_name>494c_b00298716_0_4.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> <file_xfer_error> <file_name>494c_b00298716_0_5.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> ) 2005-09-29 23:13:17 [climateprediction.net] Unrecoverable error for result 494c_b00298716_0 (<file_xfer_error> <file_name>494c_b00298716_0_1.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> <file_xfer_error> <file_name>494c_b00298716_0_2.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> <file_xfer_error> <file_name>494c_b00298716_0_3.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> <file_xfer_error> <file_name>494c_b00298716_0_4.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> <file_xfer_error> <file_name>494c_b00298716_0_5.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> ) Env Used=LD_LIBRARY_PATH=/root/BOINC/projects/climateprediction.net:/usr/local/lib:/usr/lib:/lib 2005-09-29 23:13:18 [climateprediction.net] Deferring communication with project for 58 seconds 2005-09-29 23:13:18 [climateprediction.net] Deferring communication with project for 58 seconds Starting model ID 2nqc_000145322 Phase 2 Stack size=48.00 MB Waiting for model startup, this may take a minute... 2nqc_000145322 - PH 2 TS 169201 - 16/09/1835 00:30 - H:M:S=0268:09:45 AVG= 2.25 DLT= 0.00 ID: 16364 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275	Message 16365 - Posted: 1 Oct 2005, 15:06:37 UTC These were the error messages on the result page for that WU: core_client_version4.43core_client_version stderr_txt End-of-central-directory signature not found. Either this file is not a zipfile, or it constitutes one disk of a multi-part archive. In the latter case the central directory and zipfile comment will be found on the last disk(s) of this archive. unzip: cannot find zipfile directory in one of sulphur_data_4.21_i686-pc-linux-gnu.zip or sulphur_data_4.21_i686-pc-linux-gnu.zip.zip, and cannot find sulphur_data_4.21_i686-pc-linux-gnu.zip.ZIP, period. cp: cannot create regular file `tmp/cp.namelists\': No such file or directory I have never seen these before so I\'m afraid I can\'t be much help. ID: 16365 · Reply Quote

old_user5994 Send message Joined: 31 Aug 04 Posts: 239 Credit: 2,933,299 RAC: 0	Message 16366 - Posted: 1 Oct 2005, 15:14:16 UTC Well, if nothing else I would like to add the error logs to my collection for possibly adding to the Wiki ... Can you zip up the *.TXT and OLD files and send them to p.d.buck@comcast.net ... Thanks! ID: 16366 · Reply Quote

old_user3335 Send message Joined: 30 Aug 04 Posts: 29 Credit: 418,651 RAC: 0	Message 16371 - Posted: 1 Oct 2005, 18:32:03 UTC - in response to Message 16366. Can you zip up the *.TXT and OLD files and send them to p.d.buck@comcast.net ... Thanks! I guess not, I am too dumb. Not sure what I am doing wrong getting Ark to zip things. I need to do some more reading. I don\'t see any OLD files in Linux. I see them in the Windows BOINC folder. In Linux, there are txt files in the slots folders. I do have \"hidden files\" turned on. My poor little 494c folder in Projects only has an xml file in it. I put it in a word document and it is only three or four pages long. <img border=\"0\" src=\"http://boinc.mundayweb.com/one/stats.php?userID=2247\" /> ID: 16371 · Reply Quote

old_user5994 Send message Joined: 31 Aug 04 Posts: 239 Credit: 2,933,299 RAC: 0	Message 16383 - Posted: 2 Oct 2005, 6:25:33 UTC Hmm, You should have several TXT files ... For example, in my boinc directory on OS-X I have: -rw-r--r-- 1 paulbuck admin 464817 Oct 1 23:28 stderrdae.txt -rw-r--r-- 1 paulbuck admin 22545 Sep 28 11:30 stderrgui.txt -rw-r--r-- 1 paulbuck admin 2097163 Sep 8 23:52 stdoutdae.old -rw-r--r-- 1 paulbuck admin 1969502 Oct 1 23:28 stdoutdae.txt -rw-r--r-- 1 paulbuck admin 9 Sep 8 06:42 stdoutgui.txt G5a:/Library/application support/boinc data paulbuck$ These are the files I am interested in ... Not sure where they are \"hidden\" in Linux as I don\'t have a linux system (yet). ID: 16383 · Reply Quote

Arnaud Send message Joined: 3 Sep 04 Posts: 268 Credit: 256,045 RAC: 0	Message 16396 - Posted: 3 Oct 2005, 7:16:11 UTC Last modified: 3 Oct 2005, 7:17:39 UTC In Linux, the log files are usually in the Wu folder( ~/BOINC/projects/climateprediction.net/xxxx_yyyyyyyyyy ) and in the slot folder (~/BOINC/slots/X) Arnaud ID: 16396 · Reply Quote

old_user3335 Send message Joined: 30 Aug 04 Posts: 29 Credit: 418,651 RAC: 0	Message 16414 - Posted: 4 Oct 2005, 3:29:17 UTC - in response to Message 16396. Last modified: 4 Oct 2005, 3:33:12 UTC In Linux, the log files are usually in the Wu folder( ~/BOINC/projects/climateprediction.net/xxxx_yyyyyyyyyy ) and in the slot folder (~/BOINC/slots/X) Well, this is the xml file from the 494c folder in projects: [UMID] [V]100[/V] [MD]SCYCLE[/MD] [N]494c_b00298716[/N] [PH]0[/PH] [TS]1[/TS] [DAY]0[/DAY] [MTH]0[/MTH] [YR]0[/YR] [HR]0[/hr] [MIN]0[/MIN] [SEC]0[/SEC] [CSF]0[/CSF] [TR]0[/tr] [ST]0[/ST] [RS]3[/RS] [RSC]1[/RSC] [RSDT]0[/RSDT] [RSMT]0[/RSMT] [RSYT]0[/RSYT] [RSD attr=\"0\"][/RSD] [RSD attr=\"1\"][/RSD] [RSD attr=\"2\"][/RSD] [RSD attr=\"3\"][/RSD] [RSD attr=\"4\"][/RSD] [RSD attr=\"5\"][/RSD] [RSD attr=\"6\"][/RSD] [RSD attr=\"7\"][/RSD] [RSD attr=\"8\"][/RSD] [RSD attr=\"9\"][/RSD] [RSD attr=\"10\"][/RSD] [RSM attr=\"0\"][/RSM] [RSM attr=\"1\"][/RSM] [RSM attr=\"2\"][/RSM] [RSM attr=\"3\"][/RSM] [RSM attr=\"4\"][/RSM] [RSM attr=\"5\"][/RSM] [RSM attr=\"6\"][/RSM] [RSM attr=\"7\"][/RSM] [RSM attr=\"8\"][/RSM] [RSM attr=\"9\"][/RSM] [RSM attr=\"10\"][/RSM] [RSY attr=\"0\"][/RSY] [RSY attr=\"1\"][/RSY] [RSY attr=\"2\"][/RSY] [RSY attr=\"3\"][/RSY] [RSY attr=\"4\"][/RSY] [RSY attr=\"5\"][/RSY] [RSY attr=\"6\"][/RSY] [RSY attr=\"7\"][/RSY] [RSY attr=\"8\"][/RSY] [RSY attr=\"9\"][/RSY] [RSY attr=\"10\"][/RSY] [CS attr=\"0\"]../sulphur_um_4.21_i686-pc-linux-gnu=0d60790beb86b831463989c82ae15185[/CS] [CS attr=\"1\"]jobs/climate.spin=1627afc00d4677ab01bcf34a5c90d48c[/CS] [CS attr=\"2\"]jobs/climate.cont=b8fa65d29109ae10a4b33378116c1f12[/CS] [CS attr=\"3\"]jobs/climate.doub=a39b44eec85dab48b7b806bb5576103f[/CS] [CS attr=\"4\"]jobs/climate.so2.cont=a337bc8f664f4d7dab01cb476e658e99[/CS] [CS attr=\"5\"]jobs/climate.so2.doub=0f628c3acc846f79678eaed577474d36[/CS] [CS attr=\"6\"]jobs/ncatts.cpdc=bb68dbfdc5bcf5d5173f01b2857e31dd[/CS] [/UMID] And all the brackets were replaced with []. and there is a stderr.txt file in slots that just says \"no heartbeat from core client\" over and over again. Perhaps if I had collected the stderr.txt right after the model crash I could have had something useful. <img border=\"0\" src=\"http://boinc.mundayweb.com/one/stats.php?userID=2247\" /> ID: 16414 · Reply Quote

old_user5994 Send message Joined: 31 Aug 04 Posts: 239 Credit: 2,933,299 RAC: 0	Message 16422 - Posted: 4 Oct 2005, 11:45:24 UTC Stealing liberally from the Wiki ... the text for that message (we have most of them, but, always looking for one more): ======== This error is caused by one or more of the running processes, the BOINC Daemon that make up the BOINC Client Software on the Participant\'s Computer has stopped running (it \"crashed\"). In other words, something real bad happened. The usual suspect is the BOINC Daemon, but it can also be from the failure of the Science Application. Now, for those that want to know more, a \"heart-beat\" (or heartbeat) is a periodic message sent from one software component to another telling that other software component, \"I am alive and well!\". In the BOINC Client Software we have signals going from the Science Application to the BOINC Daemon, and a separate set of signals going in the other direction. If the BOINC Daemon stops running, we want the Science Application to also stop, and vice versa. If one dies, the other should die also. These heart-beat signals are common in software systems where there are multiple components that run essentially independent of each other. They are just small messages and they are repeated every few minutes or so. So, they don\'t take much away from your hunt for maximum credit. Courtesy of Walt Gribben (with minor edits by Paul): This message means that the BOINC Client Software stopped communicating with the Science Application. The BOINC Daemon sends a heartbeat message out so that the Science Application programs know its still alive and kicking. So if the messages stop, its supposed to mean that the BOINC Daemon isn\'t running anymore (perhaps it crashed?) and the Science Applications are also supposed to exit. Thats after they don\'t get a heartbeat message for 30 seconds. So, they print the \"no heartbeat\" error and exit. The Science Applications are using an exit code of zero to indicate there isn\'t any error, at least not with the Work Unit. Later, the BOINC Daemon sees that the Science Application exited (zero status) but wasn\'t finished with the Work Unit (there is no finished file) so it restarts the Work Unit. And from where it left off, or at least from the last Checkpoint. There might be a problem with BOINC Daemon to the Science Application communications, but its not all that serious. Some time is lost in restarting the Work Unit, but its not like it has to start from the beginning each time. After I saw the Work Units were completing in around the same time whether or not they got \"no heartbeat\" messages, I stopped looking into it. ========== The log files are in the same directory that contains the \"slot\" directories and the main BOINC Client Software. If you are running one of the command line clients, then you have to use the UNIX redirects to write the files. It has been so long since I have run those that I do not recall the exact syntax. But, with the later client software there is no real reason not to run the \"graphical\" clients I don\'t know why they bother with the older CLI versions anymore ... but I digress ... If you don\'t have the logs, you don\'t have the logs ... but it is strange ... ID: 16422 · Reply Quote

old_user5994 Send message Joined: 31 Aug 04 Posts: 239 Credit: 2,933,299 RAC: 0	Message 16423 - Posted: 4 Oct 2005, 11:48:15 UTC Last modified: 4 Oct 2005, 11:50:26 UTC I just looked in your account and found this: [url=http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=1105225]Result[/rul] <core_client_version>4.43</core_client_version> <stderr_txt> End-of-central-directory signature not found. Either this file is not a zipfile, or it constitutes one disk of a multi-part archive. In the latter case the central directory and zipfile comment will be found on the last disk(s) of this archive. unzip: cannot find zipfile directory in one of sulphur_data_4.21_i686-pc-linux-gnu.zip or sulphur_data_4.21_i686-pc-linux-gnu.zip.zip, and cannot find sulphur_data_4.21_i686-pc-linux-gnu.zip.ZIP, period. cp: cannot create regular file `tmp/cp.namelists\': No such file or directory </stderr_txt> <message><file_xfer_error> <file_name>494c_b00298716_0_1.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> <file_xfer_error> <file_name>494c_b00298716_0_2.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> <file_xfer_error> <file_name>494c_b00298716_0_3.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> <file_xfer_error> <file_name>494c_b00298716_0_4.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> <file_xfer_error> <file_name>494c_b00298716_0_5.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> </message> Not sure if I got the right computer or not ... but this tells me that you did not even get the model unpacked ... ID: 16423 · Reply Quote

old_user3335 Send message Joined: 30 Aug 04 Posts: 29 Credit: 418,651 RAC: 0	Message 16429 - Posted: 4 Oct 2005, 16:40:15 UTC - in response to Message 16423. I just looked in your account and found this: snip Not sure if I got the right computer or not ... but this tells me that you did not even get the model unpacked ... Paul, Yes, that was my crashed sulphur cycle. As to the heartbeat issue- when I look in the linux system monitor, boinc is listed as sleeping, while the science app is listed as running. If a process is sleeping does it still have a heartbeat? Or because it is sleeping, it doesn\'t send a heartbeat message, and thus the generation of all those heartbeat messages. Pam ID: 16429 · Reply Quote

Arnaud Send message Joined: 3 Sep 04 Posts: 268 Credit: 256,045 RAC: 0	Message 16431 - Posted: 4 Oct 2005, 17:11:13 UTC - in response to Message 16429. Last modified: 4 Oct 2005, 17:19:53 UTC As to the heartbeat issue- when I look in the linux system monitor, boinc is listed as sleeping, while the science app is listed as running. It\'s normal. Same thing on my machines. Boinc must be running from time to time to communicate with the science apps, but else it\'s sleeping. Edit: I checked in Ksysguard: CPU % of boinc is going to .25% every 1 minute. Arnaud ID: 16431 · Reply Quote

old_user5994 Send message Joined: 31 Aug 04 Posts: 239 Credit: 2,933,299 RAC: 0	Message 16436 - Posted: 4 Oct 2005, 19:56:18 UTC Pam, The BOINC Client/BOINC Daemon is and should be mostly asleep. if it is not, then there is a problem. In essence, as Arnaud said, the BOINC Daemon should be using 1% or less CPU in almost all cases (the BENCHMARK period is obviously one of the special cases). In effect, the BOINC Daemon is like an operating system kernel, you need it, but you don\'t want it to take up too much time. However, if the BOINC Daemon dies, the science application has just lost its \"parent\" process. And one of the means that is used to determine death is the \"heart-beat\" exchange. Not having studied that part of the code, and not haveing my own science application to test (yet) I cannot say for sure what other special cases could exist where you would lose the heartbeat. But, they SHOULD be rare. If you are getting this all the time, and the models are crashing, there is something going on ... THis is why the logs are so important. You can tell a lot from the content. Like who died, when, sometimes ... :) The point being, I don\'t know how/why you are not getting the log files which is why I asked if you were using the CLI version. I had, for example, problems with CPDN on the PowerMac running OS-X ... turns out this is a \"known\" issue that is easily fixed but I did not pursue it hard enough ... looking at MY error messages in the result file got me the clue ... and I may be running again ... won\'t be real sure till I get past the first \"trickle\" ... but I do have an hour of run time ... anyway ... I am not the best trouble-shooter, but if you are still having problems we can TRY to get you back on the air again ... first we have to get your logs back though ... ID: 16436 · Reply Quote