climateprediction.net home page
Model Restarting Repeatedly

Model Restarting Repeatedly

Questions and Answers : Unix/Linux : Model Restarting Repeatedly
Message board moderation

To post messages, you must log in.

AuthorMessage
DJStarfox

Send message
Joined: 27 Jan 07
Posts: 300
Credit: 3,288,263
RAC: 26,370
Message 29021 - Posted: 28 May 2007, 4:53:23 UTC
Last modified: 28 May 2007, 4:58:59 UTC

My model appears to restart repeatedly but it won\'t tell me the error!

Sure I can restore from a backup, but what do I change after restore to prevent the crashing?

After CPU/MB upgrade same OS & kernel does this:

2007-05-27 13:35:02 [---] Starting BOINC client version 5.8.15 for i686-pc-linux-gnu
2007-05-27 13:35:02 [---] log flags: task, file_xfer, sched_ops, unparsed_xml, benchmark_debug
2007-05-27 13:35:02 [---] Libraries: libcurl/7.16.0 OpenSSL/0.9.8d zlib/1.2.3
2007-05-27 13:35:02 [---] Data directory: /usr/local/boinc
2007-05-27 13:35:02 [---] Processor: 2 AuthenticAMD AMD Opteron(tm) Processor 248 HE [fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm]
2007-05-27 13:35:02 [---] Memory: 1.98 GB physical, 494.15 MB virtual
2007-05-27 13:35:02 [---] Disk: 44.46 GB total, 32.97 GB free
2007-05-27 13:35:02 [Einstein@Home] URL: http://einstein.phys.uwm.edu/; Computer ID: 903846; location: home; project prefs: default
2007-05-27 13:35:02 [climateprediction.net] URL: http://climateprediction.net/; Computer ID: 531684; location: home; project prefs: default
2007-05-27 13:35:02 [SETI@home] URL: http://setiathome.berkeley.edu/; Computer ID: 3025937; location: home; project prefs: default
2007-05-27 13:35:02 [---] General prefs: from climateprediction.net (last modified 2007-05-22 23:43:45)
2007-05-27 13:35:02 [---] Host location: home
2007-05-27 13:35:02 [---] General prefs: no separate prefs for home; using your defaults
2007-05-27 13:47:54 [climateprediction.net] Restarting task hadcm3inct_cl6r_1920_160_05862500_3 using hadcm3i version 541
Beginning work on result hadcm3inct_cl6r_1920_160_05862500_3...
Starting model in /usr/local/boinc/projects/climateprediction.net...
Created shared memory region key = 171555 of size 655060 bytes (version 602)
.so shmem return code = 0
Starting model ID hadcm3inct_cl6r_1920_160_05862500 Phase 1
Program launched with process id # 3424
Climate model starting - use graphics to monitor progress.
Or visit the website to see the graphs for this run.
Getting pthread attributes - retval=0
Setting pthread size (100663296 bytes) - retval=0
Executing program hadcm3transum_5.41_i686-pc-linux-gnu 171555
hadcm3inct_cl6r_1920_160_05862500 - PH 1 TS 0381025 A - 13/08/1935 00:30 - H:M:S=0483:49:45 AVG= 4.57 DLT= 0.00
Model restart required...
Preparing for restart attempt # 1...
Starting model ID hadcm3inct_cl6r_1920_160_05862500 Phase 1
Getting pthread attributes - retval=0
Setting pthread size (100663296 bytes) - retval=0
Executing program hadcm3transum_5.41_i686-pc-linux-gnu 171555
Program launched with process id # 3439
Climate model starting - use graphics to monitor progress.
Or visit the website to see the graphs for this run.
Model restart required...
Preparing for restart attempt # 2...
Starting model ID hadcm3inct_cl6r_1920_160_05862500 Phase 1
Getting pthread attributes - retval=0
Setting pthread size (100663296 bytes) - retval=0
Executing program hadcm3transum_5.41_i686-pc-linux-gnu 171555
Program launched with process id # 3449
Climate model starting - use graphics to monitor progress.
Or visit the website to see the graphs for this run.
Model restart required...
Preparing for restart attempt # 3...
Starting model ID hadcm3inct_cl6r_1920_160_05862500 Phase 1
Getting pthread attributes - retval=0
Setting pthread size (100663296 bytes) - retval=0
Executing program hadcm3transum_5.41_i686-pc-linux-gnu 171555
Program launched with process id # 3454
Climate model starting - use graphics to monitor progress.
Or visit the website to see the graphs for this run.
hadcm3inct_cl6r_1920_160_05862500 - PH 1 TS 0381457 A - 19/08/1935 00:30 - H:M:S=0484:04:01 AVG= 4.57 DLT= 1.00
Model restart required...
...
Sorry, too many model crashes! :-(
Cleaning up from the run...
Cleaning up graphics data...
Detaching shared memory...
2007-05-27 19:26:50 [climateprediction.net] Deferring communication for 1 min 0 sec 2007-05-27 19:26:50 [climateprediction.net] Reason: Unrecoverable error for result hadcm3inct_cl6r_1920_160_05862500_3 (process exited with code 22 (0x16))
2007-05-27 19:26:50 [climateprediction.net] Computation for task hadcm3inct_cl6r_1920_160_05862500_3 finished
ID: 29021 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 29359 - Posted: 28 Jun 2007, 23:51:22 UTC

DJStarfox, I\'m sorry your question has gone unanswered for so long. We still haven\'t got to the bottom of what this 22 code means. I don\'t think it\'s anything specifically to do with Linux. I\'ve asked one of the programmers to look at this thread plus another thread about code 22 describing an apparently different problem.
Cpdn news
ID: 29359 · Report as offensive     Reply Quote
DJStarfox

Send message
Joined: 27 Jan 07
Posts: 300
Credit: 3,288,263
RAC: 26,370
Message 29593 - Posted: 18 Jul 2007, 0:02:25 UTC - in response to Message 29359.  

DJStarfox, I\'m sorry your question has gone unanswered for so long. We still haven\'t got to the bottom of what this 22 code means. I don\'t think it\'s anything specifically to do with Linux. I\'ve asked one of the programmers to look at this thread plus another thread about code 22 describing an apparently different problem.


Well, I\'ve reset the project since then. I couldn\'t get my restored (from tar file) project folder to run the model anymore. So far, so good at 14% done, which is farthest ever a model has gone for me.
ID: 29593 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 29599 - Posted: 18 Jul 2007, 3:58:33 UTC

The code 22 errors seem to be a defect in certain models, not a problem on the cruncher\'s computer. Tolu is working on a solution in Oxford. Only a small proportion of models go down with this error and it can affect Windows OS as well. So that was just bad luck last time.
Cpdn news
ID: 29599 · Report as offensive     Reply Quote

Questions and Answers : Unix/Linux : Model Restarting Repeatedly

©2024 cpdn.org