Message boards : Number crunching : CRASHED HADCM3
Message board moderation
Author | Message |
---|---|
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
I seem to have developed a problem with hadcm3 models. I have completed several in the past, but, the last 2 have crashed after the being stopped to make backups. I don�t understand this as I stopped manager the right way by first suspending the model and then closing the manager. The stderr: <core_client_version>6.10.58</core_client_version> <![CDATA[ <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> =0, selfPID=0, iMonCtr=0 Model crash detected, will try to restart... Controller:: CPDN process is not running, exiting, bRetVal = 0, checkPID=0, selfPID=0, iMonCtr=0 Model crash detected, will try to restart... Controller:: CPDN process is not running, exiting, bRetVal = 0, checkPID=0, selfPID=0, iMonCtr=0 Model crash detected, will try to restart... Controller:: CPDN process is not running, exiting, bRetVal = 0, checkPID=0, selfPID=0, iMonCtr=0 Model crash detected, will try to restart... Controller:: CPDN process is not running, exiting, bRetVal = 0, checkPID=0, selfPID=0, iMonCtr=0 Model crash detected, will try to restart... Controller:: CPDN process is not running, exiting, bRetVal = 0, checkPID=0, selfPID=0, iMonCtr=0 Model crash detected, will try to restart... Controller:: CPDN process is not running, exiting, bRetVal = 0, checkPID=0, selfPID=0, iMonCtr=0 Model crash detected, will try to restart... Controller:: CPDN process is not running, exiting, bRetVal = 0, checkPID=0, selfPID=0, iMonCtr=0 Model crash detected, will try to restart... Controller:: CPDN process is not running, exiting, bRetVal = 0, checkPID=0, selfPID=0, iMonCtr=0 Model crash detected, will try to restart... Controller:: CPDN process is not running, exiting, bRetVal = 0, checkPID=0, selfPID=0, iMonCtr=0 Model crash detected, will try to restart... Controller:: CPDN process is not running, exiting, bRetVal = 0, checkPID=0, selfPID=0, iMonCtr=0 Model crash detected, will try to restart... Controller:: CPDN process is not running, exiting, bRetVal = 0, checkPID=0, selfPID=0, iMonCtr=0 Model crash detected, will try to restart... Controller:: CPDN process is not running, exiting, bRetVal = 0, checkPID=0, selfPID=0, iMonCtr=0 Model crash detected, will try to restart... Controller:: CPDN process is not running, exiting, bRetVal = 0, checkPID=0, selfPID=0, iMonCtr=0 Model crash detected, will try to restart... The stderr seems to be saying that an important part of the program did not restart after it was stopped. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
|
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,808,726 RAC: 5,192 |
Jim, Perversely, the success rate for HADCM3N models on my machines went up when I stopped making backups. I now run these models completely undisturbed: no suspends, no stops, no backups, no network activity. I make one backup at the start of each model batch when they haven't even unzipped (i.e. they are suspended immediately after downloading). On crashes (other than 'negative theta') then I found by repeated experiments that models would only suceed if restarted from the beginning. Since this method has been adopted no model has crashed. When the model has finished then network activity is turned on again for one large upload. This is unsuitable for most of the machines I could use, since they inevitably have some work use, which takes priority. So, in practice, only one Mac that can be set aside for a three week session is used for HADCM3N. The other machines run HADAM3P when available (machines and models). Iain |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
Thanks for the advise Iain. I am not sure if it would work on my machine. At more than 900 hours to complete a CM model I don�t think I could go all that time without shutting it down at least once. My other machine runs CM just fine. I finished one just yesterday. When the problem machine has finished running all of the Hadam3p Wu�s now on it I plan to reset the project. Maybe that will solve whatever the problem is. |
©2024 cpdn.org