climateprediction.net (CPDN) home page
Thread 'Reporting - Errors while computing -'

Thread 'Reporting - Errors while computing -'

Message boards : Number crunching : Reporting - Errors while computing -
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
3rkko

Send message
Joined: 12 Feb 08
Posts: 66
Credit: 4,877,652
RAC: 0
Message 46001 - Posted: 20 Apr 2013, 22:34:47 UTC

Three crashes
hadcm3n_3af0_1980_40_008349704_2
hadcm3n_49r8_1980_40_008350067_1
hadcm3n_3jf5_1980_40_008352170_0
with the same "(C++ Exception) (0xe06d7363) at address 0x7732C41F".
ID: 46001 · Report as offensive     Reply Quote
Matthias Lehmkuhl

Send message
Joined: 24 Sep 05
Posts: 7
Credit: 3,465,472
RAC: 3,431
Message 46004 - Posted: 21 Apr 2013, 11:49:17 UTC

got also one
Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x7637C41F

Engaging BOINC Windows Runtime Debugger...

hadcm3n_4gu4_2020_40_008351404
4 results have crashed with the error above

1 result (the short one) is on Darwin 12.3.0 with error
process exited with code 22 (0x16, -234)

Matthias
ID: 46004 · Report as offensive     Reply Quote
ProfileByron Leigh Hatch @ team Carl ...
Avatar

Send message
Joined: 17 Aug 04
Posts: 289
Credit: 44,103,664
RAC: 0
Message 46005 - Posted: 21 Apr 2013, 12:16:22 UTC

one more of my Models Crashed approx two hours ago.
hadcm3n_4h2i_1980_40_008350145_3
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x7560812F

ID: 46005 · Report as offensive     Reply Quote
ProfileByron Leigh Hatch @ team Carl ...
Avatar

Send message
Joined: 17 Aug 04
Posts: 289
Credit: 44,103,664
RAC: 0
Message 46012 - Posted: 22 Apr 2013, 3:49:14 UTC

I'm the 5th Computer to Crarsh this Model. hadcm3n_4m8m_1980_40_008349532_4
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x7560812F
However I did have Success on this Model after 532 hours of 24 / 7 none stop Crunching hadcm3n_zmjk_1920_40_008340870_4
three (3) other Computers Crashed this same Model with various Error while computing.


ID: 46012 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 46015 - Posted: 22 Apr 2013, 9:19:39 UTC

Byron, in the case of the first WU you link to, all the computers have crashed the model with error code -529697949. If you have the same OS as a computer that's already suffered this the chances of you succeeding must be almost zero. Something wrong with the model. Fortunately these models didn't spend much time crunching.

In the case of the second model you linked to the other computers are serial crashers. Look at the other computers belonging to your wingmen. Serial crashers usually kill models after no seconds of computing time at all or just a few seconds. They have very little credit for the number of models they've had. If you look at a page or two of their recent models you may see an unmitigated disaster. That Linux machine probably hasn't d/l the 32-bit libraries it needs so every model crashes.

After private messaging was introduced to the forums it was possible for a short time to send a PM to the people crashing models and they'd receive an email notification. But the email notification of BOINC forum PMs was turned off by default to protect members' privacy. How many members notice this detail in their accounts and turn email notification on? I think the current default situation is a mistake but my pleas to Berkeley were rejected.
Cpdn news
ID: 46015 · Report as offensive     Reply Quote
WB8ILI

Send message
Joined: 1 Sep 04
Posts: 161
Credit: 81,522,141
RAC: 1,164
Message 46021 - Posted: 22 Apr 2013, 12:04:56 UTC

This morning I had a "Run Time Error" message on my screen. I checked my memory allocation and the model had bout 1.5 gig of real memory allocated. When I clicked OK the model aborted with a Computational Error.

The error in the error file was:

The system cannot find the path specified.
(0x3) - exit code 3 (0x3)

Maybe these out-of-memory errors have something to do with a program loop allocating memory on a missing file situation.
ID: 46021 · Report as offensive     Reply Quote
Profilenenym

Send message
Joined: 13 Jan 09
Posts: 2
Credit: 9,575,518
RAC: 29,638
Message 46269 - Posted: 23 May 2013, 4:30:51 UTC

The http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8503012 seems to have a memory bug. After 31 hrs CPU time/35 hrs Run time/8% progress bar I mentioned no trickle received on the server side. Task allocated 1,5 GB memory and 3,5 GB virtual memory. I deleted the task, because three crunchers before me got error while computing after a long time. Another weird issue - zero CPU time in database.
ID: 46269 · Report as offensive     Reply Quote
Profilenenym

Send message
Joined: 13 Jan 09
Posts: 2
Credit: 9,575,518
RAC: 29,638
Message 46280 - Posted: 23 May 2013, 22:10:51 UTC - in response to Message 46269.  

ID: 46280 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 46283 - Posted: 23 May 2013, 22:46:23 UTC - in response to Message 46280.  

All of those models were created about the time of the "malloc error". I posted about it in this thread on the 19 April.
I don't remember the outcome of the thinking/testing.


Backups: Here
ID: 46283 · Report as offensive     Reply Quote
Ba

Send message
Joined: 27 Jan 11
Posts: 7
Credit: 67,315,445
RAC: 6,001
Message 46344 - Posted: 2 Jun 2013, 11:51:38 UTC
Last modified: 2 Jun 2013, 11:52:28 UTC

ID: 46344 · Report as offensive     Reply Quote
ojum-le

Send message
Joined: 5 May 07
Posts: 27
Credit: 6,369,307
RAC: 0
Message 46346 - Posted: 2 Jun 2013, 17:49:50 UTC

Try to clean up your data-directory.

C:\Boinc\data\projects\climateprediction.net\

Delete all files. I had the same issues like u.
ID: 46346 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 46347 - Posted: 2 Jun 2013, 22:49:40 UTC - in response to Message 46344.  

Ba

Do NOT delete all files! You'll lose the many running models as well if you do.

There are some 'model problems' with a lot of what you have, but I haven't checked all of them because you have so many. Those I looked at are not your fault. The errors mentioned have come up a few times over the past year, and have been talked about 'somewhere'.

I'm not sure about the last one on your list. But that machine is running an "old" version of BOINC, which won't help. I'd suggest upgrading to the next version (.28) which is a release version, and will have less bugs.



Backups: Here
ID: 46347 · Report as offensive     Reply Quote
Ba

Send message
Joined: 27 Jan 11
Posts: 7
Credit: 67,315,445
RAC: 6,001
Message 46362 - Posted: 3 Jun 2013, 17:25:45 UTC

Thanks.

Not run this many models for a while just didn't like the look of that many errors.

Just checked through and most also error on the other machines running them so will just keep an eye on them for now.
ID: 46362 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 46363 - Posted: 3 Jun 2013, 20:32:00 UTC

Hi Ba

You have listed 4 computers. When models have crashed on the first 3 computers in your list the reason is usually a defect in the model. Very often model defects are listed in uppercase eg NAMELIST, REPLANCA, INITTIME. These problems are not the fault of your computers and I expect that other computers in the workunits also crashed them. That's just bad luck.

Computer #4 in the list is different:

http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1282401

This computer has crashed a lot of models with 'No heartbeat' messages which I think could be a problem with the computer. It's an AMD with 48 cores and lots of RAM. Is it overclocked? If so, I think you should test for stability because CPDN models are rather temperamental and any instability can push them over the edge.
Cpdn news
ID: 46363 · Report as offensive     Reply Quote
ProfileGreg van Paassen

Send message
Joined: 17 Nov 07
Posts: 142
Credit: 4,271,370
RAC: 0
Message 46364 - Posted: 4 Jun 2013, 5:18:51 UTC

Hi Ba,

In addition to what Mo said, you have several crashes on 1179592 that looks as though they are disk-related. HadCM3Ns are "disk-write-heavy" and seem to be sensitive to sluggish disk response, much more so than the regional models (HadAM3). HadCM3Ns seem to like neither their code and static data being swapped out, nor for the disks to take too long when they're creating zip files at the 25%, 50%, 75%, and 100% marks.

Probably I'm teaching my grandmother to suck eggs here, but you might want to check the sysctls vm.swappiness, vm.dirty_background_ratio, and vm.dirty_ratio.

Avoid swapping if possible (low swappiness, say 20), and avoid big "surges" in disk activity.

With 64 GB of memory, letting pending disk writes accumulate to 5% of memory (IIRC, that's the default value for vm.dirty_background_ratio) before writing them out would produce noticeable delays when the writing does take place. HadCM3Ns won't like that. Try vm.dirty_background_ratio=1 and vm.dirty_ratio=3 (both are percent of memory), and see if that reduces the number of crashes at the 25% mark, 3110.40 credits. Reducing vm.vfs_cache_pressure may also help, since CPDN models are continually writing to the same files.

Alternatively (or as well), try the 'deadline' scheduler, if you're using CFQ.
ID: 46364 · Report as offensive     Reply Quote
Ba

Send message
Joined: 27 Jan 11
Posts: 7
Credit: 67,315,445
RAC: 6,001
Message 46365 - Posted: 5 Jun 2013, 0:17:49 UTC

Thanks guys.

None of the server rigs I am running is overclocked so that should not be a problem.

One of them has an older install (1179592) ,think I will let that one run its models and reinstall.

I really dont know that much about linux I just use it on the big rigs as its free ,so thanks for the sugestions I will give them a try over the weekend.






ID: 46365 · Report as offensive     Reply Quote
ProfileMikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 46367 - Posted: 5 Jun 2013, 10:31:41 UTC
Last modified: 5 Jun 2013, 10:36:21 UTC

Try running a both a memory stress-test and a CPU stress-test on the computer which Mo identified, to see if there are any underlying issues (perhaps a bad memory stick). The best stress tests are the USB-/CD-bootable ones which run on the bare hardware.

Also it might be worth taking a look at the CPU temperatures, a dislodged heatsink can cause problems also.


If you have lots of memory, it might be worth setting the 'stay in memory' flag so that tasks are not constantly stopped & restarted.
I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 46367 · Report as offensive     Reply Quote
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 46370 - Posted: 5 Jun 2013, 22:19:23 UTC - in response to Message 46367.  
Last modified: 5 Jun 2013, 22:19:52 UTC

Mike, which test do you suggest? Prime95 is often mentioned in the forums, but I don't think that is a bootable test.
ID: 46370 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 46371 - Posted: 6 Jun 2013, 7:45:53 UTC - in response to Message 46370.  

I use memtest http://www.memtest86.com/download.htm which can be booted from cd or usb. And prime95 under whichever OS you use as a stress tester.
ID: 46371 · Report as offensive     Reply Quote
ProfileByron Leigh Hatch @ team Carl ...
Avatar

Send message
Joined: 17 Aug 04
Posts: 289
Credit: 44,103,664
RAC: 0
Message 46711 - Posted: 27 Jul 2013, 2:26:43 UTC




I wanted to open a new thread for this subject but when I try I get:

Internal Server Error:
The server encountered an internal error or misconfiguration and was unable to complete your request.
Please contact the server administrator, cpdn-sysadmin@oerc.ox.ac.uk and inform them of the time the error occurred, and anything you might have done that may have caused the error.
More information about this error may be available in the server error log.
Apache Server at climateapps2.oerc.ox.ac.uk Port 80

I'm using - W 7 - IE 10 - BOINC 7.0.64 (x86) - running as a single instillation - (not as a service)

... any my problem:

haven't seen this one before highlighted in red. Does anyone know what it Means ?

26/07/2013 4:16:15 PM | climateprediction.net | Requesting new tasks for CPU
26/07/2013 4:16:19 PM | climateprediction.net | Scheduler request completed: got 0 new tasks
26/07/2013 4:16:19 PM | climateprediction.net | Server can't open log file (../log_climateapps2/scheduler.log)


ID: 46711 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : Number crunching : Reporting - Errors while computing -

©2024 cpdn.org