climateprediction.net (CPDN) home page
Thread 'Cannot locate specific track - and similar disk-like failures on client side'

Thread 'Cannot locate specific track - and similar disk-like failures on client side'

Message boards : Number crunching : Cannot locate specific track - and similar disk-like failures on client side
Message board moderation

To post messages, you must log in.

AuthorMessage
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 46158 - Posted: 6 May 2013, 11:26:01 UTC

I've looked at the other hosts - "wingmen" for some of the few hadcm3n re-submitted jobs my computers have picked up recently.

One of the most typical errors (aside from broken user hosts and some misconfigured wu's that were broken and are now gone) is some kind of "disk error"

Like "cannot find specific track" or "device does not recognize command" or such.

The half-dozen computers whose time I contribute don't seem to report this type of error (they do have user and operator errors (my bad) , and some of those "ghost" tasks, but it seems that a significant fraction of the tasks my machines are re-running have failed with (approximately) "device cannot recognize command" or "find specific track"

I know my disks are cheap as any, and as error-prone, but none of the models they have tried to run have ever shown this type of error (linux Ubu various, Intel and AMD here)

Could be my machines are not catching some type of error, or ? , or ?

Any ideas on what happens and how to avoid?

TIA

Anyone have any idea ?
ID: 46158 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 46159 - Posted: 6 May 2013, 14:46:13 UTC

I've noted the same thing on others' tasks in the work units I run. I've had a couple of these errors myself, but can't remember if they were on a Windows PC or my Linux PCs. The thing I do remember is that they did occur immediately after a restart of boinc.

I was wondering if the shutdown of boinc was not clean, and a file was corrupted, and the error is a result of that? Or is it just a hardware error? Not sure.
ID: 46159 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 46160 - Posted: 6 May 2013, 16:04:06 UTC

On a couple of occasions my computer has shut down abruptly through an electricity cut or a big computer problem. That killed some of the running models though not all and led to exit code 25.

I suspect that exit code 22 is very often caused by the computer being shut down correctly but without exiting from BOINC first. I think we have to assume that most CPDN members have no idea that they should exit from BOINC first.

But a lot of exit codes and messages are a mystery to me. I often don't know whether the error number comes from a Windows list, a BOINC list or some putative CPDN list.

Windows system errors: http://www.hiteksoftware.com/knowledge/articles/049.htm

IIRC, Carl once said there might be a CPDN error code list but he wasn't specific and if it exists we certainly never got it from him.

Some of the error messages when there's a problem inherent in the model must come from the Met Office, devised by them for their models. When I see stderr messages with things like INITTIME in uppercase I assume this is a phrase devised by the Met Office.
Cpdn news
ID: 46160 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,808,726
RAC: 5,192
Message 46161 - Posted: 6 May 2013, 18:34:29 UTC
Last modified: 6 May 2013, 18:47:20 UTC

One of the (many) problems with the BOINC error system is the silent translation of error codes between programming languages and between operating systems.

The following tables can usually sort things out:

Error Codes - BOINC
Error Codes - FORTRAN
Error Codes - Linux
Error Codes - Windows

So, for example, Windows error:

ERROR_SEEK25 (0x19) The drive cannot locate a specific area or track on the disk.

is probably not Linux error:

#define ENOTTY 25 /* Not a typewriter */

but could well be FORTRAN error:

25 severe (25): Record number outside range: FOR$IOS_RECNUMOUT. A direct access READ, WRITE, or FIND statement specified a record number outside the range specified when the file was opened.
ID: 46161 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 46162 - Posted: 6 May 2013, 19:41:42 UTC

I've made this thread sticky so Iain's useful links don't disappear from view.

It would certainly be useful if the exit codes contained some indication of which list the number was generated from. I don't know whether BOINC would be capable of doing this.
Cpdn news
ID: 46162 · Report as offensive     Reply Quote
ProfileGreg van Paassen

Send message
Joined: 17 Nov 07
Posts: 142
Credit: 4,271,370
RAC: 0
Message 46163 - Posted: 6 May 2013, 23:11:28 UTC - in response to Message 46158.  

In my (anecdotal) experience, "exit 25" errors tend to be correlated with the presence of many "Suspended CPDN Monitor - Suspend request from BOINC..." messages.

For example. 1151 suspend messages in 62619 CPU seconds, an average of one suspension every 54 CPU seconds.

There may be a race condition. If Boinc suspends a task just after it requests a disk read, at the point where the operating system thinks it has successfully delivered the disk data to the requesting task, the disk data might vanish from in-memory buffers before the task is re-animated. (Especially if "leave suspended applications in memory" is not selected, so that the task itself is written to disk -- which uses disk buffers.)

The remedy for "exit 25s" may be the same as for "exit 193s": set Boinc preferences to (1) allow high levels of CPU utilisation and (2) leave applications in memory when suspended. That is, reduce the number of task suspensions and reduce the amount of code executed when they are re-animated.

Exit 22s seem to be different. From what I have seen, there is little correlation with anything. They may be due to power failures, or to conflicts with other software. I lost four tasks running in a virtual machine with exit 22, due to repeated power failures. (Those big switches on the power mains are so tempting to toddlers...) Interestingly, tasks running on the host machine at the same time were unscathed.
ID: 46163 · Report as offensive     Reply Quote

Message boards : Number crunching : Cannot locate specific track - and similar disk-like failures on client side

©2024 cpdn.org