Thread 'Cannot locate specific track - and similar disk-like failures on client side'

Author	Message
Eirik Redd Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649	Message 46158 - Posted: 6 May 2013, 11:26:01 UTC I've looked at the other hosts - "wingmen" for some of the few hadcm3n re-submitted jobs my computers have picked up recently. One of the most typical errors (aside from broken user hosts and some misconfigured wu's that were broken and are now gone) is some kind of "disk error" Like "cannot find specific track" or "device does not recognize command" or such. The half-dozen computers whose time I contribute don't seem to report this type of error (they do have user and operator errors (my bad) , and some of those "ghost" tasks, but it seems that a significant fraction of the tasks my machines are re-running have failed with (approximately) "device cannot recognize command" or "find specific track" I know my disks are cheap as any, and as error-prone, but none of the models they have tried to run have ever shown this type of error (linux Ubu various, Intel and AMD here) Could be my machines are not catching some type of error, or ? , or ? Any ideas on what happens and how to avoid? TIA Anyone have any idea ? ID: 46158 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275	Message 46159 - Posted: 6 May 2013, 14:46:13 UTC I've noted the same thing on others' tasks in the work units I run. I've had a couple of these errors myself, but can't remember if they were on a Windows PC or my Linux PCs. The thing I do remember is that they did occur immediately after a restart of boinc. I was wondering if the shutdown of boinc was not clean, and a file was corrupted, and the error is a result of that? Or is it just a hardware error? Not sure. ID: 46159 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 46160 - Posted: 6 May 2013, 16:04:06 UTC On a couple of occasions my computer has shut down abruptly through an electricity cut or a big computer problem. That killed some of the running models though not all and led to exit code 25. I suspect that exit code 22 is very often caused by the computer being shut down correctly but without exiting from BOINC first. I think we have to assume that most CPDN members have no idea that they should exit from BOINC first. But a lot of exit codes and messages are a mystery to me. I often don't know whether the error number comes from a Windows list, a BOINC list or some putative CPDN list. Windows system errors: http://www.hiteksoftware.com/knowledge/articles/049.htm IIRC, Carl once said there might be a CPDN error code list but he wasn't specific and if it exists we certainly never got it from him. Some of the error messages when there's a problem inherent in the model must come from the Met Office, devised by them for their models. When I see stderr messages with things like INITTIME in uppercase I assume this is a phrase devised by the Met Office. Cpdn news ID: 46160 · Reply Quote

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,827,799 RAC: 5,038	Message 46161 - Posted: 6 May 2013, 18:34:29 UTC Last modified: 6 May 2013, 18:47:20 UTC One of the (many) problems with the BOINC error system is the silent translation of error codes between programming languages and between operating systems. The following tables can usually sort things out: Error Codes - BOINC Error Codes - FORTRAN Error Codes - Linux Error Codes - Windows So, for example, Windows error: ERROR_SEEK25 (0x19) The drive cannot locate a specific area or track on the disk. is probably not Linux error: #define ENOTTY 25 / Not a typewriter / but could well be FORTRAN error: 25 severe (25): Record number outside range: FOR$IOS_RECNUMOUT. A direct access READ, WRITE, or FIND statement specified a record number outside the range specified when the file was opened. ID: 46161 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 46162 - Posted: 6 May 2013, 19:41:42 UTC I've made this thread sticky so Iain's useful links don't disappear from view. It would certainly be useful if the exit codes contained some indication of which list the number was generated from. I don't know whether BOINC would be capable of doing this. Cpdn news ID: 46162 · Reply Quote

Greg van Paassen Send message Joined: 17 Nov 07 Posts: 142 Credit: 4,271,370 RAC: 0	Message 46163 - Posted: 6 May 2013, 23:11:28 UTC - in response to Message 46158. In my (anecdotal) experience, "exit 25" errors tend to be correlated with the presence of many "Suspended CPDN Monitor - Suspend request from BOINC..." messages. For example. 1151 suspend messages in 62619 CPU seconds, an average of one suspension every 54 CPU seconds. There may be a race condition. If Boinc suspends a task just after it requests a disk read, at the point where the operating system thinks it has successfully delivered the disk data to the requesting task, the disk data might vanish from in-memory buffers before the task is re-animated. (Especially if "leave suspended applications in memory" is not selected, so that the task itself is written to disk -- which uses disk buffers.) The remedy for "exit 25s" may be the same as for "exit 193s": set Boinc preferences to (1) allow high levels of CPU utilisation and (2) leave applications in memory when suspended. That is, reduce the number of task suspensions and reduce the amount of code executed when they are re-animated. Exit 22s seem to be different. From what I have seen, there is little correlation with anything. They may be due to power failures, or to conflicts with other software. I lost four tasks running in a virtual machine with exit 22, due to repeated power failures. (Those big switches on the power mains are so tempting to toddlers...) Interestingly, tasks running on the host machine at the same time were unscathed. ID: 46163 · Reply Quote