climateprediction.net (CPDN) home page
Thread 'Network connection problems and CPDN model crashes'

Thread 'Network connection problems and CPDN model crashes'

Questions and Answers : Unix/Linux : Network connection problems and CPDN model crashes
Message board moderation

To post messages, you must log in.

AuthorMessage
Andris Pavenis

Send message
Joined: 22 Oct 05
Posts: 15
Credit: 2,340,122
RAC: 0
Message 40900 - Posted: 23 Oct 2010, 11:04:35 UTC

Had 4 CPDN models crashed recently on http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=826910

Both times it seems to be related by temporary network connection problems. I have noticed also earlier problems when:
- BOINC manager is running
- there are network connection problems

In second BOINC log one can see that also Einstein@HOME had some problems (similar as I have observed earlier)

Andris

22-Oct-2010 07:29:14 [Milkyway@home] Sending scheduler request: To fetch work.
22-Oct-2010 07:29:14 [Milkyway@home] Requesting new tasks for GPU
22-Oct-2010 07:29:19 [Milkyway@home] Scheduler request completed: got 1 new tasks
22-Oct-2010 07:29:21 [Milkyway@home] Started download of stars-td82-2stream_20.txt
22-Oct-2010 07:29:21 [Milkyway@home] Started download of de_separation_82_3s_20_1_185629_1287721425_search_parameters
22-Oct-2010 07:29:24 [Milkyway@home] Finished download of de_separation_82_3s_20_1_185629_1287721425_search_parameters
22-Oct-2010 07:29:36 [Milkyway@home] Finished download of stars-td82-2stream_20.txt
22-Oct-2010 07:29:42 [Einstein@Home] Computation for task h1_1099.75_S5R4__244_S5GC1a_0 finished
22-Oct-2010 07:29:42 [Einstein@Home] Restarting task h1_1099.75_S5R4__224_S5GC1a_1 using einstein_S5GC1 version 105
22-Oct-2010 07:29:44 [Einstein@Home] Started upload of h1_1099.75_S5R4__244_S5GC1a_0_0
22-Oct-2010 07:29:50 [Einstein@Home] Finished upload of h1_1099.75_S5R4__244_S5GC1a_0_0
22-Oct-2010 07:38:24 [Milkyway@home] Sending scheduler request: To fetch work.
22-Oct-2010 07:38:24 [Milkyway@home] Requesting new tasks for GPU
22-Oct-2010 07:38:52 [Milkyway@home] Scheduler request failed: Couldn't resolve host name
22-Oct-2010 07:38:57 [Collatz Conjecture] Sending scheduler request: To fetch work.
22-Oct-2010 07:38:57 [Collatz Conjecture] Reporting 1 completed tasks, requesting new tasks for GPU
22-Oct-2010 07:39:20 [Collatz Conjecture] Scheduler request completed: got 1 new tasks
22-Oct-2010 07:39:20 [---] Couldn't parse preferences file - using BOINC defaults
22-Oct-2010 07:39:20 [---] Reading preferences override file
22-Oct-2010 07:39:20 [---] Preferences:
22-Oct-2010 07:39:20 [---] max memory usage when active: 4000.55MB
22-Oct-2010 07:39:20 [---] max memory usage when idle: 7200.99MB
22-Oct-2010 07:39:20 [---] max disk usage: 10.00GB
22-Oct-2010 07:39:20 [---] max CPUs used: 3
22-Oct-2010 07:39:20 [---] (to change, visit the web site of an attached project,
22-Oct-2010 07:39:20 [---] or click on Preferences)
22-Oct-2010 07:39:22 [---] Project communication failed: attempting access to reference site
22-Oct-2010 07:39:22 [Collatz Conjecture] Started download of collatz_1286499150_1156660
22-Oct-2010 07:40:11 [---] BOINC can't access Internet - check network connection or proxy configuration.
22-Oct-2010 07:40:11 [Collatz Conjecture] Temporarily failed download of collatz_1286499150_1156660: can't resolve hostname
22-Oct-2010 07:40:11 [Collatz Conjecture] Backing off 1 min 0 sec on download of collatz_1286499150_1156660
22-Oct-2010 07:40:11 [climateprediction.net] Computation for task hadsm3dhet2_jme9_006592019_8 finished
22-Oct-2010 07:40:11 [climateprediction.net] Output file hadsm3dhet2_jme9_006592019_8_1.zip for task hadsm3dhet2_jme9_006592019_8 absent
22-Oct-2010 07:40:11 [climateprediction.net] Output file hadsm3dhet2_jme9_006592019_8_2.zip for task hadsm3dhet2_jme9_006592019_8 absent
22-Oct-2010 07:40:11 [climateprediction.net] Output file hadsm3dhet2_jme9_006592019_8_3.zip for task hadsm3dhet2_jme9_006592019_8 absent
22-Oct-2010 07:40:12 [Einstein@Home] Restarting task h1_1099.75_S5R4__214_S5GC1a_0 using einstein_S5GC1 version 105
22-Oct-2010 07:40:12 [Milkyway@home] Sending scheduler request: To fetch work.
22-Oct-2010 07:40:12 [Milkyway@home] Requesting new tasks for GPU
22-Oct-2010 07:40:28 [climateprediction.net] Computation for task hadsm3dhet2_jme8_006592018_3 finished
22-Oct-2010 07:40:28 [climateprediction.net] Output file hadsm3dhet2_jme8_006592018_3_1.zip for task hadsm3dhet2_jme8_006592018_3 absent
22-Oct-2010 07:40:28 [climateprediction.net] Output file hadsm3dhet2_jme8_006592018_3_2.zip for task hadsm3dhet2_jme8_006592018_3 absent
22-Oct-2010 07:40:28 [climateprediction.net] Output file hadsm3dhet2_jme8_006592018_3_3.zip for task hadsm3dhet2_jme8_006592018_3 absent
22-Oct-2010 07:40:28 [Einstein@Home] Starting p2030_54075_19693_0073_G189.73-02.68.C_0.dm_380_2
22-Oct-2010 07:40:29 [Einstein@Home] Starting task p2030_54075_19693_0073_G189.73-02.68.C_0.dm_380_2 using einsteinbinary_ABP2 version 108
22-Oct-2010 07:40:29 [Einstein@Home] Task h1_1099.75_S5R4__224_S5GC1a_1 exited with zero status but no 'finished' file
22-Oct-2010 07:40:29 [Einstein@Home] If this happens repeatedly you may need to reset the project.
22-Oct-2010 07:40:30 [Einstein@Home] Restarting task h1_1099.75_S5R4__224_S5GC1a_1 using einstein_S5GC1 version 105
22-Oct-2010 07:40:54 [Milkyway@home] Scheduler request failed: Couldn't connect to server
22-Oct-2010 07:41:12 [Collatz Conjecture] Started download of collatz_1286499150_1156660
22-Oct-2010 07:41:48 [Collatz Conjecture] Finished download of collatz_1286499150_1156660

23-Oct-2010 13:13:10 [Milkyway@home] Computation for task de_separation_82_3s_10_1_591473_1287779991_1 finished
23-Oct-2010 13:13:10 [Milkyway@home] [coproc_debug] Assigning CUDA instance 0 to de_separation_82_2s_20_1_595037_1287780659_1
23-Oct-2010 13:13:10 [Milkyway@home] Starting de_separation_82_2s_20_1_595037_1287780659_1
23-Oct-2010 13:13:10 [Milkyway@home] Starting task de_separation_82_2s_20_1_595037_1287780659_1 using milkyway version 24
23-Oct-2010 13:13:12 [Milkyway@home] Started upload of de_separation_82_3s_10_1_591473_1287779991_1_0
23-Oct-2010 13:13:19 [Milkyway@home] Finished upload of de_separation_82_3s_10_1_591473_1287779991_1_0
23-Oct-2010 13:26:43 [Milkyway@home] Computation for task de_separation_82_2s_20_1_595037_1287780659_1 finished
23-Oct-2010 13:26:43 [Collatz Conjecture] [coproc_debug] Assigning CUDA instance 0 to collatz_1286499150_1211437_1
23-Oct-2010 13:26:43 [Collatz Conjecture] Starting collatz_1286499150_1211437_1
23-Oct-2010 13:26:43 [Collatz Conjecture] Starting task collatz_1286499150_1211437_1 using collatz version 202
23-Oct-2010 13:26:45 [Milkyway@home] Started upload of de_separation_82_2s_20_1_595037_1287780659_1_0
23-Oct-2010 13:27:21 [Einstein@Home] Task h1_1099.75_S5R4__154_S5GC1a_0 exited with zero status but no 'finished' file
23-Oct-2010 13:27:21 [Einstein@Home] If this happens repeatedly you may need to reset the project.
23-Oct-2010 13:27:21 [---] Project communication failed: attempting access to reference site
23-Oct-2010 13:27:21 [Milkyway@home] Temporarily failed upload of de_separation_82_2s_20_1_595037_1287780659_1_0: can't resolve hostname
23-Oct-2010 13:27:21 [Milkyway@home] Backing off 1 min 0 sec on upload of de_separation_82_2s_20_1_595037_1287780659_1_0
23-Oct-2010 13:27:22 [Einstein@Home] Restarting task h1_1099.75_S5R4__154_S5GC1a_0 using einstein_S5GC1 version 105
23-Oct-2010 13:27:27 [climateprediction.net] Computation for task hadsm3dhet2_jkqz_006589885_7 finished
23-Oct-2010 13:27:27 [climateprediction.net] Output file hadsm3dhet2_jkqz_006589885_7_1.zip for task hadsm3dhet2_jkqz_006589885_7 absent
23-Oct-2010 13:27:27 [climateprediction.net] Output file hadsm3dhet2_jkqz_006589885_7_2.zip for task hadsm3dhet2_jkqz_006589885_7 absent
23-Oct-2010 13:27:27 [climateprediction.net] Output file hadsm3dhet2_jkqz_006589885_7_3.zip for task hadsm3dhet2_jkqz_006589885_7 absent
23-Oct-2010 13:27:27 [Einstein@Home] Resuming task h1_1099.75_S5R4__153_S5GC1a_1 using einstein_S5GC1 version 105
23-Oct-2010 13:27:28 [climateprediction.net] Computation for task hadsm3dhet2_jkqy_006589884_5 finished
23-Oct-2010 13:27:28 [climateprediction.net] Output file hadsm3dhet2_jkqy_006589884_5_1.zip for task hadsm3dhet2_jkqy_006589884_5 absent
23-Oct-2010 13:27:28 [climateprediction.net] Output file hadsm3dhet2_jkqy_006589884_5_2.zip for task hadsm3dhet2_jkqy_006589884_5 absent
23-Oct-2010 13:27:28 [climateprediction.net] Output file hadsm3dhet2_jkqy_006589884_5_3.zip for task hadsm3dhet2_jkqy_006589884_5 absent
23-Oct-2010 13:27:28 [Einstein@Home] Resuming task h1_1099.80_S5R4__146_S5GC1a_1 using einstein_S5GC1 version 105
23-Oct-2010 13:27:36 [---] Internet access OK - project servers may be temporarily down.
23-Oct-2010 13:28:21 [Milkyway@home] Started upload of de_separation_82_2s_20_1_595037_1287780659_1_0
23-Oct-2010 13:28:31 [Milkyway@home] Finished upload of de_separation_82_2s_20_1_595037_1287780659_1_0
23-Oct-2010 13:35:27 [---] Already attached - deleting project_init.xml
23-Oct-2010 13:37:34 [Milkyway@home] Sending scheduler request: To fetch work.
23-Oct-2010 13:37:34 [Milkyway@home] Reporting 2 completed tasks, requesting new tasks for GPU
23-Oct-2010 13:37:56 [---] Project communication failed: attempting access to reference site
23-Oct-2010 13:37:57 [---] BOINC can't access Internet - check network connection or proxy configuration.
23-Oct-2010 13:37:59 [Milkyway@home] Scheduler request failed: Couldn't connect to server
23-Oct-2010 13:38:04 [Collatz Conjecture] Sending scheduler request: To fetch work.
23-Oct-2010 13:38:04 [Collatz Conjecture] Requesting new tasks for GPU
23-Oct-2010 13:38:09 [Collatz Conjecture] Scheduler request failed: Couldn't resolve host name
23-Oct-2010 13:38:59 [Milkyway@home] Sending scheduler request: To fetch work.
23-Oct-2010 13:38:59 [Milkyway@home] Reporting 2 completed tasks, requesting new tasks for GPU
23-Oct-2010 13:39:24 [Milkyway@home] Scheduler request failed: Couldn't connect to server


ID: 40900 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 40906 - Posted: 25 Oct 2010, 4:50:42 UTC

Hi Andris

Your computer page is here and your Tasks page is here.

One crashed task in July was a FAMOUS. The error messages show that the crash was caused by the parameter values of your model. So that is not a problem.

But four HadSM crashed. Probably two at one moment and the other two at another moment. All have exit code 11 and similar error messages. One of the tasks is here. Click on stderr + to see the messages. Signal 11 in every case.

This is not, as far as I know, caused by network connection problems. Jorden has some explanations of the Signal 11 error in his FAQ here. I see that your computer has processed a lot of CPDN models successfully in the past. But have you installed a new version of Linux and now need the 32bit compatibility libraries? See Geophi's post.

If this is the problem please tell us.
Cpdn news
ID: 40906 · Report as offensive     Reply Quote
Andris Pavenis

Send message
Joined: 22 Oct 05
Posts: 15
Credit: 2,340,122
RAC: 0
Message 40907 - Posted: 25 Oct 2010, 6:18:57 UTC - in response to Message 40906.  


One crashed task in July was a FAMOUS. The error messages show that the crash was caused by the parameter values of your model. So that is not a problem.


I'm not writing about that


But four HadSM crashed. Probably two at one moment and the other two at another moment. All have exit code 11 and similar error messages. One of the tasks is here. Click on stderr + to see the messages. Signal 11 in every case.


I saw all that. Unfortunately even if core file was generated Fedora 13 crash handler found that it is not caused by any Fedora 13 packages, so it was automatically deleted.


This is not, as far as I know, caused by network connection problems. Jorden has some explanations of the Signal 11 error in his FAQ here. I see that your computer has processed a lot of CPDN models successfully in the past. But have you installed a new version of Linux and now need the 32bit compatibility libraries? See Geophi's post.

If this is the problem please tell us.


The problem is not related with Linux upgrade (no serious upgrades in last month). Also all 32 bit compatibility libraries are in place. Linux 64 bit version is used there already for a long time.

What I saw is that similarly as sometimes earlier there has probably been some bad interaction between boincmgr and boinc client when there is network connection problems (when network is still on, but one is getting no response and connections times out as far as I have observed)

I could try to reproduce the problem by
- getting a new work unit on the same system
- trying to attach GDB to the process (unfortunately without debug info there would be little use of GDB)
- messing with network (trying to break it various ways)



ID: 40907 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 40908 - Posted: 25 Oct 2010, 7:27:05 UTC

This sort of problem has been mentioned a few times on the BOINC board, although not recently.
And for those computers where it happens, it can be with any project.

I don't remember what, if any, solution was found.

ID: 40908 · Report as offensive     Reply Quote
Andris Pavenis

Send message
Joined: 22 Oct 05
Posts: 15
Credit: 2,340,122
RAC: 0
Message 40930 - Posted: 29 Oct 2010, 4:59:33 UTC

The same problem (crash of CPDN model) repeated once more about 20 seconds after restarting model. BOINC was started by startup scripts while booting computer. Network (WLAN) was not yet up. Unfortunately due to NetworkManager problems WLAN connection only happens after GNOME session is started.

This time boincmgr was not running so I guess interaction between boincmgr and boinc client can be excluded as a reason.

Running for prime95 3 hours (4 threads as one should on Core 2 Quad) did not show any errors and I did not expect them either.

Would it be worth to try some more model to get core dump for somebody to examine if crash repeats? (Core dump generation for binaries not belonging to installed RPM packages was off by default. It is on now)


ID: 40930 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 40931 - Posted: 29 Oct 2010, 6:48:42 UTC - in response to Message 40930.  

If you're using scripts then perhaps you could add to them.

I suspend BOINC before exiting it, so that when it starts again, it doesn't run until I do this manually.

If you did this, then you could wait until after everything was running, to run a start up script.


Backups: Here
ID: 40931 · Report as offensive     Reply Quote

Questions and Answers : Unix/Linux : Network connection problems and CPDN model crashes

©2024 cpdn.org