Questions and Answers :
Unix/Linux :
Network connection problems and CPDN model crashes
Message board moderation
Author | Message |
---|---|
Send message Joined: 22 Oct 05 Posts: 15 Credit: 2,340,122 RAC: 0 |
Had 4 CPDN models crashed recently on http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=826910 Both times it seems to be related by temporary network connection problems. I have noticed also earlier problems when: - BOINC manager is running - there are network connection problems In second BOINC log one can see that also Einstein@HOME had some problems (similar as I have observed earlier) Andris 22-Oct-2010 07:29:14 [Milkyway@home] Sending scheduler request: To fetch work. 22-Oct-2010 07:29:14 [Milkyway@home] Requesting new tasks for GPU 22-Oct-2010 07:29:19 [Milkyway@home] Scheduler request completed: got 1 new tasks 22-Oct-2010 07:29:21 [Milkyway@home] Started download of stars-td82-2stream_20.txt 22-Oct-2010 07:29:21 [Milkyway@home] Started download of de_separation_82_3s_20_1_185629_1287721425_search_parameters 22-Oct-2010 07:29:24 [Milkyway@home] Finished download of de_separation_82_3s_20_1_185629_1287721425_search_parameters 22-Oct-2010 07:29:36 [Milkyway@home] Finished download of stars-td82-2stream_20.txt 22-Oct-2010 07:29:42 [Einstein@Home] Computation for task h1_1099.75_S5R4__244_S5GC1a_0 finished 22-Oct-2010 07:29:42 [Einstein@Home] Restarting task h1_1099.75_S5R4__224_S5GC1a_1 using einstein_S5GC1 version 105 22-Oct-2010 07:29:44 [Einstein@Home] Started upload of h1_1099.75_S5R4__244_S5GC1a_0_0 22-Oct-2010 07:29:50 [Einstein@Home] Finished upload of h1_1099.75_S5R4__244_S5GC1a_0_0 22-Oct-2010 07:38:24 [Milkyway@home] Sending scheduler request: To fetch work. 22-Oct-2010 07:38:24 [Milkyway@home] Requesting new tasks for GPU 22-Oct-2010 07:38:52 [Milkyway@home] Scheduler request failed: Couldn't resolve host name 22-Oct-2010 07:38:57 [Collatz Conjecture] Sending scheduler request: To fetch work. 22-Oct-2010 07:38:57 [Collatz Conjecture] Reporting 1 completed tasks, requesting new tasks for GPU 22-Oct-2010 07:39:20 [Collatz Conjecture] Scheduler request completed: got 1 new tasks 22-Oct-2010 07:39:20 [---] Couldn't parse preferences file - using BOINC defaults 22-Oct-2010 07:39:20 [---] Reading preferences override file 22-Oct-2010 07:39:20 [---] Preferences: 22-Oct-2010 07:39:20 [---] max memory usage when active: 4000.55MB 22-Oct-2010 07:39:20 [---] max memory usage when idle: 7200.99MB 22-Oct-2010 07:39:20 [---] max disk usage: 10.00GB 22-Oct-2010 07:39:20 [---] max CPUs used: 3 22-Oct-2010 07:39:20 [---] (to change, visit the web site of an attached project, 22-Oct-2010 07:39:20 [---] or click on Preferences) 22-Oct-2010 07:39:22 [---] Project communication failed: attempting access to reference site 22-Oct-2010 07:39:22 [Collatz Conjecture] Started download of collatz_1286499150_1156660 22-Oct-2010 07:40:11 [---] BOINC can't access Internet - check network connection or proxy configuration. 22-Oct-2010 07:40:11 [Collatz Conjecture] Temporarily failed download of collatz_1286499150_1156660: can't resolve hostname 22-Oct-2010 07:40:11 [Collatz Conjecture] Backing off 1 min 0 sec on download of collatz_1286499150_1156660 22-Oct-2010 07:40:11 [climateprediction.net] Computation for task hadsm3dhet2_jme9_006592019_8 finished 22-Oct-2010 07:40:11 [climateprediction.net] Output file hadsm3dhet2_jme9_006592019_8_1.zip for task hadsm3dhet2_jme9_006592019_8 absent 22-Oct-2010 07:40:11 [climateprediction.net] Output file hadsm3dhet2_jme9_006592019_8_2.zip for task hadsm3dhet2_jme9_006592019_8 absent 22-Oct-2010 07:40:11 [climateprediction.net] Output file hadsm3dhet2_jme9_006592019_8_3.zip for task hadsm3dhet2_jme9_006592019_8 absent 22-Oct-2010 07:40:12 [Einstein@Home] Restarting task h1_1099.75_S5R4__214_S5GC1a_0 using einstein_S5GC1 version 105 22-Oct-2010 07:40:12 [Milkyway@home] Sending scheduler request: To fetch work. 22-Oct-2010 07:40:12 [Milkyway@home] Requesting new tasks for GPU 22-Oct-2010 07:40:28 [climateprediction.net] Computation for task hadsm3dhet2_jme8_006592018_3 finished 22-Oct-2010 07:40:28 [climateprediction.net] Output file hadsm3dhet2_jme8_006592018_3_1.zip for task hadsm3dhet2_jme8_006592018_3 absent 22-Oct-2010 07:40:28 [climateprediction.net] Output file hadsm3dhet2_jme8_006592018_3_2.zip for task hadsm3dhet2_jme8_006592018_3 absent 22-Oct-2010 07:40:28 [climateprediction.net] Output file hadsm3dhet2_jme8_006592018_3_3.zip for task hadsm3dhet2_jme8_006592018_3 absent 22-Oct-2010 07:40:28 [Einstein@Home] Starting p2030_54075_19693_0073_G189.73-02.68.C_0.dm_380_2 22-Oct-2010 07:40:29 [Einstein@Home] Starting task p2030_54075_19693_0073_G189.73-02.68.C_0.dm_380_2 using einsteinbinary_ABP2 version 108 22-Oct-2010 07:40:29 [Einstein@Home] Task h1_1099.75_S5R4__224_S5GC1a_1 exited with zero status but no 'finished' file 22-Oct-2010 07:40:29 [Einstein@Home] If this happens repeatedly you may need to reset the project. 22-Oct-2010 07:40:30 [Einstein@Home] Restarting task h1_1099.75_S5R4__224_S5GC1a_1 using einstein_S5GC1 version 105 22-Oct-2010 07:40:54 [Milkyway@home] Scheduler request failed: Couldn't connect to server 22-Oct-2010 07:41:12 [Collatz Conjecture] Started download of collatz_1286499150_1156660 22-Oct-2010 07:41:48 [Collatz Conjecture] Finished download of collatz_1286499150_1156660 23-Oct-2010 13:13:10 [Milkyway@home] Computation for task de_separation_82_3s_10_1_591473_1287779991_1 finished 23-Oct-2010 13:13:10 [Milkyway@home] [coproc_debug] Assigning CUDA instance 0 to de_separation_82_2s_20_1_595037_1287780659_1 23-Oct-2010 13:13:10 [Milkyway@home] Starting de_separation_82_2s_20_1_595037_1287780659_1 23-Oct-2010 13:13:10 [Milkyway@home] Starting task de_separation_82_2s_20_1_595037_1287780659_1 using milkyway version 24 23-Oct-2010 13:13:12 [Milkyway@home] Started upload of de_separation_82_3s_10_1_591473_1287779991_1_0 23-Oct-2010 13:13:19 [Milkyway@home] Finished upload of de_separation_82_3s_10_1_591473_1287779991_1_0 23-Oct-2010 13:26:43 [Milkyway@home] Computation for task de_separation_82_2s_20_1_595037_1287780659_1 finished 23-Oct-2010 13:26:43 [Collatz Conjecture] [coproc_debug] Assigning CUDA instance 0 to collatz_1286499150_1211437_1 23-Oct-2010 13:26:43 [Collatz Conjecture] Starting collatz_1286499150_1211437_1 23-Oct-2010 13:26:43 [Collatz Conjecture] Starting task collatz_1286499150_1211437_1 using collatz version 202 23-Oct-2010 13:26:45 [Milkyway@home] Started upload of de_separation_82_2s_20_1_595037_1287780659_1_0 23-Oct-2010 13:27:21 [Einstein@Home] Task h1_1099.75_S5R4__154_S5GC1a_0 exited with zero status but no 'finished' file 23-Oct-2010 13:27:21 [Einstein@Home] If this happens repeatedly you may need to reset the project. 23-Oct-2010 13:27:21 [---] Project communication failed: attempting access to reference site 23-Oct-2010 13:27:21 [Milkyway@home] Temporarily failed upload of de_separation_82_2s_20_1_595037_1287780659_1_0: can't resolve hostname 23-Oct-2010 13:27:21 [Milkyway@home] Backing off 1 min 0 sec on upload of de_separation_82_2s_20_1_595037_1287780659_1_0 23-Oct-2010 13:27:22 [Einstein@Home] Restarting task h1_1099.75_S5R4__154_S5GC1a_0 using einstein_S5GC1 version 105 23-Oct-2010 13:27:27 [climateprediction.net] Computation for task hadsm3dhet2_jkqz_006589885_7 finished 23-Oct-2010 13:27:27 [climateprediction.net] Output file hadsm3dhet2_jkqz_006589885_7_1.zip for task hadsm3dhet2_jkqz_006589885_7 absent 23-Oct-2010 13:27:27 [climateprediction.net] Output file hadsm3dhet2_jkqz_006589885_7_2.zip for task hadsm3dhet2_jkqz_006589885_7 absent 23-Oct-2010 13:27:27 [climateprediction.net] Output file hadsm3dhet2_jkqz_006589885_7_3.zip for task hadsm3dhet2_jkqz_006589885_7 absent 23-Oct-2010 13:27:27 [Einstein@Home] Resuming task h1_1099.75_S5R4__153_S5GC1a_1 using einstein_S5GC1 version 105 23-Oct-2010 13:27:28 [climateprediction.net] Computation for task hadsm3dhet2_jkqy_006589884_5 finished 23-Oct-2010 13:27:28 [climateprediction.net] Output file hadsm3dhet2_jkqy_006589884_5_1.zip for task hadsm3dhet2_jkqy_006589884_5 absent 23-Oct-2010 13:27:28 [climateprediction.net] Output file hadsm3dhet2_jkqy_006589884_5_2.zip for task hadsm3dhet2_jkqy_006589884_5 absent 23-Oct-2010 13:27:28 [climateprediction.net] Output file hadsm3dhet2_jkqy_006589884_5_3.zip for task hadsm3dhet2_jkqy_006589884_5 absent 23-Oct-2010 13:27:28 [Einstein@Home] Resuming task h1_1099.80_S5R4__146_S5GC1a_1 using einstein_S5GC1 version 105 23-Oct-2010 13:27:36 [---] Internet access OK - project servers may be temporarily down. 23-Oct-2010 13:28:21 [Milkyway@home] Started upload of de_separation_82_2s_20_1_595037_1287780659_1_0 23-Oct-2010 13:28:31 [Milkyway@home] Finished upload of de_separation_82_2s_20_1_595037_1287780659_1_0 23-Oct-2010 13:35:27 [---] Already attached - deleting project_init.xml 23-Oct-2010 13:37:34 [Milkyway@home] Sending scheduler request: To fetch work. 23-Oct-2010 13:37:34 [Milkyway@home] Reporting 2 completed tasks, requesting new tasks for GPU 23-Oct-2010 13:37:56 [---] Project communication failed: attempting access to reference site 23-Oct-2010 13:37:57 [---] BOINC can't access Internet - check network connection or proxy configuration. 23-Oct-2010 13:37:59 [Milkyway@home] Scheduler request failed: Couldn't connect to server 23-Oct-2010 13:38:04 [Collatz Conjecture] Sending scheduler request: To fetch work. 23-Oct-2010 13:38:04 [Collatz Conjecture] Requesting new tasks for GPU 23-Oct-2010 13:38:09 [Collatz Conjecture] Scheduler request failed: Couldn't resolve host name 23-Oct-2010 13:38:59 [Milkyway@home] Sending scheduler request: To fetch work. 23-Oct-2010 13:38:59 [Milkyway@home] Reporting 2 completed tasks, requesting new tasks for GPU 23-Oct-2010 13:39:24 [Milkyway@home] Scheduler request failed: Couldn't connect to server |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Hi Andris Your computer page is here and your Tasks page is here. One crashed task in July was a FAMOUS. The error messages show that the crash was caused by the parameter values of your model. So that is not a problem. But four HadSM crashed. Probably two at one moment and the other two at another moment. All have exit code 11 and similar error messages. One of the tasks is here. Click on stderr + to see the messages. Signal 11 in every case. This is not, as far as I know, caused by network connection problems. Jorden has some explanations of the Signal 11 error in his FAQ here. I see that your computer has processed a lot of CPDN models successfully in the past. But have you installed a new version of Linux and now need the 32bit compatibility libraries? See Geophi's post. If this is the problem please tell us. Cpdn news |
Send message Joined: 22 Oct 05 Posts: 15 Credit: 2,340,122 RAC: 0 |
I'm not writing about that
I saw all that. Unfortunately even if core file was generated Fedora 13 crash handler found that it is not caused by any Fedora 13 packages, so it was automatically deleted.
The problem is not related with Linux upgrade (no serious upgrades in last month). Also all 32 bit compatibility libraries are in place. Linux 64 bit version is used there already for a long time. What I saw is that similarly as sometimes earlier there has probably been some bad interaction between boincmgr and boinc client when there is network connection problems (when network is still on, but one is getting no response and connections times out as far as I have observed) I could try to reproduce the problem by - getting a new work unit on the same system - trying to attach GDB to the process (unfortunately without debug info there would be little use of GDB) - messing with network (trying to break it various ways) |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
This sort of problem has been mentioned a few times on the BOINC board, although not recently. And for those computers where it happens, it can be with any project. I don't remember what, if any, solution was found. |
Send message Joined: 22 Oct 05 Posts: 15 Credit: 2,340,122 RAC: 0 |
The same problem (crash of CPDN model) repeated once more about 20 seconds after restarting model. BOINC was started by startup scripts while booting computer. Network (WLAN) was not yet up. Unfortunately due to NetworkManager problems WLAN connection only happens after GNOME session is started. This time boincmgr was not running so I guess interaction between boincmgr and boinc client can be excluded as a reason. Running for prime95 3 hours (4 threads as one should on Core 2 Quad) did not show any errors and I did not expect them either. Would it be worth to try some more model to get core dump for somebody to examine if crash repeats? (Core dump generation for binaries not belonging to installed RPM packages was off by default. It is on now) |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
If you're using scripts then perhaps you could add to them. I suspend BOINC before exiting it, so that when it starts again, it doesn't run until I do this manually. If you did this, then you could wait until after everything was running, to run a start up script. Backups: Here |
©2024 cpdn.org