climateprediction.net home page
New work discussion - 2

New work discussion - 2

Message boards : Number crunching : New work discussion - 2
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 42 · Next

AuthorMessage
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 66396 - Posted: 14 Nov 2022, 16:16:21 UTC - in response to Message 66358.  

I doubt using all 32 virtual cores will give a good throughput, but I'm Team Blue with little experience of 'the other side' :)


I am currently running 5 hadsm4_um_8.02_i686-pc-lin... tasks, one wcgrid_arp1_wrf_7.32_x86_64-pc... and 5 less important ones.

With those, I get a so-so processor cache hit ratio.

Computer 1511241
CPU type 	GenuineIntel
Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7]
Number of processors 	16
Memory 	62.28 GB
Cache 	16896 KB

# perf stat -aB -e cache-references,cache-misses

 Performance counter stats for 'system wide':

    41,547,407,402      cache-references                                            
    23,459,791,917      cache-misses              #   56.465 % of all cache refs    

      62.202908556 seconds time elapsed

ID: 66396 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,984,965
RAC: 21,892
Message 66398 - Posted: 14 Nov 2022, 17:46:06 UTC - in response to Message 66394.  

They'll be a steady stream of HadSM4 tasks now with OpenIFS tasks appearing soon.
Yes, I saw there were two more batches in the queue already.
ID: 66398 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,420,897
RAC: 16,732
Message 66401 - Posted: 14 Nov 2022, 18:29:38 UTC - in response to Message 66396.  

With those, I get a so-so processor cache hit ratio.
But that's system wide? So difficult to know how much the boinc tasks are affected? I think it's easier to monitor wall-clock run times to judge how well the tasks are running when altering preferences, and quieten the machine as much as possible by killing or suspending any process that will cause system jitters.
ID: 66401 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 66402 - Posted: 14 Nov 2022, 20:55:44 UTC - in response to Message 66401.  

But that's system wide? So difficult to know how much the boinc tasks are affected?


Yes, but the rest of my system was pretty-much idle. I did have Firefox up, but I was not doing anything with it.
My processor has 16 cores and 11 of them were saturated with Boinc work. Pretty much like this:
    PID    PPID USER      PR  NI S    RES  %MEM  %CPU  P     TIME+ COMMAND                                                                   
 273653   16165 boinc     39  19 R 765984   1.2  99.0  2 362:55.38 ../../projects/einstein.phys.uwm.edu/hsgamma_FGRP5_1.08_x86_64-pc-linux-+ 
 278785   16165 boinc     39  19 R 765936   1.2  99.3  7 504:23.35 ../../projects/einstein.phys.uwm.edu/hsgamma_FGRP5_1.08_x86_64-pc-linux-+ 
 272462   16165 boinc     39  19 R 761424   1.2  99.1  9 624:27.90 ../../projects/www.worldcommunitygrid.org/wcgrid_arp1_wrf_7.32_x86_64-pc+ 
 279494  279405 boinc     39  19 R 675768   1.0  98.8  6 491:05.20 /var/lib/boinc/projects/climateprediction.net/hadsm4_um_8.02_i686-pc-lin+ 
 280047  280043 boinc     39  19 R 675452   1.0  99.1  4 479:16.71 /var/lib/boinc/projects/climateprediction.net/hadsm4_um_8.02_i686-pc-lin+ 
 279909  279896 boinc     39  19 R 675372   1.0  98.8 11 481:47.97 /var/lib/boinc/projects/climateprediction.net/hadsm4_um_8.02_i686-pc-lin+ 
 281984  281979 boinc     39  19 R 674972   1.0  99.1  5 459:16.00 /var/lib/boinc/projects/climateprediction.net/hadsm4_um_8.02_i686-pc-lin+ 
 186760  186753 boinc     39  19 R 669356   1.0  98.9  1   1986:07 /var/lib/boinc/projects/climateprediction.net/hadsm4_um_8.02_i686-pc-lin+ 
 304836   16165 boinc     39  19 R 213132   0.3  99.2 10 111:13.14 ../../projects/einstein.phys.uwm.edu/einsteinbinary_BRP4G_1.33_x86_64-pc+ 
 306110   16165 boinc     39  19 R 161796   0.2  99.0  0  85:47.34 ../../projects/www.worldcommunitygrid.org/wcgrid_opn1_autodock_7.21_x86_+ 
 307781   16165 boinc     39  19 R 143168   0.2  99.1 13  47:31.30 ../../projects/www.worldcommunitygrid.org/wcgrid_opn1_autodock_7.21_x86_+ 
  16165       1 boinc     30  10 S  44264   0.1   0.1  8  43401:04 /usr/bin/boinc                                                            
 186753   16165 boinc     39  19 S  11228   0.0   0.1 14   1:14.01 ../../projects/climateprediction.net/hadsm4_8.02_i686-pc-linux-gnu hadsm+ 
 279896   16165 boinc     39  19 S  10468   0.0   0.1 14   0:18.02 ../../projects/climateprediction.net/hadsm4_8.02_i686-pc-linux-gnu hadsm+ 
 281979   16165 boinc     39  19 S  10280   0.0   0.1 13   0:16.89 ../../projects/climateprediction.net/hadsm4_8.02_i686-pc-linux-gnu hadsm+ 
 280043   16165 boinc     39  19 S  10244   0.0   0.0 14   0:17.99 ../../projects/climateprediction.net/hadsm4_8.02_i686-pc-linux-gnu hadsm+ 
 279405   16165 boinc     39  19 S  10216   0.0   0.1 14   0:17.71 ../../projects/climateprediction.net/hadsm4_8.02_i686-pc-linux-gnu hadsm+ 


S is state (R is running; S is sleeping; %CPU is how busy that CPU is; P is the processor number (0-15). Since these are just my Boinc tasks, the ones much less than 98% might be working on other things.
ID: 66402 · Report as offensive
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,800,369
RAC: 19,765
Message 66408 - Posted: 15 Nov 2022, 10:25:09 UTC - in response to Message 66394.  

They'll be a steady stream of HadSM4 tasks now with OpenIFS tasks appearing soon.

How are they related? Does the data from HadSM4 serve as input for OpenIFS?
ID: 66408 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,984,965
RAC: 21,892
Message 66409 - Posted: 15 Nov 2022, 10:49:28 UTC - in response to Message 66408.  

How are they related? Does the data from HadSM4 serve as input for OpenIFS?
Not in this instance, they are separate research projects. It is not unusual for HadSM4 batches to provide inputs for future batches of the same type though. I don't remember seeing this done with any testing batches of OpenIFS but I couldn't say for certain that no batches ever provided input for subsequent ones.
ID: 66409 · Report as offensive
Drago75

Send message
Joined: 8 Jan 22
Posts: 9
Credit: 1,780,471
RAC: 3,152
Message 66413 - Posted: 15 Nov 2022, 12:08:52 UTC
Last modified: 15 Nov 2022, 12:09:41 UTC

I am getting a lot of invalid units. They produce some calculation error somewhere on the way. That usually happens when I restart my hosts in the morning. In the evening I always pause all work, then I wait 30 seconds before I shut em down to make sure all data is written to the ssd correctly.The next morning I get some aborts. Happens on two AMD hosts running Linux Mint 20 and Ubuntu 20. I presume there is some issue with the checkpoints. Did anybody else notice that, too?
ID: 66413 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 66414 - Posted: 15 Nov 2022, 12:18:14 UTC - in response to Message 66413.  

I am getting a lot of invalid units. They produce some calculation error somewhere on the way. That usually happens when I restart my hosts in the morning. In the evening I always pause all work, then I wait 30 seconds before I shut em down to make sure all data is written to the ssd correctly.The next morning I get some aborts. Happens on two AMD hosts running Linux Mint 20 and Ubuntu 20. I presume there is some issue with the checkpoints. Did anybody else notice that, too?
Why on earth are you shutting them down? I leave everything on 24/7, otherwise why bother with Boinc?
ID: 66414 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,420,897
RAC: 16,732
Message 66416 - Posted: 15 Nov 2022, 13:01:43 UTC - in response to Message 66408.  

They'll be a steady stream of HadSM4 tasks now with OpenIFS tasks appearing soon.

How are they related? Does the data from HadSM4 serve as input for OpenIFS?
No, they are completely different models being used for completely different experiments. There will be alot of OpenIFS tasks coming soon, we're just going slowly to make sure everything is correct. They will be single core, 5-6Gb RAM requirement (about the same as LHC Atlas).
ID: 66416 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 66417 - Posted: 15 Nov 2022, 13:08:42 UTC - in response to Message 66416.  

There will be alot of OpenIFS tasks coming soon, we're just going slowly to make sure everything is correct. They will be single core, 5-6Gb RAM requirement (about the same as LHC Atlas).
Except LHC Atlas is up to 8 cores, so uses a lot less RAM overall.
ID: 66417 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,420,897
RAC: 16,732
Message 66418 - Posted: 15 Nov 2022, 13:09:07 UTC - in response to Message 66413.  

I am getting a lot of invalid units. They produce some calculation error somewhere on the way. That usually happens when I restart my hosts in the morning. In the evening I always pause all work, then I wait 30 seconds before I shut em down to make sure all data is written to the ssd correctly.The next morning I get some aborts. Happens on two AMD hosts running Linux Mint 20 and Ubuntu 20. I presume there is some issue with the checkpoints. Did anybody else notice that, too?
I shutdown, sometimes suspend, my machines too (as I prefer to use free solar elec during the day & not pay for boinc at night - to answer P.Hucker). I have not seen this behaviour normally and I don't bother to suspend the tasks first. The client should know how to do that. The only issue I've had is with HadAM (if that's the right one on Windows), which errored on a PC reboot. But that only happened once.

I'm not entirely sure what 'checkpointing' really means. It may only be the client that's checkpointing. OpenIFS does it's own checkpointing and doesn't know anything any checkpointing frequency set in the client. I can't say how the Hadley models behave. Checkpointing is relatively expensive, it's a big I/O dump of the model's internal state, so we wanted to control that ourselves and not let boinc attempt it.
ID: 66418 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 66420 - Posted: 15 Nov 2022, 13:52:16 UTC - in response to Message 66413.  

I am getting a lot of invalid units. They produce some calculation error somewhere on the way. That usually happens when I restart my hosts in the morning. In the evening I always pause all work, then I wait 30 seconds before I shut em down to make sure all data is written to the ssd correctly.The next morning I get some aborts. Happens on two AMD hosts running Linux Mint 20 and Ubuntu 20. I presume there is some issue with the checkpoints. Did anybody else notice that, too?


I am run Red Hat Enterprise Linux release 8.6 (Ootpa) on my Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7] machine.

While I only reboot my machine only every week or two (and sometimes only a little longer), I do not get problems like you describe. Right now my machine has been up only up 3 days, 19:43, but I am getting no errors. My most recent "errors" are all like this, which were not really a crashes at all.
Task 22238436
Name 	hadsm4_a08x_201402_1_939_012156968_0
Workunit 	12156968
Created 	14 Nov 2022, 10:37:54 UTC
Sent 	14 Nov 2022, 12:23:58 UTC
Report deadline 	27 Oct 2023, 17:43:58 UTC
Received 	15 Nov 2022, 2:23:18 UTC

<core_client_version>7.20.2</core_client_version>
<![CDATA[
<message>
process exited with code 22 (0x16, -234)</message>
<stderr_txt>

Model crashed: ATM_DYN : NEGATIVE THETA DETECTED.                                                                                                                                                                                                                              tmp/xnnuj.pipe_dummy                                                            

Model crashed: ATM_DYN : NEGATIVE THETA DETECTED.                                                                                                                                                                                                                              tmp/xnnuj.pipe_dummy                                                            

Model crashed: ATM_DYN : NEGATIVE THETA DETECTED.                                                                                                                                                                                                                              tmp/xnnuj.pipe_dummy                                                            

Model crashed: ATM_DYN : NEGATIVE THETA DETECTED.                                                                                                                                                                                                                              tmp/xnnuj.pipe_dummy                                                            

Model crashed: ATM_DYN : NEGATIVE THETA DETECTED.                                                                                                                                                                                                                              tmp/xnnuj.pipe_dummy                                                            

Model crashed: ATM_DYN : NEGATIVE THETA DETECTED.                                                                                                                                                                                                                              tmp/xnnuj.pipe_dummy                                                            
Sorry, too many model crashes! :-(
20:33:46 (322618): called boinc_finish(22)

</stderr_txt>
]]>

When I get ready to accept updates to my machine that require a reboot, I stop all new tasks, and let most Boinc tasks run to completion. I usually cannot get CPDN tasks to complete because most recent tasks (except in the last few days) take around a week to complete. So I suspend those that have not even started, then those that are running one at a time. I do not think I have gotten any crashes with this procedure, perhaps in a year (but I cannot remember). I did get a few crashes that had nothing to do with rebooting with a bad batch of tasks with segmentation violations, but these seem to have been bad batches of the tasks that are no longer supplied to Linux systems.
ID: 66420 · Report as offensive
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,695,853
RAC: 10,233
Message 66421 - Posted: 15 Nov 2022, 14:01:01 UTC - in response to Message 66418.  

I'm not entirely sure what 'checkpointing' really means.
'Checkpointing' as a concept applies primarily to the scientific data being processed by a project's scientific application. The idea is to record a complete and consistent state of the application's internal processes on non-volatile memory, in a form that the same application can read back and use as a starting point after a pause.

BOINC itself is aware of the process, but in general can't control it: it can't demand that a checkpoint is taken at a particular moment. But it can set some constraints on the process. For example, some people are concerned about the longevity of their SSDs in terms of lifetime write cycles. They may choose to extend the time interval between checkpoints, on the basis that they will only shut down their machine very rarely, and they are content to accept the risk that computing effort will be wasted in the event of an unplanned power outage. BOINC also takes account of the length of time that has elapsed from the last checkpoint when deciding to pause one project's application, and give a turn at the trough for a different one. If a task has never checkpointed, BOINC will try to avoid pausing it unless absolutely necessary.

CPDN has a particular problem with checkpoints. The amount of data that has to be recorded to catch the complete internal state of the model so far is much greater than for most other projects. In some cases - slower drives or interfaces, heavily contended devices, or cached 'lazy write' drives - it can take a significant amount of time before the stored data is complete and usable. I think the majority of problems in the past will have been caused by one or more of these delays causing the image on disk to be incomplete and unreadable on restart.
ID: 66421 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,420,897
RAC: 16,732
Message 66423 - Posted: 15 Nov 2022, 14:09:29 UTC - in response to Message 66421.  

I'm not entirely sure what 'checkpointing' really means.
CPDN has a particular problem with checkpoints. The amount of data that has to be recorded to catch the complete internal state of the model so far is much greater than for most other projects. In some cases - slower drives or interfaces, heavily contended devices, or cached 'lazy write' drives - it can take a significant amount of time before the stored data is complete and usable. I think the majority of problems in the past will have been caused by one or more of these delays causing the image on disk to be incomplete and unreadable on restart.
Thanks Richard. I should have said I understand the concept of checkpointing but not how the implementation is applied for CPDN. OpenIFS when it's running has no knowledge of what checkpointing is set on the boinc client side and I don't intend to implement it, exactly for the reasons you describe.

I had to alter the checkpointing in OpenIFS to only keep one set of checkpoint files on the machine. Normally we would not delete the 'older' checkpoint files so that if the most recent one is corrupt, we can fall back to an earlier one. Unfortunately, this puts too much data in the slot directory. So if the single checkpoint does corrupt, that's the end of the task. Never seen it happen to date though with OIFS under boinc.
ID: 66423 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,420,897
RAC: 16,732
Message 66431 - Posted: 15 Nov 2022, 16:37:20 UTC

I'm hoping someone can help me make sense of this. What appears to be happening is a resource backoff which counts down, but it reaches zero & I get no tasks even though I know today there are plenty in the queue, and there's no tasks running on my machine. I've browsed the forums but not found a satisfactory answer.

If I turn on 'work_fetch_debug' (thanks Richard), I see this sequence (I've deleted a few lines for brevity):

...
Tue 15 Nov 2022 16:05:53 GMT | climateprediction.net | [work_fetch] share 0.000 project is backed off  (resource backoff: 44.33, inc 9600.00)  <<< about to go to zero
Tue 15 Nov 2022 16:05:53 GMT | climateprediction.net | can't fetch CPU: project is backed off
......
Tue 15 Nov 2022 16:06:54 GMT | climateprediction.net | choose_project: scanning
Tue 15 Nov 2022 16:06:54 GMT | climateprediction.net | can fetch CPU
Tue 15 Nov 2022 16:06:54 GMT | climateprediction.net | CPU needs work - buffer low
Tue 15 Nov 2022 16:06:54 GMT | climateprediction.net | checking CPU
Tue 15 Nov 2022 16:06:54 GMT | climateprediction.net | [work_fetch] set_request() for CPU: ninst 4 nused_total 0.00 nidle_now 1.00 fetch share 1.00 req_inst 4.00 req_secs 51840.00
Tue 15 Nov 2022 16:06:54 GMT | climateprediction.net | CPU set_request: 51840.000000
Tue 15 Nov 2022 16:06:54 GMT | climateprediction.net | [work_fetch] request: CPU (51840.00 sec, 4.00 inst)
Tue 15 Nov 2022 16:06:54 GMT | climateprediction.net | Sending scheduler request: To fetch work.
Tue 15 Nov 2022 16:06:54 GMT | climateprediction.net | Requesting new tasks for CPU
Tue 15 Nov 2022 16:06:55 GMT | climateprediction.net | Scheduler request completed: got 0 new tasks  << server says no!
Tue 15 Nov 2022 16:06:55 GMT | climateprediction.net | No tasks sent
Tue 15 Nov 2022 16:06:55 GMT | climateprediction.net | Project requested delay of 3636 seconds  << and we go around again.
This is a Mint21 machine which has successfully run HadSM4 before. I've checked resources given to boinc. The only disk limit is to leave 100Gb free on the disk (there's 204Gb free). It has 32Gb RAM and boinc is allowed to use 75%.

Any thoughts from the experts? Thanks.

Would be nice if the server returned an error code I could look up, instead of just saying 'no'.
ID: 66431 · Report as offensive
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,695,853
RAC: 10,233
Message 66432 - Posted: 15 Nov 2022, 16:42:29 UTC - in response to Message 66423.  

It would be appreciated by users - even if only in v1.01 - if you could listen for and obey the setting "Request tasks to checkpoint at most every xxx seconds". In other words, if you've checkpointed within the last xxx seconds, skip the checkpoint loop this time round. That saves wear and tear on the hardware, and (by skipping code) might even speed things up slightly. Every little saving helps.
ID: 66432 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,420,897
RAC: 16,732
Message 66434 - Posted: 15 Nov 2022, 17:04:19 UTC - in response to Message 66432.  
Last modified: 15 Nov 2022, 17:11:45 UTC

It would be appreciated by users - even if only in v1.01 - if you could listen for and obey the setting "Request tasks to checkpoint at most every xxx seconds". In other words, if you've checkpointed within the last xxx seconds, skip the checkpoint loop this time round. That saves wear and tear on the hardware, and (by skipping code) might even speed things up slightly. Every little saving helps.
The model takes 30secs -> 2 mins to complete a timestep depending on resolution & machine speed. The user decides to set a checkpoint every 2mins. That will make the model dump it's memory every timestep, which will kill performance and result in a lot of unnecessary I/O to the hardware. It will also trigger extra code to be run (not less). We did tests in the early days to find the best balance between I/O load and cost of repeating a few extra model steps should a restart be needed. It's not something I want volunteers to be altering without a good understanding of how the model works.

The model checkpoint frequency is fixed at run start, it's not dynamic as you suggest.
ID: 66434 · Report as offensive
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,695,853
RAC: 10,233
Message 66435 - Posted: 15 Nov 2022, 17:23:00 UTC

Thoughts on work fetch. The first obvious point is: "if you don't ask, you can't get". The main reason for not asking is "resource backoff" - BOINC applies this, ever more aggressively - if you ask, but the server gives you nothing - without any thought given to the reason for the failure. Sometimes a reason is given in the server reply, but usually not. As you've found out. And the other main reason is "not highest priority project" - only appears if you have multiple projects active at the same time. You've dodged that one.

Next point. How do you maximise your chances of receiving work, once you're asking? I always suggest that asking for a smaller amount at a time helps. The more you ask for, the more work the server has to do, looping through the lists of available tasks and doing quite complicated tests on each one to see if it's suitable. BOINC servers tend to have multiple scheduler instances running at the same time and querying the same cached list of maybe 200 tasks offered by a process called the 'feeder'. The quicker you can nip in and out of that melee, the better.

The server does log its activity, with reasons. The only project that exposes that information is Einstein, so far as I know. For each host, the 'computer details' page on their website has a live link for 'Last time contacted server'. That leads to the server log for that transaction. Try my https://einsteinathome.org/host/12808716/log.

I'm still working with David and Laurence to track down that 'MT task oversupply' bug. This morning, I herded all the cats into line, and managed to send David this evidence and analysis:

For the deadline check, the units are "wallclock time per task", so the calculation is correct.

But for the 'need more work' check, the units are "cpu-core time per task", which for MT tasks is six times larger.

2022-11-15 09:50:33.2143 [PID=189622] [send_job] [WU#2242433] est delay 0, skipping deadline check
2022-11-15 09:50:33.2376 [PID=189622] [send] Sending app_version ATLAS 4 125 native_mt; projected 17.42 GFLOPS
2022-11-15 09:50:33.2378 [PID=189622] [send_job] est. duration for WU 2242433: unscaled 620.12 scaled 621.10
2022-11-15 09:50:33.2378 [PID=189622] [send] [HOST#4741] sending [RESULT#3147325 SsXLDmBy5E2n7Olcko1bjSoqABFKDmABFKDmFymXDmk7MKDmQ02bSo_0] (est. dur. 621.10s (0h10m21s09)) (max time 344509795.71s (95697h09m55s71))
2022-11-15 09:50:33.2382 [PID=189622] [send_job] est. duration for WU 2242442: unscaled 620.12 scaled 621.10
2022-11-15 09:50:33.2382 [PID=189622] [send_job] [WU#2242442] meets deadline: 621.10 + 621.10 < 604800
2022-11-15 09:50:33.2447 [PID=189622] [send] Sending app_version ATLAS 4 125 native_mt; projected 17.42 GFLOPS
2022-11-15 09:50:33.2449 [PID=189622] [send_job] est. duration for WU 2242442: unscaled 620.12 scaled 621.10
2022-11-15 09:50:33.2449 [PID=189622] [send] [HOST#4741] sending [RESULT#3147334 BOMLDmLy5E2n7Olcko1bjSoqABFKDmABFKDmFymXDmt7MKDm6ueZmn_0] (est. dur. 621.10s (0h10m21s09)) (max time 344509795.71s (95697h09m55s71))
2022-11-15 09:50:33.2454 [PID=189622] [send_job] est. duration for WU 2242460: unscaled 620.12 scaled 621.10
2022-11-15 09:50:33.2454 [PID=189622] [send_job] [WU#2242460] meets deadline: 1242.19 + 621.10 < 604800
2022-11-15 09:50:33.2527 [PID=189622] [send] Sending app_version ATLAS 4 125 native_mt; projected 17.42 GFLOPS
2022-11-15 09:50:33.2530 [PID=189622] [send_job] est. duration for WU 2242460: unscaled 620.12 scaled 621.10
2022-11-15 09:50:33.2530 [PID=189622] [send] [HOST#4741] sending [RESULT#3147352 LriLDmfy5E2n7Olcko1bjSoqABFKDmABFKDmFymXDmB8MKDmQpCRXm_0] (est. dur. 621.10s (0h10m21s09)) (max time 344509795.71s (95697h09m55s71))
2022-11-15 09:50:33.2536 [PID=189622] [send_job] est. duration for WU 2242454: unscaled 620.12 scaled 621.10
2022-11-15 09:50:33.2536 [PID=189622] [send_job] [WU#2242454] meets deadline: 1863.29 + 621.10 < 604800
2022-11-15 09:50:33.2604 [PID=189622] [send] Sending app_version ATLAS 4 125 native_mt; projected 17.42 GFLOPS
2022-11-15 09:50:33.2605 [PID=189622] [send_job] est. duration for WU 2242454: unscaled 620.12 scaled 621.10
2022-11-15 09:50:33.2605 [PID=189622] [send] [HOST#4741] sending [RESULT#3147346 648LDmZy5E2n7Olcko1bjSoqABFKDmABFKDmFymXDm57MKDm3ZnlUm_0] (est. dur. 621.10s (0h10m21s09)) (max time 344509795.71s (95697h09m55s71))
2022-11-15 09:50:33.2607 [PID=189622] [send_job] est. duration for WU 2242451: unscaled 620.12 scaled 621.10
2022-11-15 09:50:33.2607 [PID=189622] [send_job] [WU#2242451] meets deadline: 2484.38 + 621.10 < 604800
2022-11-15 09:50:33.2670 [PID=189622] [send] Sending app_version ATLAS 4 125 native_mt; projected 17.42 GFLOPS
2022-11-15 09:50:33.2674 [PID=189622] [send_job] est. duration for WU 2242451: unscaled 620.12 scaled 621.10
2022-11-15 09:50:33.2674 [PID=189622] [send] [HOST#4741] sending [RESULT#3147343 t98LDmWy5E2n7Olcko1bjSoqABFKDmABFKDmFymXDm27MKDmi4tacn_0] (est. dur. 621.10s (0h10m21s09)) (max time 344509795.71s (95697h09m55s71))
2022-11-15 09:50:33.2684 [PID=189622] [send_job] est. duration for WU 2242457: unscaled 620.12 scaled 621.10
2022-11-15 09:50:33.2684 [PID=189622] [send_job] [WU#2242457] meets deadline: 3105.48 + 621.10 < 604800
2022-11-15 09:50:33.2757 [PID=189622] [send] Sending app_version ATLAS 4 125 native_mt; projected 17.42 GFLOPS
2022-11-15 09:50:33.2758 [PID=189622] [send_job] est. duration for WU 2242457: unscaled 620.12 scaled 621.10
2022-11-15 09:50:33.2758 [PID=189622] [send] [HOST#4741] sending [RESULT#3147349 hYXMDmcy5E2n7Olcko1bjSoqABFKDmABFKDmFymXDm87MKDmmC5GUn_0] (est. dur. 621.10s (0h10m21s09)) (max time 344509795.71s (95697h09m55s71))
2022-11-15 09:50:33.2760 [PID=189622] [send_job] est. duration for WU 2242461: unscaled 620.12 scaled 621.10
2022-11-15 09:50:33.2760 [PID=189622] [send_job] [WU#2242461] meets deadline: 3726.58 + 621.10 < 604800
2022-11-15 09:50:33.2836 [PID=189622] [send] Sending app_version ATLAS 4 125 native_mt; projected 17.42 GFLOPS
2022-11-15 09:50:33.2837 [PID=189622] [send_job] est. duration for WU 2242461: unscaled 620.12 scaled 621.10
2022-11-15 09:50:33.2837 [PID=189622] [send] [HOST#4741] sending [RESULT#3147353 IUJMDmgy5E2n7Olcko1bjSoqABFKDmABFKDmFymXDmC8MKDmEJVcBn_0] (est. dur. 621.10s (0h10m21s09)) (max time 344509795.71s (95697h09m55s71))
2022-11-15 09:50:33.2839 [PID=189622] [send_job] est. duration for WU 2242444: unscaled 620.12 scaled 621.10
2022-11-15 09:50:33.2839 [PID=189622] [send_job] [WU#2242444] meets deadline: 4347.67 + 621.10 < 604800
2022-11-15 09:50:33.2896 [PID=189622] [send] Sending app_version ATLAS 4 125 native_mt; projected 17.42 GFLOPS
2022-11-15 09:50:33.2897 [PID=189622] [send_job] est. duration for WU 2242444: unscaled 620.12 scaled 621.10
2022-11-15 09:50:33.2898 [PID=189622] [send] [HOST#4741] sending [RESULT#3147336 egONDmNy5E2n7Olcko1bjSoqABFKDmABFKDmFymXDmv7MKDmCPtgMn_0] (est. dur. 621.10s (0h10m21s09)) (max time 344509795.71s (95697h09m55s71))
2022-11-15 09:50:33.2900 [PID=189622] [send_job] est. duration for WU 2242446: unscaled 620.12 scaled 621.10
2022-11-15 09:50:33.2900 [PID=189622] [send_job] [WU#2242446] meets deadline: 4968.77 + 621.10 < 604800
2022-11-15 09:50:33.2992 [PID=189622] [send] Sending app_version ATLAS 4 125 native_mt; projected 17.42 GFLOPS
2022-11-15 09:50:33.2999 [PID=189622] [send_job] est. duration for WU 2242446: unscaled 620.12 scaled 621.10
2022-11-15 09:50:33.3000 [PID=189622] [send] [HOST#4741] sending [RESULT#3147338 rWONDmQy5E2n7Olcko1bjSoqABFKDmABFKDmFymXDmx7MKDm2Qitwm_0] (est. dur. 621.10s (0h10m21s09)) (max time 344509795.71s (95697h09m55s71))
2022-11-15 09:50:33.3018 [PID=189622] [send] don't need more work
2022-11-15 09:50:33.3064 [PID=189622] Sending reply to [HOST#4741]: 9 results, delay req 61.00
Most of those are debug elements which David had asked for, chosen from https://boinc.berkeley.edu/trac/wiki/ProjectOptions#Logging. If you ask Andy nicely ...
ID: 66435 · Report as offensive
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,695,853
RAC: 10,233
Message 66436 - Posted: 15 Nov 2022, 17:35:24 UTC - in response to Message 66434.  

The model checkpoint frequency is fixed at run start, it's not dynamic as you suggest.
I was describing the generic schema for BOINC projects as a whole, which in general can checkpoint very quickly and easily.

Your case is different, and you quite understandably want to set a longer minimum interval to allow the main app to get on with it. The default minimum is 60 seconds. But some users might want to increase even your extended minimum. I was only suggesting that the minimum might be extensible outwards, not reduced.
ID: 66436 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,420,897
RAC: 16,732
Message 66437 - Posted: 15 Nov 2022, 17:46:57 UTC - in response to Message 66436.  
Last modified: 15 Nov 2022, 17:57:44 UTC

Your case is different, and you quite understandably want to set a longer minimum interval to allow the main app to get on with it. The default minimum is 60 seconds. But some users might want to increase even your extended minimum. I was only suggesting that the minimum might be extensible outwards, not reduced.
In practise we've found few instances when restarts are needed if the model task completes in 10-12 hrs. If the checkpoint frequency is increased by the user, that will result in longer runtimes as it will have to repeat more timesteps from the previous checkpoint. As each step is relatively time-consuming that needs to be balanced against reducing I/O. It also duplicates model output resulting in bigger output files. This is why we spent time experimenting with different checkpointing options to get one that was optimum as near as possible for different model resolutions.
ID: 66437 · Report as offensive
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 42 · Next

Message boards : Number crunching : New work discussion - 2

©2024 cpdn.org