climateprediction.net (CPDN) home page
Thread 'New Work Announcements 2024'

Thread 'New Work Announcements 2024'

Message boards : Number crunching : New Work Announcements 2024
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · Next

AuthorMessage
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 70961 - Posted: 8 Jun 2024, 11:14:29 UTC - in response to Message 70959.  


As I mentioned in an earlier post, the model finishes correctly but the controlling code has miscalculated the number of upload files expected so it fails the batch, even though all the results are there. So please let the tasks run as the results are still usable.

The BL OIFS app only needs ~3.5Gb RAM. The normal OIFS app needs more ~6Gb.


OK. All mine are failing the same way, but I am letting them complete.

Mine use a little more than 3. GB at times.
top - 07:04:32 up 2 days, 19:32,  2 users,  load average: 14.17, 14.43, 14.58
Tasks: 486 total,  15 running, 471 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.5 us,  0.6 sy, 86.9 ni, 11.7 id,  0.0 wa,  0.2 hi,  0.0 si,  0.0 st
MiB Mem : 128086.0 total,   3521.9 free,  17893.0 used, 106671.1 buff/cache
MiB Swap:  15992.0 total,  15990.5 free,      1.5 used. 108432.6 avail Mem 

    PID    PPID USER      PR  NI S    RES  %MEM  %CPU  P     TIME+ COMMAND                                                                   
 537222  537215 boinc     39  19 R   5.1g   4.1  99.3 12  97:14.89 /var/lib/boinc/slots/14/oifs_43r3_model.exe                               
 508741  508738 boinc     39  19 R   4.6g   3.7  99.4  1 353:06.52 /var/lib/boinc/slots/3/oifs_43r3_model.exe                                
 504560  504516 boinc     39  19 R   2.4g   1.9  99.5 13 375:49.65 /var/lib/boinc/slots/0/oifs_43r3_model.exe                                
 
Computer 1511241
CPU type 	GenuineIntel
Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7]
Number of processors 	16
Operating System 	Linux Red Hat Enterprise Linux
Red Hat Enterprise Linux 8.10 (Ootpa) [4.18.0-553.5.1.el8_10.x86_64|libc 2.28]
BOINC version 	7.20.2
Memory 	125.08 GB
Cache 	16896 KB

ID: 70961 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 70962 - Posted: 8 Jun 2024, 11:15:17 UTC - in response to Message 70960.  
Last modified: 8 Jun 2024, 11:18:49 UTC

And there might even be some left for when my new machine arrives! The trick might be to let it start downloading enough for all 24 threads then while they are downloading, reduce the number of cpus BOINC can use to 6 which should leave plenty of headroom with 64GB RAM.
ID: 70962 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 42,134,602
RAC: 71,196
Message 70965 - Posted: 8 Jun 2024, 15:05:56 UTC - in response to Message 70959.  
Last modified: 8 Jun 2024, 15:10:34 UTC

Ouch, I just read the "Batch 1017 Errors" post. I didn't know we'd use batch numbers across apps and thought that must be a continuation for WAH batches and skipped the post... Sorry for the duplicates.

On the other hand, same as observed by Jean-David Beyer, the RSS usage is not capped at 3.5GB. This looks like the normal OIFS apps when I collected RSS every second for 10 minutes.
2311604 - 2488436: ************** (82, 13.8%)
2488437 - 2665269: **************** (15, 16.3%)
2665270 - 2842101: ******************** (19, 19.5%)
2842102 - 3018934: *********************** (20, 22.9%)
3018935 - 3195766: ************************** (21, 26.4%)
3195767 - 3372599: ****************************** (20, 29.8%)
3372600 - 3549431: ******************************** (16, 32.5%)
3549432 - 3726264: ********************************** (10, 34.2%)
3726265 - 3903097: ************************************ (11, 36.0%)
3903098 - 4079929: ********************************************************************************** (272, 81.8%)
4079930 - 4256762: *********************************************************************************** (8, 83.2%)
4256763 - 4433594: ************************************************************************************* (9, 84.7%)
4433595 - 4610427: ************************************************************************************** (8, 86.0%)
4610428 - 4787259: **************************************************************************************** (12, 88.0%)
4787260 - 4964092: ****************************************************************************************** (13, 90.2%)
4964093 - 5140925: **************************************************************************************************** (58, 100.0%)
ID: 70965 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 70966 - Posted: 8 Jun 2024, 17:35:35 UTC - in response to Message 70965.  

If I understand your graph correctly, it seems the Working set is monotonically increasing. Now in the short run, that may be true. I read only what my top program shows and it updates every 19 seconds. In my experience, the working set increases for a while, then it drops back and rinse and repeat. I.e., the process allocates more and more RAM up to a certain point, gives some back and does another cycle.
ID: 70966 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 42,134,602
RAC: 71,196
Message 70967 - Posted: 8 Jun 2024, 18:16:31 UTC - in response to Message 70966.  
Last modified: 8 Jun 2024, 18:17:29 UTC

Ah sorry I should have explained. It's not a time series but a histogram. It's sampling the RSS usage over 10 minutes with a rate of one sample per second and grouping them into buckets. RSS is from whatever shown by `ps`. The number on the left are recorded RSS bytes, divided into equal buckets. The number on the right of each bar are the number samples that fall into that bucket. Then the percentage is total percentage that falls into this bucket and below. The stars are just visualization. You can think this graph as a CDF rotated by 90 degrees.

Yes, the actual memory allocation pattern is as what you described. My goal with this little script is to figure out the range of RSS this task actually use over time, so that I can set the concurrent correctly.
ID: 70967 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 42,134,602
RAC: 71,196
Message 70968 - Posted: 8 Jun 2024, 18:52:47 UTC

A different topic. Is there any criteria gating what client can get new tasks? Most of my Linux machines are happily crunching, except one host where I've migrated from a physical disk to a VM. I've since reset the project, waited for the 1 hour update interval many times, but each time still get reply of no new tasks. I also tried uninstalling boinc, clear the data directory and install again. That didn't help either, though the new client get associated to the same host id, so if it's some server side filtering it won't make a difference anyway.
ID: 70968 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,528,638
RAC: 17,959
Message 70970 - Posted: 8 Jun 2024, 19:31:56 UTC - in response to Message 70968.  

Probably because there are no more linux tasks available, according to the server status. I have stopped resends for batch 1017, otherwise we'll be swamped by always failing tasks.
A different topic. Is there any criteria gating what client can get new tasks? Most of my Linux machines are happily crunching, except one host where I've migrated from a physical disk to a VM. I've since reset the project, waited for the 1 hour update interval many times, but each time still get reply of no new tasks. I also tried uninstalling boinc, clear the data directory and install again. That didn't help either, though the new client get associated to the same host id, so if it's some server side filtering it won't make a difference anyway.
ID: 70970 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 42,134,602
RAC: 71,196
Message 70971 - Posted: 8 Jun 2024, 19:49:02 UTC - in response to Message 70970.  

Probably because there are no more linux tasks available, according to the server status. I have stopped resends for batch 1017, otherwise we'll be swamped by always failing tasks.

Thanks. Oops, I read the wrong column and thought tasks are still available. Guess I will wait for the next batch of fun while figuring out how to not be upload bandwidth limited next time... :-)
ID: 70971 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,528,638
RAC: 17,959
Message 70973 - Posted: 11 Jun 2024, 11:30:13 UTC

I'm told there's more Windows & Linux work on the way.

Windows: More batches from Weather@Home for the New Zealand configuration (NZ25) will come first, followed by more batches for the East Asia configuration (EAS25). Note that the NZ batch will use WAH2 version 8.24, whereas the EAS25 batches will use a new WAH-RI version 8.31.

Linux: There's also a rerun of the flawed 1017 batch for OpenIFS on its way.
---
CPDN Visiting Scientist
ID: 70973 · Report as offensive     Reply Quote
Yeti

Send message
Joined: 5 Aug 04
Posts: 178
Credit: 18,956,646
RAC: 44,988
Message 70982 - Posted: 13 Jun 2024, 20:21:03 UTC

Regarding wah2 region independend on Windows (Batch 1006 / 1015 ???), how much RAM should I calculate for each task ?
Supporting BOINC, a great concept !
ID: 70982 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 70983 - Posted: 13 Jun 2024, 20:37:56 UTC - in response to Message 70982.  

Regarding wah2 region independend on Windows (Batch 1006 / 1015 ???), how much RAM should I calculate for each task ?
I reckon on allowing 2GB/task normally on WAH2 which leaves some spare. In practice on my new machine it is always going to be my upload bandwidth that limits me till my connection is upgraded.
ID: 70983 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,528,638
RAC: 17,959
Message 70987 - Posted: 14 Jun 2024, 7:46:58 UTC - in response to Message 70982.  

Regarding wah2 region independend on Windows (Batch 1006 / 1015 ???), how much RAM should I calculate for each task ?

The WaH tasks will take no more than 500Mb RAM. That applies to both wah2 and wah2-ri.
OpenIFS tasks take much more, 5GB. Note the change of units.
---
CPDN Visiting Scientist
ID: 70987 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,528,638
RAC: 17,959
Message 71025 - Posted: 24 Jun 2024, 11:59:13 UTC

New Weather@Home batch going out today. NZ25 domain, Windows only, app version 8.24.
---
CPDN Visiting Scientist
ID: 71025 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 71026 - Posted: 24 Jun 2024, 13:36:04 UTC - in response to Message 71025.  
Last modified: 24 Jun 2024, 13:38:02 UTC

New Weather@Home batch going out today. NZ25 domain, Windows only, app version 8.24.
3150 25 month tasks. Roll up roll up, they won't last long!
ID: 71026 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,528,638
RAC: 17,959
Message 71027 - Posted: 24 Jun 2024, 14:25:51 UTC - in response to Message 71025.  

Some tasks were sent out as batch 995. This was a mistake. The correct batch is 1019. If you have a task from 995 it can be aborted. Don't waste time running it as the results are not needed. It's an previously run batch.
---
CPDN Visiting Scientist
ID: 71027 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,528,638
RAC: 17,959
Message 71028 - Posted: 24 Jun 2024, 14:43:28 UTC - in response to Message 71026.  

New Weather@Home batch going out today. NZ25 domain, Windows only, app version 8.24.
3150 25 month tasks. Roll up roll up, they won't last long!
Please don't download and sit on a pile of unstarted tasks though...
---
CPDN Visiting Scientist
ID: 71028 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 71030 - Posted: 24 Jun 2024, 15:28:20 UTC - in response to Message 71027.  

Some tasks were sent out as batch 995. This was a mistake. The correct batch is 1019. If you have a task from 995 it can be aborted.


I got two of each on my pipsqueak machine. I just aborted the 995 ones.

My big machine is Linux only, so I got none of these (of course).
ID: 71030 · Report as offensive     Reply Quote
wateroakley

Send message
Joined: 6 Aug 04
Posts: 195
Credit: 28,386,425
RAC: 10,115
Message 71036 - Posted: 24 Jun 2024, 21:23:39 UTC - in response to Message 71027.  

Some tasks were sent out as batch 995. This was a mistake. The correct batch is 1019. If you have a task from 995 it can be aborted. Don't waste time running it as the results are not needed. It's an previously run batch.
28 deg C here, today. I wondered why the desktop PC was making extra fan-noise when I got home. Six tasks from batch 995 aborted as requested.
ID: 71036 · Report as offensive     Reply Quote
Paul

Send message
Joined: 14 Feb 06
Posts: 31
Credit: 4,507,116
RAC: 2,013
Message 71038 - Posted: 25 Jun 2024, 9:11:24 UTC

Aren't the Weather At Home 2 (wah2) v8.24 the ones that crash on restart? Or has this been solved?

Just wondering whether to abort the one that I've got. It's survived 2 restarts so far, so if there is still a problem, its luck must run out soon.

Thanks!
ID: 71038 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,724,038
RAC: 7,570
Message 71039 - Posted: 25 Jun 2024, 9:22:11 UTC - in response to Message 71038.  

I can answer that! I've just had a brief (1 or 2 seconds) power outage, and everything shut down. On power up (and after waiting ages for the router to restart), I can see that the four tasks I got from this batch (v8.24 app, batch 1019 data) have picked up and restarted running from the point they'd reached.
ID: 71039 · Report as offensive     Reply Quote
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · Next

Message boards : Number crunching : New Work Announcements 2024

©2024 cpdn.org