climateprediction.net (CPDN) home page
Thread 'New work discussion - 2'

Thread 'New work discussion - 2'

Message boards : Number crunching : New work discussion - 2
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 13 · 14 · 15 · 16 · 17 · 18 · 19 . . . 42 · Next

AuthorMessage
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 66680 - Posted: 30 Nov 2022, 22:42:44 UTC - in response to Message 66660.  

Error code 9 : Some users have reported seeing 'task exited with error code 9'. This is an indication of lack of system memory. Reduce the number of OpenIFS tasks you have running.


Strange. I wonder where that error code is coming from? In Linux, when a process returns an error code, it is supposed to come from this list. This is file
/usr/include/asm-generic/errno-base.h
in my version of Red Hat Enterprise Linux 8. These have not changed in years (decades perhaps).

#define EPERM            1      /* Operation not permitted */
#define ENOENT           2      /* No such file or directory */
#define ESRCH            3      /* No such process */
#define EINTR            4      /* Interrupted system call */
#define EIO              5      /* I/O error */
#define ENXIO            6      /* No such device or address */
#define E2BIG            7      /* Argument list too long */
#define ENOEXEC          8      /* Exec format error */
#define EBADF            9      /* Bad file number */   <---<<<
#define ECHILD          10      /* No child processes */
#define EAGAIN          11      /* Try again */
#define ENOMEM          12      /* Out of memory */     <---<<<
#define EACCES          13      /* Permission denied */
#define EFAULT          14      /* Bad address */
#define ENOTBLK         15      /* Block device required */
#define EBUSY           16      /* Device or resource busy */
#define EEXIST          17      /* File exists */
#define EXDEV           18      /* Cross-device link */
#define ENODEV          19      /* No such device */
#define ENOTDIR         20      /* Not a directory */
#define EISDIR          21      /* Is a directory */
#define EINVAL          22      /* Invalid argument */
#define ENFILE          23      /* File table overflow */
#define EMFILE          24      /* Too many open files */
#define ENOTTY          25      /* Not a typewriter */
#define ETXTBSY         26      /* Text file busy */
#define EFBIG           27      /* File too large */
#define ENOSPC          28      /* No space left on device */
#define ESPIPE          29      /* Illegal seek */
#define EROFS           30      /* Read-only file system */
#define EMLINK          31      /* Too many links */
#define EPIPE           32      /* Broken pipe */
#define EDOM            33      /* Math argument out of domain of func */
#define ERANGE          34      /* Math result not representable */

ID: 66680 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,431,665
RAC: 17,512
Message 66697 - Posted: 1 Dec 2022, 16:43:48 UTC - in response to Message 66680.  

Error code 9 : Some users have reported seeing 'task exited with error code 9'. This is an indication of lack of system memory. Reduce the number of OpenIFS tasks you have running.
Strange. I wonder where that error code is coming from? In Linux, when a process returns an error code, it is supposed to come from this list. This is file
/usr/include/asm-generic/errno-base.h
in my version of Red Hat Enterprise Linux 8. These have not changed in years (decades perhaps).
I am not sure about the use of that file, but 'kill -l' (little 'ell') on any terminal gives the list of signals.
ID: 66697 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 66702 - Posted: 1 Dec 2022, 20:45:22 UTC - in response to Message 66697.  

Error code 9 : Some users have reported seeing 'task exited with error code 9'. This is an indication of lack of system memory. Reduce the number of OpenIFS tasks you have running.

Strange. I wonder where that error code is coming from? In Linux, when a process returns an error code, it is supposed to come from this list. This is file
/usr/include/asm-generic/errno-base.h
in my version of Red Hat Enterprise Linux 8. These have not changed in years (decades perhaps).

I am not sure about the use of that file, but 'kill -l' (little 'ell') on any terminal gives the list of signals.


True, but the kill signals are not the same as the error codes.

A kill signal is a way one process, even a shell, can communicate with a running process.
An error code is a way for an exiting process to communicate to the user why it is exiting.

kill 15 urges a process to get off, but gives it a chance to do it gracefully.
kill 9 terminates a process with extreme prejudice.
ID: 66702 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 66707 - Posted: 1 Dec 2022, 22:33:55 UTC - in response to Message 66702.  

kill 15 urges a process to get off, but gives it a chance to do it gracefully.
kill 9 terminates a process with extreme prejudice.
Those two descriptions made me laugh hysterically for many reasons.
ID: 66707 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 66754 - Posted: 3 Dec 2022, 21:08:29 UTC - in response to Message 66707.  

kill 15 urges a process to get off, but gives it a chance to do it gracefully.
kill 9 terminates a process with extreme prejudice.

Those two descriptions made me laugh hysterically for many reasons.


Thank-you. ;-)

In the old days, when running various releases of Linux, if you shut down the system or rebooted it in the normal way, part of the shut-down procedure in /etc/rc.d/rc0.d or rc6.d would send a kill -s 15 to all the remaining processes, wait (IIRC) 15 seconds, and then send a kill -s 9 that kicked them off.

A process that wanted to do an emergency cleanup would catch the -15 signal, close files, or whatever it needed to do, and then exit. If it ignored the signal, the -9 would kill it with no cooperation from the process.

Nowadays, with systems controlled by systemd, it is somewhat different and I have never bothered to see how it works. Certainly not in /etc/rc.d

See the halt command manual page for how to do it these days.
ID: 66754 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 66758 - Posted: 4 Dec 2022, 2:34:29 UTC - in response to Message 66754.  

In the old days, when running various releases of Linux, if you shut down the system or rebooted it in the normal way, part of the shut-down procedure in /etc/rc.d/rc0.d or rc6.d would send a kill -s 15 to all the remaining processes, wait (IIRC) 15 seconds, and then send a kill -s 9 that kicked them off.

A process that wanted to do an emergency cleanup would catch the -15 signal, close files, or whatever it needed to do, and then exit. If it ignored the signal, the -9 would kill it with no cooperation from the process.
And if it needed more than 15 seconds? There could be massive disk writes needed. Is the user not prompted to allow it more time?
ID: 66758 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 66759 - Posted: 4 Dec 2022, 3:19:48 UTC - in response to Message 66758.  

In the old days, when running various releases of Linux, if you shut down the system or rebooted it in the normal way, part of the shut-down procedure in /etc/rc.d/rc0.d or rc6.d would send a kill -s 15 to all the remaining processes, wait (IIRC) 15 seconds, and then send a kill -s 9 that kicked them off.

A process that wanted to do an emergency cleanup would catch the -15 signal, close files, or whatever it needed to do, and then exit. If it ignored the signal, the -9 would kill it with no cooperation from the process.

And if it needed more than 15 seconds? There could be massive disk writes needed. Is the user not prompted to allow it more time?


No; tough luck. Remember that most programs running on a Linux system are unattended daemon processes waiting for an input message from another process. So there is no user there to prompt.

On my system at the moment, it says
Tasks: 470 total, 13 running, 456 sleeping, 0 stopped,

12 of those running are my Boinc tasks that run even when I am asleep when I am logged out and my monitor is turned off.

Think about it: If I wanted to shut down my system, I would first have to tell my Boinc client to do no new tasks for each project.
Then I would want to wait until all running processes ended which could have been several months in the old days. It could be quicker because most projects can tolerate normal shutdowns. so after doing the no-new-tasks., I could just do a Suspend on all the running processes.
But what if, when power fails and my UPS decides to do a controlled shut-down of the system? If it is a small UPS, it may have only very few minutes to do all that needs to be done.

If you are the system administrator, you could change the 15 second interval that would apply to all processes.

But remember that the sysadmin could run an existing program, wall - write a message to all users, which (s)he could send out ahead of time before doing the system shutdown. That program sends a message to the terminal of all logged-in users' terminals. Depending on the release of Linux or UNIX you might be using, there are many other handy ways tp handle these things.
ID: 66759 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 66760 - Posted: 4 Dec 2022, 4:21:01 UTC - in response to Message 66759.  
Last modified: 4 Dec 2022, 4:22:57 UTC

Remember that most programs running on a Linux system are unattended daemon processes waiting for an input message from another process. So there is no user there to prompt.

On my system at the moment, it says
Tasks: 470 total, 13 running, 456 sleeping, 0 stopped,

12 of those running are my Boinc tasks that run even when I am asleep when I am logged out and my monitor is turned off.

Think about it: If I wanted to shut down my system, I would first have to tell my Boinc client to do no new tasks for each project.
Then I would want to wait until all running processes ended which could have been several months in the old days. It could be quicker because most projects can tolerate normal shutdowns. so after doing the no-new-tasks., I could just do a Suspend on all the running processes.
But what if, when power fails and my UPS decides to do a controlled shut-down of the system? If it is a small UPS, it may have only very few minutes to do all that needs to be done.
I'd make it alert the user on the screen. If the user isn't there, shut down anyway after a bit longer. It would only have to be something like 15+15 seconds.

My UPS doesn't cause a shutdown. It causes a hibernate - zero power usage. Anyway with two caravan leisure batteries at 110Ah each, I've never seen that happen. Boinc suspends when it's on battery anyway, so it doesn't use much. Why do people pay so much for UPS batteries? They're several times more expensive than some big deep cycle lead acids.
ID: 66760 · Report as offensive
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,302,757
RAC: 1,077
Message 66761 - Posted: 4 Dec 2022, 9:39:35 UTC
Last modified: 4 Dec 2022, 9:44:32 UTC

My personal biggest issue with the current new work is this one – i.e. that there is no new work.

On December 7…17, my CPUs are long planned ahead to run another project, and I had hoped that I could have cached & completed as much OpenIFSs beforehand, such that I could have filled these 10 days with OpenIFS result file uploads through my narrow Internet uplink. But alas, and as usual, the one thing which cannot be done at CPDN is to follow a plan. Right now I am completing my last buffered OpenIFS task and am uploading 130 GB pending result file uploads, which will take me less than two days. So, the singular resource of mine which limits my CPDN contribution currently – upload bandwidth – will stay unused during these 10 spare days. The very brief (as usual) windows of work availability last Monday and Tuesday were too short for me to figure out how much work to buffer. On the other hand, since it was such a little amount of work, there are more than enough other participants around here to get these two little OpenIFS batches completed quickly anyway, so all is fine from that end.

But if CPDN was to undertake a campaign like, just as an example, getting ≥42,000 results returned within a month, then more participants then currently active seem to be required. Which won't happen, since most people who run BOINC are not attending projects with irregular work availability. Which in turn limits the throughput available to CPDN = limits the scope of projects that are feasible to put onto CPDN. It's a catch-22. :-(
ID: 66761 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 66762 - Posted: 4 Dec 2022, 9:52:41 UTC - in response to Message 66761.  
Last modified: 4 Dec 2022, 9:53:00 UTC

So, the singular resource of mine which limits my CPDN contribution currently – upload bandwidth – will stay unused during these 10 spare days.
It's not CPDN's fault you have a slow internet connection. I find it hard to believe those still exist.

But if CPDN was to undertake a campaign like, just as an example, getting ≥42,000 results returned within a month, then more participants then currently active seem to be required. Which won't happen, since most people who run BOINC are not attending projects with irregular work availability. Which in turn limits the throughput available to CPDN = limits the scope of projects that are feasible to put onto CPDN. It's a catch-22. :-(
Where did you get that idea? Most people such as myself just leave CPDN getting new work if and when it does and carry on with other projects. When CPDN work appears, Boinc will jump on it as it has a work debt.
ID: 66762 · Report as offensive
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,989,107
RAC: 21,788
Message 66764 - Posted: 4 Dec 2022, 10:17:52 UTC - in response to Message 66762.  

Where did you get that idea? Most people such as myself just leave CPDN getting new work if and when it does and carry on with other projects. When CPDN work appears, Boinc will jump on it as it has a work debt.
Agreed Peter. The current batches are now all at over 50% complete. The thing the project could do to get the tasks back even more quickly would be to set a still shorter deadline and reissue tasks after say ten or twelve days. Even running only two at a time, to keep the backlog of uploads from building up, I will finish the ones in my queue in less than 6 days from getting the last ones.
ID: 66764 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 66768 - Posted: 4 Dec 2022, 14:00:59 UTC - in response to Message 66764.  

Deadline now is about one month..
My machine can do better than that, so IMAO, they could cut the deadline in half one more time.
Average upload rate 	   84.26 KB/sec
Average download rate 	18781.17 KB/sec

OpenIFS 43r3 Perturbed Surface 1.01 x86_64-pc-linux-gnu
Number of tasks completed 	21
Max tasks per day 	        25
Number of tasks today 	         0
Consecutive valid tasks 	21
Average processing rate 	27.91 GFLOPS
Average turnaround time 	1.35 days    <---<<<

ID: 66768 · Report as offensive
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,302,757
RAC: 1,077
Message 66769 - Posted: 4 Dec 2022, 15:14:58 UTC - in response to Message 66762.  
Last modified: 4 Dec 2022, 15:31:08 UTC

Mr. P Hucker wrote:
It's not CPDN's fault you have a slow internet connection. I find it hard to believe those still exist.
It's got a slow upstream link, but one which is still above average in the country where I live.

(It's slowness is only relative; I could still run 30 tasks of the last batch at once without building an upload backlog.)

Mr. P Hucker wrote:
[...] xii5ku wrote:
most people who run BOINC are not attending projects with irregular work availability.
[...] Where did you get that idea?
By looking around who is doing what.
ID: 66769 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 66770 - Posted: 4 Dec 2022, 15:48:30 UTC - in response to Message 66769.  

It's got a slow upstream link, but one which is still above average in the country where I live.
Mine is slow compared to everyone I know, I only get 7 Mbit up, but 32 (was 54, they claim the trunk line got too busy, probably they don't like what I download) Mbit down. Everyone else seems to have symmetrical connections.

Do you like in the UK? Your team is British.
ID: 66770 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,431,665
RAC: 17,512
Message 66771 - Posted: 4 Dec 2022, 16:03:46 UTC - in response to Message 66761.  

xii5ku.
But alas, and as usual, the one thing which cannot be done at CPDN is to follow a plan.
And what plan exactly are you referring to? I've been attending regular meetings, and working closely with them for many years (they are a very small team as you may or may not know), so I know very well what's going on. There are multiple projects on the go and CPDN have to wait until the scientist/s is/are ready. That is often where the wait time is. I find that comment annoying and plainly inaccurate.

CPDN is following the needs of the science projects, thankfully not the needs of your computers. They survive on a shoestring budget from the science projects they are able to attract.
---
CPDN Visiting Scientist
ID: 66771 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,431,665
RAC: 17,512
Message 66772 - Posted: 4 Dec 2022, 16:05:55 UTC - in response to Message 66768.  

Deadline now is about one month..
My machine can do better than that, so IMAO, they could cut the deadline in half one more time.
Yes, we thought about that, but going from a year to 1 month was quite a big change so the idea was to watch how things went. In any case, in practise what happens is the very slow tasks are resent after a couple of weeks before the deadline.
ID: 66772 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 66773 - Posted: 4 Dec 2022, 16:37:39 UTC - in response to Message 66772.  

Deadline now is about one month..
My machine can do better than that, so IMAO, they could cut the deadline in half one more time.

Yes, we thought about that, but going from a year to 1 month was quite a big change so the idea was to watch how things went. In any case, in practise what happens is the very slow tasks are resent after a couple of weeks before the deadline.


Looks like very good planning indeed!
ID: 66773 · Report as offensive
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 66774 - Posted: 4 Dec 2022, 18:58:18 UTC - in response to Message 66762.  

I find it hard to believe those still exist.


Why? Lots of people, on this very board, are telling you that they have exactly things things you "find it hard to believe still exist." Great, you live somewhere with symmetrical gigabit, not everyone does.

Most US residential connections are asymmetrical, often badly so. Upload just hasn't been prioritized, and so the result is that there isn't much of the available spectrum allocated to it (be it wireless, cable, DSL, etc). The connections are tuned for download speed, which is what most users broadly care about, even more so now that streaming is a thing. The only reason I have more than a few Mbit upload is because I have Starlink, which is its own unique pain in the rear at times.

Deadline now is about one month..
My machine can do better than that, so IMAO, they could cut the deadline in half one more time.


Great, your machine isn't everyone's machine, is it? There are still plenty of older machines that run intermittently (I have an older Broadwell box in addition to my Ryzen rigs, and they all run when I've got sufficient power from the sun). Set the results based on when they're needed, not based on what some random computer can do on them.

Mine is slow compared to everyone I know, I only get 7 Mbit up, but 32 (was 54, they claim the trunk line got too busy, probably they don't like what I download) Mbit down. Everyone else seems to have symmetrical connections.


Faster than anything I've had before Starlink... and still frequently faster than Starlink. They're adding subscribers a lot faster than satellites.
ID: 66774 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 66775 - Posted: 4 Dec 2022, 20:00:45 UTC - in response to Message 66774.  

Most US residential connections are asymmetrical, often badly so. Upload just hasn't been prioritized, and so the result is that there isn't much of the available spectrum allocated to it (be it wireless, cable, DSL, etc). The connections are tuned for download speed, which is what most users broadly care about, even more so now that streaming is a thing. The only reason I have more than a few Mbit upload is because I have Starlink, which is its own unique pain in the rear at times.


I do not know about Most US residential connections, but mine for most of the last 18 years have been symmetrical. The first two of those were not. IIRC, it was 5 Megabits//second down and 2 Megabits/second up. Big improvement from 56 Kilabits on dial-up. But then they increased it to10 down and 5 up, and the 20 both down and up. Currently it is nominally 75 up and 75 down. I think I could get 500 or more, but I would have to pay more and have no use for it.

In my neighborhood, Verizon does not even supply copper connections anymore It is fiber-optic for everyone. If you only have a single voice line, it happens to come down a fiber-optic channel. The maintenance for copper connections was just too much for Verizon around here, so everyone is on FiOS. Not everyone knows this.
ID: 66775 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 66781 - Posted: 5 Dec 2022, 10:44:32 UTC - in response to Message 66774.  

Why? Lots of people, on this very board, are telling you that they have exactly things things you "find it hard to believe still exist." Great, you live somewhere with symmetrical gigabit, not everyone does.
As I said, I don't have symmetrical, others do, not where I live but on the boards. Everywhere but this board which seems to have a lot of people with slow connections. It's unusual.

I live in Scotland, the middle of nowhere compared to most folk. In a small town of 3000 people. Yet I've had 54Mbit down/7 up for 10 years. I'm getting 1Gbit in 4 years, and most of Scotland already has it.

Most US residential connections are asymmetrical, often badly so.
Amazing, the US used to be ahead of the UK, my friend in America had an absurdly fast connection in New York in 2000 when I still had dial up at 56Kbit! I can't remember what he had but it was something like 10Mbit, the same as I was getting in my university office until I got a server in my room and persuaded them to put me on the 100Mbit backbone, which was good for multiplayer games, I mean work.

Upload just hasn't been prioritized, and so the result is that there isn't much of the available spectrum allocated to it (be it wireless, cable, DSL, etc). The connections are tuned for download speed, which is what most users broadly care about, even more so now that streaming is a thing. The only reason I have more than a few Mbit upload is because I have Starlink, which is its own unique pain in the rear at times.
If you have more than a few Mbit up, yours must be almost as fast as mine, so no problem doing loads of CPDN. I've got 126 cores, but I've never got to do loads of CPDN stuff because they're still on that horrid Linux stuff I refuse to install because it's so complicated. I use Windows because it just works.

Great, your machine isn't everyone's machine, is it? There are still plenty of older machines that run intermittently (I have an older Broadwell box in addition to my Ryzen rigs, and they all run when I've got sufficient power from the sun). Set the results based on when they're needed, not based on what some random computer can do on them.
They need them ASAP. So if some people can do them in 4 days and some in 4 weeks, they want them done in 4 days. You are not the centre of their universe, their work is. Why not leave your machine on 24/7?

Faster than anything I've had before Starlink... and still frequently faster than Starlink. They're adding subscribers a lot faster than satellites.
Is Starlink the Musk one?! I thought that was gonna be the fastest thing ever. I just looked it up for the UK: "£75/mo with a one-time hardware cost of £460." I don't think so. My fibre was £50 installation and £28 a month. Looks like it's not fully set up yet, although it says available for me, when I click order (I wanted to see the predicted speeds) it said "WE'RE NOT ABLE TO PROCESS YOUR REQUEST AT THIS TIME. PLEASE TRY AGAIN LATER." Reviews show it averages 112 down 16 up here in the UK. Are you not getting that?
ID: 66781 · Report as offensive
Previous · 1 . . . 13 · 14 · 15 · 16 · 17 · 18 · 19 . . . 42 · Next

Message boards : Number crunching : New work discussion - 2

©2024 cpdn.org