by Tykling
23. mar 2018 09:45 UTC
I have never been a big fan of software using SYSV IPC
shared memory and semaphores. I love PostgreSQL but over the years its use of SYSV IPC
has caused various issues for me. I use Zabbix as well, which makes heavy use of shared memory.
I run all my stuff in FreeBSD jails which used to make things even more complicated, because SYSV IPC
stuff wasn't namespaced, so two jails with allow.sysvipc=1
could see and modify each others shared memory and semaphores - not ideal. In FreeBSD 11 and beyond this is no longer an issue, as I will demonstrate later in this post.
Tuning the amount of shared memory and semaphores available to the system it is not trivial. Some of the limits are /boot/loader.conf
tunables which can only be changed at reboot. Applications don't always document very well how much they will use. If you have more than one jail using shared memory and/or semaphores on a jailhost the tuning becomes even more tricky. People tend to just throw larger and larger numbers in /boot/loader.conf
until it works. That is not how I like to work with systems. I do. not. blindly copy/paste stuff from websites without clearly understanding what it is.
Until now, SYSV IPC
has been the exception to this rule. I have tried a bunch of times to understand what it is and how it works (from a sysadmin perspective), and failed miserably each time. This means that when something does go wrong it isn't exactly clear what I am supposed to do about it. This blogpost is an attempt to demystify the thing for future reference.
Shared Memory
, Semaphores
and Message Queues
are collectively known as SYSV IPC
. Shared Memory
is used when sofware wants to share a chunk of memory between processes. Semaphores
are used for interprocess communication. They are often used to check and manage allocation of resources such as shared memory. Message Queues
will not be covered in this blogpost.
Applications shared memory usage is limited by the kernel, so finding and understanding those limits seems like a good place to start. After looking into the limits I will look at the current resource consumption and figure how to calculate how high the limits actually need to be.
ipcs -M
can be used to show the currently active limits for shared memory:
[tsr@sorthat /usr/src]$ ipcs -M shminfo: shmmax: 536870912 (max shared memory segment size) shmmin: 1 (min shared memory segment size) shmmni: 192 (max number of shared memory identifiers) shmseg: 128 (max shared memory segments per process) shmall: 4097152 (max amount of shared memory in pages)
The above values map directly to the following sysctls
:
kern.ipc.shmall: 4097152 kern.ipc.shmseg: 128 kern.ipc.shmmni: 192 kern.ipc.shmmin: 1 kern.ipc.shmmax: 536870912
Some of these are tunable only in /boot/loader.conf
, others can be set with sysctl:
[tsr@svaneke ~]$ sudo sysctl kern.ipc.shmseg=129 sysctl: oid 'kern.ipc.shmseg' is a read only tunable sysctl: Tunable values are set in /boot/loader.conf [tsr@svaneke ~]$ sudo sysctl kern.ipc.shmall=131072 kern.ipc.shmall: 131073 -> 131072 [tsr@svaneke ~]$
Nowadays the defaults for shared memory are usually more than enough for a PostgreSQL
jail, because a modern PostgreSQL
only uses a very small amount of shared memory (just a single segment of 48 bytes per server). Zabbix
is a different story, it uses shared memory a lot so kern.ipc.shmall
will need tuning.
kern.ipc.shmall
limits the total amount of shared memory, in pages. The pagesize
command shows the pagesize of the running system. Combining the two allows us to see how much memory in bytes the system is allowed to use for shared memory:
[tsr@sorthat ~]$ echo "$(sysctl -n kern.ipc.shmall) * $(pagesize)" 4097152 * 4096 [tsr@sorthat ~]$ !! | bc echo "$(sysctl -n kern.ipc.shmall) * $(pagesize)" | bc 16781934592 [tsr@sorthat ~]$
So, 16781934592 bytes, or almost 17GB. The default value of kern.ipc.shmall
is 131072 pages
(defined here). That means a default FreeBSD 11 system has 536870912 bytes or around 500MB available for shared memory. That is not enough for Zabbix
which is why the system above has kern.ipc.shmall=4097152
in /etc/sysctl.conf
.
Calculating exactly how much Zabbix
will need is not very well documented. The documentation just says to increase kern.ipc.shmall
to 2097152 pages and kern.ipc.shmmax
to 134217728 bytes (128MB) (the default is 536870912 (512MB) so no need for that, must be old advice I guess).
kern.ipc.shmseg
limits how many shared memory segments can be allocated, the default is 128. kern.ipc.shmmax
limits the maximum size of each segment, the default is 536870912 bytes, or around 500MB. Even though this system has a higher kern.ipc.shmall
than the default, it is still true that kern.ipc.shmseg * kern.ipc.shmmax > kern.ipc.shmall
so it follows that the system will not be able to allocate all of them at their max size due to the limits imposed by kern.ipc.shmall
:
[tsr@sorthat ~]$ echo "$(sysctl -n kern.ipc.shmmax) * $(sysctl -n kern.ipc.shmseg)" | bc 68719476736 [tsr@sorthat ~]$
So, 68719476736 bytes, or almost 68GB, which is much higher than the total maximum allowed 16781934592 bytes (16GB) enforced by kern.ipc.shmall
. This is not going to be an issue, it simply means that _every_ segment cannot be the _maximum_ size at the _same_ time.
Now that I know the limits imposed by the kernel it is time to look at the current shared memory usage. The ipcs -ma
command can show details:
[tsr@sorthat ~]$ ipcs -ma Shared Memory: T ID KEY MODE OWNER GROUP CREATOR CGROUP NATTCH SEGSZ CPID LPID ATIME DTIME CTIME m 458752 0 --rw------- 770 770 770 770 190 48 21599 21599 8:14:33 14:07:38 8:14:33 m 458753 0 --rw------- 122 122 122 122 144 16777216 22110 22110 8:15:28 14:07:34 8:15:28 m 458754 0 --rw------- 122 122 122 122 144 8388608 22110 22110 8:15:28 14:07:34 8:15:28 m 458755 0 --rw------- 122 122 122 122 144 16777216 22110 22110 8:15:28 14:07:34 8:15:28 m 458756 0 --rw------- 122 122 122 122 144 456340276 22110 22110 8:15:28 14:07:34 8:15:28 m 786437 0 --rw------- 122 122 122 122 144 80530636 22110 22110 8:15:28 14:07:34 8:15:28 m 786438 0 --rw------- 122 122 122 122 144 36252 22110 22110 8:15:28 14:07:34 8:15:28 m 524295 0 --rw------- 122 122 122 122 144 536870912 22110 22110 8:15:28 14:07:34 8:15:28 m 524296 0 --rw------- 122 122 122 122 97 16777216 22711 22711 8:15:37 no-entry 8:15:37 m 524297 0 --rw------- 122 122 122 122 97 4194304 22711 22711 8:15:37 no-entry 8:15:37 m 524298 0 --rw------- 122 122 122 122 97 57042535 22711 22711 8:15:37 no-entry 8:15:37 m 524299 0 --rw------- 122 122 122 122 97 10066329 22711 22711 8:15:37 no-entry 8:15:37 m 589836 0 --rw------- 122 122 122 122 97 24408 22711 22711 8:15:37 no-entry 8:15:37 [tsr@sorthat ~]$
Using standard unix tools we can sum up all the values in the SEGSZ
column and we get the total number of bytes of shared memory current in use on the system:
[tsr@sorthat ~]$ echo $(ipcs -ma | cut -w -f 10 | grep -v SEGSZ | grep -v "^$" | tr "\n" "+" | sed "s/+$//") | bc 1203825956 [tsr@sorthat ~]$
So 1203825956 bytes or just around 1.2GB. Nowhere near the limits we have above.
The number of allocated segments is also comfortably below the limit of 128 set by kern.ipc.shmseg
:
[tsr@sorthat ~]$ ipcs -ma | wc -l 16 [tsr@sorthat ~]$
(substract 3 from the number to get the precise count without header and empty lines)
Most of the shared memory segments shown above belong to UID 122
except for the 48 byte one which belongs to UID 770
. Since I am running ipcs
on the jailhost (as opposed to inside a jail) the UIDs and GIDs cannot be resolved since the local /etc/passwd
does not contain know about them. It is easy to find out what it might be though:
[tsr@sorthat ~]$ grep 770 /usr/jails/*/etc/passwd /usr/jails/postgres4.sorthat.servers.bornfiber.dk/etc/passwd:postgres:*:770:770:PostgreSQL Daemon:/var/db/postgres:/bin/sh [tsr@sorthat ~]$ grep 122 /usr/jails/*/etc/passwd /usr/jails/zabbix2.servers.bornfiber.dk/etc/passwd:zabbix:*:122:122:Zabbix NMS:/nonexistent:/bin/sh /usr/jails/zabbixproxy1.servers.bornfiber.dk/etc/passwd:zabbix:*:122:122:Zabbix NMS:/nonexistent:/bin/sh [tsr@sorthat ~]$
So UID 770
is PostgreSQL
and UID 122
is Zabbix
. The CPID
column also contains the process ID of the process which created the shared memory segment, and the LPID
column contains the pid of the process which last did an operation on the segment.
So now I know:
Time to look at Semaphores
!
ipcs -S
can be used to show the current limits for Semaphores
:
[tsr@sorthat ~]$ ipcs -S seminfo: semmni: 50 (# of semaphore identifiers) semmns: 340 (# of semaphores in system) semmnu: 150 (# of undo structures in system) semmsl: 340 (max # of semaphores per id) semopm: 100 (max # of operations per semop call) semume: 50 (max # of undo entries per process) semusz: 632 (size in bytes of undo structure) semvmx: 32767 (semaphore maximum value) semaem: 16384 (adjust on exit max value) [tsr@sorthat ~]$
The values shown above are the defaults for FreeBSD 11. They map to the following sysctls:
[tsr@sorthat ~]$ sysctl kern.ipc | grep sem kern.ipc.semaem: 16384 kern.ipc.semvmx: 32767 kern.ipc.semusz: 632 kern.ipc.semume: 50 kern.ipc.semopm: 100 kern.ipc.semmsl: 340 kern.ipc.semmnu: 150 kern.ipc.semmns: 340 kern.ipc.semmni: 50 [tsr@sorthat ~]$
The important ones for PostgreSQL
are kern.ipc.semmni
(maximum number of semaphore sets) and kern.ipc.semmns
(maximum number of semaphores). Note that the PostgreSQL
documentation on this says to also set kern.ipc.semmnu=256
in the example for FreeBSD, but it also says Various other settings related to "semaphore undo", such as SEMMNU and SEMUME, do not affect PostgreSQL. so I am not setting kern.ipc.semmnu
. These are /boot/loader.conf
tunables, they are readonly when using sysctl
.
This setting limits the maximum number of semaphore sets for the system.
Calculating this for PostgreSQL
can be done using the following formula from the docs: ceil((max_connections + autovacuum_max_workers + max_worker_processes + 5) / 16)
. On this server max_connections
is 200, autovacuum_max_workers
is at the default of 3, and max_worker_processes
is at the default of 8. This means we have (200 + 3 + 8 + 5) / 16 = 13.5
which we round up to 14. The default setting of 50 on FreeBSD should be plenty as long as only PostgreSQL
uses semaphores.
Zabbix
appears to use 1 semaphore set per server, and does not mention semaphores in the documentation.
This setting limits the maximum number of semaphores on the system.
Calculating how many semaphores PostgreSQL
needs can be done using the following formula from the docs: ceil((max_connections + autovacuum_max_workers + max_worker_processes + 5) / 16) * 17
. Given the values above we end up with ((200 + 3 + 8 + 5) / 16) * 17 = 229.5
which we round up to 230. Again, the default setting of 340 on FreeBSD should be plenty as long as only PostgreSQL
uses semaphores.
Zabbix
appears to use 14 semaphores per server, and does not mention semaphores in the documentation.
The familiar ipcs
command can show the current semaphore usage:
[tsr@sorthat ~]$ ipcs -as Semaphores: T ID KEY MODE OWNER GROUP CREATOR CGROUP NSEMS OTIME CTIME s 458752 0 --rw------- 770 770 770 770 17 17:26:50 8:14:33 s 524289 0 --rw------- 770 770 770 770 17 17:26:16 8:14:33 s 589826 0 --rw------- 770 770 770 770 17 17:22:25 8:14:33 s 589827 0 --rw------- 770 770 770 770 17 17:21:58 8:14:33 s 589828 0 --rw------- 770 770 770 770 17 17:26:05 8:14:33 s 589829 0 --rw------- 770 770 770 770 17 17:26:51 8:14:33 s 589830 0 --rw------- 770 770 770 770 17 17:26:23 8:14:33 s 589831 0 --rw------- 770 770 770 770 17 17:26:11 8:14:33 s 589832 0 --rw------- 770 770 770 770 17 17:25:59 8:14:33 s 589833 0 --rw------- 770 770 770 770 17 17:26:26 8:14:33 s 589834 0 --rw------- 770 770 770 770 17 17:21:00 8:14:33 s 589835 0 --rw------- 770 770 770 770 17 17:20:19 8:14:33 s 589836 0 --rw------- 770 770 770 770 17 17:25:44 8:14:33 s 589837 0 --rw------- 770 770 770 770 17 17:26:44 8:14:33 s 589838 0 --rw------- 122 122 122 122 14 17:26:54 8:15:28 s 589839 0 --rw------- 122 122 122 122 14 17:26:54 8:15:37 [tsr@sorthat ~]$
Each line represents a semaphore set:
[tsr@sorthat ~]$ ipcs -as | wc -l 19 [tsr@sorthat ~]$
(substract 3 from the number to get the precise count without header and empty lines)
So I am currently using 16 out of the permitted 50 semaphore sets (kern.ipc.semmni
).
By adding up the numbers in the NSEMS
column we can see the number of semaphores currenly in use:
[tsr@sorthat ~]$ echo "$(ipcs -as | cut -w -f 9 | egrep -v "(^$|Semaphores|NSEMS)" | tr "\n" "+" | sed "s/+$//")" | bc 266 [tsr@sorthat ~]$
And I am currently using 266 out of the permitted 340 semaphores (kern.ipc.semmns
).
So now I know:
It would make sense to add these metrics to some monitoring, but that is an exercise for a future blogpost.
FreeBSD jails all share the same kernel. When something in a jail needs SYSV IPC
the jails has to be given permission to use it.
Before FreeBSD 11 SYSV IPC
resources were not namespaced, and you could only enable everything with allow.sysvipc=1
or enable nothing at all. The primary problem with this is that you use jails to seperate services, in case one of them gets compromised. But imagine a jailhost with two seperate jails A and B, which both use SYSV IPC
stuff. Jail A gets owned, and is now able to read and modify the SYSV IPC
resources for jail B. Clearly not ideal.
The old advice was to run the services in the jails with different UIDs, but that advice only helps as long as your intruder doesn't get root. See below for a view from inside a jail, which can also see the SYSV IPC
resources from another jail on the same jailhost. This is from inside a jail with allow.sysvipc=1
:
[tsr@postgres4 ~]$ ipcs Message Queues: T ID KEY MODE OWNER GROUP Shared Memory: T ID KEY MODE OWNER GROUP m 65536 0 --rw------- 122 122 m 65537 0 --rw------- 122 122 m 65538 0 --rw------- 122 122 m 65539 0 --rw------- 122 122 m 65540 0 --rw------- 122 122 m 65541 0 --rw------- 122 122 m 65542 0 --rw------- 122 122 m 65543 0 --rw------- 122 122 m 65544 0 --rw------- 122 122 m 65545 0 --rw------- 122 122 m 65546 0 --rw------- 122 122 m 65547 0 --rw------- 122 122 m 131084 5432001 --rw------- postgres postgres Semaphores: T ID KEY MODE OWNER GROUP s 65536 0 --rw------- 302 302 s 65537 0 --rw------- 122 122 s 65538 0 --rw------- 122 122 s 131075 5432001 --rw------- postgres postgres s 131076 5432002 --rw------- postgres postgres s 131077 5432003 --rw------- postgres postgres s 131078 5432004 --rw------- postgres postgres s 131079 5432005 --rw------- postgres postgres s 131080 5432006 --rw------- postgres postgres s 131081 5432007 --rw------- postgres postgres s 131082 5432008 --rw------- postgres postgres s 131083 5432009 --rw------- postgres postgres s 131084 5432010 --rw------- postgres postgres s 131085 5432011 --rw------- postgres postgres s 131086 5432012 --rw------- postgres postgres s 131087 5432013 --rw------- postgres postgres s 131088 5432014 --rw------- postgres postgres [tsr@postgres4 ~]$
The semaphores and shared memory shown with a numeric UID are the ones that do not belong to this jail. The root user in this jail is able to modify or delete these, even though they belong to another jail.
FreeBSD 11 solves this in an elegant way:
In FreeBSD 11 allow.sysvipc=1
is no longer recommended, instead three new permissions has been introduced:
sysvshm
: Controls access to shared memorysysvsem
: Controls access to SYSV semaphoressysvmsg
: Controls access to SYSV message queuesEach of these can have three values:
disable
: Disables access to this type of resource (default)inherit
: Makes the jail inherit the global SYSV
namespace (the old behaviour, same as allow.sysvipc=1
)
new
: Creates a new seperate SYSV
namespace for this jail. This is what you want.So the example above with a PostgreSQL
jail which needs shared memory and semaphores I add sysvshm=new
and sysvsem=new
instead of allow.sysvipc=1
in FreeBSD 11 and beyond. Seen from the jail it looks the same except no entries from other jails are visible:
[tsr@postgres4 ~]$ ipcs Message Queues: T ID KEY MODE OWNER GROUP Shared Memory: T ID KEY MODE OWNER GROUP m 131084 5432001 --rw------- postgres postgres Semaphores: T ID KEY MODE OWNER GROUP s 131075 5432001 --rw------- postgres postgres s 131076 5432002 --rw------- postgres postgres s 131077 5432003 --rw------- postgres postgres s 131078 5432004 --rw------- postgres postgres s 131079 5432005 --rw------- postgres postgres s 131080 5432006 --rw------- postgres postgres s 131081 5432007 --rw------- postgres postgres s 131082 5432008 --rw------- postgres postgres s 131083 5432009 --rw------- postgres postgres s 131084 5432010 --rw------- postgres postgres s 131085 5432011 --rw------- postgres postgres s 131086 5432012 --rw------- postgres postgres s 131087 5432013 --rw------- postgres postgres s 131088 5432014 --rw------- postgres postgres [tsr@postgres4 ~]$
This is very, very nice (and about time). Going back to my early beginnings with FreeBSD jails I have been wondering when this would get fixed properly. Yay!
Today (March 2018) I was called in on an issue where a PostgreSQL
server was unable to start after a crash because of what I was told was suspected diskspace issues. I was greeted with the familiar message:
[tsr@postgres4 /usr/home/tsr]$ sudo service postgresql start Password: pg_ctl: another server might be running; trying to start server anyway FATAL: could not create semaphores: No space left on device DETAIL: Failed system call was semget(5432005, 17, 03600). HINT: This error does *not* mean that you have run out of disk space. It occurs when either the system limit for the maximum number of semaphore sets (SEMMNI), or the system wide maximum number of semaphores (SEMMNS), would be exceeded. You need to raise the respective kernel parameter. Alternatively, reduce PostgreSQL's consumption of semaphores by reducing its max_connections parameter. The PostgreSQL documentation contains more information about configuring your system for PostgreSQL. LOG: database system is shut down pg_ctl: could not start server Examine the log output. [tsr@postgres4 /usr/home/tsr]$
This has nothing to do with diskspace of course, as the message says, it has to do with semaphore limits. So I checked the current status of SYSV IPC
ressource usage with ipcs -a
:
[tsr@sorthat ~]$ ipcs -a Message Queues: T ID KEY MODE OWNER GROUP CREATOR CGROUP CBYTES QNUM QBYTES LSPID LRPID STIME RTIME CTIME Shared Memory: T ID KEY MODE OWNER GROUP CREATOR CGROUP NATTCH SEGSZ CPID LPID ATIME DTIME CTIME m 458752 0 --rw------- 770 770 770 770 190 48 21599 21599 8:14:33 13:47:36 8:14:33 m 458753 0 --rw------- 122 122 122 122 144 16777216 22110 22110 8:15:28 13:46:59 8:15:28 m 458754 0 --rw------- 122 122 122 122 144 8388608 22110 22110 8:15:28 13:46:59 8:15:28 m 458755 0 --rw------- 122 122 122 122 144 16777216 22110 22110 8:15:28 13:46:59 8:15:28 m 458756 0 --rw------- 122 122 122 122 144 456340276 22110 22110 8:15:28 13:46:59 8:15:28 m 786437 0 --rw------- 122 122 122 122 144 80530636 22110 22110 8:15:28 13:46:59 8:15:28 m 786438 0 --rw------- 122 122 122 122 144 36252 22110 22110 8:15:28 13:46:59 8:15:28 m 524295 0 --rw------- 122 122 122 122 144 536870912 22110 22110 8:15:28 13:46:59 8:15:28 m 524296 0 --rw------- 122 122 122 122 97 16777216 22711 22711 8:15:37 no-entry 8:15:37 m 524297 0 --rw------- 122 122 122 122 97 4194304 22711 22711 8:15:37 no-entry 8:15:37 m 524298 0 --rw------- 122 122 122 122 97 57042535 22711 22711 8:15:37 no-entry 8:15:37 m 524299 0 --rw------- 122 122 122 122 97 10066329 22711 22711 8:15:37 no-entry 8:15:37 m 589836 0 --rw------- 122 122 122 122 97 24408 22711 22711 8:15:37 no-entry 8:15:37 Semaphores: T ID KEY MODE OWNER GROUP CREATOR CGROUP NSEMS OTIME CTIME s 458752 0 --rw------- 770 770 770 770 17 13:47:25 8:14:33 s 524289 0 --rw------- 770 770 770 770 17 13:39:02 8:14:33 s 589826 0 --rw------- 770 770 770 770 17 13:37:52 8:14:33 s 589827 0 --rw------- 770 770 770 770 17 13:45:18 8:14:33 s 589828 0 --rw------- 770 770 770 770 17 13:47:01 8:14:33 s 589829 0 --rw------- 770 770 770 770 17 13:47:42 8:14:33 s 589830 0 --rw------- 770 770 770 770 17 13:46:47 8:14:33 s 589831 0 --rw------- 770 770 770 770 17 13:47:11 8:14:33 s 589832 0 --rw------- 770 770 770 770 17 13:46:51 8:14:33 s 589833 0 --rw------- 770 770 770 770 17 13:45:57 8:14:33 s 589834 0 --rw------- 770 770 770 770 17 13:46:33 8:14:33 s 589835 0 --rw------- 770 770 770 770 17 13:34:15 8:14:33 s 589836 0 --rw------- 770 770 770 770 17 13:47:33 8:14:33 s 589837 0 --rw------- 770 770 770 770 17 13:47:33 8:14:33 s 589838 0 --rw------- 122 122 122 122 14 13:47:44 8:15:28 s 589839 0 --rw------- 122 122 122 122 14 13:47:44 8:15:37 [tsr@sorthat ~]$
Obviously PostgreSQL
isn't running (since it refused to start), and I had already shut down the Zabbix jails earlier in a frenzy to try to get PostgreSQL
to start up. So nothing should be using SYSV IPC
ressources at all. Yet there they were, plain as day. Somehow they had not been cleaned up properly and the lingering semaphores were now preventing PostgreSQL
from starting.
Since no running jails were using Shared Memory
or Semaphores
I could use ipcrm -W
to clean up everything:
[tsr@sorthat ~]$ sudo ipcrm -W [tsr@sorthat ~]$ ipcs -t Message Queues: T ID KEY MODE OWNER GROUP STIME RTIME CTIME Shared Memory: T ID KEY MODE OWNER GROUP ATIME DTIME CTIME Semaphores: T ID KEY MODE OWNER GROUP OTIME CTIME [tsr@sorthat ~]$
This command should be used with care, be very sure you know what you are doing. It should only be used if you are certain nothing else is running which needs the shared memory or semaphores. ipcrm
also has switches to delete individual semaphore sets or shared memory segments in cases where that is needed.
After cleaning up the old semaphores PostgreSQL
started up without any problems. After that I started the Zabbix jails again, and then I started writing this blogpost so I never have to go through this again.
I have no idea why PostgreSQL
crashed in the first place. I also have no idea why it was unable to clean up the lingering semaphores after the crash. But at least I know how to find and remove any lingering semaphores in case it happens again. I will also increase the semaphore limit kern.ipc.semmns
to a large enough value that it can handle at least twice what PostgreSQL
needs, so if this happens again it should still be able to start.
I kind of feel like the FreeBSD rc.d
init script PostgreSQL
should run ipcrm
to clean up any lingering stuff before starting it, but people on #postgres
on Freenode seemed to disagree.
PostgreSQL 10
uses Posix Semaphores
instead of SYSV IPC
semaphores, which will make the problem with semaphores for PostgreSQL
go away entirely.