by Tykling
23. mar 2018 09:45 UTC
I have never been a big fan of software using SYSV IPC shared memory and semaphores. I love PostgreSQL but over the years its use of SYSV IPC has caused various issues for me. I use Zabbix as well, which makes heavy use of shared memory.
I run all my stuff in FreeBSD jails which used to make things even more complicated, because SYSV IPC stuff wasn't namespaced, so two jails with allow.sysvipc=1 could see and modify each others shared memory and semaphores - not ideal. In FreeBSD 11 and beyond this is no longer an issue, as I will demonstrate later in this post.
Tuning the amount of shared memory and semaphores available to the system it is not trivial. Some of the limits are /boot/loader.conf tunables which can only be changed at reboot. Applications don't always document very well how much they will use. If you have more than one jail using shared memory and/or semaphores on a jailhost the tuning becomes even more tricky. People tend to just throw larger and larger numbers in /boot/loader.conf until it works. That is not how I like to work with systems. I do. not. blindly copy/paste stuff from websites without clearly understanding what it is.
Until now, SYSV IPC has been the exception to this rule. I have tried a bunch of times to understand what it is and how it works (from a sysadmin perspective), and failed miserably each time. This means that when something does go wrong it isn't exactly clear what I am supposed to do about it. This blogpost is an attempt to demystify the thing for future reference.
Shared Memory, Semaphores and Message Queues are collectively known as SYSV IPC. Shared Memory is used when sofware wants to share a chunk of memory between processes. Semaphores are used for interprocess communication. They are often used to check and manage allocation of resources such as shared memory. Message Queues will not be covered in this blogpost.
Applications shared memory usage is limited by the kernel, so finding and understanding those limits seems like a good place to start. After looking into the limits I will look at the current resource consumption and figure how to calculate how high the limits actually need to be.
ipcs -M can be used to show the currently active limits for shared memory:
[tsr@sorthat /usr/src]$ ipcs -M
shminfo:
shmmax: 536870912 (max shared memory segment size)
shmmin: 1 (min shared memory segment size)
shmmni: 192 (max number of shared memory identifiers)
shmseg: 128 (max shared memory segments per process)
shmall: 4097152 (max amount of shared memory in pages)
The above values map directly to the following sysctls:
kern.ipc.shmall: 4097152 kern.ipc.shmseg: 128 kern.ipc.shmmni: 192 kern.ipc.shmmin: 1 kern.ipc.shmmax: 536870912
Some of these are tunable only in /boot/loader.conf, others can be set with sysctl:
[tsr@svaneke ~]$ sudo sysctl kern.ipc.shmseg=129 sysctl: oid 'kern.ipc.shmseg' is a read only tunable sysctl: Tunable values are set in /boot/loader.conf [tsr@svaneke ~]$ sudo sysctl kern.ipc.shmall=131072 kern.ipc.shmall: 131073 -> 131072 [tsr@svaneke ~]$
Nowadays the defaults for shared memory are usually more than enough for a PostgreSQL jail, because a modern PostgreSQL only uses a very small amount of shared memory (just a single segment of 48 bytes per server). Zabbix is a different story, it uses shared memory a lot so kern.ipc.shmall will need tuning.
kern.ipc.shmall limits the total amount of shared memory, in pages. The pagesize command shows the pagesize of the running system. Combining the two allows us to see how much memory in bytes the system is allowed to use for shared memory:
[tsr@sorthat ~]$ echo "$(sysctl -n kern.ipc.shmall) * $(pagesize)" 4097152 * 4096 [tsr@sorthat ~]$ !! | bc echo "$(sysctl -n kern.ipc.shmall) * $(pagesize)" | bc 16781934592 [tsr@sorthat ~]$
So, 16781934592 bytes, or almost 17GB. The default value of kern.ipc.shmall is 131072 pages (defined here). That means a default FreeBSD 11 system has 536870912 bytes or around 500MB available for shared memory. That is not enough for Zabbix which is why the system above has kern.ipc.shmall=4097152 in /etc/sysctl.conf.
Calculating exactly how much Zabbix will need is not very well documented. The documentation just says to increase kern.ipc.shmall to 2097152 pages and kern.ipc.shmmax to 134217728 bytes (128MB) (the default is 536870912 (512MB) so no need for that, must be old advice I guess).
kern.ipc.shmseg limits how many shared memory segments can be allocated, the default is 128. kern.ipc.shmmax limits the maximum size of each segment, the default is 536870912 bytes, or around 500MB. Even though this system has a higher kern.ipc.shmall than the default, it is still true that kern.ipc.shmseg * kern.ipc.shmmax > kern.ipc.shmall so it follows that the system will not be able to allocate all of them at their max size due to the limits imposed by kern.ipc.shmall:
[tsr@sorthat ~]$ echo "$(sysctl -n kern.ipc.shmmax) * $(sysctl -n kern.ipc.shmseg)" | bc 68719476736 [tsr@sorthat ~]$
So, 68719476736 bytes, or almost 68GB, which is much higher than the total maximum allowed 16781934592 bytes (16GB) enforced by kern.ipc.shmall. This is not going to be an issue, it simply means that _every_ segment cannot be the _maximum_ size at the _same_ time.
Now that I know the limits imposed by the kernel it is time to look at the current shared memory usage. The ipcs -ma command can show details:
[tsr@sorthat ~]$ ipcs -ma Shared Memory: T ID KEY MODE OWNER GROUP CREATOR CGROUP NATTCH SEGSZ CPID LPID ATIME DTIME CTIME m 458752 0 --rw------- 770 770 770 770 190 48 21599 21599 8:14:33 14:07:38 8:14:33 m 458753 0 --rw------- 122 122 122 122 144 16777216 22110 22110 8:15:28 14:07:34 8:15:28 m 458754 0 --rw------- 122 122 122 122 144 8388608 22110 22110 8:15:28 14:07:34 8:15:28 m 458755 0 --rw------- 122 122 122 122 144 16777216 22110 22110 8:15:28 14:07:34 8:15:28 m 458756 0 --rw------- 122 122 122 122 144 456340276 22110 22110 8:15:28 14:07:34 8:15:28 m 786437 0 --rw------- 122 122 122 122 144 80530636 22110 22110 8:15:28 14:07:34 8:15:28 m 786438 0 --rw------- 122 122 122 122 144 36252 22110 22110 8:15:28 14:07:34 8:15:28 m 524295 0 --rw------- 122 122 122 122 144 536870912 22110 22110 8:15:28 14:07:34 8:15:28 m 524296 0 --rw------- 122 122 122 122 97 16777216 22711 22711 8:15:37 no-entry 8:15:37 m 524297 0 --rw------- 122 122 122 122 97 4194304 22711 22711 8:15:37 no-entry 8:15:37 m 524298 0 --rw------- 122 122 122 122 97 57042535 22711 22711 8:15:37 no-entry 8:15:37 m 524299 0 --rw------- 122 122 122 122 97 10066329 22711 22711 8:15:37 no-entry 8:15:37 m 589836 0 --rw------- 122 122 122 122 97 24408 22711 22711 8:15:37 no-entry 8:15:37 [tsr@sorthat ~]$
Using standard unix tools we can sum up all the values in the SEGSZ column and we get the total number of bytes of shared memory current in use on the system:
[tsr@sorthat ~]$ echo $(ipcs -ma | cut -w -f 10 | grep -v SEGSZ | grep -v "^$" | tr "\n" "+" | sed "s/+$//") | bc 1203825956 [tsr@sorthat ~]$
So 1203825956 bytes or just around 1.2GB. Nowhere near the limits we have above.
The number of allocated segments is also comfortably below the limit of 128 set by kern.ipc.shmseg:
[tsr@sorthat ~]$ ipcs -ma | wc -l
16
[tsr@sorthat ~]$
(substract 3 from the number to get the precise count without header and empty lines)
Most of the shared memory segments shown above belong to UID 122 except for the 48 byte one which belongs to UID 770. Since I am running ipcs on the jailhost (as opposed to inside a jail) the UIDs and GIDs cannot be resolved since the local /etc/passwd does not contain know about them. It is easy to find out what it might be though:
[tsr@sorthat ~]$ grep 770 /usr/jails/*/etc/passwd /usr/jails/postgres4.sorthat.servers.bornfiber.dk/etc/passwd:postgres:*:770:770:PostgreSQL Daemon:/var/db/postgres:/bin/sh [tsr@sorthat ~]$ grep 122 /usr/jails/*/etc/passwd /usr/jails/zabbix2.servers.bornfiber.dk/etc/passwd:zabbix:*:122:122:Zabbix NMS:/nonexistent:/bin/sh /usr/jails/zabbixproxy1.servers.bornfiber.dk/etc/passwd:zabbix:*:122:122:Zabbix NMS:/nonexistent:/bin/sh [tsr@sorthat ~]$
So UID 770 is PostgreSQL and UID 122 is Zabbix. The CPID column also contains the process ID of the process which created the shared memory segment, and the LPID column contains the pid of the process which last did an operation on the segment.
So now I know:
Time to look at Semaphores!
ipcs -S can be used to show the current limits for Semaphores:
[tsr@sorthat ~]$ ipcs -S
seminfo:
semmni: 50 (# of semaphore identifiers)
semmns: 340 (# of semaphores in system)
semmnu: 150 (# of undo structures in system)
semmsl: 340 (max # of semaphores per id)
semopm: 100 (max # of operations per semop call)
semume: 50 (max # of undo entries per process)
semusz: 632 (size in bytes of undo structure)
semvmx: 32767 (semaphore maximum value)
semaem: 16384 (adjust on exit max value)
[tsr@sorthat ~]$
The values shown above are the defaults for FreeBSD 11. They map to the following sysctls:
[tsr@sorthat ~]$ sysctl kern.ipc | grep sem kern.ipc.semaem: 16384 kern.ipc.semvmx: 32767 kern.ipc.semusz: 632 kern.ipc.semume: 50 kern.ipc.semopm: 100 kern.ipc.semmsl: 340 kern.ipc.semmnu: 150 kern.ipc.semmns: 340 kern.ipc.semmni: 50 [tsr@sorthat ~]$
The important ones for PostgreSQL are kern.ipc.semmni (maximum number of semaphore sets) and kern.ipc.semmns (maximum number of semaphores). Note that the PostgreSQL documentation on this says to also set kern.ipc.semmnu=256 in the example for FreeBSD, but it also says Various other settings related to "semaphore undo", such as SEMMNU and SEMUME, do not affect PostgreSQL. so I am not setting kern.ipc.semmnu. These are /boot/loader.conf tunables, they are readonly when using sysctl.
This setting limits the maximum number of semaphore sets for the system.
Calculating this for PostgreSQL can be done using the following formula from the docs: ceil((max_connections + autovacuum_max_workers + max_worker_processes + 5) / 16). On this server max_connections is 200, autovacuum_max_workers is at the default of 3, and max_worker_processes is at the default of 8. This means we have (200 + 3 + 8 + 5) / 16 = 13.5 which we round up to 14. The default setting of 50 on FreeBSD should be plenty as long as only PostgreSQL uses semaphores.
Zabbix appears to use 1 semaphore set per server, and does not mention semaphores in the documentation.
This setting limits the maximum number of semaphores on the system.
Calculating how many semaphores PostgreSQL needs can be done using the following formula from the docs: ceil((max_connections + autovacuum_max_workers + max_worker_processes + 5) / 16) * 17. Given the values above we end up with ((200 + 3 + 8 + 5) / 16) * 17 = 229.5 which we round up to 230. Again, the default setting of 340 on FreeBSD should be plenty as long as only PostgreSQL uses semaphores.
Zabbix appears to use 14 semaphores per server, and does not mention semaphores in the documentation.
The familiar ipcs command can show the current semaphore usage:
[tsr@sorthat ~]$ ipcs -as Semaphores: T ID KEY MODE OWNER GROUP CREATOR CGROUP NSEMS OTIME CTIME s 458752 0 --rw------- 770 770 770 770 17 17:26:50 8:14:33 s 524289 0 --rw------- 770 770 770 770 17 17:26:16 8:14:33 s 589826 0 --rw------- 770 770 770 770 17 17:22:25 8:14:33 s 589827 0 --rw------- 770 770 770 770 17 17:21:58 8:14:33 s 589828 0 --rw------- 770 770 770 770 17 17:26:05 8:14:33 s 589829 0 --rw------- 770 770 770 770 17 17:26:51 8:14:33 s 589830 0 --rw------- 770 770 770 770 17 17:26:23 8:14:33 s 589831 0 --rw------- 770 770 770 770 17 17:26:11 8:14:33 s 589832 0 --rw------- 770 770 770 770 17 17:25:59 8:14:33 s 589833 0 --rw------- 770 770 770 770 17 17:26:26 8:14:33 s 589834 0 --rw------- 770 770 770 770 17 17:21:00 8:14:33 s 589835 0 --rw------- 770 770 770 770 17 17:20:19 8:14:33 s 589836 0 --rw------- 770 770 770 770 17 17:25:44 8:14:33 s 589837 0 --rw------- 770 770 770 770 17 17:26:44 8:14:33 s 589838 0 --rw------- 122 122 122 122 14 17:26:54 8:15:28 s 589839 0 --rw------- 122 122 122 122 14 17:26:54 8:15:37 [tsr@sorthat ~]$
Each line represents a semaphore set:
[tsr@sorthat ~]$ ipcs -as | wc -l
19
[tsr@sorthat ~]$
(substract 3 from the number to get the precise count without header and empty lines)
So I am currently using 16 out of the permitted 50 semaphore sets (kern.ipc.semmni).
By adding up the numbers in the NSEMS column we can see the number of semaphores currenly in use:
[tsr@sorthat ~]$ echo "$(ipcs -as | cut -w -f 9 | egrep -v "(^$|Semaphores|NSEMS)" | tr "\n" "+" | sed "s/+$//")" | bc 266 [tsr@sorthat ~]$
And I am currently using 266 out of the permitted 340 semaphores (kern.ipc.semmns).
So now I know:
It would make sense to add these metrics to some monitoring, but that is an exercise for a future blogpost.
FreeBSD jails all share the same kernel. When something in a jail needs SYSV IPC the jails has to be given permission to use it.
Before FreeBSD 11 SYSV IPC resources were not namespaced, and you could only enable everything with allow.sysvipc=1 or enable nothing at all. The primary problem with this is that you use jails to seperate services, in case one of them gets compromised. But imagine a jailhost with two seperate jails A and B, which both use SYSV IPC stuff. Jail A gets owned, and is now able to read and modify the SYSV IPC resources for jail B. Clearly not ideal.
The old advice was to run the services in the jails with different UIDs, but that advice only helps as long as your intruder doesn't get root. See below for a view from inside a jail, which can also see the SYSV IPC resources from another jail on the same jailhost. This is from inside a jail with allow.sysvipc=1:
[tsr@postgres4 ~]$ ipcs Message Queues: T ID KEY MODE OWNER GROUP Shared Memory: T ID KEY MODE OWNER GROUP m 65536 0 --rw------- 122 122 m 65537 0 --rw------- 122 122 m 65538 0 --rw------- 122 122 m 65539 0 --rw------- 122 122 m 65540 0 --rw------- 122 122 m 65541 0 --rw------- 122 122 m 65542 0 --rw------- 122 122 m 65543 0 --rw------- 122 122 m 65544 0 --rw------- 122 122 m 65545 0 --rw------- 122 122 m 65546 0 --rw------- 122 122 m 65547 0 --rw------- 122 122 m 131084 5432001 --rw------- postgres postgres Semaphores: T ID KEY MODE OWNER GROUP s 65536 0 --rw------- 302 302 s 65537 0 --rw------- 122 122 s 65538 0 --rw------- 122 122 s 131075 5432001 --rw------- postgres postgres s 131076 5432002 --rw------- postgres postgres s 131077 5432003 --rw------- postgres postgres s 131078 5432004 --rw------- postgres postgres s 131079 5432005 --rw------- postgres postgres s 131080 5432006 --rw------- postgres postgres s 131081 5432007 --rw------- postgres postgres s 131082 5432008 --rw------- postgres postgres s 131083 5432009 --rw------- postgres postgres s 131084 5432010 --rw------- postgres postgres s 131085 5432011 --rw------- postgres postgres s 131086 5432012 --rw------- postgres postgres s 131087 5432013 --rw------- postgres postgres s 131088 5432014 --rw------- postgres postgres [tsr@postgres4 ~]$
The semaphores and shared memory shown with a numeric UID are the ones that do not belong to this jail. The root user in this jail is able to modify or delete these, even though they belong to another jail.
FreeBSD 11 solves this in an elegant way:
In FreeBSD 11 allow.sysvipc=1 is no longer recommended, instead three new permissions has been introduced:
sysvshm: Controls access to shared memorysysvsem: Controls access to SYSV semaphoressysvmsg: Controls access to SYSV message queuesEach of these can have three values:
disable: Disables access to this type of resource (default)inherit: Makes the jail inherit the global SYSV namespace (the old behaviour, same as allow.sysvipc=1)
new: Creates a new seperate SYSV namespace for this jail. This is what you want.So the example above with a PostgreSQL jail which needs shared memory and semaphores I add sysvshm=new and sysvsem=new instead of allow.sysvipc=1 in FreeBSD 11 and beyond. Seen from the jail it looks the same except no entries from other jails are visible:
[tsr@postgres4 ~]$ ipcs Message Queues: T ID KEY MODE OWNER GROUP Shared Memory: T ID KEY MODE OWNER GROUP m 131084 5432001 --rw------- postgres postgres Semaphores: T ID KEY MODE OWNER GROUP s 131075 5432001 --rw------- postgres postgres s 131076 5432002 --rw------- postgres postgres s 131077 5432003 --rw------- postgres postgres s 131078 5432004 --rw------- postgres postgres s 131079 5432005 --rw------- postgres postgres s 131080 5432006 --rw------- postgres postgres s 131081 5432007 --rw------- postgres postgres s 131082 5432008 --rw------- postgres postgres s 131083 5432009 --rw------- postgres postgres s 131084 5432010 --rw------- postgres postgres s 131085 5432011 --rw------- postgres postgres s 131086 5432012 --rw------- postgres postgres s 131087 5432013 --rw------- postgres postgres s 131088 5432014 --rw------- postgres postgres [tsr@postgres4 ~]$
This is very, very nice (and about time). Going back to my early beginnings with FreeBSD jails I have been wondering when this would get fixed properly. Yay!
Today (March 2018) I was called in on an issue where a PostgreSQL server was unable to start after a crash because of what I was told was suspected diskspace issues. I was greeted with the familiar message:
[tsr@postgres4 /usr/home/tsr]$ sudo service postgresql start
Password:
pg_ctl: another server might be running; trying to start server anyway
FATAL: could not create semaphores: No space left on device
DETAIL: Failed system call was semget(5432005, 17, 03600).
HINT: This error does *not* mean that you have run out of disk space. It occurs when either the system limit for the maximum number of semaphore sets (SEMMNI), or the system wide maximum number of semaphores (SEMMNS), would be exceeded. You need to raise the respective kernel parameter. Alternatively, reduce PostgreSQL's consumption of semaphores by reducing its max_connections parameter.
The PostgreSQL documentation contains more information about configuring your system for PostgreSQL.
LOG: database system is shut down
pg_ctl: could not start server
Examine the log output.
[tsr@postgres4 /usr/home/tsr]$
This has nothing to do with diskspace of course, as the message says, it has to do with semaphore limits. So I checked the current status of SYSV IPC ressource usage with ipcs -a:
[tsr@sorthat ~]$ ipcs -a Message Queues: T ID KEY MODE OWNER GROUP CREATOR CGROUP CBYTES QNUM QBYTES LSPID LRPID STIME RTIME CTIME Shared Memory: T ID KEY MODE OWNER GROUP CREATOR CGROUP NATTCH SEGSZ CPID LPID ATIME DTIME CTIME m 458752 0 --rw------- 770 770 770 770 190 48 21599 21599 8:14:33 13:47:36 8:14:33 m 458753 0 --rw------- 122 122 122 122 144 16777216 22110 22110 8:15:28 13:46:59 8:15:28 m 458754 0 --rw------- 122 122 122 122 144 8388608 22110 22110 8:15:28 13:46:59 8:15:28 m 458755 0 --rw------- 122 122 122 122 144 16777216 22110 22110 8:15:28 13:46:59 8:15:28 m 458756 0 --rw------- 122 122 122 122 144 456340276 22110 22110 8:15:28 13:46:59 8:15:28 m 786437 0 --rw------- 122 122 122 122 144 80530636 22110 22110 8:15:28 13:46:59 8:15:28 m 786438 0 --rw------- 122 122 122 122 144 36252 22110 22110 8:15:28 13:46:59 8:15:28 m 524295 0 --rw------- 122 122 122 122 144 536870912 22110 22110 8:15:28 13:46:59 8:15:28 m 524296 0 --rw------- 122 122 122 122 97 16777216 22711 22711 8:15:37 no-entry 8:15:37 m 524297 0 --rw------- 122 122 122 122 97 4194304 22711 22711 8:15:37 no-entry 8:15:37 m 524298 0 --rw------- 122 122 122 122 97 57042535 22711 22711 8:15:37 no-entry 8:15:37 m 524299 0 --rw------- 122 122 122 122 97 10066329 22711 22711 8:15:37 no-entry 8:15:37 m 589836 0 --rw------- 122 122 122 122 97 24408 22711 22711 8:15:37 no-entry 8:15:37 Semaphores: T ID KEY MODE OWNER GROUP CREATOR CGROUP NSEMS OTIME CTIME s 458752 0 --rw------- 770 770 770 770 17 13:47:25 8:14:33 s 524289 0 --rw------- 770 770 770 770 17 13:39:02 8:14:33 s 589826 0 --rw------- 770 770 770 770 17 13:37:52 8:14:33 s 589827 0 --rw------- 770 770 770 770 17 13:45:18 8:14:33 s 589828 0 --rw------- 770 770 770 770 17 13:47:01 8:14:33 s 589829 0 --rw------- 770 770 770 770 17 13:47:42 8:14:33 s 589830 0 --rw------- 770 770 770 770 17 13:46:47 8:14:33 s 589831 0 --rw------- 770 770 770 770 17 13:47:11 8:14:33 s 589832 0 --rw------- 770 770 770 770 17 13:46:51 8:14:33 s 589833 0 --rw------- 770 770 770 770 17 13:45:57 8:14:33 s 589834 0 --rw------- 770 770 770 770 17 13:46:33 8:14:33 s 589835 0 --rw------- 770 770 770 770 17 13:34:15 8:14:33 s 589836 0 --rw------- 770 770 770 770 17 13:47:33 8:14:33 s 589837 0 --rw------- 770 770 770 770 17 13:47:33 8:14:33 s 589838 0 --rw------- 122 122 122 122 14 13:47:44 8:15:28 s 589839 0 --rw------- 122 122 122 122 14 13:47:44 8:15:37 [tsr@sorthat ~]$
Obviously PostgreSQL isn't running (since it refused to start), and I had already shut down the Zabbix jails earlier in a frenzy to try to get PostgreSQL to start up. So nothing should be using SYSV IPC ressources at all. Yet there they were, plain as day. Somehow they had not been cleaned up properly and the lingering semaphores were now preventing PostgreSQL from starting.
Since no running jails were using Shared Memory or Semaphores I could use ipcrm -W to clean up everything:
[tsr@sorthat ~]$ sudo ipcrm -W [tsr@sorthat ~]$ ipcs -t Message Queues: T ID KEY MODE OWNER GROUP STIME RTIME CTIME Shared Memory: T ID KEY MODE OWNER GROUP ATIME DTIME CTIME Semaphores: T ID KEY MODE OWNER GROUP OTIME CTIME [tsr@sorthat ~]$
This command should be used with care, be very sure you know what you are doing. It should only be used if you are certain nothing else is running which needs the shared memory or semaphores. ipcrm also has switches to delete individual semaphore sets or shared memory segments in cases where that is needed.
After cleaning up the old semaphores PostgreSQL started up without any problems. After that I started the Zabbix jails again, and then I started writing this blogpost so I never have to go through this again.
I have no idea why PostgreSQL crashed in the first place. I also have no idea why it was unable to clean up the lingering semaphores after the crash. But at least I know how to find and remove any lingering semaphores in case it happens again. I will also increase the semaphore limit kern.ipc.semmns to a large enough value that it can handle at least twice what PostgreSQL needs, so if this happens again it should still be able to start.
I kind of feel like the FreeBSD rc.d init script PostgreSQL should run ipcrm to clean up any lingering stuff before starting it, but people on #postgres on Freenode seemed to disagree.
PostgreSQL 10 uses Posix Semaphores instead of SYSV IPC semaphores, which will make the problem with semaphores for PostgreSQL go away entirely.