Fun with FreeBSD Listen Queue Overflow

by Tykling


02. oct 2016 12:22 UTC


I spotted a listen queue overflow error message in /var/log/messages on one of my jailhosts today.

I have no idea what could be causing this. This jailhost is busy so it could be a lot of things. Seems to happen about every hour, perhaps some scheduled job?

Most results when you search for the error mention TCP services that can't keep up with the connection rate to the TCP port. But since I didn't know which TCP port I had some detective work to do.

The error from messages looked like this:

Oct  1 01:17:10 bong kernel: sonewconn: pcb 0xfffff800647962d0: Listen queue overflow: 151 already in queue awaiting acceptance (1 occurrences)
Oct  1 02:14:10 bong kernel: sonewconn: pcb 0xfffff800647962d0: Listen queue overflow: 151 already in queue awaiting acceptance (1 occurrences)
Oct  1 03:14:11 bong kernel: sonewconn: pcb 0xfffff800647962d0: Listen queue overflow: 151 already in queue awaiting acceptance (1 occurrences)
Oct  1 04:13:11 bong kernel: sonewconn: pcb 0xfffff800647962d0: Listen queue overflow: 151 already in queue awaiting acceptance (1 occurrences)
Oct  1 05:11:11 bong kernel: sonewconn: pcb 0xfffff800647962d0: Listen queue overflow: 151 already in queue awaiting acceptance (1 occurrences)
Oct  1 06:09:11 bong kernel: sonewconn: pcb 0xfffff800647962d0: Listen queue overflow: 151 already in queue awaiting acceptance (1 occurrences)
Oct  1 07:09:11 bong kernel: sonewconn: pcb 0xfffff800647962d0: Listen queue overflow: 151 already in queue awaiting acceptance (1 occurrences)
Oct  1 08:10:12 bong kernel: sonewconn: pcb 0xfffff800647962d0: Listen queue overflow: 151 already in queue awaiting acceptance (1 occurrences)
Oct  1 09:10:12 bong kernel: sonewconn: pcb 0xfffff800647962d0: Listen queue overflow: 151 already in queue awaiting acceptance (1 occurrences)
Oct  1 10:09:12 bong kernel: sonewconn: pcb 0xfffff800647962d0: Listen queue overflow: 151 already in queue awaiting acceptance (1 occurrences)
Oct  1 11:09:12 bong kernel: sonewconn: pcb 0xfffff800647962d0: Listen queue overflow: 151 already in queue awaiting acceptance (1 occurrences)
Oct  1 12:10:13 bong kernel: sonewconn: pcb 0xfffff800647962d0: Listen queue overflow: 151 already in queue awaiting acceptance (1 occurrences)

Finding the socket from the pcb id

The errors all have the same pcb 0xfffff800647962d0 so I need to find the matching socket. I can use netstat(1) for this:

$ netstat -Aan | grep fffff800647962d0
fffff800647962d0 stream      0      0 fffff8021ab66938        0        0        0 public/pickup

Okay, I recognize this immediately - this isn't even a TCP socket, it is a unix socket opened by Postfix. Unfortunately the socket has a relative path, and netstat doesn't output the pid that opened the socket, so I need to find out which Postfix instance (in which jail) is not processing connections. There is quite a few of them since a lot of the jails are mailservers:

$ sockstat -l | grep public/pickup
postfix  pickup     94418 6  stream public/pickup
postfix  pickup     93468 6  stream public/pickup
postfix  pickup     93444 6  stream public/pickup
postfix  pickup     93137 6  stream public/pickup
postfix  pickup     92970 6  stream public/pickup
postfix  pickup     92777 6  stream public/pickup
postfix  pickup     92219 6  stream public/pickup
postfix  pickup     92209 6  stream public/pickup
postfix  pickup     92199 6  stream public/pickup
postfix  pickup     92174 6  stream public/pickup
postfix  pickup     92048 6  stream public/pickup
postfix  pickup     91887 6  stream public/pickup
postfix  pickup     91619 6  stream public/pickup
postfix  pickup     91580 6  stream public/pickup
postfix  pickup     91579 6  stream public/pickup
postfix  pickup     91576 6  stream public/pickup
postfix  pickup     91571 6  stream public/pickup
postfix  pickup     91549 6  stream public/pickup
postfix  pickup     91534 6  stream public/pickup
postfix  pickup     91511 6  stream public/pickup
postfix  pickup     91506 6  stream public/pickup
postfix  pickup     91456 6  stream public/pickup
postfix  pickup     91154 6  stream public/pickup
postfix  pickup     90225 6  stream public/pickup
postfix  pickup     89419 6  stream public/pickup
postfix  pickup     88126 6  stream public/pickup
postfix  pickup     46251 6  stream public/pickup
postfix  pickup     44779 6  stream public/pickup
postfix  pickup     44654 6  stream public/pickup
postfix  pickup     40736 6  stream public/pickup
postfix  pickup     40649 6  stream public/pickup
postfix  pickup     40604 6  stream public/pickup
postfix  pickup     37761 6  stream public/pickup
root     master     9692  18 stream public/pickup
root     master     37916 18 stream public/pickup
root     master     41241 18 stream public/pickup
root     master     7695  18 stream public/pickup
root     master     10668 18 stream public/pickup
root     master     47108 18 stream public/pickup
root     master     93509 18 stream public/pickup
root     master     52179 18 stream public/pickup
root     master     20236 18 stream public/pickup
root     master     5484  17 stream public/pickup
root     master     5163  17 stream public/pickup
root     master     65010 18 stream public/pickup
root     master     28457 18 stream public/pickup
root     master     19717 18 stream public/pickup
root     master     12271 18 stream public/pickup
root     master     93695 18 stream public/pickup
root     master     10337 18 stream public/pickup
root     master     7350  18 stream public/pickup
root     master     7080  18 stream public/pickup
root     master     6789  18 stream public/pickup
root     master     6525  18 stream public/pickup
root     master     6206  18 stream public/pickup
root     master     5799  18 stream public/pickup
root     master     5543  18 stream public/pickup
root     master     5281  18 stream public/pickup
root     master     5006  18 stream public/pickup
root     master     4720  30 stream public/pickup
root     master     4410  18 stream public/pickup
root     master     4131  18 stream public/pickup
root     master     3867  18 stream public/pickup
root     master     3550  18 stream public/pickup
root     master     3254  18 stream public/pickup
root     master     2568  17 stream public/pickup
root     master     1945  17 stream public/pickup
root     master     1079  17 stream public/pickup

Finding the culprit from the queue size

Since the queue size in the error messages is always the same - 151 - I might be able to use that number to find the socket. The listen queue sizes for unix and network sockets can be displayed with the netstat command:
$ netstat -Lan | grep 151
unix  151/0/100      public/pickup
The numbers are described in netstat(1):
If -L is also present, show the size of the various listen queues.  The
first count shows the number of unaccepted connections, the second
count shows the amount of unaccepted incomplete connections,
and the third count is the maximum number of queued connections.
The "Listen queue overflow" error happens when a queue is more than 150% full. The queue size is 100 meaning that a queue of 151 pending triggers the error.

Actually finding the correct jail

All this still doesn't help me identify what jail this Postfix is running in. Fortunately netstat(1) can also run inside jails, so I did a quick oneliner to find it:
for jail in $(jls -h name | tail +2); do 
    sudo jexec $jail netstat -Aan 2>/dev/null | grep -q fffff800647962d0;
    if [ $? -eq 0 ]; then
        echo "found in jail $jail";
    fi;
done
This output the jail name and I was able to go fix Postfix in the jail and the errors went away. The problem was a missing main.cf config making the postfix pickup process unable to.. pickup. The connections were just queued up on the socket, until it reached the limit.

Final Thoughts

This would have been a lot easier if FreeBSD logged the pid or socket info along with the error when it happens. I was fortunate that this was a long-running process so the pcb stayed the same. If this had been a shortlived process it would have been considerably more difficult to find it. Process accounting combined with logging the pid with the error would be preferable. Alternatively one could jimmy up something to keep an eye on /var/log/messages and run netstat to find the pcb immediately after the error happens. Fortuately this was not neccesary for me (this time).

Search this blog

Tags for this blogpost