Unix Sockets – Handling Ancillary Data on Partial Reads

unix-sockets

So I've read lots of information on unix-stream ancillary data, but one thing missing from all the documentation is what is supposed to happen when there is a partial read?

Suppose I'm receiving the following messages into a 24 byte buffer

msg1 [20 byes]   (no ancillary data)
msg2 [7 bytes]   (2 file descriptors)
msg3 [7 bytes]   (1 file descriptor)
msg4 [10 bytes]  (no ancillary data)
msg5 [7 bytes]   (5 file descriptors)

The first call to recvmsg, I get all of msg1 (and part of msg2? Will the OS ever do that?) If I get part of msg2, do I get the ancillary data right away, and need to save it for the next read when I know what the message was actually telling me to do with the data? If I free up the 20 bytes from msg1 and then call recvmsg again, will it ever deliver msg3 and msg4 at the same time? Does the ancillary data from msg3 and msg4 get concatenated in the control message struct?

While I could write test programs to experimentally find this out, I'm looking for documentation about how ancillary data behaves in a streaming context. It seems odd that I can't find anything official on it.

I'm going to add my experimental findings here, which i got from this test program:

https://github.com/nrdvana/daemonproxy/blob/master/src/ancillary_test.c

Linux 3.2.59, 3.17.6

It appears that Linux will append portions of ancillary-bearing messages to the end of other messages as long as no prior ancillary payload needed to be delivered during this call to recvmsg. Once one message's ancillary data is being delivered, it will return a short read rather than starting the next ancillary-data message. So, in the example above, the reads I get are:

recv1: [24 bytes] (msg1 + partial msg2 with msg2's 2 file descriptors)
recv2: [10 bytes] (remainder of msg2 + msg3 with msg3's 1 file descriptor)
recv3: [17 bytes] (msg4 + msg5 with msg5's 5 file descriptors)
recv4: [0 bytes]

BSD 4.4, 10.0

BSD provides more alignment than Linux, and gives a short read immediately before the start of a message with ancillary data. But, it will happily append a non-ancillary-bearing message to the end of an ancillary-bearing message. So for BSD, it looks like if your buffer is larger than the ancillary-bearing message, you get almost packet-like behavior. The reads I get are:

recv1: [20 bytes] (msg1)
recv2: [7 bytes]  (msg2, with msg2's 2 file descriptors)
recv3: [17 bytes] (msg3, and msg4, with msg3's 1 file descriptor)
recv4: [7 bytes]  (msg5 with 5 file descriptors)
recv5: [0 bytes]

TODO:

Would still like to know how it happens on older Linux, iOS, Solaris, etc, and how it could be expected to happen in the future.

Best Answer

Ancillary data is received as if it were queued along with the first normal data octet in the segment (if any).

-- POSIX.1-2017

For the rest of your question, things get a bit hairy.

...For the purposes of this section, a datagram is considered to be a data segment that terminates a record, and that includes a source address as a special type of ancillary data.

Data segments are placed into the queue as data is delivered to the socket by the protocol. Normal data segments are placed at the end of the queue as they are delivered. If a new segment contains the same type of data as the preceding segment and includes no ancillary data, and if the preceding segment does not terminate a record, the segments are logically merged into a single segment...

A receive operation shall never return data or ancillary data from more than one segment.

So modern BSD sockets exactly match this extract. This is not surprising :-).

Remember the POSIX standard was written after UNIX, and after splits like BSD v.s. System V. One of the main goals was to help understand the existing range of behaviour, and prevent even more splits in existing features.

Linux was implemented without reference to BSD code. It appears to behave differently here.

If I read you correctly, it sounds like Linux is additionally merging "segments" when a new segment does include ancillary data, but the previous segment does not.
Your point that "Linux will append portions of ancillary-bearing messages to the end of other messages as long as no prior ancillary payload needed to be delivered during this call to recvmsg", does not seem entirely explained by the standard. One possible explanation would involve a race condition. If you read part of a "segment", you will receive the ancillary data. Perhaps Linux interpreted this as meaning the remainder of the segment no longer counts as including ancillary data! So when a new segment is received, it is merged - either as per the standard, or as per difference 1 above.

If you want to write a maximally portable program, you should avoid this area altogether. When using ancillary data, it is much more common to use datagram sockets. If you want to work on all the strange platforms that technically aspire to provide something mostly like POSIX, your question seems to be venturing into a dark and untested corner.

You could argue Linux still follows several significant principles:

"Ancillary data is received as if it were queued along with the first normal data octet in the segment".
Ancillary data is never "concatenated", as you put it.

However, I am not convinced the Linux behaviour is particularly useful, when you compare it to the BSD behaviour. It seems like the program you describe would need to add a Linux-specific workaround. And I don't know a justification for why Linux would expect you to do that.

It might have looked sensible when writing the Linux kernel code, but without ever having been tested or exercised by any program.

Or it might be exercised by some program code which mostly works under this subset, but in principle could have edge-case "bugs" or race conditions.

If you cannot make sense of the Linux behaviour and its intended usage, I think that argues for treating this as a "dark, untested corner" on Linux.

Related Solutions

Linux – Default Unix Socket Buffer Size Values

The default is not configurable, but it is different between 32-bit and 64-bit Linux. The value appears to written so as to allow 256 packets of 256 bytes each, accounting for the different per-packet overhead (structs with 32-bit v.s. 64-bit pointers or integers).

On 64-bit Linux 4.14.18: 212992 bytes

On 32-bit Linux 4.4.92: 163840 bytes

The default buffer sizes are the same for both the read and write buffers. The per-packet overhead is a combination of struct sk_buff and struct skb_shared_info, so it depends on the exact size of these structures (rounded up slightly for alignment). E.g. in the 64-bit kernel above, the overhead is 576 bytes per packet.

http://elixir.free-electrons.com/linux/v4.5/source/net/core/sock.c#L265

/* Take into consideration the size of the struct sk_buff overhead in the
 * determination of these values, since that is non-constant across
 * platforms.  This makes socket queueing behavior and performance
 * not depend upon such differences.
 */
#define _SK_MEM_PACKETS     256
#define _SK_MEM_OVERHEAD    SKB_TRUESIZE(256)
#define SK_WMEM_MAX     (_SK_MEM_OVERHEAD * _SK_MEM_PACKETS)
#define SK_RMEM_MAX     (_SK_MEM_OVERHEAD * _SK_MEM_PACKETS)

/* Run time adjustable parameters. */
__u32 sysctl_wmem_max __read_mostly = SK_WMEM_MAX;
EXPORT_SYMBOL(sysctl_wmem_max);
__u32 sysctl_rmem_max __read_mostly = SK_RMEM_MAX;
EXPORT_SYMBOL(sysctl_rmem_max);
__u32 sysctl_wmem_default __read_mostly = SK_WMEM_MAX;
__u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX;

Interestingly, if you set a non-default socket buffer size, Linux doubles it to provide for the overheads. This means that if you send smaller packets (e.g. less than the 576 bytes above), you won't be able to fit as many bytes of user data in the buffer, as you had specified for its size.

Linux – What does x (execute) permission do on unix sockets

Nothing, as I can see.

The Linux man page unix(7) says that the permissions of the directory containing a socket apply normally (i.e. you need +x on /foo to connect to /foo/sock, and +w on /foo to create /foo/sock) and that write permission controls connecting to the socket itself:

On Linux, connecting to a stream socket object requires write permission on that socket; sending a datagram to a datagram socket likewise requires write permission on that socket.

Apparently some other systems behave differently:

POSIX does not make any statement about the effect of the permissions on a socket file, and on some systems (e.g., older BSDs), the socket permissions are ignored. Portable programs should not rely on this feature for security.

unix(4) on FreeBSD describes similar requirements. The Linux man page didn't say if socket access on some systems ignores the directory permissions too.

Removing the x bit from the socket seems to have the effect of giving a different error for trying execute the socket, but that's not much of a practical difference:

$ ls -l test.sock
srwxr-xr-x 1 user user 0 Jun 28 16:24 test.sock=
$ nc -U ./test.sock
Hello
$ ./test.sock
bash: ./test.sock: No such device or address
$ chmod a-x test.sock
$ nc -U ./test.sock
Hello
$ ./test.sock
bash: ./test.sock: Permission denied

(I did also test that indeed only the w bit seems to matter for accessing the socket on Debian's Linux 4.9.0.)

Perhaps the sockets you meant had all permission bits removed from the user, or you meant the x bit on the directory?