1. 程式人生 > >Linux, Netlink, and Go — Part 1: netlink

Linux, Netlink, and Go — Part 1: netlink

Linux, Netlink, and Go — Part 1: netlink

I’m a big fan of Prometheus. I use it quite a lot at both home and work, and greatly enjoy having insight into what my systems are doing at any given moment. One of the most widely used Prometheus exporters is the node_exporter: a daemon that can extract a wide variety of metrics from UNIX-like machines.

As I was browsing the repository, I noticed an open issue requesting the addition of WiFi metrics to node_exporter. The idea intrigued me, and I realized that I would certainly make use of such a feature on my Linux laptop. I began exploring options for retrieving WiFi device information on Linux.

After a couple of weeks of experimentation (including the legacy ioctl()

wireless extensions API), I authored two Go packages which work together to interact with WiFi devices on Linux:

  • netlink: provides low-level access to Linux netlink sockets.
  • wifi: provides access to IEEE 802.11 WiFi device actions and statistics.

This series of posts will describe some of the lessons I learned while implementing these packages in Go, and hopefully provide a nice reference for others who wish to experiment with netlink and/or WiFi devices in their language of choice.

The pseudo-code in this series will use Go’s x/sys/unix package and types from my netlink and wifi packages. I plan to break up the series as follows (links to come as more are posted):

What is netlink?

Netlink is a Linux kernel inter-process communication mechanism, enabling communication between a userspace process and the kernel, or multiple userspace processes. Netlink sockets are the primitive which enables this communication.

This post will provide a primer on netlink sockets, messages, multicast groups, and attributes. In addition, this post will focus on communication between userspace and the kernel, rather than communication between two userspace processes.

Creating netlink sockets

Netlink makes use of the standard BSD sockets API. This should be quite familiar to anyone who has done network programming in C. If you are unfamiliar with BSD sockets, I recommend the excellent Beej’s Guide to Network Programming for a primer on the topic.

It is important to note that netlink communications never traverse beyond the local host. With this in mind, let’s begin diving into how netlink sockets work!

To communicate with netlink, a netlink socket must be opened. This is done using the socket() system call:

fd, err := unix.Socket(
// Always used when opening netlink sockets.
unix.AF_NETLINK,
// Seemingly used interchangeably with SOCK_DGRAM,
// but it appears not to matter which is used.
unix.SOCK_RAW,
// The netlink family that the socket will communicate
// with, such as NETLINK_ROUTE or NETLINK_GENERIC.
family,
)

The family parameter specifies a particular netlink family: essentially, a kernel subsystem which can be communicated with using netlink sockets. These families may offer functionality such as:

  • NETLINK_ROUTE: manipulation of Linux’s network interfaces, routes, IP addresses, etc.
  • NETLINK_GENERIC: a building block for simplified addition of new netlink families, like nl80211, Open vSwitch, etc.

Once the socket is created, bind() must be called to prepare it to send and receive messages.

err := unix.Bind(fd, &unix.SockaddrNetlink{
// Always used when binding netlink sockets.
Family: unix.AF_NETLINK,
// A bitmask of multicast groups to join on bind.
// Typically set to zero.
Groups: 0,
// If you'd like, you can assign a PID for this socket
// here, but in my experience, it's easier to leave
// this set to zero and let netlink assign and manage
// PIDs on its own.
Pid: 0,
})

At this point, the netlink socket is now ready to send and receive messages to and from the kernel.

Netlink message format

Netlink messages follow a very particular format. All messages must be aligned to a 4 byte boundary. As an example, a 16 byte message must be sent as is, but a 17 byte message must be padded to 20 bytes.

It is very important to note that, unlike typical network communications, netlink uses the host byte order, or endianness,for encoding and decoding integers, instead of the common network byte order (big endian). As a result, code which must convert between byte and integer representations of data must keep this in mind.

Netlink message headers make use of the following format: (diagram from RFC 3549):

0                   1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type | Flags |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Sequence Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Process ID (PID) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

These fields contain the following information:

  • Length (32 bits): the length of the entire message, including both headers and payload.
  • Type (16 bits): what kind of information the message contains, such as an error, end of multi-part message, etc.
  • Flags (16 bits): bit flags which indicate that a message is a request, a multi-part message, an acknowledgement of a request, etc.
  • Sequence Number (32 bits): a number used to correlate requests and responses; incremented on each request.
  • Process ID (PID) (32 bits): sometimes referred to as port ID; a number used to uniquely identify a particular netlink socket; may or may not be the process’s ID.

Finally, a payload may immediately follow a netlink header. Again, note that the payload must be padded to a 4 byte boundary.

An example netlink message which sends a request to the kernel may resemble the following in Go:

msg := netlink.Message{
Header: netlink.Header{
// Length of header, plus payload.
Length: 16 + 4,
// Set to zero on requests.
Type: 0,
// Indicate that message is a request to the kernel.
Flags: netlink.HeaderFlagsRequest,
// Sequence number selected at random.
Sequence: 1,
// PID set to process's ID.
PID: uint32(os.Getpid()),
},
// An arbitrary byte payload. May be in a variety of formats.
Data: []byte{0x01, 0x02, 0x03, 0x04},
}

Sending and receiving netlink messages

Now that we are familiar with some of the basics of netlink sockets, we can send and receive data using a socket.

Once a message has been prepared, it can be sent to the kernel using sendto():

// Assume messageBytes produces a netlink request message (like the
// one shown above) with the specified payload.
b := messageBytes([]byte{0x01, 0x02, 0x03, 0x04})
err := unix.Sendto(b, 0, &unix.SockaddrNetlink{
// Always used when sending on netlink sockets.
Family: unix.AF_NETLINK,
})

Read-only requests to netlink typically do not require any special privileges. Operations which modify the state of a subsystem using netlink, or require locking its internal state, typically require elevated privileges. This may mean running the program as root or using to:

  • Send a write request to make changes to a subsystem using netlink.
  • Send a read request with the NLM_F_ATOMIC flag, to receive an atomic snapshot of data from netlink.

Receiving messages from a netlink socket using recvfrom() can be slightly more complicated, depending on a variety of factors. Netlink may reply with:

  • Very small or very large messages.
  • Multi-part messages, broken into multiple pieces.
  • An explicit error number, when header type is “error”.

In addition, the sequence number and PID of each message should be validated as well. When working with raw system calls, it’s up to the socket’s user to handle these cases.

Large messages

To deal with large messages, I’ve employed a technique of allocating a single page of memory, peeking at the buffer (without draining it), and then doubling the size of the buffer if it’s too small to read the entire message. Thanks, Dominik Honnef for your insight on this problem.

Error handling omitted for brevity. Please check your errors.

b := make([]byte, os.Getpagesize())
for {
// Peek at the buffer to see how many bytes are available.
n, _, _ := unix.Recvfrom(fd, b, unix.MSG_PEEK)
// Break when we can read all messages.
if n < len(b) {
break
}
// Double in size if not enough bytes.
b = make([]byte, len(b)*2)
}
// Read out all available messages.
n, _, _ := unix.Recvfrom(fd, b, 0)

In theory, a netlink message may be of a size up to ~4GiB (maximum 32-bit unsigned integer), but in practice, messages are much smaller.

Multi-part messages

For certain types of messages, netlink may reply with a “multi-part message”. In this case, each message before the final one will have the “multi” flag set. The final message will have a type of “done”.

When returning multi-part messages, the first recvfrom() will return all messages with the “multi” flag set. Next, recvfrom() must be called again to retrieve the final message with header type “done”. This is very important or else netlink will simply hang on subsequent requests, waiting for the caller to drain the final header type “done” message.

The code for this isn’t as trivial as other examples, but you can take a look at my implementation if you’d like a reference.

Netlink error numbers

If netlink cannot satisfy a request for whatever reason, it will return an explicit error number in the payload of a message containing header type “error”. These error numbers are the same as Linux’s classic error numbers, such as ENOENT for “no such file or directory”, or EPERM for “permission denied”.

If a message’s header type indicates an error, the error number will be encoded as a signed 32 bit integer (note: also uses system endianness) in the first 4 bytes of the message’s payload.

const name = "foo0"
_, err := rtnetlink.InterfaceByName(name)
if err != nil && os.IsNotExist(err) {
// Error is result of a netlink error number, and can be
// checked in the usual Go fashion.
log.Printf("no such device: %q", name)
return
}

Sequence number and PID validation

To ensure a netlink reply from the kernel is in response to one of our requests, we must also validate the sequence number and PID on each received message. In the majority of cases, these should match exactly what was sent to the kernel with a request. Subsequent requests should increment the sequence number before sending another message to netlink.

PID validation may vary slightly, depending on several conditions.

  • If a message is received in userspace on behalf a multicast group, it will have a PID of 0, meaning the message originated in the kernel.
  • If a request is sent to the kernel with a PID of 0, netlink will assign a PID for a given socket on the first response. This PID should be used (and validated) in subsequent communications.

Assuming you didn’t specify a PID in bind(), when opening multiple netlink sockets in a single application, the first one will be assigned a PID of the process’s ID. Subsequent ones will have a random number chosen by netlink. In my experience, it is much easier to just let netlink assign all PIDs itself, and make sure you keep track of which numbers it assigns for each socket.

Multicast groups

In addition to the classic request/response socket paradigm, netlink sockets also provide multicast groups to enable subscribing to certain events as they occur.

A multicast group can be joined using two different methods:

  • Specifying a groups bitmask during bind(). This is considered the “legacy” method.
  • Joining and leaving groups using setsockopt(). This is the preferred, modern method.

Joining and leaving groups using setsockopt() is a matter of swapping a single constant. In Go, this is done using uint32 “group” values.

// Can also specify unix.NETLINK_DROP_MEMBERSHIP to leave
// a group.
const joinLeave = unix.NETLINK_ADD_MEMBERSHIP
// Multicast group ID. Typically assigned using predefined
// constants for various netlink families.
const group = 1
err := syscall.SetSockoptInt(
fd,
unix.SOL_NETLINK,
joinLeave,
group,
)

Once a group is joined, you can listen for messages using recvfrom() as usual. Leaving the group will cause no further messages to be delivered for a given multicast group.

Netlink attributes

To wrap up our primer on netlink sockets, we will discuss a very common data format for netlink message payloads: attributes.

Netlink attributes are unusual in that they are in LTV (length, type, value) format, instead of the typical TLV (type, length, value). As with every other integer in netlink sockets, the type and length values are also encoded with host endianness. Finally, netlink attributes must also be padded to a 4 byte boundary, just like netlink messages.

Each field contains the following information:

  • Length (16 bits): the length of the entire attribute, including length, type, and value fields. May not be set to a 4 byte boundary. For example, if length is 17 bytes, the attribute will be padded to 20 bytes, but the 3 bytes of padding should not be interpreted as meaningful.
  • Type (16 bits): the type of an attribute, typically defined as a constant in some netlink family or header.
  • Value (variable bytes): the raw payload of an attribute. May contain nested attributes, which are stored in the same format. Those nested attributes may contain even more nested attributes!

There are two special flags which may be present in netlink attributes, though I have yet to encounter them in my work.

  • NLA_F_NESTED: specifies a nested attribute; used as a hint for parsing. Doesn’t always appear to be used, even if nested attributes are present.
  • NLA_F_NET_BYTEORDER: attribute data is stored in network byte order (big endian) instead of host endianness.

Consult the documentation of a given netlink family to determine if either of these flags should be checked.

Summary

Now that we are familiar with using netlink sockets and messages, the next post in the series will build upon this knowledge to dive into generic netlink.

Hope you enjoyed this post! If you have questions or comments, feel free to reach out via the comments, Twitter, or Gophers Slack (username: mdlayher).

Updates

  • 2/22/2017: moved background information about BSD sockets API to the “Creating netlink sockets” section.
  • 2/22/2017: noted need for root or CAP_NET_ADMIN for many netlink write operations, and when using NLM_F_ATOMIC. Thanks, Steven Hartland from the golang-nuts thread.
  • 2/23/2017: noted ability to specify a PID for a socket in bind(). Thanks, Dan Williams from a libnl thread.
  • 2/27/2017: changed pseudocode to use x/sys/unix instead of syscall, since syscall is frozen.

References

The following links were used frequently as a reference as I built out package netlink, and authored this post:

相關推薦

Linux, Netlink, and Go — Part 1: netlink

Linux, Netlink, and Go — Part 1: netlinkI’m a big fan of Prometheus. I use it quite a lot at both home and work, and greatly enjoy having insight into what

Linux, Netlink, and Go — Part 2: generic netlink

Linux, Netlink, and Go — Part 2: generic netlinkIn Part 1 of this series, I described some of the fundamental concepts of netlink sockets, messages, and at

Namespaces and Go Part 1

Linux provides the following namespaces and we will see how we can demonstrate these with Go Namespace Constant Isolates Cgroup

Algorithms: Design and Analysis, Part 1 - Programming Assignment #1

容易 wan des all nes food multi 長度 匯總 自我總結: 1.編程的思維不夠,雖然分析有哪些需要的函數,但是不能比較好的匯總整合 2.寫代碼能力,容易挫敗感,經常有bug,很煩心,耐心不夠好 題目: In this programming as

Doing Well by Doing Bad: Writing Bad Code with Go Part 1

Doing Well by Doing Bad: Writing Bad Code with Go Part 1A Satirical Take on Programming in GoAfter decades of programming in Java, for the past several yea

Design Systems and Agility (Part 1 of 2)

Our working environmentsIn modern working contexts, our awareness of change still can freeze teams and can cause delays in decision making. This is because

Creating visualizations to better understand your data and models (Part 1)

The Cancer Genome Atlas Breast Cancer DatasetThe Cancer Genome Atlas (TCGA) breast cancer RNA-Seq dataset (I’m using an old freeze from 2015) has 20,532 fe

Namespaces and Go Part 3

In Part 2 we executed a shell with modified hostname using UTS namespace. In this article, we will explain how we can use PID and Mount namespaces. By is

Namespaces and Go Part 2

In Part 1 of Namespace article series ,we were unable to set hostname from a shell even though the user was root That program was missing UTS namespace w

linux操作系統及命令Part 1

oldboy ont pre 普通 下載 man tro 分隔符 所在 1.關於linux系統的安裝與流程 (1)下載Vmware workstation 與 linux系統(centos版本、redhat版本、Ubuntu版本...)鏡像。 (2)詳細安裝見

C++ and OO Num. Comp. Sci. Eng. - Part 1.

nim num 內容 general -o 編譯時間 增加 radi gpo 本文參考自 《C++ and Object-Oriented Numeric Computing for Scientists and Engineers》。 序言 書中主要討論的問題是面向對象的

安卓系統在Linux Deploy上部署CentOs ARM版 搭建私人Web伺服器 [ part 1 ]

1.安卓機一臺(本次使用榮耀4X高配版 android 5.0  2G RAM  8G ROM) 2.下載app:Linux Deploy 3.配置映象源:http://chinanet.mirrors

Stanford Algorithms Design and Analysis Part 2 week 1

import java.io.BufferedReader;import java.io.DataInputStream;import java.io.FileInputStream;import java.io.FileNotFoundException;import java.io.IOException

The Essence of Quantum Mechanics Part 1: Measurement and Spin

The Essence of Quantum Mechanics Part 1: Measurement and SpinDespite what you may have heard, quantum physics isn’t really a difficult subject to understan

Learning and Leading in the Era of Artificial Intelligence and Machine Learning, Part 1

Learning and Leading in the Era of Artificial Intelligence and Machine Learning, Part 1Wikimedia CommonsWith this 2-part blog series, I’ll explore the evol

Word Embeddings and Document Vectors: Part 1. Similarity

This similarity can be as simple as a categorical feature value such as the color or shape of the objects we are classifying, or a more complex function of

Create a personal video watch list in the cloud with PHP and the Movie Database API Part 1

Up until a few years ago, I’d turn on the TV and find myself humming Springsteen’s “57 Channels and Nothin’ On” as I flipped through

React Native and Forms Redeux: Part 1

React Native and Forms Redeux: Part 1Yet (YET) another article on React and forms; with a React Native twist.note: There is a reason the image is reversed

AOSP Part 1: Get the code using the Manifest and Repo tool

6 months ago, I moved to New York, the first city I lived in outside of Israel. With a new job at a new place, I decided to also try a new laptop runn

Increasing the Adoption of UX and the Products You Design (Part 1)

Increasing the Adoption of UX and the Products You Design (Part 1)Applying Diffusion of Innovations TheoryJust because it is great, doesn’t mean it will be