|  |  | 
|  | Overview | 
|  | ======== | 
|  |  | 
|  | This readme tries to provide some background on the hows and whys of RDS, | 
|  | and will hopefully help you find your way around the code. | 
|  |  | 
|  | In addition, please see this email about RDS origins: | 
|  | http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html | 
|  |  | 
|  | RDS Architecture | 
|  | ================ | 
|  |  | 
|  | RDS provides reliable, ordered datagram delivery by using a single | 
|  | reliable connection between any two nodes in the cluster. This allows | 
|  | applications to use a single socket to talk to any other process in the | 
|  | cluster - so in a cluster with N processes you need N sockets, in contrast | 
|  | to N*N if you use a connection-oriented socket transport like TCP. | 
|  |  | 
|  | RDS is not Infiniband-specific; it was designed to support different | 
|  | transports.  The current implementation used to support RDS over TCP as well | 
|  | as IB. | 
|  |  | 
|  | The high-level semantics of RDS from the application's point of view are | 
|  |  | 
|  | *	Addressing | 
|  | RDS uses IPv4 addresses and 16bit port numbers to identify | 
|  | the end point of a connection. All socket operations that involve | 
|  | passing addresses between kernel and user space generally | 
|  | use a struct sockaddr_in. | 
|  |  | 
|  | The fact that IPv4 addresses are used does not mean the underlying | 
|  | transport has to be IP-based. In fact, RDS over IB uses a | 
|  | reliable IB connection; the IP address is used exclusively to | 
|  | locate the remote node's GID (by ARPing for the given IP). | 
|  |  | 
|  | The port space is entirely independent of UDP, TCP or any other | 
|  | protocol. | 
|  |  | 
|  | *	Socket interface | 
|  | RDS sockets work *mostly* as you would expect from a BSD | 
|  | socket. The next section will cover the details. At any rate, | 
|  | all I/O is performed through the standard BSD socket API. | 
|  | Some additions like zerocopy support are implemented through | 
|  | control messages, while other extensions use the getsockopt/ | 
|  | setsockopt calls. | 
|  |  | 
|  | Sockets must be bound before you can send or receive data. | 
|  | This is needed because binding also selects a transport and | 
|  | attaches it to the socket. Once bound, the transport assignment | 
|  | does not change. RDS will tolerate IPs moving around (eg in | 
|  | a active-active HA scenario), but only as long as the address | 
|  | doesn't move to a different transport. | 
|  |  | 
|  | *	sysctls | 
|  | RDS supports a number of sysctls in /proc/sys/net/rds | 
|  |  | 
|  |  | 
|  | Socket Interface | 
|  | ================ | 
|  |  | 
|  | AF_RDS, PF_RDS, SOL_RDS | 
|  | AF_RDS and PF_RDS are the domain type to be used with socket(2) | 
|  | to create RDS sockets. SOL_RDS is the socket-level to be used | 
|  | with setsockopt(2) and getsockopt(2) for RDS specific socket | 
|  | options. | 
|  |  | 
|  | fd = socket(PF_RDS, SOCK_SEQPACKET, 0); | 
|  | This creates a new, unbound RDS socket. | 
|  |  | 
|  | setsockopt(SOL_SOCKET): send and receive buffer size | 
|  | RDS honors the send and receive buffer size socket options. | 
|  | You are not allowed to queue more than SO_SNDSIZE bytes to | 
|  | a socket. A message is queued when sendmsg is called, and | 
|  | it leaves the queue when the remote system acknowledges | 
|  | its arrival. | 
|  |  | 
|  | The SO_RCVSIZE option controls the maximum receive queue length. | 
|  | This is a soft limit rather than a hard limit - RDS will | 
|  | continue to accept and queue incoming messages, even if that | 
|  | takes the queue length over the limit. However, it will also | 
|  | mark the port as "congested" and send a congestion update to | 
|  | the source node. The source node is supposed to throttle any | 
|  | processes sending to this congested port. | 
|  |  | 
|  | bind(fd, &sockaddr_in, ...) | 
|  | This binds the socket to a local IP address and port, and a | 
|  | transport, if one has not already been selected via the | 
|  | SO_RDS_TRANSPORT socket option | 
|  |  | 
|  | sendmsg(fd, ...) | 
|  | Sends a message to the indicated recipient. The kernel will | 
|  | transparently establish the underlying reliable connection | 
|  | if it isn't up yet. | 
|  |  | 
|  | An attempt to send a message that exceeds SO_SNDSIZE will | 
|  | return with -EMSGSIZE | 
|  |  | 
|  | An attempt to send a message that would take the total number | 
|  | of queued bytes over the SO_SNDSIZE threshold will return | 
|  | EAGAIN. | 
|  |  | 
|  | An attempt to send a message to a destination that is marked | 
|  | as "congested" will return ENOBUFS. | 
|  |  | 
|  | recvmsg(fd, ...) | 
|  | Receives a message that was queued to this socket. The sockets | 
|  | recv queue accounting is adjusted, and if the queue length | 
|  | drops below SO_SNDSIZE, the port is marked uncongested, and | 
|  | a congestion update is sent to all peers. | 
|  |  | 
|  | Applications can ask the RDS kernel module to receive | 
|  | notifications via control messages (for instance, there is a | 
|  | notification when a congestion update arrived, or when a RDMA | 
|  | operation completes). These notifications are received through | 
|  | the msg.msg_control buffer of struct msghdr. The format of the | 
|  | messages is described in manpages. | 
|  |  | 
|  | poll(fd) | 
|  | RDS supports the poll interface to allow the application | 
|  | to implement async I/O. | 
|  |  | 
|  | POLLIN handling is pretty straightforward. When there's an | 
|  | incoming message queued to the socket, or a pending notification, | 
|  | we signal POLLIN. | 
|  |  | 
|  | POLLOUT is a little harder. Since you can essentially send | 
|  | to any destination, RDS will always signal POLLOUT as long as | 
|  | there's room on the send queue (ie the number of bytes queued | 
|  | is less than the sendbuf size). | 
|  |  | 
|  | However, the kernel will refuse to accept messages to | 
|  | a destination marked congested - in this case you will loop | 
|  | forever if you rely on poll to tell you what to do. | 
|  | This isn't a trivial problem, but applications can deal with | 
|  | this - by using congestion notifications, and by checking for | 
|  | ENOBUFS errors returned by sendmsg. | 
|  |  | 
|  | setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in) | 
|  | This allows the application to discard all messages queued to a | 
|  | specific destination on this particular socket. | 
|  |  | 
|  | This allows the application to cancel outstanding messages if | 
|  | it detects a timeout. For instance, if it tried to send a message, | 
|  | and the remote host is unreachable, RDS will keep trying forever. | 
|  | The application may decide it's not worth it, and cancel the | 
|  | operation. In this case, it would use RDS_CANCEL_SENT_TO to | 
|  | nuke any pending messages. | 
|  |  | 
|  | setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..) | 
|  | getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..) | 
|  | Set or read an integer defining  the underlying | 
|  | encapsulating transport to be used for RDS packets on the | 
|  | socket. When setting the option, integer argument may be | 
|  | one of RDS_TRANS_TCP or RDS_TRANS_IB. When retrieving the | 
|  | value, RDS_TRANS_NONE will be returned on an unbound socket. | 
|  | This socket option may only be set exactly once on the socket, | 
|  | prior to binding it via the bind(2) system call. Attempts to | 
|  | set SO_RDS_TRANSPORT on a socket for which the transport has | 
|  | been previously attached explicitly (by SO_RDS_TRANSPORT) or | 
|  | implicitly (via bind(2)) will return an error of EOPNOTSUPP. | 
|  | An attempt to set SO_RDS_TRANSPPORT to RDS_TRANS_NONE will | 
|  | always return EINVAL. | 
|  |  | 
|  | RDMA for RDS | 
|  | ============ | 
|  |  | 
|  | see rds-rdma(7) manpage (available in rds-tools) | 
|  |  | 
|  |  | 
|  | Congestion Notifications | 
|  | ======================== | 
|  |  | 
|  | see rds(7) manpage | 
|  |  | 
|  |  | 
|  | RDS Protocol | 
|  | ============ | 
|  |  | 
|  | Message header | 
|  |  | 
|  | The message header is a 'struct rds_header' (see rds.h): | 
|  | Fields: | 
|  | h_sequence: | 
|  | per-packet sequence number | 
|  | h_ack: | 
|  | piggybacked acknowledgment of last packet received | 
|  | h_len: | 
|  | length of data, not including header | 
|  | h_sport: | 
|  | source port | 
|  | h_dport: | 
|  | destination port | 
|  | h_flags: | 
|  | CONG_BITMAP - this is a congestion update bitmap | 
|  | ACK_REQUIRED - receiver must ack this packet | 
|  | RETRANSMITTED - packet has previously been sent | 
|  | h_credit: | 
|  | indicate to other end of connection that | 
|  | it has more credits available (i.e. there is | 
|  | more send room) | 
|  | h_padding[4]: | 
|  | unused, for future use | 
|  | h_csum: | 
|  | header checksum | 
|  | h_exthdr: | 
|  | optional data can be passed here. This is currently used for | 
|  | passing RDMA-related information. | 
|  |  | 
|  | ACK and retransmit handling | 
|  |  | 
|  | One might think that with reliable IB connections you wouldn't need | 
|  | to ack messages that have been received.  The problem is that IB | 
|  | hardware generates an ack message before it has DMAed the message | 
|  | into memory.  This creates a potential message loss if the HCA is | 
|  | disabled for any reason between when it sends the ack and before | 
|  | the message is DMAed and processed.  This is only a potential issue | 
|  | if another HCA is available for fail-over. | 
|  |  | 
|  | Sending an ack immediately would allow the sender to free the sent | 
|  | message from their send queue quickly, but could cause excessive | 
|  | traffic to be used for acks. RDS piggybacks acks on sent data | 
|  | packets.  Ack-only packets are reduced by only allowing one to be | 
|  | in flight at a time, and by the sender only asking for acks when | 
|  | its send buffers start to fill up. All retransmissions are also | 
|  | acked. | 
|  |  | 
|  | Flow Control | 
|  |  | 
|  | RDS's IB transport uses a credit-based mechanism to verify that | 
|  | there is space in the peer's receive buffers for more data. This | 
|  | eliminates the need for hardware retries on the connection. | 
|  |  | 
|  | Congestion | 
|  |  | 
|  | Messages waiting in the receive queue on the receiving socket | 
|  | are accounted against the sockets SO_RCVBUF option value.  Only | 
|  | the payload bytes in the message are accounted for.  If the | 
|  | number of bytes queued equals or exceeds rcvbuf then the socket | 
|  | is congested.  All sends attempted to this socket's address | 
|  | should return block or return -EWOULDBLOCK. | 
|  |  | 
|  | Applications are expected to be reasonably tuned such that this | 
|  | situation very rarely occurs.  An application encountering this | 
|  | "back-pressure" is considered a bug. | 
|  |  | 
|  | This is implemented by having each node maintain bitmaps which | 
|  | indicate which ports on bound addresses are congested.  As the | 
|  | bitmap changes it is sent through all the connections which | 
|  | terminate in the local address of the bitmap which changed. | 
|  |  | 
|  | The bitmaps are allocated as connections are brought up.  This | 
|  | avoids allocation in the interrupt handling path which queues | 
|  | sages on sockets.  The dense bitmaps let transports send the | 
|  | entire bitmap on any bitmap change reasonably efficiently.  This | 
|  | is much easier to implement than some finer-grained | 
|  | communication of per-port congestion.  The sender does a very | 
|  | inexpensive bit test to test if the port it's about to send to | 
|  | is congested or not. | 
|  |  | 
|  |  | 
|  | RDS Transport Layer | 
|  | ================== | 
|  |  | 
|  | As mentioned above, RDS is not IB-specific. Its code is divided | 
|  | into a general RDS layer and a transport layer. | 
|  |  | 
|  | The general layer handles the socket API, congestion handling, | 
|  | loopback, stats, usermem pinning, and the connection state machine. | 
|  |  | 
|  | The transport layer handles the details of the transport. The IB | 
|  | transport, for example, handles all the queue pairs, work requests, | 
|  | CM event handlers, and other Infiniband details. | 
|  |  | 
|  |  | 
|  | RDS Kernel Structures | 
|  | ===================== | 
|  |  | 
|  | struct rds_message | 
|  | aka possibly "rds_outgoing", the generic RDS layer copies data to | 
|  | be sent and sets header fields as needed, based on the socket API. | 
|  | This is then queued for the individual connection and sent by the | 
|  | connection's transport. | 
|  | struct rds_incoming | 
|  | a generic struct referring to incoming data that can be handed from | 
|  | the transport to the general code and queued by the general code | 
|  | while the socket is awoken. It is then passed back to the transport | 
|  | code to handle the actual copy-to-user. | 
|  | struct rds_socket | 
|  | per-socket information | 
|  | struct rds_connection | 
|  | per-connection information | 
|  | struct rds_transport | 
|  | pointers to transport-specific functions | 
|  | struct rds_statistics | 
|  | non-transport-specific statistics | 
|  | struct rds_cong_map | 
|  | wraps the raw congestion bitmap, contains rbnode, waitq, etc. | 
|  |  | 
|  | Connection management | 
|  | ===================== | 
|  |  | 
|  | Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and | 
|  | ERROR states. | 
|  |  | 
|  | The first time an attempt is made by an RDS socket to send data to | 
|  | a node, a connection is allocated and connected. That connection is | 
|  | then maintained forever -- if there are transport errors, the | 
|  | connection will be dropped and re-established. | 
|  |  | 
|  | Dropping a connection while packets are queued will cause queued or | 
|  | partially-sent datagrams to be retransmitted when the connection is | 
|  | re-established. | 
|  |  | 
|  |  | 
|  | The send path | 
|  | ============= | 
|  |  | 
|  | rds_sendmsg() | 
|  | struct rds_message built from incoming data | 
|  | CMSGs parsed (e.g. RDMA ops) | 
|  | transport connection alloced and connected if not already | 
|  | rds_message placed on send queue | 
|  | send worker awoken | 
|  | rds_send_worker() | 
|  | calls rds_send_xmit() until queue is empty | 
|  | rds_send_xmit() | 
|  | transmits congestion map if one is pending | 
|  | may set ACK_REQUIRED | 
|  | calls transport to send either non-RDMA or RDMA message | 
|  | (RDMA ops never retransmitted) | 
|  | rds_ib_xmit() | 
|  | allocs work requests from send ring | 
|  | adds any new send credits available to peer (h_credits) | 
|  | maps the rds_message's sg list | 
|  | piggybacks ack | 
|  | populates work requests | 
|  | post send to connection's queue pair | 
|  |  | 
|  | The recv path | 
|  | ============= | 
|  |  | 
|  | rds_ib_recv_cq_comp_handler() | 
|  | looks at write completions | 
|  | unmaps recv buffer from device | 
|  | no errors, call rds_ib_process_recv() | 
|  | refill recv ring | 
|  | rds_ib_process_recv() | 
|  | validate header checksum | 
|  | copy header to rds_ib_incoming struct if start of a new datagram | 
|  | add to ibinc's fraglist | 
|  | if competed datagram: | 
|  | update cong map if datagram was cong update | 
|  | call rds_recv_incoming() otherwise | 
|  | note if ack is required | 
|  | rds_recv_incoming() | 
|  | drop duplicate packets | 
|  | respond to pings | 
|  | find the sock associated with this datagram | 
|  | add to sock queue | 
|  | wake up sock | 
|  | do some congestion calculations | 
|  | rds_recvmsg | 
|  | copy data into user iovec | 
|  | handle CMSGs | 
|  | return to application | 
|  |  | 
|  | Multipath RDS (mprds) | 
|  | ===================== | 
|  | Mprds is multipathed-RDS, primarily intended for RDS-over-TCP | 
|  | (though the concept can be extended to other transports). The classical | 
|  | implementation of RDS-over-TCP is implemented by demultiplexing multiple | 
|  | PF_RDS sockets between any 2 endpoints (where endpoint == [IP address, | 
|  | port]) over a single TCP socket between the 2 IP addresses involved. This | 
|  | has the limitation that it ends up funneling multiple RDS flows over a | 
|  | single TCP flow, thus it is | 
|  | (a) upper-bounded to the single-flow bandwidth, | 
|  | (b) suffers from head-of-line blocking for all the RDS sockets. | 
|  |  | 
|  | Better throughput (for a fixed small packet size, MTU) can be achieved | 
|  | by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed | 
|  | RDS (mprds).  Each such TCP/IP flow constitutes a path for the rds/tcp | 
|  | connection. RDS sockets will be attached to a path based on some hash | 
|  | (e.g., of local address and RDS port number) and packets for that RDS | 
|  | socket will be sent over the attached path using TCP to segment/reassemble | 
|  | RDS datagrams on that path. | 
|  |  | 
|  | Multipathed RDS is implemented by splitting the struct rds_connection into | 
|  | a common (to all paths) part, and a per-path struct rds_conn_path. All | 
|  | I/O workqs and reconnect threads are driven from the rds_conn_path. | 
|  | Transports such as TCP that are multipath capable may then set up a | 
|  | TPC socket per rds_conn_path, and this is managed by the transport via | 
|  | the transport privatee cp_transport_data pointer. | 
|  |  | 
|  | Transports announce themselves as multipath capable by setting the | 
|  | t_mp_capable bit during registration with the rds core module. When the | 
|  | transport is multipath-capable, rds_sendmsg() hashes outgoing traffic | 
|  | across multiple paths. The outgoing hash is computed based on the | 
|  | local address and port that the PF_RDS socket is bound to. | 
|  |  | 
|  | Additionally, even if the transport is MP capable, we may be | 
|  | peering with some node that does not support mprds, or supports | 
|  | a different number of paths. As a result, the peering nodes need | 
|  | to agree on the number of paths to be used for the connection. | 
|  | This is done by sending out a control packet exchange before the | 
|  | first data packet. The control packet exchange must have completed | 
|  | prior to outgoing hash completion in rds_sendmsg() when the transport | 
|  | is mutlipath capable. | 
|  |  | 
|  | The control packet is an RDS ping packet (i.e., packet to rds dest | 
|  | port 0) with the ping packet having a rds extension header option  of | 
|  | type RDS_EXTHDR_NPATHS, length 2 bytes, and the value is the | 
|  | number of paths supported by the sender. The "probe" ping packet will | 
|  | get sent from some reserved port, RDS_FLAG_PROBE_PORT (in <linux/rds.h>) | 
|  | The receiver of a ping from RDS_FLAG_PROBE_PORT will thus immediately | 
|  | be able to compute the min(sender_paths, rcvr_paths). The pong | 
|  | sent in response to a probe-ping should contain the rcvr's npaths | 
|  | when the rcvr is mprds-capable. | 
|  |  | 
|  | If the rcvr is not mprds-capable, the exthdr in the ping will be | 
|  | ignored.  In this case the pong will not have any exthdrs, so the sender | 
|  | of the probe-ping can default to single-path mprds. | 
|  |  |