mirror of
				https://git.proxmox.com/git/mirror_iproute2
				synced 2025-10-27 17:47:25 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			430 lines
		
	
	
		
			15 KiB
		
	
	
	
		
			TeX
		
	
	
	
	
	
			
		
		
	
	
			430 lines
		
	
	
		
			15 KiB
		
	
	
	
		
			TeX
		
	
	
	
	
	
| \documentstyle[12pt,twoside]{article}
 | |
| \def\TITLE{IPv6 Flow Labels}
 | |
| \input preamble
 | |
| \begin{center}
 | |
| \Large\bf IPv6 Flow Labels in Linux-2.2.
 | |
| \end{center}
 | |
| 
 | |
| 
 | |
| \begin{center}
 | |
| { \large Alexey~N.~Kuznetsov } \\
 | |
| \em Institute for Nuclear Research, Moscow \\
 | |
| \verb|kuznet@ms2.inr.ac.ru| \\
 | |
| \rm April 11, 1999
 | |
| \end{center}
 | |
| 
 | |
| \vspace{5mm}
 | |
| 
 | |
| \tableofcontents
 | |
| 
 | |
| \section{Introduction.}
 | |
| 
 | |
| Every IPv6 packet carries 28 bits of flow information. RFC2460 splits
 | |
| these bits to two fields: 8 bits of traffic class (or DS field, if you
 | |
| prefer this term) and 20 bits of flow label. Currently there exist
 | |
| no well-defined API to manage IPv6 flow information. In this document
 | |
| I describe an attempt to design the API for Linux-2.2 IPv6 stack.
 | |
| 
 | |
| \vskip 1mm
 | |
| 
 | |
| The API must solve the following tasks:
 | |
| 
 | |
| \begin{enumerate}
 | |
| 
 | |
| \item To allow user to set traffic class bits.
 | |
| 
 | |
| \item To allow user to read traffic class bits of received packets.
 | |
| This feature is not so useful as the first one, however it will be
 | |
| necessary f.e.\ to implement ECN [RFC2481] for datagram oriented services
 | |
| or to implement receiver side of SRP or another end-to-end protocol
 | |
| using traffic class bits.
 | |
| 
 | |
| \item To assign flow labels to packets sent by user.
 | |
| 
 | |
| \item To get flow labels of received packets. I do not know
 | |
| any applications of this feature, but it is possible that receiver will
 | |
| want to use flow labels to distinguish sub-flows.
 | |
| 
 | |
| \item To allocate flow labels in the way, compliant to RFC2460. Namely:
 | |
| 
 | |
| \begin{itemize}
 | |
| \item
 | |
| Flow labels must be uniformly distributed (pseudo-)random numbers,
 | |
| so that any subset of 20 bits can be used as hash key.
 | |
| 
 | |
| \item
 | |
| Flows with coinciding source address and flow label must have identical
 | |
| destination address and not-fragmentable extensions headers (i.e.\ 
 | |
| hop by hop options and all the headers up to and including routing header,
 | |
| if it is present.)
 | |
| 
 | |
| \begin{NB}
 | |
| There is a hole in specs: some hop-by-hop options can be
 | |
| defined only on per-packet base (f.e.\  jumbo payload option).
 | |
| Essentially, it means that such options cannot present in packets
 | |
| with flow labels.
 | |
| \end{NB}
 | |
| \begin{NB}
 | |
| NB notes here and below reflect only my personal opinion,
 | |
| they should be read with smile or should not be read at all :-).
 | |
| \end{NB}
 | |
| 
 | |
| 
 | |
| \item
 | |
| Flow labels have finite lifetime and source is not allowed to reuse
 | |
| flow label for another flow within the maximal lifetime has expired,
 | |
| so that intermediate nodes will be able to invalidate flow state before
 | |
| the label is taken over by another flow.
 | |
| Flow state, including lifetime, is propagated along datagram path
 | |
| by some application specific methods
 | |
| (f.e.\ in RSVP PATH messages or in some hop-by-hop option).
 | |
| 
 | |
| 
 | |
| \end{itemize}
 | |
| 
 | |
| \end{enumerate}
 | |
| 
 | |
| \section{Sending/receiving flow information.}
 | |
| 
 | |
| \paragraph{Discussion.}
 | |
| \addcontentsline{toc}{subsection}{Discussion}
 | |
| It was proposed (Where? I do not remember any explicit statement)
 | |
| to solve the first four tasks using
 | |
| \verb|sin6_flowinfo| field added to \verb|struct| \verb|sockaddr_in6|
 | |
| (see RFC2553).
 | |
| 
 | |
| \begin{NB}
 | |
| 	This method is difficult to consider as reasonable, because it
 | |
| 	puts additional overhead to all the services, despite of only
 | |
| 	very small subset of them (none, to be more exact) really use it.
 | |
| 	It contradicts both to IETF spirit and the letter. Before RFC2553
 | |
| 	one justification existed, IPv6 address alignment left 4 byte
 | |
| 	hole in \verb|sockaddr_in6| in any case. Now it has no justification.
 | |
| \end{NB}
 | |
| 
 | |
| We have two problems with this method. The first one is common for all OSes:
 | |
| if \verb|recvmsg()| initializes \verb|sin6_flowinfo| to flow info
 | |
| of received packet, we loose one very important property of BSD socket API,
 | |
| namely, we are not allowed to use received address for reply directly
 | |
| and have to mangle it, even if we are not interested in flowinfo subtleties.
 | |
| 
 | |
| \begin{NB}
 | |
| 	RFC2553 adds new requirement: to clear \verb|sin6_flowinfo|.
 | |
| 	Certainly, it is not solution but rather attempt to force applications
 | |
| 	to make unnecessary work. Well, as usually, one mistake in design
 | |
| 	is followed by attempts	to patch the hole and more mistakes...
 | |
| \end{NB}
 | |
| 
 | |
| Another problem is Linux specific. Historically Linux IPv6 did not
 | |
| initialize \verb|sin6_flowinfo| at all, so that, if kernel does not
 | |
| support flow labels, this field is not zero, but a random number.
 | |
| Some applications also did not take care about it. 
 | |
| 
 | |
| \begin{NB}
 | |
| Following RFC2553 such applications can be considered as broken,
 | |
| but I still think that they are right: clearing all the address
 | |
| before filling known fields is robust but stupid solution.
 | |
| Useless wasting CPU cycles and
 | |
| memory bandwidth is not a good idea. Such patches are acceptable
 | |
| as temporary hacks, but not as standard of the future.
 | |
| \end{NB}
 | |
| 
 | |
| 
 | |
| \paragraph{Implementation.}
 | |
| \addcontentsline{toc}{subsection}{Implementation}
 | |
| By default Linux IPv6 does not read \verb|sin6_flowinfo| field
 | |
| assuming that common applications are not obliged to initialize it
 | |
| and are permitted to consider it as pure alignment padding.
 | |
| In order to tell kernel that application
 | |
| is aware of this field, it is necessary to set socket option
 | |
| \verb|IPV6_FLOWINFO_SEND|.
 | |
| 
 | |
| \begin{verbatim}
 | |
|   int on = 1;
 | |
|   setsockopt(sock, SOL_IPV6, IPV6_FLOWINFO_SEND,
 | |
|              (void*)&on, sizeof(on));
 | |
| \end{verbatim}
 | |
| 
 | |
| Linux kernel never fills \verb|sin6_flowinfo| field, when passing
 | |
| message to user space, though the kernels which support flow labels
 | |
| initialize it to zero. If user wants to get received flowinfo, he
 | |
| will set option \verb|IPV6_FLOWINFO| and after this he will receive
 | |
| flowinfo as ancillary data object of type \verb|IPV6_FLOWINFO|
 | |
| (cf.\ RFC2292).
 | |
| 
 | |
| \begin{verbatim}
 | |
|   int on = 1;
 | |
|   setsockopt(sock, SOL_IPV6, IPV6_FLOWINFO, (void*)&on, sizeof(on));
 | |
| \end{verbatim}
 | |
| 
 | |
| Flowinfo received and latched by a connected TCP socket also may be fetched
 | |
| with \verb|getsockopt()| \verb|IPV6_PKTOPTIONS| together with
 | |
| another optional information.
 | |
| 
 | |
| Besides that, in the spirit of RFC2292 the option \verb|IPV6_FLOWINFO|
 | |
| may be used as alternative way to send flowinfo with \verb|sendmsg()| or
 | |
| to latch it with \verb|IPV6_PKTOPTIONS|.
 | |
| 
 | |
| \paragraph{Note about IPv6 options and destination address.}
 | |
| \addcontentsline{toc}{subsection}{IPv6 options and destination address}
 | |
| If \verb|sin6_flowinfo| does contain not zero flow label,
 | |
| destination address in \verb|sin6_addr| and non-fragmentable
 | |
| extension headers are ignored. Instead, kernel uses the values
 | |
| cached at flow setup (see below). However, for connected sockets
 | |
| kernel prefers the values set at connection time.
 | |
| 
 | |
| \paragraph{Example.}
 | |
| \addcontentsline{toc}{subsection}{Example}
 | |
| After setting socket option \verb|IPV6_FLOWINFO|
 | |
| flowlabel and DS field are received as ancillary data object
 | |
| of type \verb|IPV6_FLOWINFO| and level \verb|SOL_IPV6|.
 | |
| In the cases when it is convenient to use \verb|recvfrom(2)|,
 | |
| it is possible to replace library variant with your own one,
 | |
| sort of:
 | |
| 
 | |
| \begin{verbatim}
 | |
| #include <sys/socket.h>
 | |
| #include <netinet/in6.h>
 | |
| 
 | |
| size_t recvfrom(int fd, char *buf, size_t len, int flags,
 | |
|                 struct sockaddr *addr, int *addrlen)
 | |
| {
 | |
|   size_t cc;
 | |
|   char cbuf[128];
 | |
|   struct cmsghdr *c;
 | |
|   struct iovec iov = { buf, len };
 | |
|   struct msghdr msg = { addr, *addrlen,
 | |
|                         &iov,  1,
 | |
|                         cbuf, sizeof(cbuf),
 | |
|                         0 };
 | |
| 
 | |
|   cc = recvmsg(fd, &msg, flags);
 | |
|   if (cc < 0)
 | |
|     return cc;
 | |
|   ((struct sockaddr_in6*)addr)->sin6_flowinfo = 0;
 | |
|   *addrlen = msg.msg_namelen;
 | |
|   for (c=CMSG_FIRSTHDR(&msg); c; c = CMSG_NEXTHDR(&msg, c)) {
 | |
|     if (c->cmsg_level != SOL_IPV6 ||
 | |
|       c->cmsg_type != IPV6_FLOWINFO)
 | |
|         continue;
 | |
|     ((struct sockaddr_in6*)addr)->sin6_flowinfo = *(__u32*)CMSG_DATA(c);
 | |
|   }
 | |
|   return cc;
 | |
| }
 | |
| \end{verbatim}
 | |
| 
 | |
| 
 | |
| 
 | |
| \section{Flow label management.}
 | |
| 
 | |
| \paragraph{Discussion.}
 | |
| \addcontentsline{toc}{subsection}{Discussion}
 | |
| Requirements of RFC2460 are pretty tough. Particularly, lifetimes
 | |
| longer than boot time require to store allocated labels at stable
 | |
| storage, so that the full implementation necessarily includes user space flow
 | |
| label manager. There are at least three different approaches:
 | |
| 
 | |
| \begin{enumerate}
 | |
| \item {\bf ``Cooperative''. } We could leave flow label allocation wholly
 | |
| to user space. When user needs label he requests manager directly. The approach
 | |
| is valid, but as any ``cooperative'' approach it suffers of security problems.
 | |
| 
 | |
| \begin{NB}
 | |
| One idea is to disallow not privileged user to allocate flow
 | |
| labels, but instead to pass the socket to manager via \verb|SCM_RIGHTS|
 | |
| control message, so that it will allocate label and assign it to socket
 | |
| itself. Hmm... the idea is interesting.
 | |
| \end{NB}
 | |
| 
 | |
| \item {\bf ``Indirect''.} Kernel redirects requests to user level daemon
 | |
| and does not install label until the daemon acknowledged the request.
 | |
| The approach is the most promising, it is especially pleasant to recognize
 | |
| parallel with IPsec API [RFC2367,Craig]. Actually, it may share API with
 | |
| IPsec.
 | |
| 
 | |
| \item {\bf ``Stupid''.} To allocate labels in kernel space. It is the simplest
 | |
| method, but it suffers of two serious flaws: the first,
 | |
| we cannot lease labels with lifetimes longer than boot time, the second, 
 | |
| it is sensitive to DoS attacks. Kernel have to remember all the obsolete
 | |
| labels until their expiration and malicious user may fastly eat all the
 | |
| flow label space.
 | |
| 
 | |
| \end{enumerate}
 | |
| 
 | |
| Certainly, I choose the most ``stupid'' method. It is the cheapest one
 | |
| for implementor (i.e.\ me), and taking into account that flow labels
 | |
| still have no serious applications it is not useful to work on more
 | |
| advanced API, especially, taking into account that eventually we
 | |
| will get it for no fee together with IPsec.
 | |
| 
 | |
| 
 | |
| \paragraph{Implementation.}
 | |
| \addcontentsline{toc}{subsection}{Implementation}
 | |
| Socket option \verb|IPV6_FLOWLABEL_MGR| allows to
 | |
| request flow label manager to allocate new flow label, to reuse
 | |
| already allocated one or to delete old flow label.
 | |
| Its argument is \verb|struct| \verb|in6_flowlabel_req|:
 | |
| 
 | |
| \begin{verbatim}
 | |
| struct in6_flowlabel_req
 | |
| {
 | |
|         struct in6_addr flr_dst;
 | |
|         __u32           flr_label;
 | |
|         __u8            flr_action;
 | |
|         __u8            flr_share;
 | |
|         __u16           flr_flags;
 | |
|         __u16           flr_expires;
 | |
|         __u16           flr_linger;
 | |
|         __u32         __flr_reserved;
 | |
|         /* Options in format of IPV6_PKTOPTIONS */
 | |
| };
 | |
| \end{verbatim}
 | |
| 
 | |
| \begin{itemize}
 | |
| 
 | |
| \item \verb|dst| is IPv6 destination address associated with the label.
 | |
| 
 | |
| \item \verb|label| is flow label value in network byte order. If it is zero,
 | |
| kernel will allocate new pseudo-random number. Otherwise, kernel will try
 | |
| to lease flow label ordered by user. In this case, it is user task to provide
 | |
| necessary flow label randomness.
 | |
| 
 | |
| \item \verb|action| is requested operation. Currently, only three operations
 | |
| are defined:
 | |
| 
 | |
| \begin{verbatim}
 | |
| #define IPV6_FL_A_GET   0   /* Get flow label */
 | |
| #define IPV6_FL_A_PUT   1   /* Release flow label */
 | |
| #define IPV6_FL_A_RENEW 2   /* Update expire time */
 | |
| \end{verbatim}
 | |
| 
 | |
| \item \verb|flags| are optional modifiers. Currently
 | |
| only \verb|IPV6_FL_A_GET| has modifiers:
 | |
| 
 | |
| \begin{verbatim}
 | |
| #define IPV6_FL_F_CREATE 1   /* Allowed to create new label */
 | |
| #define IPV6_FL_F_EXCL   2   /* Do not create new label */
 | |
| \end{verbatim}
 | |
| 
 | |
| 
 | |
| \item \verb|share| defines who is allowed to reuse the same flow label.
 | |
| 
 | |
| \begin{verbatim}
 | |
| #define IPV6_FL_S_NONE    0   /* Not defined */
 | |
| #define IPV6_FL_S_EXCL    1   /* Label is private */
 | |
| #define IPV6_FL_S_PROCESS 2   /* May be reused by this process */
 | |
| #define IPV6_FL_S_USER    3   /* May be reused by this user */
 | |
| #define IPV6_FL_S_ANY     255 /* Anyone may reuse it */
 | |
| \end{verbatim}
 | |
| 
 | |
| \item \verb|linger| is time in seconds. After the last user releases flow
 | |
| label, it will not be reused with different destination and options at least
 | |
| during this time. If \verb|share| is not \verb|IPV6_FL_S_EXCL| the label
 | |
| still can be shared by another sockets. Current implementation does not allow
 | |
| unprivileged user to set linger longer than 60 sec.
 | |
| 
 | |
| \item \verb|expires| is time in seconds. Flow label will be kept at least
 | |
| for this time, but it will not be destroyed before user released it explicitly
 | |
| or closed all the sockets using it. Current implementation does not allow
 | |
| unprivileged user to set timeout longer than 60 sec. Proviledged applications
 | |
| MAY set longer lifetimes, but in this case they MUST save allocated
 | |
| labels at stable storage and restore them back after reboot before the first
 | |
| application allocates new flow.
 | |
| 
 | |
| \end{itemize}
 | |
| 
 | |
| This structure is followed by optional extension headers associated
 | |
| with this flow label in format of \verb|IPV6_PKTOPTIONS|. Only
 | |
| \verb|IPV6_HOPOPTS|, \verb|IPV6_RTHDR| and, if \verb|IPV6_RTHDR| presents,
 | |
| \verb|IPV6_DSTOPTS| are allowed.
 | |
| 
 | |
| \paragraph{Example.}
 | |
| \addcontentsline{toc}{subsection}{Example}
 | |
|  The function \verb|get_flow_label| allocates
 | |
| private flow label.
 | |
| 
 | |
| \begin{verbatim}
 | |
| int get_flow_label(int fd, struct sockaddr_in6 *dst, __u32 fl)
 | |
| {
 | |
|         int on = 1;
 | |
|         struct in6_flowlabel_req freq;
 | |
| 
 | |
|         memset(&freq, 0, sizeof(freq));
 | |
|         freq.flr_label = htonl(fl);
 | |
|         freq.flr_action = IPV6_FL_A_GET;
 | |
|         freq.flr_flags = IPV6_FL_F_CREATE | IPV6_FL_F_EXCL;
 | |
|         freq.flr_share = IPV6_FL_S_EXCL;
 | |
|         memcpy(&freq.flr_dst, &dst->sin6_addr, 16);
 | |
|         if (setsockopt(fd, SOL_IPV6, IPV6_FLOWLABEL_MGR,
 | |
|                        &freq, sizeof(freq)) == -1) {
 | |
|                 perror ("can't lease flowlabel");
 | |
|                 return -1;
 | |
|         }
 | |
|         dst->sin6_flowinfo |= freq.flr_label;
 | |
| 
 | |
|         if (setsockopt(fd, SOL_IPV6, IPV6_FLOWINFO_SEND,
 | |
|                        &on, sizeof(on)) == -1) {
 | |
|                 perror ("can't send flowinfo");
 | |
| 
 | |
|                 freq.flr_action = IPV6_FL_A_PUT;
 | |
|                 setsockopt(fd, SOL_IPV6, IPV6_FLOWLABEL_MGR,
 | |
|                            &freq, sizeof(freq));
 | |
|                 return -1;
 | |
|         }
 | |
|         return 0;
 | |
| }
 | |
| \end{verbatim}
 | |
| 
 | |
| A bit more complicated example using routing header can be found
 | |
| in \verb|ping6| utility (\verb|iputils| package). Linux rsvpd backend
 | |
| contains an example of using operation \verb|IPV6_FL_A_RENEW|.
 | |
| 
 | |
| \paragraph{Listing flow labels.} 
 | |
| \addcontentsline{toc}{subsection}{Listing flow labels}
 | |
| List of currently allocated
 | |
| flow labels may be read from \verb|/proc/net/ip6_flowlabel|.
 | |
| 
 | |
| \begin{verbatim}
 | |
| Label S Owner Users Linger Expires Dst                              Opt
 | |
| A1BE5 1 0     0     6      3       3ffe2400000000010a0020fffe71fb30 0
 | |
| \end{verbatim}
 | |
| 
 | |
| \begin{itemize}
 | |
| \item \verb|Label| is hexadecimal flow label value.
 | |
| \item \verb|S| is sharing style.
 | |
| \item \verb|Owner| is ID of creator, it is zero, pid or uid, depending on
 | |
| 		sharing style.
 | |
| \item \verb|Users| is number of applications using the label now.
 | |
| \item \verb|Linger| is \verb|linger| of this label in seconds.
 | |
| \item \verb|Expires| is time until expiration of the label in seconds. It may
 | |
| 	be negative, if the label is in use.
 | |
| \item \verb|Dst| is IPv6 destination address.
 | |
| \item \verb|Opt| is length of options, associated with the label. Option
 | |
| 	data are not accessible.
 | |
| \end{itemize}
 | |
| 
 | |
| 
 | |
| \paragraph{Flow labels and RSVP.} 
 | |
| \addcontentsline{toc}{subsection}{Flow labels and RSVP}
 | |
| RSVP daemon supports IPv6 flow labels
 | |
| without any modifications to standard ISI RAPI. Sender must allocate
 | |
| flow label, fill corresponding sender template and submit it to local rsvp
 | |
| daemon. rsvpd will check the label and start to announce it in PATH
 | |
| messages. Rsvpd on sender node will renew the flow label, so that it will not
 | |
| be reused before path state expires and all the intermediate
 | |
| routers and receiver purge flow state.
 | |
| 
 | |
| \verb|rtap| utility is modified to parse flow labels. F.e.\ if user allocated
 | |
| flow label \verb|0xA1234|, he may write:
 | |
| 
 | |
| \begin{verbatim}
 | |
| RTAP> sender 3ffe:2400::1/FL0xA1234 <Tspec>
 | |
| \end{verbatim}
 | |
| 
 | |
| Receiver makes reservation with command:
 | |
| \begin{verbatim}
 | |
| RTAP> reserve ff 3ffe:2400::1/FL0xA1234 <Flowspec>
 | |
| \end{verbatim}
 | |
| 
 | |
| \end{document}
 | 
