This section describes configuration files and module parameters and includes the following sections:
LNet network hardware and routing are now configured via module parameters. Parameters should be specified in the /etc/modprobe.d/lustre.conf
file, for example:
options lnet networks=tcp0(eth2)
The above option specifies that this node should use the TCP protocol on the eth2 network interface.
Module parameters are read when the module is first loaded. Type-specific LND modules (for instance, ksocklnd
) are loaded automatically by the LNet module when LNet starts (typically upon modprobe ptlrpc
).
LNet configuration parameters can be viewed under /sys/module/lnet/parameters/
, and LND-specific parameters under the name of the corresponding LND, for example /sys/module/ksocklnd/parameters/
for the socklnd (TCP) LND.
For the following parameters, default option settings are shown in parenthesis. Changes to parameters marked with a W affect running systems. Unmarked parameters can only be set when LNet loads for the first time. Changes to parameters marked with Wc
only have effect when connections are established (existing connections are not affected by these changes.)
- With routed or other multi-network configurations, use
ip2nets
rather than networks, so all nodes can use the same configuration. - For a routed network, use the same 'routes' configuration everywhere. Nodes specified as routers automatically enable forwarding and any routes that are not relevant to a particular node are ignored. Keep a common configuration to guarantee that all nodes have consistent routing tables.
- A separate
lustre.conf
file makes distributing the configuration much easier. - If you set
config_on_load=1
, LNet starts atmodprobe
time rather than waiting for the Lustre file system to start. This ensures routers start working at module load time.
# lctl
# lctl> net down
- Remember the
lctl ping {nid}
command - it is a handy way to check your LNet configuration.
This section describes LNet options.
Network topology module parameters determine which networks a node should join, whether it should route between these networks, and how it communicates with non-local networks.
Here is a list of various networks and the supported software stacks:
Network | Software Stack |
---|---|
o2ib | OFED Version 2 |
Note
The Lustre software ignores the loopback interface (lo0
), but the Lustre file system uses any IP addresses aliased to the loopback (by default). When in doubt, explicitly specify networks.
ip2nets
("") is a string that lists globally-available networks, each with a set of IP address ranges. LNet determines the locally-available networks from this list by matching the IP address ranges with the local IPs of a node. The purpose of this option is to be able to use the same modules.conf
file across a variety of nodes on different networks. The string has the following syntax.
<ip2nets> :== <net-match> [ <comment> ] { <net-sep> <net-match> }
<net-match> :== [ <w> ] <net-spec> <w> <ip-range> { <w> <ip-range> }
[ <w> ]
<net-spec> :== <network> [ "(" <interface-list> ")" ]
<network> :== <nettype> [ <number> ]
<nettype> :== "tcp" | "elan" | "o2ib" | ...
<iface-list> :== <interface> [ "," <iface-list> ]
<ip-range> :== <r-expr> "." <r-expr> "." <r-expr> "." <r-expr>
<r-expr> :== <number> | "*" | "[" <r-list> "]"
<r-list> :== <range> [ "," <r-list> ]
<range> :== <number> [ "-" <number> [ "/" <number> ] ]
<comment :== "#" { <non-net-sep-chars> }
<net-sep> :== ";" | "\n"
<w> :== <whitespace-chars> { <whitespace-chars> }
<net-spec>
contains enough information to uniquely identify the network and load an appropriate LND. The LND determines the missing "address-within-network" part of the NID based on the interfaces it can use.
<iface-list>
specifies which hardware interface the network can use. If omitted, all interfaces are used. LNDs that do not support the <iface-list>
syntax cannot be configured to use particular interfaces and just use what is there. Only a single instance of these LNDs can exist on a node at any time, and <iface-list>
must be omitted.
<net-match>
entries are scanned in the order declared to see if one of the node's IP addresses matches one of the <ip-range>
expressions. If there is a match, <net-spec>
specifies the network to instantiate. Note that it is the first match for a particular network that counts. This can be used to simplify the match expression for the general case by placing it after the special cases. For example:
ip2nets="tcp(eth1,eth2) 134.32.1.[4-10/2]; tcp(eth1) *.*.*.*"
4 nodes on the 134.32.1.* network have 2 interfaces (134.32.1.{4,6,8,10}) but all the rest have 1.
ip2nets="o2ib 192.168.0.*; tcp(eth2) 192.168.0.[1,7,4,12]"
This describes an IB cluster on 192.168.0.*. Four of these nodes also have IP interfaces; these four could be used as routers.
Note that match-all expressions (For instance, *.*.*.*
) effectively mask all other
<net-match>
entries specified after them. They should be used with caution.
Here is a more complicated situation, the route parameter is explained below. We have:
- Two TCP subnets
- One Elan subnet
- One machine set up as a router, with both TCP and Elan interfaces
- IP over Elan configured, but only IP will be used to label the nodes.
options lnet ip2nets=â€tcp 198.129.135.* 192.128.88.98; \
elan 198.128.88.98 198.129.135.3; \
routes='cp 1022@elan # Elan NID of router; \
elan 198.128.88.98@tcp # TCP NID of router '
This is an alternative to "ip2nets
" which can be used to specify the networks to be instantiated explicitly. The syntax is a simple comma separated list of <net-spec>
s (see above). The default is only used if neither 'ip2nets' nor 'networks' is specified.
This is a string that lists networks and the NIDs of routers that forward to them.
It has the following syntax (<w>
is one or more whitespace characters):
<routes> :== <route>{ ; <route> }
<route> :== [<net>[<w><hopcount>]<w><nid>[:<priority>]{<w><nid>[:<priority>]}
Note: the priority parameter was added in release 2.5.
So a node on the network tcp1
that needs to go through a router to get to the Elan network:
options lnet networks=tcp1 routes="elan 1 192.168.2.2@tcpA"
The hopcount and priority numbers are used to help choose the best path between multiply-routed configurations.
A simple but powerful expansion syntax is provided, both for target networks and router NIDs as follows.
<expansion> :== "[" <entry> { "," <entry> } "]"
<entry> :== <numeric range> | <non-numeric item>
<numeric range> :== <number> [ "-" <number> [ "/" <number> ] ]
The expansion is a list enclosed in square brackets. Numeric items in the list may be a single number, a contiguous range of numbers, or a strided range of numbers. For example, routes="elan 192.168.1.[22-24]@tcp"
says that network elan0
is adjacent (hopcount defaults to 1); and is accessible via 3 routers on the tcp0
network (192.168.1.22@tcp
, 192.168.1.23@tcp
and 192.168.1.24@tcp
).
routes="[tcp,o2ib] 2 [8-14/2]@elan"
says that 2 networks (tcp0
and o2ib0
) are accessible through 4 routers (8@elan
, 10@elan
, 12@elan
and 14@elan
). The hopcount of 2 means that traffic to both these networks will be traversed 2 routers - first one of the routers specified in this entry, then one more.
Duplicate entries, entries that route to a local network, and entries that specify routers on a non-local network are ignored.
Prior to release 2.5, a conflict between equivalent entries was resolved in favor of the route with the shorter hopcount. The hopcount, if omitted, defaults to 1 (the remote network is adjacent)..
Introduced in Lustre 2.5Since 2.5, equivalent entries are resolved in favor of the route with the lowest priority number or shorter hopcount if the priorities are equal. The priority, if omitted, defaults to 0. The hopcount, if omitted, defaults to 1 (the remote network is adjacent).
It is an error to specify routes to the same destination with routers on different local networks.
If the target network string contains no expansions, then the hopcount defaults to 1 and may be omitted (that is, the remote network is adjacent). In practice, this is true for most multi-network configurations. It is an error to specify an inconsistent hop count for a given target network. This is why an explicit hopcount is required if the target network string specifies more than one network.
This is a string that can be set either to "enabled
" or "disabled
" for explicit control of whether this node should act as a router, forwarding communications between all local networks.
A standalone router can be started by simply starting LNet ('modprobe ptlrpc
') with appropriate network topology options.
The acceptor is a TCP/IP service that some LNDs use to establish communications. If a local network requires it and it has not been disabled, the acceptor listens on a single port for connection requests that it redirects to the appropriate local network. The acceptor is part of the LNet module and configured by the following options:
Variable | Description |
---|---|
accept (secure) |
The type of connections that the acceptor will allow from remote nodes. • secure - Accept connections only from reserved TCP ports (below 1023). This is the default, and prevents userspace processes from trying to connect to the server.• all - Accept connections from any TCP port. This may be needed to allow connections on nonprivileged ports, for example from a client in a virtual machine running in userspace.• none - Do not run the acceptor. This may prevent the client from receiving server RPCs if the TCP connection is lost and the server needs to contact the client for some reason (e.g. LDLM lock callback or size glimpse). |
accept_port (988) |
Port number on which the acceptor should listen for connection requests. All nodes in a site configuration that require an acceptor must use the same port. |
accept_backlog (127) |
Maximum length that the queue of pending connections may grow to (see listen(2)). |
accept_timeout (5, W) |
Maximum time in seconds the acceptor is allowed to block while communicating with a peer. |
accept_proto_version |
Version of the acceptor protocol that should be used by outgoing connection requests. It defaults to the most recent acceptor protocol version, but it may be set to the previous version to allow the node to initiate connections with nodes that only understand that version of the acceptor protocol. The acceptor can, with some restrictions, handle either version (that is, it can accept connections from both 'old' and 'new' peers). For the current version of the acceptor protocol (version 1), the acceptor is compatible with old peers if it is only required by a single local network. |
rnet_htable_size
is an integer that indicates how many remote networks the internal LNet hash table is configured to handle. rnet_htable_size
is used for optimizing the hash table size and does not put a limit on how many remote networks you can have. The default hash table size when this parameter is not specified is: 128.
The SOCKLND
kernel TCP/IP LND (socklnd
) is connection-based and uses the acceptor to establish communications via sockets with its peers.
It supports multiple instances and load balances dynamically over multiple interfaces. If no interfaces are specified by the ip2nets
or networks module parameter, all non-loopback IP interfaces are used. The address-within-network is determined by the address of the first IP interface an instance of the socklnd
encounters.
Consider a node on the 'edge' of an InfiniBand network, with a low-bandwidth management Ethernet (eth0
), IP over IB configured (ipoib0
), and a pair of GigE NICs (eth1
,eth2
) providing off-cluster connectivity. This node should be configured with ' networks=o2ib,tcp(eth1,eth2)
' to ensure that the socklnd
ignores the management Ethernet and IPoIB.
Variable | Description |
---|---|
timeout``(50,W) |
Time (in seconds) that communications may be stalled before the LND completes them with failure. |
nconnds``(4) |
Sets the number of connection daemons. |
min_reconnectms``(1000,W) |
Minimum connection retry interval (in milliseconds). After a failed connection attempt, this is the time that must elapse before the first retry. As connections attempts fail, this time is doubled on each successive retry up to a maximum of 'max_reconnectms '. |
max_reconnectms``(6000,W) |
Maximum connection retry interval (in milliseconds). |
eager_ack``(0 on linux,``1 on darwin,W) |
Boolean that determines whether the socklnd should attempt to flush sends on message boundaries. |
typed_conns``(1,Wc) |
Boolean that determines whether the socklnd should use different sockets for different types of messages. When clear, all communication with a particular peer takes place on the same socket. Otherwise, separate sockets are used for bulk sends, bulk receives and everything else. |
min_bulk``(1024,W) |
Determines when a message is considered "bulk". |
tx_buffer_size, rx_buffer_size``(8388608,Wc) |
Socket buffer sizes. Setting this option to zero (0), allows the system to auto-tune buffer sizes.WarningBe very careful changing this value as improper sizing can harm performance. |
nagle``(0,Wc) |
Boolean that determines if nagle should be enabled. It should never be set in production systems. |
keepalive_idle``(30,Wc) |
Time (in seconds) that a socket can remain idle before a keepalive probe is sent. Setting this value to zero (0) disables keepalives. |
keepalive_intvl``(2,Wc) |
Time (in seconds) to repeat unanswered keepalive probes. Setting this value to zero (0) disables keepalives. |
keepalive_count``(10,Wc) |
Number of unanswered keepalive probes before pronouncing socket (hence peer) death. |
enable_irq_affinity``(0,Wc) |
Boolean that determines whether to enable IRQ affinity. The default is zero (0).When set, socklnd attempts to maximize performance by handling device interrupts and data movement for particular (hardware) interfaces on particular CPUs. This option is not available on all platforms. This option requires an SMP system to exist and produces best performance with multiple NICs. Systems with multiple CPUs and a single NIC may see increase in the performance with this parameter disabled. |
zc_min_frag``(2048,W) |
Determines the minimum message fragment that should be considered for zero-copy sends. Increasing it above the platform's PAGE_SIZE disables all zero copy sends. This option is not available on all platforms. |