253 lines
9.2 KiB
ReStructuredText
253 lines
9.2 KiB
ReStructuredText
|
.. SPDX-License-Identifier: GPL-2.0
|
|||
|
|
|||
|
=============
|
|||
|
Devlink DPIPE
|
|||
|
=============
|
|||
|
|
|||
|
Background
|
|||
|
==========
|
|||
|
|
|||
|
While performing the hardware offloading process, much of the hardware
|
|||
|
specifics cannot be presented. These details are useful for debugging, and
|
|||
|
``devlink-dpipe`` provides a standardized way to provide visibility into the
|
|||
|
offloading process.
|
|||
|
|
|||
|
For example, the routing longest prefix match (LPM) algorithm used by the
|
|||
|
Linux kernel may differ from the hardware implementation. The pipeline debug
|
|||
|
API (DPIPE) is aimed at providing the user visibility into the ASIC's
|
|||
|
pipeline in a generic way.
|
|||
|
|
|||
|
The hardware offload process is expected to be done in a way that the user
|
|||
|
should not be able to distinguish between the hardware vs. software
|
|||
|
implementation. In this process, hardware specifics are neglected. In
|
|||
|
reality those details can have lots of meaning and should be exposed in some
|
|||
|
standard way.
|
|||
|
|
|||
|
This problem is made even more complex when one wishes to offload the
|
|||
|
control path of the whole networking stack to a switch ASIC. Due to
|
|||
|
differences in the hardware and software models some processes cannot be
|
|||
|
represented correctly.
|
|||
|
|
|||
|
One example is the kernel's LPM algorithm which in many cases differs
|
|||
|
greatly to the hardware implementation. The configuration API is the same,
|
|||
|
but one cannot rely on the Forward Information Base (FIB) to look like the
|
|||
|
Level Path Compression trie (LPC-trie) in hardware.
|
|||
|
|
|||
|
In many situations trying to analyze systems failure solely based on the
|
|||
|
kernel's dump may not be enough. By combining this data with complementary
|
|||
|
information about the underlying hardware, this debugging can be made
|
|||
|
easier; additionally, the information can be useful when debugging
|
|||
|
performance issues.
|
|||
|
|
|||
|
Overview
|
|||
|
========
|
|||
|
|
|||
|
The ``devlink-dpipe`` interface closes this gap. The hardware's pipeline is
|
|||
|
modeled as a graph of match/action tables. Each table represents a specific
|
|||
|
hardware block. This model is not new, first being used by the P4 language.
|
|||
|
|
|||
|
Traditionally it has been used as an alternative model for hardware
|
|||
|
configuration, but the ``devlink-dpipe`` interface uses it for visibility
|
|||
|
purposes as a standard complementary tool. The system's view from
|
|||
|
``devlink-dpipe`` should change according to the changes done by the
|
|||
|
standard configuration tools.
|
|||
|
|
|||
|
For example, it’s quite common to implement Access Control Lists (ACL)
|
|||
|
using Ternary Content Addressable Memory (TCAM). The TCAM memory can be
|
|||
|
divided into TCAM regions. Complex TC filters can have multiple rules with
|
|||
|
different priorities and different lookup keys. On the other hand hardware
|
|||
|
TCAM regions have a predefined lookup key. Offloading the TC filter rules
|
|||
|
using TCAM engine can result in multiple TCAM regions being interconnected
|
|||
|
in a chain (which may affect the data path latency). In response to a new TC
|
|||
|
filter new tables should be created describing those regions.
|
|||
|
|
|||
|
Model
|
|||
|
=====
|
|||
|
|
|||
|
The ``DPIPE`` model introduces several objects:
|
|||
|
|
|||
|
* headers
|
|||
|
* tables
|
|||
|
* entries
|
|||
|
|
|||
|
A ``header`` describes packet formats and provides names for fields within
|
|||
|
the packet. A ``table`` describes hardware blocks. An ``entry`` describes
|
|||
|
the actual content of a specific table.
|
|||
|
|
|||
|
The hardware pipeline is not port specific, but rather describes the whole
|
|||
|
ASIC. Thus it is tied to the top of the ``devlink`` infrastructure.
|
|||
|
|
|||
|
Drivers can register and unregister tables at run time, in order to support
|
|||
|
dynamic behavior. This dynamic behavior is mandatory for describing hardware
|
|||
|
blocks like TCAM regions which can be allocated and freed dynamically.
|
|||
|
|
|||
|
``devlink-dpipe`` generally is not intended for configuration. The exception
|
|||
|
is hardware counting for a specific table.
|
|||
|
|
|||
|
The following commands are used to obtain the ``dpipe`` objects from
|
|||
|
userspace:
|
|||
|
|
|||
|
* ``table_get``: Receive a table's description.
|
|||
|
* ``headers_get``: Receive a device's supported headers.
|
|||
|
* ``entries_get``: Receive a table's current entries.
|
|||
|
* ``counters_set``: Enable or disable counters on a table.
|
|||
|
|
|||
|
Table
|
|||
|
-----
|
|||
|
|
|||
|
The driver should implement the following operations for each table:
|
|||
|
|
|||
|
* ``matches_dump``: Dump the supported matches.
|
|||
|
* ``actions_dump``: Dump the supported actions.
|
|||
|
* ``entries_dump``: Dump the actual content of the table.
|
|||
|
* ``counters_set_update``: Synchronize hardware with counters enabled or
|
|||
|
disabled.
|
|||
|
|
|||
|
Header/Field
|
|||
|
------------
|
|||
|
|
|||
|
In a similar way to P4 headers and fields are used to describe a table's
|
|||
|
behavior. There is a slight difference between the standard protocol headers
|
|||
|
and specific ASIC metadata. The protocol headers should be declared in the
|
|||
|
``devlink`` core API. On the other hand ASIC meta data is driver specific
|
|||
|
and should be defined in the driver. Additionally, each driver-specific
|
|||
|
devlink documentation file should document the driver-specific ``dpipe``
|
|||
|
headers it implements. The headers and fields are identified by enumeration.
|
|||
|
|
|||
|
In order to provide further visibility some ASIC metadata fields could be
|
|||
|
mapped to kernel objects. For example, internal router interface indexes can
|
|||
|
be directly mapped to the net device ifindex. FIB table indexes used by
|
|||
|
different Virtual Routing and Forwarding (VRF) tables can be mapped to
|
|||
|
internal routing table indexes.
|
|||
|
|
|||
|
Match
|
|||
|
-----
|
|||
|
|
|||
|
Matches are kept primitive and close to hardware operation. Match types like
|
|||
|
LPM are not supported due to the fact that this is exactly a process we wish
|
|||
|
to describe in full detail. Example of matches:
|
|||
|
|
|||
|
* ``field_exact``: Exact match on a specific field.
|
|||
|
* ``field_exact_mask``: Exact match on a specific field after masking.
|
|||
|
* ``field_range``: Match on a specific range.
|
|||
|
|
|||
|
The id's of the header and the field should be specified in order to
|
|||
|
identify the specific field. Furthermore, the header index should be
|
|||
|
specified in order to distinguish multiple headers of the same type in a
|
|||
|
packet (tunneling).
|
|||
|
|
|||
|
Action
|
|||
|
------
|
|||
|
|
|||
|
Similar to match, the actions are kept primitive and close to hardware
|
|||
|
operation. For example:
|
|||
|
|
|||
|
* ``field_modify``: Modify the field value.
|
|||
|
* ``field_inc``: Increment the field value.
|
|||
|
* ``push_header``: Add a header.
|
|||
|
* ``pop_header``: Remove a header.
|
|||
|
|
|||
|
Entry
|
|||
|
-----
|
|||
|
|
|||
|
Entries of a specific table can be dumped on demand. Each eentry is
|
|||
|
identified with an index and its properties are described by a list of
|
|||
|
match/action values and specific counter. By dumping the tables content the
|
|||
|
interactions between tables can be resolved.
|
|||
|
|
|||
|
Abstraction Example
|
|||
|
===================
|
|||
|
|
|||
|
The following is an example of the abstraction model of the L3 part of
|
|||
|
Mellanox Spectrum ASIC. The blocks are described in the order they appear in
|
|||
|
the pipeline. The table sizes in the following examples are not real
|
|||
|
hardware sizes and are provided for demonstration purposes.
|
|||
|
|
|||
|
LPM
|
|||
|
---
|
|||
|
|
|||
|
The LPM algorithm can be implemented as a list of hash tables. Each hash
|
|||
|
table contains routes with the same prefix length. The root of the list is
|
|||
|
/32, and in case of a miss the hardware will continue to the next hash
|
|||
|
table. The depth of the search will affect the data path latency.
|
|||
|
|
|||
|
In case of a hit the entry contains information about the next stage of the
|
|||
|
pipeline which resolves the MAC address. The next stage can be either local
|
|||
|
host table for directly connected routes, or adjacency table for next-hops.
|
|||
|
The ``meta.lpm_prefix`` field is used to connect two LPM tables.
|
|||
|
|
|||
|
.. code::
|
|||
|
|
|||
|
table lpm_prefix_16 {
|
|||
|
size: 4096,
|
|||
|
counters_enabled: true,
|
|||
|
match: { meta.vr_id: exact,
|
|||
|
ipv4.dst_addr: exact_mask,
|
|||
|
ipv6.dst_addr: exact_mask,
|
|||
|
meta.lpm_prefix: exact },
|
|||
|
action: { meta.adj_index: set,
|
|||
|
meta.adj_group_size: set,
|
|||
|
meta.rif_port: set,
|
|||
|
meta.lpm_prefix: set },
|
|||
|
}
|
|||
|
|
|||
|
Local Host
|
|||
|
----------
|
|||
|
|
|||
|
In the case of local routes the LPM lookup already resolves the egress
|
|||
|
router interface (RIF), yet the exact MAC address is not known. The local
|
|||
|
host table is a hash table combining the output interface id with
|
|||
|
destination IP address as a key. The result is the MAC address.
|
|||
|
|
|||
|
.. code::
|
|||
|
|
|||
|
table local_host {
|
|||
|
size: 4096,
|
|||
|
counters_enabled: true,
|
|||
|
match: { meta.rif_port: exact,
|
|||
|
ipv4.dst_addr: exact},
|
|||
|
action: { ethernet.daddr: set }
|
|||
|
}
|
|||
|
|
|||
|
Adjacency
|
|||
|
---------
|
|||
|
|
|||
|
In case of remote routes this table does the ECMP. The LPM lookup results in
|
|||
|
ECMP group size and index that serves as a global offset into this table.
|
|||
|
Concurrently a hash of the packet is generated. Based on the ECMP group size
|
|||
|
and the packet's hash a local offset is generated. Multiple LPM entries can
|
|||
|
point to the same adjacency group.
|
|||
|
|
|||
|
.. code::
|
|||
|
|
|||
|
table adjacency {
|
|||
|
size: 4096,
|
|||
|
counters_enabled: true,
|
|||
|
match: { meta.adj_index: exact,
|
|||
|
meta.adj_group_size: exact,
|
|||
|
meta.packet_hash_index: exact },
|
|||
|
action: { ethernet.daddr: set,
|
|||
|
meta.erif: set }
|
|||
|
}
|
|||
|
|
|||
|
ERIF
|
|||
|
----
|
|||
|
|
|||
|
In case the egress RIF and destination MAC have been resolved by previous
|
|||
|
tables this table does multiple operations like TTL decrease and MTU check.
|
|||
|
Then the decision of forward/drop is taken and the port L3 statistics are
|
|||
|
updated based on the packet's type (broadcast, unicast, multicast).
|
|||
|
|
|||
|
.. code::
|
|||
|
|
|||
|
table erif {
|
|||
|
size: 800,
|
|||
|
counters_enabled: true,
|
|||
|
match: { meta.rif_port: exact,
|
|||
|
meta.is_l3_unicast: exact,
|
|||
|
meta.is_l3_broadcast: exact,
|
|||
|
meta.is_l3_multicast, exact },
|
|||
|
action: { meta.l3_drop: set,
|
|||
|
meta.l3_forward: set }
|
|||
|
}
|