ComparisonOfFastRecoveryMet.../thesis/content/basics/resilient_routing.tex

\section{Resilient routing in modern networks}
\label{sec:resilient_routing}
The main resiliency goals of a network provider are (1) producing as little failures as possible and (2) reducing the impact of each remaining failure, given that some failures are unavoidable.

Securing a software-defined network imposes additional challenges, but also additional flexibility over a traditional network. In a network without controller most routings will be either entered manually or a routing protocol like Open Shortest Path First (OSPF) (\cite{Moy.041998}) or Intermediate System to Intermediate System (IS-IS) (\cite{Callon.121990}) would be used. In case a routing protocol is used, the routers themselves handle the routing in accordance to the definition of specified routing protocol. In the subsection \ref{ospf} we explain one of the traditional routing protocols OSPF and its implications. We put this into contrast in subsection \ref{sdn_advantages} where we discuss benefits and drawbacks of a SDN. After we established a basic overview over the differences of architectures, we shift our focus on resilience on the control plane and data plane specifically in subsection \ref{resilience_data_plane} and subsection \ref{resilience_control_plane}. While fast recovery methods (FRMs) mostly work on the data plane they take a role somewhere in between the data and control plane and are often used in addition to methods on the data and control plane. This is why we devote the subsection \ref{FRM} to the explanation of multiple FRMs and their applications.


\subsection{Traditional routing protocols by taking the example of OSPF}
\label{ospf}
OSPF is a link-state protocol and therefore shares information contained on a router with its direct neighbours. Routers will be split into different pre-assigned areas, with a cluster of routers designated as \textit{backbone area}, and the rest split up into a multitude of areas and area types, summarized as \textit{nonbackbone} areas.
The backbone area is used as a central network point and all traffic from the network that moves between other areas must flow through the backbone area. Because of its design, the backbone area would be a single point of failure if only a single router was used, blocking all traffic between areas in case of a failure. The usage of multiple routers as backbone area serves as additional safety through redundancy, combating this issue.

Each router shares all his information with its directly adjacent neighbours on the condition that the adjacent router belongs to the same pre-defined area. After an introductory period of sharing routing information using the \textit{hello protocol}, each router in an area would hold a database about the network.
OSPF uses the \textit{hello protocol} and each router sends regular "Hello" messages to its neighbours according to a predefined \textit{HelloInterval} which defaults to a value of \SI{10}{\second}. In addition to this, there is a \textit{RouterDeadInterval} which should be a multiple of the \textit{HelloInterval} and is set to \SI{40}{\second} by default. If a router did not receive a hello message for as long as the dead interval is configured, the router will assume the neighbouring router to be down and will share this information.
In case a network component, e.g. a link, would be detected as faulty, the router would share this information with its adjacent neighbours which will in turn repeat the change to all routers connected to them. This would cause a propagation of a refreshed version of the routing and allows for a relatively fast convergence time.

But communication between routers is still time-consuming and a convergence of a link state protocol like OSPF will also suffer from its fault detection mechanisms. A failing link might be easily and quickly recognized, but a failing router would require the network to first wait for the dead interval. This would mean a down time of \SIrange{30}{40}{\second} depending on the default configuration until .

Because routers perform these tasks themselves this also implicates that the internal resources, e.g. processing power and internal storage of each router, need to be used in part for OSPF. The network itself also receives additional traffic because the need of the routers to share information. This would cause the network to break if components were overloaded without any protective measures, as the processing power would not suffice to still maintain OSPF.

\subsection{Security benefits and drawbacks of software-defined networks}
\label{sdn_advantages}
In contrast to this, a software-defined network uses one or more controllers as central management components. They are made up of specialized hardware and software, and use their own fault detection and failure recovery. Routers in such a network have no need to perform complex routing protocol algorithms. Said operations are taken over by the controller.

The main purpose of a controller is the creation of an abstraction layer of the network. Network administrators have an easier time collecting information about the network, as well as deploying new router configurations, granting them a lot of flexibility when deploying safety measures.

SDN is, as any network, susceptible to denial-of-service (DOS) attacks. An e.g. OpenFlow controller adds new variants of DOS attacks. One of such would be sending packets which will by design cause a "table-miss" hit, only to cause each of these packets to be sent to the controller. In an experiment \textcite{Alharbi.2017} found that the controller ONOS started to drop its performance after only 2000 packets per seconds were used in an orchestrated attack. A SDN without working controller will start dropping packets that cause a "table-miss", preventing new flows to be created and therefore limits the networks functionality.

As such SDNs introduce an additional single-point-of-failure to the network. Controllers should be used redundantly and, if possible, should only control parts of a network, with each their own controller cluster. This prevents failures in controllers to impact the whole network.

To their advantage controllers allow for the easy implementation of more advanced safety measures. A lot of work has been done toward DOS security in SDNs (\cite{Dridi.2016}, \cite{Kuerban.2016}).

\subsection{Resilience on the data plane}
\label{resilience_data_plane}

Per definition most methods providing additional network resilience on the data plane are network agnostic and operations executed on the data plane can be assumed to be very fast as overhead produced by e.g. communication protocols is avoided.
One widely used method to add resilience are static routes that are installed directly on the components, re-routing traffic in case of failures. These are called local \textit{fast re-routing} (FRR) (\textcite{Nelakuditi.2003}). FRR routes make use of already existing link redundancy in the network.

\begin{figure}
	\centering
	\begin{subfigure}[b]{0.49\textwidth}
		\centering
		\includegraphics[width=\textwidth]{frr_exampleA}
		\caption{Network with FRR before a failure}
		\label{fig:basics_frr_before}
	\end{subfigure}
	\hfill
	\begin{subfigure}[b]{0.49\textwidth}
		\centering
		\includegraphics[width=\textwidth]{frr_exampleB}
		\caption{Network with FRR after a failure}
		\label{fig:basics_frr_after}
	\end{subfigure}
	\label{fig:frr_example}
\end{figure}

They mostly use interface dependent routing, identifying returning packets and routing them over an alternative path.
Because FRR acts on the data-plane it reacts very fast and will re-route packets in a matter of milliseconds, but by its agnostic nature will often create loops and therefore delays and additional traffic in a network.

The process is visualized in \cref{fig:frr_example}. In the shown example, a packet is sent from subnet 1 to subnet 4. The process on each router is exemplary shown for router R1. The routing table shows only relevant entries for this example, additional entries are omitted. It contains the destination \textit{subnet 4} multiple times, depending on the interface that a packet was received on. In this example, if a packet to subnet 4 was received on the interface \textit{r1-eth0}, which is from subnet 1, the packet is sent over \textit{r1-eth1} to router R2.

In case the connection between router R2 and R4 would be disrupted, router R2 would redirect the packet back to router R1. Because router R1 has a separate routing table entry for incoming packets on \textit{r1-eth1} going to subnet 4, it will now send the packet on interface \textit{r1-eth2}. The new non-optimal route now contains a loop as can be seen in \cref{fig:frr_example}. While the new route might not be an optimal route, it is taking effect as soon as a router has recognized a failure and redirected a packet back to its entry interface. In this example this would also only add the small delay of 2 transmission times. The additional delay is dependent on the network and the level of redundancy used. Loops have additional implications for the network and are part of the possible optimizations for failures in a network, which is discussed in \cref{FRM}.

The conditions for using FRR are easy to fulfil and most networks will already be compatible with it. A certain level of redundancy is required, as components using FRR need alternative routes to take.

\subsection{Resilience on the control plane}
\label{resilience_control_plane}
In contrast, a SDN controller will react on failures by collecting information about the network and calculating near-optimal alternative routes. It would then re-write the routing tables on each affected component with protocols like OpenFlow. OSPF also works on the control plane, as the routers collect information about the network.

Operations on the control plane are very thorough as decisions are made based on an overview of the network. The collection of information and the deployment of solutions however is very time consuming. A failure in a network that only uses control plane mechanisms for failure handling will be unattended to during the whole process, potentially creating backlog or reducing availability for a longer period of time. Operations on the control plane mostly have convergence times in the dimension of seconds. (\cite{Liu.2013})

This is the reason why most modern networks will use a combination of mechanisms on the data \emph{and} control plane, e.g. FRR and a global convergence protocol, allowing sub-optimal paths to restore availability while the global convergence protocol provides an optimised routing after some time.


\subsection{Fast Recovery Methods}
\label{FRM}
Fast Recovery Methods are operations that take place on the data plane. A combination of operations on the data \emph{and} control plane inevitably create a delay; while the alternative route was already established in a matter of milliseconds through e.g. FRR, the operation on the control plane, e.g. a controller calculating optimal routings, will take any time in the order of seconds. In this time gap the network uses a sub optimal route for its traffic.

Loops created by e.g. FRR will affect the network in this time frame and will not only potentially delay traffic but also reserve scarce link capacity on looped routes.

As such FRMs can be perceived as optimizations of data plane mechanisms like FRR. Because FRR is very prominent in networks, we use FRR and FRMs optimizing FRR as main examples. We chose a few of the existing FRMs and explain them. In \cite{Chiesa.2021} you can find a more thorough survey of some of the existing FRMs.

\subsubsection{Resilient Routing Layers}

One core issue of FRR is that routes created by FRR are inherently agnostic. Routers depend their routing decision on no information other than the incoming interface and the destination. This limits routing options; each combination of incoming interface and destination network can have only one outgoing interface mapped. This can cause loops as seen in \cref{fig:frr_example}, because each route has to be "checked" and return a packet for the next route to become active.

When solving the issue of sub-optimal paths on the controller however, the calculation and roll-out of a solution can take several seconds. \textcite{Kvalbein.2005} proposed an alternative solution.

Instead of using one routing for the network, each router would instead persist multiple routing tables, with each routing table belonging to a so called routing layer. All routing layers must be able to reach all entry and exit points of the network.

\begin{figure}
	\centering
	\includegraphics[width=9cm]{rlayers_visualization}
	\caption{Resilient routing layers - visualization}
  	\label{fig:rl_visual}
\end{figure}

E.g. a link would be seen as "safe" if there is at least one layer in which the link is not included. This can be extended for devices such as routers. Routing layers protecting links can be seen in the example in figure \ref{fig:rl_visual}. While e.g. the link from R1 to R2 is included in routing layer 0, it is not included in routing layer 1. If a failure in this link would occur, the network could switch all routers to routing layer 1, circumventing the failure. This requires routers to manipulate packets and add an identifier to the packet, determining the routing layer that should be used.

Layers can be defined either manually or by algorithms taking configurable parameters. A more robust, low relevance networking section could receive a lower number of layers, while a high availability, high relevance networking section would receive more layers and therefore higher robustness against failures.

Because each router needs a routing table for each layer, the number of routing tables and therefore the allocated memory on the routers grows with the number of layers. If e.g. each router had a safe layer, the amount of routing tables on each router would be equal to the total amount of routers in the network.

The manipulation of packets limits the applicability of this method, as existing parts of the network might not be compatible.


\subsubsection{ShortCut}
The authors of ShortCut (\textcite{Shukla.2021}) propose to remove loops created by FRR and therefore optimize the routes by self-editing existing flows on a router. In addition to the interface specific routing performed by FRR, ShortCut also uses this data to identify packets which were returned to the router. By maintaining a priority list of flows for each port they would then be able to remove invalid entries, e.g. links that failed or routes that returned a packet.


\begin{figure}
  	\centering
	\begin{subfigure}[b]{0.49\textwidth}
		\centering
		\includegraphics[width=\textwidth]{shortcut_example_1}
		\caption{Network with FRR and a failure}
		\label{fig:basics_wo_shortcut}
	\end{subfigure}
	\hfill
	\begin{subfigure}[b]{0.49\textwidth}
		\centering
		\includegraphics[width=\textwidth]{shortcut_example_2}
		\caption{Network with FRR and a failure, optimized by ShortCut}
		\label{fig:basics_w_shortcut}
	\end{subfigure}
	\label{fig:shortcut_example}
	\caption{Concept of ShortCut}
\end{figure}

In \cref{fig:basics_wo_shortcut} you can see a network with pre-installed FRR routes on router R1, in this example an additional route for packets heading to H4 coming from router R2, and a failure in the link between routers R2 and R4. The returning packet from R2 will be routed, according to the routing table, to router R3. This results in an off-path or loop from R1 to R2, passing R1 twice. In \cref{fig:basics_w_shortcut} the routes on router R1 for packets from H1 are edited. Because ShortCut recognized that packets to H4 return from R2, the route forwarding to router R2 is omitted from the routing table. As entries in a routing table are evaluated from top to bottom, the alternative route on router R1 will now always be used for packets to H4, effectively removing the loop.

This saves (1) the additional transmission times to router R2 and back, as well as (2) link capacity on the link between routers R1 and R2.

ShortCut is applicable to most network topologies as well as pre-existing FRR and global convergence stacks.

\subsubsection{Blink}

\subsubsection{Revive}
Revive (\cite{Haque.2018})