some changes

master
Frederik Maaßen 2 years ago
parent a707adc65d
commit 6b1b4c5780
  1. 2
      thesis/content/basics/failure_scenarios.tex
  2. 10
      thesis/content/basics/modern_networks.tex
  3. 29
      thesis/content/basics/resilient_routing.tex
  4. 36
      thesis/content/basics/test_environment.tex
  5. 2
      thesis/content/begin/abstract.tex
  6. 4
      thesis/content/begin/titlepage_english.tex
  7. 36
      thesis/content/conclusion/conclusion.tex
  8. 8
      thesis/content/evaluation/evaluation.tex
  9. 87
      thesis/content/evaluation/failure_path_networks.tex
  10. 62
      thesis/content/evaluation/minimal_network.tex
  11. 6
      thesis/content/implementation/implementation.tex
  12. 2
      thesis/content/implementation/resilient_routing_tables.tex
  13. 9
      thesis/content/implementation/shortcut_implementation.tex
  14. 24
      thesis/content/implementation/test_network.tex
  15. 28
      thesis/content/introduction.tex
  16. 24
      thesis/content/testing/testing.tex
  17. 12
      thesis/content/testing/topologies_and_routing.tex
  18. 752
      thesis/images/tests/failure_path_2_packet_flow/packet_flow_after_wo_sc.eps
  19. 1053
      thesis/images/tests/failure_path_2_packet_flow_udp/packet_flow_udp_after_wo_sc.eps
  20. 1042
      thesis/images/tests/failure_path_2_packet_flow_udp/packet_flow_udp_before_wo_sc.eps
  21. 872
      thesis/images/tests/minimal_packet_flow_udp/packet_flow_udp_concurrent_sc.eps
  22. 25
      thesis/thesis.tex

@ -5,7 +5,7 @@ Computer networks are a complex structure of many different interconnected devic
\subsection{Failures in data centers}
\label{ssec:failures_data_centers}
The reasons for failures are multifaceted and largely depend on the used network components, network structure and security measures. In 2011 a study by \textcite{Gill.2011} analysed failures that occurred in a network of large professional data centers over a year. They found that over their measurement period a mean of 5.2 devices experienced a failure per day, while a mean of 40.8 links experienced a failure per day. Most of these failures had a small, if any, impact on the network, as many security measures like redundancy or dynamic re-routing were already implemented. Still, many devices and links experienced downtimes and a huge amount of packets and data were lost. The downtimes and lost data could cause costs for the provider as it is common to define a service-level agreement (SLA), an agreement between service-provider and consumer, which defines levels of availability and reliability (besides other factors) the provider has to fulfil (\cite{Wustenhoff.2002}). If the agreed levels of availability and reliability are not met, the provider will pay a penalty. A lower downtime or faster convergence can therefore avoid additional costs.
The reasons for failures are multifaceted and largely depend on the used network components, network structure and security measures. In 2011 a study by \textcite{Gill.2011} analysed failures that occurred in a network of large professional data centers over a year. They found that over their measurement period a mean of 5.2 devices experienced a failure per day, while a mean of 40.8 links experienced a failure per day. Most of these failures had a small, if any, impact on the network, as many security measures like redundancy or dynamic re-routing were already implemented. Still, many devices and links experienced downtimes and a huge amount of packets and data were lost. The downtimes and lost data could cause costs for the provider as it is common to define a service-level agreement (SLA), an agreement between service-provider and consumer which defines levels of availability and reliability (besides other factors) the provider has to fulfil (\cite{Wustenhoff.2002}). If the agreed levels of availability and reliability are not met, the provider will pay a penalty. A lower downtime or faster convergence can therefore avoid additional costs.
\subsubsection{Failure causes}
When analysing the causes of the failures, most device failures were caused by maintenance operations on the network, for example if an aggregation switch was shut down temporarily and each connected switch and router was temporarily unreachable.

@ -2,13 +2,13 @@
\label{sec:modern_networks}
In our digital society networks have become an essential infrastructure for countries worldwide. A huge part of the population today is in some form reliant on the availability of networks and associated services, be it on their smartphone or their home internet access. As such, the reliability of a network has a huge impact on the economy and social life.
A study in 2015 (\cite{Montag.2015}) found that their participants used WhatsApp, a messaging service, for around 32 minutes a day, with an overall usage mean of their smartphone of around 162 minutes a day, mostly spent online. Private and commercial users alike cause a huge amount of traffic. The german federal network agency reported a per capita network usage on the terrestrial network of \SI{175}{\giga\byte} per month (\cite{BundesnetzagenturDeutschland.2021}) in Germany. The traffic per capita in 2017 was at around \SI{98}{\giga\byte} per month. Together with the rise of e.g. cloud based solutions for companies and , network requirements are expected to rise exponentially.
A study in 2015 (\cite{Montag.2015}) found that their participants used WhatsApp, a messaging service, for around 32 minutes a day, with an overall usage mean of their smartphone of around 162 minutes a day, mostly spent online. Private and commercial users alike cause a huge amount of traffic. The german federal network agency reported a per capita network usage on the terrestrial network of \SI{175}{\giga\byte} per month (\cite{BundesnetzagenturDeutschland.2021}) in Germany. The traffic per capita in 2017 was at around \SI{98}{\giga\byte} per month. Together with the rise of e.g. cloud based solutions for companies, network requirements are expected to rise exponentially.
With bigger networks and higher traffic, the work of a network administrator is getting more complex by the day. Modern networks need to be flexible, scalable and reliable. Configuration changes should be applied near instantly, failures should be corrected as fast as possible and new components should only have to be connected to the existing network, with all additional configuration being applied automatically or only requiring a minimum amount of manual work.
With bigger networks and higher traffic the work of a network administrator is getting more complex by the day. Modern networks need to be flexible, scalable and reliable. Configuration changes should be applied near instantly, failures should be corrected as fast as possible and new components should only have to be connected to the existing network, with all additional configuration being applied automatically or only requiring a minimum amount of manual work.
In the past many of the challenges a network faces in modern times were solved by specialized networking devices which offered their own proprietary configuration capabilities, including interfaces, protocols and software. In many cases this would lead to companies using one supplier for the biggest part of their networks, as well as huge amounts of manual labour configuring each device.
While many networks are still configured manually, the current trend strives towards software-defined networks, taking complexity from single devices and instead using a component called controller to handle logical operations (\cite{Rak.2020}), like configuration of new devices, the roll out of configuration changes and monitoring. This allows for a better network management especially in networks with high availability constrains and flexibility requirements e.g. the commercial network of an internet provider.
While many networks are still configured manually, the current trend strives towards SDNs, taking complexity from single devices and instead using a component called controller to handle logical operations (\cite{Rak.2020}), like configuration of new devices, the roll out of configuration changes and monitoring. This allows for a better network management especially in networks with high availability constrains and flexibility requirements e.g. the commercial network of an internet provider.
Network administrators receive a real time overview over the network, implement new software controlled features on the fly or create virtual network environments. A cloud provider could create automatic processes using a SDN controller to configure virtual networks for a tenant in seconds.
\begin{figure}
@ -18,7 +18,7 @@ Network administrators receive a real time overview over the network, implement
\label{fig:sdn_network_concept}
\end{figure}
In figure \ref{fig:sdn_network_concept} you can see a typical SDN architecture with the controller as a centralized management component. It is connected to multiple networking devices like routers and switches, and is able to read and write information to these devices. Several protocols can be used to affect routers and switches. Some vendors have their own proprietary protocols, while some protocols follow open specifications like OpenFlow and Open Virtual Switch Database (OVSDB). They allow the controller to communicate with the networking devices via their so called \textit{south-bound api}.
In \cref{fig:sdn_network_concept} you can see a typical SDN architecture with the controller as a centralized management component. It is connected to multiple networking devices like routers and switches, and is able to read and write information to these devices. Several protocols can be used to affect routers and switches. Some vendors have their own proprietary protocols, while some protocols follow open specifications like OpenFlow and \textit{Open Virtual Switch Database} (OVSDB). They allow the controller to communicate with the networking devices via their so called \textit{south-bound api}.
In a network with an OpenFlow controller, routers use flow tables for routing. These contain information provided by the controller about the destination of specific packets, identified by a set of rules. If a packet enters the router for which no flow table entry matches, the "table-miss" flow entry is hit and the packet is sent to the controller. The controller would then use the information attached to the packet to create a new flow entry for similar packets according to the controller's configuration. The controller can then edit the flow tables of the affected routers and e.g. add the newly created flow entry.
@ -28,6 +28,6 @@ Alongside these controllers multiple vendors supplied SDN controller platforms t
Lastly, controllers serve as an interface for other applications that can communicate with the controller over its so called \textit{north-bound api}. This includes applications like firewalls, load balancers and even business systems, allowing for a lot of flexibility when designing an automated network.
This causes networks to be logically split by their available information and level of operation. On the one hand there are operations directly on routers and switches, only using available information to the device itself and maybe its neighbours. Mechanisms working under these conditions are part of the so called \textit{data plane}.
On the other hand, there are operations on the controllers, managing devices and maintaining an overview over the network, or routing protocols using link-state algorithms to collect data about the network.
On the other hand there are operations on the controllers, managing devices and maintaining an overview over the network, or routing protocols using link-state algorithms to collect data about the network.
These functions and mechanisms are part of the \textit{control plane}.

@ -2,19 +2,19 @@
\label{sec:resilient_routing}
The main resiliency goals of a network provider are (1) producing as little failures as possible and (2) reducing the impact of each remaining failure, given that some failures are unavoidable.
Securing a software-defined network imposes additional challenges, but also additional flexibility over a traditional network. In a network without controller most routings will be either entered manually or a routing protocol like Open Shortest Path First (OSPF) (\cite{Moy.041998}) or Intermediate System to Intermediate System (IS-IS) (\cite{Callon.121990}) would be used. In case a routing protocol is used, the routers themselves handle the routing in accordance to the definition of specified routing protocol. In the subsection \ref{ospf} we explain one of the traditional routing protocols OSPF and its implications. We put this into contrast in subsection \ref{sdn_advantages} where we discuss benefits and drawbacks of a SDN. After we established a basic overview over the differences of architectures, we shift our focus on resilience on the control plane and data plane specifically in subsection \ref{resilience_data_plane} and subsection \ref{resilience_control_plane}. While fast recovery methods (FRMs) mostly work on the data plane they take a role somewhere in between the data and control plane and are often used in addition to methods on the data and control plane. This is why we devote the subsection \ref{FRM} to the explanation of multiple FRMs and their applications.
Securing a SDN imposes additional challenges but also additional flexibility over a traditional network. In a network without controller most routings will be either entered manually or a routing protocol like OSPF (\cite{Moy.041998}) or Intermediate System to Intermediate System (IS-IS) (\cite{Callon.121990}) would be used. In case a routing protocol is used, the routers themselves handle the routing in accordance to the definition of specified routing protocol. In the subsection \ref{ospf} we explain one of the traditional routing protocols OSPF and its implications. We put this into contrast in subsection \ref{sdn_advantages} where we discuss benefits and drawbacks of a SDN. After we established a basic overview over the differences of architectures, we shift our focus on resilience on the control plane and data plane specifically in subsection \ref{resilience_data_plane} and subsection \ref{resilience_control_plane}. While fast recovery methods (FRMs) mostly work on the data plane they take a role somewhere in between the data and control plane and are often used in addition to methods on the data and control plane. This is why we devote the subsection \ref{FRM} to the explanation of multiple FRMs and their applications.
\subsection{Traditional routing protocols by taking the example of OSPF}
\label{ospf}
OSPF is a link-state protocol and therefore shares information contained on a router with its direct neighbours. Routers will be split into different pre-assigned areas, with a cluster of routers designated as \textit{backbone area}, and the rest split up into a multitude of areas and area types, summarized as \textit{nonbackbone} areas.
OSPF is a link-state protocol and therefore shares information contained on a router with its direct neighbours. Routers will be split into different pre-assigned areas, with a cluster of routers designated as \textit{backbone area}. The rest is split up into a multitude of areas and area types, summarized as \textit{nonbackbone} areas.
The backbone area is used as a central network point and all traffic from the network that moves between other areas must flow through the backbone area. Because of its design, the backbone area would be a single point of failure if only a single router was used, blocking all traffic between areas in case of a failure. The usage of multiple routers as backbone area serves as additional safety through redundancy, combating this issue.
Each router shares all his information with its directly adjacent neighbours on the condition that the adjacent router belongs to the same pre-defined area. After an introductory period of sharing routing information using the \textit{hello protocol}, each router in an area would hold a database about the network.
OSPF uses the \textit{hello protocol} and each router sends regular "Hello" messages to its neighbours according to a predefined \textit{HelloInterval} which defaults to a value of \SI{10}{\second}. In addition to this, there is a \textit{RouterDeadInterval} which should be a multiple of the \textit{HelloInterval} and is set to \SI{40}{\second} by default. If a router did not receive a hello message for as long as the dead interval is configured, the router will assume the neighbouring router to be down and will share this information.
OSPF uses the \textit{hello protocol} and each router sends regular "Hello" messages to its neighbours according to a predefined \textit{HelloInterval} which defaults to a value of \SI{10}{\second}. In addition to this, there is a \textit{RouterDeadInterval} which should be a multiple of the \textit{HelloInterval} and is set to \SI{40}{\second} by default. If a router did not receive a "Hello" message for as long as the dead interval is configured, the router will assume the neighbouring router to be down and will share this information.
In case a network component, e.g. a link, would be detected as faulty, the router would share this information with its adjacent neighbours which will in turn repeat the change to all routers connected to them. This would cause a propagation of a refreshed version of the routing and allows for a relatively fast convergence time.
But communication between routers is still time-consuming and a convergence of a link state protocol like OSPF will also suffer from its fault detection mechanisms. A failing link might be easily and quickly recognized, but a failing router would require the network to first wait for the dead interval. This would mean a down time of \SIrange{30}{40}{\second} depending on the default configuration until .
But communication between routers is still time-consuming and a convergence of a link state protocol like OSPF will also suffer from its fault detection mechanisms. A failing link might be easily and quickly recognized, but a failing router would require the network to first wait for the \textit{RouterDeadInterval}. This would mean a down time of \SIrange{30}{40}{\second} depending on the configuration.
Because routers perform these tasks themselves this also implicates that the internal resources, e.g. processing power and internal storage of each router, need to be used in part for OSPF. The network itself also receives additional traffic because the need of the routers to share information. This would cause the network to break if components were overloaded without any protective measures, as the processing power would not suffice to still maintain OSPF.
@ -34,7 +34,7 @@ To their advantage controllers allow for the easy implementation of more advance
\label{resilience_data_plane}
Per definition most methods providing additional network resilience on the data plane are network agnostic and operations executed on the data plane can be assumed to be very fast as overhead produced by e.g. communication protocols is avoided.
One widely used method to add resilience are static routes that are installed directly on the components, re-routing traffic in case of failures. These are called local \textit{fast re-routing} (FRR) (\textcite{Nelakuditi.2003}). FRR routes make use of already existing link redundancy in the network.
One widely used method to add resilience are static routes that are installed directly on the components, re-routing traffic in case of failures. These are called local FRR (\textcite{Nelakuditi.2003}). FRR routes make use of already existing link redundancy in the network.
\begin{figure}
\centering
@ -67,20 +67,21 @@ The conditions for using FRR are easy to fulfil and most networks will already b
\label{resilience_control_plane}
In contrast, a SDN controller will react on failures by collecting information about the network and calculating near-optimal alternative routes. It would then re-write the routing tables on each affected component with protocols like OpenFlow. OSPF also works on the control plane, as the routers collect information about the network.
Operations on the control plane are very thorough as decisions are made based on an overview of the network. The collection of information and the deployment of solutions however is very time consuming. A failure in a network that only uses control plane mechanisms for failure handling will be unattended to during the whole process, potentially creating backlog or reducing availability for a longer period of time. Operations on the control plane mostly have convergence times in the dimension of seconds. (\cite{Liu.2013})
Operations on the control plane are very thorough as decisions are made based on an overview of the network. The collection of information and the deployment of solutions, however, is very time consuming. A failure in a network that only uses control plane mechanisms for failure handling will be unattended to during the whole process, potentially creating backlog or reducing availability for a longer period of time. Operations on the control plane mostly have convergence times in the dimension of seconds (\cite{Liu.2013}).
This is the reason why most modern networks will use a combination of mechanisms on the data \emph{and} control plane, e.g. FRR and a global convergence protocol, allowing sub-optimal paths to restore availability while the global convergence protocol provides an optimised routing after some time.
\subsection{Fast Recovery Methods}
\label{FRM}
Fast Recovery Methods are operations that take place on the data plane. A combination of operations on the data \emph{and} control plane inevitably create a delay; while the alternative route was already established in a matter of milliseconds through e.g. FRR, the operation on the control plane, e.g. a controller calculating optimal routings, will take any time in the order of seconds. In this time gap the network uses a sub optimal route for its traffic.
FRMs are operations that take place on the data plane. A combination of operations on the data \emph{and} control plane inevitably create a delay; while the alternative route was already established in a matter of milliseconds through e.g. FRR, the operation on the control plane, e.g. a controller calculating optimal routings, will take any time in the order of seconds. In this time gap the network uses a sub optimal route for its traffic.
Loops created by e.g. FRR will affect the network in this time frame and will not only potentially delay traffic but also reserve scarce link capacity on looped routes.
As such FRMs can be perceived as optimizations of data plane mechanisms like FRR. Because FRR is very prominent in networks, we use FRR and FRMs optimizing FRR as main examples. We chose a few of the existing FRMs and explain them. In \cite{Chiesa.2021} you can find a more thorough survey of some of the existing FRMs.
\subsubsection{Resilient Routing Layers}
\label{rrl}
One core issue of FRR is that routes created by FRR are inherently agnostic. Routers depend their routing decision on no information other than the incoming interface and the destination. This limits routing options; each combination of incoming interface and destination network can have only one outgoing interface mapped. This can cause loops as seen in \cref{fig:frr_example}, because each route has to be "checked" and return a packet for the next route to become active.
@ -91,11 +92,11 @@ Instead of using one routing for the network, each router would instead persist
\begin{figure}
\centering
\includegraphics[width=9cm]{rlayers_visualization}
\caption{Resilient routing layers - visualization}
\caption{Resilient Routing Layers - visualization}
\label{fig:rl_visual}
\end{figure}
E.g. a link would be seen as "safe" if there is at least one layer in which the link is not included. This can be extended for devices such as routers. Routing layers protecting links can be seen in the example in figure \ref{fig:rl_visual}. While e.g. the link from R1 to R2 is included in routing layer 0, it is not included in routing layer 1. If a failure in this link would occur, the network could switch all routers to routing layer 1, circumventing the failure. This requires routers to manipulate packets and add an identifier to the packet, determining the routing layer that should be used.
E.g. a link would be seen as "safe" if there is at least one layer in which the link is not included. This can be extended for devices such as routers. Routing layers protecting links can be seen in the example in \cref{fig:rl_visual}. While e.g. the link from R1 to R2 is included in routing layer 0, it is not included in routing layer 1. If a failure in this link would occur, the network could switch all routers to routing layer 1, circumventing the failure. This requires routers to manipulate packets and add an identifier to the packet, determining the routing layer that should be used.
Layers can be defined either manually or by algorithms taking configurable parameters. A more robust, low relevance networking section could receive a lower number of layers, while a high availability, high relevance networking section would receive more layers and therefore higher robustness against failures.
@ -105,7 +106,8 @@ The manipulation of packets limits the applicability of this method, as existing
\subsubsection{ShortCut}
The authors of ShortCut (\textcite{Shukla.2021}) propose to remove loops created by FRR and therefore optimize the routes by self-editing existing flows on a router. In addition to the interface specific routing performed by FRR, ShortCut also uses this data to identify packets which were returned to the router. By maintaining a priority list of flows for each port they would then be able to remove invalid entries, e.g. links that failed or routes that returned a packet.
\label{shortcut}
The authors of ShortCut (\cite{Shukla.2021}) propose to remove loops created by FRR and therefore optimize the routes by self-editing existing flows on a router. In addition to the interface specific routing performed by FRR, ShortCut also uses this data to identify packets which were returned to the router. By maintaining a priority list of flows for each port they would then be able to remove invalid entries, e.g. links that failed or routes that returned a packet.
\begin{figure}
@ -130,10 +132,11 @@ The authors of ShortCut (\textcite{Shukla.2021}) propose to remove loops created
In \cref{fig:basics_wo_shortcut} you can see a network with pre-installed FRR routes on router R1, in this example an additional route for packets heading to H4 coming from router R2, and a failure in the link between routers R2 and R4. The returning packet from R2 will be routed, according to the routing table, to router R3. This results in an off-path or loop from R1 to R2, passing R1 twice. In \cref{fig:basics_w_shortcut} the routes on router R1 for packets from H1 are edited. Because ShortCut recognized that packets to H4 return from R2, the route forwarding to router R2 is omitted from the routing table. As entries in a routing table are evaluated from top to bottom, the alternative route on router R1 will now always be used for packets to H4, effectively removing the loop.
This saves (1) the additional transmission times to router R2 and back, as well as (2) link capacity on the link between routers R1 and R2.
ShortCut is applicable to most network topologies as well as pre-existing FRR and global convergence stacks.
\subsubsection{Blink}
\label{blink}
Blink (\cite{ThomasHolterbach.2019}) interprets TCP flows and uses them as an indicator for failures occurring in the network. For this it uses TCP packets arriving at routers implementing blink, analysing a sample of these flows. If it recognizes that e.g. a packet was not acknowledged and this occurs multiple times, it signals a failure and re-routes traffic accordingly. This, however, requires the network to fail first, only restoring connectivity after packets were already lost.
\subsubsection{Revive}
Revive (\cite{Haque.2018})
\label{revive}
Revive (\cite{Haque.2018}) combines the usage of local backup routes with a memory management module, distributing traffic according to pre-defined memory thresholds on switches. All backup routes are implemented by installing flows on the routers/switches with an OpenFlow controller. It also uses OpenFlow's Fast Failover Groups (FFG) for detecting failures. The backup routes, however, are prone to create loops and the addition of re-routing using controller mechanisms is relatively slow, although much faster than a global convergence protocol.

@ -6,24 +6,26 @@ To evaluate a network and test new mechanisms and functions, the simulation of n
While using real life components for such a simulation might be the most realistic approach, it quickly becomes infeasible especially in a scientific context, as it not only adds dependencies on different hardware vendors possibly distorting results, but also reduces replicability of said results.
Furthermore it also complicates automation of measurements, increasing the overhead in testing.
Most networks already use some sort of virtualization, be it in form of virtual local area networks (VLANs), partitioning already existing physical networks into "virtual" networks, or the usage of virtual routers running on so called \textit{white boxes}, eliminating the need for specialized hardware.
Most networks already use some sort of virtualization, be it in form of \textit{Virtual Local Area Networks} (VLAN), partitioning already existing physical networks into "virtual" networks, or the usage of virtual routers running on so called \textit{white boxes}, eliminating the need for specialized hardware.
This eliminates some of the bridges between a virtual simulation and a real life network. While the network itself might not exist in the physical world, the software running on these virtual devices is already similar or even the same as in real life networks. In some cases configurations can even be shared between physical and virtual devices, allowing for fast real life confirmation of results.
As such most scientific work in regards to networks is in some way done on virtualized networks.
It becomes possible to implement huge network structures and testing setups on a single computer and allow each reader with a sufficiently powerful home computer to replicate the results.
In the following sections we will first take a look at Mininet (\cite{LantzBobandtheMininetContributors.2021}), a network virtualization toolkit in \cref{basic_mininet}. We then go on to discuss measurement criteria for a computer network in \cref{basics_measuring_performance}. Lastly we discuss methods of introducing failures in Mininet in \cref{introducing_failures}.
\subsection{Mininet}
Mininet (\cite{LantzBobandtheMininetContributors.}) is a tool to create virtual networks running in linux based operating systems, with each component running its own linux kernel on a single system. This is done using network namespaces, which is a feature of the linux kernel allowing for independent network stacks with separate network devices, routes and firewall rules.
\label{basic_mininet}
Mininet (\cite{LantzBobandtheMininetContributors.2021}) is a tool to create virtual networks running in linux based operating systems, with each component running its own linux kernel on a single system. This is done using network namespaces, which is a feature of the linux kernel allowing for independent network stacks with separate network devices, routes and firewall rules.
Mininet supplies a verbose management console which allows for the creation of virtual networks by using simple commands. These networks can then be evaluated, hosts can be accessed via a terminal and influenced in their behaviour, e.g. by editing their routing tables or implementing packet filtering on them, and pre-defined functions can be used to evaluate the performance of the network or monitor traffic.
Mininet also comes with a python api which can be used to create automatic scripts that will create host, routers, switches and controllers, connect them via links, set their routing configuration and execute functions on the network.
Mininet also comes with a python API which can be used to create automatic scripts that will create host, routers, switches and controllers, connect them via links, set their routing configuration and execute functions on the network.
These scripts can be shared and used in Mininets provided virtual machine based on the Ubuntu distribution in several versions of Ubuntu.
Additionally it supports the OpenFlow protocol and the usage of controllers like POX, as well as the P4 language (add source), which can be used to customize packet handling even more. Written controllers can then even be used in real life setups that replicate the created network in Mininet and retain their programmed functionality.
Additionally it supports the OpenFlow protocol and the usage of controllers like POX, as well as the P4 language (\cite{Bosshart.2014}), which can be used to customize packet handling even more. Written controllers can then even be used in real life setups that replicate the created network in Mininet and retain their programmed functionality.
This allows the quick creation and configuration of test environments, as well as replicability of test scenarios, the creation of failure scenarios and the application in real life.
@ -38,23 +40,23 @@ One important note about the usage of the Mininet VM is the configuration in \te
\subsection{Measuring performance}
\label{basics_measuring_performance}
When evaluating the performance of a network there are several quantitative and qualitative criteria to be considered, depending on the use case and the type of network. We explain the evaluation criteria in section \ref{evaluation_criteria}. In the following sections we then explain possible ways to measure specific criteria. In section \ref{introducing_failures} we lastly explain how failures can be introduced and used in a test environment.
When evaluating the performance of a network there are several quantitative and qualitative criteria to be considered, depending on the use case and the type of network. We explain the evaluation criteria in \cref{evaluation_criteria}. In the following sections we then explain possible ways to measure specific criteria. In \cref{introducing_failures} we lastly explain how failures can be introduced and used in a test environment.
\subsubsection{Evaluation criteria}
\label{evaluation_criteria}
The most basic usage of a network is the pure transfer of data without timing or flow constraints. Here, a faster transmission and therefore a higher bandwidth is the main criteria of evaluation. Protocols like TCP also resend dropped packages, so in these cases the bandwidth can also be used to make assumptions about the transfer error rate.
For traffic with timing constraints e.g. Voice-over-IP (VOIP) the bandwidth is bound by the number of concurrent transmissions and the quality of sound. In modern networks these transfers should only use a fraction of the available bandwidth. Additionally, in contrast to a TCP data transfer, dropping a few packages is only a minor inconvenience and would most likely not be noticed by the user.
For traffic with timing constraints e.g. \textit{Voice-over-IP} (VOIP) the bandwidth is bound by the number of concurrent transmissions and the quality of sound. In modern networks these transfers should only use a fraction of the available bandwidth. Additionally, in contrast to a TCP data transfer, dropping a few packages is only a minor inconvenience and would most likely not be noticed by the user.
Therefore the most recognizable criteria in VOIP is certainly the delay and the flow of packages. While a delay of a few milliseconds will not be instantly recognizable to the user, it can quickly result in a poor experience, especially if other factors like the usage of a wireless connection to a phone etc. are considered and delays accumulate.
VOIP protocols mostly use a jitter buffer to collect packets in a certain time frame and reorder them according to their sequence numbers. This works very well if all packets were received in the correct order, but this is rarely the case. As such, the evaluation of the packet flow and the order of transmission should be evaluated as well.
VOIP protocols mostly use a jitter buffer to collect packets in a certain time frame and reorder them according to their sequence numbers. This works very well if all packets were received in the correct order and without much delay. However, if a disruptive failure would occur chances are that additional delays and changed routings could cause the jitter buffer to be insufficient and the quality to suffer. As such, the evaluation of the packet flow and the order of transmission should be evaluated as well.
On-demand video streaming mostly uses the UDP protocol and is not timing bound. High bandwidths are certainly required, but the lack of such will only be perceived when the bandwidth becomes unable to sustain a fluid playback of the video content. UDP does not resend dropped packages and a small count of missing packets is not a huge problem, but the number of dropped packets should be kept to a minimum.
On-demand video streaming mostly uses the UDP and is not timing bound. High bandwidths are certainly required, but the lack of such will only be perceived when the bandwidth becomes unable to sustain a fluid playback of the video content. UDP does not resend dropped packages and a small count of missing packets is not a huge problem. A higher count of lost packets will however reduce quality and in some cases even cause a disruption of the service.
Up until now we only considered the bandwidth as the speed of data transfer from one device to another, but in a network with multiple data streams the usage of each link has to be considered. If a data transfer would e.g. use a larger route through the network caused by a failure, and that route contains a loop, the links inside this loop are unnecessarily strained and might influence other traffic in the network.
Up until now we only considered the bandwidth as the speed of data transfer from one device to another. In reality networks will forward thousands of data streams and existing links are rarely unused. If a data transfer would e.g. use a larger route through the network caused by a failure and that route contains a loop, the links inside this loop are unnecessarily strained and might influence other traffic in the network.
In summary, a non-specialized network is evaluated by it's bandwidth, link usage, latency, packet loss and packet flow (check if meaning is right).
In summary, a non-specialized network is evaluated by it's bandwidth, link usage, latency, packet loss and packet flow.
\subsubsection{Measuring bandwidth}
\label{measuring_bandwidth}
@ -63,12 +65,12 @@ The process of testing includes starting a server on a receiving device and star
After the pre defined time the transfers will stop and the client instance of \textit{iperf} will shut down, after printing out the average transfer rate. The server instance will have to be shut down manually.
By default \textit{iperf} will use tcp to send data, but when using the "-u" flag it will instead transfer data with udp. When using udp \textit{iperf} requires an additional bandwidth parameter, which will specify how much data will be sent over the network. This is done because protocols like TCP use flow control to limit the amount of data sent on the capabilities of the receiving device. A slower device like a mobile phone will e.g. limit the data transfer to not get overwhelmed. Protocols like UDP do not provide any flow control and therefore \textit{iperf} has to limit the used bandwidth itself. Using "-u 0" will cause \textit{iperf} to send as many UDP packets as possible.
By default \textit{iperf} will use TCP to send data, but when using the "-u" flag it will instead transfer data using UDP. When using UDP \textit{iperf} requires an additional bandwidth parameter, which will specify how much data will be sent over the network. This is done because protocols like TCP use flow control to limit the amount of data sent on the capabilities of the receiving device. A slower device like a mobile phone will e.g. limit the data transfer to not get overwhelmed. Protocols like UDP do not provide any flow control and therefore \textit{iperf} has to limit the used bandwidth itself. Using "-u 0" will cause \textit{iperf} to send as many UDP packets as possible.
\subsubsection{Measuring latency}
Measuring latency can be done most easily by using the \textit{ping} application included in most operating systems and sending multiple "pings" in a certain interval from a sending device to a receiving device. The sending device will send an ICMP echo request packet to the receiving device over the network and if the packet was received, the receiving device will answer with an ICMP echo reply packet. After each sent packet, the sender will wait for the reply and log the time difference between sending and receiving.
Measuring latency can be done most easily by using the \textit{ping} application included in most operating systems and sending multiple "pings" in a certain interval from a sending device to a receiving device. The sending device will send an \textit{Internet Control Message Protocol} (ICMP) echo request packet to the receiving device over the network and if the packet was received, the receiving device will answer with an ICMP echo reply packet. After each sent packet, the sender will wait for the reply and log the time difference between sending and receiving.
A ping test can be run for a given time during which a failure can be introduced.
A \textit{ping} test can be run for a given time during which a failure can be introduced.
\subsubsection{Measuring influence of link usage}
\label{measure_link_usage}
@ -91,19 +93,19 @@ In a virtual network packet loss will not be caused by faulty devices on an othe
We have to keep in mind that some of this packet loss might be caused by the configuration changes which are used to simulate a link failure in the virtual network. As such packet loss will most likely only occur shortly after a failure is introduced, or, in case UDP is used in a bandwidth measurement, if the bandwidth that has to be achieved overloads the packet queues of the routers.
\subsubsection{Monitoring packet flow}
Network mechanisms like ShortCut influence the routing and therefore the flow of packets in the network. To gain an overview over the routes packets take, we can analyse this packet flow by measuring the amount of IP packets passing each router.
Network mechanisms like ShortCut influence the routing and therefore the flow of packets in the network. To gain an overview over the routes packets take, we can analyse this packet flow by measuring the amount of packets belonging to a specific protocol, e.g..
This can be done with tools like \textit{nftables}, which besides its functionality to create rules for packets acting as a firewall is also able to use these rules to deploy counters. These counters can be accessed in regular intervals to confirm if e.g. a FRM was successfully triggered and traffic is re-routed, skipping a router.
\subsection{Introducing failures}
\label{introducing_failures}
In related studies, failures are often split into definitions of \textit{device failures} and \textit{link failures}. Depending on the network and its structure link and device failures have different impacts on the network. If e.g. a top of the rack switch fails the whole rack loses connection to the network. A single link failure might only affect a single component.
In related studies failures are often split into definitions of \textbf{device failures} and \textbf{link failures}. Depending on the network structure link and device failures have different impacts on the network. If e.g. a top of the rack switch fails, the whole rack loses connection to the network. A single link failure might only affect a single component.
Nonetheless, a link failure and a device failure are not so different. While a link failure requires the network and its administrators to take different measures than if it was a device failure, the data plane of the network will react quite similarly.
A failure and measures taken by the network can be split into two phases: In the first phase, a network component, be it a link or device, breaks. This can be either a failure in parts (e.g. specific packets dropped, protocols not working etc.) or a complete failure which constitutes a complete breakdown of functionality. The network uses timeouts and fault detection protocols on layer 1 (physical) and 2 (ethernet) to detect faulty interfaces and links. Most complete failures will be recognized near instantly, while partial failures might take longer to detect because they require e.g. timeouts to run out. These partial failures will most likely be caused by device failures, as link failures are less complex. Even a partially working link will most likely produce faulty packets which will be recognized quickly (do i need a source for that?)
Failures and measures taken by the network can be split into two phases. In the first phase a network component, be it a link or device, breaks. This can be either a failure in parts (e.g. specific packets dropped, protocols not working etc.) or a complete failure which constitutes a complete breakdown of functionality. The network uses timeouts and fault detection protocols on layer 1 (physical) and 2 (ethernet) to detect faulty interfaces and links. Most complete failures will be recognized near instantly, while partial failures might take longer to detect because they require e.g. timeouts to run out. These partial failures will most likely be caused by device failures, as link failures are less complex. Even a partially working link will most likely produce faulty packets which will be recognized quickly.
The first phase contains all operations done until a failure was recognized.
When the failure was detected, the second phase starts. This is when the network becomes active and reacts to the failure, e.g. by using FRR to quickly reroute affected traffic flows and inquiring the controller to take further action. This is the phase of the failure in which the type of failure becomes near irrelevant for the operations on the data plane; the router recognizes a failing route (e.g. because the router behind a link is broken) and reacts accordingly. This reaction is very deterministic as only data on the router is used.
When the failure was detected, the second phase starts. This is when the network becomes active and reacts to the failure, e.g. by using FRR to quickly reroute affected traffic flows and inquiring the controller to take further action. This is the phase of the failure in which the type of failure becomes near irrelevant for the operations on the data plane; the router recognizes a failing route, e.g. because the router behind a link is broken, and reacts accordingly. This reaction is very deterministic as only data on the router is used.
The FRMs in this work always presume clean failures that already have been detected. Fault detection in phase 1 is therefore excluded from this work and we always assume that a device or link has failed completely. Because we only focus on the fault mitigation on the data plane we can also assume that a faulty devices constitutes a device on which all links are broken. A failure in the only link connecting a router for example is the same as a completely broken router.

@ -1,3 +1,3 @@
% !TeX encoding = UTF-8
\chapter*{\iftoggle{lang_eng}{Abstract}{Kurzfassung}}
In our modern society the internet and provided services play an increasingly greater role in our daily lives.
The failure of links and devices in networks is a common occurrence for any network administrator. As such modern networks implement global convergence protocols and \textit{Fast Re-Routing} (FRR) mechanisms to quickly restore connectivity. FRR mechanisms tend to create sub-optimal paths containing unnecessary loops which increase latency, disturb other data flows or create other adverse effects on the network until the slower convergence protocol rewrites routings on the network. \textit{Fast Recovery Methods} (FRM) like ShortCut (\cite{Shukla.2021}) optimize existing FRR mechanisms by e.g. removing existing loops in routings created by FRR. However, ShortCut has yet to be tested in further detail. In this work we implement three topologies, a FRR mechanism and ShortCut in a virtual network and run performance tests while introducing artificial failures to the network, evaluating the performance of these two methods. Our results show that ShortCut reduces the effects of sub-optimal paths chosen by a FRR mechanism and in some cases even restores pre-failure conditions on the network.

@ -21,9 +21,9 @@
\begin{minipage}{9cm}
\large
\begin{center}
\par{}{\Large Comparison of Fast Recovery Methods in Networks}
\par{}{\Large A Comparison of Fast-Recovery Mechanisms in Networks}
\vspace*{1cm}
\par{}\textbf{Implementation and Evaluation of Fast Recovery Methods in Mininet}
\par{}\textbf{Implementation and Evaluation of Fast Recovery Mechanisms in Mininet}
\vspace*{1cm}
\par{}Frederik Maaßen
% \vspace*{1cm}

@ -1,26 +1,38 @@
\chapter{Conclusion}
In this chapter we conclude our findings in \cref{results}. We furthermore discuss further extensions of this work and how our results could be made more robust for interpretation in \cref{future}.
\section{Results of this work}
\label{results}
We provide an entry point into the topic of modern networks, resilient routing and FRMs.
To this end we explain the implementation of a test framework using Mininet, incorpirate a simple FRR mechanism and provide a simple example for an implementation of ShortCut.
We also show that FRMs can have a meaningful impact on networks, restoring bandwidth, removing additional delays and reducing the unnecessary workload for routers through testing in our test framework. ShortCut has proven to be an effective method in relieving additional burden imposed on networks after a failure, reducing delays by up to 38\% and additional packet forwarding by up to 120\%.
\section{Future work}
During our work on this thesis we were not able to
\label{future}
During our work on this thesis we discovered more ways to further investigate FRMs and extend our existing test framework.
In addition to this we also discovered more methods to produce more robust results. We discuss possible extensions of our test framework in \cref{conc_framework}. The addition of CPU usage measurements on Mininet hosts discussed in \cref{conc_cpu} could also provide additional insights into our results.
\subsection{Testing framework}
We provided the testing framework we used for performing tests on ShortCut. This framework can be used for many different test runs, but is still a prototype. The structure is far from optimal and most pipelines can be optimized.
\label{conc_framework}
We provide a test framework which can be used to performing tests on Mininet networks automatically. The framework is, however, still a prototype and could be extended for broader usage.
One example is the usage of \textit{iperf3} for performing bandwidth measurements. During our tests we experienced bursts in the log output of the \textit{iperf} server, exceeding both imposed limits on the network, be it the limit for Mininet links or the limit for \textit{iperf} itself, for a short period of time by up to 50\%. We decided to - for now - interpret these bursts as a technicality of \textit{iperf}, but a thorough investigation of this issue would increase the quality of the data produced.
One easy extension would be the addition of tests using UDP. While we already implemented an UDP data transfer for a packet flow measurement, additional implementations using UDP in e.g. bandwidth testing could prove worthwhile as UDP has no congestion control, possibly causing packet queue overflows in routers.
\subsection{Differentiating multiple tools}
This includes the addition of a selection of measurement tools.
All measurements were done using one specific tool, depending on the type of measurement, namely \textit{iperf} for bandwidth measurements and the production of data streams and \textit{ping} for latency tests. \textit{iperf} could be replaced with a multitude of software packets and some members of the Mininet community have suggested that e.g. \textit{netperf} (\cite{Jones.2015}) would provide more accurate results. This could be evaluated in further detail.
\subsection{Measuring CPU usage of hosts in Mininet}
Tests in this work are done with a limit of \SI{100}{Mbps} imposed on the links, in tests without a limit to the bandwidth values of around \SI{40}{Gbps} were reached. While this should remove any fluctuations that could be caused by additional operations either on the virtual machine or the host system, it does not completely ensure proper distribution of processing power. For this we could run CPU usage measurements while the actual tests are running. This would enable us to further interpret results and possible spikes in delay or bandwidth.
\subsection{Massive testing}
Because of the time constrains of this work we were unable to test in high volumes, even though we experienced some fluctuations in our measurements. To increase the reliability of our results the tests could be run e.g. a hundred times. This could also be integrated into the testing framework with an additional argument specifying in which quantity the test should be run.
\subsection{Adding topologies, FRR variants and FRMs}
In this work we evaluated three pretty similar topologies, as well as a simple implementation of FRR and an implementation of the FRM ShortCut.
We evaluate three pretty similar topologies, as well as a simple implementation of FRR and an implementation of the FRM ShortCut.
Depending on the requirements of a network, an e.g. full topology with all routers inter-connected might be a good starting point for further testing. This should go hand in hand with the implementation of an automatic routing and a more strategic deployment of FRR and FRMs.
As described in \cref{FRM} there are also many different FRMs which could be implemented in Mininet and tested using our test framework.
As described in \cref{FRM} there are also many different FRMs which could be implemented in Mininet and tested using our test framework.
To further professionalize the implementation of FRMs they could be implemented using P4, overwriting the default behaviour of Mininet routers. This would provide a version of e.g. ShortCut that could be deployed in a real network in a short amount of time.
\subsection{Measuring CPU usage of hosts in Mininet}
\label{conc_cpu}
Tests in this work are done with a limit of \SI{100}{Mbps} imposed on the links, in tests without a limit to the bandwidth values of around \SI{40}{Gbps} were reached. While this should remove any fluctuations that could be caused by additional operations either on the virtual machine or the host system, it does not completely ensure proper distribution of processing power. For this we could run CPU usage measurements while the actual tests are running. This would enable us to further interpret results and possible spikes in delay or bandwidth.

@ -1,18 +1,16 @@
\chapter{Evaluation}
\label{evaluation}
In this chapter we evaluate tests that were run using our test framework in Mininet. The tests were performed as described in \ref{cp:testing} with a bandwidth limit on each link of \SI{100}{Mbps}. When testing with delays on the network we noticed that the performance dropped rapidly. This is why we only use an additional delay of \SI{5}{\milli\second} per link in our latency tests - other tests do not use a delay.
In this chapter we evaluate tests that were run using our test framework in Mininet. The tests were performed as described in \ref{cp:testing} with a bandwidth limit on each link of \SI{100}{Mbps}. We use an additional delay of \SI{5}{\milli\second} per link in our latency tests - other tests do not use a delay.
The evaluations are sorted by topology. For each topology we measured the bandwidth, bandwidth with a concurrent data flow, latency, TCP packet flow and UDP packet flow. We execute each test once with FRR active in the corresponding section \textit{With FRR} and once with FRR and our implementation of ShortCut active in the corresponding section \textit{With FRR and ShortCut}.
We start with our minimal network in section \ref{eva_minimal_network}, followed by the evaluation of two networks with longer "failure paths", measuring the influence of additional nodes in looped paths in section \ref{eva_failure_path_network}.
Lastly we discuss our results in \cref{discussion}.
We start with our minimal network in section \ref{eva_minimal_network}, followed by the evaluation of two networks with longer "failure paths", measuring the influence of additional nodes in looped paths in section \ref{eva_failure_path_network}. Lastly we discuss our results in \cref{discussion}.
\input{content/evaluation/minimal_network}
\input{content/evaluation/failure_path_networks}
\section{Discussion of results}
\section{Discussion}
\label{discussion}
In this section we discuss our results in the previous measurements. We proceed by comparing the results of different measurement types using the three topologies. For each measurement type we collect the implications of a failure for the network and whether ShortCut is able to enhance results. We start with the bandwidth in \cref{discussion_bandwidth}, continuing to the bandwidth with a second data flow in \cref{discussion_bandwidth_link_usage}. After that we talk about our latency measurements in \cref{discussion_latency} followed by our packet flow measurements using TCP and UDP in \cref{discussion_packet_flow}.

@ -6,15 +6,15 @@
\begin{subfigure}[b]{0.49\textwidth}
\centering
\includegraphics[width=\textwidth]{testing_failure_path_1}
\label{fig:evaluation_failure_path_1_network}
\caption{Failure path with 2 hops}
\label{fig:evaluation_failure_path_1_network}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.49\textwidth}
\centering
\includegraphics[width=\textwidth]{testing_failure_path_2}
\label{fig:evaluation_failure_path_2_network}
\caption{Failure path with 3 hops}
\label{fig:evaluation_failure_path_2_network}
\end{subfigure}
\caption{Networks with longer failure paths}
\label{fig:evaluation_failure_path_networks}
@ -24,32 +24,35 @@ In this section we evaluate the results for our two failure path networks seen i
Most tests however did not produce significantly different results to the minimal network, evaluated in \cref{eva_minimal_network}, which is why we will focus on differences between the two topology classes.
In \cref{evaluation_failure_path_bandwidth} we evaluate the influence of longer failure paths on achieved bandwidths in the network. Furthermore we investigate the influence of two concurrent data flows on the looped path in \cref{evaluation_failure_path_bandwidth_link_usage}. The impact of a longer looped path on the latency in the networks is evaluated in \cref{evaluation_failure_path_latency}. Lastly we inspect packet flow in our failure path networks using a TCP transfer in \cref{evaluation_failure_path_tcp_packet_flow} and using an UDP transfer in \cref{evaluation_failure_path_udp_packet_flow}.
\subsection{Bandwidth}
\label{evaluation_failure_path_bandwidth}
When measuring the bandwidth of our networks with longer failure paths the results were similar to those of the bandwidth measurement in the minimal network, described in \cref{evaluation_minimal_bandwidth}.
The addition of hops to the failure path did not have an effect on the bandwidth.
The addition of hops to the failure path did not have an effect on the bandwidth. It has to be noted however that the packets sent through the looped path will potentially block other data flows, reducing the overall bandwidth of both data flows. This is evaluated in \cref{evaluation_failure_path_bandwidth_link_usage}.
\subsection{Two concurrent data transfers}
\subsection{Bandwidth with concurrent data flow}
\label{evaluation_failure_path_bandwidth_link_usage}
We started two concurrent data flows using \textit{iperf}. In case of a failure, these two data flows would influence each other.
The first tests evaluated were run using only a FRR mechanism in \textit{With FRR}. We then evaluate the tests run with a FRR mechanism and an implementation of ShortCut in \textit{With FRR and ShortCut}.
\subsubsection{With FRR}
\begin{figure}
\centering
\begin{subfigure}[b]{0.49\textwidth}
\centering
\includegraphics[width=\textwidth]{tests/failure_path_1_bandwidth_link_usage/bandwidth_link_usage_before_wo_sc}
\label{fig:evaluation_failure_path_1_bandwidth_link_usage_wo_sc_a}
\caption{Bandwidth before a failure}
\label{fig:evaluation_failure_path_1_bandwidth_link_usage_wo_sc_a}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.49\textwidth}
\centering
\includegraphics[width=\textwidth]{tests/failure_path_1_bandwidth_link_usage/bandwidth_link_usage_after_wo_sc}
\label{fig:evaluation_failure_path_1_bandwidth_link_usage_wo_sc_b}
\caption{Bandwidth after a failure}
\label{fig:evaluation_failure_path_1_bandwidth_link_usage_wo_sc_b}
\end{subfigure}
\caption{Bandwidth with concurrent data transfer on H3 to H1}
\label{fig:evaluation_failure_path_1_bandwidth_link_usage_wo_sc}
@ -79,96 +82,94 @@ Our measurement with a failure occurring concurrent to our data transfers howeve
\centering
\begin{subfigure}[b]{0.49\textwidth}
\centering
\includegraphics[width=\textwidth]{tests/failure_path_1_bandwidth_link_usage/bandwidth_link_usage_before_sc}
\label{fig:evaluation_failure_path_1_bandwidth_link_usage_sc_a}
\caption{Bandwidth before a failure}
\includegraphics[width=\textwidth]{tests/failure_path_1_bandwidth_link_usage/bandwidth_link_usage_after_sc}
\caption{Bandwidth after failure - 1st network}
\label{fig:evaluation_failure_path_1_bandwidth_link_usage_sc_b}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.49\textwidth}
\centering
\includegraphics[width=\textwidth]{tests/failure_path_1_bandwidth_link_usage/bandwidth_link_usage_after_sc}
\label{fig:evaluation_failure_path_1_bandwidth_link_usage_sc_b}
\caption{Bandwidth after a failure}
\includegraphics[width=\textwidth]{tests/failure_path_2_bandwidth_link_usage/bandwidth_link_usage_after_sc}
\caption{Bandwidth after failure - 2nd network}
\label{fig:evaluation_failure_path_2_bandwidth_link_usage_sc_b}
\end{subfigure}
\caption{Bandwidth with concurrent data transfer on H2 to H1 using ShortCut}
\caption{Bandwidth with concurrent data transfer using ShortCut}
\label{fig:evaluation_failure_path_1_bandwidth_link_usage_sc}
\end{figure}
\begin{figure}
\centering
\includegraphics[width=10cm]{tests/failure_path_1_bandwidth_link_usage/bandwidth_link_usage_concurrent_sc}
\caption{Bandwidth H1 to H4 with concurrent data transfer on h2 to h1 - failure occuring after 15 seconds using ShortCut}
\label{fig:evaluation_failure_path_1_bandwidth_link_usage_concurrent_sc}
\end{figure}
Using ShortCut allows both topologies to restore the full bandwidth for both transmissions, as the looped path is cut, causing the data transfers to not interfere with each other.
\subsection{Latency}
\label{failure_path_latency}
\label{evaluation_failure_path_latency}
We measured the latency between host H1 and H6 for our first failure path network and between host H1 and H8 for our second failure path network.
The first tests evaluated were run using only a FRR mechanism in \textit{With FRR}. We then evaluate the tests run with a FRR mechanism and an implementation of ShortCut in \textit{With FRR and ShortCut}.
\subsubsection{With FRR}
\begin{figure}
\centering
\begin{subfigure}[b]{0.49\textwidth}
\centering
\includegraphics[width=\textwidth]{tests/failure_path_1_latency/latency_before_wo_sc}
\label{fig:evaluation_failure_path_1_latency_wo_sc_a}
\caption{Latency before a failure - 1st network}
\label{fig:evaluation_failure_path_1_latency_wo_sc_a}
\end{subfigure}
\begin{subfigure}[b]{0.49\textwidth}
\centering
\includegraphics[width=\textwidth]{tests/failure_path_1_latency/latency_after_wo_sc}
\label{fig:evaluation_failure_path_1_latency_wo_sc_b}
\caption{Latency after a failure - 1st network}
\label{fig:evaluation_failure_path_1_latency_wo_sc_b}
\end{subfigure}
\vskip\baselineskip
\begin{subfigure}[b]{0.49\textwidth}
\centering
\includegraphics[width=\textwidth]{tests/failure_path_2_latency/latency_before_wo_sc}
\label{fig:evaluation_failure_path_2_latency_wo_sc_a}
\caption{Latency before a failure - 2nd network}
\label{fig:evaluation_failure_path_2_latency_wo_sc_a}
\end{subfigure}
\begin{subfigure}[b]{0.49\textwidth}
\centering
\includegraphics[width=\textwidth]{tests/failure_path_2_latency/latency_after_wo_sc}
\label{fig:evaluation_failure_path_2_latency_wo_sc_b}
\caption{Latency after a failure - 2nd network}
\label{fig:evaluation_failure_path_2_latency_wo_sc_b}
\end{subfigure}
\caption{Latency measured with \textit{ping} on both failure path networks}
\label{fig:evaluation_failure_path_1_latency_wo_sc}
\end{figure}
The additional hops in our failure path networks add, as expected, latency to the measurements. In case of our first failure path network around \SI{20}{\milli\second} of additional latency were measured after a failure as can be seen in \cref{fig:evaluation_failure_path_1_latency_wo_sc_b}. The second failure path network adds an additional \SI{10}{\milli\second} to the latency in case of a failure, adding \SI{30}{\milli\second} in total as can be seen in \cref{fig:evaluation_failure_path_2_latency_wo_sc_b}. This is caused by the additional links on the longer path, with Mininet adding \SI{5}{\milli\second} of delay for each link that is passed. Because only ICMP echo requests and not replies use the looped path, as packets returning from either host H6 or H8 are not forwarded to router R2 when arriving on router R1, the additional latency on the network will always be \SI{10}{\milli\second} for each link contained on the looped path. The additional link is passed twice by each ICMP echo request.
The additional hops in our failure path networks add, as expected, latency to the measurements, as each Mininet link in our tests add a delay of \SI{5}{\milli\second}. When compared to our minimal network evaluated in \cref{evaluation_minimal_latency} the first failure path network adds exactly \SI{5}{\milli\second}, shown in \cref{fig:evaluation_failure_path_1_latency_wo_sc_a} for our additional hop. The second failure path network adds an extra \SI{5}{\milli\second} on top, which can be seen in \cref{fig:evaluation_failure_path_2_latency_wo_sc_a}.
When introducing a failure our first failure path network adds around \SI{20}{\milli\second} of latency as can be seen in \cref{fig:evaluation_failure_path_1_latency_wo_sc_b}. \cref{fig:evaluation_failure_path_2_latency_wo_sc_b} shows that the second failure path network adds an additional \SI{10}{\milli\second} to the latency in case of a failure, adding \SI{30}{\milli\second} in total. This is caused by the additional links on the longer path. Because only ICMP echo requests and not replies use the looped path, as packets returning from either host H6 or H8 are not forwarded to router R2 when arriving on router R1, the additional latency on the network will always be \SI{10}{\milli\second} for each link contained on the looped path. The additional link is passed twice by each ICMP echo request.
\subsubsection{With FRR and ShortCut}
\label{failure_path_1_latency_with_frr_and_shortcut}
\begin{figure}
\centering
\begin{subfigure}[b]{0.49\textwidth}
\centering
\includegraphics[width=\textwidth]{tests/failure_path_1_latency/latency_before_sc}
\caption{Latency before failure on 1st failure path network using ShortCut}
\label{fig:evaluation_failure_path_1_latency_sc_a}
\caption{Latency before a failure on 1st failure path network}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.49\textwidth}
\centering
\includegraphics[width=\textwidth]{tests/failure_path_1_latency/latency_after_sc}
\caption{Latency after failure on 1st failure path network using ShortCut}
\label{fig:evaluation_failure_path_1_latency_sc_b}
\caption{Latency after a failure on 1st failure path network}
\end{subfigure}
\begin{subfigure}[b]{0.49\textwidth}
\centering
\includegraphics[width=\textwidth]{tests/failure_path_2_latency/latency_before_sc}
\caption{Latency before failure on 2nd failure path network using ShortCut}
\label{fig:evaluation_failure_path_2_latency_sc_a}
\caption{Latency before a failure on 2nd failure path network}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.49\textwidth}
\centering
\includegraphics[width=\textwidth]{tests/failure_path_2_latency/latency_after_sc}
\caption{Latency after failure on 2nd failure path network using ShortCut}
\label{fig:evaluation_failure_path_2_latency_sc_b}
\caption{Latency after a failure on 2nd failure path network}
\end{subfigure}
\vskip\baselineskip
\caption{Latency measured with \textit{ping} on both failure path networks using ShortCut}
@ -179,8 +180,10 @@ Similar to our results when measuring the minimal topology in \cref{evaluation_m
\subsection{Packet flow - TCP}
\label{failure_path_tcp_packet_flow}
\label{evaluation_failure_path_tcp_packet_flow}
We measure the amount of TCP packets forwarded in our failure path networks. For this we attach \textit{nftables} counters to for routers in each of these topologies.
The first tests evaluated were run using only a FRR mechanism in \textit{With FRR}. We then evaluate the tests run with a FRR mechanism and an implementation of ShortCut in \textit{With FRR and ShortCut}.
\subsubsection{With FRR}
\label{failure_path_1_packet_flow_with_frr}
\begin{figure}
@ -188,29 +191,37 @@ We measure the amount of TCP packets forwarded in our failure path networks. For
\begin{subfigure}[b]{0.49\textwidth}
\centering
\includegraphics[width=\textwidth]{tests/failure_path_1_packet_flow/packet_flow_after_wo_sc}
\caption{TCP packet flow after a failure - 1st network}
\label{fig:evaluation_failure_path_1_packet_flow_wo_sc_b}
\caption{TCP Packets on routers after a failure - 1st network}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.49\textwidth}
\centering
\includegraphics[width=\textwidth]{tests/failure_path_2_packet_flow/packet_flow_after_wo_sc}
\caption{TCP packet flow after a failure - 2nd network}
\label{fig:evaluation_failure_path_2_packet_flow_wo_sc_b}
\caption{TCP Packets on routers after a failure - 2nd network}
\end{subfigure}
\caption{TCP Packets on all routers measured with \textit{nftables} counters}
\label{fig:evaluation_failure_path_packet_flow_wo_sc}
\end{figure}
At a first glance the results from the test run on the minimal topology and our failure path networks are very similar. As in \cref{evaluation_minimal_packet_flow} we observe that TCP packets sent through the looped path only forward the TCP packets bearing data as returning ACKs from the \textit{iperf} server are not passed into the loop. Router R1 however forwards an additional 50\% of packets, forwarding packets containing data twice on their way to the \textit{iperf} server and forwarding returning ACKs once back to the host H1.
Similar to our observations in \cref{evaluation_minimal_tcp_packet_flow} only TCP packets bearing data are passed into the loop, as returning ACKs from the \textit{iperf} server are forwarded to the host H1 directly. Router R1 however forwards an additional 50\% of packets, forwarding packets containing data twice on their way to the \textit{iperf} server and forwarding returning ACKs once back to the host H1. The router R3 in our first failure path network and the router R4 in our second failure path network both only forward each TCP packet bearing data once. Both of these results are shown in \cref{fig:evaluation_failure_path_packet_flow_wo_sc}.
The implications for the network are however quite more severe the more routers are contained on the looped paths. Because each router that is not the start and end point of our loop path forwards each packet in the loop twice, the amount of packets forwarded relative to the total amount of packets forwarded in a healthy network will steadily increase.
The implications for the network are however quite more severe the more routers are contained on the looped paths. Because each router that is not the start and end point of our loop path forwards each packet in the loop twice, the amount of packets forwarded relative to the total amount of packets forwarded in a healthy network will steadily increase. For our first failure path this means an additional 50\% of packets forwarded. For our second failure path network the amount of packets forwarded even increases by 60\%.
For our first failure path this means an additional 50\% of packets forwarded, assuming that routers R4 and R6 forward the same amount of packets as router R5.
In our second failure path network the amount of packets forwarded even increases by 60\%, assuming that router R3 forwards the same amount of packets as router R2 and that routers R5, R6 and R8 forward the same amount of packets as router R7.
\subsubsection{With FRR and ShortCut}
ShortCut was able to cut off the loop and therefore reduce the amount of packets forwarded to the original amount. Because this is similar to the behaviour in \cref{evaluation_minimal_tcp_packet_flow} we omitted additional graphs from our results.
\subsection{Packet flow - UDP}
We used the same test function as in \cref{evaluation_failure_path_tcp_packet_flow} but with the "udp" flag to run a packet flow measurement with UDP.
The first tests evaluated were run using only a FRR mechanism in \textit{With FRR}. We then evaluate the tests run with a FRR mechanism and an implementation of ShortCut in \textit{With FRR and ShortCut}.
\label{evaluation_failure_path_udp_packet_flow}
\subsubsection{With FRR}
\begin{figure}
\centering
@ -231,7 +242,9 @@ ShortCut was able to cut off the loop and therefore reduce the amount of packets
\label{fig:evaluation_failure_path_packet_flow_udp_wo_sc}
\end{figure}
Falling in line with our results in our TCP packet flow measurement in \cref{evaluation_failure_path_tcp_packet_flow}, a longer failure path adds to the overall workload of the network. This is shown in \cref{fig:evaluation_failure_path_packet_flow_udp_wo_sc}. Because UDP does not send ACKs back to the sender, all packets produced by the UDP data transfer are passed through the looped path.
Because router R1 and all routers up until the endpoint of the loop will forward the UDP packets twice, the amount of packets forwarded on the first failure path network increases by 100\%. When the failure path is extended as is the case in our second failure path network, the amount of packets forwarded on the network even increases by 120\%.
\subsubsection{With FRR and ShortCut}

@ -2,14 +2,18 @@
\label{eva_minimal_network}
\begin{figure}
\centering
\fbox{\includegraphics[width=12cm]{testing_4r3h}}
\includegraphics[width=12cm]{testing_4r3h}
\caption{Minimal network}
\label{fig:evaluation_minimal_network}
\end{figure}
In this section we will evaluate the tests performed on our minimal topology, starting with an evaluation of the bandwidth measurements in \cref{evaluation_minimal_bandwidth}. We continue to evaluate the tests measuring the influence of an additional data flow on the looped path in \cref{evaluation_minimal_bandwidth_link_usage}. In \cref{evaluation_minimal_latency} we then evaluate our latency tests. Lastly we evaluate our packet flow measurements using a TCP transfer in \cref{evaluation_minimal_tcp_packet_flow} and a UDP transfer in \cref{evaluation_minimal_udp_packet_flow}.
\subsection{Bandwidth}
\label{evaluation_minimal_bandwidth}
We performed multiple tests of influences to the bandwidth with occurring failures. These were run using \textit{iperf} and a logging interval of \SI{0.5}{\second}. All data was collected from the output of the \textit{iperf} server.
The first tests evaluated were run using only a FRR mechanism in \textit{With FRR}. We then evaluate the tests run with a FRR mechanism and an implementation of ShortCut in \textit{With FRR and ShortCut}.
\subsubsection{With FRR}
\begin{figure}
@ -17,15 +21,15 @@ We performed multiple tests of influences to the bandwidth with occurring failur
\begin{subfigure}[b]{0.49\textwidth}
\centering
\includegraphics[width=\textwidth]{tests/minimal_bandwidth/bandwidth_before_wo_sc}
\label{fig:evaluation_minimal_bandwidth_wo_sc_a}
\caption{Bandwidth before a failure}
\label{fig:evaluation_minimal_bandwidth_wo_sc_a}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.49\textwidth}
\centering
\includegraphics[width=\textwidth]{tests/minimal_bandwidth/bandwidth_after_wo_sc}
\label{fig:evaluation_minimal_bandwidth_wo_sc_b}
\caption{Bandwidth after a failure}
\label{fig:evaluation_minimal_bandwidth_wo_sc_b}
\end{subfigure}
\caption{Bandwidth measured with \textit{iperf} from H1 to H4}
\label{fig:evaluation_minimal_bandwidth_wo_sc}
@ -43,7 +47,7 @@ We performed a TCP bandwidth test on the minimal network, a topology with 4 rout
In \cref{fig:evaluation_minimal_bandwidth_concurrent_wo_sc} however we introduced the failure while the bandwidth test was running. The test was run for \SI{30}{\second} and the failure was introduced at around 15 seconds, which caused no visible performance drop. However in some executions of this test the performance dropped when introducing the failure and the log output of the sending client reported the need to resend up to 100 packets. Because this behaviour is only occurring sporadically, we assume
this to be a timing issue.
When the connection between routers is cut, our test framework uses the Mininet python API to deactivate the corresponding interfaces on both affected routers. This is done in sequence. In this example the interface on router R2 was deactivated first and the interface on router R4 was deactivated second. We implemented this behaviour after observing the default behaviour of the Mininet network. If the connection between e.g. router R2 and router R4 was only cut by deactivating the interface on router R4, router R2 would not recognize the failure and would loose all packets sent to the link. Because we deactivate the interfaces in sequence and the Mininet python api introduces delay to the operation, the interface on R2 will be deactivated while the interface on R4 will continue receiving packets already on the link and will continue sending packets to the deactivated interface on R2 for a short period of time. All packets sent to R2 in this time period will be lost. But because the \textit{iperf} server itself does not send any actual data, but only acknowledgements (ACK) for already received data, only ACKs are lost this way.
When the connection between routers is cut, our test framework uses the Mininet python API to deactivate the corresponding interfaces on both affected routers. This is done in sequence. In this example the interface on router R2 was deactivated first and the interface on router R4 was deactivated second. We implemented this behaviour after observing the default behaviour of the Mininet network. If the connection between e.g. router R2 and router R4 was only cut by deactivating the interface on router R4, router R2 would not recognize the failure and would loose all packets sent to the link. Because we deactivate the interfaces in sequence and the Mininet python api introduces delay to the operation, the interface on R2 will be deactivated while the interface on R4 will continue receiving packets already on the link and will continue sending packets to the deactivated interface on R2 for a short period of time. All packets sent to R2 in this time period will be lost. But because the \textit{iperf} server itself does not send any actual data, but only \textit{Acknowledgements} (ACK) for already received data, only ACKs are lost in the process.
TCP (\cite{InformationSciencesInstituteUniversityofSouthernCalifornia.1981}) however does not necessarily resend lost ACKs, and the client does not necessarily resend all packets for which he did not receive an ACK. Data for which the ACKs were lost could still be implicitly acknowledged by the server if they e.g. belonged to the same window as following packets and the ACKs for these packets were received by the client. This could cause a situation in which the server already received data, but the client only receives a notification of the success of the transfer with a delay.
@ -60,15 +64,15 @@ In our further tests we observed that the bandwidth alone does not change heavil
\begin{subfigure}[b]{0.49\textwidth}
\centering
\includegraphics[width=\textwidth]{tests/minimal_bandwidth/bandwidth_before_sc}
\label{fig:evaluation_minimal_bandwidth_sc_a}
\caption{Bandwidth before a failure}
\label{fig:evaluation_minimal_bandwidth_sc_a}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.49\textwidth}
\centering
\includegraphics[width=\textwidth]{tests/minimal_bandwidth/bandwidth_after_sc}
\label{fig:evaluation_minimal_bandwidth_sc_b}
\caption{Bandwidth after a failure}
\label{fig:evaluation_minimal_bandwidth_sc_b}
\end{subfigure}
\caption{Bandwidth measured with \textit{iperf} from H1 to H4 using ShortCut}
\label{fig:evaluation_minimal_bandwidth_sc}
@ -85,24 +89,26 @@ In our further tests we observed that the bandwidth alone does not change heavil
As can be seen in \cref{fig:evaluation_minimal_bandwidth_sc} and \cref{fig:evaluation_minimal_bandwidth_concurrent_sc}, using ShortCut had no further influence on the achieved throughput. This is to be expected, as longer or shorter paths will only influence throughput if e.g. a link with a lower bandwidth is contained in an additional path.
\subsection{Two concurrent data transfers}
\subsection{Bandwidth with concurrent data flow}
\label{evaluation_minimal_bandwidth_link_usage}
In this test we evaluated the bandwidth between hosts H1 and H4 with a concurrent data transfer between hosts H2 and H1. Both transfers were run with a limitation of \SI{100}{Mbps}, which constitutes the maximum allowed bandwidth in this test.
In this test we evaluated the bandwidth between hosts H1 and H4 with a concurrent data transfer between hosts H2 and H1. Both transfers were run with a limitation of \SI{100}{Mbps}, which constitutes the maximum allowed bandwidth in this test.
The first tests evaluated were run using only a FRR mechanism in \textit{With FRR}. We then evaluate the tests run with a FRR mechanism and an implementation of ShortCut in \textit{With FRR and ShortCut}.
\subsubsection{With FRR}
\begin{figure}
\centering
\begin{subfigure}[b]{0.49\textwidth}
\centering
\includegraphics[width=\textwidth]{tests/minimal_bandwidth_link_usage/bandwidth_link_usage_before_wo_sc}
\label{fig:evaluation_minimal_bandwidth_link_usage_wo_sc_a}
\caption{Bandwidth before a failure}
\label{fig:evaluation_minimal_bandwidth_link_usage_wo_sc_a}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.49\textwidth}
\centering
\includegraphics[width=\textwidth]{tests/minimal_bandwidth_link_usage/bandwidth_link_usage_after_wo_sc}
\label{fig:evaluation_minimal_bandwidth_link_usage_wo_sc_b}
\caption{Bandwidth after a failure}
\label{fig:evaluation_minimal_bandwidth_link_usage_wo_sc_b}
\end{subfigure}
\caption{Bandwidth with concurrent data transfer on H2 to H1}
\label{fig:evaluation_minimal_bandwidth_link_usage_wo_sc}
@ -116,11 +122,10 @@ In this test we evaluated the bandwidth between hosts H1 and H4 with a concurren
\end{figure}
Before a failure, as can be seen in \cref{fig:evaluation_minimal_bandwidth_link_usage_wo_sc_a}, the throughput is at around \SI{100}{Mbps} which is our current maximum. While the additional transfer between hosts H2 and H1 does in fact use some of the links that are also used in our \textit{iperf} test, namely the link between routers R1 to R2 and host H1 to R1, it does so in a different direction. While the data itself is sent from hosts H1 to H4 over H2, only the tcp acknowledgements are sent on the route back. Data from hosts H2 to H1 is sent from routers R2 to R1 and therefore only the returning acknowledgements use the link in the same direction, not impacting the achieved throughput.
Before a failure, as can be seen in \cref{fig:evaluation_minimal_bandwidth_link_usage_wo_sc_a}, the throughput is at around \SI{100}{Mbps} which is our current maximum. While the additional transfer between hosts H2 and H1 does in fact use some of the links that are also used in our \textit{iperf} test, namely the link between routers R1 to R2 and host H1 to R1, it does so in a different direction. While the data itself is sent from hosts H1 to H4 over H2, only the TCP ACKs are sent on the route back. Data from hosts H2 to H1 is sent from routers R2 to R1 and therefore only the returning ACKs use the link in the same direction, not impacting the achieved throughput.
If a failure is introduced however, traffic from host H1 loops over router R2 using up bandwidth on a link that is also passed by the additional data flow. Therefore we experience a huge performance drop to around \SIrange{20}{30}{Mbps}, while the additional data flow drops in performance to around \SI{80}{Mbps} as can be seen in \cref{fig:evaluation_minimal_bandwidth_link_usage_wo_sc_b}. From a network perspective, this results in a combined loss of 50\% throughput. While the amount of traffic sent through the network before the failure amounted to \SI{200}{Mbps}, it now dropped down to a combined \SI{100}{Mbps}.
During the bandwidth measurement in \cref{fig:evaluation_minimal_bandwidth_link_usage_wo_sc_b} there are small drops in performance
\subsubsection{With FRR and ShortCut}
@ -129,15 +134,15 @@ During the bandwidth measurement in \cref{fig:evaluation_minimal_bandwidth_link_
\begin{subfigure}[b]{0.49\textwidth}
\centering
\includegraphics[width=\textwidth]{tests/minimal_bandwidth_link_usage/bandwidth_link_usage_before_sc}
\label{fig:evaluation_minimal_bandwidth_link_usage_sc_a}
\caption{Bandwidth before a failure}
\label{fig:evaluation_minimal_bandwidth_link_usage_sc_a}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.49\textwidth}
\centering
\includegraphics[width=\textwidth]{tests/minimal_bandwidth_link_usage/bandwidth_link_usage_after_sc}
\label{fig:evaluation_minimal_bandwidth_link_usage_sc_b}
\caption{Bandwidth after a failure}
\label{fig:evaluation_minimal_bandwidth_link_usage_sc_b}
\end{subfigure}
\caption{Bandwidth with concurrent data transfer on H2 to H1 using ShortCut}
\label{fig:evaluation_minimal_bandwidth_link_usage_sc}
@ -146,7 +151,7 @@ During the bandwidth measurement in \cref{fig:evaluation_minimal_bandwidth_link_
\begin{figure}
\centering
\includegraphics[width=10cm]{tests/minimal_bandwidth_link_usage/bandwidth_link_usage_concurrent_sc}
\caption{Bandwidth H1 to H4 with concurrent data transfer on h2 to h1 - failure occuring after 15 seconds using ShortCut}
\caption{Bandwidth H1 to H4 with concurrent data transfer on H2 to H1 - failure occuring after 15 seconds using ShortCut}
\label{fig:evaluation_minimal_bandwidth_link_usage_concurrent_sc}
\end{figure}
When activating our implementation of ShortCut no significant change of values can be observed. This is due to the removal of the looped path, effectively allowing both data transfers to run on full bandwidth. This completely restores the original traffic throughput achieved by both data transfers of \SI{200}{Mbps}.
@ -162,14 +167,14 @@ In the following sections we evaluate the latency measurements run on the minima
\begin{subfigure}[b]{0.49\textwidth}
\centering
\includegraphics[width=\textwidth]{tests/minimal_latency/latency_before_failure_wo_sc}
\label{fig:evaluation_minimal_latency_wo_sc_a}
\caption{Latency before a failure}
\label{fig:evaluation_minimal_latency_wo_sc_a}
\end{subfigure}
\begin{subfigure}[b]{0.49\textwidth}
\centering
\includegraphics[width=\textwidth]{tests/minimal_latency/latency_after_failure_wo_sc}
\label{fig:evaluation_minimal_latency_wo_sc_b}
\caption{Latency after a failure}
\label{fig:evaluation_minimal_latency_wo_sc_b}
\end{subfigure}
\caption{Latency measured with ping}
\label{fig:evaluation_minimal_latency_wo_sc}
@ -184,7 +189,7 @@ In the following sections we evaluate the latency measurements run on the minima
As each link adds \SI{5}{\milli\second} of delay and \textit{ping} logs the difference in time between sending a packet and receiving an answer, the approximate delay would be the amount of links passed \textit{N} multiplied with the delay per link. In our test network there are 6 links between hosts H1 and H4. Because these links are passed twice, one time to host H4 and one time back to host H1, this results in an approximate delay of \SI{60}{\milli\second}.
The test run confirmed these assumptions. As can be seen in \cref{fig:evaluation_minimal_latency_wo_sc_a} a ping on the network without failure took an average of around 65 milliseconds with slight variations. The additional \SI{5}{\milli\second} are most likely caused in the routing process on the router.
The test run confirmed these assumptions. As can be seen in \cref{fig:evaluation_minimal_latency_wo_sc_a} a ping on the network without failure took an average of around \SI{65}{\milli\second} with slight variations. The additional \SI{5}{\milli\second} are most likely caused in the routing process on the routers.
When introducing a failure however, additional links are passed on the way from host H1 to H4. Instead of 6 links passed per direction, the network now sends the packets on a sub-optimal path which adds 2 passed links from router R1 to R2 and back. These are only passed when sending packets to host H4, packets returning from H4 will not take the sub-optimal path. This would, in theory, add around \SI{10}{\milli\second} of delay to our original results.
@ -199,15 +204,15 @@ When the failure is introduced concurrent to a running test, the latency spikes
\begin{subfigure}[b]{0.49\textwidth}
\centering
\includegraphics[width=\textwidth]{tests/minimal_latency/latency_before_failure_sc}
\label{fig:evaluation_minimal_latency_sc_a}
\caption{Latency before a failure}
\label{fig:evaluation_minimal_latency_sc_a}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.49\textwidth}
\centering
\includegraphics[width=\textwidth]{tests/minimal_latency/latency_after_failure_sc}
\label{fig:evaluation_minimal_latency_sc_b}
\caption{Latency after a failure}
\label{fig:evaluation_minimal_latency_sc_b}
\end{subfigure}
\caption{Latency measured with ping using ShortCut}
\label{fig:evaluation_minimal_latency_sc}
@ -229,7 +234,7 @@ The spike in latency which can be seen in \cref{fig:evaluation_minimal_latency_c
\subsection{Packet flow - TCP}
\label{evaluation_minimal_tcp_packet_flow}
To show the amount of TCP packets being forwarded on each router, we measured the packet flow on all routers of this topology. This is done by counting TCP packets with \textit{nftables} while a concurrent data transfer is started from host H1 to H4. The results include the amount of packets forwarded on each router per second. This was done with an intermediate and concurrent failure for a network with FRR in \cref{minimal_packet_flow_with_frr}, as well as a network with an additional implementation of ShortCut in \cref{minimal_packet_flow_with_frr_and_shortcut}.
To show the amount of TCP packets being forwarded on each router, we measured the packet flow on all routers of this topology. This is done by counting TCP packets with \textit{nftables} while a concurrent data transfer is started from host H1 to H4. The results include the amount of packets forwarded on each router per second. This was done with an intermediate and concurrent failure for a network with FRR in \textit{With FRR}, as well as a network with an additional implementation of ShortCut in \textit{With FRR and Shortcut}.
\subsubsection{With FRR}
\label{minimal_packet_flow_with_frr}
@ -238,15 +243,15 @@ To show the amount of TCP packets being forwarded on each router, we measured th
\begin{subfigure}[b]{0.49\textwidth}
\centering
\includegraphics[width=\textwidth]{tests/minimal_packet_flow/packet_flow_before_wo_sc}
\label{fig:evaluation_minimal_packet_flow_wo_sc_a}
\caption{TCP packets flow before a failure}
\label{fig:evaluation_minimal_packet_flow_wo_sc_a}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.49\textwidth}
\centering
\includegraphics[width=\textwidth]{tests/minimal_packet_flow/packet_flow_after_wo_sc}
\label{fig:evaluation_minimal_packet_flow_wo_sc_b}
\caption{TCP packets flow after a failure}
\label{fig:evaluation_minimal_packet_flow_wo_sc_b}
\end{subfigure}
\caption{TCP Packets on all routers measured with \textit{nftables} counters}
\label{fig:evaluation_minimal_packet_flow_wo_sc}
@ -262,7 +267,7 @@ The results in the network before a failure are as to be expected and can be see
After a failure all four routers receive packets as can be seen in \cref{fig:evaluation_minimal_packet_flow_wo_sc_b}, but router R1 now receives most packets with an average of around 1500 packets while routers R3 and R4 receive roughly the same amount of packets as before the failure at an average of around 1000 packets. Router R2 receives the least packets with an average of around 500 packets.
This is most likely caused by the looped path and the implications for packet travel this has. Router R1 receives all packets that are sent to host H4 from H1 twice, once sending them to router R2 and the second time when receiving the packets back from router R2 to send them to router R3. But while all packets sent from host H1 pass router R1 twice, acknowledgements sent back by the \textit{iperf} server on H4 will only pass R1 once, as R1 would not send packets with H1 as destination to R2. Router R2 on the other hand only receives packets sent to H4 but none of the ACKs sent back. This is why, when compared to the average packet count of all routers in \cref{fig:evaluation_minimal_packet_flow_wo_sc_a}, R2 receives roughly half of all packets a router would normally receive as TCP specifies that for each received packet TCP will send an ACK as answer. This also explains why router R1 forwards an average of around 1500 packets per second, forwarding data packets with around 500 packets per second twice and forwarding acknowledgement packets once with also 500 packets per second, producing an additional 50\% load on the router.
This is most likely caused by the looped path and the implications for packet travel this has. Router R1 receives all packets that are sent to host H4 from H1 twice, once sending them to router R2 and the second time when receiving the packets back from router R2 to send them to router R3. But while all packets sent from host H1 pass router R1 twice, ACKs sent back by the \textit{iperf} server on H4 will only pass R1 once, as R1 would not send packets with H1 as destination to R2. Router R2 on the other hand only receives packets sent to H4 but none of the ACKs sent back. This is why, when compared to the average packet count of all routers in \cref{fig:evaluation_minimal_packet_flow_wo_sc_a}, R2 receives roughly half of all packets a router would normally receive as TCP specifies that for each received packet TCP will send an ACK as answer. This also explains why router R1 forwards an average of around 1500 packets per second, forwarding data packets with around 500 packets per second twice and forwarding acknowledgement packets once with also 500 packets per second, producing an additional 50\% load on the router.
Aside from the changed path and therefore the inclusion of router R3 in this path, routers R3 and R4 are unaffected by the failure, forwarding each packet once.
@ -282,15 +287,15 @@ Reconfiguration of routers in Mininet does not reset the \textit{nftables} count
\begin{subfigure}[b]{0.49\textwidth}
\centering
\includegraphics[width=\textwidth]{tests/minimal_packet_flow/packet_flow_before_sc}
\label{fig:evaluation_minimal_packet_flow_sc_a}
\caption{TCP packet flow before a failure}
\label{fig:evaluation_minimal_packet_flow_sc_a}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.49\textwidth}
\centering
\includegraphics[width=\textwidth]{tests/minimal_packet_flow/packet_flow_after_sc}
\label{fig:evaluation_minimal_packet_flow_sc_b}
\caption{TCP packet flow after a failure}
\label{fig:evaluation_minimal_packet_flow_sc_b}
\end{subfigure}
\caption{TCP Packets on all routers measured with \textit{nftables} counters using Shortcut}
\label{fig:evaluation_minimal_packet_flow_sc}
@ -310,6 +315,8 @@ When running the TCP packet flow measurements with an implementation of ShortCut
\label{evaluation_minimal_udp_packet_flow}
We repeated the packet flow test in \cref{evaluation_minimal_tcp_packet_flow} using UDP to inspect the differences caused by the two protocols.
The first tests evaluated were run using only a FRR mechanism in \textit{With FRR}. We then evaluate the tests run with a FRR mechanism and an implementation of ShortCut in \textit{With FRR and ShortCut}.
\subsubsection{With FRR}
\begin{figure}
\centering
@ -339,7 +346,7 @@ We repeated the packet flow test in \cref{evaluation_minimal_tcp_packet_flow} us
When running the packet flow test measuring UDP packets the amount of packets was much higher compared to TCP packets. \textit{iperf} uses different packet sizes for each protocol, sending TCP packet with a size of \SI{128}{\kilo\byte} and UDP packets with only a size of \SI{8}{\kilo\byte} (\cite{Dugan.2016}). The same amount of data transmitted should therefore produce a packet count roughly 16 times higher when using UDP compared to TCP. TCP however, as can be seen in \cref{fig:evaluation_minimal_packet_flow_wo_sc_a}, causes the routers to log around 1000 packets per second when running a bandwidth measurement limited by the overall bandwidth limit on the network of \SI{100}{\mega\bit\per\second}. A naive assumption would be that UDP should sent 16000 packets per second over the network, but that does match with our test results seen in \cref{fig:evaluation_minimal_packet_flow_udp_wo_sc_a}, where only around 7800 packets per second are logged on the routers.
The reason for this is also the key difference between TCP and UDP: TCP uses acknowledgements (ACKs) to confirm the transmission of packets. These are packets returning from the \textit{iperf} server to the client. For each received data package, the server will send back an ACK. If no ACK is sent for a packet, the client will resend the missing packet. This causes the network to transmit twice the amount of packets, one half containing the actual data and one half only containing ACKs.
The reason for this is also the key difference between TCP and UDP: TCP uses ACKs to confirm the transmission of packets. These are packets returning from the \textit{iperf} server to the client. For each received data package, the server will send back an ACK. If no ACK is sent for a packet, the client will resend the missing packet. This causes the network to transmit twice the amount of packets, one half containing the actual data and one half only containing ACKs.
UDP however will just blindly send packets on their way and does not evaluate whether they actually reached their destination. Therefore all UDP packets contain data and no additional packets for confirmation or congestion control etc. are sent over the network.
Because UDP does not send ACKs the results observed after a failure in \cref{fig:evaluation_minimal_packet_flow_udp_wo_sc_b} are very telling, with routers R2, R3 and R4 all forwarding the same amount of packets and router R1 forwarding exactly double the amount of packets.
@ -373,5 +380,4 @@ When using ShortCut in a UDP packet flow measurement, the negative consequences
\label{fig:evaluation_minimal_packet_flow_udp_concurrent_sc}
\end{figure}
WRITE THIS

@ -1,9 +1,9 @@
\chapter{Implementation}
In the following chapter we implement an examplary network in Mininet, including routing between hosts and routers.
We also implement fast re-routing as well as ShortCut. In section \ref{sec:test_network} we explain the test framework that we built for performing tests. In section \ref{implementation_rrt} we then explain how we implemented FRR in the test framework.
Lastly we talk about our implementation of ShortCut in \ref{implementation_shortcut}.
We also implement fast re-routing as well as ShortCut. In \cref{sec:test_network} we explain the test framework that we built for performing tests. In \cref{implementation_rrt} we then explain how we implemented FRR in the test framework.
Lastly we talk about our implementation of ShortCut in \cref{implementation_shortcut}.
All implementations, the thesis and the measurements can be accessed via the git repository for this thesis in \cite{Maaen.052022}.
All implementations, this thesis and the test result can be accessed on our Git repository (\cite{Maaen.052022}).
\input{content/implementation/test_network}

@ -6,6 +6,6 @@ Depending on the target address and the interface of entry we can determine whet
In our simple network this would be the case in a scenario where a packet is going from H1 to H3, but the link between R2 and R4 would be unavailable. R2 then returns the packet to R1. The information that a packet to H3 should not normally be received on the ethernet device linked to R2 can be used to re-route this packet by using additional routing tables, referenced by ip policy rules.
The function \textit{ip rule} is part of \textit{iproute2} (\cite{AlexeyKuznetsov.2022}) and allows for the addition of policy rules that decide for each packet coming into the router which routing table to use, depending on the destination, source, incoming interface or outgoing interface. It also provides additional options and configurations, but in this context only using the destination and the incoming interface will suffice.
The function \textit{ip rule} is part of \textit{iproute2} (\cite{AlexeyKuznetsov.2011}) and allows for the addition of policy rules that decide for each packet coming into the router which routing table to use, depending on the destination, source, incoming interface or outgoing interface. It also provides additional options and configurations, but in this context only using the destination and the incoming interface will suffice.
By specifying additional routing tables for each router and each input, and adding alternative routes, we effectively implement FRR behaviour.

@ -5,13 +5,16 @@ It should then be able to manipulate routing table entries in case certain condi
The routing table manipulation has to take effect as fast as possible. Furthermore the performance impact of the ShortCut implementation has to be evaluated.
We begin by discussing possible options to detect failures, including the identification of returning packets in \cref{identifying_packets}. After we select the identification method using \textit{nftables} we go on to explain the implementation in \cref{impl_nftables}.
\subsection{Identifying packets}
\subsection{Identifying failures}
\label{identifying_packets}
To determine which route should be deleted from the routing table, ShortCut has to gather knowledge about the packets forwarded by the router. The already implemented FRR explained in \cref{implementation_rrt} adds routing tables which will be used depending on the interface the packet was received on. These alternative routes are also added to the default routing table with a lower priority metric, in case the link directly connected to the router would fail.
To identify a packet that is returning we already use the incoming interface when implementing FRR. If we would however be able to execute a function when such a packet is received, we would also be able to delete the old invalid routing table entry. For this there are several approaches which could be used.
To identify a packet that is returning and therefore a failure we already use the incoming interface when implementing FRR. If we would however be able to execute a function when such a packet is received, we would also be able to delete the old invalid routing table entry. For this there are several approaches which could be used.
The programming language P4 (\cite{Bosshart.2014}) can be used to write router logic and compile it so that a Mininet routers behaviour could be changed completely. This would also allow us to execute additional functionality in certain cases, e.g. if a specific ip route table entry is hit, but would require a manual implementation of core router functionalities like ARP handling.
In the context of this work this is not feasible.
@ -21,7 +24,7 @@ Another possible solution is the usage of low-complexity controllers, but to be
A far more easily implemented solution is packet filtering, which allows us to use most of the pre-existing functionality and logic while still being able to react to certain packets. In most linux distributions this functionality is already supplied using the \textit{nftables} package, which is mostly used for firewalls and therefore has the ability to specify rules for packets which will then be inserted to a multitude of targets, e.g. log files or a so called "netfilter queue".
\subsection{Implementation using nftables}
\label{impl_nftables}
Netfilter tables (\textit{nftables}) is a packet filtering tool for implementing firewalls in linux kernels and uses tables to store chains, which are a set of rules for incoming packets hooked to different parts of the network stack. These hook points are provided by the Netfilter kernel interface.
It is certainly possible to write software using these hooks directly, but in the scope of this work we will use \textit{nftables} to expose packets using \textit{nftables} rule definitions and their possible targets.

@ -1,22 +1,22 @@
\section{Implementation of a testing framework}
\label{sec:test_network}
We describe the setup of a custom test framework which we will continue to use for the tests on Mininet networks. The framework is a collection of pre-configured networks, tests, routings and scripts, which can be used as a base for automatic tests.
The framework is implemented in python 3.8 (\cite{vanRossum.2009}).
The framework is implemented in python 3.8.
To perform tests for multiple topologies and failure scenarios a structurized framework should be implemented.
The core component is a main script called \textit{mininet\_controller}, responsible to perform all operations on the Mininet network using the Mininet python API.
Topologies can be created and added to the framework by creating a python module in the package "topologies", inheriting from the "CustomTopo" class described in \textit{CustomTopo} in \cref{base_components} and adding the class to the "topos" dictionary in the \textit{mininet\_controller} with the name of the topology as key for a nested dictionary, which in turn contains the class, the module name, i.e. the file name without extension containing the class, and a description of the topology.
Topologies can be created and added to the framework by creating a python module in the package "topologies", inheriting from the "CustomTopo" class described in \textit{CustomTopo} in \cref{base_components}. They then have to be referenced in the "topos" dictionary in the \textit{mininet\_controller} with the name of the topology as key. The corresponding value is a dictionary which in turn contains the class, the module name, i.e. the file name without extension containing the class, and a description of the topology as items.
Each topology implements a \textit{build} function which is a predefined function for Mininet topologies and contains the setup of links, hosts and switches. When creating a Mininet network an object of the topology can be passed to Mininet, which will call the \textit{build} function to create network components.
If routers should be used in a topology they can be added as nodes. This is done by calling the \textit{addNode} function and passing it the "cls" parameter with our custom "LinuxRouter" class as value. The "LinuxRouter" class is described in \textit{LinuxRouter} in \cref{base_components}.
If routers should be used in a topology they can be added as nodes. This is done by calling the \textit{addNode} function and passing it the "cls" parameter with our custom router class as value. The router class is described in \textit{LinuxRouter} in \cref{base_components}.
Each topology also provides its own routing, IP policies for FRR and tests, as each of these have to be manually adjusted for each topology. We use python dictionaries to store these configurations. The configuration of routings is described in \ref{configuration_routing}, the configuration for IP policies is described in \cref{configuration_ip_policy} and the configuration of tests is described in \cref{configuration_test}.
Each topology also provides its own routing, IP policies for FRR and tests, as each of these have to be manually adjusted for each topology. We use python dictionaries to store these configurations. The configuration of routings is described in \cref{configuration_routing}, the configuration for IP policies is described in \cref{configuration_ip_policy} and the configuration of tests is described in \cref{configuration_test}.
The test framework already provides measurement commands which can be used in test configurations. These are described in \cref{implementation_commands}. Results from these measurements can be used to plot graphs using our plotting commands described in \cref{plotting}.
Lastly we also provide users with a parameter based \textit{command line interface} (CLI), which we describe in \cref{command_line_interface}.
Lastly we also provide users with a parameter based CLI, which we describe in \cref{command_line_interface}.
\subsection{Base components}
\label{base_components}
@ -55,7 +55,7 @@ A rule consists of a table it should be applied to, an incoming port and a list
\label{configuration_test}
The test configuration is a python dictionary with test names as keys. Each test is defined by a python dictionary, containing information about the test.
A test has two phases, the pre-execution phase and the execution phase. In the pre-execution phase a list of commands pre-defined by the Mininet controller can be used to prepare the network for a test. These commands can be specified under the key \textit{pre\_execution}. This was implemented because a network that was just created might have additional delays on the first execution of functions caused by e.g. ARP handling.
A test has two phases, the pre-execution phase and the execution phase. In the pre-execution phase a list of commands pre-defined by the Mininet controller can be used to prepare the network for a test. These commands can be specified under the key \textit{pre\_execution}. This was implemented because a network that was just created might have additional delays on the first execution of functions caused by e.g. \textit{Address Resolution Protocol} (ARP) handling.
The execution phase contains the actual testing. A command can be executed and its results will be written to the Mininet console.
@ -73,7 +73,7 @@ This function will use a connection which is by definition a list with 2 element
\subsubsection{measure\_bandwidth}
\label{measure_bandwidth}
The function \textit{measure\_bandwidth} is used to measure the bandwidth between two Mininet hosts in a Mininet network using \textit{iperf}, logging and parsing results after the measurement. Results are then sent to one of the plotting functions described in \cref{plotting}.
This function will use a 2-element list of hosts, a 2-element list of IPs, a length parameter that defines how long the test will run in seconds, an interval parameter defining the interval between each log entry of \textit{iperf}, a unique test name for naming a created graph or log entry, a graph title in case a graph should be created, a flag that defines whether \textit{iperf} should use tcp or udp as transfer protocol and a bandwidth to limit the transfer rate of \textit{iperf}.
This function will use a 2-element list of hosts, a 2-element list of IPs, a length parameter that defines how long the test will run in seconds, an interval parameter defining the interval between each log entry of \textit{iperf}, a unique test name for naming a created graph or log entry, a graph title in case a graph should be created, a flag that defines whether \textit{iperf} should use TCP or UDP as transfer protocol and a bandwidth to limit the transfer rate of \textit{iperf}.
The command starts an \textit{iperf} server. While experimenting we sometimes experienced unexpected behaviour causing tests to fail. There seemed to be an issue with the timing of the \textit{iperf} server and client commands which were executed on the corresponding devices. Because the \textit{iperf} server and client were started detached and the python script executed both commands directly one after another, the client seemed to try to connect to the server while the server was still in its startup process, denying the connection. This is why we added an additional delay between server and client command execution.
@ -82,7 +82,7 @@ The command starts an \textit{iperf} server. While experimenting we sometimes ex
The function \textit{measure\_link\_usage\_bandwidth} is used to start two separate \textit{iperf} measurements between two host-pairs and will log and parse results of both measurements. The results are then passed to the multi-plotting function described in \cref{plotting}.
The second \textit{iperf} measurement will use the port 5202 instead of the default port 5201 in case two servers are started on the same device.
The function reuses the \textit{measure\_bandwidth} function described in section \ref{measure_bandwidth}. We call the measurement done by the \textit{measure\_bandwidth} function the "main" measurement, the additional measurement used to evaluate the influence of another file transfer on the network the "additional" measurement.
The function reuses the \textit{measure\_bandwidth} function described in \cref{measure_bandwidth}. We call the measurement done by the \textit{measure\_bandwidth} function the "main" measurement, the additional measurement used to evaluate the influence of another file transfer on the network the "additional" measurement.
The measurements are configurable by providing an interval in which \textit{iperf} will log results, as well as a length parameter to specify how long an \textit{iperf} measurement should be run.
While the main measurement is run for the exact specified time the additional measurement is run for a slightly longer time due to timing issues.
@ -112,7 +112,7 @@ There are python libraries for reading \textit{nftables} entries. Because an imp
After starting a bandwidth test using \textit{iperf} on the client device, the packet counters are started which will start a python thread for each of the measurement targets. In each of these threads the bash command for displaying counters is used to access the current count. The output is saved in python, parsed and then saved to a log file which is named after the device that is being logged, including the current time of execution in a fitting format for \textit{gnuplot}.
After stopping the \textit{iperf} server the created log files are passed as a dictionary with the corresponding label for the data to the multi-plotting function explained in section \ref{plotting}.
After stopping the \textit{iperf} server the created log files are passed as a dictionary with the corresponding label for the data to the multi-plotting function explained in \cref{plotting}.
\subsection{Plotting functions}
@ -127,12 +127,12 @@ This data is then passed to \textit{gnuplot}, which will produce an eps file con
\label{fig:example_plotting}
\end{figure}
In addition to plotting a single line in a graph we also implemented a function to plot multiple data files in \textit{gnuplot} automatically. The function uses a greyscale as line colors and different dash styles for differentiating plots. It adds a defined label to each dataset. This can be used to e.g. plot the packet flow of multiple devices. A plot created with this method will look something like can be seen in figure \ref{fig:example_multiplotting}.
In addition to plotting a single line in a graph we also implemented a function to plot multiple data files in \textit{gnuplot} automatically. The function uses a greyscale as line colors and different dash styles for differentiating plots. It adds a defined label to each dataset. This can be used to e.g. plot the packet flow of multiple devices. A plot created with this method will look something like can be seen in \cref{fig:example_multiplotting}.
\begin{figure}
\centering
\includegraphics[width=9cm]{packet_flow_intermediate_wo_sc}
\caption{Exemplary latency test run with automatic plotting}
\includegraphics[width=9cm]{tests/minimal_packet_flow/packet_flow_after_wo_sc}
\caption{Exemplary packet flow test run with automatic plotting of multiple graphs in a figure}
\label{fig:example_multiplotting}
\end{figure}

@ -1,23 +1,23 @@
\chapter{Introduction}
\label{introduction}
\section{Motivation}
In recent years, especially during the COVID-19 pandemic, network usage has risen exponentially. In Germany alone the per capita data usage on the terrestrial network has risen from \SI{98}{\giga\byte} per month in 2017 to \SI{175}{\giga\byte} in 2020 (\cite{BundesnetzagenturDeutschland.2021}).
\section{Motivation}
A large part of the population suddenly had to spent additional time in their homes which has contributed to this rise in data usage. But this development is not limited to the pandemic. Data usage has been constantly rising due to the popularity of streaming services, increased internet usage in daily life and the rising popularity of cloud based services.
Because of the increased usage, failing networks cause an increasingly severe amount of social and economic costs. This is why the reliability of networks is as important as ever.
Failures in networks will always occur, be it through the failure of hardware, failures caused by errors in software or human errors. In addition to this the maintenance of networks will also regularly reduce a networks performance or cause the whole network to be unavailable.
Network administrators use a multitude of ways to increase performance, reduce the impact of failures on the network and achieve the highest possible availability and reliability. Two of these methods include the usage of global convergence protocols like Open Shortest Path First (OSPF) (\cite{Moy.041998}) or similar methods, either on the routers themselves or on a controller in a software defined network (SDN), and the usage of Fast Re-Routing (FRR) (\cite{Chiesa.2021}) approaches.
Network administrators use a multitude of ways to increase performance, reduce the impact of failures on the network and achieve the highest possible availability and reliability. Two of these methods include the usage of global convergence protocols like \textit{Open Shortest Path First} (OSPF) (\cite{Moy.041998}) or similar methods, either on the routers themselves or on a controller in a \textit{Software Defined Network} (SDN), and the usage of FRR (\cite{Chiesa.2021}) approaches.
The key difference between both is the time they take to become active. Because FRR mechanisms only use the available data on the device they tend to take effect near immediately. Global convergence protocols however are slow, sometimes even taking seconds to converge (\cite{Liu.2013}). This is due to them collecting information about the network by communicating with multiple devices, recomputing routes for all affected parts of the network and deploying these flows on routers and switches.
Most of the FRR approaches will however create sub-optimal paths which may be already in use or contain loops, effectively reducing the performance of the network.
FRMs like ShortCut (\cite{Shukla.2021}), Resilient Routing Layers (RRL) (\cite{Kvalbein.2005}), Revive (\cite{Haque.2018}) and Blink (\cite{ThomasHolterbach.2019}) try to alleviate this issue by removing longer paths from the routings only using data available on the device, bridging the gap between FRR and the global convergence protocol.
FRMs like ShortCut (\cite{Shukla.2021}), \textit{Resilient Routing Layers} (RRL) (\cite{Kvalbein.2005}), Revive (\cite{Haque.2018}) and Blink (\cite{ThomasHolterbach.2019}) try to alleviate this issue by removing longer paths from the routings only using data available on the device, bridging the gap between FRR and the global convergence protocol.
\section{State of the art}
@ -29,26 +29,24 @@ ShortCut uses information about the incoming packet to determine whether or not
Revive installs backup routes pro-actively using an optimized algorithm and controllers, but is prone to loops created by alternative paths as failures are not propagated to routers.
Blink WRITE THIS
Blink uses \textit{Transmission Control Protocol} (TCP) mechanisms to detect failures in TCP flows on the network, but only works reactively, requiring the network to have already failed before taking effect.
All of these FRMs are described in further detail in \cref{blink}.
Older FRMs have already been evaluated thoroughly and even though they do work in theory they either have yet to see widespread implementation or face limitations in their applicability, be it by requiring a high amount of resources or by using e.g. packet manipulation, excluding networks which by structure are incompatible to such mechanisms.
. Even though some FRMs were already released and discussed more than a decade ago they have yet to see widespread implementation as they either face limitations in their applicability to networks, e.g. because they require to manipulate packets, or in their resource usage.
\section{Contribution}
We provide an introduction to the topic of modern networks, failure scenarios and resilient routing.
In this context we use Mininet (\cite{LantzBobandtheMininetContributors.2021}), a tool to create virtual networks, to implement multiple topologies with routings. We then implement a simple FRR mechanism, re-routing returning packets to an alternative path.
In this context we use Mininet (\cite{LantzBobandtheMininetContributors.}), a tool to create virtual networks, to implement multiple topologies with routings. We then implement a simple FRR mechanism, re-routing returning packets to an alternative path.
We implement ShortCut using \textit{nftables} (\cite{Ayuso.2019}) and python 3.8 (\cite{vanRossum.2009}) that can be installed on any linux based router, and which is used for testing.
To test ShortCut we provide a prototype for an implementation using \textit{nftables} (\cite{Ayuso.2019}) and python 3.8 (\cite{vanRossum.2009}) that can be installed on any linux based router.
We build a testing framework that can be used to automatically create Mininet topologies, formulate tests in python dictionaries using existing measurement functions, set network wide bandwidth limits or delays and to run automatic sets of tests.
The framework can be called using an argument based \textit{Command Line Interface} (CLI).
We developed a testing framework that can be used to automatically create Mininet topologies, formulate tests in python dictionaries using existing measurement functions, set network wide bandwidth limits or delays and to run automatic sets of tests.
The framework can be called using an argument based \textit{command line interface} (CLI).
Using this framework we test several topologies with increasingly longer looped paths using FRR with and without ShortCut. We discover that ShortCut is able to reduce latency in our topologies by up to 30\%, reduces the amount of packets forwarded on the network by up to 38\% for TCP transfers and up to 55\% for \textit{User Datagram Protocol} (UDP) transfers and is able to remove bottlenecks created by concurrent data flows on the looped paths, restoring the original functionality of the network.
Using this framework we test several topologies using FRR with and without ShortCut and discuss the results, showing the possible applications and benefits of the FRM ShortCut.
The test framework, this thesis and all test results can be accessed on our Git repository (\cite{Maaen.052022}).

@ -1,7 +1,7 @@
\chapter{Testing}
\label{cp:testing}
In this chapter we define and perform the tests for our evaluation. For this we first explain created topologies and their routings in section \ref{topologies_and_routing}, followed by an explanation of our measurements which are taken of every topology in section \ref{testing_measurements}. We then go on to explain the addition of failures, FRRs and FRMs to the tests in section \ref{testing_failures}. Lastly we explain the process of testing in our test framework in \ref{testing_performing}.
In this chapter we define and perform the tests for our evaluation. For this we first explain created topologies and their routings in section \ref{topologies_and_routing}, followed by an explanation of our measurements which are taken of every topology in section \ref{testing_measurements}. We then go on to explain the addition of failures, FRRs and FRMs to the tests in section \ref{testing_failures}.
\input{content/testing/topologies_and_routing}
@ -9,19 +9,23 @@ In this chapter we define and perform the tests for our evaluation. For this we
\label{testing_measurements}
To evaluate the performance of a network we established a list of criteria in \cref{basics_measuring_performance}. Our measurements should reflect these criteria, which is why we implemented corresponding measurement functions in our test framework as described in \cref{implementation_commands}.
In the following sections we describe the implemented performance tests in detail. An evaluation of the results of the tests described here will be given in \cref{evaluation}.
In the following sections we describe the implemented performance tests in detail, beginning with the bandwidth test in \cref{testing_bandwidth}. We go on to explain the process of our latency measurements in \cref{testing_latency}. A test evaluating the influence of two concurrent data transfers on looped paths is explained in \cref{testing_bandwidth_link_usage}.
To evaluate the distribution of packets on the network while running a data transfer either with TCP or UDP we implement a test described in \cref{testing_packet_flow}.
An evaluation of the results of the tests described here will be given in \cref{evaluation}.
\subsection{Bandwidth}
\label{testing_bandwidth}
The tests measuring bandwidth are one of the most basic tests in our testing framework. It uses the "measure\_bandwidth" function described in \textit{measure\_bandwidth} in \cref{implementation_commands}.
As each virtual network that is created in Mininet starts with empty caches on all devices, we start a short \textit{iperf} test on the Mininet network prior to each bandwidth measurement which will cause most network handling like packets sent by the \textit{address resolution protocol} (ARP) to be finished before the actual test starts. This reduces the impact these protocols and mechanisms would otherwise have on the first seconds of the bandwidth measurement.
As each virtual network that is created in Mininet starts with empty caches on all devices, we start a short \textit{iperf} test on the Mininet network prior to each bandwidth measurement which will cause most network handling like packets sent by the ARP to be finished before the actual test starts. This reduces the impact these protocols and mechanisms would otherwise have on the first seconds of the bandwidth measurement.
The bandwidth tests are run using \textit{iperf}. Each test is performed for \SI{30}{\second} with a log interval of \SI{0.5}{\second}.
For all of our bandwidth measurements, H1 is used as the \textit{iperf} client. The server is chosen depending on the topology at hand which is H4 for the minimal topology described in \cref{testing_minimum_network} and shown in \cref{fig:testing_4r4h}, H6 for the first network used to evaluate longer failure paths described in \cref{testing_failure_path} and shown in \cref{fig:testing_failure_path_1}, and H8 for the second network used to evaluate longer failure paths described in \cref{testing_failure_path} and shown in \cref{fig:testing_failure_path_2}.
\subsection{Latency}
\label{testing_latency}
Latency tests are run using the "measure\_latency" function described in \textit{measure\_bandwidth} in \cref{implementation_commands}. Prior to every measurement a separate \textit{ping} test is run between the hosts used for measurement. This is done to fill the caches of the routers, e.g. the ARP cache.
After this initialization the test is run for \SI{30}{\second} with a log interval of \SI{0.5}{\second}.
@ -29,18 +33,18 @@ After this initialization the test is run for \SI{30}{\second} with a log interv
All latency tests were run with H1 as the sender. For each topology the receiving device changes, using H4 for the minimal topology described in \cref{testing_minimum_network} and shown in \cref{fig:testing_4r4h}, H6 for the first network used to evaluate longer failure paths described in \cref{testing_failure_path} and shown in \cref{fig:testing_failure_path_1}, and H8 for the second network used to evaluate longer failure paths described in \cref{testing_failure_path} and shown in \cref{fig:testing_failure_path_2}.
\subsection{Bandwidth link usage}
\label{testing_bandwidth_link_usage}
This test uses two concurrent \textit{iperf} measurements to measure the influence on the distribution of bandwidth between two data flows in the network while the actual routing of the packets is changed due to the introduction of a failure.
It uses the function "measure\_link\_usage\_bandwidth" described in \textit{measure\_link\_usage\_bandwidth} in \cref{implementation_commands}.
Before each test run a separate \textit{iperf} measurement is run to fill caches on the routers, e.g. the ARP cache.
The test is run for \SI{30}{\second} with a log interval of \SI{1}{\second}.
For the first running measurement host H1 is used as \textit{iperf} client in all topologies, while the server changes depending on the topology, similar to the configuration in \cref{testing_bandwidth}
The test is run for \SI{30}{\second} with a log interval of \SI{1}{\second}.For the first running measurement host H1 is used as \textit{iperf} client in all topologies, while the server changes depending on the topology, similar to the configuration in \cref{testing_bandwidth}
The second measurement however always uses host H1 as an \textit{iperf} server, shifting the client depending on the topology. The client is always the host attached to the router on the top path nearest to the failure point, which is host H2 in the minimal topology described in \cref{testing_minimum_network}, host H3 in the first failure path network described in \cref{testing_failure_path} and host H4 in the second failure path network, which is also described in \cref{testing_failure_path}.
\subsection{Packet flow}
\label{testing_packet_flow}
The test uses the "measure\_packet\_flow" function described in \textit{measure\_packet\_flow} in \cref{implementation_commands}.
For each topology we choose four routers for which the packet counters should be implemented. This will cover all routes packets could take as some routers are connected in series, forwarding each packet they receive.
@ -55,18 +59,12 @@ The fourth and final point of interest is a router on the alternative path, whic
We execute a basic \textit{iperf} bandwidth measurement prior to our packet flow measurement to fill caches.
All packet flow measurements are run for \SI{30}{\second} with a log interval of \SI{1}{\second}.
All packet flow measurements are run for \SI{30}{\second} with a log interval of \SI{1}{\second}. We add an additional test using the same configuration but setting the flag in the \textit{measure\_packet\_flow} function to "udp".
\section{Failures, FRR and FRMs}
\label{testing_failures}
Each test is run for each topology in two versions, one with an intermediate failure and one with a concurrent failure. Tests using an intermediate failure define two commands for measurement, one before a failure and one after a failure. The test will execute the first measurement, introduce the failure by running the "connection\_shutdown" command described in \textit{connection\_shutdown} in \cref{implementation_commands} and then run the second measurement automatically.
In case of a concurrent failure the measurement is started and the failure is introduced during the run-time of the measurement.
Each of these tests is also run once only using FRR and once using FRR and our implementation of ShortCut.
\section{Performing tests}
\label{testing_performing}
Tests are performed using the command line interface (CLI) of the test framework described in \cref{command_line_interface}. Each test will plot its results automatically.

@ -3,7 +3,7 @@
We define a set of topologies, which we later use in our tests. This includes a minimal network explained in section \ref{testing_minimum_network} and two failure path networks in section \ref{testing_failure_path}.
We define a set of topologies, which we later use in our tests. This includes a minimal network explained in \cref{testing_minimum_network} and two failure path networks in \cref{testing_failure_path}.
\subsection{Minimum sized network}
\label{testing_minimum_network}
@ -26,14 +26,14 @@ Suppose there are three routers A, B and C. They are connected to each other. If
FRMs and other protocols that focus on reliability of networks were created to restore the original functionality of the network. Most networks will already use an optimized layout and configuration. As such a suboptimal route will not be chosen unless deemed necessary. Hence it is unlikely that in a real life network the alternative router chosen after a failure occurred is faster than the old route. Because of these concerns we decided to add a fourth router to the topology.
In addition to this we added a third and fourth host. This is optional, but when testing the added strain on the network caused by a second data flow passing a loop created by FRR, we want to add the option to use a real data flow instead of a simulated data flow as described in section \ref{measure_link_usage}.
In addition to this we added a third and fourth host. This is optional, but when testing the added strain on the network caused by a second data flow passing a loop created by FRR, we want to add the option to use a real data flow instead of a simulated data flow as described in \cref{measure_link_usage}.
The whole topology can be seen in figure \ref{fig:testing_4r4h}.
The whole topology can be seen in \cref{fig:testing_4r4h}.
\subsubsection{Routing}
We combined this small network with a simple routing. All routers except for router R3 create three subnets: one host subnet and two subnets connected to the other routers. In case of router R3 only two subnets are needed as no host is connected.
The address of a subnet was chosen according to the router number. The subnet for hosts on router R1 is 10.1.0.0 with a subnet mask of 24 or 255.255.255.0, the gateway address is 10.1.0.1.
The address of a subnet was chosen according to the router number. The subnet for hosts connected to e.g. router R1 is 10.1.0.0 with a subnet mask of 24 or 255.255.255.0, the gateway address is 10.1.0.1.
\begin{figure}
\centering
@ -42,7 +42,7 @@ The address of a subnet was chosen according to the router number. The subnet fo
\label{fig:ip_naming_norm}
\end{figure}
The subnets between routers were given IP addresses in a similar fashion as shown in figure \ref{fig:ip_naming_norm} (b). The ip address for the subnet between router R1 and R2 therefore is 10.112.0.0. When choosing addresses we wanted to be able to identify the router and the interface connected to such a subnet only by its ip. As only routers are part of the subnets between routers, and we have a very limited amount of routers, we added the router number, as well as the interface number to the ip address. We always attached all hosts to the first interface and counted fro
The subnets between routers were given IP addresses in a similar fashion as shown in \cref{fig:ip_naming_norm}. The IP address for the subnet between routers R1 and R2 therefore is 10.112.0.0. When choosing addresses we wanted to be able to identify the router and the interface connected to such a subnet only by its IP. As only routers are part of the subnets between routers, and we have a very limited amount of routers, we added the router number, as well as the interface number to the IP address. We always attached all hosts to the first interface and counted fro
\subsection{Failure path topologies}
@ -64,5 +64,5 @@ FRMs advantages include the removal of longer paths. In theory, these should red
\end{figure}
\subsubsection{Routing}
The routing is an extension of the routing in the minimal network. The naming convention shown in figure \ref{fig:ip_naming_norm} was used in both longer failure paths networks as well.
The routing is an extension of the routing in the minimal network. The naming convention shown in \cref{fig:ip_naming_norm} was used in both longer failure paths networks as well.

@ -158,15 +158,13 @@
% Abbildungsverzeichnis
\listoffigures
\addcontentsline{toc}{chapter}{Table of figures}
\listoftables
%Tabellenverzeichnis
\addcontentsline{toc}{chapter}{List of tables}
\listof{algorithm}{List of algorithms}
%Algoritmenverzeichnis
\addcontentsline{toc}{chapter}{List of algorithms}
%Quelltextverzeichnis
\listof{lstlisting}{List of source code}
\addcontentsline{toc}{chapter}{List of source code}
%Quelltextverzeichnis..
\printbibliography
@ -176,24 +174,21 @@
% ##########################################################################
\appendix
\input{content/end/appendix1}
\cleardoublepage
% #############################
% Aufgabenstellung
% #############################
\markboth{}{AUFGABENSTELLUNG}
\addcontentsline{toc}{chapter}{Aufgabenstellung}
\includepdf{documents/Aufgabenstellung}
\cleardoublepage
%\markboth{}{AUFGABENSTELLUNG}
%\addcontentsline{toc}{chapter}{Task}
%\includepdf{documents/Aufgabenstellung}
%\cleardoublepage
% #############################
% Erklärung
% #############################
\markboth{}{ERKLÄRUNG}
\addcontentsline{toc}{chapter}{Erklärung}
\addcontentsline{toc}{chapter}{Affidavit}
\includepdf{documents/Eidesstattliche_Versicherung}
\cleardoublepage
% ##########################################################################
% THAT'S IT!