Improving Resilience When We Don’t Have Steady State

Nate Evans and Mike Thompson from the COAR team recently spent some time at the African Institute for Mathematics and Science (AIMS) teaching computer networking and cyber security to a cohort of 40 students.  During our time at AIMS, we encountered many situations that made us think about how our research in resilience applies in a place like Senegal where we can’t make any assumptions about steady state.

Senegal is a country on the western most tip of Africa.  It has an extremely small footprint for math and science education and has endemic problems with infrastructure reliability — particularly with the power grid.  While we were teaching, the power would frequently go out, sometimes two or three times during a single class period.  The power problems would sometimes be to the whole school and sometimes to just an individual building.  Though most of the school’s important equipment is on some type of uninterrupted power source (UPS), it turns out the batteries in these devices don’t last too long under such conditions.

In addition to the power issues, AIMS-Senegal suffers from a lack of quality Internet access.  When we arrived, their Internet was essentially a collection of consumer grade DSL connections, one for each building with a largely integrated wireless setup.

This network layout suffered from several issues:

  • If any one line has problems, a whole building or area of the campus loses access.

  • If there is a lot of activity on one network, free bandwidth on other, unused bandwidth is wasted.

  • Machines on different networks cannot talk to each other.

We decided to make some changes to the network and see if we could improve the resilience of their connection with only the tools at hand.  Our approach was to take several of the DSL lines and aggregate them together at one point, using the server to multiplex connections.  This, it turns out, is not a trivial task.

First, we explored interface bonding. Interface bonding offers several options for fail over and load balancing. We were able to configure interface bonding on the two dhcp interfaces, but it was not really designed for what we are doing.

Interface bonding relies on various strategies to demultiplex traffic down different paths based on traffic characteristics and resilience features.  However, interface bonding was really designed for the above scenario, where we have a shared backplane with better speed than any individual port.  We are then able to use several ports on a switch as a single network connection, essentially doubling, tripling, etc our speeds.

This, unfortunately, is not our situation at AIMS-Senegal.  While its true that because our DSL modems all come from the same provider (Orange Telecom, more on that later), they are more than likely on the same switch somewhere on the backend, that doesn’t actually allow us to use them this way (because they have different ips and logical gateways, etc).

Our solution? We configured a shorewall firewall on the linux server to utilize a demultiplex strategy for connections on the AIMS network to use multiple connections in a round robin fashion through the server.

Though this doesn’t  actually improve our  speed, it feels like it does (especially when we have 40+ students using the computer lab at the same time) because load is distributed over our  various upstream  connection.

This solves some of our problems.  We can now share connections on the server and for the lab and classrooms — this is especially helpful for the lab, when it is under heavy activity We can now run a proxy server to help cache packages and frequently accessed sites (as well as filter content). We have connection sharing and fail over for failures of individual lines and power connections through parts of the campus. We have enough flexibility that we can still keep separate networks for the staff and the library. We protect ourselves from certain failures:

  • Of power in individual buildings
  • Of lines running between buildings
  • Of failures due to configuration or switching issues at the upstream provider

However, we introduced some new problems and definitely left some old problems unsolved.  The server is now a more significant single point of failure. Complexity of mux/demux can lead to problems with some TCP services that rely heavily on middleboxes (middleboxes tend to mangle parts of the IP header which can lead to loss of tagging for the firewall to know which connection to send traffic back to). Complexity of the configuration can make long term maintenance and upgrades more difficult. All of our uplinks are through Orange Telecom, so though we are protected from certain failures in their switching infrastructure, we’re still dependent on their overall infrastructure staying up.  We still have no easy way to share resources between other parts of the campus, such as the library or the residences.

Its not a magic bullet, but it begins to improve the situation, and we hope, give some more ideas to the IT folks at AIMS-Senegal about how to continue to tackle such problems.

How do we merge this with our own research and try to let our experience inform our future strategies on how we look at resilience, particularly when there is no steady-state?  The AIMS-Senegal network situation actually closely resembles the situation we find ourselves in with regard to cybersecurity defense.  We always have to assume the next failure is coming, that we don’t know when it will happen or how long it will last.  This situation has largely informed our approach to proactive defenses, particularly our Moving Target Defense technologies.  In fact, our patent-pending Stream Splitting Moving Target Defense (SS-MTD) implements a type of multipath TCP at the application layer in a configuration that looks vaguely similar to our new network configuration at AIMS-Senegal.

The key here is getting back to resilience.  Or, thought about another way, how do we react to failure.  In the resilience world we often talk about these bathtub curves, where a service moves from a steady state into a failed state and then gradually (depending on the needs and requirements of the service in question) returns to its original steady state.  We then make various measurements and try to approach improving the resilience of a service in terms of reducing the amount of degradation or reducing the time window that the service is degraded.

Without a  traditional steady-state, its more difficult to talk about resilience in our usual terminology. Improving diversity of infrastructure providers should add a dramatic increase in possible resilience.  Currently at AIMS, their only real option for Internet is Orange Telecom, but that could change with additional infrastructure deployment in Africa. Adding diversity to connection types in addition to providers would be even better (cable, cellular, satellite, etc). Efforts to deploy non-traditional internet connections in the region by companies like Facebook and Google will hopefully help to make this easier.  Though Internet resilience is a lower priority than other infrastructure, like power, it is still mission critical to an educational institution like AIMS.

Utilizing strategies from moving target resilience work allows us to view failures as a normal operating condition and react accordingly. Overlapping bathtub curves can still lead to 100% uptime.  This type of thinking and approach to problem solving hopefully can allow AIMS to contribute to a model for resilience without steady state assumptions as a goal for Senegal and other areas in Africa and abroad that deal with similar problems.

This post was written by: Mike Thompson