Designing Experiments to Avoid Internet Measurement Pitfalls
Ana Custura, University of Aberdeen IRTF MAPRG, IETF-116.

The Internet is heterogeneous, spanning many different types of networks. Wide-scale measurements (e.g. to a large number of targets or from a large number of vantage points) are very useful to understand it and to guide protocol standardisation. Using IPv6 Extension Headers as an example, we will discuss lessons learned from building wide-scale active Internet measurements and identify mistakes that have at some point or another ruined our measurement campaigns. I will give concrete examples from experiments to measure IPv6 Hop-by-Hop and Destination Option Extension Headers over a variety of paths in the past 8 years to help others avoid the same pitfalls in the future.

The University of Aberdeen does Internet measurements to support protocol standardisation. This work has been presented in various IETF groups along the years, and hashelped shape standards, particularly around transport.

Measurements are super useful for protocol standardisation. It can tell us work with standards whether or not a proposed standard, has any barriers to deployment or whether or not an existing standard is used. It also helps us find various bugs that might exist in standards, and if you find a bug, then that's the first step towards fixing it, right?

Measurements are useful, but in order to perform these measurements in a useful way, you kind of have to target a lot of the Internet. The Internet is absolutely huge, and that's the challenge here. It comprises billions of parts and there's lots of diversity within it: mobile networks, satellite networks, etc. To be effective measurements need to target as many diverse parts as possible.

First you have to design a measurement campaign and you have choices: You can choose to generate packets and then throw them at the Internet and see what comes back and that's an active measurement. Or you can choose to observe traffic that already exists.
That is a passive measurement and then you can instrument your test so that you control either one endpoint or both of the endpoints in your measurement, or maybe you're doing something from an in-network device. 

Next, you have to consider which metric to measure and at which level of aggregation to measure. You can have performance metrics or you can have something like a functional metric, like connectivity. You can choose all of these and design a great measurement campaign. But there are, of course, some pitfalls, and this is what this talk is about.

As an example, I'm going to use IPv6 Extension Headers. The measurements I'm going to talk about are going to be mostly active measurements and functional connectivity. This is a brief overview of IPv6 Extension Headers.

IPv6 was designed from the start to be extensible, and you have these extension headers that should enable new functionality. They had a bit of a rocky start. There are different router architectures, not necessarily supporting processing some packets with some extension headers in hardware. For this historical reason, networks to do this, they some, some of them drop packets with extension headers across the years, many different groups within the IETF have tried to measure extension headers.

Extension Header measurement is pretty hard. If you look at the table, you can see what appears to be conflicting results. However, these results are not really conflicting. They essentially work together to tell a story.

This is a measurement where you're most likely to mess-up in one way or another, and that is because the brokenness can exist in many different places.
* Some devices maybe do not support Extension Headers to begin with.
* Some maybe do not have the capacity to look deep into a packet in/after the Extension Header.
* Some need to access up on their protocol information and they cannot get to thi because an Extension Header is in the way.
* And finally you have either by configuration or by misconfigurations, a network that actually feels redundant next like these.

How do you actually measure extension headers in a way that makes sense?

Because the are not very widely used, you have to generate traffic with Extension Headers at a vantage point. You send it across the Internet,
it gets to the destination and maybe you get some feedback from your destination that your packet has actually arrived. That's how you complete an end-to-end test. You can measure any property end to end and you can work out if it actually works or not. However, what an end-to-end test does not tell you, is where the problem has occurred if something has gone wrong. For example, a packet was dropped and the tests won't tell you where it was dropped:
you need to do path measurements.

Here are some examples of where you can mess-up your measurement.

There are three categories:
* where we measure from, your vantage point, 
* what do we measure to this destinations?
* how we make measure methodologies. 

Vantage points.

The pitfalls around vantage point relate to around the lack of diversity in vantage points. If you're going to do a measurement from a cloud provider,
you'd better make sure that your card provider is transparent to what you want to measure. I can't remember the number of times I've said I've started to measure something from IWC or from Digital Ocean, only to find out that the card provider was messing with a protocol that I was trying to test and all of my results were garbage.

If you can try multiple cloud providers and mixing active measurement platforms, something like Spotless will give you connectivity to thousands and thousands of different vantage points. They also have the advantage that some of these vantage points will be niche networks.
This will help you avoid this sampling bias pitfall.

Examples are in this table show the percentage of servers that reply to a packet that was sent to them with an extension header.
The set of destinations for each of these measurements wasn't changed. The only thing that is different here is the vantage point where I ran the measurement from and as you can see, they're digital ocean and ?Lenard?.

When I run my measurements from there, I received really literally no answers from my servers. These two providers, they dropped packets with extension headers or they get lost in transit to the cloud provider upstream. This is still a good, valid measurement point is just that if you want to run a wide scale measurement campaign to lots of destinations, this is not the place to do it.

There is a division between the core and the edge of the Internet. In general, the core of the Internet is a lot more transparent to the traversal of different protocols. That is because you have less devices in the core of the Internet that are likely to mess with your packets, so less middleboxes.

Also the more specific the network, e.g., mobile and satellite, the more weirdness you're likely to get.
It's always good to have a mix of edge and core, but it leads you to think about the devices that your packets travel over.

Destinations.

The story here is more or less the same. You have to choose diverse destinations if you want to understand how things work,
because the results may look different for four different types of servers.

One thing that comes up again and again is the top 1 million domain lists. This list has 1 million domains in it. You can resolve this list and then you can filter it so that you only see unique IP addresses. You can then run your measurements to those unique IP addresses that you found.
And this gives you multiple web mail and DNS server targets that you can use.

The problem with this list is that it's not diverse. Apart from this top list, you can also use crowdsourcing for your measurements.
And this is great because you end-up with a lot more clients. A client starts the measurement and then of well, it contacts your server or your vantage point and then you run the measurement back to them. These are great, but they are harder to reproduce. So that's something to keep in mind because it can affect your ability to reproduce your results. It's the same for other people who might want to reproduce the results and looks like.

The top 1 million domain lists are not very diverse. This table has two columns, and in the two columns you have the exact same data, except it's a pair host for you in your first column and a purchase view in the second one. For years I've been presenting that results for a host as opposed to previous,
but that they might have been much more convincing. If I had actually shown the previous division.

It's because you have many different hosts concentrated in ? that maybe do something.
And drop Extension Headers. 

The general point is that all the measurements that you do, you have to provide a split. This data is from RFC6972 and then some crowdsourced measurements from APNIC. These are all essentially examples to show you that if you choose different types of this nations, then you can understand the picture a bit better. Because if you think about it, some of the servers that you target might be behind a specialised infrastructure.

If you look at Web servers, they might be behind the civilian or behind a proxy or load balancer and so on. Or if you're going to clients in edge networks, then you might have edge network specific little boxes in the way and so on.

The infrastructure may look different for different server types. That is a key point. 

Methodology.

To avoid the pitfall, it is always good to combine measurement approaches. You can combine passive and active measurements, and that will help you understand things better. It's always useful to compare the methodology and results to what others have been done. I would recommend to any researcher to try and first reproduce results that exist because this kind of tends to help you get out of the way.

All of those silly issues that you might find when you set up your experiment. And open source your data, because this allows other people to a validate it and be built upon it. In terms of performance measurements, of course, you should always measure multiple upper layer protocols because sometimes the
choice of upper layer protocol will influence traversal at the network level. And then do not necessarily assume you understand the hot Internet works.
There is lots of balancing and there are lots of weird middleboxes that do weird things.

The final two examples are around protocol differences.

This is just a slide to show you that I've measured loads of devices in each networks and it looked like there's a split between UDP and DCP.
This is an opportunity to understand why DSCP measurements were different, traversal for the DSCP is different and for UDP is because lots of edge devices might mess with DSCP. You can start thinking whether or not there's a link between those devices and traversal.

The best example for last load balancing it exists in the Internet and you can measure it with something. The main tool you can measure this with this Paris Traceroute, and I'm sure many of you are already familiar with this post base. This is a tool that attempts to find load balancers between a source and a
destination by running multiple trials throughout and in between each measurement, It varies the protocol header fields.

I measured a source and destination, with various results, and I find four different paths. I find at least two balancers in this network, and that's great.
But then if I repeat this measurement, except instead of sending regular packets without extension headers and I sent a package with a destination option extension header instead, the load balancing is lost and  I no longer detect four paths. That is probably because of a load balancer. 
In this particular example, they were using byte offsets to move packets on a different path, and the entropy is lost because of an Extension Header.

To sum-up, use multiple approaches for the same measurement, use multiple vantage points and destinations and never expect anything on the Internet. This really works the way you think it should.