Notebook

Introducing Flows¶

In this notebook, we'll use the QoF flow meter, Python ipfix module, and Pandas to explore the characteristics of flow data.

Flow data analysis is in general somewhat difficult to experiment with, as it involves passive observation of network traffic data, which carries with it network end-user privacy risk and may additionally come with stringent regulatory and legal requirements. For this tutorial, we'll be using publicly available WIDE MAWI traces, which, being collected from a transpacific backbone link, are not really representative of the types of traffic in access, enterprise, or academic networks; and, having their IP addresses anonymized without preserving structure, are not really useful for looking at the structure of the networks on either side of the backbone. We'll put up with these inconveniences for the sake of having real data to play with, though.

If you have access to your own network traces, you can run these through QoF:

qof --yaml qof-config.yaml --in my-trace-file.gz --out my-ipfix-file.ipfix

then point this notebook at the resulting IPFIX file. (Note, of course, that commentary in this notebook is based on the set of MAWI trace data used in the course, and will probably not match the results in your own data.)

The QoF command we used to create the trace used in this notebook is shown below:

[brian@magpie ~]$ qof --verbose --yaml qof-simple-uniflow.yaml --in mawi-0330-30min.pcap.gz \
                    | gzip > mawi-0330-30min-uniflow.ipfix.gz
[2014-06-23 15:57:45] qof 0.9.0 ("Albula") starting
[2014-06-23 15:59:09] Processed 66397621 packets into 7634742 flows:
[2014-06-23 15:59:09]   Mean flow rate 94245.55/s.
[2014-06-23 15:59:09]   Mean packet rate 819632.16/s.
[2014-06-23 15:59:09]   Virtual bandwidth 5044.6538 Mbps.
[2014-06-23 15:59:09]   Maximum flow table size 159240.
[2014-06-23 15:59:09]   579 flush events.
[2014-06-23 15:59:09]   4453490 asymmetric/unidirectional flows detected (58.33%)
[2014-06-23 15:59:09] Assembled 33813 fragments into 16810 packets:
[2014-06-23 15:59:09]   Expired 26 incomplete fragmented packets. (0.00%)
[2014-06-23 15:59:09]   Maximum fragment table size 23.
[2014-06-23 15:59:09] Rejected 65071 packets during decode: (0.10%)
[2014-06-23 15:59:09]   65071 due to incomplete headers: (0.10%)
[2014-06-23 15:59:09]     52931 incomplete IPv6 extension headers. (0.08%)
[2014-06-23 15:59:09]     12140 incomplete transport headers. (0.02%)
[2014-06-23 15:59:09]     (Use a larger snaplen to reduce incomplete headers.)
[2014-06-23 15:59:09] qof terminating

This notebook uses the Pandas data analysis framework to explore a collection of flow data. So first, run the following code to set up the environment:

In [ ]:

import ipfix
import panfix
import gzip
import bz2

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

plt.rcParams['figure.figsize'] = (12, 6)

ipfix.ie.use_iana_default()
ipfix.types.use_integer_ipv4()

Now we can import some flows into a dataframe. This might take a while.

In [ ]:

# Set the name of the IPFIX file to work on here.
ipfix_filename = "../../mawi-0330-30min-uniflow.ipfix.gz"
# change to gzip.open, bz2.open or open, as appropriate
ipfix_file_fn = gzip.open
# Change to None for no limit, or set a limit to reduce memory requirements
ipfix_max_flows = 1000000
# Skip the first N flows when limiting (prevents short flow bias)
ipfix_skip_flows = 2000000

df = panfix.dataframe_from_ipfix(ipfix_filename, (
                                 "flowStartMilliseconds",  "flowEndMilliseconds", 
                                 "sourceIPv4Address",      "sourceTransportPort",
                                 "destinationIPv4Address", "destinationTransportPort", 
                                 "protocolIdentifier",     "flowEndReason",
                                 "octetDeltaCount",        "packetDeltaCount"), 
                                 count=ipfix_max_flows, skip=ipfix_skip_flows,
                                 open_fn=ipfix_file_fn)
print("Loaded "+str(len(df))+" flows.")

Let's fix up the dataframe a bit:

In [ ]:

df = panfix.coerce_timestamps(df) # fixup timestamp datatypes
df = panfix.derive_duration(df)   # calculate duration of each flow 
df["flowDeltaCount"] = 1          # one flow per flow (useful for aggregating by flows)

Let's have a look at a time series of flow counts:

In [ ]:

flows_ts = df["flowDeltaCount"].copy()
flows_ts.index = df["flowEndMilliseconds"]
flows_ts.resample(how="sum", rule="1s").plot()

To explain what's going on here, we first have understand two fundamental configuration parameters of flow meters that are important to intepreting results. First is the passive timeout (or idle timeout). Any flow for which no packets are seen during this interval will be expired and exported. Second is the active timeout which can also be thought of as the maximum duration of the flow record. The passive timeout ensures that flows without a "natural" end (i.e., all non-TCP flows) are eventually flushed and exported. The active timeout sets a maximum delay between the time the first packet of the flow is seen, and the time the first record representing packets in that flow are exported.

Exported flows are ordered neither by start nor end time. Naturally terminated (i.e., through TCP FIN or RST) and actively timed-out flows are available for export immediately after the last packet, while passively timed-out flows will have export times up to one passive timeout in the past.

QoF's passive timeout is set to 30 seconds by default, and its active timeout to 300 seconds.

Given this knowledge, we can see three distinct intervals in the series of flow end times.

We see only passively timed-out flows for the first passive timeout interval (the 30 seconds between about 04:08:40 and 04:09:10).
The capture starts at (export time) 04:09:10. From here we see the steady state: flows terminated for all reasons. Here we see a flow export rate between 3900 and 4900 flows per second.
The capture ends at (export time) 04:13:10. For the last passive timeout interval, no passively timed out flows can be seen, only those timed out actively or naturally.

These intervals are also reflected in the time series of start times:

In [ ]:

flows_ts.index = df["flowStartMilliseconds"]
flows_ts.resample(how="sum", rule="1s").plot()

Layer 4 protocol analysis¶

Now, let's have a look at the protocols in use in the trace file. To do this, we'll group by protocols and sum flow counts:

In [ ]:

df.groupby("protocolIdentifier")['flowDeltaCount'].aggregate("sum").plot(kind="bar")

This is probably easier to read if we get names for the protocols for the system protocols database. First, we need to load the protocol names into a dataframe.

In [ ]:

proto_df = pd.read_table("../data/protocols", header=None, index_col=1, usecols=(0,1), 
                         names=["protocolName", "protocolNumber"])["protocolName"]

Now we can create a new data frame containing the aggregate (which will be indexed by protocol identifier), join it to the protocol numbers table, reindex the data frame by protocol name, then plot with names:

In [ ]:

proto_flows = pd.DataFrame(df.groupby("protocolIdentifier")['flowDeltaCount'].aggregate("sum"))
proto_flows = proto_flows.join(proto_df)
proto_flows.index = proto_flows["protocolName"]
proto_flows["flowDeltaCount"].plot(kind="bar")

Here we see TCP, ICMP, and UDP dominating flow counts, with some ipv6 in ipv4 encapsulation, and negligble amounts of ipcencap (ipv4 in ipv4 encapsulation), GRE (tunneling), ESP (IPsec), and PIM (multicast encapsulation). Now let's look at the same breakdown by bytes:

In [ ]:

proto_bytes = pd.DataFrame(df.groupby("protocolIdentifier")['octetDeltaCount'].aggregate("sum"))
proto_bytes = proto_bytes.join(proto_df)
proto_bytes.index = proto_bytes["protocolName"]
proto_bytes["octetDeltaCount"].plot(kind="bar")

Here we see a very different picture: by bytes, almost all the traffic is TCP. We can also calculate this proportion numerically, by summing the number of bytes in TCP flows and dividing by the number of bytes in all flows:

In [ ]:

sum(df[df["protocolIdentifier"] == 6]["octetDeltaCount"]) / sum(df["octetDeltaCount"])

Counting Ports¶

Let's have a look at port numbers by transport protocol. First, let's select UDP and TCP flows separately.

In [ ]:

udp_df = df[df["protocolIdentifier"] == 17]
tcp_df = df[df["protocolIdentifier"] == 6]

Since we're looking at uniflows, and are interested in services that actually responded, we'll look at source addresses. We'll use the value_counts() shortcut for counting top N ports by flows count. First for UDP:

In [ ]:

udp_df["sourceTransportPort"].value_counts()[:10].plot(kind="bar")

Hm. Here we see DNS (53) and NTP (123), along with quite a lot of flows on a high port (58534). We can dig into those to try and figure out what's going on there:

In [ ]:

udp_58534_df = udp_df[udp_df["sourceTransportPort"] == 58534]
udp_58534_df["sourceIPv4Address"].value_counts()

Almost all of these flows come from a single source, which probably indicates scanning. Let's turn our attention to TCP now:

In [ ]:

tcp_df["sourceTransportPort"].value_counts()[:10].plot(kind="bar")

Here, Port 80 (HTTP) and 443 (HTTP over TLS) dominate, as expected. There's a little Port 22 (SSH) as well.

Elephants and Mice¶

We can have a more in-depth view of the relationship of flow rates and durations by plotting these on two-dimensional histogram. Here we'll have duration on the X axis and data rate (in nominal bits per second) on the Y axis. The weight of each bin will show the count of flows, the count of packets in flows, or the count of bytes in flows, falling into that bin. First execute the following to define the function we'll use here:

In [ ]:

def plot_rate_duration_uniflow(df, by, filename=None):
    plt.figure(figsize=(9,7))
    plt.hexbin(x = df["durationSeconds"],
           y = (df["octetDeltaCount"] * 8) / (df["durationSeconds"] + 0.001), 
           C = df[by],
           reduce_C_function = np.sum,
           yscale='log',
           bins='log',
           cmap = plt.cm.binary)
    cb = plt.colorbar()
    cb.set_label("log10("+by+")")
    plt.xlabel("duration (s)")
    plt.ylabel("data rate (bps)")
    if filename:
        plt.savefig(filename)

Now let's look at the shape of all flows by flow count:

In [ ]:

plot_rate_duration_uniflow(df, by="flowDeltaCount")

The flow counts are dominated by very short flows with rates between about 10kbps and 10Mbps. There are a few long-duration low rate flows, and even fewer long duration high rate flows. Given a flow, there is a very high chance it is very short (a "mouse").

In [ ]:

plot_rate_duration_uniflow(df, by="packetDeltaCount")

Looking at packet counts tells a different story: here, the packet counts are dominated by longer flows in the 1Mbps - 10Mbps range ("elephants"), many of which are maximum duration (recalling that the QoF active timeout is 300 seconds; flows longer than that will be represented by multiple records).

In [ ]:

plot_rate_duration_uniflow(df, by="octetDeltaCount")

Viewed in terms of octets, the reign of the elephants is even more pronounced.

We can also look at this in terms of the size of the packets. Bulk transfer applications tend to have larger packets (closer to MTU, usually around 1500 bytes as set by Ethernet), while machine-to-machine applications use smaller packets, and various constant bit-rate protocols (usually media). We don't have per-packet size information in flow data, but we can approximate it by dividing the number of bytes in the flow by the number of packets.

In [ ]:

(df["octetDeltaCount"] / df["packetDeltaCount"]).hist(range=(0,1600),bins=150, weights=df["packetDeltaCount"])

The distribution is quite bimodal, with small packets clustered around 40 bpp, and large packets clustered around 1500. We can also see some dependency between bytes per packet and application (approximated by port) by plotting these in two dimensions:

In [ ]:

def plot_port_bpp_uniflow(df, by="packetDeltaCount", portrange=(0,65535), filename=None):
    plt.figure(figsize=(9,7))
    plt.hexbin(x = df["octetDeltaCount"] / df["packetDeltaCount"],
           y = df["sourceTransportPort"], 
           C = df[by],
           reduce_C_function = np.sum,
           bins='log',
           cmap = plt.cm.binary,
           extent=(0,1500,portrange[0],portrange[1])),
    cb = plt.colorbar()
    cb.set_label("log10("+by+")")
    plt.xlabel("mean octets/packet")
    plt.ylabel("port")
    if filename:
        plt.savefig(filename)

In [ ]:

plot_port_bpp_uniflow(tcp_df)

In [ ]:

plot_port_bpp_uniflow(tcp_df, portrange=(0,512))

In [ ]:

plot_port_bpp_uniflow(udp_df)

In [ ]:

plot_port_bpp_uniflow(udp_df, portrange=(0,512))