Welcome to cytoflow
! I'm glad you're here. The following is a heavily commented workflow for importing a few tubes of cytometry data and doing some (very) basic analysis. The goal is to give you not only a taste of using the library for interactive work, but also some insight into the rationale for the way it's designed the way it is and the way it differs from existing projects.
cytoflow
's goal is to enable quantitative, reproducible cytometry analysis. Reproducibility between cytometry experiments is poor, due in part to differences in analysis between operators; but if all your analysis is in a Jupyter
notebook (like this one!), then sharing and reuse of workflows is much easier.
Let's look at a very basic experiment, containing only two tubes. These two tubes contain cells that are expressing fluorescent proteins as the read-out of some cell state. We'll assume that these two tubes are identical, except that one has been induced with 1 µM of the small molecule inducer Doxycycline (aka 'Dox') and the other tube has been induced with 10 µM Dox.
Start by setting up Jupyter's plotting interface, then import the cytoflow
module.
# in some buggy versions of the Jupyter notebook, this needs to be in its OWN CELL.
%matplotlib inline
import cytoflow as flow
# if your figures are too big or too small, you can scale them by changing matplotlib's DPI
import matplotlib
matplotlib.rc('figure', dpi = 160)
The central data structure in cytoflow
is the Experiment
, which is basically a pandas.DataFrame
containing the events; its associated metadata (experimental conditions and the like); and some methods to manipulate them.
You usually create an Experiment
using an instance of the ImportOp
class. Start by defining two tubes, including their experimental conditions (ie, how much Dox is in each); then give those tubes, and a dict
specifying the experimental conditions' names and types, to ImportOp
. Call the apply()
function to get back the Experiment
with all the data in it.
tube1 = flow.Tube(file = 'data/RFP_Well_A3.fcs',
conditions = {'Dox' : 10.0})
tube2 = flow.Tube(file='data/CFP_Well_A4.fcs',
conditions = {'Dox' : 1.0})
import_op = flow.ImportOp(conditions = {'Dox' : 'float'},
tubes = [tube1, tube2])
ex = import_op.apply()
Once you have an Experiment
instance, this is the last time you should ever think about tubes or wells. Rather, think of your experiment as a very large set of single-cell measurements, each of which has some metadata associated with it; in this case, how much Dox the cell was exposed to. cytoflow
helps you focus your analysis on how those measurements change as the experimental conditions change, without worrying about what cells were in what tube.
Let's have a quick look at one of the fluorescence channels, Y2-A
. Instantiate a HistogramView
and tell it which channel we're looking at, then call plot
and pass it the experiment containing the data. Remember, this is not a single tube, but rather all the data in the Experiment
.
hist = flow.HistogramView()
hist.channel = 'Y2-A'
hist.plot(ex)
Hmmm. This plot is hard to interpret because most of the data is clustered around 0, in the linear range of the detector's response. Let's re-plot using a different scale. My favorite is logicle
, which has a linear response around 0 and a log range elsewhere. We specify the plot scale by setting the scale
attribute of HistogramView
to logicle
; other options are log
and linear
.
The cell below also demonstrates a different way to parameterize the HistogramView
instance, by passing parameters to the constructor instead of setting the values of the instance's attributes after we make it. Either way is correct.
hist = flow.HistogramView(channel = 'Y2-A',
scale = 'logicle')
hist.plot(ex)
Ah, much better. There is clearly a population of cells around 0 and a population of cells around about 5000. But! This is the entire Experiment
-- is is one of the populations from the low-Dox tube and the other population from the high-Dox tube?
Let's see if the histogram is different for the two different concentrations of inducer. CytoFlow
's plotting takes inspiration from Trellis plots (eg the lattice package in R): to split the data into subsets, you tell the plotting module which metadata "facet" you want to plot in the X, Y and color (hue) axes.
This time, we tell HistogramView
to make a separate plot for each different value of Dox and stack them on top of eachother by saying yfacet = 'Dox'
; if we wanted the plots side-by-side, we would have said xfacet = 'Dox'
.
Also note that this time, we don't keep around the HistogramView
object; if we don't need to re-use it, we can just call plot
right after the constructor.
flow.HistogramView(channel = 'Y2-A',
scale = 'logicle',
yfacet = 'Dox').plot(ex)
Indeed, the two tubes have dramatically different histograms. We could also plot them on top of eachother with different colors, by using huefacet
instead of yfacet
.
flow.HistogramView(channel = "Y2-A",
scale = "logicle",
huefacet = 'Dox').plot(ex)
So, there's a clear difference between these tubes: one has a substantial population above about ~200 in the Y2-A, and the other doesn't. What's the proportion of "high" cells? Let's gate out the "high" population using a ThresholdOp
.
thresh = flow.ThresholdOp(name = "T",
channel = "Y2-A",
threshold = 200)
ex2 = thresh.apply(ex)
Two important things to note here: first, an Operation
(such as ThresholdOp
) does not operate in place; instead, you create a ThresholdOp
object, then call apply()
to run it on an Experiment
. The apply()
function returns a new Experiment
.
Second, a gate does not remove events; it simply adds additional metadata to the events already there. You can see this if we look at the underlying pandas.DataFrame
for ex
and ex2
:
print(ex.data.head())
print("----")
print(ex2.data.head())
B1-A B1-H B1-W Dox FSC-A FSC-H \ 0 -127.094002 257.718353 -63040.300781 10.0 459.962982 437.354553 1 -70.234840 255.798340 -34034.042969 10.0 -267.174652 365.354553 2 -96.471756 313.398346 -41931.687500 10.0 -201.582336 501.354553 3 18.831570 277.669250 8514.489258 10.0 291.259888 447.029755 4 100.882095 255.756256 51291.074219 10.0 -397.168579 354.565308 FSC-W HDR-T SSC-A SSC-H SSC-W \ 0 137847.578125 2.018511 840.091370 747.917847 147225.328125 1 -95849.679688 27.451754 3476.902344 3163.917969 144038.046875 2 -52700.828125 32.043865 480.270691 507.917877 123937.437500 3 85399.273438 79.327492 8026.275879 6741.838867 156043.484375 4 -146821.125000 79.731194 7453.750488 609.884277 262143.968750 V2-A V2-H V2-W Y2-A Y2-H \ 0 41.593452 240.153854 22701.017578 109.946274 153.630051 1 103.437637 336.153870 40332.058594 5554.108398 4273.629883 2 -271.375580 256.153870 -138860.828125 81.835281 121.630051 3 -26.212378 207.677841 -16543.453125 -54.120304 98.122017 4 44.559933 216.036865 27035.013672 -10.542595 127.326027 Y2-W 0 93802.468750 1 170344.203125 2 88188.023438 3 -72294.242188 4 -10852.761719 ---- B1-A B1-H B1-W Dox FSC-A FSC-H \ 0 -127.094002 257.718353 -63040.300781 10.0 459.962982 437.354553 1 -70.234840 255.798340 -34034.042969 10.0 -267.174652 365.354553 2 -96.471756 313.398346 -41931.687500 10.0 -201.582336 501.354553 3 18.831570 277.669250 8514.489258 10.0 291.259888 447.029755 4 100.882095 255.756256 51291.074219 10.0 -397.168579 354.565308 FSC-W HDR-T SSC-A SSC-H SSC-W \ 0 137847.578125 2.018511 840.091370 747.917847 147225.328125 1 -95849.679688 27.451754 3476.902344 3163.917969 144038.046875 2 -52700.828125 32.043865 480.270691 507.917877 123937.437500 3 85399.273438 79.327492 8026.275879 6741.838867 156043.484375 4 -146821.125000 79.731194 7453.750488 609.884277 262143.968750 V2-A V2-H V2-W Y2-A Y2-H \ 0 41.593452 240.153854 22701.017578 109.946274 153.630051 1 103.437637 336.153870 40332.058594 5554.108398 4273.629883 2 -271.375580 256.153870 -138860.828125 81.835281 121.630051 3 -26.212378 207.677841 -16543.453125 -54.120304 98.122017 4 44.559933 216.036865 27035.013672 -10.542595 127.326027 Y2-W T 0 93802.468750 False 1 170344.203125 True 2 88188.023438 False 3 -72294.242188 False 4 -10852.761719 False
In ex2, the ThresholdOp
added a new column, named T
, which has the same name as the ThresholdOp
's name
attribute. It is True
if the Y2-A
value is greater than 200, and False
otherwise. The T
column is now a piece metadata just like the Dox concentration that we can use to facet the plots. For example, the following plots different Dox
concentrations in separate plots stacked on top of eachother (yfacet = "Dox"
), and on each of these plots uses separate colors for the "low" and "high" populations (huefacet = 'T'
):
flow.HistogramView(channel = "Y2-A",
scale = "logicle",
yfacet = "Dox",
huefacet = 'T').plot(ex2)
It's also important to note that we can still access the underlying pandas.Dataframe
: either by looking at the Experiment.data
attribute, or by referencing a column directly:
ex2["T"].head(10)
0 False 1 True 2 False 3 False 4 False 5 False 6 True 7 False 8 True 9 False Name: T, dtype: bool
Because the data is all stored in a single pandas.Dataframe
, we can use the pandas
API, and indeed the rest of the SciPy stack, to ask sophisticated questions of the underlying data. For example, "how many events of each Dox concentration were above the threshold?"
ex2.data.groupby(['Dox', 'T']).size()
Dox T 1.0 False 9946 True 54 10.0 False 5561 True 4439 dtype: int64
cytoflow
can answer the same question for us using one of its statistics views. In cytoflow
, a statistic is a number that summarizes a population; one of the key features
of cytoflow
is that it makes it easy to see how these summary statistics change as your experimental conditions change.
Several operations add statistics to an experiment; one of the most straightforward ones is ChannelStatisticOp
. It groups the experiment's data by the conditions specified in by
, then applies function
to channel
for each group.
In this case, we'll divide up the data into the subgroups T == True & Dox == 1
; T == True & Dox == 10
; T == False & Dox == 1
; and T == False & Dox == 10
. Then, we'll apply the function len
to the Y2-A
channel.
ex3 = flow.ChannelStatisticOp(name = "ByDox",
channel = "Y2-A",
by = ['T', 'Dox'],
function = len).apply(ex2)
Now we can look at the new statistic: Experiment.statistics
is a dictionary whose keys are tuples and whose values are the computed statistics (stored as MultiIndex
ed pandas.Series
). The first element in the tuple is the name of the operation that added it, and the second is defined by that operation. In this case, it's len
, the name of the function.
ex3.statistics[("ByDox", "len")]
T Dox False 1.0 9946 10.0 5561 True 1.0 54 10.0 4439 Name: ByDox : len, dtype: int64
We can also plot statistics using one of the various statistics views, such as BarChartView
.
flow.BarChartView(statistic = ("ByDox", "len"),
variable = "Dox",
huefacet = 'T').plot(ex3)
Statistics are important enough that they get an entire notebook of examples; please see Statistics.ipynb
for a more in-depth exploration.
I hope this makes the semantics of the cytoflow
package clear. This was a pretty simplistic toy analysis; for more sophisticated examples, see the other accompanying Jupyter notebooks.