Simple API level usage of the XFEL Data Exploration Toolbox¶

n.b. you will need cctbx.xfel installed to run this, and will need to run it with the command libtbx.ipython.

API documentation can be found at http://cci.lbl.gov/cctbx_docs/xfel/xfel.clustering.html#cluster-cluster Note that this is in active development, so may be updated freequently/contain new methods or classes that are not yet finished.

This tutorial will demonstrate the API level usage of the XFEL data exploration toolkit, using a test data set with only 49 images, for simplicity. On my local machine, this is at:

In [19]:

TESTDATA = ['/users/oli/Dropbox/Stanford_Postdoc/CODING/cctbx_testing/PolG_test_data']

In [2]:

import logging
import numpy as np
import matplotlib.pyplot as plt
import logging
import brewer2mpl
# Set up logging
reload(logging)  # work-around for IPython
FORMAT = '%(message)s'
logging.basicConfig(level=logging.INFO, format=FORMAT)
# pretty colors
cols = brewer2mpl.get_map('BrBG', 'Diverging', 3).mpl_colors

Let's start by creating a cluster object. We will use the from_directories class method to create a cluster object.

In [3]:

from xfel.clustering.cluster import Cluster
t_clus = Cluster.from_directories(TESTDATA)

Next, we'll do some hierarchical clustering on this to get a sense of the unit cells and point groups that make up our data. sub_clusters will be a list of clusters, and we're ignoring the other return value (the axes object we just plotted onto).

In [5]:

sub_clusters, _ = t_clus.ab_cluster(labels=False, write_file_lists=False)

Hierarchical clustering of unit cells
Using Andrews-Bernstein distance from Andrews & Bernstein J Appl Cryst 47:346 (2014)
Distances have been calculated

So let's look at the composition sub_clusters list we got. Since this is acting on a group of clusters, we will import a tool from the cluster_groups module.

In [6]:

from xfel.clustering.cluster_groups import unit_cell_info
pretty_str = unit_cell_info(sub_clusters)
print pretty_str

12 clusters.
C_id  Num in cluster Med_a       Med_b       Med_c       Med_alpha    Med_beta     Med_gamma   
cluster_10       2        230.7(1.1 ) 290.2(3.7 ) 784.3(1.4 ) 90.00 (0.00) 90.00 (0.00) 90.00 (0.00)
2 in P222.
cluster_11       14       224.6(4.6 ) 286.5(1.8 ) 400.0(7.0 ) 90.00 (0.00) 90.00 (0.00) 90.00 (0.00)
10 in C222, 4 in P222.
cluster_12       24       227.4(1.6 ) 227.4(1.6 ) 286.8(2.4 ) 90.00 (0.00) 90.00 (0.00) 120.00(0.00)
24 in P3.
Standard deviations are in brackets.
9 singletons:

 Point group   a           b           c           alpha        beta         gamma       
C2             225.7       270.5       1187.4     90.0         90.0         94.0        
C222           231.2       286.7       1131.1     90.0         90.0         90.0        
P2             225.9       229.4       287.2      90.0         90.0         119.4       
C2             228.4       299.2       392.5      93.1         90.0         90.0        
C2             23.7        330.1       358.3      97.9         90.0         90.0        
C2             226.3       290.9       399.7      91.3         90.0         90.0        
P222           226.8       396.7       572.0      90.0         90.0         90.0        
C2             223.1       486.0       682.3      92.7         90.0         90.0        
P1             484.1       600.6       689.2      99.9         91.1         104.5

We see the two biggest clusters that were in red and green in the plot above. Lets call these clu_a and clu_b. These are just the last two elements of the list, since it is sorted.

In [9]:

clu_b, clu_a = sub_clusters[-2:]
print "clu_a size: {}\nclu_b size: {}".format(len(clu_a.members), len(clu_b.members))

clu_a size: 24
clu_b size: 14

Let's now pretend we are interested only in the smaller cluster, clu_b. Let us examine the intensity distribution for this, again ignoring the returned axes object:

In [17]:

_ = clu_b.all_frames_intensity_stats()

That's not looking great: a very high B value, and some rising in the intensities at higher resoltion. These are in fact pretty poor quality data, so that's consitent. Now let's look at the point group composition and see if either of these look better separately:

In [13]:

clu_b.pg_composition

Out[13]:

{'C222': 10, 'P222': 4}

In [24]:

clu_bC = clu_b.point_group_filter('C222')
clu_bP = clu_b.point_group_filter('P222') 

In [27]:

_ = clu_bC.all_frames_intensity_stats()

Still looking a bit woking at high res..

In [29]:

_ = clu_bP.all_frames_intensity_stats()

This looks a little more sensible. Now, let's look at the info string:

In [34]:

print clu_bP.info

Made from files in ['/users/oli/Dropbox/Stanford_Postdoc/CODING/cctbx_testing/PolG_test_data']
############################## Next filter ##############################
Made using ab_cluster with t=10000, distance method, and single linkage
14 of 49 images passedon to this cluster
############################## Next filter ##############################
Cluster filtered by for point group P222.
4 of 14 images passedon to this cluster

Note how this keeps track of the cluster's provenance, so we know where it came from. Let's now write these images of interest out (so that we can, for example look at them individually using cctbx.image_viewer,

In [31]:

clu_bP.dump_file_list(out_file_name='temp.lst')

In [32]:

cat temp.lst

/users/oli/Dropbox/Stanford_Postdoc/CODING/cctbx_testing/PolG_test_data/int-s00-2013-11-10T17:53Z14.813_00000.pickle
/users/oli/Dropbox/Stanford_Postdoc/CODING/cctbx_testing/PolG_test_data/int-s03-2013-11-10T17:51Z07.396_00000.pickle
/users/oli/Dropbox/Stanford_Postdoc/CODING/cctbx_testing/PolG_test_data/int-s03-2013-11-10T17:54Z36.359_00000.pickle
/users/oli/Dropbox/Stanford_Postdoc/CODING/cctbx_testing/PolG_test_data/int-s04-2013-11-10T18:28Z45.434_00000.pickle

Marvelous!