This notebook introduces the fundamental objects of MDAnalysis: the Universe, AtomGroup, and Atom objects. It also introduces the selection language, and the basics of reading and slicing trajectories.

1. Fundamental objects¶

`Universes` and `AtomGroups`¶

"If you wish to make an apple pie from scratch, you must first invent the Universe."

~ Carl Sagan

One of the fundamental objects in the MDAnalysis data model is the Universe object. This contains all of a simulations' topology information at the least, but usually also includes trajectory information as well. A Universe can be thought of as an interface to the raw details of a simulation.

First, we need to import MDAnalysis, giving us access to all the components in its namespace:

In [8]:

import MDAnalysis as mda

In order to do anything, we do need some actual molecular dynamics data to work with. You should have downloaded a set of datafiles for this workshop; I placed mine in the following location relative to this notebook:

In [9]:

datapath = 'sandbox/'

To make a Universe, we need at the very least a topology file. We'll use one for a simulation system that includes adenylate kinase with the water removed:

In [10]:

import os
top = os.path.join(datapath, 'equilibrium/adk4AKE.psf')

In [11]:

u = mda.Universe(top)

In [12]:

print u

<Universe with 3341 atoms and 3365 bonds>

We now have a Universe object. Since the PSF file we loaded contained both atom identities and bond information, the Universe is able to access these details.

We can access all atoms in the Universe through the Universe.atoms attribute. This gives access to many details, including basic topology details for each atom.

In [13]:

u.atoms.indices

Out[13]:

array([   0,    1,    2, ..., 3338, 3339, 3340])

In [14]:

u.atoms.names

Out[14]:

array(['N', 'HT1', 'HT2', ..., 'CA', 'HA1', 'HA2'], 
      dtype='|S4')

In [15]:

u.atoms.resnames

Out[15]:

array(['MET', 'MET', 'MET', ..., 'GLY', 'GLY', 'GLY'], 
      dtype='|S3')

In [16]:

u.atoms.resids

Out[16]:

array([  1,   1,   1, ..., 214, 214, 214])

In [17]:

u.atoms.charges

Out[17]:

array([-0.3 ,  0.33,  0.33, ..., -0.02,  0.09,  0.09])

In [18]:

u.atoms.masses

Out[18]:

array([ 14.007,   1.008,   1.008, ...,  12.011,   1.008,   1.008])

In [19]:

u.atoms.types

Out[19]:

array(['NH3', 'HC', 'HC', ..., 'CT2', 'HB', 'HB'], 
      dtype='|S4')

These are all properties of an AtomGroup, of which Universe.atoms is one such instance.

In [20]:

type(u.atoms)

Out[20]:

MDAnalysis.core.AtomGroup.AtomGroup

Many of the methods of an AtomGroup return numpy arrays of the same length as the AtomGroup itself; that is, each element corresponds to each atom in the AtomGroup, in order. That means that if we want only, say, the masses of the 3rd through the 11th atom in our topology, we could simply slice like so:

In [21]:

u.atoms.masses[2:11]

Out[21]:

array([  1.008,   1.008,  12.011,   1.008,  12.011,   1.008,   1.008,
        12.011,   1.008])

Note: AtomGroup indexes are 0-based, even if the topology file gives the atoms 1-based indices. This is for consistency with the rest of the Python ecosystem, including numpy.

But AtomGroups themselves are actually sliceable, so we could instead do:

In [22]:

u.atoms[2:11].masses

Out[22]:

array([  1.008,   1.008,  12.011,   1.008,  12.011,   1.008,   1.008,
        12.011,   1.008])

to achieve the same effect. Slicing of an AtomGroup returns an AtomGroup containing the atoms in the slice, so this gives a basic way of accessing a subset of the atoms present in the topology.

Fancy indexing also works. We can, for example, get an AtomGroup containing the 3rd, 22nd, 3rd, and second-to-last atom in the Universe with something like:

In [23]:

ag = u.atoms[[2, 21, 2, -2]]

In [24]:

ag.indices

Out[24]:

array([   2,   21,    2, 3339])

We can also use expressions that return boolean arrays to select out atoms of interest. We can, for exmple, get an AtomGroup of all atoms in the system with mass less than 10 u with:

In [25]:

u.atoms[u.atoms.masses < 10]

Out[25]:

<AtomGroup with 1685 atoms>

We can string together boolean expressions to obtain even more granular selections, for example getting all atoms with mass less than 10 u AND within an arginine residue:

In [26]:

u.atoms[(u.atoms.masses < 10) & (u.atoms.resnames == 'ARG')]

Out[26]:

<AtomGroup with 169 atoms>

All topology properties of an AtomGroup can be harnessed in this way to make complex subselections.

Working with individual atoms¶

We can also work with individual atoms, though in practice this is rare, and in the future will be relatively inefficient when operating on multiple atoms compared to operating on a whole AtomGroup. At the moment, however, these operations are comparable in speed.

In [27]:

a = u.atoms[0]

In [28]:

Out[28]:

<Atom 1: N of type NH3 of resname MET, resid 1 and segid ADK>

In [29]:

print "name:", a.name
print "resid:", a.resid
print "resname:", a.resname

name: N
resid: 1
resname: MET

Challenge: count the number of glycine residues in the protein¶

In [30]:

import numpy as np

In [31]:

r = u.atoms[u.atoms.resnames == 'GLY'].resids

In [32]:

len(np.unique(r))

Out[32]:

`ResidueGroups` and `SegmentGroups`¶

The Universe also gives higher-order topology objects, including ResidueGroups and SegmentGroups. We can access all residues in the Universe with:

In [76]:

u.residues

Out[76]:

<ResidueGroup [<Residue MET, 1>, <Residue ARG, 2>, <Residue ILE, 3>, <Residue ILE, 4>, <Residue LEU, 5>, <Residue LEU, 6>, <Residue GLY, 7>, <Residue ALA, 8>, <Residue PRO, 9>, <Residue GLY, 10>, <Residue ALA, 11>, <Residue GLY, 12>, <Residue LYS, 13>, <Residue GLY, 14>, <Residue THR, 15>, <Residue GLN, 16>, <Residue ALA, 17>, <Residue GLN, 18>, <Residue PHE, 19>, <Residue ILE, 20>, <Residue MET, 21>, <Residue GLU, 22>, <Residue LYS, 23>, <Residue TYR, 24>, <Residue GLY, 25>, <Residue ILE, 26>, <Residue PRO, 27>, <Residue GLN, 28>, <Residue ILE, 29>, <Residue SER, 30>, <Residue THR, 31>, <Residue GLY, 32>, <Residue ASP, 33>, <Residue MET, 34>, <Residue LEU, 35>, <Residue ARG, 36>, <Residue ALA, 37>, <Residue ALA, 38>, <Residue VAL, 39>, <Residue LYS, 40>, <Residue SER, 41>, <Residue GLY, 42>, <Residue SER, 43>, <Residue GLU, 44>, <Residue LEU, 45>, <Residue GLY, 46>, <Residue LYS, 47>, <Residue GLN, 48>, <Residue ALA, 49>, <Residue LYS, 50>, <Residue ASP, 51>, <Residue ILE, 52>, <Residue MET, 53>, <Residue ASP, 54>, <Residue ALA, 55>, <Residue GLY, 56>, <Residue LYS, 57>, <Residue LEU, 58>, <Residue VAL, 59>, <Residue THR, 60>, <Residue ASP, 61>, <Residue GLU, 62>, <Residue LEU, 63>, <Residue VAL, 64>, <Residue ILE, 65>, <Residue ALA, 66>, <Residue LEU, 67>, <Residue VAL, 68>, <Residue LYS, 69>, <Residue GLU, 70>, <Residue ARG, 71>, <Residue ILE, 72>, <Residue ALA, 73>, <Residue GLN, 74>, <Residue GLU, 75>, <Residue ASP, 76>, <Residue CYS, 77>, <Residue ARG, 78>, <Residue ASN, 79>, <Residue GLY, 80>, <Residue PHE, 81>, <Residue LEU, 82>, <Residue LEU, 83>, <Residue ASP, 84>, <Residue GLY, 85>, <Residue PHE, 86>, <Residue PRO, 87>, <Residue ARG, 88>, <Residue THR, 89>, <Residue ILE, 90>, <Residue PRO, 91>, <Residue GLN, 92>, <Residue ALA, 93>, <Residue ASP, 94>, <Residue ALA, 95>, <Residue MET, 96>, <Residue LYS, 97>, <Residue GLU, 98>, <Residue ALA, 99>, <Residue GLY, 100>, <Residue ILE, 101>, <Residue ASN, 102>, <Residue VAL, 103>, <Residue ASP, 104>, <Residue TYR, 105>, <Residue VAL, 106>, <Residue LEU, 107>, <Residue GLU, 108>, <Residue PHE, 109>, <Residue ASP, 110>, <Residue VAL, 111>, <Residue PRO, 112>, <Residue ASP, 113>, <Residue GLU, 114>, <Residue LEU, 115>, <Residue ILE, 116>, <Residue VAL, 117>, <Residue ASP, 118>, <Residue ARG, 119>, <Residue ILE, 120>, <Residue VAL, 121>, <Residue GLY, 122>, <Residue ARG, 123>, <Residue ARG, 124>, <Residue VAL, 125>, <Residue HSE, 126>, <Residue ALA, 127>, <Residue PRO, 128>, <Residue SER, 129>, <Residue GLY, 130>, <Residue ARG, 131>, <Residue VAL, 132>, <Residue TYR, 133>, <Residue HSE, 134>, <Residue VAL, 135>, <Residue LYS, 136>, <Residue PHE, 137>, <Residue ASN, 138>, <Residue PRO, 139>, <Residue PRO, 140>, <Residue LYS, 141>, <Residue VAL, 142>, <Residue GLU, 143>, <Residue GLY, 144>, <Residue LYS, 145>, <Residue ASP, 146>, <Residue ASP, 147>, <Residue VAL, 148>, <Residue THR, 149>, <Residue GLY, 150>, <Residue GLU, 151>, <Residue GLU, 152>, <Residue LEU, 153>, <Residue THR, 154>, <Residue THR, 155>, <Residue ARG, 156>, <Residue LYS, 157>, <Residue ASP, 158>, <Residue ASP, 159>, <Residue GLN, 160>, <Residue GLU, 161>, <Residue GLU, 162>, <Residue THR, 163>, <Residue VAL, 164>, <Residue ARG, 165>, <Residue LYS, 166>, <Residue ARG, 167>, <Residue LEU, 168>, <Residue VAL, 169>, <Residue GLU, 170>, <Residue TYR, 171>, <Residue HSE, 172>, <Residue GLN, 173>, <Residue MET, 174>, <Residue THR, 175>, <Residue ALA, 176>, <Residue PRO, 177>, <Residue LEU, 178>, <Residue ILE, 179>, <Residue GLY, 180>, <Residue TYR, 181>, <Residue TYR, 182>, <Residue SER, 183>, <Residue LYS, 184>, <Residue GLU, 185>, <Residue ALA, 186>, <Residue GLU, 187>, <Residue ALA, 188>, <Residue GLY, 189>, <Residue ASN, 190>, <Residue THR, 191>, <Residue LYS, 192>, <Residue TYR, 193>, <Residue ALA, 194>, <Residue LYS, 195>, <Residue VAL, 196>, <Residue ASP, 197>, <Residue GLY, 198>, <Residue THR, 199>, <Residue LYS, 200>, <Residue PRO, 201>, <Residue VAL, 202>, <Residue ALA, 203>, <Residue GLU, 204>, <Residue VAL, 205>, <Residue ARG, 206>, <Residue ALA, 207>, <Residue ASP, 208>, <Residue LEU, 209>, <Residue GLU, 210>, <Residue LYS, 211>, <Residue ILE, 212>, <Residue LEU, 213>, <Residue GLY, 214>]>

And all segments with:

In [34]:

u.segments

Out[34]:

<SegmentGroup [<Segment ADK>]>

ResidueGroups and SegmentGroups also behave similarly to AtomGroups, with many of their methods returning numpy arrays with each element corresponding to a single residue or segment, respectively.

In [35]:

u.residues.resnames

Out[35]:

array(['MET', 'ARG', 'ILE', 'ILE', 'LEU', 'LEU', 'GLY', 'ALA', 'PRO',
       'GLY', 'ALA', 'GLY', 'LYS', 'GLY', 'THR', 'GLN', 'ALA', 'GLN',
       'PHE', 'ILE', 'MET', 'GLU', 'LYS', 'TYR', 'GLY', 'ILE', 'PRO',
       'GLN', 'ILE', 'SER', 'THR', 'GLY', 'ASP', 'MET', 'LEU', 'ARG',
       'ALA', 'ALA', 'VAL', 'LYS', 'SER', 'GLY', 'SER', 'GLU', 'LEU',
       'GLY', 'LYS', 'GLN', 'ALA', 'LYS', 'ASP', 'ILE', 'MET', 'ASP',
       'ALA', 'GLY', 'LYS', 'LEU', 'VAL', 'THR', 'ASP', 'GLU', 'LEU',
       'VAL', 'ILE', 'ALA', 'LEU', 'VAL', 'LYS', 'GLU', 'ARG', 'ILE',
       'ALA', 'GLN', 'GLU', 'ASP', 'CYS', 'ARG', 'ASN', 'GLY', 'PHE',
       'LEU', 'LEU', 'ASP', 'GLY', 'PHE', 'PRO', 'ARG', 'THR', 'ILE',
       'PRO', 'GLN', 'ALA', 'ASP', 'ALA', 'MET', 'LYS', 'GLU', 'ALA',
       'GLY', 'ILE', 'ASN', 'VAL', 'ASP', 'TYR', 'VAL', 'LEU', 'GLU',
       'PHE', 'ASP', 'VAL', 'PRO', 'ASP', 'GLU', 'LEU', 'ILE', 'VAL',
       'ASP', 'ARG', 'ILE', 'VAL', 'GLY', 'ARG', 'ARG', 'VAL', 'HSE',
       'ALA', 'PRO', 'SER', 'GLY', 'ARG', 'VAL', 'TYR', 'HSE', 'VAL',
       'LYS', 'PHE', 'ASN', 'PRO', 'PRO', 'LYS', 'VAL', 'GLU', 'GLY',
       'LYS', 'ASP', 'ASP', 'VAL', 'THR', 'GLY', 'GLU', 'GLU', 'LEU',
       'THR', 'THR', 'ARG', 'LYS', 'ASP', 'ASP', 'GLN', 'GLU', 'GLU',
       'THR', 'VAL', 'ARG', 'LYS', 'ARG', 'LEU', 'VAL', 'GLU', 'TYR',
       'HSE', 'GLN', 'MET', 'THR', 'ALA', 'PRO', 'LEU', 'ILE', 'GLY',
       'TYR', 'TYR', 'SER', 'LYS', 'GLU', 'ALA', 'GLU', 'ALA', 'GLY',
       'ASN', 'THR', 'LYS', 'TYR', 'ALA', 'LYS', 'VAL', 'ASP', 'GLY',
       'THR', 'LYS', 'PRO', 'VAL', 'ALA', 'GLU', 'VAL', 'ARG', 'ALA',
       'ASP', 'LEU', 'GLU', 'LYS', 'ILE', 'LEU', 'GLY'], 
      dtype='|S3')

In [36]:

u.segments.segids

Out[36]:

array(['ADK'], 
      dtype='|S3')

We can more easily get the number of glycine residues using a ResidueGroup.

In [37]:

u.atoms[u.atoms.resnames == 'GLY'].residues.n_residues

Out[37]:

or perhaps more Pythonic:

In [38]:

len(u.atoms[u.atoms.resnames == 'GLY'].residues)

Out[38]:

Using coordinates¶

AtomGroups are more interesting with coordinate data attached to them. Our topology file has none of this, so using

In [39]:

u.atoms.positions

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-39-878f7dd8f697> in <module>()
----> 1 u.atoms.positions

/home/alter/.local/lib/python2.7/site-packages/MDAnalysis/core/AtomGroup.pyc in __getattr__(self, name)
   1041         except SelectionError:
   1042             raise AttributeError("'{0}' object has no attribute '{1}'".format(
-> 1043                     self.__class__.__name__, name))
   1044 
   1045     def _get_named_atom(self, name):

AttributeError: 'AtomGroup' object has no attribute 'positions'

yields an AttributeError. Let's load some coordinates by creating a new Universe, but with a topology + a trajectory.

In [41]:

traj = os.path.join(datapath, 'equilibrium', '1ake_007-nowater-core-dt240ps.dcd')

In [42]:

u = mda.Universe(top, traj)

Now this will work.

In [43]:

u.atoms.positions

Out[43]:

array([[  2.87066364,  10.60445595,   9.75028801],
       [  3.10920501,  11.32470894,   9.0389967 ],
       [  3.74609542,  10.24300766,  10.17991066],
       ..., 
       [ -7.49935293,  10.89219856,  12.33476448],
       [ -6.59658432,  10.83427048,  12.92472839],
       [ -8.34826946,  10.59926796,  12.93470669]], dtype=float32)

Challenge: calculate the center of geometry of the C$_\alpha$ atoms¶

In [62]:

u.atoms[u.atoms.names == 'CA'].positions.mean(axis=0)

Out[62]:

array([ 1.04244471,  2.23443413, -5.1523819 ], dtype=float32)

Alternatively, we have a builtin for this:

In [63]:

u.atoms[u.atoms.names == 'CA'].centroid()

Out[63]:

array([ 1.04244471,  2.23443413, -5.1523819 ], dtype=float32)

Bonds, angles, and dihedrals¶

We can also get at connectivity information between atoms, such as bonds, angles, and dihedrals.

In [125]:

u.bonds

Out[125]:

<TopologyGroup containing 3365 Bonds>

In [126]:

u.angles

Out[126]:

<TopologyGroup containing 6123 Angles>

In [127]:

u.dihedrals

Out[127]:

<TopologyGroup containing 8921 Dihedrals>

In [131]:

u.bonds[3]

Out[131]:

<Bond between: Atom 1 (N of MET 1 None) and Atom 5 (CA of MET1 None), length 1.49 A>

Want the actual value?

In [138]:

u.bonds[3].value()

Out[138]:

1.4905602

In [48]:

import matplotlib.pyplot as plt
%matplotlib inline

In [51]:

plt.hist(u.bonds.values(), bins=30)

Out[51]:

(array([   23.,   295.,    57.,   287.,  1022.,     0.,     0.,    31.,
          105.,   126.,    41.,    71.,   114.,    88.,    81.,   101.,
          129.,   186.,   255.,   221.,   102.,    15.,     2.,     0.,
            0.,     0.,     4.,     2.,     3.,     4.]),
 array([ 0.95999745,  0.99054587,  1.02109429,  1.05164271,  1.08219113,
         1.11273955,  1.14328797,  1.17383639,  1.20438481,  1.23493323,
         1.26548165,  1.29603007,  1.32657849,  1.35712691,  1.38767533,
         1.41822375,  1.44877217,  1.47932059,  1.50986901,  1.54041743,
         1.57096585,  1.60151427,  1.63206269,  1.66261111,  1.69315953,
         1.72370795,  1.75425637,  1.78480479,  1.81535321,  1.84590163,
         1.87645005]),
 <a list of 30 Patch objects>)

These work the same way as AtomGroups. They're sliceable, and indexing them works too to give individual bonds, angles, dihedrals.

2. Selecting atoms with selection strings¶

We've already seen that complex selections can be performed on AtomGroups through boolean array indexing. However, MDAnalysis also features a CHARMM-style atom selection mechanism that is often more convenient.

In [55]:

u.select_atoms?

Although boolean selections work well enough for selecting out atoms from AtomGroups, the selection language makes more complex selections possible with probably less effort. For example, we can selections of the

In [71]:

acidic = u.select_atoms("(resname GLU or resname ASP)")
acidic

Out[71]:

<AtomGroup with 474 atoms>

In [70]:

acidic_o = acidic.select_atoms('(name OD* or name OE*)')
acidic_o

Out[70]:

<AtomGroup with 70 atoms>

In [73]:

basic_n = u.select_atoms( "((resname LYS or resname ARG) and (name NZ or name NH*))")
basic_n

Out[73]:

<AtomGroup with 44 atoms>

If we want only residues that are involved in salt bridges, we can use our AtomGroups as part of additional selections. We can also use the around selection operator to specify only atoms within 4 angstroms. At the end we can get the full residues back.

In [56]:

acidic_res = u.select_atoms("group acidic and around 4 group basic", acidic=acidic_o, basic=basic_n).residues
basic_res = u.select_atoms("group basic and around 4 group acidic", acidic=acidic_o, basic=basic_n).residues

We can write these selections out into a variety of formats; in particular we can output as a VMD selection.

In [75]:

# addition of AtomGroups yields an AtomGroup
(acidic_res + basic_res).write('saltbridge.vmd', name='saltbridge')

Then load into VMD

source saltbridge.vmd

and use the new named selection "saltbridge" in Representations.

Example: domain movement of adenylate kinase (AdK)¶

AdK has three domains: CORE, NMP, LID. The NMP and LID move relative to the CORE. Movements can be defined by two angles between the two domains, $\theta_{\text{NMP}}$ and $\theta_{\text{LID}}$.

These angles are defined as:

CORE residues 1-29, 60-121, 160-214 (gray)
NMP residues 30-59 (blue)
LID residues 122-159 (yellow)

In [92]:

from IPython.core.display import Image

Image("MDAnalysisTutorial/doc/sphinx/figs/angle_defs.png")

Out[92]:

AdK undergoes a conformational transition during which the NMP and LID domain move relative to the CORE domain. The transition can be characterized by the angles between the domains:

In [93]:

Image("MDAnalysisTutorial/doc/sphinx/figs/6b_angle_def_open.png", width=500)

Out[93]:

For simplicity, define $\theta_{\text{NMP}}$ and $\theta_{\text{LID}}$ between the centers of mass of backbone atoms of groups of the following atoms A, B, C:

Definition of $\theta_{\text{NMP}}$

A: 115-125, B: 90-100, C: 35-55

Definition of $\theta_{\text{LID}}$

A: 179-185, B: 115-125, C: 125-153

The angle between vectors $\vec{BA}$ and $\vec{BC}$ is \begin{equation*} \theta = \arccos\left(\frac{\vec{BA}\cdot\vec{BC}}{|\vec{BA}||\vec{BC}|}\right) \end{equation*}

With numpy:

np.arccos(np.cdot(BA, BC))/(np.linalg.norm(BA) * np.linalg.norm(BC))

Write two functions theta_NMP() and theta_LID() that calculate these angles, given the universe.

Hint:

In [90]:

A = u.select_atoms("resid 115:125 and backbone").center_of_geometry()

In [91]:

import numpy as np
from numpy.linalg import norm

def theta_NMP(u):
    """Calculate the NMP-CORE angle for E. coli AdK in degrees"""
    C = u.select_atoms("resid 115:125 and backbone").center_of_geometry()
    B = u.select_atoms("resid 90:100 and backbone").center_of_geometry()
    A = u.select_atoms("resid 35:55 and backbone").center_of_geometry()
    BA = A - B
    BC = C - B
    theta = np.arccos(np.dot(BA, BC)/(norm(BA)*norm(BC)))
    return np.rad2deg(theta)

def theta_LID(u):
    """Calculate the LID-CORE angle for E. coli AdK in degrees"""
    C = u.select_atoms("resid 179:185 and backbone").center_of_geometry()
    B = u.select_atoms("resid 115:125 and backbone").center_of_geometry()
    A = u.select_atoms("resid 125:153 and backbone").center_of_geometry()
    BA = A - B
    BC = C - B
    theta = np.arccos(np.dot(BA, BC)/(norm(BA)*norm(BC)))
    return np.rad2deg(theta)

And now Oliver.