Python

Python is my favorite programming language, mainly because of its readability, but also because the syntax allows you to express concepts in fewer lines of code than would be possible with other languages. Python features a fully dynamic type system and is therefore perfect as a scripting language, but it is also used in a wide range of non-scripting contexts.

The Python syntax is extremely well documented on other web pages and the purpose of this site is not to compete with these. Here I will focus on scripts and code snippets useful within the context of modeling. In most cases this page will serve as a personal reference on how to use python in different ways as I tend to forget as time goes by.

Content:

PDB files in general
Atom distances

PDB files in general

To read and manipulate PDB files with python we need to know some general rules about the format of these files. Typically one is only interested in the structure itself, that is the lines with the atomic coordinates, and not the PDB header with the crystallographic information etc. These lines start with 'ATOM' or 'HETATM' (or 'TER') and follow the general format:

0000000000111111111122222222223333333333444444444455555555556666666666777777777 
0123456789012345678901234567890123456789012345678901234567890123456789012345678

ATOM    377  CA  ARG E  66      -1.910  71.379 192.241  1.00 27.20           C  
ATOM    378  C   ARG E  66      -1.718  72.280 193.441  1.00 29.68           C  
ATOM    379  O   ARG E  66      -1.892  73.500 193.361  1.00 33.25           O  
ATOM    380  CB  ARG E  66      -0.948  71.737 191.087  1.00 29.81           C  
HETATM  731  Ca2 Ca2 E 700      -1.171  70.878 189.840  1.00 22.08           Ca

Each line in this example contains 79 letters including spaces (0-78).

Letters 0:5: ATOM/HETATM/TER
Letters 6:10: Atom number
Letters 13:16: PDB atom type
Letters 17:19: Residue name
Letter 21: Chain
Letters 22:25: Residue number
Letters 30:37: X coordinate
Letters 38:45: Y coordinate
Letters 46:53: Z coordinate
Letters 56:59: Occupancy
Letters 61:65: B factor
Letters 77:78: Atom

When the format is clarified, python can easily be used to extract information and manipulate the data in the PDB file as you wish.

Using python to read PDB files

The easiest way to work with files in genral in python, is to read them as lists. That is, each line is going to be a list element with its own index. This is done by opening the file and read ('r') in each line with the .readlines() function:

>>> pdbfile = open('1bzx.pdb','r').readlines()

Now 'pdbfile' is a list, with n entries, where n is the number of lines in 1bzx.pdb. You can check the length of a list by typing:

>>> len(pdbfile)

which for the pdb file 1bzx.pdb will return 2690. However, there is 2219 atoms in the structure, so the rest of the lines are from the pdb header and the atom connect part at the end of the file. To create a new list with only the structure we need to extract each line containing the atomic coordinates, that is the lines starting with 'ATOM' or 'HETATM'. First we create a empty list (I call it protein here), and then we add all lines from 'pdbfile' with the atomic coordinates to this file:

>>> protein = []
>>> for line in pdbfile:
...   atomLine = line.split()[0]
...   if 'ATOM' in atomLine or 'HETATM' in atomLine:
...      protein.append(line)

The length of this list is 2219, which is the number of atoms present in the original '1bzx.pdb' file. Now you can read and manipulate atom by atom by entering the corresponding index. For example, if you type:

>>> protein[0]
'ATOM      1  N   ILE E  16      -1.219  88.613 187.589  1.00 21.35           N  \n'

the first list element (the first atom with coordinates) is returned as a string. This is not really usefull yet. However, if we now combine our knowledge of the PDB file format with some python functions, we can go further than just reading lines. One of the most usefull functions is the 'split' function. If you type:

>>> protein[0].split()
['ATOM', '1', 'N', 'ILE', 'E', '16', '-1.219', '88.613', '187.589', '1.00', '21.35', 'N']

a new list containing 12 elements is returned. This function has split the previous line on white spaces where for example elemente 7-9 are the x,y and z coordinates. To extract only the coordinates, you can type:

>>> protein[0].split()[6:9]
['-1.219', '88.613', '187.589']

However, whereas the split function is easy to use and requires less typing, it may in some cases not do what you want it to. This is because PDB files often has some lines where not all entries are separated by spaces. Look at this example:

ATOM    356  CG1AVAL E  63     -10.280  70.440 188.215  0.50 29.25           C

Here the 'split()' will not separate the alternative pdb atom type CG1A from the residue name VAL. It will return 'CG1AVAL'. On a general basis it is therefor safer to not use the split functions, but rather to specify the letter indices defined above.

To extract parts of the pdb file based on letter indicies, try to type the following:

>>> protein[0][17:20]
'ILE'

which returns the residue name, that is ILE in this case, which is letters 17-19 (the last number in python is omitted). This way of extracting the data requires a bit more typing, but it will ensure that you get what you expect every time.

Atom distances

This is an example on how to use python to do simple calculations based on the coordinates in the PDB files. The goal is to create a simple function to calculate the distance between any two pairs of atoms in a PDB file.

r_ij = ( (x_j - x_i)² + (y_j - y_i)² + (z_j - z_i)²)^1/2

To make it simple, I will make the function so that it takes two atom numbers, and returns the distance between them. All we need to do is to search through the PDB file for the selected atom numbers, and then get the corresponding coordinates to calculate the distance. Since the distance involves using the square root function, we also need to import math (or numpy). If we also want to use the code from the command line and take inputs from there, we need to import sys.

#!/usr/bin/env python

import sys
import math

To define a python function we use 'def':

def distance(filename, atomnr1, atomnr2):

The function 'distance' now expects 3 inputs, the name of the PDB file and the two atom numbers. The next step is to open the PDB file and find the correct coordinates for the selected atom pair.

   pdbfile = open(filename,'r').readlines()

   for line in pdbfile:
      try:
         if int(line.split()[1]) == int(atomnr1):
            x1 = float(line.split()[6])
            y1 = float(line.split()[7])
            z1 = float(line.split()[8])
         elif int(line.split()[1]) == int(atomnr2):
            x2 = float(line.split()[6])
            y2 = float(line.split()[7])
            z2 = float(line.split()[8])
      except:
         continue

Now that we got the coordinates, all that is left is to calculate the distance and return it to the user:

   distance = math.sqrt( (x2-x1)**2 + (y2-y1)**2 + (z2-z1)**2 )

   return distance

Save the script as 'atomdist.py'.

The function can now be imported in python and used directly as it is. Open python type:

>>> import atomdist

Now atomdist contains the function 'distance', so to use it you have to call it by typing:

>>> atomdist.distance('pdbfile',atomnr1,atomnr2)

Personally I find the code easier to use if I can use it directly from the command line without using python interactively. Just add the following code after the code definition:

try:
   pdbfile = str(sys.argv[1])
   atomnr1 = sys.argv[2]
   atomnr2 = sys.argv[3]

   r12 = distance(pdbfile,atomnr1,atomnr2)

   print str(r12)
   
except:
   print 'Please input pdb file, and two atom numbers'

From the command line you can now for example type:

python atomdist.py 1bzx.py 230 389

which will return the distance between atom 230 and 389 in the PDB file with entry code 1BZX.pdb.

Python

PDB files in general

Using python to read PDB files

PDB format white spaces

PDB format letter indices

Atom distances

NAVIGATION

INFO

LINKS