The Python syntax is extremely well documented on other web pages and the purpose of this site is not to compete with these. Here I will focus on scripts and code snippets useful within the context of modeling. In most cases this page will serve as a personal reference on how to use python in different ways as I tend to forget as time goes by.
Content:
0000000000111111111122222222223333333333444444444455555555556666666666777777777 0123456789012345678901234567890123456789012345678901234567890123456789012345678
ATOM 377 CA ARG E 66 -1.910 71.379 192.241 1.00 27.20 C ATOM 378 C ARG E 66 -1.718 72.280 193.441 1.00 29.68 C ATOM 379 O ARG E 66 -1.892 73.500 193.361 1.00 33.25 O ATOM 380 CB ARG E 66 -0.948 71.737 191.087 1.00 29.81 C HETATM 731 Ca2 Ca2 E 700 -1.171 70.878 189.840 1.00 22.08 CaEach line in this example contains 79 letters including spaces (0-78).
The easiest way to work with files in genral in python, is to read them as lists. That is, each line is going to be a list element with its own index. This is done by opening the file and read ('r') in each line with the .readlines() function:
>>> pdbfile = open('1bzx.pdb','r').readlines()
Now 'pdbfile' is a list, with n entries, where n is the number of lines in 1bzx.pdb. You can check the length of a list by typing:>>> len(pdbfile)
which for the pdb file 1bzx.pdb will return 2690. However, there is 2219 atoms in the structure, so the rest of the lines are from the pdb header and the atom connect part at the end of the file. To create a new list with only the structure we need to extract each line containing the atomic coordinates, that is the lines starting with 'ATOM' or 'HETATM'. First we create a empty list (I call it protein here), and then we add all lines from 'pdbfile' with the atomic coordinates to this file:>>> protein = [] >>> for line in pdbfile: ... atomLine = line.split()[0] ... if 'ATOM' in atomLine or 'HETATM' in atomLine: ... protein.append(line)The length of this list is 2219, which is the number of atoms present in the original '1bzx.pdb' file. Now you can read and manipulate atom by atom by entering the corresponding index. For example, if you type:
>>> protein[0] 'ATOM 1 N ILE E 16 -1.219 88.613 187.589 1.00 21.35 N \n'the first list element (the first atom with coordinates) is returned as a string. This is not really usefull yet. However, if we now combine our knowledge of the PDB file format with some python functions, we can go further than just reading lines. One of the most usefull functions is the 'split' function. If you type:
>>> protein[0].split() ['ATOM', '1', 'N', 'ILE', 'E', '16', '-1.219', '88.613', '187.589', '1.00', '21.35', 'N']a new list containing 12 elements is returned. This function has split the previous line on white spaces where for example elemente 7-9 are the x,y and z coordinates. To extract only the coordinates, you can type:
>>> protein[0].split()[6:9] ['-1.219', '88.613', '187.589']However, whereas the split function is easy to use and requires less typing, it may in some cases not do what you want it to. This is because PDB files often has some lines where not all entries are separated by spaces. Look at this example:
ATOM 356 CG1AVAL E 63 -10.280 70.440 188.215 0.50 29.25 CHere the 'split()' will not separate the alternative pdb atom type CG1A from the residue name VAL. It will return 'CG1AVAL'. On a general basis it is therefor safer to not use the split functions, but rather to specify the letter indices defined above.
To extract parts of the pdb file based on letter indicies, try to type the following:
>>> protein[0][17:20] 'ILE'which returns the residue name, that is ILE in this case, which is letters 17-19 (the last number in python is omitted). This way of extracting the data requires a bit more typing, but it will ensure that you get what you expect every time.
atomDef = line.split()[0]
atomnumber = int(line.split()[1])
pdbAtom = line.split()[2]
residueName = line.split()[3]
chain = line.split()[4]
residueNumber = int(line.split()[5])
x = float(line.split()[6])
y = float(line.split()[7])
z = float(line.split()[8])
occupancy = float(line.split()[9])
bfactor = float(line.split()[10])
atom = line.split()[11]
atomDef = line[0:6]
atomnumber = int(line[6:11])
pdbAtom = line[13:17]
residueName = int(line[17:20])
chain = line[21:22]
residueNumber = int(line[22:26])
x = float(line[30:38])
y = float(line[38:46])
z = float(line[46:54])
occupancy = float(line[56:60])
bfactor = float(line[61:66])
atom = line[77:79]
rij = ( (xj - xi)2 + (yj - yi)2 + (zj - zi)2)1/2
To make it simple, I will make the function so that it takes two atom numbers, and returns the distance between them. All we need to do is to search through the PDB file for the selected atom numbers, and then get the corresponding coordinates to calculate the distance. Since the distance involves using the square root function, we also need to import math (or numpy). If we also want to use the code from the command line and take inputs from there, we need to import sys.#!/usr/bin/env python import sys import math
To define a python function we use 'def':
def distance(filename, atomnr1, atomnr2):The function 'distance' now expects 3 inputs, the name of the PDB file and the two atom numbers. The next step is to open the PDB file and find the correct coordinates for the selected atom pair.
pdbfile = open(filename,'r').readlines() for line in pdbfile: try: if int(line.split()[1]) == int(atomnr1): x1 = float(line.split()[6]) y1 = float(line.split()[7]) z1 = float(line.split()[8]) elif int(line.split()[1]) == int(atomnr2): x2 = float(line.split()[6]) y2 = float(line.split()[7]) z2 = float(line.split()[8]) except: continueNow that we got the coordinates, all that is left is to calculate the distance and return it to the user:
distance = math.sqrt( (x2-x1)**2 + (y2-y1)**2 + (z2-z1)**2 ) return distanceSave the script as 'atomdist.py'.
The function can now be imported in python and used directly as it is. Open python type:
>>> import atomdist
Now atomdist contains the function 'distance', so to use it you have to call it by typing:>>> atomdist.distance('pdbfile',atomnr1,atomnr2)
Personally I find the code easier to use if I can use it directly from the command line without using python interactively. Just add the following code after the code definition:
try: pdbfile = str(sys.argv[1]) atomnr1 = sys.argv[2] atomnr2 = sys.argv[3] r12 = distance(pdbfile,atomnr1,atomnr2) print str(r12) except: print 'Please input pdb file, and two atom numbers'From the command line you can now for example type:
python atomdist.py 1bzx.py 230 389
which will return the distance between atom 230 and 389 in the PDB file with entry code 1BZX.pdb.