Protein Structures
The
csb.bio.structure module defines some of the most fundamental abstractions in the library:
Structure,
Chain,
Residue and
Atom. Instances of these objects may exist independently and that is perfectly fine, but usually they are part of a
Composite aggregation. The root node in this Composite is a
Structure (or
Ensemble).
Structures are composed of
Chains, and each
Chain is a collection of
Residues. The leaf node is
Atom.
All of these objects implement the base
AbstractEntity interface. Therefore, every node in the Composite can be transformed:
>>> r, t = [rotation matrix], [translation vector]
>>> entity.transform(r, t)
and it knows its immediate children:
>>> entity.items
<iterator> # over all immediate child entities
If you want to traverse the complete Composite tree, starting at arbitrary level, and down to the lowest level, use one of the
CompositeEntityIterators. Or just call
AbstractEntity.components():
>>> entity.components()
<iterator> # over all descendants, of any type, at any level
>>> entity.components(klass=Residue)
<iterator> # over all Residue descendants
Some of the inner objects in this hierarchy behave just like dictionaries (but are not):
>>> structure.chains['A'] # access chain A by ID
<Chain A: Protein>
>>> structure['A'] # the same
<Chain A: Protein>
>>> residue.atoms['CS']
<Atom: CA> # access an atom by its name
>>> residue.atoms['CS']
<Atom: CA> # the same
Others behave like list collections:
>>> chain.residues[10] # 1-based access to the residues in the chain
<ProteinResidue [10]: PRO 10>
>>> chain[10] # 0-based, list-like access
<ProteinResidue [11]: GLY 11>
Step-wise building of
Ensembles,
Chains and
Residues is supported through a number of append methods, for example:
>>> residue = ProteinResidue(401, ProteinAlphabet.ALA)
>>> s.chains['A'].residues.append(residue)
See
EnsembleModelsCollection,
StructureChainsTable,
ChainResiduesCollection and
ResidueAtomsTable in our API docs for more details.
Some other objects in this module of potential interest are the self-explanatory
SecondaryStructure and
TorsionAngles.
PDB I/O
CSB comes with a number of PDB structure parsers, format builders and database providers, all defined in the
csb.bio.io.wwpdb package. The most basic usage is:
>>> parser = StructureParser('structure.pdb')
>>> parser.parse_structure()
<Structure> # a Structure object (model)
or if this is an NMR ensemble:
>>> parser.parse_models()
<Ensemble> # an Ensemble object (collection of alternative Structure-s)
This module introduces a family of PDB file parsers. The common interface of all parsers is defined in
AbstractStructureParser. This class has several implementations:
- RegularStructureParser - handles normal PDB files with SEQRES fields
- LegacyStructureParser - reads structures from legacy or malformed PDB files, which are lacking SEQRES records (initializes all residues from the ATOMs instead)
- PDBHeaderParser - reads only the headers of the PDB files and produces structures without coordinates. Useful for reading metadata (e.g. accession numbers or just plain SEQRES sequences) with minimum overhead
Unless you have a special reason, you should use the
StructureParser factory, which returns a proper
AbstractStructureParser implementation, depending on the input PDB file. If the input file looks like a regular PDB file, the factory returns a
RegularStructureParser, otherwise it instantiates
LegacyStructureParser.
StructureParser is in fact an alias for
AbstractStructureParser.create_parser.
Writing your own, customized PDB parser is easy. Suppose that you are trying to
parse a PDB-like file which misuses the charge column to store custom info. This
will certainly crash
AbstractStructureParser (for good), but you can create your
own parser as a workaround. All you need to to is to override the virtual
_read_charge_field hook method:
class CustomParser(RegularStructureParser):
def _read_charge(self, line):
try:
return super(CustomParser, self)._read_charge(line)
except StructureFormatError:
return None
Another important abstraction in this module is
StructureProvider. It has several implementations which can be used to retrieve PDB Structures from various sources: file system directories, remote URLs, etc. You can easily create your own provider as well. See
StructureProvider for details.
Finally, this module gives you some
FileBuilders, used for text serialization of
Structures and
Ensembles:
>>> builder = PDBFileBuilder(stream)
>>> builder.add_header(structure)
>>> builder.add_structure(structure)
where stream is any Python stream, e.g. an open file or sys.stdout.