- Frankfurt Institute for Advanced Studies (1)
Objective identification of residue ranges for the superposition of protein structures
Donata K. Kirchner
- Background: The automation of objectively selecting amino acid residue ranges for structure superpositions is important for meaningful and consistent protein structure analyses. So far there is no widely-used standard for choosing these residue ranges for experimentally determined protein structures, where the manual selection of residue ranges or the use of suboptimal criteria remain commonplace. Results: We present an automated and objective method for finding amino acid residue ranges for the superposition and analysis of protein structures, in particular for structure bundles resulting from NMR structure calculations. The method is implemented in an algorithm, CYRANGE, that yields, without protein-specific parameter adjustment, appropriate residue ranges in most commonly occurring situations, including low-precision structure bundles, multi-domain proteins, symmetric multimers, and protein complexes. Residue ranges are chosen to comprise as many residues of a protein domain that increasing their number would lead to a steep rise in the RMSD value. Residue ranges are determined by first clustering residues into domains based on the distance variance matrix, and then refining for each domain the initial choice of residues by excluding residues one by one until the relative decrease of the RMSD value becomes insignificant. A penalty for the opening of gaps favours contiguous residue ranges in order to obtain a result that is as simple as possible, but not simpler. Results are given for a set of 37 proteins and compared with those of commonly used protein structure validation packages. We also provide residue ranges for 6351 NMR structures in the Protein Data Bank. Conclusions: The CYRANGE method is capable of automatically determining residue ranges for the superposition of protein structure bundles for a large variety of protein structures. The method correctly identifies ordered regions. Global structure superpositions based on the CYRANGE residue ranges allow a clear presentation of the structure, and unnecessary small gaps within the selected ranges are absent. In the majority of cases, the residue ranges from CYRANGE contain fewer gaps and cover considerably larger parts of the sequence than those from other methods without significantly increasing the RMSD values. CYRANGE thus provides an objective and automatic method for standardizing the choice of residue ranges for the superposition of protein structures. Additional files Additional file 1: Dependence of Q on the order parameter rank. The quantity Qi is plotted against the order parameter rank i for 9 different protein structure bundles. Additional file 2: Dependence of P on the clustering stage. The quantity Pi is plotted against the clustering stage i for 9 different protein structure bundles. Additional file 3: Dependence of CYRANGE results on the minimal cluster size parameter my. The sequence coverage (red) and RMSD (blue) of the residue ranges determined by CYRANGE were plotted as a function of my for 9 different protein structure bundles. The dotted vertical line indicates the default value, my = 8. Where CYRANGE found two domains, the RMSD values of the individual domains are shown in light and dark blue. Additional file 4: Dependence of CYRANGE results on the domain boundary extension parameter m. See Additional File 3 for details. Additional file 5: Dependence of CYRANGE results on the minimal gap width g. See Additional File 3 for details. Additional file 6: Dependence of CYRANGE results on the relative RMSD decrease parameter delta. See Additional File 3 for details. Additional file 7: Dependence of CYRANGE results on the absolute RMSD decrease parameter delta abs. See Additional File 3 for details. Additional file 8: Dependence of CYRANGE results on the gap penalty parameter gamma. See Additional File 3 for details. Additional file 9: Correlation between the sequence coverage from CYRANGE, FindCore and PSVS, and the GDT total score, GDT_TS. Each data point represents a protein shown in Figures 3 and 4. The coverage is the percentage of amino acid residues included in the residue ranges found by the different methods. The GDT_TS value is defined by GDT_TS = (P1 + P2 + P4 + P8)/4, where Pd is the fraction of residues that can be superimposed under a distance cutoff of d Å. Additional file 10: Correlation between the RMSD value for the residue ranges from CYRANGE, FindCore and PSVS, and the GDT total score, GDT_TS. Each data point represents one protein domain. See Additional File 9 for details.