Structural genomics is the assignment of three-dimensional structures to proteomes and the investigation of their biological implications. Protein structure is an important indicator of function, particularly where the structure of a new protein is homologous to one already known. Two levels of assignment are employed in structural genomics, one being experimental large-scale determination of protein structures using NMR or X-ray crystallography, and the other computational structure prediction through detection of homologies with proteins of known structure.
With numerous genome sequences already available and the majority of human genes represented in the EST database, it is becoming increasingly likely that a family to which a new protein belongs is represented already in the databases.
Computational methods involve pairwise or multiple sequence comparisons, fold recognition, predictions of secondary structures based on statistical rules derived from structures, and modelling. Mycoplasma genitalium, with only 479 proteins, is a focus for computational investigations in structural genomics. As with the other functional genomics technologies, structure prediction is dependent on databases and appropriate search programs.
Predictions can be made by searching databases of complete protein domains (CATH, ProDom, SCOP), collections of structural or functional sequence motifs (BLOCKS, PRINTS) or libraries of conserved sequence patterns associated with specific functions (PROSITE). For example, the SCOP (Structural Classification of Proteins) database, aims to provide a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known, with links to PDB entries, sequences, references, images and interactive display systems. Generally, the problem of obtaining the best results from a database search is one of signal to noise.
To compensate for the noise, the sensitivity of the search method can be increased, e.g. the PSI-BLAST method, which combines the popular BLAST algorithm with profile analysis. Integration of pre-processing methods into the search scheme also considerably improves signal to noise. For multidomain proteins, searching with the entire sequence is much less sensitive than searching with segments that are located between known domains. Thus scanning databases of known domains is an important complement to standard database searches. In the absence of recognisable sequence similarity, threading approaches – fold assignments by checking for sequence compatibility with known three-dimensional structure, e.g. using ProFIT – may reveal additional insights, as has been successfully demonstrated for leptin.
Model building can be carried out where genome sequences show good enough matches (>30% identity) to a PDB sequence; a public repository of predicted models of protein structures cross-referenced with sequence databases and PDB has been initiated jointly by the PDB and the Swiss Bioinformatics Institute.
A number of bioinformatics methods devoted to the analysis of protein-protein interactions have been developed in recent years. Different docking procedures have been described which try to analyse whether two proteins are able to interact and the structure of the resulting complex. Protein docking methods can be applied when the 3-dimensional structures of the proteins are known or when a good structural model can be built.