Detecting elusive biologically significant structural differences with serial crystallography - Project Summary/Abstract: Issues underlying human health depend on understanding proteins in different
conformational states (perturbed either by therapeutic compounds or by changes in their environment). The
high brilliance of modern synchrotron and XFEL facilities can gather many samples of each conformation state
of a specimen containing proteins in multiple conformational states, yielding thousands of data points that, if
correctly clustered, can provide snapshots of the protein in each of its states. By gaining the cooperation of the
major developers of clustering software, we will combine the strengths of existing tools with new algorithms to
answer the urgent problem of re-organizing mixed data from proteins in multiple states into multiple data from
proteins in single states. Working independently the software developers that are collaborating on this project
have developed paradigm-changing clustering software. Each of these algorithms works well in specific cases,
but none are sufficient to solve solve all the clustering problems we now face. Serial crystallography is a powerful
technique in which diffraction patterns from many crystals of the same substance are studied to understand
the possible 3-dimensional structure or structures of the substance. It is an essential technique that was made
possible by brilliant new X-ray free electron laser (XFEL) light sources and has become an important technique
at synchrotrons as well. The data may be organized either as stills (usually at XFELs) or narrow wedges (serial
crystallography at synchtrotrons, SXS). In either case the stills and wedges must be carefully organized into highly
homogeneous clusters of data that can be merged for processing.
There are several alternative approaches to discovering appropriate clusters, based, for example, on com-
parisons of crystallographic cell parameters or, alternatively, on comparisons of intensities of diffraction reflection
amplitudes. In many cases, if the quality and correct clustering criteria are known in advance these existing tools
are adequate, especially when their only task is to sort good images from bad ones. However, when one tries to
separate polymorphs, or to follow sequential states in a dynamic system, one requires more effective clustering
algorithms; no single clustering criterion is sufficient. Clustering based on cell parameters is effective at the early
stages of clustering when dealing with partial data sets. One might investigate other criteria such as differences
of Wilson plots to measure similarities of data. When the original data are complete (> 75% today for similar
applications), or one wants to achieve higher levels of completeness, one can cluster on correlation of intensi-
ties. Perhaps one must adjust weighting of criteria by resolution ranges. This project is exploring multi-stage
sequential clustering, developing optimal tools that will move from one clustering criterion to another, leading to
merged sets of sufficiently complete reflection-intensity data. This will provide information most sensitive to the
phenomena being investigated to allow work within an integrated software framework.