Project summary
There is a crisis of reproducibility and replicability of scienti¿c results. This crisis is an increasing source of
concern both in the scienti¿c and popular press. The crisis is so acute that the United States Congress is currently
investigating reproducibility of the scienti¿c process. At the heart of this crisis is a collection of problems including
small-sample sizes, under-powered studies, under-trained data analysts and an inability to directly leverage prior
results in the statistical analysis of smaller, hypothesis-driven experiments using high-throughput technologies.
Advances in technology have dramatically reduced the cost and di¿culty of collecting high-throughput molecular
data. Large collections of raw data are increasingly publicly available but are usually incorporated into individual
analyses by NIGMS and other investigators on an ad-hoc basis. Meanwhile, the other costs of running a designed,
hypothesis-driven study have not decreased at the same speed with technological advances. It is still expensive to
identify, recruit, collect, and follow up samples even if the high-throughput measurements themselves are cheap.
Despite the incredible amount of available public data, it is still common practice to perform statistical inference
in these hypothesis-driven experiments study-by-study, only indirectly including previous data, estimates, and
results. So ¿ndings from these studies may be highly variable, unreliable, or unreplicable. Our group has focused
on developing statistical methods, data resources, and software and training that allow researchers to borrow
strength empirically from public repositories, large-scale data generation projects, and crowd-sourced data to
improve inference in individual, hypothesis driven studies. We propose to build on our work in developing
statistical data sources, methods, software and training that facilitate and speed the work of our biological and
medical collaborators. The result will be a research community that can take advantage of public data already
collected at a large cost to the NIH to improve power, reduce required sample sizes, and improve replication in
many new hypothesis driven molecular studies of development and disorder.