Design considerations for genetic linkage and association studies


This chapter describes the main issues that genetic epidemiologists usually consider in the design of linkage and association studies. For linkage, we briefly consider the situation of rare, highly penetrant alleles showing a disease pattern consistent with Mendelian inheritance investigated through parametric methods in large pedigrees or with autozygosity mapping in inbred families, and we then turn our focus to the most common design, affected sibling pairs, of more relevance for common, complex diseases. Theoretical and more practical power and sample size calculations are provided as a function of the strength of the genetic effect being investigated. We also discuss the impact of other determinants of statistical power such as disease heterogeneity, pedigree, and genotyping errors, as well as the effect of the type and density of genetic markers. Linkage studies should be as large as possible to have sufficient power in relation to the expected genetic effect size. Segregation analysis, a formal statistical technique to describe the underlying genetic susceptibility, may assist in the estimation of the relevant parameters to apply, for instance. However, segregation analyses estimate the total genetic component rather than a single-locus effect. Locus heterogeneity should be considered when power is estimated and at the analysis stage, i.e. assuming smaller locus effect than the total genetic component from segregation studies. Disease heterogeneity should be minimised by considering subtypes if they are well defined or by otherwise collecting known sources of heterogeneity and adjusting for them as covariates; the power will depend upon the relationship between the disease subtype and the underlying genotypes. Ultimately, identifying susceptibility alleles of modest effects (e.g. RR≤1.5) requires a number of families that seem unfeasible in a single study. Meta-analysis and data pooling between different research groups can provide a sizeable study, but both approaches require even a higher level of vigilance about locus and disease heterogeneity when data come from different populations. All necessary steps should be taken to minimise pedigree and genotyping errors at the study design stage as they are, for the most part, due to human factors. A two-stage design is more cost-effective than one stage when using short tandem repeats (STRs). However, dense single-nucleotide polymorphism (SNP) arrays offer a more robust alternative, and due to their lower cost per unit, the total cost of studies using SNPs may in the future become comparable to that of studies using STRs in one or two stages. For association studies, we consider the popular case–control design for dichotomous phenotypes, and we provide power and sample size calculations for one-stage and multistage designs. For candidate genes, guidelines are given on the prioritisation of genetic variants, and for genome-wide association studies (GWAS), the issue of choosing an appropriate SNP array is discussed. A warning is issued regarding the danger of designing an underpowered replication study following an initial GWAS. The risk of finding spurious association due to population stratification, cryptic relatedness, and differential bias is underlined. GWAS have a high power to detect common variants of high or moderate effect. For weaker effects (e.g. relative risk<1.2), the power is greatly reduced, particularly for recessive loci. While sample sizes of 10,000 or 20,000 cases are not beyond reach for most common diseases, only meta-analyses and data pooling can allow attaining a study size of this magnitude for many other diseases. It is acknowledged that detecting the effects from rare alleles (i.e. frequency<5%) is not feasible in GWAS, and it is expected that novel methods and technology, such as next-generation resequencing, will fill this gap. At the current stage, the choice of which GWAS SNP array to use does not influence the power in populations of European ancestry. A multistage design reduces the study cost but has less power than the standard one-stage design. If one opts for a multistage design, the power can be improved by jointly analysing the data from different stages for the SNPs they share. The estimates of locus contribution to disease risk from genome-wide scans are often biased, and relying on them might result in an underpowered replication study. Population structure has so far caused less spurious associations than initially feared, thanks to systematic ethnicity matching and application of standard quality control measures. Differential bias could be a more serious threat and must be minimised by strictly controlling all the aspects of DNA acquisition, storage, and processing.

Statistical Human Genetics 2012; 237-262