This impressive undertaking brings new understanding to the functional aspects of the genome and can probably be considered the most significant genomic discovery step since the sequencing of the whole human genome in 2000. The ENCODE project assigned biochemical function to about 80% of the genome, and in particular to elements outside of the well-studied protein-coding regions.
The findings of the ENCODE consortium – comprising over 400 researchers working in 32 laboratories across the world – indicate that a much greater proportion of the genome is biologically active than had previously been thought, and should effectively dispel the notion of ‘junk’ DNA.
The results of functional analyses on 147 different cell types demonstrate that at least 80% of the genome performs a specific function – mostly regulating the activity of the 2% that comprises protein-encoding genes. The work identified some 4 million regulatory elements in total, many of which are located far away on the genome from the gene they control.
This is arguably the most significant step forward in our understanding of how the human genome works since the releases of the initial draft sequence in 2000 and the final draft in 2003. At that time it was deeply surprising to many to find that humans possessed only around 20,000 genes occupying less than 2% of the genome, leading some to label the other 98% as ‘junk’ DNA.
Evidence from functional and genome-wide association studies over the years has made this an increasingly defunct term as it became apparent that a large proportion of the single base mutations that cause disease fell between gene coding regions, but this new comprehensive analysis should put an end to the idea once and for all.
Dr Ewan Birney of the European Bioinformatics Institute (EBI) in Cambridge who coordinated the data analysis said “This will give researchers a whole new world to explore and ultimately, it’s hoped, will lead to new treatments”. He also pointed out that the job was still far from done, and that deep characterisation is probably only around 10% complete. It is quite possible that much of the remaining 20% of the genome has a functional role that has yet to be identified.
The mapping provides new insights into gene organisation and most of all, mechanisms of regulation. A central goal in biology – understanding the enormous diversity of gene expression in different cell types under various physiological conditions – can be considered partly achieved.
The project yielded invaluable information on the human transcriptional regulatory network with systematic analyses of transcription factors, chromatin structure and regulatory modifications. All these findings shine new light on our concept of the gene.
Some of the newly identified elements correspond to sequence variants linked to human disease, and can therefore guide interpretation of these variations. Genome-wide association studies have previously identified many noncoding variants associated with common diseases and traits. Such variants systematically perturb transcription, alter chromatin states, and form regulatory networks. ENCODE’s results point to the involvement of regulatory DNA variation in common human disease and provide pathogenic insights into diverse disorders.
The publication of such a detailed analysis of the functionalities of the human genome has understandably generated much enthusiasm among scientists and general public alike. Confirmation that a far larger chunk of our genome is biologically active than previously thought has been an exciting discovery and researchers hope the findings will lead to a deeper understanding of numerous diseases.
It is however important to remember, and for the scientific community to clearly acknowledge, that despite these fantastic results it may be many years before patients see any benefits from the project. Better understanding of the functional complexity of the human genome will undeniably lead to improved control of disease and to better treatments, but the road to clinical implications and applications is still long and difficult.
Ku et al Modern Pathology 2012;25(8):1055-1068l Mod Pathol. 2012;25(8):1055-1068
Recent advances in genotyping and sequencing technologies have provided powerful tools with which to explore the genetic basis of both Mendelian (monogenic) and sporadic (polygenic) diseases. Several hundred genome-wide association studies have so far been performed to explore the genetics of various polygenic or complex diseases including those cancers with a genetic predisposition. Exome sequencing has also proven very successful in elucidating the etiology of a range of hitherto poorly understood Mendelian disorders caused by high-penetrance mutations. Despite such progress, the genetic etiology of several familial cancers, such as familial colorectal cancer type X, has remained elusive. Familial colorectal cancer type X and Lynch syndrome are similar in terms of their fulfilling certain clinical criteria, but the former group is not characterized by germline mutations in DNA mismatch-repair genes. On the other hand, the genetics of sporadic colorectal cancer have been investigated by genome-wide association studies, leading to the identification of multiple new susceptibility loci. In addition, there is increasing evidence to suggest that familial and sporadic cancers exhibit similarities in terms of their genetic etiologies. In this review, we have summarized our current knowledge of familial colorectal cancer type X, discussed current approaches to probing its genetic etiology through the application of new sequencing technologies and the recruitment of the results of colorectal cancer genome-wide association studies, and explore the challenges that remain to be overcome given the uncertainty of the current genetic model (ie, monogenic vs polygenic) of familial colorectal cancer type X.
Recent developments in high-throughput sequence capture methods and next-generation sequencing technologies have made exome sequencing a viable approach to the identification of pathological mutations, both from a technical standpoint and in terms of being cost-effective.[1–4] The advent of exome sequencing has already contributed significantly toward the identification of new causal mutations (and genes) for a number of previously unresolved Mendelian disorders such as Kabuki syndrome, Miller syndrome, Sensenbrenner syndrome, and Fowler syndrome to name just a few. Further, exome sequencing has proven to be an effective tool to interrogate the genetic basis of Mendelian disorders in samples derived from both families and unrelated individuals.[5–8] Since the inception of the idea of using exome sequencing as both a discovery and a diagnostic tool for Mendelian disorders, this field has advanced very considerably. Accompanied and aided by other technical advances such as the development of computational and statistical approaches to interrogate the myriad variants identified by exome sequencing,[12, 13] including algorithms to detect copy number variants using exome sequencing data, and the idea (and practical demonstration) of using single-nucleotide polymorphism genotypes extracted from exome sequencing data to perform accurate genetic linkage mapping to reduce the ‘search space’ for genetic variants, exome sequencing has emerged as a mature analytical approach.
Although major progress has been made in understanding the genetic basis of Mendelian disorders over the past 3 years using exome sequencing, so far only limited studies have interrogated familial forms of cancer, ie, familial pancreatic cancer and hereditary pheochromocytoma (a rare neural crest cell tumor). By harnessing the latest technological advances, Jones et al  identified a germline truncating mutation in PALB2 through exome sequencing a single patient with familial pancreatic cancer. That this patient might have a familial form of pancreatic cancer was suggested by the fact that his sister had also developed the disease. In similar manner, mutations in MAX, the MYC-associated factor X gene, were also identified through sequencing the exomes of three unrelated individuals with hereditary pheochromocytoma.
Since 2005, >100 genome-wide association studies have been performed to interrogate the genetic basis of various sporadic or polygenic forms of cancer (such as colorectal, prostate, breast, and lung) for which numerous statistically robust and novel single-nucleotide polymorphisms or genetic loci have been identified.[18, 19] In addition to their polygenic nature, these cancers are multifactorial, involving a complex interaction of multiple genetic and environmental factors. By contrast, little progress has so far been achieved in the context of ‘familial’ cancers (ie, cancers displaying a very evident family history with clustering of multiple affected family members). More specifically, familial forms of cancer typically occur in more individuals in a given family than would be expected by chance alone. Familial cancers are often characterized by their occurrence at a comparatively early age, thereby indicating the potential presence of a gene mutation that increases the risk of cancer. However, familial clustering of cases may also be a sign of a shared environment or lifestyle, or alternatively chance alone. By contrast, sporadic cancers lack any obvious family history of the disease.
The slow progress of research into familial cancer has been illustrated, for example, in hereditary diffuse gastric cancer. CDH1 was the first causal gene identified for this cancer in 1998, and it remains the only known gene underlying hereditary diffuse gastric cancer. However, germline mutations in this gene account for only a proportion of hereditary diffuse gastric cancer cases, suggesting that an as-yet-to-be identified gene(s) is likely to be responsible for the remaining cases unexplained by CDH1. Similarly, BRCA1 and BRCA2 are the only high-penetrance genes for familial breast cancer, although numerous novel single-nucleotide polymorphisms and genetic loci conferring low-to-moderate risk or effect size (odds ratio <1.5) have been identified by genome-wide association studies of polygenic breast cancer.[22, 23] Some of these common alleles have been reported to modify risk in BRCA1 and BRCA2 mutations carriers. However, so far the results from genome-wide association studies have limited value for individual risk prediction, as compared with the high-penetrance inherited mutations in causal genes for familial breast cancer which can prompt drastic clinical intervention such as mastectomy. An analysis to evaluate the potential for individualized disease risk stratification based on common single-nucleotide polymorphisms identified by genome-wide association studies in breast cancer came to the conclusion that the clinical utility of single, common, low-penetrance genes for breast cancer risk prediction is currently quite limited.
In the context of familial colorectal cancer, the genetic causes of familial adenomatous polyposis and Lynch syndrome have been well documented; in most instances, they are accounted for by germline mutations in the APC gene and DNA mismatch-repair genes (ie, MSH2, MLH1, MSH6, and PMS2), respectively. For example, ~90% of familial adenomatous polyposis cases are caused by germline mutations in the APC gene. The majority of these mutations introduce a premature stop codon resulting in a truncated protein. Similarly, the MSH2 and MLH1 genes harbor >90% of the germline mutations found in Lynch syndrome patients.[27, 28] By contrast, the genetic etiology of familial colorectal cancer type X remains largely unknown. It is widely anticipated that new insights generated from studies on familial colorectal cancer type X will lead to the molecular characterization of a novel form of familial colorectal cancer which will necessitate the reclassification of subsets of families with a strong history of colorectal cancer.
The nature of the disease determines the study design required to unravel the causal mutations or risk-predisposing variants for familial colorectal cancer type X. However, there is little evidence to show whether familial colorectal cancer type X is a monogenic or polygenic disease or whether it is somewhere in between. The evidence suggesting that familial colorectal cancer type X is a monogenic disease comes mainly from the fulfillment of Amsterdam Criteria. The Amsterdam Criteria state that at least three relatives must have colorectal cancer. However, the familial aggregation, with multiple affected family members in one family, could also be due to shared non-genetic factors, which would not therefore necessarily be compatible with the monogenic model. Such environmental factors would be expected to interact with multiple genetic risk factors causing colorectal cancer, a multifactorial disease model proposed for polygenic disease. This therefore raises the question as to whether the Amsterdam Criteria are sufficient to support a monogenic basis for familial colorectal cancer type X. Furthermore, some of the clinical features of familial colorectal cancer type X implied that it could have a polygenic basis. This uncertainty in the nature of the disease for familial colorectal cancer type X presents substantial challenges in terms of deciding upon an optimal approach to interrogate its genetic basis.
The targeted sequencing of causal genes, already applied in the context of other familial cancers (such as CDH1 (hereditary diffuse gastric cancer), BRCA1 and BRCA2 (familial breast cancer), and the genes underlying hereditary pheochromocytoma), appears to be a worthwhile approach to identify deleterious germline mutations for familial colorectal cancer type X. The rationale is that germline mutations in these genes could underlie different familial cancers, as for example in the case of the PALB2 germline mutations that have been found in both familial pancreatic and breast cancers.[16, 63] Another notable example is provided by the germline mutations in the BRCA2 gene that not only increase the risk of breast and ovarian cancer, but also pancreatic cancer. This targeted sequencing approach has been greatly aided by high-throughput enrichment methods and next-generation sequencing technologies to selectively enrich for regions of interest. Hundreds of genes can be sequenced efficiently, leveraging these technological advances compared with traditional PCR-based Sanger sequencing. The efficiency of this approach has been exemplified in a targeted sequencing study of germline mutations in 21 tumor suppressor genes for 360 women with inherited ovarian, peritoneal, or fallopian tube carcinoma. This study harnessed the power of the Sure-Select enrichment system and the Illumina sequencing platform to sequence these genes; 24% of the patients were found to carry germline loss-of-function mutations in 12 genes, six of which had not previously been implicated in inherited ovarian carcinoma. Although this targeted approach has limited discovery value, as these genes had already been implicated in causing familial cancers, it could still have some novelty value by identifying germline mutations in known genes for cancers, which have not yet been linked to these genes.
This targeted approach can be expanded to include the entire set of exons in all genes in the human genome. Exome sequencing on its own or coupled with linkage analysis has already unravelled multiple new causal mutations and genes for Mendelian disorders.[7, 8] Furthermore, these discoveries were made by exome sequencing fewer than 10 patient samples in most of the studies reported. As such, it is also widely anticipated that exome sequencing will represent a powerful tool to reveal the genetic causes of familial colorectal cancer type X by identifying rare and deleterious or high-penetrance mutations within gene coding regions. However, the appropriate selection of cases will have a key role in determining the success or otherwise of exome sequencing in this context. In addition to fulfilling the Amsterdam Criteria, and excluding germline mutations in mismatch-repair genes, selecting cases with a very early onset of disease, severe clinico-pathological manifestations or the ‘extreme’ familial colorectal cancer type X phenotypes are expected to enrich for the ‘monogenic’ component and hence enhance our chances of identifying high-penetrance mutations. Recurrent mutations (similar mutations in different samples) or genes harboring several different deleterious mutations (which include single-nucleotide variants and small indels) across multiple samples can then be prioritized for further studies using a larger sample of cases.
On the other hand, if we assume that familial colorectal cancer type X has a polygenic component, then genome-wide association studies would represent the ideal approach to identify common single-nucleotide polymorphisms associated with this disease. Further, whole-genome genotyping arrays would also allow copy number variants to be investigated to a certain extent for their associations with familial colorectal cancer type X within a single genome-wide association study. High-density genotyping arrays have been used to identify copy number variants in a cohort of 41 colorectal cancer patients who were below 40 years of age at diagnosis and/or who exhibited an overt family history. Multiple copy number variants, encompassing genes such as CDH18, GREM1, and BCR, were identified in six patients as well as two deletions encompassing two microRNA genes, hsa-mir-491/KIAA1797 and hsa-mir-646/AK309218. Interestingly, these copy number variants had not previously been reported in relation to colorectal cancer predisposition, nor had they been encountered in large control cohorts. This illustrates the potential power of copy number variant investigation to identify novel causal or susceptibility genes or genetic loci for both familial and sporadic colorectal cancers. Through another interesting observation, multiple genomic aberrations including copy number gains and losses in different chromosomes have also been detected in 30 mismatch repair-proficient familial colorectal cancers. In particular, the frequency of 20q gain is remarkably increased when compared with sporadic colorectal cancer, suggesting that the 20q gain is involved in the genetic etiology of these mismatch repair-proficient familial colorectal cancers. The finding that most of these genomic aberrations were also observed in sporadic colorectal cancer further suggests that familial and sporadic colorectal cancers could share genetic predisposition to a certain extent.
It is however noteworthy that genome-wide association studies represent an indirect association study design, based on linkage disequilibrium, to detect the disease-causing variants, as compared with direct sequencing. To achieve the required statistical power and significance threshold to detect common single-nucleotide polymorphisms conferring small effect sizes (odds ratio <1.5), several thousands of cases and controls are required for the initial genome-wide genotyping and subsequent replication studies. Although the cost of genotyping arrays is steadily becoming much cheaper, a hefty investment is still required to analyze thousands of samples. In addition to this cost, collecting the adequate sample size of patients to embark on a genome-wide association study is a considerable challenge if this is to be achieved without an international consortium (because of the rarity of familial colorectal cancer type X as compared with sporadic colorectal cancer cases). The polygenic basis of familial colorectal cancer type X is still a speculative issue. Bearing in mind this uncertainty, an alternative is to leverage the results from genome-wide association studies of colorectal cancer by genotyping the robust single-nucleotide polymorphism associations in a familial colorectal cancer type X cohort. This approach might be more feasible in terms of cost-effectiveness and sample size (without the need of a stringent significance threshold to account for several hundred thousand single-nucleotide polymorphisms). The penalty of multiple testing imposed in genome-wide association studies should increase the attractiveness of this approach in the context of testing single-nucleotide polymorphisms identified by genome-wide association studies for familial colorectal cancer type X. One may speculate that if familial colorectal cancer type X has a polygenic component, some of these polymorphisms should also be associated with familial colorectal cancer type X, which would then warrant a comprehensive genome-wide association study for familial colorectal cancer type X in the future. This speculation appears reasonable because common shared single-nucleotide polymorphisms or genetic loci have been found in several different cancers. There have been several examples of the practical utility of genome-wide association study results in the context of familial cancers. These studies have provided evidence to suggest that low-penetrance variants may explain the increased cancer risk in familial colorectal cancer[106–108] and in familial testicular germ cell tumors.
Finally, the genes or genetic loci implicated in colorectal cancer by genome-wide association studies can be captured and sequenced. This targeted sequencing approach is very cost-effective as up to 96 samples can be multiplexed through barcoding for massively parallel sequencing. This targeted sequencing approach will interrogate both rare variants and common single-nucleotide polymorphisms in the loci identified by genome-wide association studies. The promise of this approach in unravelling rare variants in loci implicated by genome-wide association studies has already been demonstrated.[51,53–55] For example, deep resequencing of such loci has identified independent rare variants associated with inflammatory bowel disease.
The genetic and clinical differences between Lynch syndrome and familial colorectal cancer type X have been well documented. However, the genetic etiologies of familial colorectal cancer type X remain to be determined. There is also a paucity of evidence to indicate one way or the other whether familial colorectal cancer type X is a monogenic or a polygenic disease. On the other hand, the genetics of sporadic/polygenic colorectal cancer have been comprehensively investigated by >10 genome-wide association studies over the past few years. One striking observation is the sharing of common single-nucleotide polymorphisms or genetic loci across different cancers. It is therefore reasonable to speculate that if familial colorectal cancer type X has a polygenic basis, some of the single-nucleotide polymorphisms identified by genome-wide association studies as conferring risk of colorectal cancer might be expected to show associations with familial colorectal cancer type X as well. Given the expense and logistic challenges involved in collecting a large number of familial colorectal cancer type X cases to embark on a genome-wide association study, together with the uncertainty of the disease model, we believe that the genotyping of genome-wide association study-identified single-nucleotide polymorphisms in familial colorectal cancer type X would be a more feasible first approach to explore the genetic etiology of this disease. However, given the low incidence of familial colorectal cancer type X (ie, only ~2–3% of colorectal cancer families meet Amsterdam Criteria and about half of these are Lynch syndrome cases), collecting an adequate large sample size is difficult and challenging especially for studying the association of single-nucleotide polymorphisms with modest effect sizes. Thus, National or International Consortia involving many centers are likely to be needed to recruit large numbers of patients. Alternatively, the genes or loci identified by genome-wide association studies could be investigated using a targeted sequencing approach to unravel rare variants of larger effect size.
One of the limitations of genome-wide association studies is that they are based upon an indirect association study design, which is reliant on linkage disequilibrium to identify the disease functional variants. As a result, the surrogate markers (ie, the associated single-nucleotide polymorphisms) identified by genome-wide association studies generally lack functional significance. Furthermore, to enhance the statistical power, genome-wide association studies have tended to lump all colorectal cancers in the disease group, even although it is well recognized that colorectal cancers are inherently heterogeneous. These challenges have led to the notion and conceptualization of ‘molecular pathological investigation’, which is a relatively new field of epidemiology based upon the molecular classification of cancer. It is a multidisciplinary field involving the investigation of the interrelationship between exogenous and endogenous (eg, genetic) factors, tumoral molecular signatures, and tumor progression. Further, integrating genome-wide association studies with molecular pathological investigation allows examination of the relationship between susceptibility alleles identified by genome-wide association studies and specific molecular alterations/subtypes, which can help to elucidate the function of these alleles and provide insights into whether the detected susceptibility alleles are truly causal. Although there are challenges, molecular pathological epidemiology has unique strengths, and can provide insights into the pathogenic process.
In addition, exome sequencing of multiple ‘well-selected’ cases could be performed, assuming a monogenic basis in which high-penetrance mutations are predicted to underlie the genetic etiology of familial colorectal cancer type X. Exome sequencing of families with multiple affected individuals also represents a promising study design. This family-based design has the advantage that it allows for the genetically heterogeneous nature of familial colorectal cancer type X. Comparing unrelated individuals or probands from different families to identify ‘common/shared’ putative pathological variants or genes harboring putative pathological variants might not be a successful strategy for genetically heterogeneous diseases. However, it still depends on the degree of genetic heterogeneity (ie, allelic heterogeneity versus locus heterogeneity) characterizing the disease and this remains unknown. Although the family design is robust with respect to genetic heterogeneity (comparing affected and unaffected members in a family), one must recognize that it could also be problematic because the penetrance of disease mutations for familial colorectal cancer type X is likely to be lower than that for Lynch syndrome.
Moving forward, it is arguable that whole-genome sequencing should probably be considered instead of exome sequencing, as the cost differential between the two approaches (given a small patient sample size) would not be substantial, and because the former approach will generate genetic data for the entire genome rather than just 1–2% as for exome sequencing. However, one should select the study design that best fits the hypothesis where rare deleterious mutations in coding regions underlie the genetic etiology of a Mendelian disorder or familial cancer. So far, all the discoveries made by whole-genome sequencing could also have been achieved using exome sequencing for Mendelian disorders. Furthermore, the genetic variants in most of the non-coding regions revealed by whole-genome sequencing remain ‘uninterpretable’ biologically. In taking a practical (rather than theoretical) point of view, whole-genome sequencing still presents a very substantial technical challenge as well as a challenge in terms of analyzing and interpreting the sequence data generated.
The disease models underpinning multiple familial cancers such as familial nasopharyngeal carcinoma,familial testicular germ cell tumor, familial chronic lymphocytic leukemia,and familial colorectal cancer (familial colorectal cancer type X) remain contentious as the high-penetrance mutations are yet to be identified. By contrast, multiple low-penetrance variants that confer an effect size of odds ratio <1.5 have been revealed through genome-wide association studies for the sporadic cases of these cancers; interestingly, some of these single-nucleotide polymorphisms have also been found to be associated with the familial cases (nasopharyngeal carcinoma, testicular germ cell tumor,[116, 117] chronic lymphocytic leukemia, and colorectal cancer). In the context of familial colorectal cancer type X, we believe that the disease model and its genetic basis are likely to become more apparent when the approaches that we have outlined and discussed are applied in practice. This should facilitate the iterative interrogation of the genetics of familial colorectal cancer type X and other familial cancers of similar nature before embarking on either a comprehensive genome-wide association studies or whole-genome sequencing approach.
Although inherited susceptibility is responsible for 30% of all CRC (Lichtenstein, Holm et al. 2000), high-penetrance mutations in APC, the mismatch repair (MMR) genes, MUTYH, SMAD4, BMPR1A and STK11 account for <5% of cases (Aaltonen et al. 2007). The nature of the residual inherited susceptibilityto CRC is at present undefined, but a model in which high-riskalleles account for all of the excess inherited risk seems improbable.It is likely that the remaining CRC inherited risk is largely accounted forby common, low penetrance alleles. These alleles may either predispose directly to colorectal tumourigenesis or may have an additive effect on predisposition. Candidate alleles studied include variants on known tumour suppressor genes, oncogenes, DNA repair genes, folate metabolising genes, and others.
The APC I1307K variant is present in about 6% of Ashkenazi Jews,but is much rarer in those of other ethnic groups. I1307K createsan A8 tract (eight consecutive adenine residues) which appears to be somatically unstable, leadingto frameshift mutations (Laken et al. 1997). The tumour risk associated with I1307K has been controversial, but most recent reports suggest that it has a relatively small effect (perhaps only 1.5-fold risk of colorectal cancer), suggesting that the A8 tract is only modestly hypermutable (Gryfe et al. 1999).
A number of other low-penetrance alleles have been found with varying degrees of evidence and importance (table 1.1). The ability to identify these genes and to understand their interactions with other relevant environmental and genetic factors remains important however. It will help to stratify an individual patient’s risk for entry into surveillance programs and to reveal causative factors, allowing more effective prevention strategies.
To date a number of genome-wide association studies have been performed in breast (Easton et al. 2007; Stacey et al. 2007; Stacey et al. 2008), lung(Amos et al. 2008), prostate (Gudmundsson et al. 2007; Gudmundsson et al. 2007; Eeles et al. 2008; Gudmundsson et al. 2008), melanoma (Gudbjartsson et al. 2008) as well as colorectal cancer (Broderick et al. 2007; Tomlinson, Webb et al. 2007; Jaeger, Webb et al. 2008; Tomlinson et al. 2008). Most of these studies have been published over the last 2 years. The odds ratios for the loci identified range from 1.1 to 1.75, the majority having an odds ratio <1.5 (Easton and Eeles 2008). There has been a certain amount of replication between these studies, particularly for the locus 8q24 which has been associated with risk of breast, prostate and colorectal cancer in separate studies. However results so far suggest that these loci account for a small proportion of the overall risk.
It is difficult to speculate on the true function of these risk alleles. There appears to be very little epistasis between the 28 loci identified in these 5 cancer types. None of these loci are involved in DNA repair, frequently a cause of susceptibility to higher penetrance loci. This may underlie why so many case control studies have failed to yield significant results consistently, as the underlying hypothesis may have been inaccurate. One might speculate that many of the associations may be driven through their effects on gene expression, particularly as many lie in gene-poor regions.
Most GWAS have not been empowered to detect the effects of polymorphisms with minor allele frequencies (MAFs) <0.05; such variants are therefore sometimes included in the rare variant class. More often, rare variants are considered to be subpolymorphic (MAF <0.01), with very rare or ‘private’ variants having MAF <0.001. Clearly much of the distinction between ‘common disease-common variant’ and ‘rare variant’ models is arbitrary. Nevertheless it is probably worth arbitrarily defining them in order to illustrate important differences between common and rare variants models, in terms of gene discovery and possible clinical relevance. For example, the significance of rare variants is such that they are likely to have more biological impact than common variants, having arisen more recently in evolutionary terms (Bodmer and Bonilla 2008).
Rare variants will not be detectable by population association studies based on the use of linked polymorphic markers, even with very large case/control cohort studies. This is because of low allelic frequency and individually small contributions to the overall inherited susceptibility of a disease. These variants are less common than those studied in association studies (i.e. minor allele frequency (MAF) <0.05) but not as rare as obvious mutations (MAF >0.01), although such mutations may also be identified. Finding rare variants requires nomination of candidate genes likely to have a role in disease aetiology, which are then directly screened for sequence variants which may affect protein function. This is known as the ‘common-disease/rare-variant’ hypothesis (Pritchard 2001).
So far there have been few rare variants identified in colorectal cancer, partially because candidate genes are not easily identified, and because there have only been a few studies performed. In one such study variants in APC I1307K and E1317Q, in AXIN1, CTNNB1, and the mismatch repair genes hMLH1 and hMSH2 were more common in 124 multiple adenoma cases than in controls (Fearnhead et al. 2004). Studies of other candidate genes have produced results of low or no significance however (Dallosso et al. 2008; Zogopoulos et al. 2008).
Labelling APC I1307K a rare variant may not be accurate, as the frequency of the polymorphism in the Ashkenazi population where it is present is 6%, thus potentially suitable for large association studies. This distinction underlines the arbitrary nature of how such polymorphisms are labelled as rare or common variants.
Although the population attributable risk (PAR) of rare variants may be relatively high, the relative influence of these common variants is low, with reported odds ratios below 2 and peaking at approximately 1.2 (Easton and Eeles 2008). Most rare variants have odds ratios a little higher than 2 but not above 5, with a mean of 3.7 in observations thus far (Bodmer and Bonilla 2008). Their individual contributions are small, and they do not give rise to familial concentrations of cases. As techniques improve to interrogate genetic sequence in an inexpensive, high-throughput and efficient manner this method of identifying variants is likely to generate a higher yield of significant results in the near future.
A candidate gene approach demonstrated rare novel low penetrance breast cancer predisposition loci in three genes, PALB2, BRIP1, and RAD51C. (Seal et al 2006; Rahman et al 2007; Meindl et al 2010). This discovery was assisted by the identification of breast cancer cases in Fanconi Anaemia pedigrees. In general however, it is not a simple task to prioritize candidates for rare variant studies. In the short term, it is likely that discovery efforts will be focused largely on sequencing candidate genes. Nevertheless, it is becoming feasible to sequence entire genomes to discover variants, due to decreased costs and increased efficiency of such methods. In a proof of principle study, complete exomic sequencing of a patient with familial pancreatic cancer identified a germline truncating mutation in PALB2 which appeared responsible for this individual’s predisposition to the disease (Jones et al 2009), although mutations in this gene are thought to be rare events in familial pancreatic cancer (Tischkowitz et al 2010).
The above mentioned rare variant loci for breast cancer in PALB2, BRIP1, and RAD51C were present in 10, 8 and 2 cases and 0, 1 and 0 controls respectively. Due to lack of power rare variants are difficult to validate by frequency alone in an association-type study. If we assume that a single variant or a set of related variants (for example, in the same gene) occurs at a general population frequency of 0.01–0.001, as many as 1000 unselected cases or controls will be required to detect with probability of about 0.7 more than one variant in a discovery screen (Bodmer & Tomlinson 2010).
Nevertheless, in principle the more common a variant is in the population the less its biological impact, thus allowing it to be passed on through generations without affecting reproductive ability. Rare variants are likely to reveal more about the pathophysiology of the disease process than common variants, as they are likely to have functional significance, as opposed to common variants which are probably in linkage disequilibrium with the causative mutations.
However it is more problematic to design useful studies of rare variants, as random variation identified cannot be readily assumed to be of functional significance, for example over 1500 variants of uncertain significance (VUSs) have been identified in BRCA1 using a sequencing based approach in breast cancer cases. The difficulty with rare variant discovery, particularly with whole exomic sequence analysis, will be to sort out the candidate functional variation from an almost overwhelming background of functionally irrelevant variation. The choice of targets will, in general, require some a priori assessment of functional effects. In silico biometric approaches have been developed with increasing predictive ability, although in vitro demonstration of effects are generally preferable in order to determine functional effects, for example simple effects on expression or protein truncation.
Studying a cohort of affected cases and subsequently examining a control set for variants identified can cause ascertainment bias. Thus it would be preferable to search for them in affected individuals and controls with equal rigour, and to use a statistical framework to determine whether variants are truly more common in the affected. These studies are likely to require extremely large and/or enriched data sets in order to identify and verify significant rare variants. Nevertheless it is becoming increasingly cost and time effective to perform even whole genome sequencing to determine genetic predisposition to both common and rare disease.
A copy number polymorphism (CNP) in MTUS1 was found to be associated with breast cancer predisposition (Frank et al. 2007), but not colorectal cancer (Monahan et al 2008). Recently, multiple studies have discovered an abundance of germline copy number variation (CNV) of DNA segments ranging from small to large chromosomal segments (e.g. Down syndrome results from trisomy 21), probably encompassing over 12% of the human genome (Redon et al. 2006). These include deletions, insertions, duplications and complex multi-site variants. The extent and role of these copy number polymorphisms (CNPs) is increasingly understood with the development of new techniques which allow us to identify such variation (Lupski 2007).
Many new CNPs have been identified from studies using whole genome SNP chips (Redon et al. 2006). However, the extent of linkage disequilibrium between SNPs and CNPs is unclear. The biological impact of these types of variation, for example on gene expression, is strikingly different. Expression profiles from SNPs and CNPs had little overlap (Stranger et al. 2007). Multiplex ligation-probe amplification (MLPA) has revealed complex whole exon duplications and deletions in APC which lead to the classic FAP phenotype (Schouten et al. 2002; McCart et al. 2006; Pagenstecher et al. 2007). High penetrance conditions such as FAP are rare whatever the type of mutation may be, e.g. point mutations or exon CNV. In theory, complex disease might be more susceptible to subtle, lower penetrance forms of variation which alter whole gene copy number without disabling gene function. In addition, the impact of individual CNPs may be even subtler, with disease phenotype being caused by combinations of low penetrance alleles.
Identification of significant CNPs is thus far hampered by the cost of performing such studies and the lack of techniques available. Genome wide association studies using SNPs are better at identifying deletion copy number variation that duplication (Locke et al. 2006). The new generation arrays (e.g. the Affymetrix 5.0 and 6.0, and Illumina 1 M) are being designed to offer the potential to simultaneously interrogate SNPs and CNPs in a single experiment. However, it may be that more comprehensive genome wide CNP maps are first required with the level of detail for CNPs that the Hapmap project provided for SNPs, before such genome wide CNP arrays are truly useful.
Much as SNPs can be either common or rare variants, so can CNPs. Using a comparative genomic hybridisation (aCGH) platform, a large study concluded that these CNVs are well tagged on existing SNP platforms and probably contribute little to disease predisposition (Craddock et al 2010). However this study was limited by the selection of CNVs and did not examine the impact of rare CNVs. While genome-wide association using common CNPs may be a potentially useful method to elucidate predisposition caused by such CNPs, this technique is not useful for such rare variants. The true role of these variants are as of yet of undetermined importance in human disease.
When a Mendelian cancer predisposition gene is first identified, much of the evidence of it’s linkage to the phenotype derives from the finding of several different variants in that gene that
Conversely the finding of a statistical association of low penetrance alleles with disease in association studies does not necessarily prove that the underlying variant has biological consequence such as causing low-penetrance predisposition. The likely disease-causing locus (with which the polymorphism is in linkage disequilibrium) has rarely been identified. IGF1 microsatellite and the TSER TYMS polymorphisms may be in linkage disequilibrium with a sequence variant which alters gene expression Monahan et al 2009). In a number of recent genome-wide and candidate gene association studies performed, the downstream effect of such variation on RNA and protein function is largely unknown. Nevertheless identification of a germline mutation in linkage disequilibrium with predisposition alleles has remained elusive and it is felt that allele-specific expression may be an important aetiological factor in colorectal cancer predisposition, particularly as many observed significant variants are not close to any known coding regions (Houlston et al. 2008; Valle et al. 2008). A SNP in SMAD7 whilst strongly associated with colorectal cancer risk was not found to alter expression of the gene despite lying in the 3’UTR region of the gene (Broderick et al. 2007). This study may have been limited by the effects of tissue-specific expression as it was performed on lymphoblastoid cell lines derived from cases. In contrast colorectal cancer associated locus 8q24 lies in a gene desert but contains regulatory elements of MYC, and this region preferentially binds TCF4 the primary target of the canonical Wnt signalling pathway (Tuupanen et al 2009; Pomerantz et al 2009).
Whilst association studies may not easily reveal germline mutations, quantitative and qualitative gene expression studies may be a useful direction for future studies.
Understanding proteomics may be used to yield information as to epistasis between genes as protein-protein interactions are amongst the most important determinants of interaction between genes. However, in variants identified to date there appears to be very little epistasis (Houlston et al. 2008). There have been some significant advances in the understanding of diseases such as Crohn’s disease (Parkes et al. 2007) and Coeliac disease (van Heel et al. 2007) due to the results of non-hypothesis driven association studies. A number of low-penetrance loci have been linked to specific biological pathways with likely biological relevance in these conditions. Five of the 10 SNPs identified by GWAS of colorectal cancer are in close LD with genes of the TGF/BMP signalling pathway including SMAD7, BMP2 and BMP4. In the next few years research is likely to reveal further advances in our understanding of the role of both common and rare low penetrance alleles in colorectal cancer by analysing the associated effects on expression and protein function, and by the identification of disease causing mutations.
Recently published data analysis from the CAPP2 study demonstrates significant modification of colorectal cancer risk in Lynch Syndrome patients by aspirin (Burn et al 2011). Thus even high penetrant syndromes may be modifiable by the environment. A priori, environmental agents are even more likely to modify lower penetrance genetic risk factors. An association of smoking-related cancers with polymorphisms at the cancer susceptibility locus 8q24 (identified by genome-wide association) has been suggested (Park et al. 2008). When the odds ratios for predisposition alleles are well below 1.5 there is a possibility of interaction (or bias) through an unmeasured environmental factor, as in the context of lung cancer risk and association with 15q which contains the nicotinic acetylcholine receptor (Chanock and Hunter 2008). Furthermore, the role of gene-environment interactions remains poorly defined and a reductionist approach to understanding the aetiology of colorectal neoplasia means that few such studies exist. Naturally common low penetrance susceptibility alleles will individually contribute little to overall risk, and it is likely that environmental ‘modification’ by smoking, exercise, body habitus, diet, etc. will provide a more complete explanation of what drives normal colonic crypts along the pathway to cancer. Indeed the odds ratios for environmental risk factors are comparable to many low penetrance alleles.
It is likely that combining data from genetic and environmental studies will provide clinicians with an increasingly powerful tool to understand and individual patient’s risk and tailor an appropriate management plan, whether this be colonoscopic screening, genetic testing, or lifestyle modification. It has been proposed that this data may be used in future in association studies in a two-step process whereby patients are first screened for epidemiological risk factors before entering the genotyping analysis (Murcray et al. 2009).
In 1997, the ColoRectal tumour Gene Identification(CoRGI) Study Consortium was formed to ascertain and collect biologicalsamples and data from families segregating colorectal cancer, in order to identify novel predisposition genes. This study led by Prof Ian Tomlinson has largely been undertaken in this laboratory by colleagues. Families and individuals are being collected with the following entry criteria;
Families were collected from centres throughout England, Scotland and Ireland.
CORGI 1 – Linkage Analysis: A genome wide linkage analysis has been performed on 69 families with a history of bowel cancer and/or polyps using the GeneChip Mapping 10K Xba 142 arrays containing 10 204SNP markers (Kemp et al. 2006). Families in this study had at least 2 individuals (except parent/child) affected. A maximum non-parametriclinkage statistic of 3.40 (P=0.0003) was identified at chromosomal region 3q21–q24. The Galway family is the largest pedigree with over 29 informative meioses, and a decision was taken for it to be studied separately (Chapters 3 and 4).
CORGI 1b A second similar set of 34 families has been collected. Linkage analysis was performed by colleagues which confirmed linkage at 3q22 (Papaemmanuil, Carvajal-Carmona et al. 2008).
CORGI 1c Approximately 100 families where siblings are affected are being collected for sib-pair analysis.
CORGI 2 – Genome Wide Association (GWA): CORGI 2 is a GWA study using an Illumina SNP platform on cases with the same entry criteria as CORGI 1 but without a family history. Colleagues initially genotyped 550,163 tag SNPs in 940 individuals with familial colorectal neoplasia and 965 controls using the Illumina Infinium platform. (Tomlinson, Webb et al. 2007). In CORGI 2b Approximately 42000 candidate SNPs with most significant association in CORGI 2 are being re-tested in a group of ~ 3000 colorectal cancer patients. Several loci which contain SNPs associated with colorectal cancer susceptibility (at 8q23, 10p14, 11q24, 15q13.3 and 18q21) have been recently identified by colleagues in this cohort (Broderick, Carvajal-Carmona et al. 2007; Tomlinson, Webb et al. 2007; Jaeger, Webb et al. 2008; Tenesa et al. 2008; Tomlinson, Webb et al. 2008). However no mutations have yet been identified at these loci with proven functional relevance.
CORGI 3 – Candidate gene screening: Genes in the CORGI 2 patient cohort are being screened for sequence abnormalities in functionally important genes such as those involved in DNA repair, the Wnt pathway, or other genes involved in the aetiology of colorectal neoplasia. Colleagues are also screening the patients included in CORGI 1 and CORGI 2 for gene mutations the loci identified by linkage or association respectively. Candidate genes EPHB1 and MBD4 have been screened for mutations at 3q21-24 in the CORGI 1 family set but none were found (Kemp, Carvajal-Carmona et al. 2006).
Because of the evidence from adenoma-to-carcinoma sequence model (Morson 1968; Fearon and Vogelstein 1990) the National Polyp Study (Winawer et al. 1993) and other prospective studies (Dove-Edwin et al. 2005; Dove-Edwin et al. 2006) we know that if polyps are removed during colonoscopy, cancer may be prevented. Thus colorectal cancer is one of the most preventable of all cancers, and some early evidence is emerging that colonoscopic screening may reduce colorectal cancer related mortality (Baxter et al. 2009). However, national colonoscopic screening programs are expensive, stretching the capacity of already busy services and therefore do not reach the whole population they target. In addition to lifestyle modification advice to reduce environmental risk factors, it may be possible to identify two groups of patients with inherited risk by understanding the underlying molecular aetiology.
(Copyright, Dr Kevin Monahan)