Oliver Thewalt

    Oliver Thewalt

    Theoretical Physics | Quantum Biology | Dark Matter Research Cluster

    HLA typing from RNA-Seq sequence reads

    HLA typing from RNA-Seq sequence reads
    By Oliver Thewalt

    0/5 stars (0 votes)


    We present a method, seq2HLA, for obtaining an individual's human leukocyte antigen (HLA) class I and II type and expression using standard next generation sequencing RNA-Seq data. RNA-Seq reads are mapped against a reference database of HLA alleles, and HLA type, confidence score and locus-specific expression level are determined. We successfully applied seq2HLA to 50 individuals included in the HapMap project, yielding 100% specificity and 94% sensitivity at a P-value of 0.1 for two-digit HLA types. We determined HLA type and expression for previously un-typed Illumina Body Map tissues and a cohort of Korean patients with lung cancer. Because the algorithm uses standard RNA-Seq reads and requires no change to laboratory protocols, it can be used for both existing datasets and future studies, thus adding a new dimension for HLA typing and biomarker studies.


    The major histocompatibility complex (MHC) molecules display peptide antigens that are derived from intracellular (class I) and extracellular (class II) proteins on the surface of vertebrate nucleated cells. The human MHC, called the human leukocyte antigen (HLA), is highly polymorphic and comprises three major gene loci for class I (A, B, C) (Figure 1) and three major gene loci for class II (DP, DQ, DR), which are expressed co-dominantly. Each cell expresses three maternal and three paternal HLA class I and three maternal and three paternal class II alpha and beta alleles. Determining the sequence of these molecules, HLA typing, is essential for clinical work (for example, organ transplantation), immune system research, and biomarker and drug development. Current HLA typing techniques use labor- and time-intensive methods, such as sequence-specific oligonucleotide probe (SSOP) hybridization [1], PCR amplification with sequence-specific primers [2], Sanger sequencing [3] and sero-typing [4].


    Next generation sequencing (NGS) is a novel platform that enables rapid generation of billions of short nucleic acid sequence reads. Several studies described the use of NGS in high-throughput HLA genotyping using genomic DNA (for examples, see [5, 6]). Recently, Lank et al. described a method using RNA for high-throughput MHC class I genotyping [7, 8], which was applied to assess genotype- and allele-specific expression of MHC class I in human and macaque leukocyte subsets [9]. This method involves the reverse transcription of RNA into cDNA, amplification using highly specific MHC I primers and subsequent bi-directional cDNA amplicon sequencing using Roche/454 GS FLX pyrosequencing, and is able to unambiguously resolve MHC alleles with high accuracy. However, all of these techniques use specialized NGS protocols including primer design to amplify only MHC class I alleles and amplicon sequencing with long reads (≥150 nucleotides) using Roche/454 GS FLX or Illumina GAIIx.

    By contrast, gene expression profiling in patient samples using 'whole transcriptome' sequencing (RNA-Seq profiling) typically uses much shorter reads. The adoption of the RNA-Seq platform has been rapid: clinical and research laboratories worldwide have deposited over 14,600 'RNA-Seq' sample profiles into public repositories such as the National Center for Biotechnology Information Sequence Read Archive, including 4,304 human RNA-Seq samples as of 8 October 2012. As opposed to previous methods for determining gene expression, the RNA-Seq platform not only generates expression profiles but the data also contain nucleotide sequence information. Given the large number of RNA-Seq profiles in the public domain and our efforts to develop individualized T cell-mediated cancer vaccines, which require the knowledge of a patient HLA type and HLA expression to prioritize target epitopes [10, 11, 12], we sought to develop an algorithm to utilize the sequence content of RNA-Seq reads to determine both HLA type and expression.

    To this end, we developed an in-silico method, 'seq2HLA', written in python and R, which takes standard RNA-Seq sequence reads in fastq format [13] as input, uses a bowtie index [14] comprising known HLA alleles, and outputs the most likely HLA class I and class II types, a P-value for each call, and the expression of each class. As a proof of concept, we applied the method to the recently released 37-nucleotide paired-end RNA-Seq data of 50 Utah residents with ancestry from northern and western Europe (CEU) [15] who are part of the HapMap project and have been previously typed using the PCR-SSOP method [16]. Our method, seq2HLA, achieves 100% specificity and 93.5% sensitivity at a P-value of 0.1. Sensitivity versus specificity curves show that the method works best using paired-end reads with length at least 37 nucleotides, that allowing one mismatch in the mapping is optimal, and that calls are more sensitive to read length than the number of reads.

    Materials and methods

    We used three publically available NGS RNA-Seq datasets. The first dataset comprises RNA-Seq reads from 60 lymphoblastoid cell lines derived from the CEU HapMap individuals of European descent, including 10 million 37-nucleotide paired-end reads per sample (Accession Number [ERA002336]) [15]. The HLA genotypes of 270 HapMap individuals were previously determined by PCR-SSOP [16] and 51 overlap with the RNA-Seq samples. The Illumina Human Body Map 2.0 project (Accession Number [ERA022994]), the second dataset, comprises reads from 16 different tissues from 15 Caucasian- and one African-American donors, with over seven million 50-nucleotide paired-end reads per sample. The third dataset contains 77 lung RNA-Seq profiles from a lung cancer study in un-typed Korean patients and comprises 100 nucleotide paired-end reads per sample (Accession Number [ERP001058]) [17]. The datasets are provided for download in the European Nucleotide Archive at the European Bioinformatics Institute.

    HLA reference sequences
    We downloaded 1,635 HLA-A, 2,247 HLA-B and 1,248 HLA-C (October 2011) and 342 HLA-DQA, 1,976 HLA-DQB and 1,023 HLA-DRB (March 2012) nucleotide sequences from the international ImMunoGeneTics/HLA database at the European Bioinformatics Institute (ftp://ftp.ebi.ac.uk/pub/databases/imgt/mhc/hla/. As exons 2 and 3 (class I) and exon 2 (class II) encode for the peptide-binding site and contain most of the polymorphisms (Figure 1), we extracted these sequences as reference sequences for alignments. In order to include reads mapping to the exon boundaries of the sequence encoding for peptide binding and presentation, we added exon 1 (73 nucleotides long) and 75 nucleotides of exon 4 to the class I reference sequences and 75 nucleotides of exons 1 and 3 to the class II reference sequences.

    Mapping of RNA-Seq reads
    We mapped the RNA-Seq reads in fastq format [13] against the reference sequences using bowtie [14]. We optimized alignment parameters (below), including the number of allowed mismatches (-v) and the number of reported mappings (-m or -a).

    Variation and grouping within HLA sequences
    To quantify and evaluate the polymorphisms of HLA class I alleles, we computed the variability of sequences inter- and intra-allele groups (Figure S1 in Additional file 1). 'HLA groups' were defined at the two-digit resolution (for example, A*01), consisting of all sequences starting with the same two digits. We computed the Hamming distance of each sequence (exons 2 and 3) with all other sequences, in terms of edit distance, and calculated the inter- and intra-group means.

    Furthermore, we calculated the variability at each position across exons 2 and 3 using Shannon's entropy and the nucleotide sequence alignments from all HLA class I alleles (Figure 1). We calculated the variability at each position in terms of information content using the binary logarithm formulation, taking into account the observed frequency of each base (A, T, C and G) across alleles. We plotted 2 - information content, such that 0 represents minimum variability (only one nucleotide observed) and 2 represents maximum variability (all four nucleotides equally likely).

    Read length and HLA typing
    To evaluate the trade-offs between read length 'f' and HLA typing, we computed the number of unique length oligonucleotides in each HLA sequence. For each locus, we created all possible f-mers of each HLA exon 2 and 3 sequence. These reads were aligned using bowtie to the respective reference sequences (class I or II) using the mapping parameter -m1 (report only unique mappings). To account for the sequencing errors, the parameter -v (allowed mismatches) was varied to allow for zero, one and two mismatches.

    Data quality control
    Of the previously typed CEU HapMap individuals, 51 overlapped with the 60 lymphoblastoid cell lines sequenced by Montgomery et al. [15]. After comparison of the HLA types determined by us and others with the established pedigrees for the CEU samples, we excluded sample 1382_1 from further analysis because of HLA inheritance inconsistencies. On examining the RNA-Seq seq2HLA calls, an individual (daughter NA12891) had an HLA type for which the HLA-A, -B and -C alleles matched the individual's father (grandfather NA12891) but the remaining alleles did not match the individual's mother (annotated as sample 1382_1, grandmother NA12892) (Figure S2 in Additional file 1). Further investigation showed that seq2HLA-determined HLA types from the RNA-Seq reads were entirely different from the PCR-SSOP results, the only result in the 51 sample dataset with a complete discrepancy; the grandmother-annotated RNA-Seq reads showed expression of Y-chromosome genes, such as EIF1AY, a male-specific gene; and results of seq2HLA using RNA-Seq reads from the grandmother from a different experiment [18] matched the PCR-SSOP results and agreed with the pedigree. Thus, the available data strongly suggest that sample 1382_1 from Montgomery et al. is not the individual NA12892, and we removed this individual from our analysis, leaving 50 CEU HapMap individuals (in the following named as Montgomery test samples). Further, given the uniqueness of an individual's HLA type, this example demonstrates how seq2HLA can be used for sample annotation quality control, such as in 'tumor versus normal' experiments where both samples should have the same HLA type.

    Results and discussion

    Although HLA alleles are highly polymorphic, all sequences across the A, B and C loci differed by less than 70 nucleotides. Exons 2 and 3 encode the peptide-binding groove of HLA class I molecules (Figure 1) and contain the majority of the variation but are nevertheless 87% identical (Figure S1 in Additional file 1). Furthermore, there are few sequences that are unique to a given allele (Table S1 in Additional file 2). The number of reads mapping to a single HLA allele depends on read length: only 49% of major alleles, that is, those alleles from A, B or C locus, comprise a unique tag 37-mer and 67% have a unique tag 100-mer. Allowing one mismatch to account for sequencing errors, only 0.8% and 3.2% of major alleles contain unique 37- and 100-mers, respectively, demonstrating the difficulty in identifying a unique fingerprint for each allele with short NGS reads. Moreover, using RNA rather than DNA has an additional complication and benefit: results reflect not only HLA type but also expression levels. Given these challenges, we nevertheless sought to make an algorithm to determine HLA types and expression from standard RNA-Seq data.

    Creating the seq2HLA workflow
    Seq2HLA works by identifying the HLA alleles associated with the greatest number of NGS RNA-Seq reads. The read-to-HLA mapping and counting is done in two stages (Figure 2). In the first stage, all reads from an RNA-Seq experiment in fastq format are mapped to the reference HLA sequences using bowtie_ENREF_4. For each locus, a 'first round' winning group is chosen. Let R1L denote the set of allelic sequences with respective read counts at locus L∈(A,B,C) in iteration 1. The top-scoring group for locus L is defined as the group that contains the allele with the highest number of mapped reads:

    Read More: HLA typing from RNA-Seq sequence reads