View on GitHub

BitSeq

Transcript isoform level expression and differential expression estimation for RNA-seq

Overview

BitSeq requires reads aligned to a transcriptome reference. This document explains how to create such reference from various sources and how to align reads using variour aligners.

Constructing transcriptome reference

The transcriptome reference should contain sequences for all RNA species expected to be found in the sample.

Reference from GENCODE

  1. Follow either of these links to GENCODE human releases or GENCODE mouse releases and select the desired release
  2. Under “Fasta files”, download “Protein-coding transcript sequences in FASTA format” and “Long non-coding RNAs in FASTA format”

pre-mRNA sequences from GENCODE

In addition to the steps above

  1. Under “GTF files”, download “Main file, gene annotation on three levels”
  2. Under “Fasta files”, download “Genome sequence fasta file”
  3. Run the BitSeq tool
gtftool -t gencode.[XXX].annotation.gtf.gz -g GRC[YYY].genome.fa.gz genes > gencode.pre_mrnas.fa

where you should replace [XXX] and [YYY] with the appropriate versions for the files you downloaded.

Reference from Ensembl

  1. Go to Ensembl website and navigate to your favourite species
  2. Under “Gene annotation”, click “Download genes, cDNAs, ncRNA, proteins (FASTA)”
  3. Download the file “….cdna.all.fa.gz” from directory “cdna” and gunzip it
  4. Download the file “….ncrna.fa.gz” from directory “ncrna” and gunzip it
  5. For human reference, it may be desirable to remove alternative haplotypes and other alternative locus entries using
./fastagrep.sh -v 'GRCh38:[^1-9XMY]' Homo_sapiens.GRCh38.cdna.all.fa > Homo_sapiens.GRCh38.cdna.filtered.fa
./fastagrep.sh -v 'GRCh38:[^1-9XMY]' Homo_sapiens.GRCh38.ncrna.all.fa > Homo_sapiens.GRCh38.ncrna.filtered.fa

pre-mRNA sequences from Ensembl

In addition to the steps above

  1. Find the GTF file by replacing “fasta” in the FTP address path by “gtf”, i.e. “ftp://ftp.ensembl.org/pub/release-76/fasta/homo_sapiens/” -> “ftp://ftp.ensembl.org/pub/release-76/gtf/homo_sapiens/” and download the “…gtf.gz” file
  2. For human reference, it may be desirable to remove alternative haplotypes and other alternative locus entries using

    gunzip -dc Homo_sapiens.GRCh38.76.gtf.gz | grep -e '^[1-9XMY#]' | gzip -c > Homo_sapiens.GRCh38.76.filtered.gtf.gz
    
  3. Returning to the Ensembl species page, under “Genome assembly”, click “Download DNA sequence (FASTA)”
  4. Download the file “….dna.primary_assembly.fa.gz”
  5. Run the BitSeq tool
gtftool -t Homo_sapiens.GRCh38.76.filtered.gtf.gz -g Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz --outputFormat=ensembl genes > ensembl.pre_mrnas.fa

where you should replace [XXX] and [YYY] with the appropriate versions for the files you downloaded.

Reference from UCSC genome browser

  1. Follow this link to: UCSC Table Browser which should set you the important fields.
    • (important fields: group:”Genes and Gene Prediction Tracks”, track:”Ensembl Genes”, table:”ensGene”, output format:”Sequence”, output file:”ensemblGenes.fasta”, file type returned:”gzip compressed”)
  2. Click “get output”
  3. Select “genomic”, click “submit”
  4. Select “5’ UTR Exons”, “CDS Exons”, “3’ UTR Exons”, “One FASTA record per gene”, click “get sequence”

Read alignment

Alignment with Bowtie 2

First create an index for your reference file ref.fa (only needs to be done once)

$ bowtie2-build -f ref.fa ref

Now, map the paired-end reads (data1-1_1.fastq and data1-1_2.fastq)

$ bowtie2 -q -k 100 --threads 4 --no-mixed --no-discordant -x ref -1 data1-1_1.fastq -2 data1-1_2.fastq -S data1-1.sam 

Alignment with Bowtie 1

Optionally, the user can also use Bowtie 1, although it is not recommended for long length reads. First create an index for your transcripts (only needs to be done once)

$ bowtie-build -f -o 2 -t 12 --ntoa my_transcriptome.fa,my_pre_mrnas.fa my_transcripts
$ bowtie -q -v 3 --trim3 0 --trim5 0 --all -m 100 --threads 4 --sam ensemblGenes -1 data1-1_1.fastq -2 data1-1_2.fastq data1-1.sam 
# alternatively for saving alignments in BAM format use samtools
$ bowtie -q -v 3 --trim3 0 --trim5 0 --all -m 100 --threads 4 --sam ensemblGenes -1 data1-1_1.fastq -2 data1-1_2.fastq | samtools view -hSb - -o data1-1.bam