Overview of BitSeq usage
BitSeq analysis consists of two stages:
- Stage 1 for expression estimation; and
- Stage 2 for differential expression estimation (optional).
The 1st stage is explained in detail below. In case that differential expression analysis is also required see BitSeq stage 2.
Stage 1:
We want to analyse expression of human transcripts from pair-end sample data1-1_1.fastq, data1-1_2.fastq. First we need to download reference in form of transcript sequence and align the reads using bowtie. We will use hg19 ensebml transcripts:
- Follow this link to: UCSC Table Browser which should set you the important fields.
- (important fields: group:”Genes and Gene Prediction Tracks”, track:”Ensembl Genes”, table:”ensGene”, output format:”Sequence”, output file:”ensemblGenes.fasta”, file type returned:”gzip compressed”)
- Click “get output”
- Select “genomic”, click “submit”
- Select “5’ UTR Exons”, “CDS Exons”, “3’ UTR Exons”, “One FASTA record per gene”, click “get sequence”
Alignment
Now create index and align:
$ bowtie2-build -f ensemblGenes.fa ensemblGenes
$ bowtie2 -q -k 100 --threads 4 --no-mixed --no-discordant -x ensemblGenes -1 data1-1_1.fastq -2 data1-1_2.fastq -S data1-1.sam
# for saving alignments in BAM format use samtools
$ bowtie2 -q -k 100 --threads 4 --no-mixed --no-discordant -x ensemblGenes -1 data1-1_1.fastq -2 data1-1_2.fastq | samtools view -hSb - -o data1-1.bam
The user can also use bowtie1 instead of bowtie2. For more details on this step see the alignment page.
- samtools is a standalone software, options: include header, input is SAM, output is BAM; input is from
stdin
; output into file
We will also need a list of transcript names with their lengths which is expected in form:
- first line contains: # M <number of transcripts>
- following lines: <gene name> <transcript name> <transcript length>
We will call it ensemblGenes.tr, this list can be created while pre-processing the alignments with option --trInfoFile
(without gene names which are missing in the SAM header).
Pre-processing
Now we need to pre-compute probabilities for each alignment, we expect that BitSeq is compiled and located in the folder $BitSeq
:
$ $BitSeq/parseAlignment data1-1.sam -o data1-1.prob --trSeqFile ensemblGenes.fasta --trInfoFile ensemblGenes.tr --uniform --verbose
- argument: the alignment file
- options: output file, file with reference sequence, transcript information file to be created, assume uniform read distribution, verbose output
- note: when computing non-uniform read distribution (without option
--uniform
) option--procN
allows parallelisation. We advice not using more than default 3 CPUs as this tends to decrease the overall performance.
Expression estimation
There are two approaches for estimating transcript expression: the first one uses MCMC sampling and the second one uses a Variational Bayes (VB) algorithm. MCMC sampling is more suitable for downstream analysis (Stage 2), while the significanlty faster VB algorithm is strongly recommended when the main aim of the analysis is expression estimation. See the next section for MCMC sampling, while for the VB algorithm see this link. An extensive simulation study was conducted in order to benchmark the majority of available transcript expression estimation methods, including BitSeqVB and BitSeqMCMC. The full code is available in order to reproduce our results.
Sampling
Now we run the sampler, which is the main part of the analysis and takes the most time.
Details: The sampler first produces MCMC_burnIn
samples which are discarded, following by MCMC_samplesN
samples, which are used to compute convergence statistics. These statistics are used to estimate X where X is the number of samples needed to produce MCMC_samplesSave
samples which will be finally recorded. The X is the major running time factor and can be decreased either by decreasing MCMC_samplesSave
or by increasing the number of chains by MCMC_chainsN
parameter, which can run in parallel (assuming enough CPUs).
$ $BitSeq/estimateExpression data1-1.prob -o data1-1 --outType RPKM -p parameters1.txt -t ensemblGenes.tr -P 2
- argument: the alignment probability file from pre-processing
- options: output prefix, output type, parameters file, transcript info file, number of threads to use (more than number of chains specified in
parameters1.txt
does not help, default number of chains is 4) - parameters file: contains basic parameters for the sampler, these can be also set by command line arguments
--MCMC_*
. This file is re-opened during the sampling and the parameters are updated. This allows us to change number of maximum number of samples or target scale reduction for running instance of the program.
The sampler produces two files, one is data1-1.thetaMeans, which contains mean thetas from every sampler and overall. The other file name depends on prefix and output type selected, in our case: data1-1.rpkm containing MCMC samples from all the chains.
The resulting file will contain M lines, one for each transcript. With each line containing MCMC_samplesSave
expression samples. We can summarize these by computing mean (and variance).
$ $BitSeq/getVariance -o data1-1.mean data1-1.rpkm
Other programs:
Short description of other useful programs provided with BitSeq, for more details run the programs with --help
option.
- convertSamples
- converts MCMC between different formats, also useful for normalisation
- extractSamples
- extracts expression samples of selected transcripts; useful when interested in just few transcripts
- extractTranscriptInfo.py
- extracts transcript information from Fasta file into
.tr
file; useful for extracting transcript-gene grouping if the Fasta file is in one of the few recognized formats - getCounts.py
- converts multiple .thetaMeans files into one file containing estimated read counts for each transcript and each file
- getFoldChange
- compute mean fold change between two files of samples (can be either expression from
estimateExpression
or condition mean expression fromestimateDE --samples
) - getGeneExpression
- converts transcript expression samples into gene expression; it needs proper transcript information file (
.tr
) with transcript-gene grouping - getWithinGeneExpression
- converts transcript expression samples into relative within-gene expression; it needs proper transcript information file (
.tr
) with transcript-gene grouping - getPPLR
- compute PPLR between two files of samples (can be either expression from
estimateExpression
or condition mean expression fromestimateDE --samples
) - transposeLargeFile
- transposes one or more expression sample files from (N rows M columns) to (M rows N columns) and merges them into single file
getGeneExpression
example: assume that the file geneFile.txt
contains gene names (one per transcript - at the same order with ensemblGenes.tr
file names).
$ $BitSeq/getGeneExpression -o data1-1_Genes.rpkm -t ensemblGenes.tr -G geneFile.txt data1-1.rpkm
$ $BitSeq/getVariance -o data1-1_Genes data1-1_Genes.rpkm
The mean gene-level RPKM values are written to the 1st column of data1-1_Genes
.