A pipeline to calculate SHAPE/DMS reactity score
icSHAPE-pipe is a complete pipeline for processing RNA structure sequencing data. It includes multiple modules such as data preprocessing, mapping, reactivity score calculation and quality control. icSHAPE-pipe can currently process reverse transcription termination type of RNA structure sequencing data, including icSHAPE, DMS-Seq, PARS and other experimental techniques. Structural sequencing data with modified mutation types will be introduced in the future.
icSHAPE-pipe is a package written in C++ and Python and calls a variety of bioinformatics tools such as bowtie2, STAR, and cufflinks. Versions above v2.0.0 support Python 2.7 and Python 3. There are three programs: sam2tab, calc_sliding_shape, countRT written in C++, have been statically compiled into executable files under the Linux platform, so no additional compilation is required under Linux.
The icSHAPE-pipe package needs to install the following program in advance:
If you are using Mac OS X system, you may need to compile three excutable files writen with C++. You can do as follows:
To use icSHAPE-pipe, just add the bin directory to your PATH variable.
export PATH=[icSHAPE-pipe/bin]:$PATH
This pipeline can either map the reads to the genome or map to the transcript. For the former, the user needs to provide a GTF annotation file. The main input and output files have the following.
Input files
Output files
icSHAPE-pipe puts each function in a different module, running icSHAPE-pipe [module-name] [options]. This section describes the function, input and output of each module separately.
The preparation phase is mainly to establish an index of the genome sequence file and to parse the GTF file.
icSHAPE-pipe starbuild -i reference.fa -o outdir --gtf reference.gtf
starbuild will use STAR to create STAR index in the directory.
icSHAPE-pipe parseGTF -g reference.gtf -o outFilePrefix -s ensembl/ncbi --genome reference.fa
parseGTF will parse the GTF file, transfer GTF file to an easy-to-read *.genomeCoor.bed file and extract the transcript sequence.The fastq preprocessing module mainly performs some processing on the reads before going to the genome, except for the poor quality of the sequencing and the part derived from the high expression of RNA.
icSHAPE-pipe readcollapse -U input.fq -o output.fq --simplify
readcollapse removes identical reads from the original sequencing data.
icSHAPE-pipe trim -i input.fq -o output.fq -l 13 -a adaptor.fa
trim is mainly to remove the adaptor sequence at the end and the beginning barcode of the sequence.
icSHAPE-pipe cleanFq -i input.fq -o unmap.fq -x rRNA_index --sam map.sam
parseGTF maps the reads to the reference genome of rRNA or small RNA, the mapped reads are saved in the sam file, and the unmapped reads are saved in the unmap.fq file.The pre-processed reads now be mapped to the genome. The RT stops for each base are counted. The icSHAPE scores are calculated.
icSHAPE-pipe mapGenome -i input.fq -o outputPrefix -x STAR_index
mapGenome map the processed reads to reference genome. The STAR_index is generated with icSHAPE-pipe starbuild
icSHAPE-pipe calcFPKM -i input.fq -o output.fq -G reference.gtf
calcFPKM calls cufflinks to calculate the expression of each isoform as a condition for subsequent filtering.
icSHAPE-pipe sam2tab -in input.sam -out output.tab
sam2tab converts the sam file into a .tab file that records the start, end, and start coordinates of each read's reference, strand, and map. This file is the input to the next step.
icSHAPE-pipe trainParameter -d knownstructure.dot -s rRNA_size_file -D D1.tab,D2.tab -N N1.tab,N2.tab -o report.pdf
In order to get an optimal parameter, trainParameter attempts to calculate the reactivity without using the window size, subtract factor parameters and compares the calculated values to the standard structure.
icSHAPE-pipe calcSHAPE -D D1.tab,D2.tab -N N1.tab,N2.tab -size chrNameLength.txt -ijf sjdbList.fromGTF.out.tab -out output.gTab -genome reference.fa -bases A,T,C,G
calcSHAPE uses a window of size 200 with a step size of 5 to calculate the reactivity score for each base. In order to cross the junction for the window, you need to provide a junction file, which can be from the sjdbList.fromGTF.out.tab file in the STAR index directory.
icSHAPE-pipe calcSHAPENoCont -N N1.tab,N2.tab -size chrNameLength.txt -ijf sjdbList.fromGTF.out.tab -out output.gTab -genome reference.fa -bases A,T,C,G
calcSHAPENoCont is the same with calcSHAPE. It calculate scores without DMSO control samplesThe coordinate system transfer is mainly to convert the score of the genomic coordinates into the score of the transcript coordinates for better analysis.
icSHAPE-pipe genSHAPEToTransSHAPE -i input.gTab [-g *.genomeCoor.bed | -s ref_size_file] -r D1/isoforms.fpkm_tracking,D2/isoforms.fpkm_tracking -o output.shape
genSHAPEToTransSHAPE converts the Shape in the gTab file into a .shape file of transcript coordinates. Various filtering methods ensure the quality of the data. More parameters can be obtained by typing in icSHAPE-pipe genSHAPEToTransSHAPE.
icSHAPE-pipe genRTBDToTransRTBD -i input.gTab [-g *.genomeCoor.bed | -s ref_size_file] -c col1,col2... -o output.txt
genRTBDToTransRTBD converts certain columns in the gTab file into transcript-based files
icSHAPE-pipe genSHAPEToBedGraph -i input.gTab -o output.bedGraph
genSHAPEToBedGraph turns the activity score and variation into a bedGraph file. This file can be uploaded to the UCSC genome browser for viewing.This module analyzes the data quality of each step and guarantees the reliability of the data.
icSHAPE-pipe readDistributionStatistic -@ raw.fq,process1.fq,process2.fq... -@ raw.fq,process1.fq,process2.fq... -@ raw.fq,process1.fq,process2.fq... --sampletag sample1,sample2,sample3 --processtag raw,process1,process2... -o outFolder
readDistributionStatistic counts the number of reads remaining in each step of one or more of the original fastq files after a series of processing, and finally generates a report. Multiple -@ are allowed in the parameter and each of which is a series of processing of the original fastq file. Example
icSHAPE-pipe samStatistics -i input.bam -o outFolder -g *.genomeCoor.bed --fast
samStatistics analyzes the ratio of various types of genes are mapped in the sam/bam file.Example
icSHAPE-pipe transSHAPEStatistics -i input.shape -g *.genomeCoor.bed -o outFolder
transSHAPEStatistics analyzes the distribution of scores in .shape files. Example
icSHAPE-pipe countRT -in D1.tab,D2.tab,N1.tab,N2.tab... -size chrNameLength.txt -ijf sjdbList.fromGTF.out.tab -out output.gTab
countRT counts RT and BD for all input files
icSHAPE-pipe plotGenomeRTRepCor -i input.gTab -o outFolder
plotGenomeRTRepCor counts RT and BD for all input files Example
icSHAPE-pipe plotGenomeSHAPEdist -i input.gTab -o outFolder
plotGenomeSHAPEdist draws the distribution of the reactivity score in the gTab file. Example
icSHAPE-pipe evaluateSHAPE -i input.shape -s structure.dot -o report.pdf
evaluateSHAPE calculates the consistency between the reacticity score and the secondary structure