icSHAPE-pipe

A pipeline to calculate SHAPE/DMS reactity score

Site Map
Latest Release
Secondary structures

Introduction

icSHAPE-pipe is a complete pipeline for processing RNA structure sequencing data. It includes multiple modules such as data preprocessing, mapping, reactivity score calculation and quality control. icSHAPE-pipe can currently process reverse transcription termination type of RNA structure sequencing data, including icSHAPE, DMS-Seq, PARS and other experimental techniques. Structural sequencing data with modified mutation types will be introduced in the future.

Obtaining icSHAPE-pipe

icSHAPE-pipe is a package written in C++ and Python and calls a variety of bioinformatics tools such as bowtie2, STAR, and cufflinks. Versions above v2.0.0 support Python 2.7 and Python 3. There are three programs: sam2tab, calc_sliding_shape, countRT written in C++, have been statically compiled into executable files under the Linux platform, so no additional compilation is required under Linux.

Requirements

The icSHAPE-pipe package needs to install the following program in advance:

Building from source

If you are using Mac OS X system, you may need to compile three excutable files writen with C++. You can do as follows:

  1. Compile PsBL library.
  2. Add the path of PsBL lib to LIBRARY_PATH variable and CPLUS_INCLUDE_PATH variable
  3. Compile sliding_shape folder in icSHAPE-pipe
  4. Copy three excutable files: sam2tab, calc_sliding_shape and countRT to bin/Functions/

Adding to PATH

To use icSHAPE-pipe, just add the bin directory to your PATH variable.

export PATH=[icSHAPE-pipe/bin]:$PATH

Input & output files

This pipeline can either map the reads to the genome or map to the transcript. For the former, the user needs to provide a GTF annotation file. The main input and output files have the following.

Input files

Output files

The process flow can be summarized by the following diagram:

Modules

icSHAPE-pipe puts each function in a different module, running icSHAPE-pipe [module-name] [options]. This section describes the function, input and output of each module separately.

[Prepare]

The preparation phase is mainly to establish an index of the genome sequence file and to parse the GTF file.

starbuild

icSHAPE-pipe starbuild -i reference.fa -o outdir --gtf reference.gtf

starbuild will use STAR to create STAR index in the directory.

parseGTF

icSHAPE-pipe parseGTF -g reference.gtf -o outFilePrefix -s ensembl/ncbi --genome reference.fa

parseGTF will parse the GTF file, transfer GTF file to an easy-to-read *.genomeCoor.bed file and extract the transcript sequence.

parseGTF currently only supports parsing GTF files from Ensembl or Gencode and GFF3 files from RefSeq

[Fastq-processing]

The fastq preprocessing module mainly performs some processing on the reads before going to the genome, except for the poor quality of the sequencing and the part derived from the high expression of RNA.

readcollapse

icSHAPE-pipe readcollapse -U input.fq -o output.fq --simplify

readcollapse removes identical reads from the original sequencing data.

trim

icSHAPE-pipe trim -i input.fq -o output.fq -l 13 -a adaptor.fa

trim is mainly to remove the adaptor sequence at the end and the beginning barcode of the sequence.

cleanFq

icSHAPE-pipe cleanFq -i input.fq -o unmap.fq -x rRNA_index --sam map.sam

parseGTF maps the reads to the reference genome of rRNA or small RNA, the mapped reads are saved in the sam file, and the unmapped reads are saved in the unmap.fq file.

[Mapping and calculate score]

The pre-processed reads now be mapped to the genome. The RT stops for each base are counted. The icSHAPE scores are calculated.

mapGenome

icSHAPE-pipe mapGenome -i input.fq -o outputPrefix -x STAR_index

mapGenome map the processed reads to reference genome. The STAR_index is generated with icSHAPE-pipe starbuild

calcFPKM

icSHAPE-pipe calcFPKM -i input.fq -o output.fq -G reference.gtf

calcFPKM calls cufflinks to calculate the expression of each isoform as a condition for subsequent filtering.

sam2tab

icSHAPE-pipe sam2tab -in input.sam -out output.tab

sam2tab converts the sam file into a .tab file that records the start, end, and start coordinates of each read's reference, strand, and map. This file is the input to the next step.

trainParameter

icSHAPE-pipe trainParameter -d knownstructure.dot -s rRNA_size_file -D D1.tab,D2.tab -N N1.tab,N2.tab -o report.pdf

In order to get an optimal parameter, trainParameter attempts to calculate the reactivity without using the window size, subtract factor parameters and compares the calculated values to the standard structure.

calcSHAPE

icSHAPE-pipe calcSHAPE -D D1.tab,D2.tab -N N1.tab,N2.tab -size chrNameLength.txt -ijf sjdbList.fromGTF.out.tab -out output.gTab -genome reference.fa -bases A,T,C,G

calcSHAPE uses a window of size 200 with a step size of 5 to calculate the reactivity score for each base. In order to cross the junction for the window, you need to provide a junction file, which can be from the sjdbList.fromGTF.out.tab file in the STAR index directory.

calcSHAPENoCont

icSHAPE-pipe calcSHAPENoCont -N N1.tab,N2.tab -size chrNameLength.txt -ijf sjdbList.fromGTF.out.tab -out output.gTab -genome reference.fa -bases A,T,C,G

calcSHAPENoCont is the same with calcSHAPE. It calculate scores without DMSO control samples

[Coordination system convert]

The coordinate system transfer is mainly to convert the score of the genomic coordinates into the score of the transcript coordinates for better analysis.

genSHAPEToTransSHAPE

icSHAPE-pipe genSHAPEToTransSHAPE -i input.gTab [-g *.genomeCoor.bed | -s ref_size_file] -r D1/isoforms.fpkm_tracking,D2/isoforms.fpkm_tracking -o output.shape

genSHAPEToTransSHAPE converts the Shape in the gTab file into a .shape file of transcript coordinates. Various filtering methods ensure the quality of the data. More parameters can be obtained by typing in icSHAPE-pipe genSHAPEToTransSHAPE.

genRTBDToTransRTBD

icSHAPE-pipe genRTBDToTransRTBD -i input.gTab [-g *.genomeCoor.bed | -s ref_size_file] -c col1,col2... -o output.txt

genRTBDToTransRTBD converts certain columns in the gTab file into transcript-based files

genSHAPEToBedGraph

icSHAPE-pipe genSHAPEToBedGraph -i input.gTab -o output.bedGraph

genSHAPEToBedGraph turns the activity score and variation into a bedGraph file. This file can be uploaded to the UCSC genome browser for viewing.

[Quanlity control]

This module analyzes the data quality of each step and guarantees the reliability of the data.

readDistributionStatistic HTML

icSHAPE-pipe readDistributionStatistic -@ raw.fq,process1.fq,process2.fq... -@ raw.fq,process1.fq,process2.fq... -@ raw.fq,process1.fq,process2.fq... --sampletag sample1,sample2,sample3 --processtag raw,process1,process2... -o outFolder

readDistributionStatistic counts the number of reads remaining in each step of one or more of the original fastq files after a series of processing, and finally generates a report. Multiple -@ are allowed in the parameter and each of which is a series of processing of the original fastq file. Example

samStatistics HTML

icSHAPE-pipe samStatistics -i input.bam -o outFolder -g *.genomeCoor.bed --fast

samStatistics analyzes the ratio of various types of genes are mapped in the sam/bam file.Example

transSHAPEStatistics HTML

icSHAPE-pipe transSHAPEStatistics -i input.shape -g *.genomeCoor.bed -o outFolder

transSHAPEStatistics analyzes the distribution of scores in .shape files. Example

countRT

icSHAPE-pipe countRT -in D1.tab,D2.tab,N1.tab,N2.tab... -size chrNameLength.txt -ijf sjdbList.fromGTF.out.tab -out output.gTab

countRT counts RT and BD for all input files

plotGenomeRTRepCor HTML

icSHAPE-pipe plotGenomeRTRepCor -i input.gTab -o outFolder

plotGenomeRTRepCor counts RT and BD for all input files Example

plotGenomeSHAPEdist HTML

icSHAPE-pipe plotGenomeSHAPEdist -i input.gTab -o outFolder

plotGenomeSHAPEdist draws the distribution of the reactivity score in the gTab file. Example

evaluateSHAPE PDF

icSHAPE-pipe evaluateSHAPE -i input.shape -s structure.dot -o report.pdf

evaluateSHAPE calculates the consistency between the reacticity score and the secondary structure