[软件使用1]cutadapt使用指南

本文包括:

  1. 什么是接头(测序简介)
  2. 使用cutadapt去接头(安装、基本用法、结果解析)
  3. 使用minion预测数据接头(注意事项)

1.什么是接头

[软件使用1]cutadapt使用指南插图1
4步获取测序数据

在了解接头之前,简要介绍一下获取测序数据的步骤,如上图主要分为4步:

  1. 制备文库
  2. PCR扩增
  3. 测序
  4. 比对分析

去除接头adapter,则是第4步比对分析中的一个小环节。

那什么是接头呢?

在第1步制备文库时,我们会在样品DNA片段末端加上adapter(接头)。

[软件使用1]cutadapt使用指南插图3
接头方式1,图片来自FIMM,侵删
[软件使用1]cutadapt使用指南插图5
加接头方式2,图片来自贝瑞基因,侵删
[软件使用1]cutadapt使用指南插图7
加接头方式,图片来自网站blackberryaurora,侵删

注意!加接头的方式有多种。

There are a number of different ways to prepare samples, all preparation methods add adapter to the end of DNA fragment.illumina

一般,一个DNA片段两端会有两个FlowCell的adapter,还有一个index。

这些接头在第2步PCR扩增时,会和固定在flowcell的接头互补配对。样品DNA片段便被吸附在flowcell表面,从而进行桥式PCR扩增成DNA簇。

怎么知道数据来自哪个样品?

科学家们的方法是,给DNA片段加上标签(barcode/index)。就像超市里货架上用来区分不同商品的条形码。

[软件使用1]cutadapt使用指南插图9
图片来自BioFrontiers,侵删
在Solexa多重测序(Multiplexed Sequencing)过程中会使用Index来区分样品,并在常规测序完成后,针对Index部分额外进行7个循环的测序,通过Index的识别,可以在1条Lane中区分12种不同的样品。Solexa与Hiseq测序技术中常见术语名词解释 | Public Library of Bioinformatics
[软件使用1]cutadapt使用指南插图11
图片来自https://hackteria.org/wiki/HiSeq2000_-_Next_Level_Hacking,侵删

为什么要扩增成DNA簇?

因为这样才能产生亮度达到CCD可以分辨的荧光点。

注:

  • flowcell上有8条lane(泳道),每条lane可以直接物理区分测序样品。
  • 1次run(单次上机测序反应)最多可以同时上样8条Lane,大概产生4G-75G测序通量。
  • 每条Lane中排有2列tile,合计120个小区。每个小区上分布数目繁多的簇结合位点。
[软件使用1]cutadapt使用指南插图13
Flowcell. (图片来自illumina,仅供学习,侵删)
[软件使用1]cutadapt使用指南插图15
FlowCell for HiSeq 2000 (图片来自hackteria.org)

最后为了得到所需数据,就得去掉这些人为在DNA片段两端加上去的接头和标签

补充说明下单端测序和双端测序的区别:

What is the difference between Single-End and Paired-End reads?Single-End Read:When the sequencing process onlyoccurs in 1 direction(utilizing Read Primer 1).Paired-End Read:If two separate read cycles occur inboth directions(utilizing both Read Primer 1 and 2). This kind of read will provide data about both sides of the fragment of interest (Blue). If the fragment size is consistent you will also be able to predict that both the forward and reverse reads will be a known distance from each other. This data can be used to help the software map the readsmore accurately.http://tucf-genomics.tufts.edu/home/faq
[软件使用1]cutadapt使用指南插图17
图片来自Frequently Asked Questions. TUCF Genomics

2. 如何使用cutadapt去接头?

cutadapt文章传送门:DOI:10.14806/ej.17.1.200.

文章摘要:

When small RNA is sequenced on current sequencing machines, the resulting reads are usually longer than the RNA and therefore contain parts of the 3 adapter. That adapter must be found and removed error-tolerantly from each read before read mapping. Previous solutions are either hard to use or do not offer required features, in particular support for color space data. As an easy to use alternative, we developed the command-line tool cutadapt, which supports 454, Illumina and SOLiD (color space) data, offers two adapter trimming algorithms, and has other useful features.Cutadapt, including its MIT-licensed source code, is available for download athttp://code.google.com/p/cutadapt/

2.1 安装cutadapt:

可以使用conda或者pip,更多参数请参考cutadapt官方文档

using pip
pip3 install --user --upgrade cutadapt

 using anaconda
conda install -c bioconda cutadapt

2.2 基本用法举例:

去掉3’端的接头
 -a 3’接头
 -o 输出,o是output的意思
 -j 选择几个核
cutadapt -j 10 -a AACCGGTT -o output.fastq input.fastq

 也可以处理gz压缩格式的文件
 支持 gzip (.gz), bzip2 (.bz2) 和 xz (.xz).
cutadapt -a AACCGGTT -o output.fastq.gz input.fastq.gz

 去除5’端的接头
 -g 5接头
cutadapt -g ADAPTER -o output.fastq.gz input.fastq.gz

 去除poly-A尾,如去除100个及以上个A
 instead of writing ten A in a row (AAAAAAAAAA), write A{10}
cutadapt -a "A{100}" -o output.fastq input.fastq

 quality triming
 The -q (or --quality-cutoff) parameter can be used to trim low-quality ends from reads before adapter removal.
 By default, only the 3’ end of each read is quality-trimmed. 
cutadapt -q 10 -o output.fastq input.fastq

 去除多个 3’接头
cutadapt -a TGAGACACGCA -a AGGCACACAGGG -o output.fastq input.fastq
[软件使用1]cutadapt使用指南插图19
cutadapt可以去除多种接头,图片来自cutadapt userguide
Keep in mind thatCutadapt removes the adapter that it finds and also the sequence following it,so even if the actual adapter sequence that is used in a protocol is longer than that (and possibly contains a variable index), it is sufficient to specify a prefix of the sequence(s).

cutadapt的结果,示例:

开头为:

Sequence: ACGTACGTACGTTAGCTAGC; Length: 20; Trimmed: 2402 times.

中间的信息为:

No. of allowed errors:
0-7 bp: 0; 8-15 bp: 1; 16-20 bp: 2
The adapter, as was shown above, has a length of 20 characters. We are using a custom error rate of 0.12. What this implies is shown above: Matches up to a length of 7 bp are allowed to have no errors. Matches of lengths 8-15 bp are allowd to have 1 error and matches of length 16 or more can have 2 errors. See alsothe section about error-tolerant matching.

结尾的部分:

Finally, a table is output that gives more detailed information about the lengths of the removed sequences. The following is only an excerpt; some rows are left out:
Overview of removed sequences
length  count   expect  max.err error counts
3       140     156.2   0       140
4       57      39.1    0       57
5       50      9.8     0       50
6       35      2.4     0       35
7       13      0.3     0       1 12
8       31      0.1     1       0 31
...
100     397     0.0     3       358 36 3

更多关于结果的解读,可以查看cutadapt User Guide: how-to-read-the-report

2.3 双端测序数据去接头:

Assume the input is inreads.1.fastqandreads.2.fastqand thatADAPTER_FWDshould be trimmed from the forward reads (first file) andADAPTER_REVfrom the reverse reads (second file).
-a 左边3’端接头
 -A 3’端接头的反向互补序列
 -o 是输出reads.1.fastq去掉接头的结果
 -p 是 --paired-output 的缩写,输出reads.2.fastq去掉接头的结果

cutadapt -a ADAPTER_FWD -A ADAPTER_REV -o out.1.fastq -p out.2.fastq reads.1.fastq reads.2.fastq

如果只知道3接头(ADAPTER_FWD),可以使用在线工具快速获得反向互补序列(ADAPTER_REV ),不用自己写代码。

[软件使用1]cutadapt使用指南插图21
搜索的结果

上面都是去除已知数据的adapter。

3. 如果我们不知道数据的接头,该怎么办呢?

当当当,这时minion就要出场了。文章传送门:Methods. 2013 Sep 1;63(1):41-9.

minion 由The European Bioinformatics Institute(EBI)开发,可用于预测接头。

安装minion,参见EBI-Kraken网站

linux下载网址: http://wwwdev.ebi.ac.uk/enright-dev/kraken/reaper/binaries/reaper-13-100/linux/
Mac下载网址: http://wwwdev.ebi.ac.uk/enright-dev/kraken/reaper/binaries/reaper-13-100/mac/

使用方法非常简单:

minion search-adapter -i SRR.fastq

预测adapter的结果示例:

criterion=sequence-density
sequence-density=52.19
sequence-density-rank=1
fanout-score=31.57
fanout-score-rank=1
prefix-density=54.75
prefix-fanout=30.1
sequence=TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACACACACATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAA


criterion=fanout-score
sequence-density=52.19
sequence-density-rank=1
fanout-score=31.57
fanout-score-rank=1
prefix-density=54.75
prefix-fanout=30.1
sequence=TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACACACACATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAA

大多数情况下,选取第一个预测结果(criterion=sequence-density)为adapter。为什么呢?

参考minion的官方文档

The two criteria are unfortunately necessary due to the varying charachteristics of 3 adapter sequence in different experimental protocols.The first criterion is frequency of occurrence; the second criterion incorporates a fan-out measure that captures the typical characteristic of 3 adapter sequence of being attached to a multitude of different prefixes.When infering adapters (i.e. without using-adapter) two candidate sequences will be shown. The second will start with the linecriterion=fanout-score.The second should only be considered if the first candidate is clearly a biological sequence.This can be established by using one of theBLASTinterfaces provided for example byNCBIandENSEMBL.也就是说,只有在第一个序列确认为是有生物学意义时,才考虑第二个序列(criterion=fanout-score)。可以用NCBI或ENSEMBL的BLAST平台来确认是否有生物学意义。

注意!

minion寻找到的接头一定要用Google或者Baidu搜索,确认是否存在该接头

因为minion的结果不一定就是接头,只是用来预测。

接下来也是使用cutadapt或其他软件去掉接头。参考前文即可。

后续的分析,日后写文章再分享~

欢迎点赞和关注支持!

阅读更多:

参考:

  1. User guide – cutadapt 1.18 documentation
  2. 用cutadapt软件来对双端测序数据去除接头
  3. 二代测序中barcodes index的介绍
  4. Solexa与Hiseq测序技术中常见术语名词解释 | Public Library of Bioinformatics
  5. Frequently Asked Questions. TUCF Genomics
  6. https://www.fimm.fi/en/services/technology-centre/sequencing/next-generation-sequencing/dna-library-preparation
  7. https://hackteria.org/wiki/HiSeq2000_-_Next_Level_Hacking

原创文章 [软件使用1]cutadapt使用指南,版权所有
如若转载,请注明出处:https://www.itxiaozhan.cn/20226364.html

发表评论

您的电子邮箱地址不会被公开。 必填项已用*标注