sed編輯生物資訊資料

阿新 • • 發佈：2018-12-15

文章目錄

stream editor 流編輯器

sed工具工作原理及特性
sed 命令

NAME
SYNOPSIS

Options
Address
Command

地址定界常規方法

基因組註釋檔案（gtf）資料示例：
空地址：即對全文進行處理
單地址
地址範圍
步進地址表示法

sed編輯命令

d : 刪除模式空間中的內容
p : 顯示被模式框定的內容
a \line : 追加line行至匹配到行的後面，如果是多行可使用\n實現多行追加
P : 只顯示模式空間中的第一行
i \line : 新增line行到匹配行的前面，如果是多行可使用\n實現多行新增
c \line : 把匹配到的行替換為line行
w /PATH : 將模式空間匹配到的行，寫入指定檔案中
r /PATH : 將PATH中指定的檔案寫入匹配到的行下方，多用於檔案合併
q : 退出sed,一般用於列印到第幾行即退出
y : 完成大小寫替換（等同於s///,基本不用）
= : 匹配到的行，顯示一個行號，預設在其匹配到的行上方顯示對應的行號，如果需要只顯示行號，需要加-n引數，把模式空間中的內容關閉顯示。
! : 條件取反，一般用於模式之後，命令之前
s/pattern/string/ : 字元替換查詢，其分隔符可自動指定，常用的有,[email protected]@[email protected]、s#pattern#string#
n : 讀取下一行覆蓋模式空間中的行返回
N : 讀取下一行並追加到模式空間中的行後面，使用\n分隔返回
{} : 多命令同時執行時，需要使用{}括起來返回

sed高階說明舉例說明

理解n 與N
理解 x
理解h和H
理解G和g

stream editor 流編輯器

sed工具工作原理及特性

行編輯器（全屏編輯器：vi）
按行處理，讀取一行到臨時緩衝區，也就是模式空間pattern space
預設不編輯原檔案，僅對模式空間中的資料做處理，如果需要修改需加-i引數
預設情況下，模式空間中的內容列印一次，被模式匹配到的內容被命令動作處理過，一般情況下會再次列印到標準輸出，除非使用d選項。
支援正則和擴充套件正則表示式

按行處理，讀取一行到臨時緩衝區，也就是模式空間，預設不改變原檔案；經過模式匹配處理後，將模式空間中的內容列印到標準版輸出並自動清空該空間中的內容，該空間是sed的主要活動空間。

sed 命令

NAME

sed - stream editor for filtering and transforming text

SYNOPSIS

sed [OPTION]… {script-only-if-no-other-script} [input-file]…
sed 選項地址定界編輯命令檔案

sed option ‘AddressCommand’ file …

Options

-n : 靜默模式，不輸出模式空間中的內容至螢幕，即關閉不能被模式匹配到的行到標準輸出中
-e : 多項編輯一次執行
-f FILE : FILE中每行是一個操作命令
-r : 支援擴充套件正則表示式
-i : 直接儲存至原檔案中

Address

startline,endline 例如：1，100
/RegExp/ 例如：/^root/
/pattern1/,/pattern2/ 解釋：第一次被pattern1匹配到的行開始，至第一個被pattern2匹配到的行束，這中間的所有行
lineNumber 指定行，\$:最後一行
startline,+N 從startline開始，向後的N行

Command

例如：

sed -n ‘5,8p’ passwd

sed ‘1,2d’ passwd

sed ‘3,$d’ passwd

地址定界常規方法

基因組註釋檔案（gtf）資料示例：

!cat ./data/demo.gtf

##gff-version 3
AB	Twinscan	exon	150	200	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	300	401	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	380	401	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	501	650	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	501	650	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	700	800	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	700	707	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	900	1000	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	start_codon	380	382	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	stop_codon	708	710	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";

空地址：即對全文進行處理

%%bash
sed 's/AB/ab/' ./data/demo.gtf

##gff-version 3
ab	Twinscan	exon	150	200	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
ab	Twinscan	exon	300	401	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
ab	Twinscan	CDS	380	401	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
ab	Twinscan	exon	501	650	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
ab	Twinscan	CDS	501	650	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
ab	Twinscan	exon	700	800	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
ab	Twinscan	CDS	700	707	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
ab	Twinscan	exon	900	1000	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
ab	Twinscan	start_codon	380	382	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
ab	Twinscan	stop_codon	708	710	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";

單地址

%%bash
#指定行
sed -n '$p' ./data/demo.gtf

echo '----------------------'
#/pattern/ : 被模式匹配到的每一行 
awk '{print $0}' ./data/demo.gtf |sed -n '/.*exon/p'

AB	Twinscan	stop_codon	708	710	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
----------------------
AB	Twinscan	exon	150	200	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	300	401	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	501	650	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	700	800	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	900	1000	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";

地址範圍

%%bash
#顯示除2到5行的所有行
sed '2,5d' ./data/demo.gtf

echo '----------------------'
#刪除2向下5行
sed '2,+5d' ./data/demo.gtf

echo '----------------------'
#刪除從第1行到被模式匹配到的第一個行的位置
sed '1,/.*start_codon/d' ./data/demo.gtf

echo '----------------------'

#刪除模式1匹配到的行和被模式2匹配到的行 
sed '/.*CDS/,/.*start_codon/d' ./data/demo.gtf

##gff-version 3
AB	Twinscan	CDS	501	650	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	700	800	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	700	707	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	900	1000	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	start_codon	380	382	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	stop_codon	708	710	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
----------------------
##gff-version 3
AB	Twinscan	CDS	700	707	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	900	1000	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	start_codon	380	382	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	stop_codon	708	710	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
----------------------
AB	Twinscan	stop_codon	708	710	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
----------------------
##gff-version 3
AB	Twinscan	exon	150	200	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	300	401	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	stop_codon	708	710	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";

步進地址表示法

%%bash

#1~2: 所有奇數行
sed -n '1~2p' ./data/demo.gtf

echo '---------------------------------'
#2~2: 所有偶數行
sed -n '2~2p' ./data/demo.gtf

##gff-version 3
AB	Twinscan	exon	300	401	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	501	650	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	700	800	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	900	1000	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	stop_codon	708	710	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
---------------------------------
AB	Twinscan	exon	150	200	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	380	401	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	501	650	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	700	707	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	start_codon	380	382	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";

sed編輯命令

d : 刪除模式空間中的內容
p : 顯示被模式框定的內容
a \line : 追加line行至匹配到行的後面，如果是多行可使用\n實現多行追加
P : 只顯示模式空間中的第一行
i \line : 新增line行到匹配行的前面，如果是多行可使用\n實現多行新增
c \line : 把匹配到的行替換為line行
w /PATH : 將模式空間匹配到的行，寫入指定檔案中
r /PATH : 將PATH中指定的檔案寫入匹配到的行下方，多用於檔案合併
q : 退出sed,一般用於列印到第幾行即退出
y : 完成大小寫替換（等同於s///,基本不用）
= : 匹配到的行，顯示一個行號，預設在其匹配到的行上方顯示對應的行號，如果需要只顯示行號，需要加-n引數，把模式空間中的內容關閉顯示。
! : 條件取反，一般用於模式之後，命令之前
s/pattern/string/ : 字元替換查詢，其分隔符可自動指定，常用的有,[email protected]@[email protected]、s#pattern#string#
n : 讀取下一行覆蓋模式空間中的行
N : 讀取下一行並追加到模式空間中的行後面，使用\n分隔
{} : 多命令同時執行時，需要使用{}括起來
h : 把模式空間中的內容覆蓋至保持空間中
H : 把模式空間中的內容追加至保持空間中
g : 把保持空間中的內容覆蓋至模式空間中
G : 把保持空間中的內容追加至模式空間中
x : 把模式空間中的內容到保持空間中的內容互換，初始保持空間中為空

d : 刪除模式空間中的內容

%%bash
sed '2,$d' ./data/demo.gtf

echo
#刪除奇數行，只顯示偶數行
sed '1~2d' ./data/demo.gtf

##gff-version 3

AB	Twinscan	exon	150	200	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	380	401	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	501	650	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	700	707	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	start_codon	380	382	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";

p : 顯示被模式框定的內容

%%bash
#查詢demo.gtf檔案中第三列以stop_codon結尾的行並打印出來
awk '{print $3}' ./data/demo.gtf   |sed -n '/stop_codon$/p'

stop_codon

a \line : 追加line行至匹配到行的後面，如果是多行可使用\n實現多行追加

%%bash
#查詢匹配到start_codon的行，並在後面新增line1,line2兩行內容
sed '/.*start_codon/ a\line1\nline2' ./data/demo.gtf

#在passwd的第五行下追加一行‘hello world’
echo
sed  '5a\hello world' ./data/demo.gtf

##gff-version 3
AB	Twinscan	exon	150	200	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	300	401	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	380	401	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	501	650	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	501	650	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	700	800	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	700	707	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	900	1000	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	start_codon	380	382	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
line1
line2
AB	Twinscan	stop_codon	708	710	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";

##gff-version 3
AB	Twinscan	exon	150	200	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	300	401	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	380	401	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	501	650	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
hello world
AB	Twinscan	CDS	501	650	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	700	800	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	700	707	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	900	1000	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	start_codon	380	382	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	stop_codon	708	710	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";

P : 只顯示模式空間中的第一行

%%bash
#顯示結果為1、3兩行
#N : 讀取下一行並追加到模式空間中的行後面，使用\n分隔
#預設動作先讀取兩行，然後執行P操作
seq 5 | sed -n 'N;P' 

#顯示結果為1,3,5
echo 
seq 6 | sed -n 'N;P'

i \line : 新增line行到匹配行的前面，如果是多行可使用\n實現多行新增

%%bash
#查詢匹配到bash結尾的行，並在前面新增line1,line2兩行內容
sed '/^#/ i\line1\nline2' ./data/demo.gtf

#在passwd的第五行前追加一行‘hello world’
echo
sed  '5i\hello world' ./data/demo.gtf

line1
line2
##gff-version 3
AB	Twinscan	exon	150	200	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	300	401	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	380	401	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	501	650	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	501	650	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	700	800	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	700	707	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	900	1000	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	start_codon	380	382	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	stop_codon	708	710	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";

##gff-version 3
AB	Twinscan	exon	150	200	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	300	401	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	380	401	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
hello world
AB	Twinscan	exon	501	650	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	501	650	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	700	800	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	700	707	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	900	1000	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	start_codon	380	382	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	stop_codon	708	710	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";

c \line : 把匹配到的行替換為line行

%%bash
#將demo.gtf的第一行修改為‘helloworld’
sed '1c\hello world' ./data/demo.gtf |head -2

#匹配包含exon的行，並把其替換為newline行
echo
sed '/.*exon/c \newline' ./data/demo.gtf

hello world
AB	Twinscan	exon	150	200	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";

##gff-version 3
newline
newline
AB	Twinscan	CDS	380	401	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
newline
AB	Twinscan	CDS	501	650	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
newline
AB	Twinscan	CDS	700	707	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
newline
AB	Twinscan	start_codon	380	382	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	stop_codon	708	710	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";

w /PATH : 將模式空間匹配到的行，寫入指定檔案中

%%bash
#匹配非#開始的行，並寫入當前目錄下的w.txt檔案中
sed -n '/^#/!w ./w.txt' ./data/demo.gtf
cat w.txt

#匹配#開始的行，並寫入當前目錄下的w1.txt檔案中
echo
sed -n '/^#/w ./w1.txt' ./data/demo.gtf
cat w1.txt

AB	Twinscan	exon	150	200	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	300	401	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	380	401	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	501	650	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	501	650	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	700	800	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	700	707	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	900	1000	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	start_codon	380	382	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	stop_codon	708	710	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";

##gff-version 3

r /PATH : 將PATH中指定的檔案寫入匹配到的行下方，多用於檔案合併

%%bash
#把當前目錄下的w1.txt檔案寫入到以#開頭的行下
sed '/^#/r ./w1.txt' ./data/demo.gtf

#把當前目錄下的w1.txt檔案寫入到第二行和第三行下
echo
sed '2,3r ./w1.txt' ./data/demo.gtf

##gff-version 3
##gff-version 3
AB	Twinscan	exon	150	200	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	300	401	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	380	401	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	501	650	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	501	650	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	700	800	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	700	707	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	900	1000	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	start_codon	380	382	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	stop_codon	708	710	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";

##gff-version 3
AB	Twinscan	exon	150	200	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
##gff-version 3
AB	Twinscan	exon	300	401	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
##gff-version 3
AB	Twinscan	CDS	380	401	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	501	650	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	501	650	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	700	800	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	700	707	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	900	1000	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	start_codon	380	382	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	stop_codon	708	710	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";

q : 退出sed,一般用於列印到第幾行即退出

%%bash
#只打印檔案中的前3行，等同於sed -n '1,3p' FILE
sed '3q' ./data/demo.gtf
echo
sed -n '1,3p' ./data/demo.gtf

##gff-version 3
AB	Twinscan	exon	150	200	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	300	401	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";

##gff-version 3
AB	Twinscan	exon	150	200	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	300	401	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";

y : 完成大小寫替換（等同於s///,基本不用）

%%bash
echo "abcdef" | sed 'y/abcdef/123456/'
echo "fedcba" | sed 'y/abcdef/123456/'
echo "abcd" | sed 'y/abcd/ABCD/'

echo 
#替換1到5行的內容bash為BASH
sed '1,5y/AB/ab/' ./data/demo.gtf

123456
654321
ABCD

##gff-version 3
ab	Twinscan	exon	150	200	.	+	.	gene_id "ab.0"; transcript_id "ab.0.1";
ab	Twinscan	exon	300	401	.	+	.	gene_id "ab.0"; transcript_id "ab.0.1";
ab	Twinscan	CDS	380	401	.	+	0	gene_id "ab.0"; transcript_id "ab.0.1";
ab	Twinscan	exon	501	650	.	+	.	gene_id "ab.0"; transcript_id "ab.0.1";
AB	Twinscan	CDS	501	650	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	700	800	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	700	707	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	900	1000	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	start_codon	380	382	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	stop_codon	708	710	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";

= : 匹配到的行，顯示一個行號，預設在其匹配到的行上方顯示對應的行號，如果需要只顯示行號，需要加-n引數，把模式空間中的內容關閉顯示。

%%bash
#在匹配到stop_codon的行列印其行號
sed  -n '/.*stop_codon/=' ./data/demo.gtf
sed  -n '/.*stop_codon/p' ./data/demo.gtf

linenum=`sed  -n '/.*stop_codon/=' ./data/demo.gtf`
echo "$linenum>>>`sed  -n '/.*stop_codon/p'  ./data/demo.gtf`"

#顯示最後一行的行號，一般可用於顯示文字的總行數。
echo
sed -n '$=' ./data/demo.gtf

#顯示所有行的行號，但空行不顯示行號
echo
sed -n '/./=' ./data/demo.gtf

11
AB	Twinscan	stop_codon	708	710	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
11>>>AB	Twinscan	stop_codon	708	710	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";

11

1
2
3
4
5
6
7
8
9
10
11

! : 條件取反，一般用於模式之後，命令之前

%%bash
#顯示以##g開頭的行
sed '/^##g/!d' ./data/demo.gtf
echo
sed -n '/^##g/p' ./data/demo.gtf

##gff-version 3

##gff-version 3

s/pattern/string/ : 字元替換查詢，其分隔符可自動指定，常用的有,[email protected]@[email protected]、s#pattern#string#

%%bash
echo 替換1到5行的內容AB為ab
sed '1,5y/AB/ab/' ./data/demo.gtf

echo "y與s命令的區別"
sed '1,5s/AB/ab/' ./data/demo.gtf

echo '只能替換每行開始的第一個匹配pattern'
sed '1,5s/B/b/' ./data/demo.gtf

echo 'g全域性替換'
sed '1,5s/B/b/g' ./data/demo.gtf

echo '查詢1到5行B將其替換為b，並且只顯示被替換過的行'
sed -n '1,5s/B/b/gp' ./data/demo.gtf

替換1到5行的內容AB為ab
##gff-version 3
ab	Twinscan	exon	150	200	.	+	.	gene_id "ab.0"; transcript_id "ab.0.1";
ab	Twinscan	exon	300	401	.	+	.	gene_id "ab.0"; transcript_id "ab.0.1";
ab	Twinscan	CDS	380	401	.	+	0	gene_id "ab.0"; transcript_id "ab.0.1";
ab	Twinscan	exon	501	650	.	+	.	gene_id "ab.0"; transcript_id "ab.0.1";
AB	Twinscan	CDS	501	650	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	700	800	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	700	707	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	900	1000	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	start_codon	380	382	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	stop_codon	708	710	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
y與s命令的區別
##gff-version 3
ab	Twinscan	exon	150	200	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
ab	Twinscan	exon	300	401	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
ab	Twinscan	CDS	380	401	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
ab	Twinscan	exon	501	650	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	501	650	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	700	800	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	700	707	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	900	1000	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	start_codon	380	382	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	stop_codon	708	710	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
只能替換每行開始的第一個匹配pattern
##gff-version 3
Ab	Twinscan	exon	150	200	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
Ab	Twinscan	exon	300	401	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
Ab	Twinscan	CDS	380	401	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
Ab	Twinscan	exon	501	650	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	501	650	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	700	800	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	700	707	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	900	1000	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	start_codon	380	382	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	stop_codon	708	710	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
g全域性替換
##gff-version 3
Ab	Twinscan	exon	150	200	.	+	.	gene_id "Ab.0"; transcript_id "Ab.0.1";
Ab	Twinscan	exon	300	401	.	+	.	gene_id "Ab.0"; transcript_id "Ab.0.1";
Ab	Twinscan	CDS	380	401	.	+	0	gene_id "Ab.0"; transcript_id "Ab.0.1";
Ab	Twinscan	exon	501	650	.	+	.	gene_id "Ab.0"; transcript_id "Ab.0.1";
AB	Twinscan	CDS	501	650	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	700	800	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	700	707	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	900	1000	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	start_codon	380	382	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	stop_codon	708	710	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
查詢1到5行B將其替換為b，並且只顯示被替換過的行
Ab	Twinscan	exon	150	200	.	+	.	gene_id "Ab.0"; transcript_id "Ab.0.1";
Ab	Twinscan	exon	300	401	.	+	.	gene_id "Ab.0"; transcript_id "Ab.0.1";
Ab	Twinscan	CDS	380	401	.	+	0	gene_id "Ab.0"; transcript_id "Ab.0.1";
Ab	Twinscan	exon	501	650	.	+	.	gene_id "Ab.0"; transcript_id "Ab.0.1";

%%bash
echo 將每行中最後一個字元刪除，.$代表每行的最後一個字元
sed 's/.$//' ./data/demo.gtf
echo "將AB 替換為AB script,&表示對前面模式的引用"
sed 's/AB/& script/' ./data/demo.gtf
echo 將“#”開頭的行替換為#+gtf
#sed 's/^\(#\).*/\1 gtf/' ./data/demo.gtf
#sed -r 's/^(#).*/\1 gtf/' ./data/demo.gtf
#sed -r 's#^(xxx).*#\1 gtf#' ./data/demo.gtf
sed -r '[email protected]^(#).*@\1 [email protected]' ./data/demo.gtf

將每行中最後一個字元刪除，.$代表每行的最後一個字元
##gff-version 
AB	Twinscan	exon	150	200	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1"
AB	Twinscan	exon	300	401	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1"
AB	Twinscan	CDS	380	401	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1"
AB	Twinscan	exon	501	650	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1"
AB	Twinscan	CDS	501	650	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1"
AB	Twinscan	exon	700	800	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1"
AB	Twinscan	CDS	700	707	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1"
AB	Twinscan	exon	900	1000	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1"
AB	Twinscan	start_codon	380	382	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1"
AB	Twinscan	stop_codon	708	710	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1"
將AB 替換為AB script,&表示對前面模式的引用
##gff-version 3
AB script	Twinscan	exon	150	200	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB script	Twinscan	exon	300	401	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB script	Twinscan	CDS	380	401	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
AB script	Twinscan	exon	501	650	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB script	Twinscan	CDS	501	650	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB script	Twinscan	exon	700	800	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB script	Twinscan	CDS	700	707	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB script	Twinscan	exon	900	1000	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB script	Twinscan	start_codon	380	382	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
AB script	Twinscan	stop_codon	708	710	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
將“#”開頭的行替換為#+gtf
# gtf
AB	Twinscan	exon	150	200	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	300	401	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	380	401	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	501	650	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	501	650	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	700	800	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	CDS	700	707	.	+	2	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	exon	900	1000	.	+	.	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	start_codon	380	382	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";
AB	Twinscan	stop_codon	708	710	.	+	0	gene_id "AB.0"; transcript_id "AB.0.1";

%%bash
#去終端操作
#history > hist
cat hist|tail -2

echo "去掉空白字元"
sed 's#[[:space:]]\+##g' hist|tail -2

echo "去掉行首的空白字元"
sed 's#^[[:space:]]\+##g' hist|tail -2

 3007  2018-10-21 13:33:58 less hist 
 3008  2018-10-21 13:34:56 history > hist
去掉空白字元
30072018-10-2113:33:58lesshist
30082018-10-2113:34:56history>hist
去掉行首的空白字元
3007  2018-10-21 13:33:58 less hist 
3008  2018-10-21 13:34:56 history > hist

%%bash

echo "如果只想從第幾次開始替換，可使用3g即Ng(N代表一個數值)"
echo "sksksksksksk" | sed 's/sk/SK/3g'

如果只想從第幾次開始替換，可使用3g即Ng(N代表一個數值)
skskSKSKSKSK

n : 讀取下一行覆蓋模式空間中的行 返回

%%bash
echo "處理兩行為一個單位"
echo "讀取一行到pattern space ，預設輸出，緊接著讀取下一行覆蓋原來的內容，然後執行d刪除，執行預設輸出，空"
seq 11 | sed  'n;d'

echo "處理兩行為一個單位"
echo "讀取一行到pattern space，預設輸出，緊接著是p列印，然後讀取下一行覆蓋原來的內容，執行預設輸出"
seq 11 | sed   'p;n'

echo "處理兩行為一個單位"
echo "讀取一行到pattern space 抑制預設輸出，接著是p列印，然後讀取下一行覆蓋原來的內容，抑制預設輸出"
seq 11 | sed -n 'p;n'


echo "讀取一行到pattern space ，預設輸出，緊接著讀取下一行覆蓋原來的內容，預設輸出"
seq 11 | sed 'n'

讀取一行到pattern space ，預設輸出，緊接著讀取下一行覆蓋原來的內容，然後執行d刪除
1
3
5
7
9
11
讀取一行到pattern space，預設輸出，緊接著是p列印，然後讀取下一行覆蓋原來的內容，執行預設輸出
1
1
2
3
3
4
5
5
6
7
7
8
9
9
10
11
11
讀取一行到pattern space 抑制預設輸出，接著是p列印，然後讀取下一行覆蓋原來的內容，抑制預設輸出
1
3
5
7
9
11
讀取一行到pattern space ，預設輸出，緊接著讀取下一行覆蓋原來的內容
1
2
3
4
5
6
7
8
9
10
11

N : 讀取下一行並追加到模式空間中的行後面，使用\n分隔 返回

%%bash

echo "讀取兩行（以\n分隔）到pattern space中 ，然後刪除pattern space中的內容，預設輸出為空"
echo "最後一行是單行，所以讀取一行到pattern space，預設輸出"
seq 11 | sed 'N;d'

echo "顯示結果為空"
seq 10 | sed 'N;d' 

echo "讀取兩行（以\n分隔）到pattern space中 ，然後列印pattern space中的內容，預設輸出"
seq 10 | sed 'N;p' 

# seq 11 | sed 'N;p'

讀取兩行（以\n分隔）到pattern space中 ，然後刪除pattern space中的內容，預設輸出為空
最後一行是單行，所以讀取一行到pattern space，預設輸出
11
顯示結果為空
讀取兩行（以\n分隔）到pattern space中 ，然後列印pattern space中的內容，預設輸出
1
2
1
2
3
4
3
4
5
6
5
6
7
8
7
8
9
10
9
10

{} : 多命令同時執行時，需要使用{}括起來 返回

%%bash
echo "讀取rstudio開始的行，再讀取下一行並列印模式空間的內容。"
sed -n '/^rstudio/{N;p}' passwd

讀取rstudio開始的行，再讀取下一行並列印模式空間的內容。
rstudio-server:x:992:990::/home/rstudio-server:/bin/bash
sunchengquan:x:1000:1000::/home/sunchengquan:/bin/bash

sed高階說明舉例說明

!seq 4 | sed 'n;d'

1
3

說明：pattern space先讀入1，預設輸出1，然後執行到n，把下一行2讀入pattern space中並覆蓋原本的1。然後pattern space中的內容（2）被刪除（d操作），預設輸出空，所以打印出1\n3

!seq 4 | sed 'n'

!seq 4 | sed 'n;p'

!seq 5 | sed 'n;d'

1
3
5

!seq 4 | sed 'N;d'

執行N操作相當於一次性讀入兩行，以\n分割，或者也可以這麼說，pattern space先讀入行，然後執行到N，把下一行新增到當前的pattern space中，pattern space內容為兩行以\n分割。

說明：pattern space先讀入1，然後執行到N，把下一行新增到當前的pattern space中，pattern space內容為1\n2，然後執行d操作被刪除。接下去讀入3（系統讀入總是覆蓋原有內容），執行N，pattern space 內容變為3\n4，然後再被刪除

!seq 7 | sed 'N;d'

理解n 與N

%%bash
seq 4 | sed 'n'
echo 'n'
seq 4 | sed -n 'n'
echo 'N'
seq 4 | sed -n 'N'

以上兩都不會輸出輸入，-n引數把模式空間中的內容關閉顯示了

%%bash
seq 4 | sed -n 'n;p'

2
4

說明：讀取1到pattern space ，抑制預設輸出1，接著n讀取下一行2並覆蓋pattern spcae的內容，p列印pattern space2,
抑制預設輸出

%%bash
seq 4 | sed -n 'N;p'
# seq 4 | sed  'N;p'

說明：N讀取1\n2行的內容，抑制預設輸出1\n2，p列印pattern space內容1\n2

%%bash
seq 5 | sed -n 'N;p'

N首先讀取1\n2 -->列印模式空間 -->讀取3\n4—>列印模式空間—>讀取5行發現沒有第6行—>失敗

%%bash
seq 5 | sed  'n;p'

首先讀取1到pattern space ，預設輸出1，接著讀取下一行2覆蓋原來的1，p列印2，預設輸出2

%%bash
seq 5 | sed  -n 'n;p'

2
4

首先讀取1到pattern space ，抑制預設輸出1，接著讀取下一行2覆蓋原來的1，p列印2，抑制預設輸2

理解 x

%%bash
# seq 11 | sed -n 'x;p' == seq 11 | sed 'x'

seq 11 | sed 'x'

說明：模式空間的1<—>保持空間的空白行–>列印模式空間的空行，11並沒有打印出來，因為他在保持空間中

%%bash
seq 4 | sed '/3/{x;p;x}'

seq 4 | sed -n '/3/{x;p;x}'

說明：當匹配到3的時候，執行交換，現在模式空間為空行，保持空間中為3，執行p命令顯示模式空間中的空行，x再交換兩這空間的內容，此是模式空間為3,預設模式空間的就會輸出至標準螢幕，故3之前多了一個空行

%%bash
seq 4 | sed '/3/{x;p;x;d}'

說明：當第二個x交換回來，直接交給d執行，故3就沒有了，只多了一條空行

理解h和H

!seq 4 | sed 'h;x'

說明：把模式空間中的內容覆蓋到保持空間，再交換，再列印至標準輸出

!seq 4 | sed 'x;h'

說明：先把模式空間的與保持空間交換，現在模式空間為空行，保持空間為1，然後再把模式空間覆蓋保持空間，再輸出至標準輸出，故都為空行

理解G和g

!seq 4 | sed '/3/g'

說明：當匹配到3的時候，把保持空間的空行覆蓋到模式空間，故就輸出了空行

%%bash
seq 3 | sed  '1!G'

說明：不是第1行就執行G操作，G是追加保持空間到模式空間，即為2+空行

!seq 3 | sed '1!G;h;$!d'

3
2
1

說明：當讀到第一行時G不操作，然後把模式空間的內容覆蓋到保持空間，然後刪除模式空間，此時，保持空間為1，當讀到第二行時，把保持空間的內容追加到模式空間，此時模式空間為2\n1,再執行h，把模式空間中的內容再覆蓋到保持空間，此時保持空間為2\n1,刪除模式空間，讀取第3行時，把保持空間的內容追加到模式空間中，此時模式空間的為3\n\2\n1,最後一行不執行d操作

%%bash

echo 在原有的每行後方新增一個空白行
seq 3 | sed 'G'

在原有的每行後方新增一個空白行
1

2

3

%%bash

echo 不是最後一行刪除，取出最後一行
seq 3 |sed '$!d'

不是最後一行刪除，取出最後一行
3