与 Deseq2 斗智斗勇

本文总阅读量

今天要做一个差异表达分析,理所当然的用上 Deseq2 包。

此次的分析方式与之前有差别,需要先进行 fpkm 标准化后再进行差异表达分析,于是读入 counts 矩阵:

rm(list = ls())

counts <- read.csv(
    'rna-seq.counts',
    header = TRUE,
    sep = '\t',
    # row.names = "Geneid",
    comment.char = '#',
    check.names = FALSE
)

然后使用一个 for 循环进行 fpkm 计算:

for (clm in colnames(counts)[6:ncol(counts)]) {
    col_fpkm <- paste0(clm, "_FPKM")     # 新列的名称,加上"_FPKM"后缀
    total <- sum(counts[, clm])          # 计算每个样本的总读数
    counts[col_fpkm] <- (counts[, clm] * 10^6) / (counts[, "Length"] / 1000)  # 使用相应样本的长度值计算 FPKM 并添加 FPKM 列
}

计算之后,删去矩阵里没用的数据和原来的 counts 矩阵:

counts = counts[,-c(2:19)]

看了一下数据,有很多的 0,为了防止意外,全部给它加一个不怎么影响结果的数值 1,然后再写入文件:

numeric_mask <- sapply(counts, is.numeric)
counts[numeric_mask] <- lapply(counts[numeric_mask], function(x) ifelse(x < 1, x + 1, x))

write.table(counts, file = 'fpkm_output', sep = '\t', row.names = FALSE)

拿到 fpkm 矩阵后开始进行差异表达分析,还是使用之前的代码:

# 数据示例:
# "Geneid"	"Normal-1-1_FPKM"	"Normal-1-2_FPKM"	"Normal-1-3_FPKM"	"HS-1-1_FPKM"	"HS-1-2_FPKM"	"HS-1-3_FPKM"	"Normal-2-1_FPKM"	"Normal-2-2_FPKM"	"Normal-2-3_FPKM"	"HS-2-1_FPKM"	"HS-2-2_FPKM"	"HS-2-3_FPKM"
# "Os01g0100100"	254855195.911414	242248722.316865	273253833.049404	587052810.902896	594548551.959114	539011925.042589	186030664.39523	145826235.093697	125383304.940375	103918228.279387	110391822.827939	100851788.756388
# "Os01g0100200"	1	887311.446317657	887311.446317657	1774622.89263531	3549245.78527063	5323868.67790595	1	887311.446317657	1	1	1	1
# "Os01g0100300"	1	1	1	1	1	1	1	1	1	1	1	1
# "Os01g0100400"	48588616.3813049	32855159.648311	51365108.7459509	136973623.322536	130032392.410921	124479407.681629	75890791.3003239	51827857.473392	59694585.8398889	9254974.54881999	16658954.187876	15270708.005553
# "Os01g0100466"	1	1	1	1	1	1	1	1	1	1	1	1
# "Os01g0100500"	610261026.10261	571107110.711071	669216921.692169	919441944.19442	886138613.861386	851485148.514851	698019801.980198	548154815.481548	536903690.369037	458595859.585959	527452745.274527	504500450.045004
# "Os01g0100600"	188775510.204082	179591836.734694	202040816.326531	398469387.755102	369387755.102041	339795918.367347	218367346.938776	152551020.408163	142346938.77551	1.75e+08	186224489.795918	137755102.040816
# 使用 fpkm 标准化之后的数据

# setwd('./pub_share/')
rm(list = ls())  
Sys.setenv(LANGUAGE = "en")

library(DESeq2)

fpkm = read.csv(
    'fpkm_output', 
    header = T,  
    sep = '\t', 
    row.names = "Geneid", 
    comment.char = '#', 
    check.name = F
)

samples = data.frame(
    sampleID = c("Normal-1-1_FPKM", "Normal-1-2_FPKM", "Normal-1-3_FPKM", "HS-1-1_FPKM", "HS-1-2_FPKM", "HS-1-3_FPKM", "Normal-2-1_FPKM", "Normal-2-2_FPKM", "Normal-2-3_FPKM", "HS-2-1_FPKM", "HS-2-2_FPKM", "HS-2-3_FPKM"), 
    sample = c("sample1", "sample1", "sample1", "sample2", "sample2", "sample2", "sample3", "sample3", "sample3", "sample4", "sample4", "sample4")
)

rownames(samples) = samples$sampleID

samples$sample = factor(samples$sample, levels = c('sample1', 'sample2', 'sample3', 'sample4'))

dds = DESeqDataSetFromMatrix(countData = fpkm, colData = samples, design = ~sample)

dds_count <- DESeq(dds, fitType = 'mean', minReplicatesForReplace = 7, parallel = FALSE)

sampl1_vs_sample2 <- results(dds_count, contrast = c('sample', 'sample1', 'sample2'))

sampl3_vs_sample4 <- results(dds_count, contrast = c('sample', 'sample3', 'sample4'))

result1 <- data.frame(sampl1_vs_sample2, stringsAsFactors = FALSE, check.names = FALSE)
result2 <- data.frame(sampl3_vs_sample4, stringsAsFactors = FALSE, check.names = FALSE)

write.table(result1, 'sampl1_vs_sample2.DESeq2.txt', col.names = NA, sep = '\t', quote = FALSE)
write.table(result2, 'sampl3_vs_sample3.DESeq4.txt', col.names = NA, sep = '\t', quote = FALSE)

报错说是 Deseq2 处理的数据得是整数,于是加上:

fpkm <- round(fpkm)

将数值四舍五入为整数。

这里突然报错说有 NA 空值converting counts to integer mode Error in validObject(.Object) : invalid class “DESeqDataSet” object: NA values are not allowed in the count matrix In addition: Warning message: In mde(x) : NAs introduced by coercion to integer range,那就奇怪了,我不是全部加了 1 吗?于是这里判断空值,给空值加 1,再加一个判断,如果加 1 后还是空值那就直接删掉该行:

numeric_mask <- sapply(fpkm, is.numeric)
fpkm[numeric_mask] <- lapply(fpkm[numeric_mask], function(x) ifelse(is.numeric(x) & x < 1, x + 100, x))

missing_values <- sum(is.na(fpkm))
if (missing_values > 0) {
  fpkm <- na.omit(fpkm)
}

还是会报同样的错误 … 于是 Google 了一下,发现原来是某些比较大的数值仍然会保留小数点,例如:

as.character(x)
[1] "8903.00000000001" "946.999999999999"
[3] "9113.99999999999" "9451.00000000001"
[5] "9156.99999999999" "875.999999999999"

直接给我无语住了,于是根据网上的资料加上:

fpkm = apply(fpkm, 2, as.integer)

成功跑出差异表达分析,查看分析结果的时候突然发现,原本第一列的基因 ID 居然全部变成了数字??

排查了半天才发现,哦,原来fpkm = apply(fpkm, 2, as.integer)会直接把 ID 也给转化数字 … 于是把这一句改成:

fpkm[-1, ] <- apply(fpkm[-1, ], 2, as.integer)

问题就解决了~真的费劲~