content.json

{"meta":{"title":"日拱一卒","subtitle":null,"description":"I like the world and I like my friends","author":null,"url":"http://yuwq.pw","root":"/"},"pages":[{"title":"","date":"2019-08-05T09:39:30.000Z","updated":"2019-09-29T02:53:42.918Z","comments":true,"path":"about/index.html","permalink":"http://yuwq.pw/about/index.html","excerpt":"","text":"博主九月份博士二年级。建立这个博客的目的在于记录我平时的学习情况，方便对知识的梳理，也为了构建一个完整的学习知识体系。 博客的主要内容以生物信息学的基础材料为主，也会放入一些教程。内容是博客的灵魂，精品在于长期的积累。后期博客的主要内容我计划分为四部分:生物数据挖掘，编程及算法基础，文献分享和生活记录。"},{"title":"文章分类","date":"2019-08-05T19:42:25.000Z","updated":"2019-09-29T02:53:42.985Z","comments":true,"path":"categories/index.html","permalink":"http://yuwq.pw/categories/index.html","excerpt":"","text":""},{"title":"标签","date":"2019-08-05T19:42:25.000Z","updated":"2019-09-29T02:53:43.139Z","comments":false,"path":"tags/index.html","permalink":"http://yuwq.pw/tags/index.html","excerpt":"","text":""}],"posts":[{"title":"reshape数据整理","slug":"reshape数据整理","date":"2019-10-08T01:25:33.000Z","updated":"2019-10-08T01:26:52.543Z","comments":true,"path":"post/reshape数据整理/","link":"","permalink":"http://yuwq.pw/post/reshape数据整理/","excerpt":"","text":"1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677dir.create(\"data\")download.file(url = \"http://jaredlander.com/data/US_Foreign_Aid.zip\", destfile = \"data/ForeignAid.zip\")unzip(\"data/ForeignAid.zip\", exdir = \"data\")require(stringr)theFiles &lt;- dir(\"data/\", pattern = \"\\\\.csv\")for (a in theFiles)&#123; nameToUse &lt;- str_sub(string = a, start = 12, end = 18) temp &lt;- read.table(file = file.path(\"data\", a), header = TRUE, sep = \",\", stringsAsFactors = FALSE) assign(x = nameToUse, value = temp)&#125;## 合并Aid90s00s &lt;- merge(x = Aid_90s, y = Aid_00s, by.x = c(\"Country.Name\", \"Program.Name\"), by.y = c(\"Country.Name\", \"Program.Name\"))head(Aid90s00s)## plyr 中的joinrequire(plyr)Aid90s00sJoin &lt;- join(x = Aid_90s, y = Aid_00s, by = c(\"Country.Name\", \"Program.Name\"))head(Aid90s00sJoin)frameNames &lt;- str_sub(string = theFiles, start = 12, end = 18)frameList &lt;- vector(\"list\", length(frameNames))names(frameList) &lt;- frameNamesfor (a in frameNames)&#123; frameList[[a]] &lt;- eval(parse(text = a))&#125;head(frameList[[1]])head(frameList[[\"Aid_00s\"]])### 把列表中所有的元素合并allAid &lt;- Reduce(function(...) &#123; join(..., by = c(\"Country.Name\", \"Program.Name\"))&#125;, frameList)require(useful)corner(allAid, c=15)bottomleft(allAid, c = 15)## 合并表 (data.table键访问内存方式)require(data.table)dt90 &lt;- data.table(Aid_90s, key = c(\"Country.Name\", \"Program.Name\"))dt00 &lt;- data.table(Aid_00s, key = c(\"Country.Name\", \"Program.Name\"))dt0090 &lt;- dt90[dt00] # dt90在左边，dt00在右边## reshape2# melt函数 （列-&gt;行）head(Aid_00s)require(reshape2)melt00 &lt;- melt(Aid_00s, id.vars = c(\"Country.Name\", \"Program.Name\"), variable.name = \"Year\", value.name = \"Dollars\")tail(melt00, 10)require(scales)melt00$Year &lt;- as.numeric(str_sub(melt00$Year, start = 3, end = 6))meltAgg &lt;- aggregate(Dollars ~ Program.Name + Year, data = melt00, sum, na.rm=TRUE)ggplot(meltAgg, aes(x=Year, y=Dollars)) + geom_line(aes(group=Program.Name)) + facet_wrap(~ Program.Name) + scale_x_continuous(breaks = seq(from=2000, to=2009, by=2)) + theme(axis.text.x = element_text(angle = 90, vjust = 1, hjust = 0)) + scale_y_continuous(labels = multiple_format(extra=dollar, multiple=\"B\"))# dcast函数（行-&gt;列）cast00 &lt;- dcast(melt00, Country.Name + Program.Name ~ Year, value.var = \"Dollars\")head(cast00)","categories":[],"tags":[]},{"title":"R中常见的小问题","slug":"R中常见的小问题","date":"2019-09-29T08:51:33.000Z","updated":"2019-10-05T14:39:50.724Z","comments":true,"path":"post/R中常见的小问题/","link":"","permalink":"http://yuwq.pw/post/R中常见的小问题/","excerpt":"来源：http://blog.genesino.com/collections/R_tips/记录无法归类的小问题及解决方法factor12factor(x = character(), levels, labels = levels, exclude = NA, ordered = is.ordered(x), nmax = NA)","text":"来源：http://blog.genesino.com/collections/R_tips/记录无法归类的小问题及解决方法factor12factor(x = character(), levels, labels = levels, exclude = NA, ordered = is.ordered(x), nmax = NA) 参数注释： x：是向量，通常是由少量唯一值的字符向量 levels：水平，字符类型，用于设置x可能包含的唯一值，默认值是x的所有唯一值。如果x不是字符向量，那么使用as.character(x)把x转换为字符向量，然后获取x向量的水平。x向量的取值跟levels有关。 labels：是水平的标签，字符类型，用于对水平添加标签，相当于对因子水平重命名； exclude：排除的字符 ordered：逻辑值，用于指定水平是否有序； nmax：水平的上限数量 例如，因子sex的值是向量c(‘f’,’m’,’f’,’f’,’m’)，因子水平是c(‘f’,’m’)： 1234&gt; sex &lt;- factor(c('f','m','f','f','m'),levels=c('f','m'))&gt; sex[1] f m f f mLevels: f m data.frame中因子转数值型如果一个data.frame中的元素是factor，想转化成numeric，比如d[1,1]是factor正确做法是：先as.character(x)，再as.numeric(x) 如果直接as.numeric，就不是以前的数字。 as.data.frame()转换as.data.frame()有一个参数stringsAsFactors。如果stringAsFactor=F，就不会把字符转换为factor。看起来是数字变成了character，原来是character的还是character。 do.call123dat &lt;- list(matrix(1:25, ncol = 5), matrix(4:28, ncol = 5), matrix(21:45, ncol=5))dat_cbind &lt;- do.call(cbind,dat)dat_rbind &lt;- do.call(rbind,dat) 解释：第一行产生了一个包含三个矩阵的list，第二行将这三个list按照列合并成一个矩阵，第三行将这三个list按照行合并成矩阵。所以结合示例do.call`函数的功能理解为，对dat对象执行cbind操作。 Tips：do.call函数，应用在大规模数据时，速度实在令人发指。推荐使用dplyr包的bind_rows函数 查看R函数代码 循环读取文件 123456## 方法一 list.files list.files(pattern = \"*.transcript.SAM\")## 方法二 Sys.globrequire(data.table)datafiles&lt;-lapply(Sys.glob(\"*.transcript.SAM\"),fread) 如果是R的内部函数，直接输入函数名字，即可查看函数的代码 123456789101112131415161718192021222324&gt; colMeansfunction (x, na.rm = FALSE, dims = 1L) &#123; if (is.data.frame(x)) x &lt;- as.matrix(x) if (!is.array(x) || length(dn &lt;- dim(x)) &lt; 2L) stop(\"'x' must be an array of at least two dimensions\") if (dims &lt; 1L || dims &gt; length(dn) - 1L) stop(\"invalid 'dims'\") n &lt;- prod(dn[id &lt;- seq_len(dims)]) dn &lt;- dn[-id] z &lt;- if (is.complex(x)) .Internal(colMeans(Re(x), n, prod(dn), na.rm)) + (0+1i) * .Internal(colMeans(Im(x), n, prod(dn), na.rm)) else .Internal(colMeans(x, n, prod(dn), na.rm)) if (length(dn) &gt; 1L) &#123; dim(z) &lt;- dn dimnames(z) &lt;- dimnames(x)[-id] &#125; else names(z) &lt;- dimnames(x)[[dims + 1L]] z&#125;&lt;bytecode: 0x2122250&gt;&lt;environment: namespace:base&gt; 如果是S4函数，则需要使用getMethod(function_name, package_name) 123456789101112&gt; showMethods('MeanVarPlot')Function: MeanVarPlot (package Seurat)object=\"seurat\"&gt; getMethod(\"MeanVarPlot\", \"seurat\")Method Definition:function (object, fxn.x = expMean, fxn.y = logVarDivMean, do.plot = TRUE, set.var.genes = TRUE, do.text = TRUE, x.low.cutoff = 0.1, x.high.cutoff = 8, y.cutoff = 1, y.high.cutoff = Inf, cex.use = 0.5, cex.text.use = 0.5, do.spike = FALSE, pch.use = 16, col.use = \"black\", spike.col.use = \"red\", plot.both = FALSE) sapply usage 12345678910111213141516171819202122232425262728293031323334353637&gt; a &lt;- as.data.frame(matrix(rnorm(30), ncol=3))&gt; aV1 V2 V31 1.1678261 0.535765512 -0.00027893832 1.4408018 0.006156163 -0.89262044613 -0.7577270 -0.252982299 0.76330471534 -0.6555118 -0.940734927 0.55866414985 1.6814423 0.536600480 0.09658088796 -1.5529560 -1.491656309 -0.14048982167 -0.2791699 -0.405854634 -0.68914479798 -0.5111633 1.071639283 0.44928345149 -0.0406343 0.243810629 0.909292486810 -1.4827207 -0.333623245 -0.2155860373&gt; a$Group = c(rep('A',5), rep('B',5))&gt; aV1 V2 V3 Group1 1.1678261 0.535765512 -0.0002789383 A2 1.4408018 0.006156163 -0.8926204461 A3 -0.7577270 -0.252982299 0.7633047153 A4 -0.6555118 -0.940734927 0.5586641498 A5 1.6814423 0.536600480 0.0965808879 A6 -1.5529560 -1.491656309 -0.1404898216 B7 -0.2791699 -0.405854634 -0.6891447979 B8 -0.5111633 1.071639283 0.4492834514 B9 -0.0406343 0.243810629 0.9092924868 B10 -1.4827207 -0.333623245 -0.2155860373 B&gt; my_function &lt;- function(x) &#123; + A &lt;- x[a$Group==\"A\"] + B &lt;- x[a$Group==\"B\"] + t.test(A, B)$p.value + &#125;&gt; sapply(X=a[,!(names(a) %in% c(\"Group\"))], FUN=my_function, simplify = T)V1 V2 V3 0.0675659 0.7597376 0.9180267 &gt; t.test(a$V1[1:5],a$V1[6:10])$p.value[1] 0.0675659 Automatically install packages if not exist 12345usePackage &lt;- function(p) &#123; if (!is.element(p, installed.packages()[,1])) install.packages(p, dep = TRUE) require(p, character.only = TRUE)&#125; read.table读入的行少于实际行Sometimes when you found the lines reading by read.table smaller than real line number, please check it you have &quot;&quot; or &#39;&#39; in your file. 12### Always set no quotea &lt;- read.table(file, sep=\"\\t\", header=T, row.names=1, quote=\"\") 通过字符读入文件Read in data from string rather than files 12345string=\"a\\tb\\tcd\\te\\tfg\\th\\ti\"data &lt;- read.table(text=string, sep=\"\\t\") 使用Aggregate进行分组计算Aggregate by one column of dataframe. 12345678910111213141516171819202122&gt; ID &lt;- c(\"a\", \"b\", \"c\", \"b\", \"c\", \"d\", \"e\")&gt; A &lt;- c(1:7)&gt; B &lt;- c(3:9)&gt; C &lt;- c(9:3)&gt; test &lt;- data.frame(ID, A, B, C)&gt; testID A B C1 a 1 3 92 b 2 4 83 c 3 5 74 b 4 6 65 c 5 7 56 d 6 8 47 e 7 9 3&gt; a = aggregate(test[2:4], by=test[1], FUN=mean)&gt; aID A B C1 a 1 3 92 b 3 5 73 c 4 6 64 d 6 8 45 e 7 9 3 Aggregate by an external variable 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748&gt; a &lt;- \"ID;GrpA.1;AA.2;AA.3;AB.3;BB.4;BC.1;CB.1;BB.2;BC.2;C\"&gt; &gt; b &lt;- \"ID;A.1;A.2;A.3;B.1;B.2;B.3;B.4;C.1;C.2a;1;2;3;8;2;3;4;4;2b;1;3;3;1;3;3;4;1;2c;2;4;3;1;5;3;6;2;2\"&gt; &gt; sampFile &lt;- read.table(text=a, sep=';', row.names=1, header=T)&gt; &gt; mat &lt;- read.table(text=b, sep=';', header=T,row.names=1)&gt; mat_t &lt;- t(mat)&gt; mat_ta b cA.1 1 1 2A.2 2 3 4A.3 3 3 3B.1 8 1 1B.2 2 3 5B.3 3 3 3B.4 4 4 6C.1 4 1 2C.2 2 2 2&gt; &gt; Grp &lt;- sampFile[match(rownames(mat_t), rownames(sampFile)),1]&gt;&gt; #The variable given to `by` in `aggregate` must be a list &gt; mat_mean &lt;- aggregate(mat_t, by=list(Grp=Grp), FUN=mean)&gt; &gt; mat_mean_grp &lt;- mat_mean$Grp&gt; &gt; mat_mean_final &lt;- do.call(rbind, mat_mean)[-1,]&gt; &gt; colnames(mat_mean_final) &lt;- mat_mean_grp&gt; &gt; mat_mean_finalA B Ca 2.000000 4.25 3.0b 2.333333 2.75 1.5c 3.000000 3.75 2.0 条件填充数据表 12345678910111213141516171819202122232425262728293031323334&gt; A &lt;- c(1:9)&gt; B &lt;- c(11:5,13,14)&gt; C &lt;- c(9:1)&gt; test &lt;- data.frame(A, B, C)&gt; test&gt; A B C&gt; 1 1 11 9&gt; 2 2 10 8&gt; 3 3 9 7&gt; 4 4 8 6&gt; 5 5 7 5&gt; 6 6 6 4&gt; 7 7 5 3&gt; 8 8 13 2&gt; 9 9 14 1&gt; lod_generate3 &lt;- function(x)&#123;&gt; + for(i in 2:length(x))&#123;&gt; + if(x[i]&lt;x[1])&gt; + x[i] &lt;- round(runif(1,min=0,max=x[1]))&gt; + &#125;&gt; + x&gt; + &#125;&gt; apply(test, 2, lod_generate3)&gt; A B C&gt;&gt; [1,] 1 11 9&gt; [2,] 2 9 6&gt; [3,] 3 9 7&gt; [4,] 4 2 8&gt; [5,] 5 9 6&gt; [6,] 6 10 2&gt; [7,] 7 7 7&gt; [8,] 8 13 2&gt; [9,] 9 14 5 每一列的数除以该列的总和123456789101112131415161718192021222324252627&gt; a &lt;- data.frame('a'=c(1,2,3,4),'b'=c(1,2,3,4),'d'=2:5)&gt; aa b d1 1 1 22 2 2 33 3 3 44 4 4 5&gt; colSums(a)a b d 10 10 14 &gt; a/colSums(a) #Wronga b d1 0.1000000 0.1000000 0.14285712 0.2000000 0.1428571 0.30000003 0.2142857 0.3000000 0.40000004 0.4000000 0.4000000 0.3571429&gt; t(a)/colSums(a) #Half-right[,1] [,2] [,3] [,4]a 0.1000000 0.2000000 0.3000000 0.4000000b 0.1000000 0.2000000 0.3000000 0.4000000d 0.1428571 0.2142857 0.2857143 0.3571429&gt; t(t(a)/colSums(a)) #Righta b d[1,] 0.1 0.1 0.1428571[2,] 0.2 0.2 0.2142857[3,] 0.3 0.3 0.2857143[4,] 0.4 0.4 0.3571429 取出共同的列1234567891011121314151617181920212223242526&gt; a &lt;- data.frame('a'=1:5,'b'=2:6,'c'=round(runif(5,min=0, max=2)),'d'=sample(1:10,5))&gt; aa b c d1 1 2 2 62 2 3 2 83 3 4 1 14 4 5 1 25 5 6 2 4&gt; b = data.frame('b'=round(rnorm(5, mean=50, sd=10)),'e'=rep(1,5),'d'=round(runif(5,min=0, max=10)),'c'=sample(1:10,5, replace=T))&gt; bb e d c1 33 1 5 72 62 1 8 63 26 1 8 14 63 1 8 65 43 1 9 4&gt; ?match(x, y)# Select elements existed in x for each in y and ordered as in x# Remove elements only existed in y&gt; b[,na.omit(match(colnames(a),colnames(b)))]b c d1 33 7 52 62 6 83 26 1 84 63 6 85 43 4 9 pairwise.t.test for a matrix123456789101112131415161718192021222324252627282930313233343536373839&gt; data = data.frame(Group=c(rep('a',20),rep('b',20),rep('c',20)), A=runif(60, min=0, max=60), B=c(sample(1:10,20,replace=T), sample(20:30,20,replace=T), c(sample(1:30,20, replace=T))))&gt; dataGroup A B1 a 8.3522445 62 a 22.9813777 43 a 11.5574241 84 a 57.5316085 65 a 20.2775717 2. . . .. . . .21 b 36.8333789 2522 b 23.5413342 2423 b 41.6235628 2624 b 27.5968927 2525 b 48.6045175 20. . . .. . . .58 c 51.0684425 3059 c 4.0294234 2760 c 22.6168908 27&gt; my_function &lt;- function(x) &#123;+ pvalue_m = pairwise.t.test(x, data$Group, pool.sd = F)$p.value+ pvalue_m &lt;- as.data.frame(pvalue_m)+ pvalue_m$id &lt;- rownames(pvalue_m)+ pvalue_m &lt;- melt(pvalue_m, id.vars=c('id'))+ name_combine = paste(pvalue_m$id, pvalue_m$variable,sep='.vs.')+ pvalue_m &lt;- as.data.frame(pvalue_m$value)+ rownames(pvalue_m) &lt;- name_combine+ pvalue_m+ #colnames(pvalue_m)[colnames(pvalue_m)==\"value\"] = name_col+ #x+ &#125;&gt; p.value &lt;- apply(X=data[,-1], 2,FUN=my_function)&gt; p.value &lt;- do.call(cbind, p.value)&gt; colnames(p.value) &lt;- colnames(data[,-1])&gt; t(p.value)b.vs.a c.vs.a b.vs.b c.vs.bA 8.670387e-01 0.454526764 NA 0.4545267642B 3.111305e-22 0.008677007 NA 0.0001068359 t.test &amp; pairwise.t.test refThe problem is not in the p-value correction, but in the (declaration of the) variance assumptions. You have used var.equal=T in your t.test calls and pooled.sd=FALSE in your paired.t.test calls. However, the argument for paired.t.test is pool.sd, not pooled.sd. Changing this gives p-values equivalent to the individual calls to t.test 12pairwise.t.test(df$freq, df$class, p.adjust.method=\"none\", paired=FALSE, pool.sd=FALSE) Several ggplot pic together 12345678910111213141516171819202122data &lt;- c(1:6,6:1,6:1,1:6, (6:1)/10,(1:6)/10,(1:6)/10,(6:1)/10,2:7,7:2,6:1,1:6, 6:1,1:6,3:8,7:2)data &lt;- as.data.frame(matrix(data, ncol=12, byrow=T))data$type &lt;- c(rep(\"Gene Expression\",2), rep(\"DNA methylation\",2), rep(\"H3K4me3\",2), rep(\"H3K27me3\",2))colnames(data) &lt;- c(\"Zygote\",\"2_cell\",\"4_cell\",\"8_cell\",\"Morula\",\"ICM\",\"ESC\",\"4 week PGC\",\"7 week PGC\",\"10 week PGC\",\"17 week PGC\", \"OOcyte\", \"type\")data$ID &lt;- rep(c(\"gene1\",\"gene2\"),4)library(reshape2)library(ggplot2)data_m &lt;- melt(data, id.vars=c(\"type\",\"ID\"))data_m$type &lt;- factor(data_m$type, levels=c(\"Gene Expression\", \"DNA methylation\", \"H3K4me3\",\"H3K27me3\"))library(gridExtra)out &lt;- by(data=data_m, INDICES=data_m$type, FUN=function(m) &#123; m &lt;- droplevels(m) p &lt;- ggplot(m, aes(x=variable,y=ID)) + xlab(NULL) + labs(title=levels(m$type)) + theme_bw() + theme(panel.grid.major = element_blank()) + theme(legend.key=element_blank()) + theme(axis.text.x=element_text(angle=45,hjust=1, vjust=1)) + theme(legend.position=\"right\") + geom_tile(aes(fill=value)) + scale_fill_gradient(low = \"white\", high = \"red\") &#125;)do.call(grid.arrange,c(out, ncol=1)) 123456789101112grid_plot = function(m, hline)&#123; ID = unique(m$Metabolites) coords = hline[[ID]]$coord text = hline[[ID]]$text p &lt;- ggplot(m, aes(x=Samples, y=Concentration, color=Year, group=Year)) p &lt;- p + geom_line(size=1, alpha=0.6) + labs(title=ID) + theme(legend.position = \"right\") + expand_limits(y=0)+ theme(axis.text.x=element_text(angle=45,hjust=1, vjust=1)) + geom_hline(yintercept = coords, linetype=\"dotted\", size=0.5) + annotate(\"text\", y=coords, x=0, label=text, vjust=0, hjust=0)&#125; hline = list(H1=list(coord=c(5000), text=c(5000)), Glu=list(coord=c(50), text=c(50)), Arg..Arg.Orn.=list(coord=c(0.5), text=c(0.5))) out &lt;- by(data=ctrl.m, INDICES=ctrl.m$Metabolites, FUN=grid_plot,hline)do.call(grid.arrange,c(out, ncol=1)) 查看R包的版本 installed.packages()[c(&quot;SC3&quot;), c(&quot;Package&quot;, &quot;Version&quot;)] 移除安装包 remove.packages(c(&#39;package_name&#39;)) 去加载已经加载的包 detach(&quot;package:package_name&quot;) 判断一个变量是否存在 12345if(exists(\"debug\"))&#123; debug=FALSE&#125; else &#123; debug=TRUE&#125; stop and warn 12warning(\"output a message after a function finishes\")stop(\"stops the execution of the function and outputs an error message\") Extract all numeric columns 1new_df &lt;- df[sapply(df, is.numeric)] r-studio usages 12rstudio-server start/stop/restartps -u user | grep 'rsession' # Kill this process when rstuido-server becomes unresponsive merge dataframes 12library(data.table)merge(a, b, all.x=T) 12345678910111213141516171819202122232425262728293031323334353637383940414243444546a &lt;- \"ID;GrpA.1;AA.2;AA.3;AB.3;BB.4;BC.1;CB.1;BB.2;BC.2;C\"b &lt;- \"ID;A.1;A.2;A.3;B.1;B.2;B.3;B.4;C.1;C.2a;1;2;3;8;2;3;4;4;2b;1;3;3;1;3;3;4;1;2c;2;4;3;1;5;3;6;2;2\"sampFile &lt;- read.table(text=a, sep=';', row.names=1, header=T)mat &lt;- read.table(text=b, sep=';', header=T,row.names=1)mat_t &lt;- t(mat)mat_t&gt; c = merge(sampFile, mat_t, by=0)&gt; #c = merge(sampFile, mat_t, by=\"row.names\") #Both work&gt; cRow.names Grp a b c1 A.1 A 1 1 12 A.2 A 2 2 23 A.3 A 3 3 34 B.1 B 1 1 15 B.2 B 2 2 26 B.3 B 3 3 37 B.4 B 4 4 48 C.1 C 1 1 19 C.2 C 2 2 2&gt; c = dataframe(c[,-1], row.names=c[,1])&gt; cGrp a b cA.1 A 1 1 1A.2 A 2 2 2A.3 A 3 3 3B.1 B 1 1 1B.2 B 2 2 2B.3 B 3 3 3B.4 B 4 4 4C.1 C 1 1 1C.2 C 2 2 2 strsplit 12345sample &lt;- c(\"a_samp1_1\", \"a_samp1_2\", \"a_samp1_3\", \"a_samp2_1\", \"a_samp2_2\", \"a_samp2_3\")# 把样品名字按 &lt;_&gt; 分割，取出其第二部分作为样品的组名# lapply(X, FUC) 对列表或向量中每个元素执行FUC操作，FUNC为自定义或R自带的函数## One better way to generate groupgroup &lt;- unlist(lapply(strsplit(sample, \"_\" ), function(x) x[2])) Multiple rows or columns legend 12gg+guides(fill=guide_legend(nrow=2, byrow=TRUE))gg+guides(fill=guide_legend(ncol=2)) 条件替换数据表 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364&gt; a &lt;- data.frame(a=1:4,b=1:4,c=1:4)&gt; aa b c1 1 1 12 2 2 23 3 3 34 4 4 4&gt; a[a$b&lt;3,\"b\"] &lt;- 3&gt; aa b c1 1 3 12 2 3 23 3 3 34 4 4 4&gt; a &lt;- within(a, a[a&lt;4] &lt;- 2)&gt; aa b c1 2 3 12 2 3 23 2 3 34 4 4 4&gt; a = matrix(1:20, nrow=4)&gt; a[,1] [,2] [,3] [,4] [,5][1,] 1 5 9 13 17[2,] 2 6 10 14 18[3,] 3 7 11 15 19[4,] 4 8 12 16 20&gt; a &lt;- as.data.frame(a)&gt; a$a = letters[1:4]&gt; aV1 V2 V3 V4 V5 a1 1 5 9 13 17 a2 2 6 10 14 18 b3 3 7 11 15 19 c4 4 8 12 16 20 d&gt; a[,1:4][a[,1:4]&gt;4] &lt;- 0&gt; aV1 V2 V3 V4 V5 a1 1 0 0 0 17 a2 2 0 0 0 18 b3 3 0 0 0 19 c4 4 0 0 0 20 d&gt; a[,-6][a[,-6]&gt;4] &lt;- 0&gt; aV1 V2 V3 V4 V5 a1 1 0 0 0 0 a2 2 0 0 0 0 b3 3 0 0 0 0 c4 4 0 0 0 0 d&gt; a[,-6][a[,-6]!=0] &lt;- 1&gt; aV1 V2 V3 V4 V5 a1 1 0 0 0 0 a2 1 0 0 0 0 b3 1 0 0 0 0 c4 1 0 0 0 0 d&gt; a[c(\"V1\",\"V2\")][a[c(\"V1\",\"V2\")]==0] &lt;- 2&gt; aV1 V2 V3 V4 V5 a1 1 2 0 0 0 a2 1 2 0 0 0 b3 1 2 0 0 0 c4 1 2 0 0 0 d 移除特定的行 12345678910111213141516171819&gt; a[,1] [,2] [,3] [,4][1,] 0.8248820 -1.3022177 0.6119348 -0.04987367[2,] -1.0353643 0.7053093 -0.4677782 0.53749134[3,] 0.3773115 0.6229525 1.4935924 1.50909417[4,] 1.3755883 -0.2864933 -0.3077768 -0.12330547[5,] 0.1286202 -0.9517153 -0.7522629 -0.13442884&gt; a[apply(a,1,function(x) &#123;mad(x)&gt;0.5&#125;),][,1] [,2] [,3] [,4][1,] 0.8248820 -1.3022177 0.6119348 -0.04987367[2,] -1.0353643 0.7053093 -0.4677782 0.53749134[3,] 0.3773115 0.6229525 1.4935924 1.50909417[4,] 0.1286202 -0.9517153 -0.7522629 -0.13442884&gt; a[apply(a,1,function(x) &#123;any(x&lt;0)&#125;),][,1] [,2] [,3] [,4][1,] 0.8248820 -1.3022177 0.6119348 -0.04987367[2,] -1.0353643 0.7053093 -0.4677782 0.53749134[3,] 1.3755883 -0.2864933 -0.3077768 -0.12330547[4,] 0.1286202 -0.9517153 -0.7522629 -0.13442884 colorRampPalette: generate color vectors by given colors 1234colfunc &lt;- colorRampPalette(c(\"black\", \"white\"))colfunc(10)[1] \"#000000\" \"#1C1C1C\" \"#383838\" \"#555555\" \"#717171\" \"#8D8D8D\" \"#AAAAAA\"[8] \"#C6C6C6\" \"#E2E2E2\" \"#FFFFFF\" Trace through columns ref 12apply(cities, 2, FUN=function(x) HoltWinters(x=x, gamma=FALSE)) apply(cities, 2, HoltWinters, gamma=FALSE) 从data.frame中取出一列，仍然维持是data.frame 1data.frame[, 1, drop=F] Batch effects ref In a literal sense, getting a matrix of batch corrected counts is not possible. Once the batch effects have been removed, the values will no longer be counts. To batch correct, it is necessary to first transform the counts to a pseudo-continuous scale. Then you can use batch correction methods developed for microarrays. This is how we usually do it. First,put the counts in a DGEList object: 12library(edgeR)y &lt;- DGEList(counts=counts) Filter non-expressed genes: 12A &lt;- aveLogCPM(y)y2 &lt;- y2[A&gt;1,] Then normalize and compute log2 counts-per-million with an offset: 12y2 &lt;- calcNormFactors(y2)logCPM &lt;- cpm(y2, log=TRUE, prior.count=5) Then remove batch correct: 1logCPMc &lt;- removeBatchEffect(y2, batch) Here batch is a vector or factor taking a different value for each batch group. You can input two batch vectors. Now you can cluster the samples, for example by: 1plotMDS(logCPMc) Variations on this would be use rpkm() instead of cpm(), or to give removeBatchEffect()a design matrix of known groups that are not batch effects. Transfer number to date 1234567891011121314151617181920&gt; library(xlsx)&gt; Days &lt;- read.xlsx2(\"Y.xlsx\", sheetIndex = 1, header=T, stringsAsFactors=F)&gt; head(Days)SampleID X.Datum.1.BE Date.of.1st.PD Date.of.death.last.follow.up1 181_29 40294 40969 415622 182_26 40281 40483 411923 183_27 40287 40923 415624 184_32 40297 41014 415625 185_38 40323 40430 405856 186_40 40324 40378 41563&gt; Days$X.Datum.1.BE = as.Date(as.numeric(Days$X.Datum.1.BE), origin = \"1899-12-30\")&gt; Days$Date.of.1st.PD = as.Date(as.numeric(Days$Date.of.1st.PD), origin = \"1899-12-30\")&gt; head(Days)SampleID X.Datum.1.BE Date.of.1st.PD Date.of.death.last.follow.up1 181_29 2010-04-26 2012-03-01 2013-10-152 182_26 2010-04-13 2010-11-01 2012-10-103 183_27 2010-04-19 2012-01-15 2013-10-154 184_32 2010-04-29 2012-04-15 2013-10-155 185_38 2010-05-25 2010-09-09 2011-02-116 186_40 2010-05-26 2010-07-19 2013-10-16 Rstudio set dynamic library and other environment variables 1234Sys.getenv()# will list all environmental variablesSys.getenv('LD_LIBRARY_PATH')Sys.setenv(LD_LIBRARY_PATH=paste(\"/my_lib_dir\", Sys.getenv('LD_LIBRARY_PATH'), sep=\":\")) maximal number of DLLs reached 12# /miniconda2/envs/r/lib/R/etc/RenvironR_MAX_NUM_DLLS=1000 Remove one value from vector 1a[!a==4] Do not transfer numbers to scientific format 1options(scigen=999) Rmarkdown to markdown 1rmarkdown::render(\"05.biotools.Rmd\", output_format = \"md_document\",output_file = \"test.md\") curl not work 12345git clone github_packageR CMD build github_packageR CMD install github_packageinstall.packages(\"github_url\", repos=NULL, type=\"source\")","categories":[],"tags":[]},{"title":"要掌握的技能[持续更新中]","slug":"要掌握的技能","date":"2019-09-29T06:06:01.000Z","updated":"2019-10-02T09:05:54.641Z","comments":true,"path":"post/要掌握的技能/","link":"","permalink":"http://yuwq.pw/post/要掌握的技能/","excerpt":"我自己找的内容计算机基础：awk： 生物信息 awk 简明教程和基本用法 生物信息 awk 用法进阶 python: 30段极简Python代码 MySQL： 1000行MySQL学习笔记 绘图用：python绘图： Matplotlib 可视化的 50 个图表 R绘图： ggplot2(一)‖基本概念 ggplot2高效实用指南 (可视化脚本、工具、套路、配色) 图片的组合与拼接 ggplot2字体设置 主成分分析图层、置信区间 深入绘制热图","text":"我自己找的内容计算机基础：awk： 生物信息 awk 简明教程和基本用法 生物信息 awk 用法进阶 python: 30段极简Python代码 MySQL： 1000行MySQL学习笔记 绘图用：python绘图： Matplotlib 可视化的 50 个图表 R绘图： ggplot2(一)‖基本概念 ggplot2高效实用指南 (可视化脚本、工具、套路、配色) 图片的组合与拼接 ggplot2字体设置 主成分分析图层、置信区间 深入绘制热图 色彩搭配： 图表色彩运用原理 文章图排版 文章用图的修改和排版 文章用图的修改和排版(2) 文献挖掘 pubmed年度趋势（做ppt的时候需要） 转自公众号生信宝典的内容：生信宝典文章集锦 http://blog.genesino.com/2100/01/shengxinbaodian/ 以下是详细内容，Jump to… 程序学习心得 生物信息之程序学习 如何优雅的提问 Linux 学习 Linux学习-文件和目录 Linux学习-文件操作 Linux文件内容操作 Linux学习-环境变量和可执行属性 Linux学习 - 管道、标准输入输出 Linux学习 - 命令运行监测和软件安装 Linux学习-常见错误和快捷操作 Linux学习-文件列太多，很难识别想要的信息在哪列；别焦急，看这里。 Linux学习-文件排序和FASTA文件操作 用了Docker，妈妈再也不担心我的软件安装了 - 基础篇 Linux服务器数据定期同步和备份方式 R统计和作图 R语言学习 - 入门环境Rstudio R语言学习 - 入门环境Rstudio R语言学习 - 热图绘制 (heatmap) R语言学习 - 基础概念和矩阵操作 R语言学习 - 热图简化 R语言学习 - 热图美化 R语言学习 - 线图绘制 R语言学习 - 线图一步法 R语言学习 - 箱线图（小提琴图、抖动图、区域散点图） R语言学习 - 箱线图一步法 R语言学习 - 火山图 R语言学习 - 富集分析泡泡图 （文末有彩蛋） R语言学习 - 散点图绘制 一文看懂PCA主成分分析 富集分析DotPlot，可以服 R语言学习 - 韦恩图 R语言学习 - 柱状图 NGS基础 NGS基础 - FASTQ格式解释和质量评估 NGS基础 - 高通量测序原理 NGS基础 - 参考基因组和基因注释文件 NGS基础 - GTF/GFF文件格式解读和转换 本地安装UCSC基因组浏览器 测序数据可视化 (一) 测序文章数据上传找哪里 39个转录组分析工具，120种组合评估(转录组分析工具哪家强-导读版) 39个转录组分析工具，120种组合评估(转录组分析工具大比拼 （完整翻译版）) Python学习 Python学习极简教程 （一） Python学习教程（二） Python学习教程（三） Python学习教程 （四） Python学习教程（五） Python学习教程 （六） Pandas，让Python像R一样处理数据，但快 NGS软件 Rfam 12.0+本地使用 （最新版教程） 轻松绘制各种Venn图 ETE构建、绘制进化树 psRobot：植物小RNA分析系统 生信软件系列 - NCBI使用 Cytoscape网络图 Cytoscape教程1 Cytoscape之操作界面介绍 新出炉的Cytoscape视频教程 分子对接 来一场蛋白和小分子的风花雪月 不是原配也可以-对接非原生配体 简单可视化-送你一双发现美的眼睛 你需要知道的那些前奏 生信宝典之傻瓜式 生信宝典之傻瓜式 (一) 如何提取指定位置的基因组序列 生信宝典之傻瓜式 (二) 如何快速查找指定基因的调控网络 生信宝典之傻瓜式 (三) 我的基因在哪里发光 - 如何查找基因在发表研究中的表达 生信人写程序 生信人写程序1. Perl语言模板及配置 生信人写程序2. Editplus添加Perl, Shell, R, markdown模板和语法高亮","categories":[],"tags":[]},{"title":"TCGA 生存信息","slug":"TCGA-生存信息","date":"2019-08-22T07:50:24.000Z","updated":"2019-09-29T09:15:12.657Z","comments":true,"path":"post/TCGA-生存信息/","link":"","permalink":"http://yuwq.pw/post/TCGA-生存信息/","excerpt":"overall survival (OS) 总生存期 recurrence free survival (RFS) 无复发生存期","text":"overall survival (OS) 总生存期 recurrence free survival (RFS) 无复发生存期 TCGA STAD的表型基础信息统计以胃癌的生存信息为例，一共有6列。 sample：样本名称 X_EVENT： 1表示死亡，0表示审查中，null表示没有数据 X_PATIENT： X_TIME_TO_EVENT： X_OS_IND： X_OS：如果有死亡，相当于days to death；如果仍然存活，时间点为days_to_last_known_alive,和days_to_last_followup中较大值。 基本的生存信息统计： 506个样本，来自于418个病例，其中170人已经死亡，剩余248人存活状态 1234567891011121314151617181920212223242526## 从UCSC xena下载生存和表型数据survival &lt;- read.table(\"../phenotype/TCGA-STAD.survival.tsv.gz\",header = T, sep = \"\\t\", quote = \"\", fill = T)phe=read.table('../phenotype/TCGA-STAD.GDC_phenotype.tsv.gz',header = T,sep = '\\t',quote = \"\",fill = T)dim(phe)## [1] 544 137dim(survival)## [1] 506 6length(unique(phe.filter$submitter_id.samples))## [1] 506 phe.filter &lt;- phe[phe$submitter_id.samples %in% survival$sample,]## phe.filter中样本是一一对应的，两者都含有506个样本。## phe多出来的一些样本是新提取的肿瘤来源的数据。## 为什么要重新提取？可能是经过治疗后出现了新的肿瘤，重新采样## （new_tumor_event_after_initial_treatment）dim(phe.filter) ## [1] 506 137length(unique(survival$sample))## 506length(unique(survival$X_PATIENT))## 418tmp &lt;- unique(survival[,c(2,3)])table(tmp$X_EVENT)## 0 1 ## 248 170 survival[survival$X_EVENT==survival$X_OS_IND,] UCSC Cancer Browser team curate the overall survival (OS) and recurrence free survival (RFS) information from the TCGA clinical and phenotypic data. Overall Survival (OS) The event call is derived from “vital status” parameter. The time_to_event is in days, equals to days_to_death if patient deceased; in the case of a patient is still living, the time variable is the maximum(days_to_last_known_alive, days_to_last_followup). This pair of clinical parameters are called _EVENT and _TIME_TO_EVENT on the cancer browser. Recurrence Free Survival (RFS) The event call is derived from “new_tumor_event_after_initi**al_treatment” parameter. The time_to_event is in days, equals to max (days_to_new_tumor_event_after_initial_treatment, days_to_tumor_recurrence) if there is an event; in the case of no event, the time variable is time of overall survival. The pair of clinical parameters are called _RFS and _RFS_IND on the cancer browser. KM plot If there is OS data, the browser KM plot will display by default. Users can use the KM plot advanced option to select other clinical variables, such _RFS and _RFS_IND to use for KM plot. 如果有OS数据，默认情况下将显示浏览器KM图。 用户可以使用KM plot advanced选项选择其他临床变量，例如_RFS和_RFS_IND用于KM图。 Example: TCGA bladder cancer recurrent free survival KM plot https://genome-cancer.ucsc.edu/proj/site/hgHeatmap/#?bookmark=090a0ed5cf614b35d1002c0ddeb044aa and then click the KM button Download OS or RFS data for survival statistical analysis You can download the OS and OS_IND and RFS and RFS_IND pair of data through clinical download, as well as the categorical clinical variable (e.g. PAM50 subtype) for survival analysis. The downloaded data is a text file, data is in the format that can be easily used by R (e.g. survdiff in the survival package) to derive p value. Please note to select “entire clinical cohort” option when download clinical data for survival analysis. 您可以通过下载临床数据以及用于生存分析的分类临床变量（例如PAM50亚型）下载OS和OS_IND以及RFS和RFS_IND数据对。 下载的数据是文本文件，数据的格式可以由R（例如生存包中的生存者）容易地使用以导出p值。 请注意在下载临床数据进行生存分析时选择“整个临床队列”选项。 参考https://groups.google.com/forum/#!topic/ucsc-cancer-genomics-browser/YvKnWZSsw1Q","categories":[],"tags":[]},{"title":"创建python虚拟环境","slug":"python-venv","date":"2019-08-18T09:02:30.000Z","updated":"2019-09-29T08:50:30.684Z","comments":true,"path":"post/python-venv/","link":"","permalink":"http://yuwq.pw/post/python-venv/","excerpt":"virtualenv可创建一套“隔离”的Python运行环境，从而避免不同版本包的干扰。","text":"virtualenv可创建一套“隔离”的Python运行环境，从而避免不同版本包的干扰。 pip安装virtualenv1pip install virtualenv 创建独立python运行环境12345# 创建目录mkdir project_Acd project_A# 创建独立的python运行环境virtualenv --no-site-packages venv 进入虚拟环境1source venv/bin/activate 在虚拟环境下安装第三方包1pip install numpy 关闭虚拟环境1deactivate","categories":[{"name":"python","slug":"python","permalink":"http://yuwq.pw/categories/python/"}],"tags":[{"name":"virtualenv","slug":"virtualenv","permalink":"http://yuwq.pw/tags/virtualenv/"}]},{"title":"download_TCGA_from_GDC","slug":"肿瘤 download-TCGA-from-GDC","date":"2019-08-08T09:04:08.000Z","updated":"2019-09-29T09:15:01.595Z","comments":true,"path":"post/肿瘤 download-TCGA-from-GDC/","link":"","permalink":"http://yuwq.pw/post/肿瘤 download-TCGA-from-GDC/","excerpt":"","text":"","categories":[],"tags":[]},{"title":"设置仓库镜像和安装R包","slug":"R-packages-installation","date":"2019-08-08T01:18:38.000Z","updated":"2019-09-29T08:48:42.337Z","comments":true,"path":"post/R-packages-installation/","link":"","permalink":"http://yuwq.pw/post/R-packages-installation/","excerpt":"以安装生存分析包survminer为例，了解R包的一般安装流程。","text":"以安装生存分析包survminer为例，了解R包的一般安装流程。 Set repo to accelerate downloadingFor CRAN: 1options(repos=structure(c(CRAN=\"https://mirrors.tuna.tsinghua.edu.cn/CRAN/\"))) For bioconductor: 1options(BioC_mirror=\"http://mirrors.ustc.edu.cn/bioc/\") Installation and loadingInstall from CRAN as follow: 1install.packages(\"survminer\") Install from bioconductor（因为survminer并不在bioconductor中，此处我随机找了别的包示例）: 12if (!requireNamespace(&quot;BiocManager&quot;, quietly = TRUE)) install.packages(&quot;BiocManager&quot;)if (!requireNamespace(&quot;ivygapSE&quot;, quietly = TRUE)) BiocManager::install(&quot;ivygapSE&quot;) Or, install the latest version from GitHub: 12if(!require(devtools)) install.packages(\"devtools\")devtools::install_github(\"kassambara/survminer\", build_vignettes = FALSE) Load survminer: 1library(\"survminer\") 其他R包相关操作：123456789101112131415161718192021222324252627282930313233# 查看当前镜像地址getOption(\"repos\")# 查看R_HOME地址R.home()R_HOME/doc/CRAN_mirrors.csvR_HOME/doc/BioC_mirrors.csv# 查看R包安装位置.libPaths()# 查看已安装的包installed.packages()# 查看包版本packageVersion(\"package_name\")# 更新包update.packages(\"package_name\")# 加载包library(\"package_name\")require(\"package_name\")# 查看加载的包.packages()# 移除已加载的包（将包从R运行环境中移除）detach(\"package_name\")# 彻底删除已安装的包：remove.packages(\"package_name\", lib = file.path(\"path/to/library\")) 自动安装包1234567usePackage &lt;- function(p) &#123; if (!is.element(p, installed.packages()[,1]))&#123; options(repos=structure(c(CRAN=\"https://mirrors.tuna.tsinghua.edu.cn/CRAN/\"))) install.packages(p, dep = TRUE) require(p, character.only = TRUE)&#125;&#125;usePackage(\"ggplot2\") 参考： https://www.jianshu.com/p/9e503a4e3563?utm_campaign=maleskine&amp;utm_content=note&amp;utm_medium=seo_notes&amp;utm_source=recommendation","categories":[{"name":"R语言","slug":"R语言","permalink":"http://yuwq.pw/categories/R语言/"}],"tags":[{"name":"R","slug":"R","permalink":"http://yuwq.pw/tags/R/"},{"name":"installation","slug":"installation","permalink":"http://yuwq.pw/tags/installation/"}]},{"title":"git基础代码","slug":"git基础","date":"2019-08-06T03:29:09.000Z","updated":"2019-09-29T08:50:02.297Z","comments":true,"path":"post/git基础/","link":"","permalink":"http://yuwq.pw/post/git基础/","excerpt":"Git基础查看文件版本库中的文件状态：1git status文件处于未修改状态修改文件，再次查看版本库中文件的状态对比文件修改前和修改后的变化：1git diff根据提示，我们能看到，文本的修改为添加了一行，即文字开头有加号的地方。如果删除一行，文字开头为减号。 提交修改内容与提交新文件12git add 文件名git commit -m \"add a line\"","text":"Git基础查看文件版本库中的文件状态：1git status文件处于未修改状态修改文件，再次查看版本库中文件的状态对比文件修改前和修改后的变化：1git diff根据提示，我们能看到，文本的修改为添加了一行，即文字开头有加号的地方。如果删除一行，文字开头为减号。 提交修改内容与提交新文件12git add 文件名git commit -m \"add a line\" 再次查看提交后，再次查看版本库当前状态，会看到文件没有需要提交的更改。 撤销操作有时提交完成后发现漏掉几个文件没有添加，或者信息写错了，这时候就需要对操作进行撤销。 12345git commit --amend### 例子git commit -m 'xxxxx'git add filegit commit --amend 撤销对文件的修改 如果不想保留对某一个文件的修改，则可以使用git checkout -- filename来撤销对文件的修改。但是这个命令只是拷贝了另一个文件来覆盖它，文件的所有更改都将消失。 来源：https://blog.csdn.net/yidu_fanchen/article/details/78663359 git 高级内容 https://laozhu.me/post/git-submodule-tutorial/ 阮一峰的博客 https://www.bookstack.cn/read/git-tutorial/README.md 从mac拷贝到windows10上进入目录后，需要执行以下命令，否则会出错： 1git config --global user.email \"your@example.com\"","categories":[],"tags":[{"name":"git","slug":"git","permalink":"http://yuwq.pw/tags/git/"}]},{"title":"两种方法构建signatures（一）","slug":"肿瘤 mutational signatures的构建","date":"2019-08-06T00:01:23.000Z","updated":"2019-09-29T09:14:52.302Z","comments":true,"path":"post/肿瘤 mutational signatures的构建/","link":"","permalink":"http://yuwq.pw/post/肿瘤 mutational signatures的构建/","excerpt":"","text":"MutationalPatterns中进行因式分解用到了R包NMF。确定因式分解的秩至关重要，要避免过度分解和分解不全的情况。nmfEstimateRank函数帮助确定最佳的秩。 Function nmfEstimateRank helps in choosing an optimal rank by implementing simple approaches proposed in the literature.","categories":[{"name":"Cancer genome","slug":"Cancer-genome","permalink":"http://yuwq.pw/categories/Cancer-genome/"},{"name":"Mutational Signatures","slug":"Cancer-genome/Mutational-Signatures","permalink":"http://yuwq.pw/categories/Cancer-genome/Mutational-Signatures/"}],"tags":[{"name":"Mutational Signatures","slug":"Mutational-Signatures","permalink":"http://yuwq.pw/tags/Mutational-Signatures/"},{"name":"tumor","slug":"tumor","permalink":"http://yuwq.pw/tags/tumor/"},{"name":"NMF","slug":"NMF","permalink":"http://yuwq.pw/tags/NMF/"}]}]}