content.json

{"pages":[{"title":"自我介绍","text":"目前就职于字节跳动, 可加微: badaoyishan进行内推咨询及进度查询, 也可直接投递简历至zhangzhongzheng@bytedance.com","link":"/about/index.html"}],"posts":[{"title":"社区发现算法-团渗透","text":"简介&nbsp;&nbsp;&nbsp;&nbsp;k-团渗透算法(CPM)[1]是第一个能够发现重叠社区的算法，重叠社区指的是结点可以同时属于多个社区。重叠社区在社交网络中是十分常见的，因为每个人都有着多种多样的社交关系。 算法&nbsp;&nbsp;&nbsp;&nbsp;网络中的最大团指的是，团中任意两个结点之间都有边连接，并且它不被其他的团所包含。CPM算法的想法非常简单，首先它找出网络中所有大小至少为k的最大团。然后构建一个团图，每个最大团都是团图中的一个结点，如果两个团c1与c2共享min(c1,c2)-1个邻居的话，它们在新图中的结点之间就存在边。最后团图中的每个连通单元就是一个结点的社区，而它可能是重叠的。代码参见 并行化挖掘最大团的过程可以改造为map reduce格式的，详细过程请见[3] 代码参见 参考文献1: Uncovering the overlapping community structure of complex networks in nature and society2: The worst-case time complexity for generating all maximal cliquesand computational experiments3: Efficient Dense Structure Mining using MapReduce","link":"/2021/02/06/algorithm/cpm/"},{"title":"kmeans","text":"简介k-means应该是最简单的一个聚类算法了，它的优化目标是使所有数据点到它们各自的最近类别中心的距离总和最小。其实k-means是基于质心的聚类，它假设类别的形状是球形的，并通过EM的方法进行求解。它的缺点是对噪声敏感，无法发现任意形状的类别，不稳定。 优化目标： 算法过程 首先随机选出k个数据作为类别中心 然后将其他数据分配到距离他们最近的类别中 将类别中心更新为所有这个类别中的数据的均值 迭代2和3，直至算法稳定 k-means++由于k-means受初始中心的影响严重，而随机选择很可能使得中心分布不均匀。k-means++的想法就是通过控制生成初始中心的过程来使得中心分布均匀，具体为顺序选择初始中心，使得新选择的中心距离已有的中心尽可能地远。初始中心的过程如下： 首先随机选择一个中心 然后计算其他数据点到已有中心的最近距离记为D(x) 按照按照概率从数据点中选取下一个中心，每个数据点被选取的概率＝ 迭代2和3直至选出k个中心 参考文献 k-means++: The Advantages of Careful Seeding","link":"/2022/01/31/algorithm/kmeans/"},{"title":"社区发现算法-Louvain","text":"简介&nbsp;&nbsp;&nbsp;&nbsp;Louvain算法[1]是一种基于多层次优化Modularity[2]的算法，它的优点是快速、准确，被[3]认为是性能最好的社区发现算法之一。 &nbsp;&nbsp;&nbsp;&nbsp;Modularity函数最初被用于衡量社区发现算法结果的质量，它能够刻画发现的社区的紧密程度。那么既然能刻画社区的紧密程度，也就能够被用来当作一个优化函数，即将结点加入它的某个邻居所在的社区中，如果能够提升当前社区结构的modularity。 &nbsp;&nbsp;&nbsp;&nbsp;Modularity的定义如下：&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;其中，m表示网络中边的数量，A为邻接矩阵，如果ci,cj相同则$\\delta(ci,cj)$＝1否则为0。 &nbsp;&nbsp;如果当前结点所在的社区只有它自己，那么在计算将它加入到其它社区时的modularity的变化有个技巧来加速计算，Louvain的高效性也在一定程度上受益于此，它为:&nbsp; &nbsp; Louvain算法包括两个阶段，在步骤一它不断地遍历网络中的结点，尝试将单个结点加入能够使modularity提升最大的社区中，直到所有结点都不再变化。在步骤二，它处理第一阶段的结果，将一个个小的社区归并为一个超结点来重新构造网络，这时边的权重为两个结点内所有原始结点的边权重之和。迭代这两个步骤直至算法稳定。它的执行流程如图所示：&nbsp; 代码实现GraphX是Spark上的一个图处理框架，它在RDD的基础之上封装出VertexRDD以及EdgeRDD，由这两个封装出的RDD便可构成图结构，详细请见官网： GraphX实现Python实现参见 参考文献[1]. Fast unfolding of communities in large networks[2]. Finding community structure in very large networks[3]. Community detection algorithms: A comparative analysis","link":"/2021/02/05/algorithm/louvain/"},{"title":"社区发现算法-标签传播","text":"简介&nbsp;&nbsp;&nbsp;&nbsp;基本的标签传播算法(LPA)[1]的思想非常简单，就是让每个结点与它的大多数邻居在同一个社区中。具体算法流程为：初始化，每个结点携带一个唯一的标签；然后更新结点的标签，令其标签与它的大多数邻居的标签相同，若存在多个则随机选择。迭代直至每个结点的标签不再变化。 &nbsp;&nbsp;&nbsp;&nbsp;LPA算法的优点是简单、快速接近线性时间，5次迭代就可使95%的结点标签稳定。缺点是算法结果不稳定，多次执行可能得到的结果都不同。 &nbsp;&nbsp;&nbsp;&nbsp;针对基本的标签传播算法有时会形成过大(“monster”)的社区，[2]提出一个令标签跳跃衰减的方法。初始时给每个标签权重为1.0，在更新结点标签时，令其与它的邻居标签中权重最大的相同，并令权重损失一部分。 参考文献[1].Near linear time algorithm to detect community structures in large-scale networksTowards Real-Time Community Detection in Large Networks","link":"/2021/02/05/algorithm/lpa/"},{"title":"Hive简易教程 - 数据分析","text":"Hive是一个HDFS上的sql执行引擎，它将sql语句转化为Hadoop上的map-reduce任务来执行。由于是写sql，所以使用Hive进行数据分析的好处是没有什么额外的学习成本，但是它是批量式处理的，可能会比较慢。本文将通过几个案例来简单介绍如何使用Hive。 样例数据** 随机生成一批订单数据(order_id, price, tag, order_date) ** 123456789101112from random import randintfrom datetime import datefrom datetime import timedeltafor i in range(1000): order_id = 'order_%s' % i seller_id = 'seller_%s' % randint(0, 300) price = randint(0, 100000) / 100.0 tag = randint(0, 1) order_date = date.today() - timedelta(days=randint(0, 30)) print order_id, seller_id, price, tag, order_date ** 存储数据到Hive ** 1234567hive&gt; create table test_order_sample(order_id string, seller_id string, price double, tag int, order_date string) row format delimited fields terminated by ' ';hive&gt; load data local inpath '/data/order_sample' into table test_order_sample; 案例一** 统计出近一周每天成功支付的订单总数，gmv，客单价 ** 123456hive&gt; select order_date,count(*),round(sum(price),2),round(avg(price),2) from test_order_sample where tag=1 and order_date&gt;=date_sub(from_unixtime(unix_timestamp(),'yyyy-MM-dd'),7) group by order_date order by order_date desc; 案例二** 统计出近一周每天成功支付 及支付失败 各自的订单总数，gmv，客单价 ** 12345678select order_date,sum(if(tag=0,1,0)),sum(if(tag=0,price,0)),avg(if(tag=0,price,0)),sum(if(tag=1,1,0)),sum(if(tag=1,price,0)),avg(if(tag=1,price,0)) from test_order_sample where tag=1 and order_date&gt;=date_sub(from_unixtime(unix_timestamp(),'yyyy-MM-dd'),7) group by order_date order by order_date desc; count函数和if条件组合，而不是两个sql join 案例三** 挑选出近一周gmv&gt;1000并且订单量&gt;2单的卖家ID及其订单 ** 1234567hive&gt; select seller_id,collect_set(order_id) from test_order_sample where tag=1 and order_date&gt;=date_sub(from_unixtime(unix_timestamp(),'yyyy-MM-dd'),7) group by seller_id having count(*)&gt;2 and sum(price)&gt;1000; 常用UDF聚合相关函数collect_set(c_1) 在使用group by之后只能select出group key以及相关的统计数字，但也可以以集合的形式select出任何其他的非group key，比如按卖家ID聚合之后又想查看在这个卖家下单的买家ID：sellect collect_set(buyer_id) from t group by seller_id。 collect_list(c_1) 与collect_set类似，元素可重复 explode(c_1) explode函数可以把一个array类型的数据扁平化。比如，现在每行是一个seller_id集合，使用explode可以扁平化为每行一个seller_id。但explode不可以直接与group by一起使用，比如我想按某些条件筛选一些卖家然后在查看该店铺的买家的情况：select explode(b.buyer_ids) from (select collect_set(buyer_id) as buyer_ids from t group by seller_id) b; 时间函数unix_timestamp() 当前时间 from_unixtime(timestamp, format) 将系统时间戳转化为人可读的数据格式 如：select from_unixtime(unix_timestamp(), ‘yyyy-MM-dd’); date_sub(string startdate, int days) 求几天前的日期 &nbsp; &nbsp; 其它nvl(v1, v2) nvl函数用于处理null值，当一个字段是null时，这个字段和其它字段进行算术运算时的结果依然为null。这时可以使用这个函数为值可能为null的字段赋予一个默认值，即v2. instr(str1, ‘xxx’) 判断字符串’xxx’是否出现在str1中，如果str1是null或者不存在xxx返回值都是0 size(a1) 返回数组a1的大小 union_all() 合并两个查询结果，但结果的列数需要一致！！！","link":"/2021/02/17/bigdata/hive1/"},{"title":"hexo主题bug修复","text":"问题描述在使用snippet主题时出现首页页面分页功能出现异常，样式如下: 首页时最后多出一个按钮 尾页时前面多出一个按钮 解决过程step 1. 查看页面源代码定位出问题的代码块，可见有问题的部分是pagination step 2. 查看主题代码看看是哪部分生成的bug代码，查看index.ejs发现pagination是它生成的 step 3. 查看pagination代码 12345678910&lt;% if (page.total &gt; 1){ %&gt;&lt;nav class=\"pagination\" role=\"navigation\"&gt; &lt;div id=\"page-nav\"&gt; &lt;%- paginator({ prev_text: \"&lt;i class='fa fa-angle-left'&gt;&lt;/i&gt;\", next_text: \"&lt;i class='fa fa-angle-right'&gt;&lt;/i&gt;\" }) %&gt; &lt;/div&gt;&lt;/nav&gt;&lt;% } %&gt; step 4. 将prev_text和nex_text改成空字符串试试 step 5. 效果符合预期，Done","link":"/2021/02/16/hexo/hexo%E4%B8%BB%E9%A2%98bug%E4%BF%AE%E5%A4%8D/"},{"title":"hexo小技巧","text":"hexo的多种特性 创建about页面hexo new page “about” 添加多个标签类似json eg. tags: [tag1,tag2]","link":"/2021/02/01/hexo/hexo%E5%B0%8F%E6%8A%80%E5%B7%A7/"},{"title":"【LeetCode 142】Linked List Cycle II","text":"题目描述Given a linked list, return the node where the cycle begins. If there is no cycle, return null. Note: Do not modify the linked list. 解题思路证明：设p1与p2相遇时用时为t，p1在圈内走了n1圈p2在圈内走了n2圈圈长为l起点到成环点距离为x相遇点到起点距离为z成环点到相遇点距离为y 则2t - t = (n2 - n1) * l =&gt; t = (n2 - n1) * l t + z = x + (n1 + 1) * lt = (n2 - n1) * l z + (n2 - n1) * l = x + (n1 + 1) * l x = z + (n2 - 2*n1 - 1) * l所以p1在相遇点接着走，p2从起点开始走，同时同步走下次会在成环点相遇 代码1234567891011121314151617181920212223242526272829303132# Definition for singly-linked list.# class ListNode(object):# def __init__(self, x):# self.val = x# self.next = Noneclass Solution(object): def detectCycle(self, head): \"\"\" :type head: ListNode :rtype: ListNode \"\"\" p1 = head p2 = head hasCycle = False while p2 is not None: if p2.next == p1: hasCycle = True p1 = p1.next break elif p2.next is None: return None p2 = p2.next.next p1 = p1.next if hasCycle: p3 = head while p1 != p3: p1 = p1.next p3 = p3.next return p3 else: return None","link":"/2021/02/01/leetcode/142/"},{"title":"【LeetCode 394】Decode String","text":"题目描述Given an encoded string, return it’s decoded string. The encoding rule is: k[encoded_string], where the encoded_string inside the square brackets is being repeated exactly k times. Note that k is guaranteed to be a positive integer. You may assume that the input string is always valid; No extra white spaces, square brackets are well-formed, etc. Furthermore, you may assume that the original data does not contain any digits and that digits are only for those repeat numbers, k. For example, there won’t be input like 3a or 2[4]. Examples: s = “3[a]2[bc]”, return “aaabcbc”.s = “3[a2[c]]”, return “accaccacc”.s = “2[abc]3[cd]ef”, return “abcabccdcdcdef”. 代码1234567891011121314151617181920212223242526272829303132class Solution(object): def decodeString(self, s): \"\"\" :type s: str :rtype: str \"\"\" stack = [] i = 0 while i &lt; len(s): if s[i] == ']': ll = [] t = stack.pop() while True: if t.isdigit(): stack.append(''.join(ll * int(t))) break elif t == '[': pass else: ll.insert(0, t) t = stack.pop() elif s[i].isdigit(): ll = [] while s[i].isdigit(): ll.append(s[i]) i += 1 stack.append(''.join(ll)) continue else: stack.append(s[i]) i += 1 return ''.join(stack)","link":"/2021/02/01/leetcode/394/"},{"title":"【LeetCode 7】Reverse Integer","text":"题目Reverse digits of an integer.Example1: x = 123, return 321Example2: x = -123, return -321 代码12345678910public int reverse(int x) { long sum = 0; while (x != 0) { sum = sum*10 + x%10; if (sum&gt;Integer.MAX_VALUE || sum&lt;Integer.MIN_VALUE) return 0; x /= 10; } return (int)sum;}","link":"/2021/02/02/leetcode/7/"},{"title":"微信小程序-动态样式","text":"有时需要根据当前上下文的某些状态来动态控制组件样式，此时可使用动态样式功能。 根据用户是否已经收藏了动态调整样式： 1&lt;button class=\"operBtn {{has_favorite==0?'not_favorite':'favorite'}}\"&gt;收藏&lt;/button&gt; 1234567891011.not_favorite { width: 100%; height: 100%; color: #000000;}.favorite { width: 100%; height: 100%; color: #07c160;}","link":"/2021/03/07/miniprogram/%E5%8A%A8%E6%80%81%E6%A0%B7%E5%BC%8F/"},{"title":"微信小程序-事件传参","text":"比如像做一个收藏功能，当用户点击收藏按钮时需要将点击的内容的id传给后端逻辑。此时可以在视图层组件上添加属性，命名为data-xxx的方式。 1&lt;button bindtap=\"favorite\" data-id=\"{{item.id}}\"&gt;收藏&lt;/button&gt; 123favorite: function (e) { console.log(e.currentTarget.dataset.id)} 官方事件详细说明文档","link":"/2021/03/07/miniprogram/%E4%BA%8B%E4%BB%B6%E4%BC%A0%E5%8F%82/"},{"title":"微信小程序-反馈按钮","text":"本来想开发一下小程序的反馈模块，也好收集用户体验信息以便后续持续改进，但是一搜发现微信开放生态中其实已经集成了，遂拿过来直接用。 1&lt;button open-type=\"feedback\"&gt;意见反馈&lt;/button&gt; 效果样式: 可到后台管理页面查找用户反馈信息 官方文档","link":"/2021/03/07/miniprogram/%E5%8F%8D%E9%A6%88%E6%8C%89%E9%92%AE/"}],"tags":[{"name":"图算法","slug":"图算法","link":"/tags/%E5%9B%BE%E7%AE%97%E6%B3%95/"},{"name":"聚类算法","slug":"聚类算法","link":"/tags/%E8%81%9A%E7%B1%BB%E7%AE%97%E6%B3%95/"},{"name":"hive","slug":"hive","link":"/tags/hive/"},{"name":"hexo","slug":"hexo","link":"/tags/hexo/"},{"name":"LeetCode","slug":"LeetCode","link":"/tags/LeetCode/"},{"name":"微信小程序","slug":"微信小程序","link":"/tags/%E5%BE%AE%E4%BF%A1%E5%B0%8F%E7%A8%8B%E5%BA%8F/"},{"name":"小程序","slug":"小程序","link":"/tags/%E5%B0%8F%E7%A8%8B%E5%BA%8F/"}],"categories":[]}