Python Github用户数据分析4 flann临近算法应用

作者: Phodal Huang 2014年4月16日 21:31

osrc最有意思的一部分莫过于flann，当然说的也是系统后台的设计的一个很关键及有意思的部分。

Python Github

邻近算法是在这个分析过程中一个很有意思的东西。

邻近算法，或者说K最近邻(kNN，k-NearestNeighbor)分类算法可以说是整个数据挖掘分类技术中最简单的方法了。所谓K最近邻，就是k个最近的邻居的意思，说的是每个样本都可以用她最接近的k个邻居来代表。

换句话说，我们需要一些样本来当作我们的分析资料，这里东西用到的就是我们之前的。

 [227.0, {'1': '51', '0': '41', '3': '17', '2': '34', '5': '28', '4': '22', '6': '34'}, [('PushEvent', 154.0), ('CreateEvent', 41.0), ('WatchEvent', 18.0), ('GollumEvent', 8.0), ('MemberEvent', 3.0), ('ForkEvent', 2.0), ('ReleaseEvent', 1.0)], 0, 0, 0, 11, [('CSS', 74.0), ('JavaScript', 60.0), ('Ruby', 12.0), ('TeX', 6.0), ('Python', 6.0), ('Java', 5.0), ('C++', 5.0), ('Assembly', 5.0), ('C', 3.0), ('Emacs Lisp', 2.0), ('Arduino', 2.0)]]

在代码中是构建了一个points.h5的文件来分析每个用户的points，之后再记录到hdf5文件中。

[ 0.00438596  0.18061674  0.2246696   0.14977974  0.07488987  0.0969163
  0.12334802  0.14977974  0.          0.18061674  0.          0.          0.
  0.00881057  0.          0.          0.03524229  0.          0.
  0.01321586  0.          0.          0.          0.6784141   0.
  0.07929515  0.00440529  1.          1.          1.          0.08333333
  0.26431718  0.02202643  0.05286344  0.02643172  0.          0.01321586
  0.02202643  0.          0.          0.          0.          0.          0.
  0.          0.          0.00881057  0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.00881057]

这里分析到用户的大部分行为，再找到与其行为相近的用户，主要的行为有下面这些:

每星期的情况
事件的类型
贡献的数量，连接以及语言
最多的语言

osrc中用于解析的代码

def parse_vector(results):
    points = np.zeros(nvector)
    total = int(results[0])

    points[0] = 1.0 / (total + 1)

    # Week means.
    for k, v in results[1].iteritems():
        points[1 + int(k)] = float(v) / total

    # Event types.
    n = 8
    for k, v in results[2]:
        points[n + evttypes.index(k)] = float(v) / total

    # Number of contributions, connections and languages.
    n += nevts
    points[n] = 1.0 / (float(results[3]) + 1)
    points[n + 1] = 1.0 / (float(results[4]) + 1)
    points[n + 2] = 1.0 / (float(results[5]) + 1)
    points[n + 3] = 1.0 / (float(results[6]) + 1)

    # Top languages.
    n += 4
    for k, v in results[7]:
        if k in langs:
            points[n + langs.index(k)] = float(v) / total
        else:
            # Unknown language.
            points[-1] = float(v) / total

    return points

这样也就返回我们需要的点数，然后我们可以用get_points来获取这些

def get_points(usernames):
    r = redis.StrictRedis(host='localhost', port=6379, db=0)
    pipe = r.pipeline()

    results = get_vector(usernames)
    points = np.zeros([len(usernames), nvector])
    points = parse_vector(results)
    return points

就会得到我们的相应的数据，接着找找和自己邻近的，看看结果。

[ 0.01298701  0.19736842  0.          0.30263158  0.21052632  0.19736842
  0.          0.09210526  0.          0.22368421  0.01315789  0.          0.
  0.          0.          0.          0.01315789  0.          0.
  0.01315789  0.          0.          0.          0.73684211  0.          0.
  0.          1.          1.          1.          0.2         0.42105263
  0.09210526  0.          0.          0.          0.          0.23684211
  0.          0.          0.03947368  0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.        ]

真看不出来两者有什么相似的地方。。。。

或许您还需要下面的文章:

关于我

Github: @phodal 微博:@phodal 知乎:@phodal

微信公众号(Phodal)

围观我的Github Idea墙, 也许，你会遇到心仪的项目

QQ技术交流群: 321689806

Feeds

RSS / Atom

关于作者

Phodal Huang

Engineer, Consultant, Writer, Designer

ThoughtWorks 技术专家

工程师 / 咨询师 / 作家 / 设计学徒

开源深度爱好者

出版有《前端架构：从入门到微前端》、《自己动手设计物联网》、《全栈应用开发：精益实践》

联系我: h@phodal.com

微信公众号: 最新技术分享

Github: @phodal
微博:@phodal
知乎:@phodal
SegmentFault:@phodal

Blog

Blog