Python Github用户数据分析2.4 Python Redis结合

Posted by: Phodal Huang April 14, 2014, 9:26 p.m.

在上一篇，我们安装了redis python，现在它终于有了用武之地。

python redis

python 上执行

启动redis-server

这一步是必需的，以便于我们存储数据和读取数据。。

$redis-server

我们可以用redis-cli进入command命令模式

$redis-server

Python Redis

在python下我们可以这样子

>>> import redis
>>> r = redis.StrictRedis(host='localhost', port=6379, db=0)
>>> r.set('foo', 'bar')
True
>>> r.get('foo')
'bar'

当然这只是一个简单的示例，实际上对于我们的数据库在上面是这样子的，为了查询某个用户的情况。。

python
Python 2.7.6 (default, Apr 12 2014, 22:23:28)
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import redis
>>> pool = redis.ConnectionPool(host='localhost', port=6379, db=1)
>>> r = redis.Redis(connection_pool=pool)
>>>pipe=r.pipeline()

因为我们在设备DB的时候用的是db=1，所以与上面的不同，接着我们定义一个简单的函数来减少工作量，源自于osrc

>>> def format_key(key):
...     return "{0}:{1}".format("od", key)
... 
>>>

接着我们可以查询我们的情况

>>> pipe.zcard(format_key("user:{0}:lang".format("gmszone")))
<redis.client.Pipeline object at 0x10c7f6810>
>>> pipe.execute()
[0] 
>>> pipe.zcard(format_key("user:{0}:lang".format("dfm")))
<redis.client.Pipeline object at 0x10c7f6810>
>>> pipe.execute()
[1]
>>>

好吧我在3月1号和2号没有提交代码，试试db=0

Python Github Redis数据库存储


import redis
r = redis.StrictRedis(host='localhost', port=6379, db=1)
def _format(key):
    return "{0}:{1}".format("od", key)
pipe = r.pipeline()
pipe.incr(_format("total"), 1)
pipe.execute()

这样我们就可以简单地存储数据了。

不过这是一个痛苦的过程，因为数据相当的多。。。

Mac OS 下可以用rdm查找数据，但是因为这里的数据量比较多，可能是不行的。

于是让我们痛苦的像osrc一样的存储数据吧。


def build_db_with_redis():
    year = 2014
    month = 3
    pipe = r.pipeline()

    for day in range(2, 4):
        date_re = re.compile(r"([0-9]{4})-([0-9]{2})-([0-9]{2})-([0-9]+)\.json.gz")

        fn_template = os.path.join("march",
                                   "{year}-{month:02d}-{day:02d}-{n}.json.gz")
        kwargs = {"year": year, "month": month, "day": day, "n": "*"}
        filenames = glob.glob(fn_template.format(**kwargs))

        for filename in filenames:
            userinfo = []
            year, month, day, hour = map(int, date_re.findall(filename)[0])
            weekday = date(year=year, month=month, day=day).strftime("%w")

            with gzip.GzipFile(filename) as f:
                events = [line.decode("utf-8", errors="ignore") for line in f]
                count = len(events)

                for n, line in enumerate(events):

                    event = json.loads(line)

                    actor = event["actor"]
                    attrs = event.get("actor_attributes", {})
                    if actor is None or attrs.get("type") != "User":
                        # This was probably an anonymous event (like a gist event)
                        # or an organization event.
                        continue

                    key = actor.lower()
                    evttype = event["type"]
                    nevents = 1
                    contribution = evttype in ["IssuesEvent", "PullRequestEvent","PushEvent"]

                    pipe.incr(_format("total"), nevents)
                    pipe.hincrby(_format("day"), weekday, nevents)
                    pipe.hincrby(_format("hour"), hour, nevents)
                    pipe.zincrby(_format("user"), key, nevents)
                    pipe.zincrby(_format("event"), evttype, nevents)

                    # Event histograms.
                    pipe.hincrby(_format("event:{0}:day".format(evttype)), weekday,
                                 nevents)
                    pipe.hincrby(_format("event:{0}:hour".format(evttype)), hour,
                                 nevents)

                    # User schedule histograms.
                    pipe.hincrby(_format("user:{0}:day".format(key)), weekday, nevents)
                    pipe.hincrby(_format("user:{0}:hour".format(key)), hour, nevents)

                    # User event type histogram.
                    pipe.zincrby(_format("user:{0}:event".format(key)), evttype,
                                 nevents)
                    pipe.hincrby(_format("user:{0}:event:{1}:day".format(key,
                                                                         evttype)),
                                 weekday, nevents)
                    pipe.hincrby(_format("user:{0}:event:{1}:hour".format(key,
                                                                          evttype)),
                                 hour, nevents)

                    # Parse the name and owner of the affected repository.
                    repo = event.get("repository", {})
                    owner, name, org = (repo.get("owner"), repo.get("name"),
                                        repo.get("organization"))
                    if owner and name:
                        repo_name = "{0}/{1}".format(owner, name)
                        pipe.zincrby(_format("repo"), repo_name, nevents)

                        # Save the social graph.
                        pipe.zincrby(_format("social:user:{0}".format(key)),
                                     repo_name, nevents)
                        pipe.zincrby(_format("social:repo:{0}".format(repo_name)),
                                     key, nevents)

                        # Do we know what the language of the repository is?
                        language = repo.get("language")
                        if language:
                            # Which are the most popular languages?
                            pipe.zincrby(_format("lang"), language, nevents)

                            # Total number of pushes.
                            if evttype == "PushEvent":
                                pipe.zincrby(_format("pushes:lang"), language, nevents)

                            pipe.zincrby(_format("user:{0}:lang".format(key)),
                                         language, nevents)

                            # Who are the most important users of a language?
                            if contribution:
                                pipe.zincrby(_format("lang:{0}:user".format(language)),
                                             key, nevents)

                pipe.execute()

然后我们就可以在下一章拿到自己的数据

[221.0, {'1': '50', '0': '41', '3': '13', '2': '33', '5': '28', '4': '22', '6': '34'}, [('PushEvent', 152.0), ('CreateEvent', 39.0), ('WatchEvent', 16.0), ('GollumEvent', 8.0), ('MemberEvent', 3.0), ('ForkEvent', 2.0), ('ReleaseEvent', 1.0)], 0, 0, 0, 11, [('CSS', 73.0), ('JavaScript', 60.0), ('Ruby', 12.0), ('TeX', 6.0), ('Python', 5.0), ('Java', 5.0), ('C++', 5.0), ('Assembly', 5.0), ('Emacs Lisp', 2.0), ('Arduino', 2.0), ('C', 1.0)]]

Python Redis方法

简要的罗列一下上面用到的method

incr(name, amount=1)

Increments the value of key by amount. If no key exists, the value will be initialized as amount 相当于是一个自加的变量

hincrby(name, key, amount=1)

Increment the value of key in hash name by amount

zincrby(name, value, amount=1)

Increment the score of value in sorted set name by amount

值得注意的是这才是真正的执行命令execute

 $pipe.execute()

而下面这些数据。。

[
221.0, 
{
'1': '50', 
'0': '41', 
'3': '13', 
'2': '33', 
'5': '28', 
'4': '22', 
'6': '34'
}, 
[
('PushEvent', 152.0), 
('CreateEvent', 39.0), 
('WatchEvent', 16.0), 
('GollumEvent', 8.0), 
('MemberEvent', 3.0), 
('ForkEvent', 2.0), 
('ReleaseEvent', 1.0)], 
0, 0, 0, 11,

[
('CSS', 73.0), 
('JavaScript', 60.0), 
('Ruby', 12.0), 
('TeX', 6.0), 
('Python', 5.0), 
('Java', 5.0), 
('C++', 5.0), 
('Assembly', 5.0), 
('Emacs Lisp', 2.0), 
('Arduino', 2.0), 
('C', 1.0)
]

]

正是我们需要的github上面的数据，用于分析用户情况的数据。

或许您还需要下面的文章:

关于我

Github: @phodal 微博:@phodal 知乎:@phodal

微信公众号(Phodal)

围观我的Github Idea墙, 也许，你会遇到心仪的项目

QQ技术交流群: 321689806

Feeds

RSS / Atom

关于作者

Phodal Huang

Engineer, Consultant, Writer, Designer

ThoughtWorks 技术专家

工程师 / 咨询师 / 作家 / 设计学徒

开源深度爱好者

出版有《前端架构：从入门到微前端》、《自己动手设计物联网》、《全栈应用开发：精益实践》

联系我: h@phodal.com

微信公众号: 最新技术分享

Github: @phodal
微博:@phodal
知乎:@phodal
SegmentFault:@phodal

Blog