Python Github用户数据分析2.2 用SQLite3存储数据

作者: Phodal Huang 2014年4月13日 10:08

如果我们每次都要花同样的时间去做一件事，去扫那些数据的话，那么这是最好的打发时间的方法。

python SQLite3 查询数据

我们创建了一个名为userdata.db的数据库文件，然后创建了一个表，里面有owner,language,eventtype,name url

def init_db():
    conn = sqlite3.connect('userdata.db')
    c = conn.cursor()
    c.execute('''CREATE TABLE userinfo (owner text, language text, eventtype text, name text, url text)''')

接着我们就可以查询数据，这里从结果讲起。


def get_count(username):
    count = 0
    userinfo = []
    condition = 'select * from userinfo where owener = \'' + str(username) + '\''
    for zero in c.execute(condition):
        count += 1
        userinfo.append(zero)

    return count, userinfo

当我查询gmszone的时候，也就是我自己就会有如下的结果


(u'gmszone', u'ForkEvent', u'RESUME', u'TeX', u'https://github.com/gmszone/RESUME')
(u'gmszone', u'WatchEvent', u'iot-dashboard', u'JavaScript', u'https://github.com/gmszone/iot-dashboard')
(u'gmszone', u'PushEvent', u'wechat-wordpress', u'Ruby', u'https://github.com/gmszone/wechat-wordpress')
(u'gmszone', u'WatchEvent', u'iot', u'JavaScript', u'https://github.com/gmszone/iot')
(u'gmszone', u'CreateEvent', u'iot-doc', u'None', u'https://github.com/gmszone/iot-doc')
(u'gmszone', u'CreateEvent', u'iot-doc', u'None', u'https://github.com/gmszone/iot-doc')
(u'gmszone', u'PushEvent', u'iot-doc', u'TeX', u'https://github.com/gmszone/iot-doc')
(u'gmszone', u'PushEvent', u'iot-doc', u'TeX', u'https://github.com/gmszone/iot-doc')
(u'gmszone', u'PushEvent', u'iot-doc', u'TeX', u'https://github.com/gmszone/iot-doc')
109

一共有109个事件，有Watch,Create,Push,Fork还有其他的，项目主要有iot,RESUME,iot-dashboard,wechat-wordpress, 接着就是语言了，Tex,Javascript,Ruby,接着就是项目的url了。

值得注意的是。


-rw-r--r--   1 fdhuang staff 905M Apr 12 14:59 userdata.db

这个数据库文件有905M，不过查询结果相当让人满意，至少相对于原来的结果来说。

Python SQLite3

Python自带了对SQLite3的支持，然而我们还需要安装SQLite3

  brew install sqlite3

或者是

 sudo port install sqlite3

或者是Ubuntu的

 sudo apt-get install sqlite3

openSUSE自然就是

 sudo zypper install sqlite3

不过，用yast2也很不错，不是么。。

Pythont Github Sqlite3数据导入

需要注意的是这里是需要python2.7，起源于对gzip的上下文管理器的支持问题


def handle_gzip_file(filename):
    userinfo = []
    with gzip.GzipFile(filename) as f:
        events = [line.decode("utf-8", errors="ignore") for line in f]

        for n, line in enumerate(events):
            try:
                event = json.loads(line)
            except:

                continue

            actor = event["actor"]
            attrs = event.get("actor_attributes", {})
            if actor is None or attrs.get("type") != "User":
                continue

            key = actor.lower()

            repo = event.get("repository", {})
            info = str(repo.get("owner")), str(repo.get("language")), str(event["type"]), str(repo.get("name")), str(
                repo.get("url"))
            userinfo.append(info)

    return userinfo

def build_db_with_gzip():
    init_db()
    conn = sqlite3.connect('userdata.db')
    c = conn.cursor()

    year = 2014
    month = 3

    for day in range(1,31):
        date_re = re.compile(r"([0-9]{4})-([0-9]{2})-([0-9]{2})-([0-9]+)\.json.gz")

        fn_template = os.path.join("march",
                                   "{year}-{month:02d}-{day:02d}-{n}.json.gz")
        kwargs = {"year": year, "month": month, "day": day, "n": "*"}
        filenames = glob.glob(fn_template.format(**kwargs))

        for filename in filenames:
            c.executemany('INSERT INTO userinfo VALUES (?,?,?,?,?)', handle_gzip_file(filename))

    conn.commit()
    c.close()

executemany可以插入多条数据，对于我们的数据来说，一小时的文件大概有五六千个会符合我们上面的安装，也就是有actor又有type才是我们需要记录的数据，我们只需要统计用户的那些事件，而非全部的事件。

python 遍历文件

我们需要去遍历文件，然后找到合适的部分，这里只是要找2014-03-01到2014-03-31的全部事件，而光这些数据的gz文件就有1.26G，同上面那些解压为json文件显得不合适，只能用遍历来处理。

这里参考了osrc项目中的写法，或者说直接复制过来。

首先是正规匹配

 date_re = re.compile(r"([0-9]{4})-([0-9]{2})-([0-9]{2})-([0-9]+)\.json.gz")

不过主要的还是在于glob.glob

glob是python自己带的一个文件操作相关模块，用它可以查找符合自己目的的文件，就类似于Windows下的文件搜索，支持通配符操作。

这里也就用上了gzip.GzipFile又一个不错的东西。

最后代码可以见

github.com/gmszone/ml

更好的方案？

redis

或许您还需要下面的文章:

关于我

Github: @phodal 微博:@phodal 知乎:@phodal

微信公众号(Phodal)

围观我的Github Idea墙, 也许，你会遇到心仪的项目

QQ技术交流群: 321689806

Feeds

RSS / Atom

关于作者

Phodal Huang

Engineer, Consultant, Writer, Designer

ThoughtWorks 技术专家

工程师 / 咨询师 / 作家 / 设计学徒

开源深度爱好者

出版有《前端架构：从入门到微前端》、《自己动手设计物联网》、《全栈应用开发：精益实践》

联系我: h@phodal.com

微信公众号: 最新技术分享

Github: @phodal
微博:@phodal
知乎:@phodal
SegmentFault:@phodal

Blog

Blog