Code Monkey home page Code Monkey logo

datasets2sqlite's Introduction

datasets2sqlite

These are some scripts to convert some large datasets from their native format to SQLite. These scripts are designed to have minimal dependencies so that they may be copied and run independently of each other.

The scripts individually provide usage help if executed with insufficient parameters and can read the compressed version of data.

Datasets

  • Amazon Reviews

    • Source: http://jmcauley.ucsd.edu/data/amazon/
    • python json2sqlite.py --gzip aggressive_dedup.json.gz amazon.sqlite reviews
    • python amazon_metadata2sqlite.py --gzip metadata.json.gz amazon.sqlite
  • Wikipedia Metadata

    • Source: https://snap.stanford.edu/data/wiki-meta.html (NOT the complete wikipedia history)
    • python wikimeta2sqlite.py --bz2 enwiki-20080103.main.bz2 wikipedia_2008.sqlite main
    • python wikimeta2sqlite.py --bz2 enwiki-20080103.users.bz2 wikipedia_2008.sqlite users
    • etc.
  • Memetracker data

    • Source: https://snap.stanford.edu/data/memetracker9.html
    • python meme2sqlite.py --gzip quotes_2008-08.txt.gz memetracker2.sqlite meme
    • python meme2sqlite.py --gzip quotes_2008-09.txt.gz memetracker2.sqlite meme
    • python meme2sqlite.py --gzip quotes_2008-10.txt.gz memetracker2.sqlite meme
    • etc.
  • Reddit data

    • Source: https://archive.org/details/2015_reddit_comments_corpus
    • From 2015-04, the comments contain 1 extra field: removal_reason. Hence, the headers need to be explicitly supplied.
    • python json2sqlite.py --bz2 RC_2015-01.bz2 --headers reddit_headers.txt reddit.sqlite comments
    • python json2sqlite.py --bz2 RC_2015-02.bz2 --headers reddit_headers.txt reddit.sqlite comments
    • python json2sqlite.py --bz2 RC_2015-03.bz2 --headers reddit_headers.txt reddit.sqlite comments
    • etc.
  • StackExchange data

Acknowledgements

I use code from rgrp/csv2sqlite for guessing types. The code for converting StackExchange dataset is taken (with minor changes) from testlnord/sedumpy.

datasets2sqlite's People

Contributors

musically-ut avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.