Code Monkey home page Code Monkey logo

hn-data-dumps's People

Contributors

ashish01 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

hn-data-dumps's Issues

Resumable?

I followed your readme and it seems like it's working:

  0%|                                                                                                                                         | 11648/32557237 [00:19<16:36:06, 544.55it/s]

Problem is if the script gets interrupted (ie, close my laptop lid, or lose network), it seems to start over.

Would it be hard to make this resumable?

zstd compression

Hey there,

Thanks for publishing your code on this - I've found it invaluable already in analysing stories and comments.

I've been experimenting with this code and can see the latest changes to zstd compress each row. I've tested undoing those changes and using ZFS compression at the storage volume level.

Here are my results in case you were interested:

row compression using zstd.compress in python:

$ du -sh hn2.db3 
12G     hn2.db3

$ du -sh hn2.db3 --apparent-size
12G     hn2.db3

$ sqlite3 hn2.db3 'SELECT * from hn_items ORDER BY id LIMIT 1'
1|(�/� �-

ZFS

$ du -sh hn2-nozstd.db3
6.5G    hn2-nozstd.db3

$ du -sh hn2-nozstd.db3 --apparent-size
17G     hn2-nozstd.db3

$ sqlite3 hn2-nozstd.db3 'SELECT * from hn_items ORDER BY id LIMIT 1'
1|{"by":"pg","descendants":15,"id":1,"kids":[15,234509,487171,82729],"score":57,"time":1160418111,"title":"Y Combinator","type":"story","url":"http://ycombinator.com"}

I'm looking at testing SQLite extensions sqlite-zstd or sqlite_zstd_zfs to see if an in-database transparent encryption method might provide similar or better compression ratios without the need for an underlying filesystem with compression support.

Will report back any further results.

Curious about missing type of items, properties and scraping script

First of all thanks for the work! I was looking for an up to date data dump of Hacker News and this seems like a really promising start!

Looking through the data dump and comparing it to the examples in the Hacker News API readme it seems that only story and ask are available, comment, _job, poll, and part of poll don't seem to be part of the data dump.

Looking at the properties of the example story it seems that only a few properties are available, e.g. descendants, kids, score etc are missing.

sqlite> SELECT * FROM hn_stories WHERE ID = 8863;
8863|{"by":"dhouston","time":1175714200,"title":"My YC app: Dropbox - Throw away your USB drive","url":"http://www.getdropbox.com/u/2/screencast.html"}

Example from https://hacker-news.firebaseio.com/v0/item/8863.json?print=pretty:

{
  "by" : "dhouston",
  "descendants" : 71,
  "id" : 8863,
  "kids" : [ 8952, 9224, 8917, 8884, 8887, 8943, 8869, 8958, 9005, 9671, 8940, 9067, 8908, 9055, 8865, 8881, 8872, 8873, 8955, 10403, 8903, 8928, 9125, 8998, 8901, 8902, 8907, 8894, 8878, 8870, 8980, 8934, 8876 ],
  "score" : 111,
  "time" : 1175714200,
  "title" : "My YC app: Dropbox - Throw away your USB drive",
  "type" : "story",
  "url" : "http://www.getdropbox.com/u/2/screencast.html"
}

I'm curious if you also scraped the other type of items, and the other properties of e.g. a story, and if it would be possible to also include those in the data dump or to share the script used for scraping.

missing requirements.txt

Thanks for this very handy script!

The requirements file seems to be missing. It looks like it should include something like:

aiohttp
tqdm
pyarrow
fastparquet
pandas

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.