ashish01 / hn-data-dumps Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
I followed your readme and it seems like it's working:
0%| | 11648/32557237 [00:19<16:36:06, 544.55it/s]
Problem is if the script gets interrupted (ie, close my laptop lid, or lose network), it seems to start over.
Would it be hard to make this resumable?
Hey there,
Thanks for publishing your code on this - I've found it invaluable already in analysing stories and comments.
I've been experimenting with this code and can see the latest changes to zstd compress each row. I've tested undoing those changes and using ZFS compression at the storage volume level.
Here are my results in case you were interested:
$ du -sh hn2.db3
12G hn2.db3
$ du -sh hn2.db3 --apparent-size
12G hn2.db3
$ sqlite3 hn2.db3 'SELECT * from hn_items ORDER BY id LIMIT 1'
1|(�/� �-
$ du -sh hn2-nozstd.db3
6.5G hn2-nozstd.db3
$ du -sh hn2-nozstd.db3 --apparent-size
17G hn2-nozstd.db3
$ sqlite3 hn2-nozstd.db3 'SELECT * from hn_items ORDER BY id LIMIT 1'
1|{"by":"pg","descendants":15,"id":1,"kids":[15,234509,487171,82729],"score":57,"time":1160418111,"title":"Y Combinator","type":"story","url":"http://ycombinator.com"}
I'm looking at testing SQLite extensions sqlite-zstd or sqlite_zstd_zfs to see if an in-database transparent encryption method might provide similar or better compression ratios without the need for an underlying filesystem with compression support.
Will report back any further results.
First of all thanks for the work! I was looking for an up to date data dump of Hacker News and this seems like a really promising start!
Looking through the data dump and comparing it to the examples in the Hacker News API readme it seems that only story and ask are available, comment, _job, poll, and part of poll don't seem to be part of the data dump.
Looking at the properties of the example story it seems that only a few properties are available, e.g. descendants, kids, score etc are missing.
sqlite> SELECT * FROM hn_stories WHERE ID = 8863;
8863|{"by":"dhouston","time":1175714200,"title":"My YC app: Dropbox - Throw away your USB drive","url":"http://www.getdropbox.com/u/2/screencast.html"}
Example from https://hacker-news.firebaseio.com/v0/item/8863.json?print=pretty:
{
"by" : "dhouston",
"descendants" : 71,
"id" : 8863,
"kids" : [ 8952, 9224, 8917, 8884, 8887, 8943, 8869, 8958, 9005, 9671, 8940, 9067, 8908, 9055, 8865, 8881, 8872, 8873, 8955, 10403, 8903, 8928, 9125, 8998, 8901, 8902, 8907, 8894, 8878, 8870, 8980, 8934, 8876 ],
"score" : 111,
"time" : 1175714200,
"title" : "My YC app: Dropbox - Throw away your USB drive",
"type" : "story",
"url" : "http://www.getdropbox.com/u/2/screencast.html"
}
I'm curious if you also scraped the other type of items, and the other properties of e.g. a story, and if it would be possible to also include those in the data dump or to share the script used for scraping.
Thanks for this very handy script!
The requirements file seems to be missing. It looks like it should include something like:
aiohttp
tqdm
pyarrow
fastparquet
pandas
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.