Hi everyone, I'm going to try to make something this year! I haven't planned out anyth

This part is so beautiful. <a target="_blank" rel="noopener noreferrer nofollow" h

The Average Novel about 2017 HOT 5 OPEN

aparrish commented on September 24, 2024 4

The Average Novel

from 2017.

Comments (5)

swizzard commented on September 24, 2024 3

If nothing else, pachyderms.camp would make for a great mastodon instance domain

…

On Tue, Nov 28, 2017 at 3:23 PM, Allison Parrish ***@***.***> wrote: Some progress! I present: *The Average Novel*. I'm still working with Project Gutenberg files from the April 2010 DVD ISO (downloadable here <http://www.gutenberg.org/wiki/Gutenberg:The_CD_and_DVD_Project> and Leonard Richardson's 47000_metadata.json <https://twitter.com/leonardr/status/667049187918356480>). Steps: (1) Fetch every text in PG that was labelled as fiction and then parsed them into sentences and used gensim's Word2Vec module to calculate word embeddings from the resulting sentences. (2) Create an array of word embeddings for every text (by looking up each word in the embedding) and normalized the length of every novel to length 50,000 exactly. (3) Sum the arrays for every length-normalized text and divide by the number of texts (~11k). (4) For each vector in the array, find the word with the closest embedding. You can see the results here <https://gist.github.com/aparrish/86daccdfa4f338b1d33e98d1624029d7>. I guess I secretly hoped that this technique would reveal, average face-like <https://pmsol3.wordpress.com/>, the Narrative Ur-text underlying all storytelling. But the result is pretty much what I actually expected: pretty much all of the structural variation gets lost in the wash. (The Produced and Proofreaders tokens at the top are obviously remnants of PG credits and boilerplate that weren't caught by the filtering tools I'm using; the , token just happens to have been the vector most central to the average, which I guess kinda makes sense given how Word2Vec works. Not sure what all those pachyderms are doing in there though.) I'm planning to continue experimenting with this technique, but wanted to share this progress in case further experiments extend past the deadline. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#22 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AB0knsj4pnqo-UuZytJmUMXXmYMlcTGrks5s7HnvgaJpZM4QDH_x> .

-- .

from 2017.

moonmilk commented on September 24, 2024 2

This part is so beautiful.

from 2017.

aparrish commented on September 24, 2024 1

I'm a day late but I posted the source code and a new version of the output. For the new version, I decided to ignore punctuation tokens when calculating the vectors for each novel. The result has fewer commas and variation that is a bit more interesting IMO!

from 2017.

aparrish commented on September 24, 2024

Some progress! I present: The Average Novel.

I'm still working with Project Gutenberg files from the April 2010 DVD ISO (downloadable here)
and Leonard Richardson's 47000_metadata.json. Steps:

(1) Fetch every text in PG that was labelled as fiction and then parsed them into sentences and used gensim's Word2Vec module to calculate 100-dimensional word embeddings from the resulting sentences.
(2) Create an array of word embeddings for every text (by looking up each word in the embedding) and normalize the length of these arrays to 50,000 (leaving ~11k arrays of dimensionality (50000,100)).
(3) Sum the arrays for every length-normalized text and divide by the number of texts.
(4) For each vector in the resulting array, find the word with the closest embedding.

You can see the results here.

I guess I secretly hoped that this technique would reveal, average face-like, the Narrative Ur-text underlying all storytelling. But the result is pretty much what I actually expected: all of the structural variation gets lost in the wash. (The Produced and Proofreaders tokens at the top are obviously remnants of PG credits and boilerplate that weren't caught by the filtering tools I'm using; the , token just happens to have been the vector most central to the average, which I guess kinda makes sense given how Word2Vec works. Not sure what all those pachyderms are doing in there though.)

I'm planning to continue experimenting with this technique, but wanted to share this progress in case further experiments extend past the deadline.

from 2017.

aparrish commented on September 24, 2024

going to post the source code for this soon, stay tuned!

from 2017.

The Average Novel about 2017 HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent