daylightingsociety / socmap Goto Github PK
View Code? Open in Web Editor NEWSocial Mapping Framework for Twitter
Home Page: https://socmap.daylightingsociety.org/
License: BSD 3-Clause "New" or "Revised" License
Social Mapping Framework for Twitter
Home Page: https://socmap.daylightingsociety.org/
License: BSD 3-Clause "New" or "Revised" License
Right now we fail (with a pretty poor error message) if the workdir, mapdir, or tweetdir don't exist. Let's create those directories automatically, instead.
Once we run SocMap data collection we have all the Tweet data stored locally. Given the list of seed users, we should easily be able to reparse those tweets and build separate maps only based on retweets or mentions. Usage would look something like:
tools/buildMentionMap.py userlist.txt 3 maps/layer2Mentions.gml
tools/buildRTMap.py userlist.txt 3 maps/layer2RTs.gml
This should be very easy if we reuse existing functionality from acquire.py
and analyze.py
, though we may need to add an extra optional parameter to the mapping functions to override the default filename.
This proposal is related to #29, but does not depend on it: Rather than only collecting tweets within a certain date range (which is sometimes desirable), we can start with a full data sample and limit ourselves to the date range retroactively. This means adding:
loadTweetsFromFile
with a max age (in epoch time) to loadgetUserReferences
passed through to the loadTweetsFromFile
max agebuildAgeMap
tool script that builds an RT/Mention network, but takes an argument of the oldest date to build the network frombuildRTMap
and buildMentionMap
so they can also use this functionalityThis shouldn't be too complicated an addition, but it's a more sprawling change than #29. No changes to analyze.py or socmap.py
Two options here:
Parse the tweet manually, use a regular expression to extract usernames from the tweet
See if the tweet JSON has an array of mentioned users, and extract that field instead
Let's aim for the second option. If we're lucky it's in the form of twitter IDs instead of usernames, and we can just use those everywhere.
We occasionally run into a particularly popular user that retweets tens or hundreds of thousands of users, which dominates the social graph space and makes data collection take an inordinate amount of time. This may be undesired behavior. We should add a flag so the user can specify a per-user limit, and we halt data collection after finding, for example, 300 unique retweeted users for a single account history.
We currently have an option -M, --maxreferences
that restricts the maximum number of edges leaving a node, to avoid the celebrity problem. However, we currently accomplish this by reading through their tweets reverse-chronologically until we've found enough mentions and retweets. This means -M 30
will get the 30 most recent retweets or mentions for each user. This is not always desirable; what if we want the strongest links between users instead of the most recent?
We should provide an option like -C, --common
that changes behavior to read all mentions and retweets per user, sort by occurrence, and use the top X most occurring connections rather than most recent activity.
Since this is a very long running program it is not unlikely that execution will be halted partway through. Where are good points to checkpoint progress, and how can we detect checkpoints and recovery from them automatically when the process is resumed?
At present we collect the most recent ~2000 tweets from each user, per Twitter API limits. This can give an uneven approximation of a community, since it does not represent "recent" interactions in a true time sense, and will include the most recent tweets from a user even if they haven't tweeted in months.
Propose adding a --maxage
-A
flag to the script for specifying an integer (or float?) number of days. No tweets older than this threshold will be collected.
Implementation will require adding optional oldestDate
argument to acquireTweets
, which can take an epoch time stamp. If any tweets older than that timestamp are detected, break out of the for-loop currently at:
Lines 101 to 113 in c8e9f40
This will not require any changes to analysis.py, or auxiliary tools, or any new interactions with tweepy.
If the user begins a log path with a slash, such as -L /tmp/download.txt
then we should parse that as "put a log file in /tmp/downloaded.txt" and not as "We should put a log file in work//tmp/downloaded.txt"
This means parsing the file name for the log file, and treating it specially if it begins with a slash.
Retweets currently display the screen name of the user that's been retweeted. Since screen names are not unique and are easily changed, we need to swap this out for either username (hard to change), or better yet, user ID (never changes).
We're probably just extracting the wrong field from the tweet json.
Right now the documentation states:
-r, --removetweets Remove tweets after extracting metadata
That flag does nothing. We should either implement nuking the tweets as we download them (and maybe leave an empty file in place to mark that we've downloaded those tweets?) or remove the command line option if we don't intend to implement the feature.
Does this feature have merit? Are there enough cases where users would want the Tweet maps but not the Tweets, and are so space-constrained that it matters?
I'm leaning towards "No one cares, pull the feature", but it should also be relatively easy to implement, so what does everyone else think?
This depends on #1 and #2. Once we have the list of everyone mentioned or retweeted by our seed nodes we need to create the following files:
A dictionary of all usernames or twitter IDs to investigate from layer0's mentions
A dictionary of all usernames or twitter IDs to investigate from layer0's tweets
These should be pickled dictionaries, of the form (layer0username -> layer1username). They should be put in the workdir
folder (default is "./work") and named layer0mentionedUsers.dict
and layer0retweetedUsers.dict
, respectively.
Tweets are huge json blobs with lots of repetition that compress quite nicely with gzip. There's already a command-line argument for enabling gzip, so compression is included as a boolean in the options
dictionary.
If compression is enabled we should add ".gz" to the end of all tweet filenames, and compress them before saving the jzon blobs to disk. Similarly, when reading tweets we should expect a ".gz" at the end, and decompress them before loading the json blobs.
Right now we put a (src, dst) edge on the map if src mentions or retweets the destination. We should include additional information on the edge itself, namely:
This allows us to differentiate between users that mentioned each other once from users that have long, regular conversations back and forth.
This will likely require changing several places where we extract usernames from tweets, so we track the number of occurrences (probably with a dictionary) rather than the set solution we use now.
Data sets get very large, very quickly. We can produce GML files containing millions of users, which are impossible to load in graph visualization tools like Gephi and Cytoscape.
Let's add tools that allow pruning users with less than an arbitrary number of tweets, or less than an arbitrary number of in or out edges.
This should help researchers remove the less active accounts from their data set, and reduce the graph to a more manageable size for their analysis tools.
Current behavior:
Traceback (most recent call last):
File "./socmap.py", line 103, in <module>
06/21/18 18:28:26 Info: Beginning data collection for layer 0
acquire.getLayers(api, options.layers, options, layer0)
File "/Users/milo/src/SocMap/acquire.py", line 156, in getLayers
saveUserList(options.workdir, "layer" + str(layer) + "startingUsers", set(userlist))
File "/Users/milo/src/SocMap/acquire.py", line 129, in saveUserList
blob = json.dumps(dictionary)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/json/__init__.py", line 230, in dumps
return _default_encoder.encode(obj)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/json/encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/json/encoder.py", line 257, in iterencode
return _iterencode(o, 0)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/json/encoder.py", line 180, in default
raise TypeError(repr(o) + " is not JSON serializable")
TypeError: {'USERNAME OF ACCOUNT WITH NO RETWEETS'} is not JSON serializable
Looks like there's an errant line in acquire.py
that reads:
saveUserList(options.workdir, "layer" + str(layer) + "startingUsers", set(userlist))
My guess is we tried to save off a list of the users we were working with, but saveUserList
expects a dictionary of src->destination
. Whoops.
Some researchers will only care about mentions or retweets in a network. Including both will generate twice as large a network as necessary, so let's add command line flags for ignoring either retweets or mentions. These options should be mutually exclusive.
I encountered an error when networkx was updating tweet count of users of a saved network before saving.
Traceback (most recent call last):
File "./socmap.py", line 113, in <module>
acquire.getLayers(api, options.layers, options, layer0)
File "/Users/user/SocMap/acquire.py", line 194, in getLayers
analyze.saveNetwork(options.mapdir, layer, tweetCounts, nextLayerRTs, nextLayerMentions)
File "/Users/user/SocMap/analyze.py", line 76, in saveNetwork
net.node[username]["tweets"] = baseUsers[username]
AttributeError: 'DiGraph' object has no attribute 'node'
i think the attribute is named nodes
and not node
Pip is the Python package manager, equivalent to Ruby's gem
or Perl's CPAN
. We've got enough of a pip framework set up that you can install all of socmap's dependencies with pip install .
in the socmap folder.
What would it take to configure socmap as a complete Python package so users can install it with pip install socmap
and don't have to deal with Github and installation from source? I've never gone through this process with Python, and don't know whether it's a good fit for a multi-file project like socmap.
NetworkX produces GML files that include a numeric label
on each node. These may be valid for the GML specification, but they are not readable by CytoScape. CytoScape is bomb, and we want to support opening our files in that tool. Therefore, we need to patch the GML files and remove the numeric labels. Here's an example function that does the job:
def patch_gml(filename):
with open(filename, "r+") as f:
content = f.read()
newcontent = re.sub("label (\S+)\s+", r'', content)
f.seek(0, 0)
f.truncate()
f.write(newcontent)
Once we begin working with very large GML files we may need something more efficient than "Put entire GML file in a string, apply regex, save back out to file", but the idea is the same.
Hi there, initial startup of the program fails.
./socmap.py -a auth.txt -u userlist.txt -l 2
09/25/22 13:24:13 Info: Beginning data collection for layer 0
Traceback (most recent call last):
File "/home/snafu/Tools/SocMap/acquire.py", line 26, in limit_handled
yield cursor.next()
File "/home/snafu/.local/lib/python3.8/site-packages/tweepy/cursor.py", line 286, in next
self.current_page = next(self.page_iterator)
File "/home/snafu/.local/lib/python3.8/site-packages/tweepy/cursor.py", line 86, in __next__
return self.next()
File "/home/snafu/.local/lib/python3.8/site-packages/tweepy/cursor.py", line 190, in next
raise StopIteration
StopIteration
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./socmap.py", line 114, in <module>
acquire.getLayers(api, options.layers, options, layer0)
File "/home/snafu/Tools/SocMap/acquire.py", line 167, in getLayers
getUserTweets(api, username, options.tweetdir, options.numtweets, options.compress)
File "/home/snafu/Tools/SocMap/acquire.py", line 101, in getUserTweets
for tweet in limit_handled(api, cursor.items()):
File "/home/snafu/Tools/SocMap/acquire.py", line 27, in limit_handled
except tweepy.error.TweepError as e:
AttributeError: module 'tweepy' has no attribute 'error'
Any suggestions?
We need to rewrite getLayers() as a loop. As psuedocode it should look like:
def getLayers(...):
for layer in (0 .. numLayers):
get user list
get tweets from users on user list
extract mentions and retweets from users on current list
save dictionaries of who's retweeting who and who's mentioning who
save map of progress so far
We should include the number of tweets we collected from a user in their node on the graph. This makes it easy to remove deleted / banned accounts (which will have 0 tweets) from a graph, as well as prune relatively inactive users, if the researcher wants to.
This should be as simple as including a 'tweets' field when creating the node. We'll probably have to add a getNumberOfTweets(username)
function to analyze.py, and preferably we can learn the answer without fully expanding the JSON back to tweet objects again, since that will waste time.
Say we have two data sets, such as Anonymous and Telecomix. We should be able to merge these in to a single "Hactivism" data set by looking for shared nodes (with the same name) and producing a third combined map joined at these points.
Shared nodes would inherit attributes from the lowest common denominator. For example:
Let's build a merging tool, with usage like:
$ mergeMaps.py anonymous.gml telecomix.gml combined_hactivism.gml
We need much more extensive logging so the user can determine how much progress is being made, and what's gone wrong when something breaks. We'll use this issue to track how logging should be implemented, and where logging is needed most.
Similar to #3. Once we have dictionaries of which seed users are mentioning whom, and which seed users are retweeting whom, we need to save the results as a GML graph to disk. Specifically we want a directed graph of seed user
retweeted/mentioned layer 1 user
saved as layer1.gml
This should be saved in the mapdir
("./map" by default). Each node representing a user should contain the following information:
We need to agree on a license before SpeakFree is offiically FOSS!
I'm leaning towards BSD-3-clause, since it's simple, permissive, and protects us from misattribution if someone uses this tool for unsavory and unintended purposes. Here's a discussion space for the age-old debate between BSD/MIT/GPL, if anyone wants to weigh in.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.