daylightingsociety / socmap Goto Github PK

View Code? Open in Web Editor NEW

18.0 18.0 4.0 474 KB

Social Mapping Framework for Twitter

Home Page: https://socmap.daylightingsociety.org/

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

network-analysis social-network-analysis twitter twitter-analysis

socmap's People

Contributors

Stargazers

Watchers

Forkers

mahmud83 amirunpri2018 wu860 osssp

socmap's Issues

Create directories automatically if needed

Right now we fail (with a pretty poor error message) if the workdir, mapdir, or tweetdir don't exist. Let's create those directories automatically, instead.

Add tool for making retweet and mention maps

Once we run SocMap data collection we have all the Tweet data stored locally. Given the list of seed users, we should easily be able to reparse those tweets and build separate maps only based on retweets or mentions. Usage would look something like:

tools/buildMentionMap.py userlist.txt 3 maps/layer2Mentions.gml

tools/buildRTMap.py userlist.txt 3 maps/layer2RTs.gml

This should be very easy if we reuse existing functionality from acquire.py and analyze.py, though we may need to add an extra optional parameter to the mapping functions to override the default filename.

Add an option for generating maps using only tweets within time threshold

This proposal is related to #29, but does not depend on it: Rather than only collecting tweets within a certain date range (which is sometimes desirable), we can start with a full data sample and limit ourselves to the date range retroactively. This means adding:

Optional parameters to loadTweetsFromFile with a max age (in epoch time) to load
Optional parameters to getUserReferences passed through to the loadTweetsFromFile max age
A new buildAgeMap tool script that builds an RT/Mention network, but takes an argument of the oldest date to build the network from
Optional arguments for max age to buildRTMap and buildMentionMap so they can also use this functionality

This shouldn't be too complicated an addition, but it's a more sprawling change than #29. No changes to analyze.py or socmap.py

Implement getMentionsFromText()

Two options here:

Parse the tweet manually, use a regular expression to extract usernames from the tweet
See if the tweet JSON has an array of mentioned users, and extract that field instead

Let's aim for the second option. If we're lucky it's in the form of twitter IDs instead of usernames, and we can just use those everywhere.

Add configurable cap of per-user links

We occasionally run into a particularly popular user that retweets tens or hundreds of thousands of users, which dominates the social graph space and makes data collection take an inordinate amount of time. This may be undesired behavior. We should add a flag so the user can specify a per-user limit, and we halt data collection after finding, for example, 300 unique retweeted users for a single account history.

Add an option for getting most common edges instead of most recent

We currently have an option -M, --maxreferences that restricts the maximum number of edges leaving a node, to avoid the celebrity problem. However, we currently accomplish this by reading through their tweets reverse-chronologically until we've found enough mentions and retweets. This means -M 30 will get the 30 most recent retweets or mentions for each user. This is not always desirable; what if we want the strongest links between users instead of the most recent?

We should provide an option like -C, --common that changes behavior to read all mentions and retweets per user, sort by occurrence, and use the top X most occurring connections rather than most recent activity.

Add recovery

Since this is a very long running program it is not unlikely that execution will be halted partway through. Where are good points to checkpoint progress, and how can we detect checkpoints and recovery from them automatically when the process is resumed?

Add an option for limiting the age of collected tweets

At present we collect the most recent ~2000 tweets from each user, per Twitter API limits. This can give an uneven approximation of a community, since it does not represent "recent" interactions in a true time sense, and will include the most recent tweets from a user even if they haven't tweeted in months.

Propose adding a --maxage -A flag to the script for specifying an integer (or float?) number of days. No tweets older than this threshold will be collected.

Implementation will require adding optional oldestDate argument to acquireTweets, which can take an epoch time stamp. If any tweets older than that timestamp are detected, break out of the for-loop currently at:

SocMap/acquire.py

Lines 101 to 113 in c8e9f40

    
           for tweet in limit_handled(api, cursor.items()): 
        
           	mentions = getMentionsFromText(tweet.text) 
        
           	date = tweet.created_at 
        
           	text = tweet.text 
        
           	source = tweet.user.screen_name.lower() 
        
           	if( hasattr(tweet, "retweeted_status") ): 
        
           		orig_author = tweet.retweeted_status.user.screen_name.lower() 
        
           		rt_count = tweet.retweeted_status.retweet_count 
        
           		rt = Retweet(source, text, date, mentions, orig_author, rt_count) 
        
           		tweets.append(rt) 
        
           	else: 
        
           		tw = Tweet(source, text, date, mentions) 
        
           		tweets.append(tw)

This will not require any changes to analysis.py, or auxiliary tools, or any new interactions with tweepy.

Allow absolute log paths

If the user begins a log path with a slash, such as -L /tmp/download.txt then we should parse that as "put a log file in /tmp/downloaded.txt" and not as "We should put a log file in work//tmp/downloaded.txt"

This means parsing the file name for the log file, and treating it specially if it begins with a slash.

Switch retweets from screen name to username

Retweets currently display the screen name of the user that's been retweeted. Since screen names are not unique and are easily changed, we need to swap this out for either username (hard to change), or better yet, user ID (never changes).

We're probably just extracting the wrong field from the tweet json.

Either implement 'Remove Tweets' or pull the command line option

Right now the documentation states:

-r, --removetweets    Remove tweets after extracting metadata

That flag does nothing. We should either implement nuking the tweets as we download them (and maybe leave an empty file in place to mark that we've downloaded those tweets?) or remove the command line option if we don't intend to implement the feature.

Does this feature have merit? Are there enough cases where users would want the Tweet maps but not the Tweets, and are so space-constrained that it matters?

I'm leaning towards "No one cares, pull the feature", but it should also be relatively easy to implement, so what does everyone else think?

Save the user lists from layer0 to disk

This depends on #1 and #2. Once we have the list of everyone mentioned or retweeted by our seed nodes we need to create the following files:

A dictionary of all usernames or twitter IDs to investigate from layer0's mentions
A dictionary of all usernames or twitter IDs to investigate from layer0's tweets

These should be pickled dictionaries, of the form (layer0username -> layer1username). They should be put in the workdir folder (default is "./work") and named layer0mentionedUsers.dict and layer0retweetedUsers.dict, respectively.

Add Tweet Compression

Tweets are huge json blobs with lots of repetition that compress quite nicely with gzip. There's already a command-line argument for enabling gzip, so compression is included as a boolean in the options dictionary.

If compression is enabled we should add ".gz" to the end of all tweet filenames, and compress them before saving the jzon blobs to disk. Similarly, when reading tweets we should expect a ".gz" at the end, and decompress them before loading the json blobs.

Add link counts

Right now we put a (src, dst) edge on the map if src mentions or retweets the destination. We should include additional information on the edge itself, namely:

mentionCount (int)
retweetCount (int)

This allows us to differentiate between users that mentioned each other once from users that have long, regular conversations back and forth.

This will likely require changing several places where we extract usernames from tweets, so we track the number of occurrences (probably with a dictionary) rather than the set solution we use now.

Add tools for pruning based on tweet and link counts

Data sets get very large, very quickly. We can produce GML files containing millions of users, which are impossible to load in graph visualization tools like Gephi and Cytoscape.

Let's add tools that allow pruning users with less than an arbitrary number of tweets, or less than an arbitrary number of in or out edges.

This should help researchers remove the less active accounts from their data set, and reduce the graph to a more manageable size for their analysis tools.

Crashes during data collection

Current behavior:

Traceback (most recent call last):
  File "./socmap.py", line 103, in <module>
06/21/18 18:28:26 Info: Beginning data collection for layer 0
    acquire.getLayers(api, options.layers, options, layer0)
  File "/Users/milo/src/SocMap/acquire.py", line 156, in getLayers
    saveUserList(options.workdir, "layer" + str(layer) + "startingUsers", set(userlist))
  File "/Users/milo/src/SocMap/acquire.py", line 129, in saveUserList
    blob = json.dumps(dictionary)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/json/__init__.py", line 230, in dumps
    return _default_encoder.encode(obj)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/json/encoder.py", line 180, in default
    raise TypeError(repr(o) + " is not JSON serializable")
TypeError: {'USERNAME OF ACCOUNT WITH NO RETWEETS'} is not JSON serializable

Looks like there's an errant line in acquire.py that reads:

saveUserList(options.workdir, "layer" + str(layer) + "startingUsers", set(userlist))

My guess is we tried to save off a list of the users we were working with, but saveUserList expects a dictionary of src->destination. Whoops.

Add options for ignoring mentions or retweets

Some researchers will only care about mentions or retweets in a network. Including both will generate twice as large a network as necessary, so let's add command line flags for ignoring either retweets or mentions. These options should be mutually exclusive.

Networkx wrong attribute name

I encountered an error when networkx was updating tweet count of users of a saved network before saving.

Traceback (most recent call last):
  File "./socmap.py", line 113, in <module>
    acquire.getLayers(api, options.layers, options, layer0)
  File "/Users/user/SocMap/acquire.py", line 194, in getLayers
    analyze.saveNetwork(options.mapdir, layer, tweetCounts, nextLayerRTs, nextLayerMentions)
  File "/Users/user/SocMap/analyze.py", line 76, in saveNetwork
    net.node[username]["tweets"] = baseUsers[username]
AttributeError: 'DiGraph' object has no attribute 'node'

i think the attribute is named nodes and not node

Set up as a pip package

Pip is the Python package manager, equivalent to Ruby's gem or Perl's CPAN. We've got enough of a pip framework set up that you can install all of socmap's dependencies with pip install . in the socmap folder.

What would it take to configure socmap as a complete Python package so users can install it with pip install socmap and don't have to deal with Github and installation from source? I've never gone through this process with Python, and don't know whether it's a good fit for a multi-file project like socmap.

Patch NetworkX GML files

NetworkX produces GML files that include a numeric label on each node. These may be valid for the GML specification, but they are not readable by CytoScape. CytoScape is bomb, and we want to support opening our files in that tool. Therefore, we need to patch the GML files and remove the numeric labels. Here's an example function that does the job:

def patch_gml(filename):
    with open(filename, "r+") as f:
        content = f.read()
        newcontent = re.sub("label (\S+)\s+", r'', content)
        f.seek(0, 0)
        f.truncate()
        f.write(newcontent)

Once we begin working with very large GML files we may need something more efficient than "Put entire GML file in a string, apply regex, save back out to file", but the idea is the same.

Fails to run

Hi there, initial startup of the program fails.

./socmap.py -a auth.txt -u userlist.txt -l 2

09/25/22 13:24:13 Info: Beginning data collection for layer 0
Traceback (most recent call last):
  File "/home/snafu/Tools/SocMap/acquire.py", line 26, in limit_handled
    yield cursor.next()
  File "/home/snafu/.local/lib/python3.8/site-packages/tweepy/cursor.py", line 286, in next
    self.current_page = next(self.page_iterator)
  File "/home/snafu/.local/lib/python3.8/site-packages/tweepy/cursor.py", line 86, in __next__
    return self.next()
  File "/home/snafu/.local/lib/python3.8/site-packages/tweepy/cursor.py", line 190, in next
    raise StopIteration
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./socmap.py", line 114, in <module>
    acquire.getLayers(api, options.layers, options, layer0)
  File "/home/snafu/Tools/SocMap/acquire.py", line 167, in getLayers
    getUserTweets(api, username, options.tweetdir, options.numtweets, options.compress)
  File "/home/snafu/Tools/SocMap/acquire.py", line 101, in getUserTweets
    for tweet in limit_handled(api, cursor.items()):
  File "/home/snafu/Tools/SocMap/acquire.py", line 27, in limit_handled
    except tweepy.error.TweepError as e:
AttributeError: module 'tweepy' has no attribute 'error'

Any suggestions?

Multi-layer getLayers()

We need to rewrite getLayers() as a loop. As psuedocode it should look like:

def getLayers(...):
    for layer in (0 .. numLayers):
        get user list
        get tweets from users on user list
        extract mentions and retweets from users on current list
        save dictionaries of who's retweeting who and who's mentioning who
        save map of progress so far

Include number of tweets in graph node

We should include the number of tweets we collected from a user in their node on the graph. This makes it easy to remove deleted / banned accounts (which will have 0 tweets) from a graph, as well as prune relatively inactive users, if the researcher wants to.

This should be as simple as including a 'tweets' field when creating the node. We'll probably have to add a getNumberOfTweets(username) function to analyze.py, and preferably we can learn the answer without fully expanding the JSON back to tweet objects again, since that will waste time.

Add tool for merging maps

Say we have two data sets, such as Anonymous and Telecomix. We should be able to merge these in to a single "Hactivism" data set by looking for shared nodes (with the same name) and producing a third combined map joined at these points.

Shared nodes would inherit attributes from the lowest common denominator. For example:

layer = min(N1.layer, N2.layer)
retweeted = (N1.retweeted or N2.retweeted)
mentioned = (N1.mentioned or N2.mentioned)

Let's build a merging tool, with usage like:

$ mergeMaps.py anonymous.gml telecomix.gml combined_hactivism.gml

Add Logging

We need much more extensive logging so the user can determine how much progress is being made, and what's gone wrong when something breaks. We'll use this issue to track how logging should be implemented, and where logging is needed most.

Save the user maps from layer0 to disk

Similar to #3. Once we have dictionaries of which seed users are mentioning whom, and which seed users are retweeting whom, we need to save the results as a GML graph to disk. Specifically we want a directed graph of seed user retweeted/mentioned layer 1 user saved as layer1.gml

This should be saved in the mapdir ("./map" by default). Each node representing a user should contain the following information:

username
screen name
twitter ID
layer
retweeted (true if the user is included because they were retweeted, false otherwise)
mentioned (true if the user is included because they were mentioned, false otherwise)

Choose a License

We need to agree on a license before SpeakFree is offiically FOSS!

I'm leaning towards BSD-3-clause, since it's simple, permissive, and protects us from misattribution if someone uses this tool for unsavory and unintended purposes. Here's a discussion space for the age-old debate between BSD/MIT/GPL, if anyone wants to weigh in.

	for tweet in limit_handled(api, cursor.items()):
	mentions = getMentionsFromText(tweet.text)
	date = tweet.created_at
	text = tweet.text
	source = tweet.user.screen_name.lower()
	if( hasattr(tweet, "retweeted_status") ):
	orig_author = tweet.retweeted_status.user.screen_name.lower()
	rt_count = tweet.retweeted_status.retweet_count
	rt = Retweet(source, text, date, mentions, orig_author, rt_count)
	tweets.append(rt)
	else:
	tw = Tweet(source, text, date, mentions)
	tweets.append(tw)