Code Monkey home page Code Monkey logo

tweetopo's Introduction

Build Status Coverage Status Code Climate Language

A simple spider for Twitter interpersonal topology.

preview

three_seed_pagerank_heatmap

What's this?

This is design for analyse relation between seed users's friends, to get the topology distribution heatmap and the core persons in circle of relationship.

As shown above, the net graph consist of 1200+ friends by 3 seed users.

The three, which two of them are each especial relevance, while the another is irrelevant to either. So their friends obvious divided into two almost separate communities in the picture.

A node represents a person, a edge represents which two follow each. Nodes and edges dye as heatmap, the color from warm to cool represent the node rank value from high to low. The warmer the color, the more important the node, the colder the color, the more irrelevant the node.

Strong (warm, red) ------> Weak (cool, blue)

hot_to_cold_map

Zoom in two communities separately to analyse. As show below, this is a relationship distribution heatmap which consist of the irrelevant seed user's friends. The user have numerous followings, but between each followings almost have no intersection, and clustering rank value is very low overall. So the distribution is scattered, sign that these friends are not in a "Circle of Friend".

sparse_community

Zoom in another side, as show below, this distribution consist of the two relevant seed user's friends. Not only are the two user each especial relevance, but also most of their friends are each close relation, they are obvious in a "Circle of Friend", and these are the target of our research.

intensive_community

In the above, there are many blue node (sign that low-level relevance) around the border of bunching core, these aren't the target of our research.

In order to find the "core" of the community bunching, I have tried seven or more kinds algorithms for measure, then the three algorithms with good results in the experiment are retained. They are Degree Centrality, Pagerank, Clustering Coefficient. Every measure algorithm will filter nodes which rank value greater than designated threshold (0 to 1).

To show the effect, let's choose only one target user's all friends, plot distribution and filter core.

As show below, it's all friends distribution heatmap:

one_seed_pagerank_heatmap

Filter core nodes whitch rank value greater than 0.3 through pagerank as show below:

pagerank_filter_less_0.3

Mostly achieve the basic project.

Usage

Its need python 3.5 and above.

Install

First install require packages of python,

# recommend to use venv
$ python -m venv venv
$ source venv/bin/activate
# or in win
# > venv\Scripts\activate.bat
$ pip install -r requirements.txt

config

Then change the config file for your own info, and institute rules of how to focus.

$ cp tweetconf.json.example tweetconf.json
$ vim tweetconf.json
$ cp rules.json.example rules.json
$ vim rules.json

In the config json file, you need set twitter tokens and mongo connetion and seed user.

twitter tokens and seed user name are a list, so you can set multi item of them, tokens will be used in multitheading for spider, and seed names decide who we crawl and analyse with. Persons info and relation stored in mongo.

Unit test

Before run the project, you can get unit test at first, it used nose, a unit testing framework of python.

$ packages="conffor, database, logsetting, netgraph, twitter"
$ python -m nose -w . -vs --with-coverage --cover-package="$packages"

Note: database test need MongoDB and config in test package as db_unit_test.json; Twitter test need key and token which configured in tweetconf.json.

Entry

File struct:

tweetopo/           # root directory
│
├─ docs/            # doc and image
│
├─ config/          # configuration files
│  │
│  ├─ rules.json       # focus filter rules
│  │
│  └─ tweetconf.json   # config file
│
├─ build/           # build scripts
│
├─ lib/             # lib package
│  │
│  ├─ twitter/      # twitter spider with tweepy
│  │
│  ├─ netgraph/     # graph struct process and data visualization
│  │
│  ├─ database/     # package for database operate
│  │
│  ├─ conffor/      # package for config and csv operate
│  │
│  ├─ logsetting/   # package for log system setting
│  │
│  └─ uitls/        # uitl code snippets
│
├─ src/             # operate script
│  │
│  ├─ scripts/      # splited scripts for operation
│  │
│  └─ main.py       # entry for run scripts
│
├─ test/            # unit test
│
├─ README.md
│
└─ requirements.txt

The main.py is entry of twitter spider for get data, and database operate for store and export data.

analyse_topology.py is entry of csv file operate for cache data list, and graph analyse for data visualization, and draw result picture.

workflow

tweetopo_workflow

  1. crawl and store twitter user relation data with seed
  2. export db relation to relations.json with seed user friends
  3. calculate each mutual friends cache in mutual_friends.csv
  4. load mutual friends file to edges with create graph struct
  5. filter low rank node out for cache hub node to hub_persons.csv
  6. draw net graph, PDF, CDF
  7. select secondout friends who multiple repeated to secondouts.csv
  8. crawl and store twitter user details data with hub persons and secondouts uid list
  9. export db person to hub_persons.json with hub persons and secondouts uid list
  10. filter focus user which hit select rules to focus_hub.csv
  11. merge hub list and details data to hub_details.csv

The result hub_details.csv record people`s uid, name, 3 measure ranks, lcation, description and other account details information.

License

GPL

tweetopo's People

Contributors

zthxxx avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.