The cis192finalproject from fredwangster

README
======
------------------
Code Organization:
------------------
1) analyzer.py 
    Class... must be initialized first!
    - accepts: filename, url_name
    - filename is the source file for all the urls spawning from the homepage, these are generated by crawler so crawler must run first!! 
    - url_name is the root url of the site we want to look at, for example, "http://www.amazon.com"
    - iterating through the list of urls, analyzer finds all the ads in each url from an ad_site_list we have hardcoded into the script
    - scores each site based on how many ads we find x links deep into the site, also the popularity of the site, and how far it is away from optimal ad % (as a % of content)
    - There is a limit to 2000 queries per day using 2 API keys from Compete
    
2) crawler.py 
    - accepts: root, crawl_depth
    - root is the root url of the site we want to look at, for example, "http://www.amazon.com"
    - crawl_depth is how many links deep we want the crawler to return. for example, crawl_depth = 1 means our crawler will return all the links in the main homepage. while crawl_depth = 2 means our crawler will follow the links in the first level, and return all the links from level 2 + level 1 + root
    
    - writes to file "./urls_inputs/root_name.txt"
    - returns out_name, root_name
    - outname is where it writes to file. i.e. "./urls_inputs/root_name.txt"
    - root_name is the stripped version of the url, i.e. www.amazon.com
    
3) main.pyw 
    - our main GUI for the program
    - calls crawler.py for a list of sites (default: top_sites.txt) and iterates through these to generate output files. Since each output file is a list of urls for a site, this is the info we need for analyzer.py
    
    - calls analyzer.py on each file generated from crawler.py, to score each site. 
    
4) top_sites.txt
    - we generated a list of top sites from compete and alexa rank.
    - main.pyw calls our scripts based on this list
    
------------------
Modules:
------------------
urllib2
BeautifulSoup (bs4)
urlparse
json
PyQt4.QtCore
PyQt4.QtGui

------------------
PennKeys:
------------------
Andrew Staniforth: staan
Connie Wu: wuconnie
Fred Wang: shuchun

fredwangster / cis192finalproject Goto Github PK

cis192finalproject's Introduction

cis192finalproject's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent