Code Monkey home page Code Monkey logo

google-keywords's Introduction

Get Google Keywords

Preparation

  • Install python packages lxml, html5lib, nltk (Linux users need to apt-get python-dev, python-lxml as well.)

      $ pip install lxml html5lib nltk
    
  • From nltk.download() select and install corpus/stopwords

      $ python
      >>> import nltk
      >>> nltk.download()
    

Configure settings

vi settings.py
  • TARGET: Either 'news' or 'blog'
  • QUERYLIST: List of queries
  • HTMLPATH: Path to save html files. (Paths should end with a slash)
  • KEYWORDPATH = 'data/keywords/'
  • NCRAWLPAGES: Number of search pages to crawl from Google
  • DELIMS: Delimiters for parsing words in HTML page
  • TODAY: Date for analysis

Run

In order to get search results for data mining, run

python main.py data mining

or set QUERYLIST=['data', 'mining'] in settings.py, and run

python main.py

Results

If HTMLPATH='data/html/' and KEYWORDPATH='data/keywords/ in settings.py, the search results and keywords are stored in the 'data' folder as below.

data/
    ├── html/
    │   ├── data_mining/
    │   └── data_mining-20120907.json
    └── keywords/
        └── keywords-data_mining.json
  • data/html/data_mining/: This folder contains the raw HTML files. File names are marked with a timestamp.

  • data/html/data_mining-20120907.json: This file contains th url, desc(description), crawled_time, title extracted from the raw HTML files. Below is an example.

      [
        {
          "url": "http://smartdatacollective.com/timoelliott/101486/analytics-world-news-big-data-cool-3d-analytics", 
          "desc": "Themos Kalafatis has worked as a consultant for , Text Mining, Information Extraction and Data Quality for over a decade. More \u00bb ", 
          "crawled_time": "20120907_192648",
          "page_no": 1,
          "title": "Scary Big Data, Cool 3D Analytics and More"
        },
        ...
      ]
    
  • data/keywords/keywords-data_mining.json: This file contains the most frequent keywords. An example is shown below.

      ["data", 23],
      ["mining", 19],
      ["analytics", 3],
      ["app", 3],
      ["big", 2],
      ["mayo", 2],
      ["companies", 2],
      ["3d", 2],
      ["ehr", 2],
      ["datamining", 2],
      ["partner", 2],
      ["nlp", 1],
      ["desktops", 1],
      ["office", 1],
      ["advisory", 1]
      ...
    

Authors

2012 LG-SNU Smart TV Project Team (Created Sep. 2012)

google-keywords's People

Contributors

e9t avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.