Code Monkey home page Code Monkey logo

automatic-website-categorizers's Introduction

Automatic-Website-Categorizers

These programs are designed to categorize specific websites.

Title Searcher

(Powered by Yandex.Translate: http://translate.yandex.com/ )

Download GoogleScraper:

pip3 install GoogleScraper

(GoogleScraper requires Python3.)

On your command prompt(or terminal), enter the following:

GoogleScraper -m http -s "bing" --keyword-file websites.txt > output.txt    

(You will need to have a file called websites.txt with a list of websites that you want to search.)
First run titles_searcher.py and then output.txt file:

python2.7 titles_searcher.py output.txt output_from_extract.txt

(The first entry in the command line after 'titles_searcher.py' should be the name of the input file that you are processing, which is the output from the desired GoogleScraper run. The entry after that should be the name of the output file that you want to create. You have to put the lists.py in the same directory as titles_searcher.py.)

Common Crawl Title Searcher

(Powered by Yandex.Translate: http://translate.yandex.com/ )

Open command prompt (or terminal) and enter the following:

python2.7 extract_data.py

(This will find the titles of webpages in Common Crawl.) Open a command prompt and enter the following:

python2.7 titles_searcher_common.py output.txt

(The first entry in the command line after 'titles_searcher.py' should be the name of the output file that you want to create.)

Description Searcher

(Powered by Yandex.Translate: http://translate.yandex.com/ )

Download the prerequisites:

pip2 install BeautifulSoup4
pip2 install request

Create a textfile called websites.txt with all of the websites that you want to search for. If you want to create a textfile of websites with a different filenmae, update the file common.py so that it references the correct filename. Then run the file description_searcher.py.

python2.7 description_searcher.py

This will output results in the form: website address category.

Aggregator Identifier

Create a textfile called websites.txt with all of the websites that you want to search. Then run the file aggregators.py on that. You will need to install BeautifulSoup and GoogleScraper as described above for the Title Searcher program and Description Searcher Program. This program will output two lists, one list will be the list of sites that were identified to be aggregators and the other list will be the list of sites that were not identified to be aggregators.

Twitter Handle/Facebook Page Extractor

Create a textfile called websites.txt with all of the websites that you want to search. Then run the file twitterandfacebook.py on that. This program wilAl output the twitter handles and facebook pages that were found for the websites that were searched. At this point, some minor editing of the output is required.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.