Code Monkey home page Code Monkey logo

scrapebotr's Introduction

ScrapeBotR

Orchestrate Instances and Retrieve Data from a ScrapeBot Database

ScrapeBotR (with "R") allows to easily retrieve (large amounts of) data from a ScrapeBot installation. The package provides easy-to-use functions to read and export instances, recipes, runs, log information, and data. Thereby, the package plugs neatly into the tidyverse as it makes heavy use of tibbles.

The ScrapeBot (without "R") is a tool for so-called "agent-based testing" to automatically visit, modify, and scrape a defined set of webpages regularly. It was built to automate various web-based tasks and keep track of them in a controllable way for academic research, primarily in the realm of computational social science.

Installation

Install the most recent development version using devtools:

devtools::install_github('MarHai/ScrapeBotR')

Usage

Import the installed version ...

library(ScrapeBotR)

... and start using it by defining your ScrapeBot database. Credentials to access your database need to be stored in an INI file somewhere in your computer's home directory (i.e., under ~/, which usually translates to /home/my_user under *nix or C:\Users\my_user\Documents under Windows). You can either create this file by hand or use the ScrapeBotR's helper function to create it:

write_scrapebot_credentials(
  host = 'path_to_my_database',
  user = 'database_username',
  password = 'database_password'
)

Alternatively, you can create the INI file manually. Ideally, the file is located directly within your home directory and named .scrapebot.ini (where the leading . prevents it from being shown in the file browser most of the time). The INI file is essentially just a raw-text file with a so-called section name and some key-value pairs, each of which cannot contain spaces between a key and its value. Any unnecessary settings can be omitted (e.g., the port number). Here's how the INI file could look like:

[a name for me to remember]
host=localhost
port=3307
user=my_personal_user
password=abcd3.45d:cba!
database=scrapebot

Once you got that out of the way, try connecting to your database, using the section name again (this is because you can have multiple sections referring to multiple ScrapeBot installations):

connection <- connect_scrapebot('a name for me to remember')

If this doesn't yield an error, you are good to go. And you could start, for example, by ...

  • listing the available recipes through get_recipes()
  • listing the available instances through get_instances()
  • get information about specific runs through get_runs()
  • Collect data via get_run_data()
  • bulk-download-and-compress screenshots from S3 via collect_screenshots_from_s3()
  • ...

Since version 0.5.0, you can also orchestrate servers on Amazon Web Services (AWS). Therefore, you first need an AWS account, to which also any raised costs will be charged. Next, generate an IAM user within your AWS account and create an API key. Also, you need an SSH key pair (in PEM format). Afterwards, use the respective R functions parallel to the ScrapeBot database (above) to write your credentials into an INI file and connect to your AWS account:

write_aws_credentials(
  access_key_id = 'aws_access_key', 
  secret_access_key = 'aws_access_secret', 
  ssh_private_pem_file = 'path_to_ssh_private_pem_file', 
  ssh_public_pem_file = 'path_to_ssh_public_pem_file'
)
aws_connection <- connect_aws()

Given that this does not yield an error, you could ...

  • start an AWS RDS instance as ScrapeBot database through aws_launch_database()
  • launch an AWS S3 instance to store screenshots through aws_launch_storage()
  • run an EC2 instance as ScrapeBot instance through aws_launch_instance()
  • store the connection object for later using aws_save_connection()
  • restore (load) the connection object some days/weeks/months/studies later through aws_load_connection()
  • terminate all AWS instances through the respective aws_terminate_* functions
  • ...

Detailed documentation is available for every function inside R.

Citation

Haim, Mario (2021). ScrapeBotR. An R package to orchestrate ScrapeBot for agent-based testing. Available at https://github.com/MarHai/ScrapeBotR/.

scrapebotr's People

Contributors

marhai avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.