Code Monkey home page Code Monkey logo

etf-data-scraper's Introduction

ETF Data Scraper

This is a Python 3 based daily scraper that collects data on actively listed ETFs using the Alpha Vantage and Yahoo Finance APIs (via the yfinance package).

The infrasturcture of the scraper includes:

  • Amazon EventBridge: Triggers a Lambda function to run daily at 5:00 PM EST / 4:00 PM CST on weekdays after the market closes.
  • AWS Lambda: Starts an AWS Fargate task, which runs the containerized application code.
  • AWS Fargate: Executes the application code to collect and process ETF data, then stores the data in an S3 bucket as either a Parquet file or a CSV file.

For a detailed walkthrough of the project, check out the following blog post: ETF Data Scraping with AWS Lambda, AWS Fargate, and Alpha Vantage & Yahoo Finance APIs.

Project Setup

Fork and Clone the Repository

Fork the repository and clone the forked repository to local machine:

# HTTPS
$ git clone https://github.com/YOUR_GITHUB_USERNAME/etf-data-scraper.git
# SSH
$ git clone git@github/YOUR_GITHUB_USERNAME/etf-data-scraper.git

Set Up with poetry

Install poetry using the official installer for your operating system. Detailed instructions can be found in Poetry's Official Documentation. Make sure to add poetry to your PATH. Refer to the official documentation linked above for specific steps for your operating system.

There are three primary methods to set up and use poetry for this project:

Method 1: Using poetry

Configure poetry to create the virtual environment inside the project's root directory (and only do so for the current project using the --local flag):

$ poetry config virtualenvs.in-project true --local
$ cd path_to_cloned_repository
$ poetry install

Method 2: Using pyenv and poetry Together

With pyenv, ensure that Python (3.11 is the default for this project) is installed:

# List available Python versions 10 through 12
$ pyenv install --list | grep " 3\.\(10\|11\|12\)\."
# Install Python 3.11.8
$ pyenv install 3.11.8
# Activate Python 3.11.8 for the current project
$ pyenv local 3.11.8
# Use currently activated Python version to create the virtual environment
$ poetry config virtualenvs.prefer-active-python true --local
$ poetry install

Method 3: Using conda and poetry Together

  1. Create a new conda environment named etf_data_scraper with Python 3.11:
$ yes | conda create --name etf_data_scraper python=3.11
  1. Install the project dependencies (ensure that the conda environment is activated):
$ cd path_to_cloned_repository
$ conda activate etf_data_scraper
$ poetry install

Create Environment Variables

To test run the code locally, create a .env file in the root directory with the following environment variables:

API_KEY=your_alpha_vantage_api_key
S3_BUCKET=your_s3_bucket_name
IPO_DATE=threshold_for_etf_ipo_date
MAX_ETFS=maximum_number_of_etfs_to_scrape
PARQUET=True
# Set to 'dev' to run the scraper in dev mode, ensure that this is removed before uploading to S3
ENV=dev

Set ENV to dev in the .env file to run the scraper in dev mode when running the entrypoint main.py locally. Ensure that this environment variable is removed from .env before uploading it to S3 for production.

Details on these environment variables can be found in the Modules subsection of the blog post.

Workflow Secrets

The workflows require the following secrets:

  • AWS_GITHUB_ACTIONS_ROLE_ARN: The ARN of the IAM role that GitHub Actions assumes to deploy to AWS.

  • AWS_REGION: The AWS region where the resources are deployed.

  • ECR_REPOSITORY: The name of the ECR repository where the Docker image is stored.

  • S3_BUCKET: The name of the S3 bucket where the ETF data is stored.

  • LAMBDA_FUNCTION: The name of the Lambda function that triggers the Fargate task.

AWS CLI for Programmatic Deployment

To deploy the resources programmatically via the boto3 SDK or via the command line instead of using the AWS console, ensure that the AWS CLI is installed on the local machine and that it is configured with the necessary credentials. Follow the instructions in the AWS CLI Documentation.

A simple starting point, though it may violate the principle of least privilege, is to create an IAM user with programmatic access that can assume an IAM role with the AdministratorAccess policy attached.

etf-data-scraper's People

Contributors

yangwu1227 avatar

Stargazers

Juuso Juvonen avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.