file-downloader

PURPOSE + CAVEATS

providing a base script to easily/quickly download links from the internet.
offers the following interactive, simple command-line features:
- lets the user choose whether or not they want to manually verify which files to save/download, during the script
- also offers the option of just "blindly" downloading all the links that the webpage contained
- lets the user quit during the middle of the webscrape
users must (currently) be at least familiar with html, file paths, and (hopefully minimally) python in order to customize the script to their particular website html
currently only works for websites that have their files stored under the same domain as the main webpage (a TODO for me to change)

INSTRUCTIONS: SETTING UP THE SCRIPT

NOTE: users can refer to sample.py for an example of how to fill out the base script.
Certain variables/conditions might change or are unneeded, depending on how the webpage is set up.
Before using, make sure to follow these instructions to customize the script for your purposes!

TODO: URL
- whichever URL you're on to inspect the html should be the URL you assign to url
- in the sample.py.py, you can see I have the main cs170 website URL assigned to url
- note: make sure that your URL ends with /, if your file URL is nested under the main webpage
TODO: EXTENSION
- for the cs170 website, the links to the files weren't actually stored in the html file for the main webpage
- instead, the links were written in a separate html file, which was nested under the main webpage's html
- hence, i included an extension variable, to add onto the original URL to appropriately scrape the links
- if your links are located in the main html, you can leave the extension variable as is
- another note: in line 49, where we're constructing the URL our script will use to grab the files, it's important to observe where your file links are stored. for cs170, although the link references were written in the calendar.html file, the links themselves were in an assets folder, which had to be accessed through the main url (doesn't involve calendar.html). adjust this construction according to your situation!
TODO: TAG
- while inspecting your DOM/page source, determine which html tag best indicates/narrows down where the code for the links are located
- if it's a regular HTML tag, you can just put the tag type (i.e. 'a' or 'p'). however, if it's a specific class or id you want to reference, assign it to the tag variable as the following: 'class_=results', or 'id=results'. you can google for more formats/information on how BeautifulSoup handles find() and find_all()
TODO: KEYWORD
- for cs170, the links that were found included a lot of random links that didn't have to do with the files i was interested in.
- to take care of that, i had a keyword variable, that allowed me to determine whether or not the link retrieved was related to the types of files i wanted to download
- for you, if you have a similar situation, make sure the keyword you assign is the starting portion of the URL that would help filter out unrelated links
- additionally for cs170, the links given were file extensions (i.e. assets/disc/...) that would be added to the end of the main url, vs. being full links themselves
- as each website may be implemented differently, this is something you should be aware of as well and might have to adjust when it comes to constructing the URL and pre-emptively checking for validity (if you want to)
TODO: PATH
- the path variable is so the script (which uses wget) knows where to save the files
- to find the exact, full path you want your files to be saved to, if you're on mac, you can follow these steps:
  - STEP 1: if you're familiar with the terminal, you can directly navigate to the folder you want in your terminal and skip to STEP 6. i recommend still following STEP 6 and onwards, just to safe-check that you have the correct full path. otherwise, if you're not that comfortable with the terminal, you can move ahead to STEP 2
  - STEP 2: navigate to the folder you want to save your files to in your Finder
  - STEP 3: hit command + spacebar to open up spotlight search
  - STEP 4: type in terminal into the search, and open up a new terminal window.
  - STEP 5: go to your finder, and drag your folder onto the terminal app icon that's located in your dock! not the terminal window itself
  - STEP 6: now that you have the correct path loaded in your terminal, it should look something like this: Mackbook:folder-name UserName$
  - STEP 7: now enter the following command: pwd|pbcopy. what this just did was copy the current path into your clipboard. you can alternatively do: pwd, and manually copy the path yourself by just cmd+c
  - STEP 8: paste the path into the path variable in our file. make sure that there are no extraneous line breaks, which may occur if you chose to use the pbcopy shortcut.

INSTRUCTIONS: RUNNING THE SCRIPT

To run the script, and to utilize the interactive features I built into it, just navigate to the path in your terminal of the folder that you've saved the flexible-downloads.py file in.
Enter the following command: python3 flexible-downloads.py, and follow the user prompts!

danielhogg / file-downloader Goto Github PK

file-downloader's Introduction

file-downloader

PURPOSE + CAVEATS

INSTRUCTIONS: SETTING UP THE SCRIPT

INSTRUCTIONS: RUNNING THE SCRIPT

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent