- providing a base script to easily/quickly download links from the internet.
- offers the following interactive, simple command-line features:
- lets the user choose whether or not they want to manually verify which files to save/download, during the script
- also offers the option of just "blindly" downloading all the links that the webpage contained
- lets the user quit during the middle of the webscrape
- users must (currently) be at least familiar with html, file paths, and (hopefully minimally) python in order to customize the script to their particular website html
- currently only works for websites that have their files stored under the same domain as the main webpage (a TODO for me to change)
NOTE: users can refer to sample.py
for an example of how to fill out the base script.
Certain variables/conditions might change or are unneeded, depending on how the webpage is set up.
Before using, make sure to follow these instructions to customize the script for your purposes!
- TODO: URL
- whichever URL you're on to inspect the html should be the URL you assign to
url
- in the
sample.py
.py, you can see I have the main cs170 website URL assigned tourl
- note: make sure that your URL ends with
/
, if your file URL is nested under the main webpage
- whichever URL you're on to inspect the html should be the URL you assign to
- TODO: EXTENSION
- for the cs170 website, the links to the files weren't actually stored in the html file for the main webpage
- instead, the links were written in a separate html file, which was nested under the main webpage's html
- hence, i included an
extension
variable, to add onto the original URL to appropriately scrape the links - if your links are located in the main html, you can leave the
extension
variable as is - another note: in line 49, where we're constructing the URL our script will use to grab the files,
it's important to observe where your file links are stored. for cs170, although the link references were
written in the
calendar.html
file, the links themselves were in anassets
folder, which had to be accessed through the mainurl
(doesn't involvecalendar.html
). adjust this construction according to your situation!
- TODO: TAG
- while inspecting your DOM/page source, determine which html tag best indicates/narrows down where the code for the links are located
- if it's a regular HTML tag, you can just put the tag type (i.e.
'a'
or'p'
). however, if it's a specific class or id you want to reference, assign it to thetag
variable as the following:'class_=results'
, or'id=results'
. you can google for more formats/information on how BeautifulSoup handlesfind()
andfind_all()
- TODO: KEYWORD
- for cs170, the links that were found included a lot of random links that didn't have to do with the files i was interested in.
- to take care of that, i had a
keyword
variable, that allowed me to determine whether or not the link retrieved was related to the types of files i wanted to download - for you, if you have a similar situation, make sure the
keyword
you assign is the starting portion of the URL that would help filter out unrelated links - additionally for cs170, the links given were file extensions (i.e.
assets/disc/...
) that would be added to the end of the mainurl
, vs. being full links themselves - as each website may be implemented differently, this is something you should be aware of as well and might have to adjust when it comes to constructing the URL and pre-emptively checking for validity (if you want to)
- TODO: PATH
- the
path
variable is so the script (which useswget
) knows where to save the files - to find the exact, full path you want your files to be saved to, if you're on mac, you can follow these steps:
- STEP 1: if you're familiar with the terminal, you can directly navigate to the folder you want in your terminal and skip to STEP 6. i recommend still following STEP 6 and onwards, just to safe-check that you have the correct full path. otherwise, if you're not that comfortable with the terminal, you can move ahead to STEP 2
- STEP 2: navigate to the folder you want to save your files to in your
Finder
- STEP 3: hit
command + spacebar
to open up spotlight search - STEP 4: type in
terminal
into the search, and open up a new terminal window. - STEP 5: go to your finder, and drag your folder onto the
terminal
app icon that's located in your dock! not the terminal window itself - STEP 6: now that you have the correct path loaded in your terminal, it should look something like this:
Mackbook:folder-name UserName$
- STEP 7: now enter the following command:
pwd|pbcopy
. what this just did was copy the current path into your clipboard. you can alternatively do:pwd
, and manually copy the path yourself by justcmd+c
- STEP 8: paste the path into the
path
variable in our file. make sure that there are no extraneous line breaks, which may occur if you chose to use thepbcopy
shortcut.
- the
To run the script, and to utilize the interactive features I built into it, just navigate to the path in your terminal of the folder
that you've saved the flexible-downloads.py
file in.
Enter the following command: python3 flexible-downloads.py
, and follow the user prompts!