Code Monkey home page Code Monkey logo

file-downloader's Introduction

file-downloader

PURPOSE + CAVEATS

  • providing a base script to easily/quickly download links from the internet.
  • offers the following interactive, simple command-line features:
    • lets the user choose whether or not they want to manually verify which files to save/download, during the script
    • also offers the option of just "blindly" downloading all the links that the webpage contained
    • lets the user quit during the middle of the webscrape
  • users must (currently) be at least familiar with html, file paths, and (hopefully minimally) python in order to customize the script to their particular website html
  • currently only works for websites that have their files stored under the same domain as the main webpage (a TODO for me to change)

INSTRUCTIONS: SETTING UP THE SCRIPT

NOTE: users can refer to sample.py for an example of how to fill out the base script.
Certain variables/conditions might change or are unneeded, depending on how the webpage is set up.
Before using, make sure to follow these instructions to customize the script for your purposes!

  • TODO: URL
    • whichever URL you're on to inspect the html should be the URL you assign to url
    • in the sample.py.py, you can see I have the main cs170 website URL assigned to url
    • note: make sure that your URL ends with /, if your file URL is nested under the main webpage
  • TODO: EXTENSION
    • for the cs170 website, the links to the files weren't actually stored in the html file for the main webpage
    • instead, the links were written in a separate html file, which was nested under the main webpage's html
    • hence, i included an extension variable, to add onto the original URL to appropriately scrape the links
    • if your links are located in the main html, you can leave the extension variable as is
    • another note: in line 49, where we're constructing the URL our script will use to grab the files, it's important to observe where your file links are stored. for cs170, although the link references were written in the calendar.html file, the links themselves were in an assets folder, which had to be accessed through the main url (doesn't involve calendar.html). adjust this construction according to your situation!
  • TODO: TAG
    • while inspecting your DOM/page source, determine which html tag best indicates/narrows down where the code for the links are located
    • if it's a regular HTML tag, you can just put the tag type (i.e. 'a' or 'p'). however, if it's a specific class or id you want to reference, assign it to the tag variable as the following: 'class_=results', or 'id=results'. you can google for more formats/information on how BeautifulSoup handles find() and find_all()
  • TODO: KEYWORD
    • for cs170, the links that were found included a lot of random links that didn't have to do with the files i was interested in.
    • to take care of that, i had a keyword variable, that allowed me to determine whether or not the link retrieved was related to the types of files i wanted to download
    • for you, if you have a similar situation, make sure the keyword you assign is the starting portion of the URL that would help filter out unrelated links
    • additionally for cs170, the links given were file extensions (i.e. assets/disc/...) that would be added to the end of the main url, vs. being full links themselves
    • as each website may be implemented differently, this is something you should be aware of as well and might have to adjust when it comes to constructing the URL and pre-emptively checking for validity (if you want to)
  • TODO: PATH
    • the path variable is so the script (which uses wget) knows where to save the files
    • to find the exact, full path you want your files to be saved to, if you're on mac, you can follow these steps:
      • STEP 1: if you're familiar with the terminal, you can directly navigate to the folder you want in your terminal and skip to STEP 6. i recommend still following STEP 6 and onwards, just to safe-check that you have the correct full path. otherwise, if you're not that comfortable with the terminal, you can move ahead to STEP 2
      • STEP 2: navigate to the folder you want to save your files to in your Finder
      • STEP 3: hit command + spacebar to open up spotlight search
      • STEP 4: type in terminal into the search, and open up a new terminal window.
      • STEP 5: go to your finder, and drag your folder onto the terminal app icon that's located in your dock! not the terminal window itself
      • STEP 6: now that you have the correct path loaded in your terminal, it should look something like this: Mackbook:folder-name UserName$
      • STEP 7: now enter the following command: pwd|pbcopy. what this just did was copy the current path into your clipboard. you can alternatively do: pwd, and manually copy the path yourself by just cmd+c
      • STEP 8: paste the path into the path variable in our file. make sure that there are no extraneous line breaks, which may occur if you chose to use the pbcopy shortcut.

INSTRUCTIONS: RUNNING THE SCRIPT

To run the script, and to utilize the interactive features I built into it, just navigate to the path in your terminal of the folder that you've saved the flexible-downloads.py file in.
Enter the following command: python3 flexible-downloads.py, and follow the user prompts!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.