ed-cooper / lecture-hoarder Goto Github PK

View Code? Open in Web Editor NEW

3.0 2.0 1.0 83 KB

Automated tool to download University of Manchester lecture podcasts

License: GNU General Public License v3.0

Python 100.00%

uom university-of-manchester

lecture-hoarder's People

Contributors

Stargazers

Watchers

Forkers

csnewman

lecture-hoarder's Issues

Check file access permissions

Currently we assume we can read from folders, create new files, etc.

If this is not the case, an exception occurs with a traceback displayed to the user:

Traceback (most recent call last):
  File "run.py", line 199, in <module>
    os.makedirs(course_dir, exist_ok=True)
  File "/usr/lib/python3.7/os.py", line 211, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/usr/lib/python3.7/os.py", line 211, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/usr/lib/python3.7/os.py", line 221, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/media/edward/Mass Storage'

Although the problem is clearly identified in the error, we should aim to make the message more user friendly through proper checks

Check for duplicate but out of order podcasts

Occasionally, lecturers may add podcasts that came before ones already available, causing the order of podcasts to change.

Additionally, podcasts may be deleted.

Currently, detection of duplicates requires an exact name match, meaning that in the above situations we download all the subsequent podcasts again, causing multiple podcasts with the same number to appear.

Add use at own risk warning

This project is using an unstable interface with their servers, which hasn't been formally approved.

I therefore feel there should be a disclaimer in the Readme and displayed each time the project is ran.

Add contributing guidelines

Filter podcast names

Currently, we filter the names of courses to remove illegal characters - e.g. COMP10120 - First Year Team Project 2018/19 becomes COMP10120 - First Year Team Project 201819

The same also needs to happen for podcast names (which come from podcast_li.a.string)

Validate every usage of BeautifulSoup in UomPodcastProvider

Every time the .find method or equivalent is used, we should validate that html HTML item was actually found and raise a PodcastProviderError otherwise.

Currently these errors are mostly not handled and will result in confusing random exceptions.

Add course filtering

Allow users to choose which modules/courses they want to download.

Probably should be in the config, but maybe also a CLI argument to override it.

Recommend setup by venv

Packages have became outdated over time.

Update README to recommend venvs for setup, so that older packages can be isolated from the main python install.

Realtime progress updates

It is possible to adjust the code to download the files in chunks:

Source: https://stackoverflow.com/questions/13137817/how-to-download-image-using-requests

The current code also calculates the total download size from the content-length headers

If we can find a non-blocking way to wait for tasks to be completed, it is therefore possible to create a live progress update for downloads.

(Feedback can be reported via the queue array)

Broken on Windows

Abstract web requests

All web request logic is currently handled by __main__.py

This clutters the file, making it hard to understand and maintain

A new interface should be created for handling web requests

In addition, it should be generic, so that dependency injection can be used to assist #3

Change get_podcast_downloader return type

This return type is the only thing preventing the entire PodcastProvider interface being entirely independent of the web.

The return type will need to support asynchronous downloading and contain the total download size (as given by int(http_download_response.headers['Content-Length'])).

This will probably require the creation of a new class for the return type.

Video page format change

Dom of site has changed. Videos no longer download

Download automatic subtitles

Podcasts now have subtitle files with automatic captions available for download.

Supporting this would be useful.

Add automated testing

Will allow us to track regressions.

Only download podcasts from the current year

Currently we download all available podcasts, but typically users only want podcasts from the current year.

We now extract the course series and use it for categorisation (see #19 ) which can be used to bootstrap the implementation for this,

This should be supported by a setting to allow all podcasts to be downloaded.

Abstract into model

Currently we have an undocumented dictionary format for podcast downloads, containing the following properties:

name
podcast_link
download_path
status
error
progress
total_size
completion_time

For future development, we should develop a dedicated class for podcasts, as well as making status an enum type.

We should also think about breaking up functionality - ideally run.py should only care about general data flow through the program, rather than implementation details such as output formatting, extraction of page data, etc.

Runtime login

It is quite a risk having raw passwords on disk. On each run the program should ask you for your username and password.

Deprecate login_service_url and video_service_base_url settings

Now web handling has been abstracted, putting specific properties to the UomPodcastProvider into the Profile doesn't seem reasonable.

Additionally, the initial reason for them to be in the settings file (see #3) is no longer the case.

Instead, they should be moved to attributes in the UomPodcastProvider class, where they can still be changed by an overriding class, if necessary.

Login broken by switch to Duo 2FA

YAML Config

Python is not an appropriate config format, instead YAML (or others) should be used.

The config file should also be automatically generated. It is also advisable that the config is placed inside the users home directory and has the read permission restricted to only that user.

Categorise lectures into years

The year for each lecture is given by the first numeric character in the course name

Having a migration handler would also be useful

Clipping for long podcast names

Long podcast names cause a single download to spread over multiple lines.

This leads to corruption when it comes to trying to overwrite the download status.

We should use the known terminal width to clip podcast names to a suitable length, and add an ellipsis to show that some text is hidden.

    # Check status code valid
    if True:  # get_video_service_podcast_page.status_code != 200:
        podcast["completion_time"] = time.time()
        podcast["error"] = "Could not get podcast webpage for " + podcast["name"] + \
                           " - Service responded with status code" + get_video_service_podcast_page.status_code
        podcast["status"] = "error"
        return

All the real errors I have experienced so far have resulted in exceptions occurring, so I'm not too concerned about fixing this immediately.

In addition, any errors that occur can almost always be remedied by running the program again.