ed-cooper / lecture-hoarder Goto Github PK
View Code? Open in Web Editor NEWAutomated tool to download University of Manchester lecture podcasts
License: GNU General Public License v3.0
Automated tool to download University of Manchester lecture podcasts
License: GNU General Public License v3.0
Currently we assume we can read from folders, create new files, etc.
If this is not the case, an exception occurs with a traceback displayed to the user:
Traceback (most recent call last):
File "run.py", line 199, in <module>
os.makedirs(course_dir, exist_ok=True)
File "/usr/lib/python3.7/os.py", line 211, in makedirs
makedirs(head, exist_ok=exist_ok)
File "/usr/lib/python3.7/os.py", line 211, in makedirs
makedirs(head, exist_ok=exist_ok)
File "/usr/lib/python3.7/os.py", line 221, in makedirs
mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/media/edward/Mass Storage'
Although the problem is clearly identified in the error, we should aim to make the message more user friendly through proper checks
Occasionally, lecturers may add podcasts that came before ones already available, causing the order of podcasts to change.
Additionally, podcasts may be deleted.
Currently, detection of duplicates requires an exact name match, meaning that in the above situations we download all the subsequent podcasts again, causing multiple podcasts with the same number to appear.
This project is using an unstable interface with their servers, which hasn't been formally approved.
I therefore feel there should be a disclaimer in the Readme and displayed each time the project is ran.
Currently, we filter the names of courses to remove illegal characters - e.g. COMP10120 - First Year Team Project 2018/19
becomes COMP10120 - First Year Team Project 201819
The same also needs to happen for podcast names (which come from podcast_li.a.string
)
Every time the .find
method or equivalent is used, we should validate that html HTML item was actually found and raise a PodcastProviderError
otherwise.
Currently these errors are mostly not handled and will result in confusing random exceptions.
Allow users to choose which modules/courses they want to download.
Probably should be in the config, but maybe also a CLI argument to override it.
Packages have became outdated over time.
Update README to recommend venvs for setup, so that older packages can be isolated from the main python install.
It is possible to adjust the code to download the files in chunks:
Source: https://stackoverflow.com/questions/13137817/how-to-download-image-using-requests
The current code also calculates the total download size from the content-length headers
If we can find a non-blocking way to wait for tasks to be completed, it is therefore possible to create a live progress update for downloads.
(Feedback can be reported via the queue array)
All web request logic is currently handled by __main__.py
This clutters the file, making it hard to understand and maintain
A new interface should be created for handling web requests
In addition, it should be generic, so that dependency injection can be used to assist #3
This return type is the only thing preventing the entire PodcastProvider
interface being entirely independent of the web.
The return type will need to support asynchronous downloading and contain the total download size (as given by int(http_download_response.headers['Content-Length'])
).
This will probably require the creation of a new class for the return type.
Dom of site has changed. Videos no longer download
Podcasts now have subtitle files with automatic captions available for download.
Supporting this would be useful.
Will allow us to track regressions.
Currently we download all available podcasts, but typically users only want podcasts from the current year.
We now extract the course series and use it for categorisation (see #19 ) which can be used to bootstrap the implementation for this,
This should be supported by a setting to allow all podcasts to be downloaded.
Currently we have an undocumented dictionary format for podcast downloads, containing the following properties:
For future development, we should develop a dedicated class for podcasts, as well as making status an enum type.
We should also think about breaking up functionality - ideally run.py should only care about general data flow through the program, rather than implementation details such as output formatting, extraction of page data, etc.
It is quite a risk having raw passwords on disk. On each run the program should ask you for your username and password.
Now web handling has been abstracted, putting specific properties to the UomPodcastProvider
into the Profile
doesn't seem reasonable.
Additionally, the initial reason for them to be in the settings file (see #3) is no longer the case.
Instead, they should be moved to attributes in the UomPodcastProvider
class, where they can still be changed by an overriding class, if necessary.
Login currently fails as the new 2FA system is not supported.
Python is not an appropriate config format, instead YAML (or others) should be used.
The config file should also be automatically generated. It is also advisable that the config is placed inside the users home directory and has the read permission restricted to only that user.
The year for each lecture is given by the first numeric character in the course name
Having a migration handler would also be useful
Long podcast names cause a single download to spread over multiple lines.
This leads to corruption when it comes to trying to overwrite the download status.
We should use the known terminal width to clip podcast names to a suitable length, and add an ellipsis to show that some text is hidden.
Similar concept to the abstraction of web requests (#21)
Allows the program to be tested without side effects - potentially useful for a dry run option
Releases should have a .deb
file produced, that will install the program into the /bin
or /usr/bin
location.
Initially, we should aim for the settings file to be specified with -s
or --settings-file
Future options could include displaying the license, a dry run, manual override for the settings file
The codebase now contains sensible default values for all settings.
The program should be able to run without any settings file.
The base directory for the repo has became polluted with various files
Is it time to move the source code into its own directory?
When testing error reporting, I found that simulating an error occurring often lead to unexpected results.
Example code: (line 104)
# Check status code valid
if True: # get_video_service_podcast_page.status_code != 200:
podcast["completion_time"] = time.time()
podcast["error"] = "Could not get podcast webpage for " + podcast["name"] + \
" - Service responded with status code" + get_video_service_podcast_page.status_code
podcast["status"] = "error"
return
All the real errors I have experienced so far have resulted in exceptions occurring, so I'm not too concerned about fixing this immediately.
In addition, any errors that occur can almost always be remedied by running the program again.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.