Light

dynamicgenetics / spotify-rehydrator Goto Github PK

View Code? Open in Web Editor NEW

0.0 3.0 0.0 129 KB

A simple package for developing a full dataset of track features from self-requested Spotify data 🌱

Home Page: https://spotify-rehydrator.readthedocs.io

License: GNU General Public License v3.0

Python 100.00%

spotify rehydrate data download music

spotify-rehydrator's Introduction

Recreate a full dataset of audio features of songs downloaded through Spotify's download my data facility.

This requires the files named StreamingHistory{n}.json where {n} represents the file number that starts at 0, and goes up to however many files were retrieved.

Quick start

Extended documentation is available on ReadTheDocs. First, install the package using pip. An example of using the package to rehydrate a folder of json files is then:

# main.py
from spotifyrehydrator import Rehydrator
import os
import pathlib

if __name__ == "__main__":
    Rehydrator(
        os.path.join(pathlib.Path(__file__).parent.absolute(), "input"),
        os.path.join(pathlib.Path(__file__).parent.absolute(), "output"),
        client_id=os.getenv("SPOTIFY_CLIENT_ID"),
        client_secret=os.getenv("SPOTIFY_CLIENT_SECRET"),
    ).run(return_all=True)

Run takes boolean arguments for audio_features and artist info, or for return_all which then returns both. These will determine how much information is retrieved to make up the full dataset that is saved into the output folder.

How it works

The files for each person are read from the specified input folder.
The name and artist provided are searched with the Spotify API. The first result is taken to be the track, and the track ID is recorded.
Additional information is searched on other endpoints if audio_features, artist info or return_all were set to True.
The matched track ID and audio features are saved as one tab delimited .tsv file per person into the specified output folder.

Good to know

Not all tracks can be retreived from the API. In our experience about 5% of tracks cannot be found on the API. These will have a value of NONE in the output files.
There is not a guaranteed match between the first returned item in a search and the track you want. Comparing msPlayed with the track length is a good way to test this since msPlayed should not exceed the track length.

P.S. Thanks to Pixel perfect for the title icon. 🙂

spotify-rehydrator's People

Contributors

Watchers

spotify-rehydrator's Issues

Use API batch functions for get features

https://developer.spotify.com/documentation/web-api/reference/#endpoint-get-audio-features-for-several-tracks

Can get up to 100 at a time.

Functions need separating.

Both get_track_id and get_track_features are now very long and would benefit from being separated into different functions.

Retrieving data to be processed
Processing the data
Saving it out

Maybe a good time to consider a Data Class?

Add refresh token when for when token runs out.

This returns a 401 error.

Quick google returns some possible solution from Spotipy issues / questions section.

Introduce a matching procedure for songs

Rank API search results based on a distance measure.

Build in recovery from rate limiting

https://developer.spotify.com/documentation/web-api/#rate-limiting

Error status 429

Convert to package for easier reuse

This will require re-organising the repo a little, and changing the default arguments.

Other things that could be achieved through this option include:

Choosing whether to get TrackIDs, Features, or something else? (Lyrics)
Re-organise into package structure
setup.py
Register with PyPi
Change input/output file options to arguments.

Reintroduce existing data checks

Before running check if the output file exists.
If it does then skip that person and log as info.

Change print statements to logs

Add useful error messages

Dataset doesn't contain artistName or trackName in Tracks

Complete test set

Check the joins match as expected etc and any data manipulation is tested.
Will need to mock the API calls.

Improve logging functionality

At the moment you lose the information about the files if you have to restart

Batch the process for each unique participant

Output file values for empty tracks

Instead of having 'None' as the value in the output csv it would be better to have a null placeholder in each column. Otherwise reading the file back in can cause problems.

Update README with clear instructions

Understand why some tracks aren't returned by the API

About 5% of tracks aren't returned by the API. Quite a few of them seem to have special characters so that could be the reason. It would be good to understand if there's anything we can change about the search queries to increase chances of them returning successfully.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.