jamesowers / midi_degradation_toolkit Goto Github PK
View Code? Open in Web Editor NEWA toolkit for generating datasets of midi files which have been degraded to be 'un-musical'.
License: MIT License
A toolkit for generating datasets of midi files which have been degraded to be 'un-musical'.
License: MIT License
In particular, split_range_sample was originally written to sample uniformly-distributed floats. We have since decided that rounding to ints is better.
I have converted the split_range_sample method itself, but not yet all of the degradations. I will do this as I go through writing tests for them.
So far, only time_shift has been updated to reflect this.
Hopefully this wont be too hard. Essentially I'd like the default setting to install bare minumum, and an -all type flag to install full optional dependencies. Don't know the standard for that but need to sort it for ease.
3 decimal places seems fine, which would be microseconds.
We will ignore them in our model input and output, but how should we handle them in the degradations?
For example, can join_notes join across tracks (this would be "ignoring")?
This is just a placeholder to think about about improvements to make.
Also, I performed a lit review recently on a paper of a similar format to ours: here's some new data and new tasks. My main criticism was that there were no comparisons to models from literature. Are we sure there is nothing we can implement from literature? We should anticipate this criticism, and think about what models from literature we could try and implement for comparison.
Docs, models, trainers, eval, paper.
(Paper is done already)
I'd go for having them both in that "midi" package and renaming it perhaps? (Currently csv_to_df is data_structures.read_note_csv)
Essentially, mdtk.midi (renamed) would be for file I/O and conversion, while mdtk.data_structures would be about doing things with dataframes.
Related to #20 (and number #46, in a way)
As discussed in Skype, here are a few examples of how we want it to work (o=onset; .=nothing, -=sustain):
Example 1: Don't cut on offsets.
Input:
....o--------
....o----....
Output:
....o--------
Example 2: Cut on onsets.
Input:
....o--------
......o------
Output:
....o-o------
The line df = df.groupby(['track', 'pitch']).apply(fix_overlapping_notes)
increases computation time by at least 100x. This is mad. For now, I'm just going to bypass this by adding a flag to skip the check (we decided not to enforce this), but I think we will learn something if we try to profile the code and understand why it's so slow.
Decide what to use.
The Composition object potentially provides lots of useful functionality for examining the excerpts, but also adds potential overhead and complication if we don't use any of that functionality.
A lower-level solution like directly using a dataframe, array, or dict, might be a better option for some use cases.
Maybe try and speed up at some point
Add a parameter, max_notes, to do so.
There are 2 options:
This will allow join_notes to always reverse split_notes. Currently, split_notes is the only one that cannot be reverse with another (or the same) degradation.
Something called "align_to_grid" or similar. This would make notes more difficult to detect.
Not a priority before release 1, but (as noted in #60), data_structures.fix_overlaps
is quite a slow function at the moment.
Whilst it's not possible to operate inplace when adding new data to dataframes (a copy must occur), there are various degradations that could work inplace - this could be much more efficient if the user doesn't want to retain the original dataframe.
Investigate speedups obtained and whether this is worth it.
The train_task script will presume that the datasets are full and correct if the file exists. We should add a catch if there is a termination to the formatter creation scripts which deletes the files if they are not finished.
For example, min_duration > max_duration.
Currently, examples like the above will warn with "No valid notes found." (or similar), and return None. Because it can't find a note to shift which will result in a duration in that given range. That's probably fine, but we could also explicitly check parameter settings to give a more explicit warning as to what is happening.
We currently quantize onset time and duration. It would be good to have an option (at least) to quantize instead onset and offset (note_on and note_off) directly. Rounding issues can cause duration to be off by 1 currently.
In case people use a subset of them. Not sure if we've hard-coded this anywhere, but we shouldn't.
iteration()
?)Including range, type.
Less obvious:
Also (for example), in add_note, min_duration.
At the moment, if you run the script without --command there's no way to create the command csvs without running the whole shebang again. We should think about how best to do structure the script to make stuff like this easier.
Possibilities include:
An eval function for each task which takes as input a data point (or a set of data points), and a label (or a set of labels), and returns the metric.
A --file flag to read labels from a file in the given eval script.
Ideally, these would be independent of our formats where possible. For example, the helpfulness script takes in a 3 dfs and outputs a score. Others should be similarly easy to use, independent of format.
Enable the testing of our models, as well as evaluation metrics (F-measure, etc.) for the various tasks:
Currently we don't know how good our baselines are against super dumb baselines:
We should probably evaluate these for ourselves at least before releasing our baselines (which will look a bit silly if they don't win!). Any other dumb ones to propose?
The overwrite check only works if the zipfile contains a top-level directory named the same as the zip file, and the returned path is potentially useless if this isn't the case.
In degradations, there's a common pattern like:
for note_index in range(excerpt.note_df.shape[0]):
...
i.e. using a for loop over a list integers to edit a dataframe.
This is bad for two reasons:
.loc
1, 2, then 3 will return you iloc
s 0, 2, then 1.There's likely a better way to do this. If possible, use a vectorised solution. For example, with relation to time_shift()
, which has this code:
for note_index in range(excerpt.note_df.shape[0]):
onset = excerpt.note_df.loc[note_index, 'onset']
offset = onset + excerpt.note_df.loc[note_index, 'dur']
# Early-shift bounds (decrease onset)
earliest_earlier_onset = max(onset - max_shift + 1, 0)
latest_earlier_onset = max(onset - min_shift + 1,
earliest_earlier_onset)
latest_earlier_onset = min(latest_earlier_onset, onset)
...
you could instead do this process in a vectorised fashion - you're just making a boolean array ultimately. Remove the loop entirely, and set onset = excerpt.note_df.onset
, offset = onset + excerpt.note_df.dur
, and use pandas series .apply()
methods to apply max()
and min()
to every element.
Eval on the test set for 1 model I trained ran nearly 30 mins.
Running ./make_dataset.py, got this error:
Making target data: 0%| | 5/22522 [00:00<1:17:36, 4.84it/s]
Traceback (most recent call last):
File "./make_dataset.py", line 457, in <module>
degraded = deg_fun(excerpt, **deg_fun_kwargs)
File "/Users/kungfujam/git/midi_degradation_toolkit/mdtk/degradations.py", line 37, in seeded_func
return func(*args, **kwargs)
File "/Users/kungfujam/git/midi_degradation_toolkit/mdtk/degradations.py", line 980, in join_notes
degraded.loc[nexts[-1]]['dur'] -
IndexError: list index out of range
By the sounds of things, nexts is empty prematurely or something.
My first versions will check basic validity (offset > onset), but not whether a newly degraded note overlaps with any existing notes.
This is likely important.
Easier to start of dev for Unix and test on a windows machine later. Line to edit in setup.py will be:
"Operating System :: Unix"
-> "Operating System :: OS Independent"
Generate/check docs for mdtk use (including the readme).
We should do some data quality analysis of the data we are going to release. I'm thinking a notebook (also doubles as an intro to what data are available for use) which reviews the data by:
Essentially I want to check that the data are not rubbish, and we can hear where the degradations are!
I guess it's ok if the package requires torch, but users that have issues installing pytorch for whatever reason would ideally still be able to run make_dataset.py. I don't think it is actually required.
This would introduce all sorts of complications if we shift the degraded excerpts to 0, including:
Still, it would be easy for a model to discover that any excerpt not beginning at 0 is degraded.
One possible solution would be to add some random amount of space (Say between 0 and 100 ms) to the start of each excerpt upon the creation of the excerpt. ie, here: https://github.com/JamesOwers/midi_degradation_toolkit/blob/master/make_dataset.py#L458
This includes no degradations (necessary), as well as multiple degradations (would be nice).
We should allow a custom window size. But then, should we always cut on onsets? Or frames with no notes? Or...?
e.g., dur==0, overlapping notes, etc. Assert and fail? Or warn?
Best thing to do would be to switch from warnings
to logging
library, but this could be a bit long. At minimum, make all the tqdm stuff shorter (long paths in desc
are killing it and sometimes creating multi-line output) and probably suppress warnings by default, only switching them on with a flag in the script.
This supersedes #3
I use [0-88) at the moment. It seems we use something more like [0-127) (or something). Anyways, we should set sensible defaults for some of these as global vars, since we may use similar params in different places.
/Users/kungfujam/git/midi_degradation_toolkit/mdtk/downloaders.py:76: UserWarning: WARNING: /Users/kungfujam/.mdtk_cache already exists, writing files within here only if they do not already exist.
category=UserWarning)
this category=UserWarning)
isn't part of the warning.
For example, ensure that C4 = 60 (maybe?) and set up a default tempo for datasets which have a metrical basis.
Current thoughts are:
piano-midi is a must.
PPD should go in as well, but decision on monophonic vs polyphonic.
Maestro would be nice, but probably best to leave it for future work.
Simple example: remove note from excerpt with no notes. But there are others.
Or, add tests in the degradations with other formats.
For example, there are currently no tests with dataframes without consecutive indices starting from 0, or unsorted dataframes, or ones with columns out of order.
Some of these we might want to allow, some we might be okay with errors.
We could also add a function which, given some data frame, enforces formatting, like:
Currently, the degradations are closed under 1, 2, 3, and 5 (if given correctly, will return correctly), but not necessarily 4 (the resulting df may not be sorted, even if the input one was).
Giving one a df which has the wrong columns would likely error.
Just a warning, for now. In case it becomes important.
For people hoping to use this for AMT, it would be useful to have some code which, given a transcription and a ground truth, will output the proportion of errors which could be assigned to each degradation.
This would allow users to create a dataset for their specific use case.
We could also have it output recommended parameter values.
Or, ideally, output a json file directly which would be readable by the make_dataset script.
Some difficulties:
Some types of randomness aren't guaranteed to be OS independent.
Also, we may want to use a numpy RandomState rather than seeds, but probably not.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.