jamesowers / midi_degradation_toolkit Goto Github PK

View Code? Open in Web Editor NEW

37.0 37.0 5.0 13.37 MB

A toolkit for generating datasets of midi files which have been degraded to be 'un-musical'.

License: MIT License

Python 100.00%

acme csv dataset evaluation ismir ismir2020 midi midi-degradation-toolkit midi-files paper pytorch-dataset

midi_degradation_toolkit's People

Contributors

Stargazers

Watchers

Forkers

ame430 lianghsia yagyapandeya zhipeng-zhong zhongshijun

midi_degradation_toolkit's Issues

Convert degradations to integers

In particular, split_range_sample was originally written to sample uniformly-distributed floats. We have since decided that rounding to ints is better.

I have converted the split_range_sample method itself, but not yet all of the degradations. I will do this as I go through writing tests for them.

So far, only time_shift has been updated to reflect this.

Fix the pip install

Hopefully this wont be too hard. Essentially I'd like the default setting to install bare minumum, and an -all type flag to install full optional dependencies. Don't know the standard for that but need to sort it for ease.

Set default rounding level for times.

3 decimal places seems fine, which would be microseconds.

Decide how to ignore tracks in degradations

We will ignore them in our model input and output, but how should we handle them in the degradations?

For example, can join_notes join across tracks (this would be "ignoring")?

Baseline model improvements and comparisons from literature

This is just a placeholder to think about about improvements to make.

Also, I performed a lit review recently on a paper of a similar format to ours: here's some new data and new tasks. My main criticism was that there were no comparisons to models from literature. Are we sure there is nothing we can implement from literature? We should anticipate this criticism, and think about what models from literature we could try and implement for comparison.

Change Error Identification -> Error Location throughout.

Docs, models, trainers, eval, paper.

(Paper is done already)

df_to_csv and csv_to_df should probably be in the same place

I'd go for having them both in that "midi" package and renaming it perhaps? (Currently csv_to_df is data_structures.read_note_csv)

Essentially, mdtk.midi (renamed) would be for file I/O and conversion, while mdtk.data_structures would be about doing things with dataframes.

Fix overlaps on input

Related to #20 (and number #46, in a way)

As discussed in Skype, here are a few examples of how we want it to work (o=onset; .=nothing, -=sustain):

Example 1: Don't cut on offsets.
Input:

....o--------
....o----....

Output:

....o--------

Example 2: Cut on onsets.
Input:

....o--------
......o------

Output:

....o-o------

The overlap check in read_note_csv is dog slow...

The line df = df.groupby(['track', 'pitch']).apply(fix_overlapping_notes) increases computation time by at least 100x. This is mad. For now, I'm just going to bypass this by adding a flag to skip the check (we decided not to enforce this), but I think we will learn something if we try to profile the code and understand why it's so slow.

Ensure padding and masking is done correctly

A la: https://towardsdatascience.com/taming-lstms-variable-sized-mini-batches-and-why-pytorch-is-good-for-your-health-61d35642972e

Degradation input format

Decide what to use.

The Composition object potentially provides lots of useful functionality for examining the excerpts, but also adds potential overhead and complication if we don't use any of that functionality.

A lower-level solution like directly using a dataframe, array, or dict, might be a better option for some use cases.

shutil.copy() seems to be very slow with large numbers of files

Maybe try and speed up at some point

Enable join_notes to join more than 2 notes together, if possible.

Add a parameter, max_notes, to do so.

There are 2 options:

Join greedily as many notes as possible up to max_notes.
Choose a random number of notes up to max_notes and join that many.

This will allow join_notes to always reverse split_notes. Currently, split_notes is the only one that cannot be reverse with another (or the same) degradation.

Add flag to force time-based degradations to choose from existing durations (and onset times?)

Something called "align_to_grid" or similar. This would make notes more difficult to detect.

New degradation parameters

Similar to distribution parameter for pitch shifting, we might want a parameter specifying to how often to lengthen or shorten notes in the onset_shift and offset_shift methods.
We may want the ability to split a note into notes of non-equal duration.

Speed up overlaps check

Not a priority before release 1, but (as noted in #60), data_structures.fix_overlaps is quite a slow function at the moment.

Investigate use of `inplace` operations for efficiency in degradations

Whilst it's not possible to operate inplace when adding new data to dataframes (a copy must occur), there are various degradations that could work inplace - this could be much more efficient if the user doesn't want to retain the original dataframe.

Investigate speedups obtained and whether this is worth it.

If command/pianoroll datasets only partially made, they are not re-made in full

The train_task script will presume that the datasets are full and correct if the file exists. We should add a catch if there is a termination to the formatter creation scripts which deletes the files if they are not finished.

Degradation input validation is not explicitly performed.

For example, min_duration > max_duration.

Currently, examples like the above will warn with "No valid notes found." (or similar), and return None. Because it can't find a note to shift which will result in a duration in that given range. That's probably fine, but we could also explicitly check parameter settings to give a more explicit warning as to what is happening.

Quantize offset (note_off) times incorrectly

We currently quantize onset time and duration. It would be good to have an option (at least) to quantize instead onset and offset (note_on and note_off) directly. Rounding issues can cause duration to be off by 1 currently.

Read task 2 labels from degradation_ids.csv

In case people use a subset of them. Not sure if we've hard-coded this anywhere, but we shouldn't.

Improvements for training [low priority]

Allow continuing training from a checkpoint
Properly seed models such that they can be reproduced (this is actually quite difficult but I have code for it)
Get formatters out of Trainers
...think about how eval should be done (in iteration()?)

No input validation for degradation params

Including range, type.

Less obvious:
Also (for example), in add_note, min_duration.

Adapt `make_dataset.py` such that you can supply a flag to make it do just single parts

At the moment, if you run the script without --command there's no way to create the command csvs without running the whole shebang again. We should think about how best to do structure the script to make stuff like this easier.

Support evaluation for people not using our trainers

Possibilities include:

An eval function for each task which takes as input a data point (or a set of data points), and a label (or a set of labels), and returns the metric.

A --file flag to read labels from a file in the given eval script.

Ideally, these would be independent of our formats where possible. For example, the helpfulness script takes in a 3 dfs and outputs a score. Others should be similarly easy to use, independent of format.

Implement Evaluation

Enable the testing of our models, as well as evaluation metrics (F-measure, etc.) for the various tasks:

F-measure
Accuracy
F-measure
Helpfulness (based on note- and frame-based F-measures from mir_eval)

"Rule-based" baselines

Currently we don't know how good our baselines are against super dumb baselines:

Task 1: [1/9, 8/9] for everything
Task 2: [1/9, ..., 1/9] for everything
Task 3:
- get average 'changed region' length and just pick the section in the middle
- predict avg nr of 1s everywhere
Task 4: do nothing

We should probably evaluate these for ourselves at least before releasing our baselines (which will look a bit silly if they don't win!). Any other dumb ones to propose?

downloads.extract_zip contains assumptions specific to PPD data

The overwrite check only works if the zipfile contains a top-level directory named the same as the zip file, and the returned path is potentially useless if this isn't the case.

Iterate by index over pandas dataframes is not efficient - edit if possible

In degradations, there's a common pattern like:

for note_index in range(excerpt.note_df.shape[0]):
   ...

i.e. using a for loop over a list integers to edit a dataframe.

This is bad for two reasons:

it's slow
there's no guarantee the index will be a range - it probably will, but you never know. e.g. if the index is [1, 3, 2], .loc 1, 2, then 3 will return you ilocs 0, 2, then 1.

There's likely a better way to do this. If possible, use a vectorised solution. For example, with relation to time_shift(), which has this code:

    for note_index in range(excerpt.note_df.shape[0]):
        onset = excerpt.note_df.loc[note_index, 'onset']
        offset = onset + excerpt.note_df.loc[note_index, 'dur']
        
        # Early-shift bounds (decrease onset)
        earliest_earlier_onset = max(onset - max_shift + 1, 0)
        latest_earlier_onset = max(onset - min_shift + 1,
                                   earliest_earlier_onset)
        latest_earlier_onset = min(latest_earlier_onset, onset)
        ...

you could instead do this process in a vectorised fashion - you're just making a boolean array ultimately. Remove the loop entirely, and set onset = excerpt.note_df.onset, offset = onset + excerpt.note_df.dur, and use pandas series .apply() methods to apply max() and min() to every element.

double_pianoroll_to_df is very slow.

Eval on the test set for 1 model I trained ran nearly 30 mins.

Degradation error - just noting in case you know...

Running ./make_dataset.py, got this error:

Making target data:   0%|                                                                                                                        | 5/22522 [00:00<1:17:36,  4.84it/s]
Traceback (most recent call last):
  File "./make_dataset.py", line 457, in <module>
    degraded = deg_fun(excerpt, **deg_fun_kwargs)
  File "/Users/kungfujam/git/midi_degradation_toolkit/mdtk/degradations.py", line 37, in seeded_func
    return func(*args, **kwargs)
  File "/Users/kungfujam/git/midi_degradation_toolkit/mdtk/degradations.py", line 980, in join_notes
    degraded.loc[nexts[-1]]['dur'] -
IndexError: list index out of range

By the sounds of things, nexts is empty prematurely or something.

Check overlapping notes after degradations

My first versions will check basic validity (offset > onset), but not whether a newly degraded note overlaps with any existing notes.

This is likely important.

Make OS independent and edit setup.py accordingly

Easier to start of dev for Unix and test on a windows machine later. Line to edit in setup.py will be:

"Operating System :: Unix" -> "Operating System :: OS Independent"

Generate docs

Generate/check docs for mdtk use (including the readme).

Before release: DQA of the released data!

We should do some data quality analysis of the data we are going to release. I'm thinking a notebook (also doubles as an intro to what data are available for use) which reviews the data by:

Playing a selection of degraded and clean excerpts
- Any issues with data? Choppy? Did flattening tracks work well?
- Are degradations obvious? Are there better parameters for degradations to use?
providing stats about number of notes in those excerpts, lengths of notes, and the actual amount of time these notes occur in etc.
- This will inform the correct seq_len to use for models (may be worth excluding silly long excerpts)
giving some background as to where these data are from and, if possible, some summary stats about genre, or tempo, or whatever we can glean
Summarise performance broken doWn over datasets (info available in metadata)

Essentially I want to check that the data are not rubbish, and we can hear where the degradations are!

Have make_dataset not rely on torch

I guess it's ok if the package requires torch, but users that have issues installing pytorch for whatever reason would ideally still be able to run make_dataset.py. I don't think it is actually required.

Degraded excerpts may not start from time 0.

This would introduce all sorts of complications if we shift the degraded excerpts to 0, including:

Much more difficult to write tests (and would need to rewrite them)
Piano-rolls no longer align to create the binary frame-based task ground truth.

Still, it would be easy for a model to discover that any excerpt not beginning at 0 is degraded.

One possible solution would be to add some random amount of space (Say between 0 and 100 ms) to the start of each excerpt upon the creation of the excerpt. ie, here: https://github.com/JamesOwers/midi_degradation_toolkit/blob/master/make_dataset.py#L458

Support excerpts with multiple degradations

This includes no degradations (necessary), as well as multiple degradations (would be nice).

Allow the user to specify a desired window size for excerpts

We should allow a custom window size. But then, should we always cut on onsets? Or frames with no notes? Or...?

Decide what to do with "invalid" MIDI

e.g., dur==0, overlapping notes, etc. Assert and fail? Or warn?

Fix the absolute diatribe of warnings and logs spewed out for make_dataset.py

Best thing to do would be to switch from warnings to logging library, but this could be a bit long. At minimum, make all the tqdm stuff shorter (long paths in desc are killing it and sometimes creating multi-line output) and probably suppress warnings by default, only switching them on with a flag in the script.

This supersedes #3

Default pitch ranges are incorrect.

I use [0-88) at the moment. It seems we use something more like [0-127) (or something). Anyways, we should set sensible defaults for some of these as global vars, since we may use similar params in different places.

Warnings seem to be showing additional lines e.g. ...

  /Users/kungfujam/git/midi_degradation_toolkit/mdtk/downloaders.py:76: UserWarning: WARNING: /Users/kungfujam/.mdtk_cache already exists, writing files within here only if they do not already exist.
    category=UserWarning)

this category=UserWarning) isn't part of the warning.

Enforce common tempo and pitch basis in downloads.

For example, ensure that C4 = 60 (maybe?) and set up a default tempo for datasets which have a metrical basis.

Decide what datasets to include in ACME v1.0, and what parameters to use for the degs.

Current thoughts are:

piano-midi is a must.

PPD should go in as well, but decision on monophonic vs polyphonic.

Maestro would be nice, but probably best to leave it for future work.

Some degradations may not be possible to perform.

Simple example: remove note from excerpt with no notes. But there are others.

Enforce DataFrame formatting

Or, add tests in the degradations with other formats.

For example, there are currently no tests with dataframes without consecutive indices starting from 0, or unsorted dataframes, or ones with columns out of order.

Some of these we might want to allow, some we might be okay with errors.

We could also add a function which, given some data frame, enforces formatting, like:

Drops all columns not pitch, onset, offset, track
Rounds those columns to ints
Puts those columns in the correct order
Sorts the dataframe
Reset the index in that order and drop the index column

Currently, the degradations are closed under 1, 2, 3, and 5 (if given correctly, will return correctly), but not necessarily 4 (the resulting df may not be sorted, even if the input one was).

Giving one a df which has the wrong columns would likely error.

Degradations do not enforce sorting.

Just a warning, for now. In case it becomes important.

Code to measure types of errors present in a transcription

For people hoping to use this for AMT, it would be useful to have some code which, given a transcription and a ground truth, will output the proportion of errors which could be assigned to each degradation.

This would allow users to create a dataset for their specific use case.

We could also have it output recommended parameter values.

Or, ideally, output a json file directly which would be readable by the make_dataset script.

Some difficulties:

It won't be possible to get the distribution exactly correct, as some errors might be ambiguous (is it a pitch shift and a time shift? Or an add note and a remove note?)
We may want to allow the user to specify a window size for this. Then, we can also measure how many windows have no degradation. (#31)
We may want to allow multiple degradations per excerpt in some cases, eventually. (#32)

Ensure Randomness and reproducibility

Some types of randomness aren't guaranteed to be OS independent.

Also, we may want to use a numpy RandomState rather than seeds, but probably not.