Ran into a couple of rough edges in trying to run on CUDA:
The model wasn't moved to GPU if CUDA was specified so it failed immediately because of that.
The compute_auc_ap utility function didn't move back to CPU where needed.
It still doesn't seem to be fully using the GPU after fixing those two, not sure why.
There are several places to change in the code to move over. Would be nice to either auto-detect GPU or provide a single configuration variable or widget to make it easier to switch.
It also doesn't properly move the inputs over in the evaluate function.