@SungFeng-Huang I have been playing with your pytorch code version for the first iteration of GAN training using oracle bounds or the provided "gas" bounds. It seems that I wasn't able to achieve the FER numbers reported in the paper, even under the "matching" setting between audio and text. While the oracle bounds gave somewhat more reasonable results, the FER was still in the low 30s%, with a final PER of 30.5%. With the provided unsupervised "gas" bounds, the FER was much much higher than the results in the paper. The FER I could obtain was 70-85%, and the best PER I could get was 65%.
I compared the gas bounds with the oracle bounds and the r-value seemed reasonable (81.77) under a 2-frame (standard 25ms/10ms shift) tolerance window.
I noticed that in the commit comments you also mentioned this issue. Have you ever figured out the reason?
How did you extract the GAS phone boundaries from the data set?
I found this repository https://github.com/allyoushawn/timit_gas
I modified their decoder code to output the boundaries in a pickle file. The code worked but the PER was very bad compared to the one generated by the original uns_bnd files located in ./data/timit_gas.