Would you mind publishing some FPGA resource usage and Fmax numbers and the configuration/part number used? I tried synthesizing (in Vivado) for various 7 series and Ultrascale parts, but could never seem to get timing closure above about 320MHz on Ultrascale (NFFT=12, DATA_WIDTH=24, TWDL_WIDTH=16, truncation mode, XSERIES set to the correct value).
I also tried your other implementation, intfft_spdf (posting issues seems to be disabled on that repository), but it seems all the ram blocks never synthesize to BRAM and end up being implemented as LUTRAM. Looking at the RTL it looks like you are using two read ports and a write port, which from what I know would only be supported in Ultrascale, but targeting Ultrascale/Ultrascale+ didn't help and I was still seeing 11k+ LUT utilization for a 4096 FFT. What device part number are you targeting and are there special constraints needed?