There is Mozilla’s DeepSpeech (GitHub - mozilla/DeepSpeech: DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.) which is pretty good, and you’re right in that any modern GPU eats up the TensorFlow training.
I guess this project could have also used the local Windows PC services a bit more as well - things like VoiceAttack are just layers over the win32 training/local recognition ‘on device’. As people have to run SRS anyway, and DCS on the client is Windows only, that would have been an alternative way to go. The speech output Azure voices are very good though.