Other typical details:
Complex elaborations of the basic ideas
HMM states ← triphones ← words
each triphone → 3-5 states + connection pattern
phone sequence from pronuncing dictionary
clustering for estimation
Acoustic features
Vocal tract length normalization, speaker clustering
Output pdf for each state as mixture of gaussians
Language model as N-gram model over words
recency/topic effects
Empirical weighting of language vs. acoustic models
etc. etc.