> if I interpret what you're saying to mean equalizing
> the class priors in the data.
Yes, that's the idea in principle. In practice, I actually somewhat
over-sample the more frequent classes for performance reasons.
> I assume this means that you want to randomly select individual frames
> across utterances, rather than patches within utterances? Or does
> this disrupt the caching too much?
I randomly sample frames independently, and caching doesn't seem
affected. I had considered sampling at the level of contiguous phones,
but I never tried it.
> Do you notice any degradation/gain in accuracy?
The frame-level classification accuracy drops considerably, because the
classifier does not learn the correct priors of the training set. You
could maybe improve frame-level phone classification by incorporating a
prior probability.
However, we don't care about classification accuracy for Tandem feature
extraction -- we use the MLP to generate acoustic features for ASR. In
terms of WER, our preliminary experiments showed that the sampling
approach is almost as good as full training -- but that wasn't an
optimal comparison. There's work to be done to improve this (for
example, it helps to re-sample after each training epoch), but in my
opinion the sampling approach will turn out to be as good or better.
-arlo