Why word error rate is not a good metric for speech recognizer training for the speech translation task?