wav2vecu2 -> phoneme from answer timespan, get answer phoneme train NMSQA tasks with context phoneme as input and answer phoneme as output

Results Avg AOS: 0.6586811819639634 Avg FF1: 0.697483897268189 Exact Match: 0.2042313923568093