We study a multi-cell frequency-selective fading uplink channel from K user terminals (UTs) to B base stations (BSs). The BSs, assumed to be oblivious of the applied encoding scheme, compress and forward their observations to a central station (CS) via capacity limited backhaul links. The CS jointly decodes the messages from all UTs. Since we assume no prior channel state information, the channel needs to be estimated during its coherence time. Based on a lower bound of the ergodic mutual information, we determine the optimal fraction of the coherence time used for channel training. We then study how the optimal training length is impacted by the backhaul capacity. Our analysis is based on large random matrix theory but shown by simulations to be tight for even small system dimensions.