A multi-modal person representation contains information about what a person looks like and what a person sounds like. However, little is known about how children form these face-voice mappings. Here, we explored the possibility that two cognitive tools that guide word learning, a oneto-one mapping bias and fast mapping, also guide children's learning about faces and voices. We taught 4- and 5-year-old mappings between three individual faces and voices, then presented them with new faces and voices. In Experiment 1, we found that children rapidly learned face-voice mappings from just a few exposures, and furthermore spontaneously mapped novel faces to novel voices using a one-to-one mapping bias (that each face can produce only one voice). In Experiment und that children's face-voice representations are abstract, generalizing to novel tokens of a person. In Experiment 3, we found that children retained in memory the face-voice mappings that they had generated via inference (...