We consider a hierarchical two-layer model of natural signals in which both layers are learned from the data. Estimation is accomplished by Score Matching, a recently proposed estimation principle for energy-based models. If the first layer outputs are squared and the second layer weights are constrained to be non-negative, the model learns responses similar to complex cells in primary visual cortex from natural images. The second layer pools a small number of features with similar orientation and frequency, but differing in spatial phase. For speech data, we obtain analogous results. The model unifies previous extensions to ICA such as subspace and topographic models and provides new evidence that localized, oriented, phase invariant features reflect the statistical properties of natural image patches.