In this paper, we present a VLSI architecture for separable 2-D Discrete Wavelet Transform (DWT). Based on 1-D DWT Recursive Pyramid Algorithm (RPA), a complete 2-D DWT output scheduling scheme is derived. The U0 between memory which stores the intermediate results and DWT core is simplified by "circular coefficients arrangement". And the concept to store the "partial accmulation sum" of convolution operation in column direction is frst proposed in this paper. For the computations of NxN 2-D DWT with filter length L, our architecture spends N2 clock cycles and requires 2NL words in memory size, 4L multipliers, as well as 4L-2 adders. And the number of multipliers and adders can be further reduced to 2L and 2L-1 respectively by sharing positive and negative clock edge. The architecture is suitable for VLSI implementation and various real-time videohmage applications.