Magnetic Random Access Memory (MRAM) is considered to be a promising future memory technology due to its low leakage power, high density and fast read speed. The heterogeneous integration enabled by the 3D integration technology makes it cost-efficient to stack MRAM on top of conventional CMPs. However, one disadvantage of MRAM is its long latency and high energy consumption associated with write operations. In this paper, we first present a cache model of stacking MRAM-based L2 cache on top of Chip Multiprocessors (CMPs), and compare it against its SRAM counterpart in terms of area, performance, and energy consumption. Through simulation results, we observe that a naive implementation of MRAM stacking can harm the chip performance and fail to fully take the advantages of MRAM, due to the aforementioned long latency and high energy of writes. We then propose two architectrual techniques: readpreemptive write buffer and SRAM-MRAM hybrid L2 cache, which can mitigate the penalty due to l...