We have fabricated a Chip Multiprocessor prototype code-named Merlot to proof our novel speculative multithreading architecture. On Merlot, multiple threads provide wider issue window beyond ordinal instruction level parallel (ILP) processors like superscalar or VLIW. With the architecture, we estimate 3.0 times speedup against single processing elements (PE) on speech recognition code and IDCT code with four PEs. Merlot integrates on-chip devices, PCI interface, and SDRAM interfaces. We have encountered design issues of chip multiprocessor and SoC design. We have successfully run parallelized mpeg3 decoder on the first silicon with several software workarounds, thanks to functional verification environment including system modeling on RTL. However, bugs found in later stage of design have required larger manpower or delay of project. In this paper, we also discuss the methodology to improve functional verification coverage, and expect the solution in formal approaches. Categories and...