In the era of multicores, many applications that tend to require substantial compute power and data crunching (aka Throughput Computing Applications) can now be run on desktop PCs. However, to achieve the best possible performance, applications need to be written in a way that exploits both parallelism and cache locality. In this paper, we propose one such approach for x86-based architectures. Our approach uses cache-oblivious techniques to divide a large problem into smaller subproblems which are mapped to different cores or threads. We then use the compiler to exploit SIMD parallelism within each subproblem. Finally, we use autotuning to pick the best parameter values throughout the optimization process. We have implemented our approach with the Intel R Compiler and the newly developed Intel R Software Autotuning Tool. Experimental results collected on a dual-socket quad-core Nehalem show that our approach achieves an average speedup of almost 20x over the best serial cases for an i...