One way to exploit Thread Level Parallelism (TLP) is to use architectures that implement novel multithreaded execution models, like Scheduled DataFlow (SDF). This latter model promises an elegant decoupled and non-blocking execution of threads. Here we extend that model in order to be used in future scalable CMP systems where wire delay imposes to partition the design. In this paper we describe our approach and experiment with different distributed schedulers, different number of clusters and processors per cluster to show good scalability of our architecture. We describe our approach and present initial results on system scalability and performance. We suggest design choices to improve the scalability of the basic design.