This paper proposes a DiffServ-over-MPLS Traffic Engineering (TE) architecture and describes the implementation of its functional blocks on Intel IXP2400 Network Processor using Intel IXA SDK 4.1 framework. We propose fast and scalable 6-tuple range-match classifier, which allows traffic policing procedures to operate on per-flow level, and a scalable low-jitter Deficit Round Robin (DRR) scheduler that can provide bandwidth guarantees on LSP level. The proposed DiffServ-over-MPLS TE functional blocks have been implemented on Intel IXDP2400 platform for up to 4,096 flows mapped to LLSPs, and can handle an aggregated traffic rate of 2.4Gbps.