Disability of visual text reading has a huge impact on the quality of life for visually disabled people. One of the most anticipated devices is a wearable camera capable of finding text regions in natural scenes and translating the text into another representation such as sound or braille. In order to develop such a device, text tracking in video sequences is required as well as text detection. We need to group homogeneous text regions to avoid multiple and redundant speech syntheses or braille conversions. We have developed a prototype system equipped with a head-mounted video camera. Text regions are extracted from the video frames using a revised DCT feature. Particle filtering is employed for fast and robust text tracking. We have tested the performance of our system using 1,000 video frames of a hall way with eight signboards. The number of text candidate images is reduced to 0.98%.