In recent years, the technological advances in mapping genes have made it increasingly easy to store and use a wide variety of biological data. Such data are usually in the form of very long strings for which it is di cult to determine the most relevant features for a classi cation task. For example, a typical DNA string may be millions of characters long, and there may be thousands of such strings in a database. In many cases, the classi cation behavior of the data may be hidden in the compositional behavior of certain segments of the string which cannot be easily determined apriori. Another problem which complicates the classi cation task is that in some cases the classi cation behavior is re ected in global behavior of the string, whereas in others it is re ected in local patterns. Given the enormous variation in the behavior of the strings over di erent data sets, it is useful to develop an approach which is sensitive to both theglobal and local behavior of the strings for the pur...
Charu C. Aggarwal