The problem of making sequential decisions in unknown probabilistic environments is studied. In cycle t action yt results in perception xt and reward rt, where all quantities in general may depend on the complete history. The perception xt and reward rt are sampled from the (reactive) environmental probability distribution