Clustering and retrieval of web pages dominantly relies on analyzing either the content of individual web pages or the link structure between them. Some literature also suggests to use the structure of web pages, notably the structure of its DOM tree. However, little work considers the visual structure of web pages for clustering. In this paper (i) we motivate visual structure-based web page clustering and retrieval for a number of applications, (ii) we formalize a visual box model-based representation of web pages that supports new metrics of visual similarity, and (iii) we report on our current work on evaluating human perception of visual similarity of web pages and applying the learned visual similarity features to web page clustering and retrieval. Categories and Subject Descriptors: H.3.3: Information Search and Retrieval General Terms: Design, Theory, Human Factors