Any attempt at automated software analysis or modification must be preceded by a comprehension step, i.e. parsing. This task, while often considered straightforward, can in fact be made very challenging depending on the source code in question. Files that make up web applications serve as an example of such difficult-to-parse artifacts, for two reasons. Firstly, these files routinely contain several programming languages at once, sometimes with widely varying syntaxes, and intermingled at the statement level. Secondly, the code routinely contains syntax errors. Understanding such files calls for a robust parser that can handle multiple languages simultaneously. An approach to creating such a parser, based on the concept of island grammars, is presented here. Island grammars have been used in the past for robust parsing and lightweight analysis of software. Some of the features of these grammars make them uniquely fit for parsing multiple languages simultaneously.
Nikita Synytskyy, James R. Cordy, Thomas R. Dean