Parsing VBScript

[2003.02.12] - This page is still in the form of rudimentary notes, and may remain so indefinitely.

Note that when talking about the language generically here, I will quite often speak of VB.  Although VB and VBScript are very different in role, they are very similar syntactically (which has significant bearing on this discussion). If you care about this sort of thing, you can jump to my page on "What's a Language?".

This page is primarily theoretical material; to jump to links to code, go to my Script Parsing Code page.

Figuring out how to parse code is a worthwhile activity, but is not for everyone.  Developing a parser for an idiosyncratic language is somewhat like mountain-climbing - a possibly grueling task with regular setbacks which both tests your skills and enhances them.  When you get to the top of this particular mountain, however, there is an entire country of new capabilities on the other side.

If you are most interested in the capabilities - code analysis, testing, migration, transformation, and translation - I suggest you look at products from a company such as Semantic Designs.  They handle the mountain-climbing for you, and even more importantly from a functional standpoint, their tools are designed as guides through the country on the other side.

A Rationale of Sorts

Why would anyone want to parse VBScript?

After all, it does it on its own when it compiles it prior to execution, and you can even use the Script Control for parsing of a sort.

Reasons vary, but one to take note of is that ALL other major scripting languages have tools for parsing their own code.  In fact, there are entire families of parsers and tokenizers for them.

Some of the need for that is mitigated by the fact that nobody writes their own compiler for VBScript, but other needs - varying from prettification to risk analysis to codebase assessment - still exist.

So why hasn't anyone written a VBScript parser?

There are a few commercial parsers for VB which work for this, but they are rare indeed.  I can identify the following factors that probably contribute to this.

Limited Relative Need

VB/VBScript's human readability tends to reduce the need for special parsers somewhat compared to other languages; also, huge development projects (particularly cross-platform ones) are not as common with script or VB as with the traditional C and Java families.  The needs for small developers are there, but the large-scale projects which drive development of such tools are less common in the VB family world.

Knowledge Constraints

There are limitations on knowledge which affect this.

First, VB is historically proprietary, and has never had formal grammars published by the OEM (until VB.NET, that is).  This makes it a little time consuming to even start the process.

Second, the set of people who are familiar with parser/compilers and the set of people who use VB heavily has a very small intersection.  Knowledge of lex/yacc/bison tends to come in CS programs as one learns to program in C; and ever after, there is a bent to program IN C as opposed to using VB.  The daily VB users, though, even the extremely literate ones, have usually not had that exposure.  This means that such a product is most likely able to be started by a C++ programmer - but that same individual will have to fight all sorts of instincts about language grammar while trying to build a parser.

Issues in Parsing VBScript

Instead of looking at the big picture of formal grammars, let's start with getting a handle on some of the smaller tasks in manipulating VBScript code. Following are some rough notes about the process I used for this. 

Small Tasks (A Justification)

Well, one value is the ability to colorize code for display in HTML.  This requires some capabilities for at least tokenizing the code.

Associated with that, being able to format code would be nice; you could take free-form text and correctly set indentation based on code structures.

We could also support preprocessing.  This would allow features similar to JScript's @-commands, allowing conditional compilation or-on-the-fly inclusion.

Preprocessing would also allow abbreviated/custom edit-time script composition.  For example, if one is tokenizing script, a statement like

a++

could easily be expanded to

a = a + 1

Possibly trivial in appearance, but in a large source code base a handful of syntax customizations could be extremely useful.

Another thing it would be nice to do is be able to "mine" scripts -automatically extract functions and subroutines.  If we can correctly parse these structures, we can process a huge script codebase and attempt to consolidate it into a library.

Building on that, we could theoretically reduce procedures to equivalency.  If the non-commented text is compared, and the non-keyword terms are turned into arbitrary tokens, we could look for identical constructs - or ones which are close.  This could suggest ways to further simplify code structures and significantly reduce bloat.

A further development would be optimization - making suggestions based on structures which are seen in code. For example, a parser could look through all of the loop structures and identify those which simply do conditional tests with no exit, identifying potential CPU hogs/endless loops.

The Problems

The most significant problem is a lack of prior work in this area; but there are specifics which need to be handled correctly.

Commenting

The largest single issue here is inconsistency between VB and VBScript.  VB allows multistatement comments - a statement starting with ' as its first non-whitespace token and ending with a _ causes the next line to be considered as a comment.  VBScript does NOT do this; the following two lines:

    ' this is _
    a Comment

will throw an error.

Strings

The syntax for this is workable, but very different from the standard parsed languages; " is its own escape character.

Statement demarcation

Statements are typically delimited with newlines. "Typically".  The actual usage produces a small array of issues.

  1. Since blank lines are also used for whitespace, a parser must know to dump any blank lines (which includes lines with tab and space characters in them).
  2. The '_' works as a continuation character for non-comment lines; these need to be joined to make a "complete" statement.
  3. Multiple statements can be concatenated onto one line with a : marker.  We need to split these up into statements.

Multipart Keywords

This is a big problem for a lexing tool. If script is tokenized, the statement

    On Error Resume Next

consists of 4 keywords which need to be checked for validity.  In reality though, it is a single instruction: "Turn on error control".  If...Then has similar problems.

Developing a Solution

I'm really bad with big-picture, all-at-once solutions to problems; I always prefer "small" tools.  As a result, I step through the different tasks and try to componentize them as I go.

Getting the code: one big chunk

The first issue is really external to script parsing: getting the data fed in. You're best off doing this in one step; having complete structures to work with makes manipulation easier, and even 10,000 lines of 70-character-average statements weighs in at less than 700KiB of data.  I use a ReadFile function wrapper to grab an entire script file:

  Function ReadFile(FilePath)
    'Given the path to a file, will return entire contents
    ' works with either ANSI or Unicode
    Dim FSO, CurrentFile
    Const ForReading = 1, _
      TristateUseDefault = -2, _
      DoNotCreateFile = False
    Set FSO = createobject("Scripting.FileSystemObject")
    If FSO.FileExists(FilePath) Then
      If FSO.GetFile(FilePath).Size>0 Then
        Set CurrentFile = FSO.OpenTextFile(FilePath, _
          ForReading, _
          False, _
          TristateUseDefault)
        ReadFile = CurrentFile.ReadAll: CurrentFile.Close
      End If
    End If
  End Function

Given that we have a bunch of code in a variable, what do we do with it?

The next step is to look at ways of parsing.  Let's look at the possible transformations we want to apply.  There are several obvious ones, but one thing that should have no effect on data content is to scrub the lines: remove leading/trailing whitespace, merge blank lines, and make line markers uniform.

Chomping out line-boundary whitespace

That sounds messy, but reduces to a very simple regular expression; to see it, you have to understand a few of the special characters usable in regular expressions.  Certain symbols such as '$' are special characters; '\' is an "extra-special" symbol, since it escapes the character following it.

Let's start with '$'.  The $ is technically an anchor; it matches the position immediately before the end of a string; and if a regex is multiline, it matches the position prior to a carriage return (\r) or linefeed (\n). OK, I sneaked \r and \n into this discussion, but you don't need them - just remember that \r matches the vbCr and \n matches the VbLf (and by the way, the vbCrLf or vbNewLine is simply \r\n).

Let's look at a combination '\s+'.  The '\' escapes the s into a special token, meaning "any whitespace character including space, tab, form-feed, etc".  The '+' is a quantifier - it means "match the thing before this 1 or more times".  And one other piece of information you should have about regular expressions in script is that they are greedy - they match the largest string possible.  Thus, in a string consisting of  some text followed by a space, a carriage return, and a tab, the prior expression will match the three whitespace symbols together.

We can use '\s*' to match whitespace 0 or more times - a risky thing to do by itself because 0 is everywhere, but if we combine it with an anchor:

'\s*$\s*'

We get a regular expression that matches the position before a line ending, AND all the whitespace before and after the line ending.

Thus, if we replace every such match with a simple vbCr, we have almost completely cleaned our code of whitespace.

Since we will be doing a LOT of regular expression swaps, it makes it convenient if we have a drop-in function that just says "in string sData, replace all occurrences of pattern oldPatrn with string newText"; it's pretty easy to put that together from the WSH documentation:

  Function RxReplace(sData, oldPatrn, newText)
    Dim rx
    Set rx = New RegExp
    rx.IgnoreCase = True
    rx.Global = True
    rx.Multiline = True
    rx.Pattern = oldPatrn   ' Set pattern.
    RxReplace = rx.Replace(sData,
newText)
  End Function

This isn't perfectly efficient, but it's a start.   Given the above two functions, here's a simple script that does our cleanup for us.

sData = ReadFile("C:\data\Projects\_test\astro\earth.vbs")

' kill in-line leading/trailing WS and
' change line separators to vbCr
sData = RxReplace(sData, "\s*$\s*", vbCr)

There are varying patterns we can use depending on our goal, and we have one absolute requirement at this stage: we cannot afford to do anything to internal whitespace in statements, or we risk damaging any strings we have.  More about this later (see "What's in a String?" below).  By anchoring ourselves to line starts and terminations, we handle that problem easily.


 

What's in a String?

Replacing Anchors

Removing excess whitespace

Ignoring what happens inside strings for the moment, we don't need any repeated whitespace