Monday, September 26, 2011

Incremental Linting with FlexeLint

Tools like FlexeLint are great for finding and preventing bugs and the bad practices that are often associated with them and little argument can be made for not using a good static analysis tool from the inception of a new project.  Many times however, it is desired to start reaping the benefits of such a tool with an existing code base.  The initial scan of even a carefully configured FlexeLint installation can generate a daunting number of messages that simply are not practical to address before continuing development.  Because of this, such tools are often not used on established code bases.  Being able to establish a baseline snapshot of a code base and receive diagnostics only for new or changed code would make using FlexeLint significantly more manageable for such projects and allow previously existing issues to be addressed in a more practical timeline.  Such a method can also be used to easily track issues introduced between different versions.

This post discusses a method to run FlexeLint so as to receive messages only on code that differs from a specific baseline.  The examples use FlexeLint but the ideas are relatively simple and can be easily modified to work with other tools as well.

Let's start with a very simple example of what we want to accomplish.  Below is a function that contains several issues:

#include <stdlib.h>
#include <string.h>

char * dup_string(char *s) {
    char *copy = malloc(strlen(s+1));
    strcpy(copy, s);
    return copy;
}

Here are the messages produced by my FlexeLint configuration:
FlexeLint for C/C++ (Unix) Vers. 9.00h, Copyright Gimpel Software 1985-2011

--- Module:   dupstring.c (C)
dupstring.c: 5: Warning 432: Suspicious argument to malloc
dupstring.c: 5: Warning 613: Possible use of null pointer 's' in left argument to operator 'ptr+int'
dupstring.c: 6: Warning 668: Possibly passing a null pointer to function 'strcpy(char *, const char *)', arg. no. 1 [Reference: file dupstring.c: line 5]
dupstring.c: 5: Info 831: Reference cited in prior message
dupstring.c: 6: Warning 668: Possibly passing a null pointer to function 'strcpy(char *, const char *)', arg. no. 2
dupstring.c: 8: Info 818: Pointer parameter 's' (line 4) could be declared as pointing to const - C Coding Standards Rule 6.2
dupstring.c: 4: Info 830: Location cited in prior message
dupstring.c: 8: Note 953: Variable 'copy' (line 5) could be declared as const
dupstring.c: 5: Info 830: Location cited in prior message

--- Global Wrap-up

: 0: Info 714: Symbol 'dup_string(char *)' (line 4, file dupstring.c) not referenced
dupstring.c: 4: Info 830: Location cited in prior message

Now suppose we make some changes:
#include <stdlib.h>
#include <string.h>

char * dup_string(char *s) {
    char *copy = malloc(strlen(s+1));
    int i;
    strcpy(copy, s + 1);
    i++;
    return copy;
    return i;
}

The goal is to get output that corresponds to only the new and changed lines:
FlexeLint for C/C++ (Unix) Vers. 9.00h, Copyright Gimpel Software 1985-2011

--- Module:   dupstring.c (C)
dupstring.c: 7: Warning 668: Possibly passing a null pointer to function 'strcpy(char *, const char *)', arg. no. 1 [Reference: file new/dupstring.c: line 5]
dupstring.c: 5: Info 831: Reference cited in prior message
dupstring.c: 7: Warning 613: Possible use of null pointer 's' in left argument to operator 'ptr+int'
dupstring.c: 8: Warning 530: Symbol 'i' (line 6) not initialized
dupstring.c: 6: Info 830: Location cited in prior message
dupstring.c: 9: Warning 438: Last value assigned to variable 'i' (defined at line 6) not used
dupstring.c: 6: Info 830: Location cited in prior message
dupstring.c: 10: Warning 527: Unreachable code at token 'return'
dupstring.c: 10: Error 64: Type mismatch (return) (char * = int)

Note that although line 5 has not changed, line 7, for which this message is referencing, has changed so this message should be displayed as well.
There are several potential methods for achieving the desired effect which are considered below.

Line Filter method

Perhaps the most naive method is to run the original codebase through lint, obtain the line numbers and message numbers produced, and then filter the same messages that reference the same line numbers in subsequent lintings on updated versions of the codebase on a per-source file basis.  This quickly falls apart though as line numbers quickly shift when new code is added or removed.  Adding a line to the top of a file would cause the line numbers of the rest of the file to change and all of the original messages would be generated again.  This is obviously not a feasible method and won't be discussed further.


Annotation method

Another idea is to run the original codebase through lint, extract the messages produced, and append a single-line suppression comment to each line that produces a message.

The method has the following drawbacks:
  • It may not be acceptable to make a potentially significant amount of annotations to the code base to suppress messages.
  • If the linting configuration changes, updating the code base to accommodate the changes can be messy.
  • Single-line suppressions do not work for all messages.  For example, they are ineffective for suppressing wrap-up messages.

The advantages to this method are:
  • The messages suppressed become documented within the code and can easily be examined and addressed as time permits.
  • As long as the annotations are performed when the linting configuration for the project has been finalized, nothing needs to be done during subsequent lintings to maintain the suppressions.
  • Implementing such a method is very straight-forward and can be accomplished with a simple script.

This may be a reasonable method for smaller codebases with fewer messages to worry about but is probably not a good general solution.


Contextual method

The next method to consider is the contextual method.  This scheme is identical to the Line Filter method but instead of using line numbers, the lines of code themselves are used.  For each message produced by lint, the source line and message number are saved.  The results of future lintings are filtered based on the saved results and messages that were originally produced for identical lines of code are withheld.

The advantages of this method are:
  • No changes to the code base is required
  • The baseline can be easily changed
  • The saved results can easily be recreated based on different linting configurations

The main disadvantage is that newly added source code lines that are identical to lines with previously suppressed messages could wind up being suppressed, such as with copy-pasted code.  This is particularly problematic for lines containing little or no useful context which could cause a significant suppression of new messages.

Some other products use a method similar to this where a hash of the message details (typically some combination of file name, function name, symbol name, and message number) are used to create an identifier each issue which can be used to provide baseline suppressions.   The hash method can result in collisions where similar issues with the same combination of information produce the same hash value. Including source code into the hash can help improve the uniqueness of the hash but usually requires more than one line of source code to be useful and this is not easily accomplished with FlexeLint.  There is, however, significant potential for developing a solution around this method.


Diff method

The last method considered here is using a diff between the current version of the codebase and a baseline version to build a list of the files and line numbers that have been added or changed.  A filter can then be applied to the output of linting the current version to only emit the messages that relate to new or changed code.

The advantages to this method are:
  • No change to the codebase are required
  • The baseline can be easily changed
  • Unlike the other methods, the baseline code base does not need to be linted

The disadvantages are:
  • A diff against the established baseline must be performed for every updated version of the codebase that is to be linted
  • Diff algorithms aren't perfect and significant changes to files can cause large chunks of relocated code to be incorrectly marked as changed

Since the shortcomings of this method are probably more tolerable than the other methods in a general case and it is not too complex to implement, I have selected to implement a solution using this method.


Implementation

The basic idea is pretty simple:
  • Run a diff against an established baseline and a newer version, extract the location of the new/changed code
  • Run lint over the new code base
  • Filter the lint output to only display messages related to the new/changed code

There are a few complications to consider though:
  • Some messages produce 830 and 831 messages, these need to be handled properly and be grouped together with their parent message.
  • When grouping messages with their 830/831 messages, the locations of each of the references should also be considered since the base message could specify a location that hasn't changed but the reference could specify a location that has.
  • Verbosity messages and specific walk announcements, etc. can be interspersed throughout messages and should be handled properly.

Because of this, something more intelligent than a simple line-by-line suppression mechanism is required.

The strategy used here is to consider all lines from the end of a previous message until the new message (including any following 830/831 messages) as a single block which is displayed if any of the location information corresponds to new code. Regular expressions are used to specify patterns to match verbose messages (such as --- Module:, /// Start of Pass 2 ///, etc) which are always displayed. Below is the python script (also available on github) that does this.

#!/usr/bin/python

import re

def process_diff_list(fh):
    '''Obtain the filenames and locations that represent new/changed code from
    the output of the diff command and return a dict of sets describing them.'''

    diff_pattern = re.compile(r'^[,\d]+[ac](?P<start>\d+)(,(?P<end>\d+))?')
    file_pattern = re.compile(r'^diff .* (?P<file>\S+)$')
    cur_file = None
    diff_list = {}

    for line in fh:
        m = file_pattern.search(line)
        if m:
            cur_file = m.group('file')
            continue

        m = diff_pattern.search(line)
        if not m:
            continue

        if cur_file not in diff_list:
            diff_list[cur_file] = set()

        start = int(m.group('start'))
        end = start
        if m.group('end'):
            end = int(m.group('end'))

        for i in xrange(start, end + 1):
            diff_list[cur_file].add(i)

    return diff_list


def process_lint_output(fh, diff_list):
    '''Process the output from linting, filter messages not related to
    new/changed code, print the ones that are.'''

    messages = get_messages(fh)
    for locations, message_list in messages:
        if not locations:
            print ''.join(message_list),
            continue

        for filename, lineno in locations:
            if filename in diff_list and lineno in diff_list[filename]:
                print ''.join(message_list),
                break


def get_messages(fh):
    '''Process the next message group and return it along with all of the
    file locations it references'''

    # Pattern for lint messages, only need filename, line, and message number
    message_pattern = re.compile(r'^(?P<filename>[^:]*):\s*(?P<lineno>\d+):'
            r'\s*(Warning|Error|Info|Note)\s*(?P<msgno>\d+)')

    # The pattern that specifies verbose messages that are not specific
    # to a particular message and should never be suppressed
    verbose_pattern = re.compile(r'^(---|///|===|\s{4})')

    # List of tuples representing locations related to the current message
    locations = []

    # List of lines representing the current message
    buffer = []

    # Flag indicating if the contents of the current buffer contains a
    # recorgnized lint message with file information
    in_message = False

    for line in fh:
        # If line is a verbose line, emit buffer and line
        m = verbose_pattern.search(line)
        if m:
            if buffer:
                yield locations, buffer
            yield [], line
            locations = []
            buffer = []
            in_message = False
            continue

        else:
            m = message_pattern.search(line)
            if m:
                filename = m.group('filename')
                lineno = int(m.group('lineno'))
                msgno = m.group('msgno')

                if in_message and msgno not in ('830', '831'):
                    # A new message has been encountered, emit current
                    # message and queue this one
                    yield locations, buffer
                    locations = []
                    buffer = []
                    locations.append((filename, lineno))
                    buffer.append(line)
                    continue
                else:
                    # New message or 830/831 to be grouped with current one
                    in_message = True
                    locations.append((filename, lineno))
                    buffer.append(line)
                    continue

            else:
                if in_message:
                    # Message that doesn't match pattern starts a new message
                    # group, emit current message and queue new one
                    yield locations, buffer
                    locations = []
                    buffer = []
                    buffer.append(line)
                    in_message = False
                    continue
                else:
                    # If we haven't seen a recognized message yet, just add
                    # this line to the buffer
                    buffer.append(line)
                    continue

    # Don't forget to send any leftover data
    yield locations, buffer

 
if __name__ == '__main__':
    import sys
    import subprocess

    if len(sys.argv) < 4:
        print "Usage: lint_diff.py orig_dir new_dir [lint options] files"
        sys.exit(1)

    orig_dir = sys.argv[1]
    new_dir = sys.argv[2]
    lint_args = sys.argv[3:]

    diff_fh = subprocess.Popen(('diff', '-N', orig_dir, new_dir), stdout=subprocess.PIPE)
    diff_list = process_diff_list(diff_fh.stdout)

    lint_fh = subprocess.Popen((('lint',) + tuple(lint_args)), stdout=subprocess.PIPE)
    process_lint_output(lint_fh.stdout, diff_list)

I use the "normal" mode for diff because it is easier to extract the new and changed line information, the -N option is needed to properly handle new files.  The process_diff_list function handles the diff output and returns a dict of filenames each containing a dict of line numbers representing the new/changed code.

The process_lint_output function calls the generator function get_messages to iterate over the groups of messages and obtain the file location information for each one which is compared to the diff_list to determine whether or not to output it.  If no location information is available, the message is displayed by default.

The script is called with the name of the baseline directory, the directory containing the modified code, and any lint arguments including the files to process. For example, using the dupstring.c example from earlier, if the original version is in orig/ and the new version is in new/, we can run our incremental lint like so:

lint_diff.py orig/ new/ new/dupstring.c

and obtain incremental results:
FlexeLint for C/C++ (Unix) Vers. 9.00h, Copyright Gimpel Software 1985-2011

--- Module:   dupstring.c (C)
new/dupstring.c: 7: Warning 668: Possibly passing a null pointer to function 'strcpy(char *, const char *)', arg. no. 1 [Reference: file new/dupstring.c: line 5]
new/dupstring.c: 5: Info 831: Reference cited in prior message
new/dupstring.c: 7: Warning 613: Possible use of null pointer 's' in left argument to operator 'ptr+int'
new/dupstring.c: 8: Warning 530: Symbol 'i' (line 6) not initialized
new/dupstring.c: 6: Info 830: Location cited in prior message
new/dupstring.c: 9: Warning 438: Last value assigned to variable 'i' (defined at line 6) not used
new/dupstring.c: 6: Info 830: Location cited in prior message
new/dupstring.c: 10: Warning 527: Unreachable code at token 'return'
new/dupstring.c: 10: Error 64: Type mismatch (return) (char * = int)

The diff process could be broken into a separate piece that writes the intermediate information to a file if desired. The solution could also be integrated to perform diffs from a SCM system.

This is a solution that works well in practice but it isn't perfect and some tweaks may need to be made for particular purposes:
  • The message_pattern assumes a single-line message format that starts with "%f: %l: %t %n", the script would need to be modified to handle alternative formats.
  • Since thread messages do not currently contain file or line information, they cannot be treated differently from the baseline and either all of them will be displayed or none of them will (which is the case in this implementation).
  • The output of the -summary option will necessarily include information related to messages from the baseline code.

4 comments:

  1. how to support incremental lint?

    I want to lint as FlexeLint does.
    and then parse the lint output and filter with diff result.

    like this:
    $ lint file.c > result.txt
    $ lint_filter orig_dir new_dir result.txt > diff_result.txt

    ReplyDelete
  2. maybe lint_diff.py is also usefull.

    but i can't run it:
    [test]$ python lint_diff.py old new new/test
    .c
    Traceback (most recent call last):
    File "lint_diff.py", line 146, in
    lint_fh = subprocess.Popen((('lint',) + tuple(lint_args)), stdout=subprocess
    .PIPE)
    File "/usr/lib/python2.6/subprocess.py", line 633, in __init__
    errread, errwrite)
    File "/usr/lib/python2.6/subprocess.py", line 1139, in _execute_child
    raise child_exception
    OSError: [Errno 2] No such file or directory
    =============
    lint is:
    ../LINT-NT.EXE -"format=%f: %l: %t %n" $*

    I can run lint:
    [test]$ ./lint new/test.c
    PC-lint for C/C++ (NT) Vers. 9.00h, Copyright Gimpel Software 1985-2011

    --- Module: new\test.c (C)


    _
    #include
    new\test.c: 1: Error 322

    ReplyDelete
  3. If "lint" is a batch script, try replacing 'lint' with 'lint.bat' at the end of the python script. If that doesn't work, also try specifying the full path for the batch file.

    ReplyDelete
  4. I wrote a lint-diff by shell.

    https://github.com/sinojelly/lint-diff

    This tool diffs the old and new files, only show warnings on the lines have difference.
    It not impact the lint result, just filter the warnings by lint-diff.xsl.


    You can use this lint-diff tool these ways:
    1. show warnings relative to a workspace of old code.
    a) modify variables in envsetup.sh
    b) run lint-diff-wraper.sh

    2. show warnings relative to the specified git commit
    a) use lint-diff.sh, you shold take a look at it first.

    ReplyDelete