Automated Program Repair (APR) is a research field that has recently gained attention due to its advances in proposing methods to fix buggy programs without human intervention. Search-Based Program Repair methods have difficulties to traverse the search space, mainly, because it is challenging and costly to evaluate each variant. Therefore, aiming to improve each program's variant evaluation through providing more information to the fitness function, we propose the combination of two techniques, Doc2vec and LSTM, to capture high-level differences among variants and to capture the dependence between source code statements in the fault localization region. The results performed with the IntroClass benchmark show that our approach captures differences between variants according to the level of changes they received, and the resulting information is useful to balance the search between the exploration and exploitation steps. Besides, the proposal might be promising to filter program variants that are adequate to the suspicious portion of the code.

The Approach

Fig. 1 - Our proposal of using code naturalness for Automated Program Repair.


The proposal considers that correct source codes (written in the same programming language as the buggy code) are available in order to perform an automated program repair. Assuming that language models are trained to capture naturalness from scriptures, we rely on that to also assume that they are capable of capturing naturalness from correct source code. The corpus used to train the models are composed of tokens obtained through feature extraction from the correct source codes.

One can use information from the naturalness model in a traditional search-based program repair workflow (dashed lines). As Figure 1 highlights with the white circle, we propose to use those models to help as a static analysis factor to a fitness function. The model receives a tokenized source code, transformed with the same process used in training, and it returns information to compound the fitness along with the dynamic factor, which is typically based on a test suite and requires executing the program.

Doc2vec is employed to encode methods or functions in source codes in the same way it treats paragraphs in documents; thus, our technique obtains a measure of similarity between programs based on their vector representation. Given this information, one might assume that a fix is not so distant from the original bugged code. Besides, one can use this vector representation to investigate the impact of mutation operators, which is typically hard to do using the patch or AST representations.

LSTM is used to capture data from and generate information for a specific area of the code. Therefore, we train our LSTM model (implemented with the TensorFlow framework) on the corpus of correct programs and then use it to synthesize code patch considering a part of the buggy program as the input sequence. The model receives a sequence of tokens from the suspicious region and generates the sequence it considers more ``natural'' for that part of the code. The experiment section presents how such an output sequence can be used to evaluate program variants.

Experiment's stuff

  • Implementation

    Our Python implementation for both LSTM and Doc2vec models. Including the training and operation phases.

  • Introclass Codes

    Source code corpora used to train both LSTM and Doc2vec models, input/output token sequences and variants.

  • Raw results

    Here you can find the results obtained to acc and prec using the LSTM model and similarity from Doc2vec model.


Summary Results

Fig. 2 - Similarity between original buggy versions and their variants


Similarity, on average, grouped by the number of mutations the variants received and the problem they were implemented for. For some cases, such as checksum, grade, and smallest, the higher the number of mutations, the lower the similarity. This behavior occurs if we assume that applying more perturbations increases the entropy between the original and a variant from it. However, for other problems, fewer mutations produced a lower similarity.

Table 1 - Average of Acc and Prec achieved by LSTM model in all 70 selected versions.

Config Acc Prec Config Acc Prec Config Acc Prec
5in_10out 0.25 0.29 10in_10out 0.37 0.35 15in_10out 0.35 0.33
5in_20out 0.35 0.28 10in_20out 0.44 0.26 15in_20out 0.43 0.26
5in_30out 0.37 0.24 10in_30out 0.44 0.22 15in_30out 0.45 0.20

Table 1 presents the average of Acc and Prec for each configuration considering all versions and problems. Configurations have the format Xin_Yout, where X and Y indicate the number of tokens given as input and number of tokens expected in the output, respectively. As Acc and Prec may be conflicting, the results show, in some sense, which configurations present the better trade-off.

Naturalness models can provide useful information to be employed by a technique that needs to explore a program's search space. From the Doc2vec model, it is possible to get information to control the exploration and exploitation, and from the LSTM it is possible to create a filter on variants that are more suitable to the context of the region pointed by the fault localization.