@msrvida/python-program-analysis

A Typescript library for parsing Python 3 and doing basic program analysis, like forming control-flow graphs and def-use chains.

Usage no npm install needed!

<script type="module">
  import msrvidaPythonProgramAnalysis from 'https://cdn.skypack.dev/@msrvida/python-program-analysis';
</script>

README

python-program-analysis

A Typescript library for parsing Python 3 and doing basic program analysis, like forming control-flow graphs and def-use chains.

Parsing

To parse Python 3 code, pass a string containing the code to the parse method.

const code = [
    'x, y = 0, 0',
    'while x < 10:',
    '   y += x * 2',
    '   x += 1',
    'print(y)'
];
const tree = parse(code.join('\n')); 

This method returns a tree of SyntaxNode objects, discriminated with a type field. The library also provides a function walk for pre-order tree traversal. With no arguments, it returns a list of the syntax nodes in the tree.

walk(tree).map(node => node.type)
// produces ["module", "assign", "literal", "literal", "name", "name", "while", "binop", …]

Optionally, walk takes a visitor object with methods onEnterNode (for pre-order traversal) and onExitNode (for post-order traversal).

Syntax nodes can be turned back into code with the printNode function, which produces a string. There is no guarantee of round-tripping. That is printNode(parse(code)) could be syntactically different than code, but will be semantically the same. For example, there may be extra parentheses around expressions, when compared with the original code. The printNode function is primarily for debugging.

Control flow

A control flow graph organizes a parse tree into a graph where the nodes are "basic blocks" (sequences of statements that run together) and the edges reflect the order of block execution.

const cfg = new ControlFlowGraph(tree);

cfg.blocks is an array of the blocks in the control flow graph, with cfg.entry pointing to the entry block and cfg.exit pointing to the exit block. The control flow graph for the parse tree above looks like this.

control flow graph

Each block has a list of its statements.

printNode(cfg.blocks[0].statements[0])

prints x, y = 0, 0.

The methods cfg.getSuccessors and cfg.getPredecessors allow the edges to be followed forward or backward.

const cond = cfg.getSuccessors(cfg.entry)[0];
printNode(cond)

prints x < 10.

Data flow

The library also provides basic def-use program analysis, namely, tracking where the values assigned to variables are read. For example, the 0 assigned to x in the entry block is read in the conditional x < 10, in the assignments y = x * 2 and x += 1.

const analyzer = new DataflowAnalyzer();
const flows = analyzer.analyze(cfg).flows;
for (let flow of flows.items) 
    console.log(printNode(flow.fromNode) + 
        " -----> " + printNode(flow.toNode))

prints

x, y = 0, 0 -----> x < 10
x, y = 0, 0 -----> print(y)
x, y = 0, 0 -----> y = x * 2
x += 1 -----> x < 10
y = x * 2 -----> print(y)
x += 1 -----> y = x * 2

Program Slicing

Program slicing removes lines from a program that are unnecessary to see the effect of a chosen line of code. For example, if we only care about the print statement in this program:

sum = 0
diff_sum = 0
for i in range(min(len(A), len(B))):
    sum += A[i] + B[i]
    diff_sum += A[i] - B[i]
print(sum)

then we can simplify the code to this:

sum = 0
for i in range(min(len(A), len(B))):
    sum += A[i] + B[i]
print(sum)

The function call slice(P,loc) takes a program P (a parse tree) and a program location loc and returns the program locations that are necessary for loc. For example, to do the slicing example above, we call slice(ast, {first_line: 6, first_column: 0, last_line, 6: last_column: 10}) which returns a LocationSet whose members have first_line values of 1, 3, 4, and 6 (but not 2 or 5).

API Specs

When deciding whether an API call needs to appear in a program slice, the slicing algorithm needs to know whether the call has a side-effect on the variables that are passed to it (including the self parameter for method calls). The call f(x) has a side-effect on x if f updates a field (x.m = y), updates an element (x[i] = y), updates a global variable, or transitively calls another function that has a side effect. Rather than analyzing the code of a called function (which may not even be available), we rely on having specifications, recorded in JSON files. Here is the specification pandas.json (with some lines omitted):

{
  "pandas": {
    "functions": [
      "array",
      "bdate_range",
      ...
      { "name": "read_clipboard", "returns": "DataFrame" },
      { "name": "read_csv", "returns": "DataFrame" },
      { "name": "read_excel", "returns": "DataFrame" },
      ...
    ],
    "types": {
      "DataFrame": {
        "methods": [
          "abs",
          "add",
          "add_prefix",
          ...
          { "name": "pop", "updates": [0] },
          ...
        ]
      }
    }
  }
}

A module's spec provides a list of the module's functions, types, and submodules. A type spec provides a list of the type's methods. In the function/method list, if a function/method appears just as a name (for example, "array" or "abs"), then it has no side-effects and doesn't return any objects with specifications. Otherwise, the function/method appears as a dictionary with its name in a name field and any of the following:

  • updates is an array of strings that lists the parameters that experience side-effects. 0 refers the self parameter, a number k >= 1 refers to the k th parameter, and any non-numeric string is the name of an updated global variable.
  • reads is an array of strings with the global variables that the method reads. (If a slice includes a call to a function that reads a global, then it must also include any calls that update that global.)
  • returns is the type of object the call returns, which is only necessary if that type has a spec.

The specs allow slice to analyze code like the following:

import pandas as pd
d = pd.read_csv("some_path")
d.pop("Column")
d.memory_usage()
d.count()            // ← slice on this line

Looking at the spec above, we can see that read_csv return a DataFrame object, so d is a DataFrame. The call to pop on d has a side-effect on the self parameter (d), because DataFrame's pop spec has an updates of [0]. Therefore, this call to pop must appear in the slice. On the other hand, the call to memory_usage has no side-effects, so it can be left out. So, the final slice includes lines 1, 2, 3, and 5, but not 4.

If there is no spec for an API call, then the data-flow and slicing algorithms conservatively assume that any passed parameter could experience a side-effect.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.