Draft:Context-free language reachability

Context-free language reachability is an algorithmic problem with applications in static program analysis. Given a graph with edge labels from some alphabet and a context-free grammar over that alphabet, the problem is to determine whether there exists a path through the graph such that the concatenation of the labels along the path is a string accepted by the grammar.

Variations

edit

In addition to the decision problem formulation given above, there are several related function problem formulations of CFL-reachability. For brevity, define an L-path to be a path with edge labels in the language of the grammar. Then:[1]

  • The all-pairs variant is to determine all pairs of nodes such that there exists an L-path between them.
  • The single-source variant is to determine all nodes that are reachable from a given source node via an L-path.
  • The single-target variant is to determine all nodes that are the sources of L-paths that end at a given target node.
  • The single-source/single-target variant is to determine whether there is an L-path between two given nodes.

Algorithms

edit

There is a relatively simple dynamic programming algorithm for solving all-pairs CFL-reachability. The algorithm requires a normalized grammar, where each production has at most two symbols (terminals or nonterminals) on the right-hand side. The runtime of this algorithm  , where   is the number of terminals and nonterminals in the normalized grammar (which is linear with respect to the original grammar), and   is the number of nodes in the graph.[2] The algorithm works by repeatedly adding summary edges to the graph: given a production  , if there exists an edge between some nodes x and y labeled with B and an edge between y and z labeled C, then the algorithm adds a new edge labeled A between x and z. This process is repeated until saturation, i.e., until no more summary edges may be added.

Applications to program analysis

edit

Several problems in program analysis can be formulated as CFL-reachability problems, including:[3]

Alias analysis

edit

Consider an imperative language with pointers, like a simplified C. The program expression graph (PEG) for a program in such a language has a node for each expression in the program, and two kinds of edges:

  • A pointer dereference edge labeled d from each pointer dereference expression *e to the corresponding expression e
  • An assignment edge labeled a from r to l for each assignment l = r

For each d- and a-edge, there are also corresponding edges in the opposite direction, labeled ~d and ~a, respectively.

The CFL-reachability problem over the PEG and the following grammar encodes the may-alias problem:[6]

M ::= ~d V d
V ::= ~F M? F
F ::= (a M?)*
~F ::= (M? ~a)*

The nonterminal M signifies that two memory locations may alias, i.e., they point to the same value. Nonterminal V signifies that two values may alias, i.e., they hold pointers that may alias. F signifies data-flows, which are sequences of assignments interleaved with memory aliases. ~F is the inverse production of F.

The following grammar is equivalent:

M ::= ~d V d
V ::= (M? ~a)* M? (a M?)*
edit

Every CFL-reachability problem can be encoded as a Datalog program.[7]

References

edit
  1. ^ (Reps 1998)
  2. ^ (Melski & Reps 2000)
  3. ^ (Reps 1998, p. 2)
  4. ^ He, Dongjie; Lu, Jingbo; Xue, Jingling (2024). "A CFL-Reachability Formulation of Callsite-Sensitive Pointer Analysis with Built-In On-The-Fly Call Graph Construction". 38th European Conference on Object-Oriented Programming (ECOOP 2024). Schloss Dagstuhl – Leibniz-Zentrum für Informatik: 18:1–18:29. doi:10.4230/LIPIcs.ECOOP.2024.18.
  5. ^ Lu, Jingbo; He, Dongjie; Xue, Jingling (2021-07-23). "Eagle: CFL-Reachability-Based Precision-Preserving Acceleration of Object-Sensitive Pointer Analysis with Partial Context Sensitivity". ACM Trans. Softw. Eng. Methodol. 30 (4): 46:1–46:46. doi:10.1145/3450492. ISSN 1049-331X.
  6. ^ Zheng, Xin; Rugina, Radu (2008-01-07). "Demand-driven alias analysis for C". Proceedings of the 35th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages. POPL '08. New York, NY, USA: Association for Computing Machinery. pp. 197–208. doi:10.1145/1328438.1328464. hdl:1813/8222. ISBN 978-1-59593-689-9.
  7. ^ (Reps 1998, p. 6)