# Static Program Analysis学习笔记

**IR(Intermediate Representation)**

**AST vs IR(Why IR?)**

AST:

high-level and closed to grammar structure

usually language dependent

suitable for fast type checking

lack of control flow information

IR:

low-level and closed to machine code

usually language independent

compact and uniform

contains control flow information

usually considered as the basis for static analysis

**Jimple Basic Concepts:**

The four main method calls of JVM:

invokespecial: call constructor, call superclass methods, call private methods

invokevirutal: instance methods call (virtual dispatch)

invokeinterface: cannot optimization, checking interface implementation

invokestatic: call static methods

method signature: `class name:return type method_name(parameter1 type,paramter 2 type...)`

**3AC(3-Address Code) and SSA**

3AC:

There is at most one operator on the right side of an instruction

In 3AC, special control statements (such as GOTO and IF statements) are used to describe the program's control flow

SSA:

Every variable has exactlly one definitiontion

in SSA, the program's control flow is represented using basic blocks and phi functions. The phi function is defined at the entry of a basic block and is used to merge different variable values into one value to handle the basic block's control flow

**Basic Blocks(BB)**

maximal sequences of consecutive three-address instructions with the properties that It can be entered only at the beginning and it can be exited only at the end

**Control Flow Graph (CFG)**

The nodes of CFG are basic blocks

Build CFG:

**Data Flow Analysis**

**overview**

How Data Flows on CFG?

Flows throw the Nodes(BBs/statements) and Edges(control flows) of CFG(program)

How application-specific Data <- Abstraction

**may analysis**(for most static analyses):

outputs information that may be true (over-approximation)

**must analysis**:

outputs information that must be true (under-approximation)

Over- and under-approximations are both for safety of analysis

**Applications**

**Reaching Definitions Analysis**

A definition d at program point p reaches a point q if there is a path from p to q such that d is "not killed" along the path

Safe-approximation

Algorithm:

Example:

**Live Variables Analysis**

Live variables analysis tells whether the value of variable v at program point p could be used along some path in CFG starting at p.If so, v is live at p; otherwise, v is dead at p.

Algorithm:

Example:

**Available Expressions Analysis**

An expression x *op* y is available at program point p if all paths from the entry to p must pass through the evaluation of x *op* y, and after the last evaluation of x *op* y, there is no redefinition of x or y

Algorithm:

Example:

**Analysis Comparison**

**Foundations**

**Partial Order**

We define poset as a pari (P, ⊑) where ⊑ is a binary relation that defines a partial ordering over P and ⊑ has the following properties:

Reflexivity(自反性)：∀x ∈ P, x ⊑ x

Antisymmetry(反对称性): ∀x, y ∈ P, x ⊑ y ∧ y ⊑ x ⟹ x = y

Transitivity(传递性): ∀x, y, z ∈ P, x ⊑ y ∧ y ⊑ z ⟹ x ⊑ z

**Partial** means for a pair of set elements in P,they could be **incomparable**; in other words, not necessary that every pair of set elements must satisfy the ordeing ⊑

**Upper and Lower Bounds**

Given a poset (P, ⊑) and its subset S that S ⊆ P, we say that

u ∈ P is an *upper bound* of S, if ∀x ∈ S, x ⊑ u. Similarly,

l ∈ P is an *lower bound* of S, if ∀x ∈ S, l ⊑ x.

We define the **least upper bound**(**lub** or **join**) of S, written ⊔S, if for every upper bound of S, say u, ⊔S ⊑ u.

Similarly, We define the **greatest lower bound** (**glb** or **meet**) of S, written ⊓S, is for every lower bound of S,say l, l ⊑ ⊓S.

Some Properties:

Not every poset has lub or glb

But if a poset has

*lub*or*glb*, it will be unique

**Lattice**

Given a poset (P, ⊑), ∀a, b ∈ P, if a ⊔ b and a ⊓ b exist, then (P, ⊑) is called a lattice

(A poset is a lattice if **every pair** of its elements has a least upper bound and a greatest lower bound)###

**Complete Lattice**

All *subsets* of a lattice have a least upper bound and a greatest lower bound

Every *finite* lattice is a complete lattice. But complete lattice is not necessarily *finite*

**Product Lattice**

Given lattices L1 = (P 1 , ⊑1 ), L2 = (P 2 , ⊑2 ), ..., Ln = (P n , ⊑n ), if for all i,(P i , ⊑i ) has ⊔i (least upper bound) and ⊓i (greatest lower bound), then we can have a product lattice Ln = (P, ⊑) that is defined by: • P = P 1 × ... × P n • (x 1 , ..., xn ) ⊑ (y 1 , ..., yn ) ⟺ (x 1 ⊑ y1 ) ∧ ... ∧ (xn ⊑ yn ) • (x 1 , ..., xn ) ⊔ (y 1 , ..., yn ) = (x 1 ⊔1 y1 , ..., xn ⊔n yn ) • (x 1 , ..., xn ) ⊓ (y 1 , ..., yn ) = (x 1 ⊓1 y1 , ..., xn ⊓n yn )

A product lattice is a lattice

If a product lattice L is a product of complete lattices,then L is also complete

**Fixed-Point Theorem**

Given a complete lattice(L,⊑) if

f: L -> L is monotonic (单调)

L is finite (有界)

then the least fixed point of f can be found by iterating f(⊥), f(f(⊥)), ..., f (... (⊥)...) until a fixed point is reached

and the greatest fixed point of f can be found by iterating f(T), f(f(T)), ......, f(...f (T )...) until a fixed point is reached