From ce9f4e23ae1bb20cf6a1c017f3d15de1c66e97bf Mon Sep 17 00:00:00 2001
From: Arseny Kapoulkine <arseny.kapoulkine@gmail.com>
Date: Tue, 21 Jun 2022 13:14:30 -0700
Subject: [PATCH] Update performance.md (#553)

Document function inlining and loop unrolling.
---
 docs/_pages/performance.md | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/docs/_pages/performance.md b/docs/_pages/performance.md
index 10e2341..34b24b0 100644
--- a/docs/_pages/performance.md
+++ b/docs/_pages/performance.md
@@ -185,3 +185,13 @@ While large tables can be a problem for incremental GC in general since currentl
 The incremental garbage collector in Luau runs three phases for each cycle: mark, atomic and sweep. Mark incrementally traverses all live objects, atomic finishes various operations that need to happen without mutator intervention (see previous section), and sweep traverses all objects in the heap, reclaiming memory used by dead objects and performing minor fixup for live objects. While objects allocated during the mark phase are traversed in the same cycle and thus may get reclaimed, objects allocated during the sweep phase are considered live. Because of this, the faster the sweep phase completes, the less garbage will accumulate; and, of course, the less time sweeping takes the less overhead there is from this phase of garbage collection on the process.
 
 Since sweeping traverses the whole heap, we maximize the efficiency of this traversal by allocating garbage-collected objects of the same size in 16 KB pages, and traversing each page at a time, which is otherwise known as a paged sweeper. This ensures good locality of reference as consecutively swept objects are contiugous in memory, and allows us to spend no memory for each object on sweep-related data or allocation metadata, since paged sweeper doesn't need to be able to free objects without knowing which page they are in. Compared to linked list based sweeping that Lua/LuaJIT implement, paged sweeper is 2-3x faster, and saves 16 bytes per object on 64-bit platforms.
+
+## Function inlining and loop unrolling
+
+By default, the bytecode compiler performs a series of optimizations that result in faster execution of the code, but they preserve both execution semantics and debuggability. For example, a function call is compiled as a function call, which may be observable via `debug.traceback`; a loop is compiled as a loop, which may be observable via `lua_getlocal`. To help improve performance in cases where these restrictions can be relaxed, the bytecode compiler implements additional optimizations when optimization level 2 is enabled (which requires using `-O2` switch when using Luau CLI), namely function inlining and loop unrolling.
+
+Only loops with loop bounds known at compile time, such as `for i=1,4 do`, can be unrolled. The loop body must be simple enough for the optimization to be profitable; compiler uses heuristics to estimate the performance benefit and automatically decide if unrolling should be performed.
+
+Only local functions (defined either as `local function foo` or `local foo = function`) can be inlined. The function body must be simple enough for the optimization to be profitable; compiler uses heuristics to estimate the performance benefit and automatically decide if each call to the function should be inlined instead. Additionally recursive invocations of a function can’t be inlined at this time, and inlining is completely disabled for modules that use `getfenv`/`setfenv` functions.
+
+In both cases, in addition to removing the overhead associated with function calls or loop iteration, these optimizations can additionally benefit by enabling additional optimizations, such as constant folding of expressions dependent on loop iteration variable or constant function arguments, or using more efficient instructions for certain expressions when the inputs to these instructions are constants.