Cranelift: update regalloc2 to 0.15.0 to permit more VRegs.#12611
Cranelift: update regalloc2 to 0.15.0 to permit more VRegs.#12611cfallin merged 2 commits intobytecodealliance:mainfrom
Conversation
This pulls in bytecodealliance/regalloc2#257 to permit more VRegs to be used in a single function body, addressing bytecodealliance#12229 and our followup discussions about supporting function body sizes up to the Wasm implementation limit standard. In addition to the RA2 upgrade, this also includes a bit more explicit limit-checking on the Cranelift side: note that we don't directly use `regalloc2::VReg` but instead we further bitpack it into `Reg`, which is logically a sum type of `VReg`, `PReg` and `SpillSlot` (the last one needed to represent stack allocation locations on defs, e.g. on callsites with many returns). `PReg`s are packed into the beginning of the `VReg` index space but `SpillSlot`s are distinguished by stealing the upper bit of a `u32`. This was previously not a problem given the smaller `VReg` index space but now we need to check explicitly; hence `Reg::from_virtual_reg_checked` and its use in the lowering vreg allocator. Because the `VReg` index packs the class into the bottom two bits, and index into the upper 30, but we steal one bit at the top, the true limit for VReg count is thus actually 2^29, or 512M. Fixes bytecodealliance#12229.
|
The failing test (
|
|
I think that test already takes forever to compile/run in debug mode on all platforms so jettisoning that test doesn't seem unreasonable to me, and agreed it in theory should not be possible to write. Maybe it's sufficient to have a local script to run or some online corpus of "must compile big modules" or something like that? Either that or we could create a dedicated test job for "compile wasmtime in release on one fast platform and compile big modules" to make sure everything succeeds. |
This updates the `ValueDataPacked` scheme from the old
```
(enum tag) (CLIF type) (value 1) (value 2)
/// | tag:2 | type:14 | x:24 | y:24 |
```
encoding in a `u64` to a new
```
/// | tag:2 | type:14 | x:32 | y:32
|
```
encoding, with a `packed` tag attribute to ensure the struct fits in
10 bytes. This permits the full range of `Value` (a `u32` entity
index) to be encoded, removing the remaining major limit on function
body size after the work in bytecodealliance#12611 to address bytecodealliance#12229.
Curiously, this appears to be a *speedup* in compile time of 3-5% on
bz2 and 3% on spidermonkey-json (Sightglass, 50 data points each). My
best guess as to why is that putting the value fields in their own
`u32`s allows for quick access without shifts/masks, which is actually
better than the unaligned accesses (caused by 10-byte size) -- which
have no penalty on modern mainstream CPUs -- and 25% size inflation of
the value-definitions array.
|
Sure thing -- I'll push the "corpus of big inputs to test" as followup (after #12613 lands as well); removed the |
This updates the `ValueDataPacked` scheme from the old
```
(enum tag) (CLIF type) (value 1) (value 2)
/// | tag:2 | type:14 | x:24 | y:24
```
encoding in a `u64` to a new
```
/// | tag:2 | type:14 | x:32 | y:32
```
encoding, with a `packed` tag attribute to ensure the struct fits in
10 bytes. This permits the full range of `Value` (a `u32` entity
index) to be encoded, removing the remaining major limit on function
body size after the work in bytecodealliance#12611 to address bytecodealliance#12229.
Curiously, this appears to be a *speedup* in compile time of 3-5% on
bz2 and 3% on spidermonkey-json (Sightglass, 50 data points each). My
best guess as to why is that putting the value fields in their own
`u32`s allows for quick access without shifts/masks, which is actually
better than the unaligned accesses (caused by 10-byte size) -- which
have no penalty on modern mainstream CPUs -- and 25% size inflation of
the value-definitions array.
This updates the `ValueDataPacked` scheme from the old
```
(enum tag) (CLIF type) (value 1) (value 2)
/// | tag:2 | type:14 | x:24 | y:24
```
encoding in a `u64` to a new
```
/// | tag:2 | type:14 | x:32 | y:32
```
encoding, with a `packed` tag attribute to ensure the struct fits in
10 bytes. This permits the full range of `Value` (a `u32` entity
index) to be encoded, removing the remaining major limit on function
body size after the work in #12611 to address #12229.
Curiously, this appears to be a *speedup* in compile time of 3-5% on
bz2 and 3% on spidermonkey-json (Sightglass, 50 data points each). My
best guess as to why is that putting the value fields in their own
`u32`s allows for quick access without shifts/masks, which is actually
better than the unaligned accesses (caused by 10-byte size) -- which
have no penalty on modern mainstream CPUs -- and 25% size inflation of
the value-definitions array.
This pulls in bytecodealliance/regalloc2#257 to permit more VRegs to be used in a single function body, addressing #12229 and our followup discussions about supporting function body sizes up to the Wasm implementation limit standard.
In addition to the RA2 upgrade, this also includes a bit more explicit limit-checking on the Cranelift side: note that we don't directly use
regalloc2::VRegbut instead we further bitpack it intoReg, which is logically a sum type ofVReg,PRegandSpillSlot(the last one needed to represent stack allocation locations on defs, e.g. on callsites with many returns).PRegs are packed into the beginning of theVRegindex space butSpillSlots are distinguished by stealing the upper bit of au32. This was previously not a problem given the smallerVRegindex space but now we need to check explicitly; henceReg::from_virtual_reg_checkedand its use in the lowering vreg allocator. Because theVRegindex packs the class into the bottom two bits, and index into the upper 30, but we steal one bit at the top, the true limit for VReg count is thus actually 2^29, or 512M.Fixes #12229.