Skip to content

Conversation

@b-pass
Copy link
Collaborator

@b-pass b-pass commented Dec 26, 2025

Description

Change both the Python->C++ and the C++->Python function calls to use the faster "vector call" calling protocol.

This makes function calls faster, at the minor expense of having to handle the arguments slightly differently.

Suggested changelog entry:

  • Improved performance of function calls between Python and C++ by switching to the "vectorcall" calling protocol.

@b-pass b-pass marked this pull request as draft December 26, 2025 13:25
@b-pass
Copy link
Collaborator Author

b-pass commented Dec 26, 2025

Small benchmark results for short function calls, this PR provides modest improvement over current master... perf

@b-pass b-pass requested a review from oremanj December 26, 2025 16:52
@b-pass b-pass marked this pull request as ready for review December 26, 2025 16:52
@rwgk
Copy link
Collaborator

rwgk commented Dec 27, 2025

@b-pass I merged master, which had no conflicts.

My idea was that maybe Cursor can figure out the root cause for these two CI failures:

2025-12-26T23:29:10.4917170Z  E       ModuleNotFoundError: No module named 'mod_per_interpreter_gil'
...
2025-12-26T23:29:12.4779770Z  E       ModuleNotFoundError: No module named 'mod_shared_interpreter_gil'

I'll wait for this CI run to finish, to see if these errors appear again.

When testing locally with Python 3.14.2, default and freethreaded, I saw this error (with both builds):

>       m.array_reshape2(b)
E       ValueError: array_reshape2: input array total size is not a squared integer

EDIT: Please see #5948 (comment) for the resolution.

I'm attaching the full build & test logs, JIC.

I switched back to master, everything else being equal, this error didn't appear. So it's definitely somehow connected to this PR.

pybind11_gcc_v3.14_df793163d58_default_tests_log_2025-12-26+193154.txt

pybind11_gcc_v3.14_df793163d58_freethreaded_log_2025-12-26+193608.txt

The `mod_per_interpreter_gil`, `mod_shared_interpreter_gil`, and
`mod_per_interpreter_gil_with_singleton` modules were being built
but not installed into the wheel when using scikit-build-core
(SKBUILD=true). This caused iOS (and potentially Android) CIBW
tests to fail with ModuleNotFoundError.

Root cause analysis:
- The main test targets have install() commands (line 531)
- The PYBIND11_MULTIPLE_INTERPRETERS_TEST_MODULES were missing
  equivalent install() commands
- For regular CMake builds, this wasn't a problem because
  LIBRARY_OUTPUT_DIRECTORY places the modules next to pybind11_tests
- For wheel builds, only targets with explicit install() commands
  are included in the wheel

This issue was latent until commit fee2527 changed the test imports
from `pytest.importorskip()` (graceful skip) to direct `import`
statements (hard failure), which exposed the missing modules.

Failing tests:
- test_multiple_interpreters.py::test_independent_subinterpreters
- test_multiple_interpreters.py::test_dependent_subinterpreters

Error: ModuleNotFoundError: No module named 'mod_per_interpreter_gil'
@rwgk
Copy link
Collaborator

rwgk commented Dec 27, 2025

Quick update: I think Cursor found the root cause for the CIBW / iOS failures, and a fix.

I'm still looking into why we're not seeing the m.array_reshape2(b) errors here.

Add numpy==2.4.0 requirement for Python 3.14 (both default and
free-threaded builds). NumPy 2.4.0 is the first version to provide
official PyPI wheels for Python 3.14:

- numpy-2.4.0-cp314-cp314-manylinux_2_27_x86_64...whl (default)
- numpy-2.4.0-cp314-cp314t-manylinux_2_27_x86_64...whl (free-threaded)

Previously, CI was skipping all numpy-dependent tests for Python 3.14
because PIP_ONLY_BINARY was set and no wheels were available:

  SKIPPED [...] test_numpy_array.py:8: could not import 'numpy':
  No module named 'numpy'

With this change, the full numpy test suite will run on Python 3.14,
providing better test coverage for the newest Python version.

Note: Using exact pin (==2.4.0) rather than compatible release (~=2.4.0)
to ensure reproducible CI results with the first known-working version.
@rwgk rwgk requested a review from henryiii as a code owner December 27, 2025 05:00
@rwgk
Copy link
Collaborator

rwgk commented Dec 27, 2025

I pushed commit ca72d6c here because we only have very limited GHA resources. Depending on how it goes, it might be best to merge that commit in a separate PR. Let's see.

Ideally commit 6ed6d5a should also be in a separate PR, but same issue with the GHA resources.

@rwgk
Copy link
Collaborator

rwgk commented Dec 27, 2025

NumPy 2.3.5 vs 2.4.0 Observations

The Issue

When testing this PR locally with Python 3.14 + NumPy 2.3.5, I observed a test failure:

>       m.array_reshape2(b)
E       ValueError: array_reshape2: input array total size is not a squared integer

test_numpy_array.py::test_array_resize FAILED

This occurred when calling array_reshape2 on a transposed view of a resized 64-element array (which should have worked since 64 = 8²).

What We Know For Sure

  1. The error appeared with NumPy 2.3.5 (released 2025-11-16) on Python 3.14
  2. The error is gone with NumPy 2.4.0 (released 2025-12-20) on Python 3.14
  3. This is NOT a bug in the vectorcall PR — the same test passes on master with NumPy 2.3.5, but that's because CI was skipping numpy tests for Python 3.14 entirely (no numpy entry in tests/requirements.txt for 3.14)
  4. NumPy 2.4.0 wheels for Python 3.14 are now available on PyPI (both default cp314 and free-threaded cp314t)

What's Undetermined

The exact bug fix in NumPy between 2.3.5 and 2.4.0 that resolved this issue. I searched the NumPy 2.4.0 release notes but couldn't find a specific fix that clearly matches our observation. The release notes mention "many expired deprecations and bug fixes" without listing all of them individually.

Resolution

Added numpy==2.4.0 to tests/requirements.txt for Python 3.14+ (commit ca72d6c). This:

  • Enables numpy tests on Python 3.14 in CI (previously skipped)
  • Avoids the NumPy 2.3.5 bug

Add `-v` to the pytest command in tests/pyproject.toml to help
diagnose hanging tests in CIBW jobs (particularly iOS).

This will show each test name as it runs, making it easier to
identify which specific test is hanging.
b-pass and others added 5 commits December 27, 2025 15:50
…ed since 3.8.

This means getting rid of simple_collector, we can do the same with a constexpr if in the unpacking_collector.
This makes more sense and saves a sort, and the small_vector implementation means it will actually take less space than a vector of size_t elements.

The most common case is that all kwargs are used.
Pyodide runs in a WebAssembly sandbox without POSIX signals, so
`signal.setitimer` is not available. This causes pytest-timeout to
crash with `AttributeError: module 'signal' has no attribute 'setitimer'`
when timeout > 0.

Override the test-command for Pyodide to keep timeout=0 (disabled).
@rwgk
Copy link
Collaborator

rwgk commented Dec 28, 2025

Review of Changes Since Commit 849df0b

1. 849df0b - argument_vector cleanups

Addressed @Skylion007's review comments:

  • [T002]: Changed emplace_back to use std::forward<Arg>(x) for proper forwarding
  • [T003]: Made push_back delegate to emplace_back
  • [T001]: Removed the sort() method entirely (no longer needed, see below)

2. b517942 - Unified to vectorcall for all Python versions

This addressed the review question about why unpacking_vectorcall_collector was limited to Python 3.12+. The old simple_collector and unpacking_collector were removed entirely, and everything now uses vectorcall since it's been available since Python 3.8.

Net result: -221 lines, +55 lines — a major simplification!

3. b4ca366 - used_kwargs as bool vector

Changed from small_vector<size_t> (indices of used kwargs, required sorting) to small_vector<bool> (flag per kwarg position). This:

  • Eliminates the need for sorting
  • Uses bit-packing in the specialized small_vector<bool> template
  • More memory efficient and simpler logic

4. c2e65f1 and 1526a85 - Signedness fixes

Minor fixes for clang warnings about signed/unsigned conversions.


All changes look good. The code is cleaner and more efficient. The review comments from @Skylion007 were all addressed.

Copy link
Collaborator

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the old simple_collector and unpacking_collector were removed entirely

Awesome to see the simplification! Looks even better to me know.

@Skylion007
Copy link
Collaborator

The push_back impl isnt quite right right. The pushback has an rvalue overload too. So it needs to either have both a const ref / rvalue overload (with corresponding std::move) or take by value and always move.

Copy link
Collaborator

@Skylion007 Skylion007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a static assert the heapvector? And maybe iarray? is noexcept move constructable technically, doesn't it? Otherwise, you need some weird edge case handling for OOM like you do with std::vector

@Skylion007
Copy link
Collaborator

Random design question, why does the collector store pyobject pointers directly vs py::handle?

It's a good bit faster at the cost of this one scary reinterpret_cast.
At 6, the struct is 144 bytes (not 128 bytes as the comment said).
@b-pass
Copy link
Collaborator Author

b-pass commented Dec 28, 2025

The push_back impl isnt quite right right.

Fixed.

Needs a static assert the heapvector? And maybe iarray?

Not sure what you mean... elaborate?

Random design question, why does the collector store pyobject pointers directly vs py::handle?

It was to avoid having to reinterpret_cast the C++ type to the C type in the Vectorcall call.... But I've consolidated both the PyObject* vector and the temp object vector into one and casted it in this latest version.

@Skylion007
Copy link
Collaborator

The push_back impl isnt quite right right.

Fixed.

Needs a static assert the heapvector? And maybe iarray?

Not sure what you mean... elaborate?

Random design question, why does the collector store pyobject pointers directly vs py::handle?

It was to avoid having to reinterpret_cast the C++ type to the C type in the Vectorcall call.... But I've consolidated both the PyObject* vector and the temp object vector into one and casted it in this latest version.

Avoiding the reintrepret_cast is a valid reason, just curious.

Copy link
Collaborator

@Skylion007 Skylion007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused why a bunch of list and tuple objects are being stored as generic objects? We don't do anything crazy to them and they all inherit py;:object. The only different is the implicit boolean checks maybe but those can be easily over by unwrapping the ptr any time and checking against nullptr

if (PyList_Append(names_list.ptr(), name.release().ptr()) < 0) {
throw error_already_set();
}
m_args.emplace_back(a.value);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
m_args.emplace_back(a.value);
m_args.emplace_back(std::move(a.value));

@b-pass
Copy link
Collaborator Author

b-pass commented Dec 28, 2025

I'm a bit confused why a bunch of list and tuple objects are being stored as generic objects? We don't do anything crazy to them and they all inherit py;:object. The only different is the implicit boolean checks maybe but those can be easily over by unwrapping the ptr any time and checking against nullptr

tuple and list don't have a constructor which allows them to be constructed as nullptr, where as object's default constructor is a nullptr. Typing these properly would make an extra allocation. IMO we can't afford the extra allocation here (in probably the hotest code path in all of pybind11). I also don't think it would be a good idea (or at least needs a lot of debate) to make a change to pybind11 tuple/list for this case.

It's just a bit inconvenient in this class ... I can certainly add some explanation in a comment about it.

EDIT: Turns out there is a way: auto null_list = reinterpret_steal<list>(handle());

b-pass and others added 4 commits December 28, 2025 18:04
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
They can be null if you "steal" a null handle.
…icit ownership

Introduce `ref_small_vector` to manage PyObject* references in `unpacking_collector`,
replacing the previous `small_vector<object>` approach.

Primary goals:

1. **Maintainability**: The previous implementation relied on
   `sizeof(object) == sizeof(PyObject*)` and used a reinterpret_cast to
   pass the object array to vectorcall. This coupling to py::object's
   internal layout could break if someone adds a debug field or other
   member to py::handle/py::object in the future.

2. **Readability**: The new `push_back_steal()` vs `push_back_borrow()`
   API makes reference counting intent explicit at each call site,
   rather than relying on implicit py::object semantics.

3. **Intuitive code**: Storing `PyObject*` directly and passing it to
   `_PyObject_Vectorcall` without casts is straightforward and matches
   what the C API expects. No "scary" reinterpret_cast needed.

Additional benefits:
- `PyObject*` is trivially copyable, simplifying vector operations
- Batch decref in destructor (tight loop vs N individual object destructors)
- Self-documenting ownership semantics

Design consideration: We considered folding the ref-counting functionality
directly into `small_vector` via template specialization for `PyObject*`.
We decided against this because:
- It would give `small_vector<PyObject*, N>` a different interface than the
  generic `small_vector<T, N>` (steal/borrow vs push_back)
- Someone might want a non-ref-counting `small_vector<PyObject*, N>`
- The specialization behavior could surprise users expecting uniform semantics

A separate `ref_small_vector` type makes the ref-counting behavior explicit
and self-documenting, while keeping `small_vector` generic and predictable.
@rwgk
Copy link
Collaborator

rwgk commented Dec 29, 2025

@b-pass commit d828e7d is meant to be a suggestion, although already fully developed. Please let me know what you think.

@b-pass
Copy link
Collaborator Author

b-pass commented Dec 29, 2025

Please let me know what you think.

I like it. It's less code than I thought it would be.

if (m_names) {
nargs -= m_names.size();
}
PyObject *result = _PyObject_Vectorcall(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slight nit could have been a py::handle

/// destructor, and avoids the need for reinterpret_cast when passing to vectorcall.
template <std::size_t InlineSize>
class ref_small_vector {
public:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, I don't like this, but I don't see a better way of doing due to layout compatibiltiy issues with py::object. Every other method of doing this requires a copy.

Copy link
Collaborator

@Skylion007 Skylion007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you thought of this, but if you make push_back only take in objects, you can handle the reinterpret steal and reinterpret borrow using only move and release mechanics.

}

/// Add a pointer, taking ownership (no incref, will decref on destruction)
void push_back_steal(PyObject *p) { m_ptrs.push_back(p); }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void push_back_steal(PyObject *p) { m_ptrs.push_back(p); }
void push_back_steal(object &&o) { m_ptrs.push_back(o.release().ptr()); }

#endif
}
args_list.append(std::move(o));
m_args.push_back_steal(h.ptr()); // cast returns a new reference
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
m_args.push_back_steal(h.ptr()); // cast returns a new reference
m_args.push_back_steal(std::move(o)); // cast returns a new reference

Then it can keep the good old object semantics.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can be normal objects that you just release. The borrow semantics are a bit more ackward unless you use move by value implicit construction. This might actually be cleaner.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explored what this change would look like. Here's the diff (Cursor-generated):

diff --git a/include/pybind11/cast.h b/include/pybind11/cast.h
--- a/include/pybind11/cast.h
+++ b/include/pybind11/cast.h
@@ -2279,8 +2279,9 @@ private:
     // normal argument, possibly needing conversion
     template <typename T>
     void process(list & /*names_list*/, T &&x) {
-        handle h = detail::make_caster<T>::cast(std::forward<T>(x), policy, {});
-        if (!h) {
+        auto o = reinterpret_steal<object>(
+            detail::make_caster<T>::cast(std::forward<T>(x), policy, {}));
+        if (!o) {
 #if !defined(PYBIND11_DETAILED_ERROR_MESSAGES)
             throw cast_error_unable_to_convert_call_arg(std::to_string(m_args.size() - 1));
 #else
@@ -2288,7 +2289,7 @@ private:
                                                         type_id<T>());
 #endif
         }
-        m_args.push_back_steal(h.ptr()); // cast returns a new reference
+        m_args.push_back_steal(std::move(o));
     }

     // * unpacking
@@ -2297,7 +2298,7 @@ private:
             return;
         }
         for (auto a : ap) {
-            m_args.push_back_borrow(a.ptr());
+            m_args.push_back_borrow(a);
         }
     }

@@ -2328,7 +2329,7 @@ private:
 #endif
         }
         names_list.append(std::move(name));
-        m_args.push_back_borrow(a.value.ptr());
+        m_args.push_back_borrow(a.value);
     }

     // ** unpacking
@@ -2347,7 +2348,7 @@ private:
 #endif
             }
             names_list.append(std::move(name));
-            m_args.push_back_borrow(k.second.ptr());
+            m_args.push_back_borrow(k.second);
         }
     }

diff --git a/include/pybind11/detail/argument_vector.h b/include/pybind11/detail/argument_vector.h
--- a/include/pybind11/detail/argument_vector.h
+++ b/include/pybind11/detail/argument_vector.h
@@ -389,13 +389,13 @@ public:
         return *this;
     }

-    /// Add a pointer, taking ownership (no incref, will decref on destruction)
-    void push_back_steal(PyObject *p) { m_ptrs.push_back(p); }
+    /// Add an object, taking ownership (will decref on destruction)
+    void push_back_steal(object &&o) { m_ptrs.push_back(o.release().ptr()); }

-    /// Add a pointer, borrowing (increfs now, will decref on destruction)
-    void push_back_borrow(PyObject *p) {
-        Py_XINCREF(p);
-        m_ptrs.push_back(p);
+    /// Add a handle, borrowing (increfs now, will decref on destruction)
+    void push_back_borrow(handle h) {
+        Py_XINCREF(h.ptr());
+        m_ptrs.push_back(h.ptr());
     }

     /// Add a null pointer (for PY_VECTORCALL_ARGUMENTS_OFFSET slot)

The change compiles and local tests pass. However, I'd prefer to keep the current implementation for these reasons:

  1. No reduction in code size — The change is net +1 line. The reinterpret_steal<object>(...) wrapper and std::move() add verbosity that offsets the minor savings from dropping .ptr() at the borrow call sites.

  2. The current code is explicit and straightforwardref_small_vector is a purpose-built container for managing PyObject* ownership. Having it accept raw pointers makes the ownership transfer explicit at the call site. When you see push_back_steal(h.ptr()), it's immediately clear you're handing off a raw pointer to a container that will manage its lifetime.

  3. The abstraction doesn't simplify anything — We still need to deal with the underlying PyObject* for the vectorcall API. Wrapping the interface in object&&/handle just moves the .ptr() and .release() calls inside the method, without reducing overall complexity.

I think in this case the added abstraction layer doesn't provide tangible benefits. The current approach is clear, direct, and easy to follow.

detail::make_caster<T>::cast(std::forward<T>(x), policy, {}));
if (!o) {
void process(list & /*names_list*/, T &&x) {
handle h = detail::make_caster<T>::cast(std::forward<T>(x), policy, {});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting this to a handle here makes me a bit uncomfortable. You could keep it in an object and it just use the move from ctor, the only downside is you'll have one extra call to PYXDECREF on a nullptr.

@Skylion007
Copy link
Collaborator

I like how you have it done now, I just wonder if it could be clenaed up anymore with .release().ptr() semantics.

#endif
}
args_list.append(std::move(o));
m_args.push_back_steal(h.ptr()); // cast returns a new reference
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can be normal objects that you just release. The borrow semantics are a bit more ackward unless you use move by value implicit construction. This might actually be cleaner.

}
m_kwargs[a.name] = std::move(a.value);
names_list.append(std::move(name));
m_args.push_back_borrow(a.value.ptr());
Copy link
Collaborator

@Skylion007 Skylion007 Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah both of these could just be a push_back(&&o) with a release ptr syntax and then you don't have to duplicate the logic and while the implementation is py::objects the input is still py::object. The one downside is the extra xdecref call on the nullptr stored in the py::object, but doubt that nullcheck will be a bottleneck?

Suggested change
m_args.push_back_borrow(a.value.ptr());
m_args.push_back_steal(reinterpret_borrow<object>(a.value));

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants