The C-API#
The Python community has created an extensive set of modules written in C/C++, which can be imported from any Python code. By using CPython's C-API, these extension modules can access the interpreter and interact with managed objects from C/C++.
Many core built-in modules are written in C as an extension. Cython - a popular way of speeding up Python code - relies on C-API as well, so C-API support in Skybison Runtime is an essential feature.
NOTE: This document was written during early project planning stages. Parts of it may be out of date.
Reimplementation strategy#
What needs to be reimplemented? All functions that access object internals
directly - functions that dereference a PyXyzObject object and read out a
member or take the address of a member. For example, PyLong_FromLong directly
accesses PyLongObject's member ob_digit. Given that PyLongObject is not
a valid struct in the Runtime, this function will have to be rewritten.
The implementation of C-API functions is in Python/ and Objects/
directories in the CPython repo. We plan to reuse a subset of this code. Since
C-API is fairly stable, our current plan is to just copy the implementation of
safe functions into our sources.
One goal of the C-API reimplementation is to be able to reuse all native code
in the Modules/ directory (built-in modules) with no source-level changes
(with rare exceptions of misbehaving code).
History & Issues#
C-API has been developed over the course of decades. The biggest issue is that it's not separated from CPython runtime and it exposes many implementation details.
When you create an extension, you include "Python.h", which transitively
includes basically all CPython headers. Public CAPI functions typically start
with Py_ (like PyLong_FromLong), while private functions start with _Py_
(like _Py_NewReference). But this is only a naming convention, extensions can
technically use the private functions as well. This convention is also not
followed in 100% of cases - _PyObject_New can be find in the docs as a public
C-API function, for example.
We will unfortunately not be able to support all public C-APIs - in some
cases even the interface relies on some implementation details. This is a known
problem and we will need to migrate code using affected APIs (like the
PyTypeObject APIs) to newer APIs (sometimes known as the Limited API) that
enforce data-hiding. See
compatible-c-extensions.md.
PyPy has not supported C-API in the beginning. It is now provided by the "CPyExt" subsystem. Check out this great article on the PyPy blog: Why emulating CPython C API is so Hard.
Here is a comprehensive summary of some other problems of current C-API design and some proposal on how to fix them: Design a new better C API for Python.
Recently, HPy has made excellent progress toward a better C-API.
PyObject* vs RawObject#
CPython represents all objects as PyObject* - pointers to a common base
structure that has some basic fields like type and reference count. When an
object is created, CPython just allocates an object of the right size on the
heap - and the address never changes during the object lifetime.
This is the first and the most important problem we need to overcome. In Skybison Runtime, objects don't have to be allocated on the heap (immediate objects) and even heap-allocated objects can change the location during their lifetime because of a compacting GC.
All C-API functions and extensions work with PyObject* and can make the
assumption that the object doesn't move. To support this, we allocate a special
object on the heap for every object that needs to cross the boundary between
Runtime and an extension module:
struct PyObject {
RawObject reference; // corresponding object
long ob_refcnt;
}
In Runtime code, we wrap this object in a special class, ApiHandle, which
provides some convenience functions:
Converting PyObject to RawObject is simple - we simply follow the reference
pointer at the top of the structure. Converting RawObject to PyObject is not as
straightforward. Runtime keeps a special table for this purpose
(Runtime::apiHandles) - it maps RawObjects to ApiHandles.
This table is also important for the GC. If an extension keeps a reference an object, we can't delete the object even if there are no references to it in the runtime. For this reason, the garbage collector treats all objects in this table (with non-zero reference count) as roots.
How does this work for immediate types? PyObject is allocated on heap as usual
and it contains the full RawObject. It's also added to the ApiHandles
dictionary, which means that if you call PyUnicode_FromString("short") two
times, you will get the same value (unlike with CPython).
Reference counting and borrowed references#
CPython uses a combination of reference counting and generational GC for memory
management. As you see above, every PyObject* has a ob_refcnt field. All
functions - including C/C++ extensions using C-API - need to update the
reference count properly. This is typically done via Py_INCREF and
Py_DECREF macros.
To simplify reference count management, reference ownership is used in the
C-API documentation. "Owning a reference" means being responsible for calling
Py_DECREF on it when it's no longer needed. See CPython
docs for details.
- Functions returning
PyObject*return new references by default - this means the caller is the owner of the reference and has to callPy_DECREFor transfer the ownership. The other option is returning a borrowed references - seePyList_GetItemfor example. - Functions taking
PyObject*as an argument don't steal ownership of the reference by default. A few function do - likePyList_SetItem- in which case the caller is no longer responsible for decrementing reference count when it's no longer needed.
If a function returns a borrowed reference, or if it steals a reference to an
argument, it's always clearly specified in the docs. CPython also provides a
file with details on how different functions affect reference count:
Doc/data/refcounts.dat (but note that it's not completely up to date).