Developing PyArrow#
Coding Style#
We follow a similar PEP8-like coding style to the pandas project. To fix style issues, use the
pre-commit command:
$ pre-commit run --show-diff-on-failure --color=always --all-files python
Unit Testing#
We are using pytest to develop our unit test suite. After building the project you can run its unit tests like so:
$ pushd arrow/python
$ python -m pytest pyarrow
$ popd
Package requirements to run the unit tests are found in
requirements-test.txt and can be installed if needed with pip install -r
requirements-test.txt.
If you get import errors for pyarrow._lib or another PyArrow module when
trying to run the tests, run python -m pytest arrow/python/pyarrow and check
if the editable version of pyarrow was installed correctly.
The project has a number of custom command line options for its test suite. Some tests are disabled by default, for example. To see all the options, run
$ python -m pytest pyarrow --help
and look for the “custom options” section.
Note
There are a few low-level tests written directly in C++. These tests are
implemented in pyarrow/src/arrow/python/python_test.cc,
but they are also wrapped in a pytest-based
test module
run automatically as part of the PyArrow test suite.
Test Groups#
We have many tests that are grouped together using pytest marks. Some of these
are disabled by default. To enable a test group, pass --$GROUP_NAME,
e.g. --parquet. To disable a test group, prepend disable, so
--disable-parquet for example. To run only the unit tests for a
particular group, prepend only- instead, for example --only-parquet.
The test groups currently include:
dataset: Apache Arrow Dataset testsflight: Flight RPC testsgandiva: tests for Gandiva expression compiler (uses LLVM)hdfs: tests that use libhdfs to access the Hadoop filesystemhypothesis: tests that use thehypothesismodule for generating random test cases. Note that--hypothesisdoesn’t work due to a quirk with pytest, so you have to pass--enable-hypothesislarge_memory: Test requiring a large amount of system RAMorc: Apache ORC testsparquet: Apache Parquet testss3: Tests for Amazon S3tensorflow: Tests that involve TensorFlow
Type Checking#
PyArrow provides type stubs (*.pyi files) for static type checking. These
stubs are located in the pyarrow-stubs/ directory and are automatically
included in the distributed wheel packages.
Running Type Checkers#
We support multiple type checkers. Their configurations are in
pyproject.toml.
mypy
To run mypy on the PyArrow codebase:
$ cd arrow/python
$ mypy
The mypy configuration is in the [tool.mypy] section of pyproject.toml.
pyright
To run pyright:
$ cd arrow/python
$ pyright
The pyright configuration is in the [tool.pyright] section of pyproject.toml.
ty
To run ty (note: currently only partially configured):
$ cd arrow/python
$ ty check
Maintaining Type Stubs#
Type stubs for PyArrow are maintained in the pyarrow-stubs/
directory. These stubs mirror the structure of the main pyarrow/ package.
When adding or modifying public APIs:
Update the corresponding ``.pyi`` stub file in
pyarrow-stubs/to reflect the new or changed function/class signatures.Include type annotations where possible. For Cython modules or dynamically generated APIs such as compute kernels add the corresponding stub in
pyarrow-stubs/.Run type checkers to ensure the stubs are correct and complete.
The stub files are automatically copied into the built wheel during the build process and will be included when users install PyArrow, enabling type checking in downstream projects and for users’ IDEs.
Note: py.typed marker file in the pyarrow/ directory indicates to type
checkers that PyArrow supports type checking according to PEP 561.
Doctest#
We are using doctest to check that docstring examples are up-to-date and correct. You can also do that locally by running:
$ pushd arrow/python
$ python -m pytest --doctest-modules
$ python -m pytest --doctest-modules path/to/module.py # checking single file
$ popd
for .py files or
$ pushd arrow/python
$ python -m pytest --doctest-cython
$ python -m pytest --doctest-cython path/to/module.pyx # checking single file
$ popd
for .pyx and .pxi files. In this case you will also need to
install the pytest-cython plugin.
Testing Documentation Examples#
Documentation examples in .rst files under docs/source/python/ use
doctest syntax and can be tested locally using:
$ pushd arrow/python
$ pytest --doctest-glob="*.rst" docs/source/python/file.rst # checking single file
$ pytest --doctest-glob="*.rst" docs/source/python # checking entire directory
$ popd
The examples use standard doctest syntax with >>> for Python prompts and
... for continuation lines. The conftest.py fixture automatically
handles temporary directory setup for examples that create files.
Debugging#
Debug build#
Since PyArrow depends on the Arrow C++ libraries, debugging can
frequently involve crossing between Python and C++ shared libraries.
For the best experience, make sure you’ve built both Arrow C++
(-DCMAKE_BUILD_TYPE=Debug) and PyArrow (export PYARROW_BUILD_TYPE=debug)
in debug mode.
Using gdb on Linux#
To debug the C++ libraries with gdb while running the Python unit tests, first start pytest with gdb:
$ gdb --args python -m pytest pyarrow/tests/test_to_run.py -k $TEST_TO_MATCH
To set a breakpoint, use the same gdb syntax that you would when debugging a C++ program, for example:
(gdb) b src/arrow/python/arrow_to_pandas.cc:1874
No source file named src/arrow/python/arrow_to_pandas.cc.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (src/arrow/python/arrow_to_pandas.cc:1874) pending.
See also
Similarly, use lldb when debugging on macOS.
Benchmarking#
For running the benchmarks, see Benchmarks.