Contributing Overview#

Local git conventions#

If you are tracking the Arrow source repository locally, here is a checklist for using git:

  • Work off of your personal fork of apache/arrow and submit pull requests “upstream”.

  • Keep your fork’s main branch synced with upstream/main.

  • Develop on branches, rather than your own “main” branch.

  • It does not matter what you call your branch. Some people like to use the GitHub issue number as branch name, others use descriptive names.

  • Sync your branch with upstream/main regularly, as many commits are merged to main every day.

  • It is recommended to use git rebase rather than git merge.

  • In case there are conflicts, and your local commit history has multiple commits, you may simplify the conflict resolution process by squashing your local commits into a single commit. Preserving the commit history isn’t as important because when your feature branch is merged upstream, a squash happens automatically.

    How to squash local commits?

    Abort the rebase with:

    $ git rebase --abort
    

    Following which, the local commits can be squashed interactively by running:

    $ git rebase --interactive ORIG_HEAD~n
    

    Where n is the number of commits you have in your local branch. After the squash, you can try the merge again, and this time conflict resolution should be relatively straightforward.

    Once you have an updated local copy, you can push to your remote repo. Note, since your remote repo still holds the old history, you would need to do a force push. Most pushes should use --force-with-lease:

    $ git push --force-with-lease origin branch
    

    The option --force-with-lease will fail if the remote has commits that are not available locally, for example if additional commits have been made by a colleague. By using --force-with-lease instead of --force, you ensure those commits are not overwritten and can fetch those changes if desired.

    Setting rebase to be default

    If you set the following in your repo’s .git/config, the --rebase option can be omitted from the git pull command, as it is implied by default.

    [pull]
          rebase = true
    

Pull request and review#

When contributing a patch, use this list as a checklist of Apache Arrow workflow:

  • Submit the patch as a GitHub pull request against the main branch.

  • So that your pull request syncs with the GitHub issue, prefix your pull request title with the GitHub issue id (ex: GH-14866: [C++] Remove internal GroupBy implementation).

  • Give the pull request a clear, brief description: when the pull request is merged, this will be retained in the extended commit message.

  • Make sure that your code passes the unit tests. You can find instructions how to run the unit tests for each Arrow component in its respective README file.

Core developers and others with a stake in the part of the project your change affects will review, request changes, and hopefully indicate their approval in the end. To make the review process smooth for everyone, try to

  • Break your work into small, single-purpose patches if possible.

    It’s much harder to merge in a large change with a lot of disjoint features, and particularly if you’re new to the project, smaller changes are much easier for maintainers to accept.

  • Add new unit tests for your code.

  • Follow the style guides for the part(s) of the project you’re modifying.

    Some languages (C++ and Python, for example) run a lint check in continuous integration. For all languages, see their respective developer documentation and READMEs for style guidance.

  • Try to make it look as if the codebase has a single author, and emulate any conventions you see, whether or not they are officially documented or checked.

When tests are passing and the pull request has been approved by the interested parties, a committer will merge the pull request. This is done with a command-line utility that does a squash merge.

Details on squash merge

A pull request is merged with a squash merge so that all of your commits will be registered as a single commit to the main branch; this simplifies the connection between GitHub issues and commits, makes it easier to bisect history to identify where changes were introduced, and helps us be able to cherry-pick individual patches onto a maintenance branch.

Your pull request will appear in the GitHub interface to have been “merged”. In the commit message of that commit, the merge tool adds the pull request description, a link back to the pull request, and attribution to the contributor and any co-authors.

AI-generated code#

We recognise that AI coding assistants are now a regular part of many developers’ workflows and can improve productivity. Thoughtful use of these tools can be beneficial, but AI-generated PRs can sometimes lead to undesirable additional maintainer burden. PRs that appear to be fully generated by AI with little to no engagement from the author may be closed without further review.

Human-generated mistakes tend to be easier to spot and reason about, and code review is intended to be a collaborative learning experience that benefits both submitter and reviewer. When a PR appears to have been generated without much engagement from the submitter, reviewers with access to AI tools could more efficiently generate the code directly, and since the submitter is not likely to learn from the review process, their time is more productively spent researching and reporting on the issue.

We are not opposed to the use of AI tools in generating PRs, but recommend the following:

  • Only submit a PR if you are able to debug and own the changes yourself - review all generated code to understand every detail

  • Match the style and conventions used in the rest of the codebase, including PR titles and descriptions

  • Be upfront about AI usage and summarise what was AI-generated

  • If there are parts you don’t fully understand, leave comments on your own PR explaining what steps you took to verify correctness

  • Watch for AI’s tendency to generate overly verbose comments, unnecessary test cases, and incorrect fixes

  • Break down large PRs into smaller ones to make review easier

PR authors are also responsible for disclosing any copyrighted materials in submitted contributions. See the ASF’s guidance on AI-generated code for further information on licensing considerations.

Experimental repositories#

Apache Arrow has an explicit policy over developing experimental repositories in the context of rules for revolutionaries.

The main motivation for this policy is to offer a lightweight mechanism to conduct experimental work, with the necessary creative freedom, within the ASF and the Apache Arrow governance model. This policy allows committers to work on new repositories, as they offer many important tools to manage it (e.g. github issues, “watch”, “github stars” to measure overall interest).

Process#

  • A committer may initiate experimental work by creating a separate git repository within the Apache Arrow (e.g. via selfserve) and announcing it on the mailing list, together with its goals, and a link to the newly created repository.

  • The committer must initiate an email thread with the sole purpose of presenting updates to the community about the status of the repo.

  • There must not be official releases from the repository.

  • Any decision to make the experimental repo official in any way, whether by merging or migrating, must be discussed and voted on in the mailing list.

  • The committer is responsible for managing issues, documentation, CI of the repository, including licensing checks.

  • The committer decides when the repository is archived.

Repository management#

  • The repository must be under apache/

  • The repository’s name must be prefixed by arrow-experimental-

  • The committer has full permissions over the repository (within possible in ASF)

  • Push / merge permissions must only be granted to Apache Arrow committers

Development process#

  • The repository must follow the ASF requirements about 3rd party code.

  • The committer decides how to manage issues, PRs, etc.

Divergences#

  • If any of the “must” above fails to materialize and no correction measure is taken by the committer upon request, the PMC should take ownership and decide what to do.

Guidance for specific features#

From time to time the community has discussions on specific types of features and improvements that they expect to support. This section outlines decisions that have been made in this regard.

Endianness#

The Arrow format allows setting endianness. Due to the popularity of little endian architectures most of implementation assume little endian by default. There has been some effort to support big endian platforms as well. Based on a mailing-list discussion, the requirements for a new platform are:

  1. A robust (non-flaky, returning results in a reasonable time) Continuous Integration setup.

  2. Benchmarks for performance critical parts of the code to demonstrate no regression.

Furthermore, for big-endian support, there are two levels that an implementation can support:

  1. Native endianness (all Arrow communication happens with processes of the same endianness). This includes ancillary functionality such as reading and writing various file formats, such as Parquet.

  2. Cross endian support (implementations will do byte reordering when appropriate for IPC and Flight messages).

The decision on what level to support is based on maintainers’ preferences for complexity and technical risk. In general all implementations should be open to native endianness support (provided the CI and performance requirements are met). Cross endianness support is a question for individual maintainers.

The current implementations aiming for cross endian support are:

  1. C++

Implementations that do not intend to implement cross endian support:

  1. Java

For other libraries, a discussion to gather consensus on the mailing-list should be had before submitting PRs.