Security Considerations#
Important
This document describes the security model for using the Arrow C++ APIs. For better understanding of this document, we recommend that you first read the overall security model for the Arrow project.
Parameter mismatch#
Many Arrow C++ APIs report errors using the arrow::Status and
arrow::Result. Such APIs can be assumed to detect common errors in the
provided arguments. However, there are also often implicit pre-conditions that
have to be upheld; these can usually be deduced from the semantics of an API
as described by its documentation.
See also
Arrow C++ Conventions
Pointer validity#
Pointers are always assumed to be valid and point to memory of the size required by the API. In particular, it is forbidden to pass a null pointer except where the API documentation explicitly says otherwise.
Type restrictions#
Some APIs are specified to operate on specific Arrow data types and may not verify that their arguments conform to the expected data types. Passing the wrong kind of data as input may lead to undefined behavior.
Data validity#
Arrow data, for example passed as arrow::Array or arrow::Table,
is always assumed to be valid. If your program may
encounter invalid data, it must explicitly check its validity by calling one of
the following validation APIs.
Structural validity#
The Validate methods exposed on various Arrow C++ classes perform relatively
inexpensive validity checks that the data is structurally valid. This implies
checking the number of buffers, child arrays, and other similar conditions.
These checks typically are constant-time against the number of rows in the data, but linear in the number of descendant fields. They can be good enough to detect potential bugs in your own code. However, they are not enough to detect all classes of invalid data, and they won’t protect against all kinds of malicious payloads.
Full validity#
The ValidateFull methods exposed by the same classes perform the same validity
checks as the Validate methods, but they also check the data extensively for
any non-conformance to the Arrow spec. In particular, they check all the offsets
of variable-length data types, which is of fundamental importance when ingesting
untrusted data from sources such as the IPC format (otherwise the variable-length
offsets could point outside of the corresponding data buffer). They also check
for invalid values, such as invalid UTF-8 strings or decimal values out of range
for the advertised precision.
“Safe” and “unsafe” APIs#
Some APIs are exposed in both “safe” and “unsafe” variants. The naming convention
for such pairs varies: sometimes the former has a Safe suffix (for example
SliceSafe vs. Slice), sometimes the latter has an Unsafe prefix or
suffix (for example Append vs. UnsafeAppend).
In all cases, the “unsafe” API is intended as a more efficient API that eschews some of the checks that the “safe” API performs. It is then up to the caller to ensure that the preconditions are met, otherwise undefined behavior may ensue.
The API documentation usually spells out the differences between “safe” and “unsafe” variants, but these typically fall into two categories:
structural checks, such as passing the right Arrow data type or numbers of buffers;
allocation size checks, such as having preallocated enough data for the given input arguments (this is typical of the array builders and buffer builders).
Ingesting untrusted data#
As an exception to the above (see Data validity), some APIs support ingesting untrusted, potentially malicious data. These are:
the IPC reader APIs
the Parquet reader APIs
the CSV reader APIs
You must not assume that they will always return valid Arrow data. The reason for not validating data automatically is that validation can be expensive but unnecessary when reading from trusted data sources.
Instead, when using these APIs with potentially invalid data (such as data coming from an untrusted source), you must follow these steps:
Check any error returned by the API, as with any other API
If the API returned successfully, validate the returned Arrow data in full (see “Full validity” above)