Copyright ยฉ 2017 W3C ยฎ ( MIT , ERCIM , Keio , Beihang ). W3C liability , trademark and document use rules apply.
This specification extends the Media Capture and Streams specification [ GETUSERMEDIA ] to allow a depth-only stream or combined depth+color stream to be requested from the web platform using APIs familiar to web authors.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at https://www.w3.org/TR/.
This extensions specification defines a new media type and constrainable property per Extensibility guidelines of the Media Capture and Streams specification [ GETUSERMEDIA ]. Horizontal reviews and feedback from early implementations of this specification are encouraged.
This document was published by the Device and Sensors Working Group and the Web Real-Time Communications Working Group as an Editor's Draft. If you wish to make comments regarding this document, please send them to public-media-capture@w3.org ( subscribe , archives ). All comments are welcome.
Publication as an Editor's Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by groups operating under the 5 February 2004 W3C Patent Policy . W3C maintains a public list of any patent disclosures (Device and Sensors Working Group) and a public list of any patent disclosures (Web Real-Time Communications Working Group) made in connection with the deliverables of each group; these pages also include instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy .
This document is governed by the 1 March 2017 W3C Process Document .
Depth cameras are increasingly being integrated into devices such as phones, tablets, and laptops. Depth cameras provide a depth map , which conveys the distance information between points on an object's surface and the camera. With depth information, web content and applications can be enhanced by, for example, the use of hand gestures as an input mechanism, or by creating 3D models of real-world objects that can interact and integrate with the web platform. Concrete applications of this technology include more immersive gaming experiences, more accessible 3D video conferences, and augmented reality, to name a few.
To
bring
depth
capability
to
the
web
platform,
this
specification
extends
the
MediaStream
interface
[
GETUSERMEDIA
]
to
enable
it
to
also
contain
depth-based
MediaStreamTrack
s.
A
depth-based
MediaStreamTrack
,
referred
to
as
a
depth
stream
track
,
represents
an
abstraction
of
a
stream
of
frames
that
can
each
be
converted
to
objects
which
contain
an
array
of
pixel
data,
where
each
pixel
represents
the
distance
between
the
camera
and
the
objects
in
the
scene
for
that
point
in
the
array.
A
MediaStream
object
that
contains
one
or
more
depth
stream
track
s
is
referred
to
as
a
depth-only
stream
or
depth+color
stream
.
Depth cameras usually produce 16-bit depth values per pixel, so this specification defines a 16-bit grayscale representation of a depth map .
This specification attempts to address the Use Cases and Requirements for accessing depth stream from a depth camera. See also the Examples section for concrete usage examples.
As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.
The key word MUST is to be interpreted as described in [ RFC2119 ].
This specification defines conformance criteria that apply to a single product: the user agent that implements the interfaces that it contains.
Implementations that use ECMAScript to implement the APIs defined in this specification must implement them in a manner consistent with the ECMAScript Bindings defined in the Web IDL specification [ WEBIDL ], as this specification uses that specification and terminology.
The
MediaStreamTrack
and
MediaStream
interfaces
this
specification
extends
are
defined
in
[
GETUSERMEDIA
].
The
Constraints
,
MediaTrackSettings
,
MediaTrackConstraints
,
MediaTrackSupportedConstraints
,
MediaTrackCapabilities
,
and
MediaTrackConstraintSet
dictionaries
this
specification
extends
are
defined
in
[
GETUSERMEDIA
].
The
getUserMedia()
,
getSettings()
methods
and
the
NavigatorUserMediaSuccessCallback
callback
are
defined
in
[
GETUSERMEDIA
].
The
concepts
muted
,
disabled
,
and
overconstrained
as
applied
to
MediaStreamTrack
are
defined
in
[
GETUSERMEDIA
].
The terms source and consumer are defined in [ GETUSERMEDIA ].
The
MediaDeviceKind
enumeration
is
defined
in
[
GETUSERMEDIA
].
The
video
element
and
ImageData
(and
its
data
attribute
and
Canvas
Pixel
ArrayBuffer
),
VideoTrack
,
HTMLMediaElement
(and
its
srcObject
attribute),
HTMLVideoElement
interfaces
and
the
CanvasImageSource
enum
are
defined
in
[
HTML
].
The terms media data , media provider object , assigned media provider object , and the concept potentially playing are defined in [ HTML ].
The
term
permission
and
the
permission
name
"
camera
"
are
defined
in
[
PERMISSIONS
].
The
DataView
,
Uint8ClampedArray
,
and
Uint16Array
buffer
source
types
are
defined
in
[
WEBIDL
].
The meaning of dictionary member being present or not present is defined in [ WEBIDL ].
The
term
depth+color
stream
means
a
MediaStream
object
that
contains
one
or
more
MediaStreamTrack
objects
whose
videoKind
of
Settings
is
"
depth
"
(
depth
stream
track
)
and
one
or
more
MediaStreamTrack
objects
whose
videoKind
of
Settings
is
"
color
"
(
color
stream
track
).
The
term
depth-only
stream
means
a
MediaStream
object
that
contains
one
or
more
MediaStreamTrack
objects
whose
videoKind
of
Settings
is
"
depth
"
(
depth
stream
track
)
only.
The
term
color-only
stream
means
a
MediaStream
object
that
contains
one
or
more
MediaStreamTrack
objects
whose
videoKind
of
Settings
is
"
color
"
(
color
stream
track
)
only,
and
optionally
of
kind
"
audio
".
The
term
depth
stream
track
means
a
MediaStreamTrack
object
whose
videoKind
of
Settings
is
"
depth
".
It
represents
a
media
stream
track
whose
source
is
a
depth
camera.
The
term
color
stream
track
means
a
MediaStreamTrack
object
whose
videoKind
of
Settings
is
"
color
".
It
represents
a
media
stream
track
whose
source
is
a
color
camera.
A depth map is an abstract representation of a frame of a depth stream track . A depth map is an image that contains information relating to the distance of the surfaces of scene objects from a viewpoint. A depth map consists of pixels referred to as depth map values . An invalid depth map value is 0 (the user agent is unable to acquire depth information for the given pixel for any reason).
A depth map has an associated near value which is a double. It represents the minimum range in meters.
A depth map has an associated far value which is a double. It represents the maximum range in meters.
A depth map has an associated horizontal focal length which is a double. It represents the horizontal focal length of the depth camera, in pixels.
A depth map has an associated vertical focal length which is a double. It represents the vertical focal length of the depth camera, in pixels.
A depth map has an associated principal point , specified by principal point x and principal point y coordinates which are double. It is a concept defined in the pinhole camera model; a projection of perspective center to the image plane.
A
depth
map
has
an
associated
transformation
from
depth
to
video
,
which
is
a
transformation
matrix
represented
by
a
Transformation
dictionary.
It
is
used
to
translate
position
in
depth
camera
3D
coordinate
system
to
RGB
video
stream's
camera
(identified
by
videoDeviceId
)
3D
coordinate
system.
After
projecting
depth
2D
pixel
coordinates
to
3D
space,
we
use
this
matrix
to
transform
depth
camera
3D
space
coordinates
to
RGB
video
camera
3D
space.
Both
depth
and
color
cameras
usually
introduce
significant
distortion
caused
by
the
camera
and
lens.
While
in
some
cases,
the
effects
are
not
noticeable,
these
distortions
cause
errors
in
image
analysis.
To
map
depth
map
pixel
values
to
corresponding
color
video
track
pixels,
we
use
two
DistortionCoefficients
dictionaries:
deprojection
distortion
coefficients
and
projection
distortion
coefficients
.
Deprojection distortion coefficients are used for compensating camera distortion when deprojecting 2D pixel coordinates to 3D space coordinates. Projection distortion coefficients are used in the opposite case, when projecting camera 3D space points to pixels. One track doesn't have both of the coefficients specified. The most common scenario is that the depth track has deprojection distortion coefficients or that the color video track has projection distortion coefficients . For the details, see algorithm to map depth pixels to color pixels .
The data type of a depth map is 16-bit unsigned integer. The algorithm to convert the depth map value to grayscale , given a depth map value d , is as follows:
The rules to convert using range linear are as given in the following formula:
d n = d - n e a r f a r - n e a r d n = d โ n e a r f a r โ n e a r
d 16 b i t = โ d n โ 65535 โ d 16 b i t = โ d n โ 65535 โ
The depth measurement d (in meter units) is recovered by solving the rules to convert using range linear for d as follows:
d n = d 16 b i t 65535 d n = d 16 b i t 65535
d = ( d n โ ( f a r - n e a r ) ) + n e a r d = ( d n โ ( f a r โ n e a r ) ) + n e a r
MediaTrackSupportedConstraints
dictionary
partial dictionary MediaTrackSupportedConstraints {
boolean videoKind = true;
boolean depthNear = true;
boolean depthFar = true;
boolean focalLengthX = true;
boolean focalLengthY = true;
boolean principalPointX = true;
boolean principalPointY = true;
boolean deprojectionDistortionCoefficients = false;
boolean projectionDistortionCoefficients = false;
boolean depthToVideoTransform = false;
};
MediaTrackCapabilities
dictionary
partial dictionary MediaTrackCapabilities {
DOMString videoKind;
(double or DoubleRange) depthNear;
(double or DoubleRange) depthFar;
(double or DoubleRange) focalLengthX;
(double or DoubleRange) focalLengthY;
(double or DoubleRange) principalPointX;
(double or DoubleRange) principalPointY;
boolean deprojectionDistortionCoefficients;
boolean projectionDistortionCoefficients;
boolean depthToVideoTransform;
};
MediaTrackConstraintSet
dictionary
partial dictionary MediaTrackConstraintSet {
ConstrainDOMString videoKind;
ConstrainDouble depthNear;
ConstrainDouble depthFar;
ConstrainDouble focalLengthX;
ConstrainDouble focalLengthY;
ConstrainDouble principalPointX;
ConstrainDouble principalPointY;
ConstrainBoolean deprojectionDistortionCoefficients;
ConstrainBoolean projectionDistortionCoefficients;
ConstrainBoolean depthToVideoTransform;
};
MediaTrackSettings
dictionary
partial dictionaryMediaTrackSettings{ DOMStringvideoKind; doubledepthNear; doubledepthFar; doublefocalLengthX; doublefocalLengthY; doubleprincipalPointX; doubleprincipalPointY;DistortionCoefficientsdeprojectionDistortionCoefficients;DistortionCoefficientsprojectionDistortionCoefficients;TransformationdepthToVideoTransform; }; dictionaryDistortionCoefficients{ doublek1; doublek2; doublep1; doublep2; doublek3; }; dictionaryTransformation{ Float32ArraytransformationMatrix; DOMStringvideoDeviceId; };
The
DistortionCoefficients
dictionary
has
the
k1
,
k2
,
p1
,
p2
and
k3
dictionary
members
that
represent
the
deprojection
distortion
coefficients
or
projection
distortion
coefficients
.
k1
,
k2
and
k3
are
radial
distortion
coefficients
while
p1
and
p2
are
tangential
distortion
coefficients
.
Radial
distortion
coefficients
and
tangential
distortion
coefficients
are
used
when
there's
need
to
deproject
depth
value
to
3D
space
or
to
project
3D
value
to
2D
video
frame
coordinates.
See
algorithm
to
map
depth
pixels
to
color
pixels
and
Brown-Conrady
distortion
model
implementation
in
3D
point
cloud
rendering
example
GLSL
shader.
The
Transformation
dictionary
has
the
transformationMatrix
dictionary
member
that
is
a
16
element
array
that
defines
the
transformation
matrix
of
the
depth
map
's
camera's
3D
coordinate
system
to
video
track's
camera
3D
coordinate
system.
The first four elements of the array correspond to the first matrix row, followed by four elements of the second matrix row and so on. It is in format suitable for use with WebGL's uniformMatrix4fv.
The
videoDeviceId
dictionary
member
represents
the
deviceId
of
video
camera
the
depth
stream
must
be
synchronized
with.
The
value
of
videoDeviceId
can
be
used
as
deviceId
constraint
in
[
GETUSERMEDIA
]
to
get
the
corresponding
video
and
audio
streams.
The
following
constrainable
properties
are
defined
to
apply
only
to
video
MediaStreamTrack
objects:
| Property Name | Values | Notes |
|---|---|---|
MediaTrackSupportedConstraints
.
videoKind
MediaTrackCapabilities
.
videoKind
MediaTrackConstraintSet
.
videoKind
MediaTrackSettings
.
videoKind
|
ConstrainDOMString
|
This
string
should
be
one
of
the
members
of
.
The
members
describe
the
kind
of
video
that
the
camera
can
capture.
Note
that
getConstraints
may
not
return
exactly
the
same
string
for
strings
not
in
this
enum.
This
preserves
the
possibility
of
using
a
future
version
of
WebIDL
enum
for
this
property.
|
enum VideoKindEnum {
"color",
"depth"
};
| Enumeration description | |
|---|---|
color
|
The source is capturing color images. |
depth
|
The source is capturing depth maps. |
The
MediaStream
consumer
for
the
depth-only
stream
and
depth+color
stream
is
the
video
element
[
HTML
].
If
a
MediaStreamTrack
whose
videoKind
of
Settings
is
muted
or
disabled
,
it
MUST
render
frames
as
if
all
the
pixels
would
be
0.
This section is non-normative.
A color stream track and a depth stream track can be combined into one depth+color stream . The rendering of the two tracks are intended to be synchronized. The resolution of the two tracks are intended to be same. And the coordination of the two tracks are intended to be calibrated. These are not hard requirements, since it might not be possible to synchronize tracks from sources.
This approach is simple to use but comes with the following caveats: it might might not be supported by the implementation and the resolutions of two tracks are intended to be the same that can require downsampling and degrade quality. The alternative approach is that a web developer implements the algorithm to map depth pixels to color pixels . See the 3D point cloud rendering example code.
The following constrainable properties are defined to apply only to depth stream track s:
| Property Name | Values | Notes |
|---|---|---|
MediaTrackSupportedConstraints
.
depthNear
MediaTrackCapabilities
.
depthNear
MediaTrackConstraintSet
.
depthNear
MediaTrackSettings
.
depthNear
|
ConstrainDouble
|
The near value , in meters. |
MediaTrackSupportedConstraints
.
depthFar
MediaTrackCapabilities
.
depthFar
MediaTrackConstraintSet
.
depthFar
MediaTrackSettings
.
depthFar
|
ConstrainDouble
|
The far value , in meters. |
MediaTrackSupportedConstraints
.
focalLengthX
MediaTrackCapabilities
.
focalLengthX
MediaTrackConstraintSet
.
focalLengthX
MediaTrackSettings
.
focalLengthX
|
ConstrainDouble
|
The horizontal focal length , in pixels. |
MediaTrackSupportedConstraints
.
focalLengthY
MediaTrackCapabilities
.
focalLengthY
MediaTrackConstraintSet
.
focalLengthY
MediaTrackSettings
.
focalLengthY
|
ConstrainDouble
|
The vertical focal length , in pixels. |
MediaTrackSupportedConstraints
.
principalPointX
MediaTrackCapabilities
.
principalPointX
MediaTrackConstraintSet
.
principalPointX
MediaTrackSettings
.
principalPointX
|
ConstrainDouble
|
The principal point x coordinate, in pixels. |
MediaTrackSupportedConstraints
.
principalPointY
MediaTrackCapabilities
.
principalPointY
MediaTrackConstraintSet
.
principalPointY
MediaTrackSettings
.
principalPointY
|
ConstrainDouble
|
The principal point y coordinate, in pixels. |
MediaTrackSupportedConstraints
.
deprojectionDistortionCoefficients
MediaTrackCapabilities
.
deprojectionDistortionCoefficients
MediaTrackConstraintSet
.
deprojectionDistortionCoefficients
MediaTrackSettings
.
deprojectionDistortionCoefficients
|
ConstrainDOMDictionary
|
The depth map 's deprojection distortion coefficients used when deprojecting from 2D to 3D space. |
MediaTrackSupportedConstraints
.
projectionDistortionCoefficients
MediaTrackCapabilities
.
projectionDistortionCoefficients
MediaTrackConstraintSet
.
projectionDistortionCoefficients
MediaTrackSettings
.
projectionDistortionCoefficients
|
ConstrainDOMDictionary
|
The depth map 's projection distortion coefficients used when deprojecting from 2D to 3D space. |
MediaTrackSupportedConstraints
.
depthToVideoTransform
MediaTrackCapabilities
.
depthToVideoTransform
MediaTrackConstraintSet
.
depthToVideoTransform
MediaTrackSettings
.
depthToVideoTransform
|
ConstrainDOMDictionary
|
The depth map 's camera's transformation from depth to video camera 3D coordinate system. |
The
depthNear
and
depthFar
constrainable
properties,
when
set,
allow
the
implementation
to
pick
the
best
depth
camera
mode
optimized
for
the
range
[near,
far]
and
help
minimize
the
error
introduced
by
the
lossy
conversion
from
the
depth
value
d
to
a
quantized
d
8bit
and
back
to
an
approximation
of
the
depth
value
d
.
If
the
depthFar
property's
value
is
less
than
the
depthNear
property's
value,
the
depth
stream
track
is
overconstrained
.
If the near value , far value , horizontal focal length or vertical focal length is fixed due to a hardware or software limitation, the corresponding constrainable property's value MUST be set to the value reported by the underlying implementation. (For example, the focal lengths of the lens may be fixed, or the underlying platform may not expose the focal length information.)
WebGLRenderingContext
interface
This section is non-normative.
There are several use-cases which are a good fit to be, at least partially, implemented on the GPU, such as motion recognition, pattern recognition, background removal, as well as 3D point cloud.
This section explains which APIs can be used for some of these mentioned use-cases; the concrete examples are provided in the Examples section.
A
video
element
whose
source
is
a
MediaStream
object
containing
a
depth
stream
track
may
be
uploaded
to
a
WebGL
texture
of
format
RGBA
or
RED
and
type
FLOAT
.
See
the
specification
[
WEBGL
]
and
the
upload
to
float
texture
example
code.
For each pixel of this WebGL texture, the R component represents normalized 16-bit value following the formula: d f l o a t = d 16 b i t 65535.0 d f l o a t = d 16 b i t 65535.0
This section is non-normative.
Here we list some of the possible approaches.
Performance of synchronous readPixels from float example in the current implementation suffice for some of the use cases. The reason is that there is no rendering to the float texture bound to named framebuffer.
This section is non-normative.
The algorithms presented in this section explain how a web developer can map depth and color pixels. Concrete example on how to do the mapping is provided in example vertex shader used for 3D point cloud rendering .
When rendering, we want to position a color value from color video frame to corresponding depth map value or 3D point in space defined by depth map value . We use deprojection distortion coefficients to compensate camera distortion when deprojecting 2D pixel coordinates to 3D space coordinates and projection distortion coefficients in the opposite case, when projecting camera 3D space points to pixels.
The algorithm to map depth pixels to color pixels is as follows:
The algorithm to deproject depth map value to point in depth camera is as follows:
Let dx and dy be 2D coordinates, in pixels, of a pixel in depth map .
Let dz be depth map value of the same pixel in the depth map .
Let fx and fy be depth map 's horizontal focal length and vertical focal length respectively.
Let cx and cy be depth map 's principal point 2D coordinates.
Let 3D coordinates (Xd, Yd, Zd) be the output of this step - a 3D point in depth camera's 3D coordinate system.
p x = d x - c x f x p x = d x โ c x f x
p y = d y - c y f y p y = d y โ c y f y
3D coordinates (Xd, Yd, Zd) in depth camera space are calculated as:
X d = d z โ p x X d = d z โ p x
Y d = d z โ p x Y d = d z โ p x
Z d = d z
3D coordinates (Xd, Yd, Zd) in depth camera space are calculated as:
r 2 = p x 2 + p y 2
r = 1 + k 1 โ r 2 + k 2 โ r 2 2 + k 3 โ r 2 3
X d = d z โ ( p x โ r + 2 โ p 1 โ p x โ p y + p 2 โ ( r 2 + 2 โ p x 2 ) )
Y d = d z โ ( p y โ r + 2 โ p 2 โ p x โ p y + p 1 โ ( r 2 + 2 โ p y 2 ) )
Z d = d z
See depth_deproject function in 3D point cloud rendering example.
The result of project depth value to 3D point step, 3D point (Xd, Yd, Zd), is in depth camera 3D coordinate system. To transform coordinates of the same point in space, but to color camera 3D coordinate system, we use matrix multiplication of transformation from depth to video matrix by the (Xd, Yd, Zd) 3D point vector.
Let (Xc, Yc, Zc) be the output of this step - a 3D coordinates of projected depth map value to color camera 3D space.
Let
M
be
transformation
matrix
defined
in
depth
map
's
depthToVideoTransform
field.
To multiply 4x4 matrix by 3 element vector, we extend the 3D vector by one element to 4 dimensional vector. After multiplication, we use vector's x, y and z coordinates as the result.
( X c Y c Z c ) = ( [ M ] ร ( X d Y d Z d 1 ) ) . x y z
In
3D
point
cloud
rendering
example,
this
is
done
by:
vec4
color_point
=
u_depth_to_color
*
vec4(depth_point,
1.0);
To
project
from
color
3D
to
2D
coordinate
we
use
the
corresponding
color
track's
MediaTrackSettings
.
The
color
track
we
get
using
depth
map
's
Transformation
.
videoDeviceId
-
it
represents
the
target
color
video
deviceID
that
should
be
used
as
a
constraint
with
[
GETUSERMEDIA
]
call
to
get
the
corresponding
color
video
stream
track.
After
that,
we
use
color
track
getSettings()
to
access
MediaTrackSettings
.
Let f x c and f y c be color track's horizontal focal length and vertical focal length respectively.
Let c x c and c y c be color track's principal point 2D coordinates.
The result of this step is 2D coordinate of pixel in color video frame ( x , y ).
position of pixel in color frame image (x, y) is calculated as:
r 2 c = ( X c ) 2 + ( Y c ) 2
r = 1 + k 1 โ r 2 + k 2 โ r 2 2 + k 3 โ r 2 3
p x c = r โ X c Z c
p y c = r โ Y c Z c
x = ( p x c + 2 โ p 1 โ p x c โ p y c + p 2 โ ( r 2 c + 2 โ p x 2 c ) ) โ f x c + c x c
y = ( p y c + 2 โ p 2 โ p x c โ p y c + p 1 โ ( r 2 c + 2 โ p y 2 c ) ) โ f y c + c y c
position of pixel in color frame image (x, y) is calculated as:
p x c = X c Z c
p y c = Y c Z c
x = p x c โ f x c + c x c
y = p y c โ f y c + c y c
See color_project function in 3D point cloud rendering example.
This section is non-normative.
Use
to
check
whether
readPixels
to
gl.RED
or
gl.RGBA
float
is
supported.
gl.getParameter(gl.IMPLEMENTATION_COLOR_READ_FORMAT);
gl.getParameter(gl.IMPLEMENTATION_COLOR_READ_FORMAT)
This section is non-normative.
The privacy and security considerations discussed in [ GETUSERMEDIA ] apply to this extension specification.
Thanks to everyone who contributed to the Use Cases and Requirements , sent feedback and comments. Special thanks to Ningxin Hu for experimental implementations, as well as to the Project Tango for their experiments.