Journey of a Point: From World to Screen Space

Introduction

This short tutorial intends to equip you with the bare bones computer graphics/linear algebra knowledge to (help you) solve the 4th assignment, and also to help you make sense of additional resources on OpenGL you may read (which after all is a 3D API while we're working in 2D).

OpenGL Conventions

There are many conventions in compute graphics, such as left- vs right-handed coordinate system (some with y pointing up, some with z), row- vs column vectors, etc. Before we try to understand geometric transformations it is important to establish a fixed set of conventions so we all "speak the same language". To make things easier, assignment 4 is strictly using OpenGL conventions rather than inventing new ones, meaning:

OpenGL uses a right-handed coordinate system for world- and eye/camera space with the origin sitting in the center of your screen, x pointing to the right, y pointing up and negative z pointing inside your screen (with positive z coming out of the screen towards you). For clip space and normalized device coordinates (NDC) it switches to a left-handed coordinate system (the z-axis flips).
OpenGL uses column vectors, meaning matrices are multiplied with vectors in this order matrix * vector, not the other way around (remeber that you need to use math.multiply() for matrices and vectors). The second implication of using column vectors is that transformations are applied right-to-left or from the inside to the outside. In order to first translate/move a point, then rotate it, you'd calculate (ROT * (TRANSL * p)). The parentheses are added for clarity and not otherwise needed. Matrix multiplication is associative (but not commutative).
OpenGL's NDC range from [-1,-1,-1] to [1,1,1] (DirectX uses [-1,-1,0] to [1,1,1] on the other hand).

3D⇒2D conversion

OpenGL is a 3D API while we're working in 2D. So we must also define the corresponding conversion. Our 2D world is a 3D world viewed from the top straight down. In other words x points to the right, negative z points up, and y is coming out of the screen at us. However all y-coordinates are 0 since our world is entirely flat. Since it doesn't make sense to use 3D vectors (all with their y-component equal to 0) to describe a 2D world, we simply drop the y and rename z⇒y. In other words:

To convert from OpenGL to our world we do [x, y, z, w]⇒[x, z, w].
To convert from our world to OpenGL we do [x, y, w]⇒[x, 0, y, w].

World Space

We start in world space where all objects' positions are defined in absolute coordinates inside the "world coordinate system".

world space

Even the camera/observer/sensor has an absolute position (and orientation, and other properties such as field of view).

Eye Space

The first transformation we do is from world to eye/camera space by multiplying our coordinates with the worldToEye matrix.

eye space

This has the effect of moving all objects together so that the camera ends up sitting at the origin looking down negative y (negative z in OpenGL). Why would you do this? To simplify certain calculations. Imagine you'd like to calculate the depth (!= distance) of a grid cell (center) as viewed from the camera.

world vs eye space depth calc

If we wanted to calculate the depth in world space (top part of the image) we would have to first mathematically define two orthogonal lines one going through the observer, the other though the grid cell. We could then calculate their intersection point and from it the depth using the Pythagorean theorem. If however, we first transform our grid cell into eye space (bottom part of the image) calculating its depth becomes as simple as accessing its y component (and inverting it). This is also the only reason we transform anything into eye space in assignment 4, to calculate our grid cells' depths. If we didn't require the depth, we could go from world space directly to ...

Clip Space

Just like eye space simplifies certain calculations (and prepares objects to be projected lateron) clip space simplifies clipping (the act removing points which are outside the camera's frustum and thus not visible). Let's start with the following scene in eye space and see how it gets transformed.

eye to clip 1

The "near"- and "far plane" are two arbitrarily defined distances in front of/behind which nothing ought to be visible (our simulator uses 1 px and 300 px). We go from eye space to clip space by multiplying our point in eye space with eyeToClip.

eye to clip 2

Multiple things happen in clip space, in order of importance:

Objects get stretched (by different amounts along different axes, depending on the aspect ratio) so they look the same when viewed through a camera with a FOV of 90° as they did before when viewed through our original camera. (If our camera already had a FOV of 90° nothing happens.)
The w component is set to -y (-z in the case of OpenGL), i.e. [x, y, w] in eye space becomes [x', y', -y] in clip space (described in depth in this tutorial).¹
The y-axis (z-axis in OpenGL) is flipped. We just transitioned from a right-handed to a left-handed coordinate system.

Why would we want our camera to have a FOV of 90° (while maintaining the appearence of our world)? Because it makes clipping trivial: Viewed through a 90°-FOV camera, a point is visible as long as it is no further from the y-axis than it is from the camera or, formally, -w <= x(,y,z) < w.¹

NDC

To go from clip space to NDC requires a manual step since it's a non-linear transformation and thus cannot be described by a matrix (multiplication), the perspective divide.

eye to clip 2

We divide all our coordinates x,y,z by the "homogeneous coordinate" w. This has the effect of turning our frustum from a truncated pyramid into a cube with corners (-1,-1(,-1)) and (1,1(,1)) and making distant things smaller, as if drawn with perspective (hence the name). NDC always range from -1 to 1, even if our screen has a non-square aspect ratio (all points have been stretched accordinlgy in the conversion to clip space to account for this). Which purpose serve NDC?, to calculate pixel positions of our points via px = resolution * (ndc+1) / 2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly