# Robotic Manipulation

Perception, Planning, and Control

Russ Tedrake

How to cite these notes, use annotations, and give feedback.

Note: These are working notes used for a course being taught at MIT. They will be updated throughout the Fall 2023 semester.

# Basic Pick and Place Your challenge: command the robot to pick up the brick and place it in a desired position/orientation.

The stage is set. You have your robot. I have a little red foam brick. I'm going to put it on the table in front of your robot, and your goal is to move it to a desired position/orientation on the table. I want to defer the perception problem for one more chapter, and will let you assume that you have access to a perfect measurement of the current position/orientation of the brick. Even without perception, completing this task requires us to build up a basic toolkit for geometry and kinematics; it's a natural place to start.

First, we will establish some terminology and notation for kinematics. This is one area where careful notation can yield dividends, and sloppy notation will inevitably lead to confusion and bugs. The Drake developers have gone to great length to establish and document a consistent multibody notation, which we call "Monogram Notation". The documentation even includes some of the motivation/philosophy behind that notation. I'll use the monogram notation throughout this text.

If you'd like a more extensive background on kinematics than what I provide here, my favorite reference is still Craig05. For free online resources, Chapters 2 and 3 of the 1994 book by Murray et al. (now free online)Murray94 are also excellent, as are the first seven chapters of Modern Robotics by Lynch and ParkLynch17 (they also have excellent accompanying videos). Unfortunately, with three different references you'll get three (slightly) different notations; ours is most similar to Craig05. The monogram notation is developed in some detail in Mitiguy17.

Please don't get overwhelmed by how much background material there is to know! I am personally of the opinion that a clear understanding of just a few basic ideas should make you very effective here. The details will come later, if you need them.

# Monogram Notation

The following concepts are disarmingly subtle. I've seen incredibly smart people assume they knew them and then perpetually stumble over notation. I did it for years myself. Take a minute to read this carefully!

Perhaps the most fundamental concept in geometry is the concept of a point. Points occupy a position in space, and they can have names, e.g. point $A$, $C$, or more descriptive names like $B_{cm}$ for the center of mass of body $B$. We'll denote the position of the point by using a position vector $p^A$; that's $p$ for position, and not for point, because other geometric quantities can also have a position.

But let's be more careful. Position is actually a relative quantity. Really, we should only ever write the position of two points relative to each other. We'll use e.g. $^Ap^C$ to denote the position of $C$ measured from $A$. The left superscript looks mighty strange, but we'll see that it pays off once we start transforming points.

Every time we describe the (relative) position as a vector of numbers, we need to be explicit about the frame we are using, specifically the "expressed-in" frame. All of our frames are defined by orthogonal unit vectors that follow the "right-hand rule". We'll give a frame a name, too, like $F$. If I want to write the position of point $C$ measured from point $A$, expressed in frame $F$, I will write $^Ap^C_F$. If I ever want to get just a single component of that vector, e.g. the $x$ component, then I'll use $^Ap^C_{F_x}$. In some sense, the "expressed-in" frame is an implementation detail; it is only required once we want to represent the multibody quantity as a vector (e.g. in the computer).

That is seriously heavy notation. I don't love it myself, but it's the most durable I've got, and we'll have shorthand for when the context is clear. There are a few very special frames. We use $W$ to denote the world frame. We think about the world frame in Drake using vehicle coordinates (positive $x$ to the front, positive $y$ to the left, and positive $z$ is up). The other particularly special frames are the body frames: every body in the multibody physics engine has a unique frame attached to it. We'll typically use $B_i$ to denote the frame for body $i$.

Frames have a position, too -- it coincides with the frame origin. So it is perfectly valid to write $^Wp^A_W$ to denote the position of point $A$ measured from the origin of the world frame, expressed in the world frame. Here is where the shorthand comes in. If the position of a quantity is measured from a frame, and expressed in the same frame, then we can safely omit the subscript. $^Fp^A \equiv {^Fp^A_F}$. Furthermore, if the "measured from" field is omitted, then we assume that the point is measured from $W$, so $p^A \equiv {}^Wp^A_W$.

Frames also have an orientation. We'll use $R$ to denote a rotation, and follow the same notation, writing $^BR^A$ to denote the orientation of frame $A$ measured from frame $B$. Unlike vectors, pure rotations do not have an additional "expressed in" frame.

A frame $F$ can be specified completely by a position and rotation measured from another frame. Taken together, we call the position and orientation a spatial pose, or just pose. A spatial transform, or just transform, is the "verb form" of pose. In Drake we use RigidTransform to represent a pose/transform, and denote it with the letter $X$. $^BX^A$ is the pose of frame $A$ measured from frame $B$. When we talk about the pose of an object $O$, without mentioning a reference frame explicitly, we mean $^WX^O$ where $O$ is the body frame of the object. We do not use the "expressed in" frame subscript for pose; we always want the pose expressed in the reference frame.

The Drake documentation also discusses how to use this notation in code. In short, $^Bp^A_C$ is written p_BA_C, ${}^BR^A$ as R_BA, and ${}^BX^A$ as X_BA. It works, I promise.

# Pick and place via spatial transforms

Now that we have the notation, we can formulate our approach to the basic pick and place problem. Let us call our object, $O$, and our gripper, $G$. Our idealized perception sensor tells us $^WX^O$. Let's create a frame $O_d$ to describe the "desired" pose of the object, $^WX^{O_d}$. So pick and place manipulation is simply trying to make $X^O = X^{O_d}$.

Add a figure here (after Terry's PR lands).
To accomplish this, we will assume that the object doesn't move relative to the world ($^WX^O$ is constant) when the gripper is open, and the object doesn't move relative to the gripper ($^GX^O$ is constant) when the gripper is closed. Then we can:
• move the gripper in the world, $X^G$, to an appropriate pose measured from the object: $^OX^{G_{grasp}}$.
• close the gripper.
• move the gripper+object to the desired pose, $X^O = X^{O_d}$.
• open the gripper, and retract the hand.
There is just one more important detail: to approach the object without colliding with it, we will insert a "pregrasp pose", $^OX^{G_{pregrasp}}$, above the object as an intermediate step. We'll use the same transform to retract away from the object when we set it down.

Clearly, programming this strategy requires good tools for working with these transforms, and for relating the pose of the gripper to the joint angles of the robot.

# Spatial Algebra

Here is where we start to see the pay-off from our heavy notation, as we define the rules of converting positions, rotations, poses, etc. between different frames. Without the notation, this invariably involves me with my right hand in the air making the "right-hand rule", and my head twisting around in space. With the notation, it's a simple matter of lining up the symbols properly, and we're more likely to get the right answer!

Here are the basic rules of algebra for our spatial quantities:
• Positions expressed in the same frame can be added when their reference and target symbols match: \begin{equation}{}^Ap^B_F + {}^Bp^C_F = {}^Ap^C_F.\end{equation} Addition is commutative, and the additive inverse is well defined: \begin{equation}{}^Ap^B_F = - {}^Bp^A_F.\end{equation} Those should be pretty intuitive; make sure you confirm them for yourself.
• Multiplication by a rotation can be used to change the "expressed in" frame: \begin{equation}{}^Ap^B_G = {}^GR^F {}^Ap^B_F.\end{equation} You might be surprised that a rotation alone is enough to change the expressed-in frame, but it's true. The position of the expressed-in frame does not affect the relative position between two points.
• Rotations can be multiplied when their reference and target symbols match: \begin{equation}{}^AR^B \: {}^BR^C = {}^AR^C.\end{equation} The inverse operation is also simply defined: \begin{equation}\left[{}^AR^B\right]^{-1} = {}^BR^A.\end{equation} When the rotation is represented as a rotation matrix, this is literally the matrix inverse, and since rotation matrices are orthonormal, we also have $R^{-1}=R^T.$
• Transforms bundle this up into a single, convenient notation when positions are measured from a frame (and the same frame they are expressed in): \begin{equation}{}^Gp^A = {}^GX^F {}^Fp^A = {}^Gp^F + {}^Fp^A_G = {}^Gp^F + {}^GR^F {}^Fp^A.\end{equation}
• Transforms compose: \begin{equation}{}^AX^B {}^BX^C = {}^AX^C,\end{equation} and have an inverse \begin{equation}\left[{}^AX^B\right]^{-1} = {}^BX^A.\end{equation} Please note that for transforms, we generally do not have that $X^{-1}$ is $X^T,$ though it still has a simple form.

In practice, transforms are implemented using homogenous coordinates, but for now I'm happy to leave that as an implementation detail.

# From camera frame to world frame

Imagine that I have a depth camera mounted in a fixed pose in my workspace. Let's call the camera frame $C$ and denote its pose in the world with ${}^WX^C$.

A depth camera returns points in the camera frame. Therefore, we'll write this position of point $P_i$ with ${}^Cp^{P_i}$. If we want to convert the point into the world frame, we simply have $$p^{P_i} = X^C {}^Cp^{P_i}.$$

This is a work-horse operation for us. We often aim to merge points from multiple cameras (typically in the world frame), and always need to somehow relate the frames of the camera with the frames of the robot. The inverse transform, ${}^CX^W$, which projects world coordinates into the camera frame, is often called the camera "extrinsics".

# Representations for 3D rotation

In the spatial algebra above, I've written the rules for rotations using an abstract notion of rotation. But in order to implement this algebra in code, we need to decide how we are going to represent those representations with a small number of real values. There are many possible representations of 3D rotations, they are each good for different operations, and unfortunately, there is no one representation to rule them all. (This is one of the many reasons why everything is better in 2D!) Common representations include

In Drake, we provide all of these representations, and make it easy to convert back and forth between them.

A 3$\times$3 rotation matrix is an orthonormal matrix with columns formed by the $x-$, $y-$, and $z-$axes. Specifically, the first column of the transform ${}^GR^F$ has the $x-$axis of frame $F$ expressed in $G$ in the first column, etc.

Euler angles specify a 3D rotation by a series of rotations around the $x$, $y$, and $z$ axes. The order of these rotations matters; and many combinations can be used to describe any 3D rotation. This is why we use RollPitchYaw in the code (preferring it over the more general term "Euler angle") and document it carefully. Roll is a rotation around the $x$-axis, pitch is a rotation around the $y$-axis, and yaw is a rotation around the $z$-axis; this is also known as "extrinsic X-Y-Z" Euler angles.

Axis angles describe a 3D rotation by a scalar rotation around an arbitrary vector axis using three numbers: the direction of the vector is the axis, and the magnitude of the vector is the angle. You can think of unit quaternions as a form of axis angles that have been carefully normalized to be unit length and have magical properties. My favorite careful description of quaternions is probably chapter 1 of Stillwell08.

Why all of the representations? Shouldn't "roll-pitch-yaw" be enough? Unfortunately, no. The limitation is perhaps most easily seen by looking at the coordinate changes from roll-pitch-yaw to/from a rotation matrix. Any roll-pitch-yaw can be converted to a rotation matrix, but the inverse map has a singularity. In particular, when the pitch angle is $\frac{\pi}{2}$, then roll and yaw are now indistinguishable. This is described very nicely, along with its physical manifestation in "gimbal lock" in this video. Similarly, the direction of the vector in the axis-angle representation in not uniquely defined when the rotation angle is zero. These singularities become problematic when we start taking derivatives of rotations, for instance when we write the equations of motion. It is now well understood Stillwell08 that it requires at least four numbers to properly represent the group of 3D rotations; the unit quaternion is the most common four-element representation.

# Forward kinematics

The spatial algebra gets us pretty close to what we need for our pick and place algorithm. But remember that the interface we have with the robot reports measured joint positions, and expects commands in the form of joint positions. So our remaining task is to convert between joint angles and cartesian frames. We'll do this in steps, the first step is to go from joint positions to cartesian frames: this is known as forward kinematics.

Throughout this text, we will refer to the joint positions of the robot (also known as "configuration" of the robot) using a vector $q$. If the configuration of the scene includes objects in the environment as well as the robot, we would use $q$ for the entire configuration vector, and use e.g. $q_{robot}$ for the subset of the vector corresponding to the robot's joint positions. Therefore, the goal of forward kinematics is to produce a map: \begin{equation}X^G = f_{kin}^G(q).\end{equation} Moreover, we'd like to have forward kinematics available for any frame we have defined in the scene. Our spatial notation and spatial algebra makes this computation relatively straight-forward.

# The kinematic tree

In order to facilitate kinematics and related multibody computations, the MultibodyPlant organizes all of the bodies in the world into a tree topology. Every body (except the world body) has a parent, which it is connected to via either a Joint or a "floating base".

# Inspecting the kinematic tree

Drake provides some visualization support for inspecting the kinematic tree data structure. The kinematic tree for an iiwa is more of a vine than a tree (it's a serial manipulator), but the tree for the dexterous hands are more interesting. I've added our brick to the example, too, so that you can see that a "free" body is just another branch off the world root node.

Insert topology visualization here (once it is better)

Every Joint and "floating base" has some number of position variables associated with it -- a subset of the configuration vector $q$ -- and knows how to compute the configuration dependent transform across the joint from the child joint frame $J_C$ to the parent joint frame $J_P$: ${}^{J_P}X^{J_C}(q)$. Additionally, the kinematic tree defines the (fixed) transforms from the joint frame to the child body frame, ${}^CX^{J_C}$, and from the joint frame to the parent frame, ${}^PX^{J_P}$. Altogether, we can compute the configuration transform between any one body and its parent, $${}^PX^C(q) = {}^PX^{J_P} {}^{J_P}X^{J_C}(q) {}^{J_C}X^C.$$

Examples or links to specifying the kinematic tree in URDF, SDF, etc...

You might be tempted to think that every time you add a joint to the MultibodyPlant, you are adding a degree of freedom. But it actually works the other way around. Every time you add a body to the plant, you are adding many degrees of freedom. But you can then add joints to remove those degrees of freedom; joints are constraints. "Welding" the robot's base to the world frame removes all of the floating degrees of freedom of the base. Adding a rotational joint between a child body and a parent body removes all but one degree of freedom, etc.

# Forward kinematics for pick and place

In order to compute the pose of the gripper in the world, $X^G$, we simply query the parent of the gripper frame in the kinematic tree, and recursively compose the transforms until we get to the world frame. Kinematic frames on the iiwa (left) and the WSG (right). For each frame, the positive $x$ axis is in red, the positive $y$ axis is in green, and the positive $z$ axis is in blue. It's (hopefully) easy to remember: XYZ $\Leftrightarrow$ RGB.

# Forward kinematics for the gripper frame

Let's evaluate the pose of the gripper in the world frame: $X^G$. We know that it will be a function of configuration of the robot, which is just a part of the total state of the MultibodyPlant (and so is stored in the Context). The following example shows you how it works.

The key lines are

gripper = plant.GetBodyByName("body", wsg)
pose = plant.EvalBodyPoseInWorld(context, gripper)
Behind the scenes, the MultibodyPlant is doing all of the spatial algebra we described above to return the pose (and also some clever caching because you can reuse much of the computation when you want to evaluate the pose of another frame on the same robot).

# Forward kinematics of "floating-base" objects

Consider the special case of having a MultibodyPlant with exactly one body added, and no joints. The kinematic tree is simply the world frame, the body frame, and they are connected by the "floating base". What does the forward kinematics function: $$X^B = f_{kin}^B(q),$$ look like in that case? If $q$ is already representing the floating-base configuration, is $f^B_{kin}$ just the identity function?

This gets into the subtle points of how we represent transforms, and how we represent 3D rotations in particular. Although we use rotation matrices in our RigidTransform class, in order to make the spatial algebra efficient, we actually use unit quaternions in the configuration vector $q,$ and in the Context in order to have a compact representation.

As a result, for this example, the software implementation of the function $f_{kin}^B$ is precisely the function that converts the position $\times$ unit quaternion representation into the pose (position $\times$ rotation matrix) representation.

# Differential kinematics (Jacobians)

The forward kinematics machinery gives us the ability to compute the pose of the gripper and the pose of the object, both in the world frame. But if our goal is to move the gripper to the object, then we should understand how changes in the joint angles relate to changes in the gripper pose. This is traditionally referred to as "differential kinematics".

At first blush, this is straightforward. The change in pose is related to a change in joint positions by the (partial) derivative of the forward kinematics: \begin{equation}dX^B = \pd{f_{kin}^B(q)}{q} dq = J^B(q)dq. \label{eq:jacobian}\end{equation} Partial derivatives of a function are referred to as "Jacobians" in many fields; in robotics it's rare to refer to derivatives of the kinematics as anything else.

All of the subtlety, again, comes in because of the multiple representations that we have for 3D rotations (rotation matrix, unit quaternions, ...). While there is no one best representation for 3D rotations, it is possible to have one canonical representation for differential rotations. Without any concern for singularities nor loss of generality, we can represent the rate of change in pose using a six-component vector for spatial velocity: \begin{equation}{}^AV^B_C = \begin{bmatrix} {}^A\omega^B_C \\ {}^A\text{v}^B_C \end{bmatrix}.\end{equation} ${}^AV^B_C$ is the spatial velocity (also known as a "twist") of frame $B$ measured in frame $A$ expressed in frame $C$, ${}^A\omega^B_C \in \Re^3$ is the angular velocity (of frame $B$ measured in $A$ expressed in frame $C$), and ${}^A\text{v}^B_C \in \Re^3$ is the translational velocity (along with the same shorthands as for positions). The angular velocity is a 3D vector (with $w_x$, $w_y$, $w_z$ components); the magnitude of this vector represents the angular speed and the direction represents the (instantaneous) axis of rotation. It's tempting to think of it as the time derivatives of roll, pitch, and yaw, but that's not true; it can easily be converted into that representation through a nonlinear change of coordinates. Spatial velocities fit nicely into our spatial algebra:
• Angular velocities add (when the frames match): \begin{equation} {}^A\omega^B_F + {}^B\omega^C_F = {}^A\omega^C_F,\end{equation} (this deserves to be verified) and have the additive inverse, ${}^A\omega^C_F = - {}^C\omega^A_F,$.
• Rotations can be used to change between the "expressed-in" frames: \begin{equation} {}^A\text{v}^B_G = {}^GR^F {}^A\text{v}^B_F, \qquad {}^A\omega^B_G = {}^GR^F {}^A\omega^B_F.\end{equation}
• Translational velocities compose across frames with: \begin{equation}{}^A\text{v}^C_F = {}^A\text{v}^B_F + {}^B\text{v}^C_F + {}^A\omega^B_F \times {}^Bp^C_F.\end{equation}
This can be derived in a few steps (click the triangle to expand)

Differentiating $${}^Ap^C = {}^AX^B {}^Bp^C = {}^Ap^B + {}^AR^B {}^Bp^C,$$ yields \begin{align} {}^A\text{v}^C =& {}^A\text{v}^B + {}^A\dot{R}^B {}^Bp^C + {}^AR^B {}^B\text{v}^C \nonumber \\ =& {}^A\text{v}^B_A + {}^A\dot{R}^B {}^BR^A {}^Bp^C_A + {}^B\text{v}^C_A.\end{align} Allow me to write $\dot{R}R^{-1}$ for ${}^A\dot{R}^B {}^BR^A$ (dropping the frames for a moment). It turns out that $\dot{R}R^{-1}$, is always a skew-symmetric matrix. To see this, differentiate $RR^T = I$ to get $$\dot{R}R^T + R\dot{R}^T = 0 \Rightarrow \dot{R}R^T = - R\dot{R}^T \Rightarrow \dot{R} R^T = - (\dot{R} R^T)^T,$$ which is the definition of a skew-symmetric matrix. Any 3$\times$3 skew-symmetric matrix can be parameterized by three numbers (we'll use the three-element vector $\omega$), and can be written as a cross product, so $\dot{R}R^Tp = \omega \times p.$

Multiply the right and left sides by ${}^FR^A$ to change the expressed-in frame, and we have our result.

• This reveals that additive inverse for translational velocities is not obtained by switching the reference and measured-in frames; it is slightly more complicated: \begin{equation}-{}^A\text{v}^B_F = {}^B\text{v}^A_F + {}^A\omega^B_F \times {}^Bp^A_F.\end{equation} .
If you're familiar with "screw theory" (as used in, e.g. Murray94 and Lynch17), click the triangle to see how those conventions are related.

Screw theory (as used in, e.g. Murray94 and Lynch17) often uses a particular form of our spatial velocity referred to as "spatial velocity in the space frame" / "spatial twists" that can be useful in a number of computations. This quantity is ${}^AV^{B_A}$, where $B_A$ is the frame rigidly attached on body $B$ (so ${}^BV^{B_A} = 0$) that is instantaneously in the same pose as $A$ (so ${}^Ap^{B_A} = 0$, ${}^AR^{B_A} = I$) Lynch17. These conditions reduce to \begin{equation}{}^AV^{B_A} = \begin{bmatrix} {}^A\omega^{B_A} \\ {}^A\text{v}^{B_A} \end{bmatrix} = \begin{bmatrix} {}^A\omega^{B_A} + {}^{B_A}\omega^B_{A} \\ {}^A\text{v}^B + {}^B\text{v}^{B_A}_A + {}^A\omega^B \times {}^Bp^{B_A}_A \end{bmatrix} = \begin{bmatrix} {}^A \omega^{B} \\ {}^A\text{v}^B - {}^A\omega^B \times {}^Ap^{B} \end{bmatrix}.\end{equation}

Why is it that we can get away with just three components for angular velocity, but not for rotations? Using the magnitude of an axis-angle representation to denote an angle is degenerate because our representation of angles should be periodic in $2\pi$, but using the magnitude of the angular velocity is ok because angular velocities take values from $(-\infty, \infty)$ without wrapping.

There is one more velocity to be aware of: I'll use $v$ to denote the generalized velocity vector of the plant, which is closely related to the time-derivative of $q$ (see below). While a spatial velocity $^AV^B$ is six components, a translational or angular velocity, $^B\text{v}^C$ or $^B\omega^C$, is three components, the generalized velocity vector is whatever size it needs to be to encode the time derivatives of the configuration variables, $q$. For the iiwa welded to the world frame, that means it has seven components. I've tried to be careful to typeset each of these v's differently throughout the notes. Almost always the distinction is also clear from the context.

# Don't assume $\dot{q} \equiv v$

The unit quaternion representation is four components, but these must form a "unit vector" of length 1. Rotation matrices are 9 components, but they must form an orthonormal matrix with $\det(R)=1$. It's pretty great that for the time derivative of rotation, we can use an unconstrained three component vector, what we've called the angular velocity vector, $\omega$. And you really should use it; getting rid of that constraint makes both the math and the numerics better.

But there is one minor nuisance that this causes. We tend to want to think of the generalized velocity as the time derivative of the generalized positions. This works when we have only our iiwa in the model, and it is welded to the world frame, because all of the joints are revolute joints with a single degree of freedom each. But we cannot assume this in general; when floating-base rotations part of the generalized position. As evidence, here is a simple example that loads exactly one rigid body into the MultibodyPlant, and then prints its Context.

The output looks like this:

Context
--------
Time: 0
States:
13 continuous states
1 0 0 0 0 0 0 0 0 0 0 0 0

plant.num_positions() = 7
plant.num_velocities() = 6