Perception, Planning, and Control
© Russ Tedrake, 2020-2024
Last modified .
How to cite these notes, use annotations, and give feedback.
Note: These are working notes used for a course being taught at MIT. They will be updated throughout the Fall 2024 semester.
Previous Chapter | Table of contents | Next Chapter |
In this chapter we'll consider the simplest version of the bin picking problem: the robot has a bin full of random objects and simply needs to move those objects from one bin to the other. We'll be agnostic about what those objects are and about where they end up in the other bin, but we would like our solution to achieve a reasonable level of performance for a very wide variety of objects. This turns out to be a pretty convenient way to create a training ground for robot learning -- we can set the robot up to move objects back and forth between bins all day and night, and intermittently add and remove objects from the bin to improve diversity. Of course, it is even easier in simulation!
Bin picking has potentially very important applications in industries such as logistics, and there are significantly more refined versions of this problem. For example, we might need to pick only objects from a specific class, and/or place the objects in known position (e.g. for "packing"). But let's start with the basic case.
If our goal is to test a diversity of bin picking situations, then the
first task is to figure out how to generate diverse simulations. How
should we populate the bin full of objects? So far we've set up each
simulation by carefully setting the initial positions (in the
Context
) for each of the objects, but that approach won't
scale.
In the real world, we would probably just dump the random objects into
the bin. That's a decent strategy for simulation, too. We can roughly
expect our simulation to faithfully implement multibody physics as long
as our initial conditions (and time step) are reasonable; but the physics
isn't well defined if we initialize the Context
with
multiple objects occupying the same physical space. The simplest and
most common way to avoid this is to generate a random number of objects
in random poses, with their vertical positions staggered so that they
trivially start out of penetration.
If you look for them, you can find animations of large numbers of falling objects in the demo reels for most advanced multibody simulators. (These demos can potentially be a bit misleading; it's relatively easy to make a simulation which lots of objects falling that looks relatively good from a distance, but if you zoom in and look closely you'll often find many anomalies.) For our purposes the falling dynamics themselves are not the focus. We just want the state of the objects when they are done falling as initial conditions for our manipulation system.
Here is the 2D case. I've added many instances of our favorite red foam brick to the plant. Note that it's possible to write highly optimized simulators that take advantage of physics being restricted to 2D; that's not what I've done here. Rather, I've added a planar joint connecting each brick to the world, and run our full 3D simulator. The planar joint has three degrees of freedom. I've oriented them here to be $x$, $z$, and $\theta$ to constrain the objects to the $xz$ plane.
I've set the initial positions for each object in the
Context
to be uniformly distributed over the horizontal
position, uniformly rotated, and staggered every 0.1m in their initial
vertical position. We only have to simulate for a little more than a
second for the bricks to come to rest and give us our intended "initial
conditions".
It's not really any different to do this with any random objects --
here is what it looks like when I run the same code, but replace the
brick with a random draw from a few objects from the YCB dataset
The same technique also works in 3D. Setting uniformly random
orientations in 3D requires a little more thought, but Drake supplies
the method UniformlyRandomRotationMatrix
(and also one for
quaternions and roll-pitch-yaw) to do that work for us.
Please appreciate that bins are a particularly simple case for
generating random scenarios. If you wanted to generate random kitchen
environments, for example, then you won't be as happy with a solution
that drops refrigerators, toasters, and stools from uniformly random
i.i.d. poses. In those cases, authoring reasonable distributions gets
much more interesting
Although the examples above are conceptually very simple, there is actually a lot going on in the physics engine to make all of this work. To appreciate a few of the details, let's explore the easiest case -- the (static) mechanics once all of the objects have come to rest.
I won't dive into a full discussion of multibody dynamics nor multibody simulation, though I do have more notes available here. What is important to understand is that the familiar $f=ma$ takes a particular form when we write it down in terms of the generalized positions, $q$, and velocities, $v$: $$M(q)\dot{v} + C(q,v)v = \tau_g(q) + \sum_i J^T_i(q)F^{c_i}.$$ We already understand the generalized positions, $q$, and velocities, $v$. The left side of the equation is just a generalization of "mass times acceleration", with the mass matrix, $M$, and the Coriolis terms $C$. The right hand side is the sum of the (generalized) forces, with $\tau_g(q)$ capturing the terms due to gravity, and $F^{c_i}$ is the spatial force due to the $i$th contact. $J_i(q)$ is the $i$th "contact Jacobian" -- it is the Jacobian that maps from the generalized velocities to the spatial velocity of the $i$th contact frame.
Our interest here is in finding (stable) steady-state solutions to these differential equations that can serve as good initial conditions for our simulation. At steady-state we have $v=\dot{v}=0$, and conveniently all of the terms on the left-hand side of the equations are zero. This leaves us with just the force-balance equations $$\tau_g(q) = - \sum_i J^T_i(q) F^{c_i}.$$
Before we proceed, let's be a little more careful about our force
notation, and define its spatial algebra. Often we think of forces as a
three-element vector (with components for $x$, $y$, and $z$) applied to a
rigid body at some point. More generally, we will define a six-component
vector for spatial
force, using the monogram notation:
\begin{equation}F^{B_p}_{{\text{name}},C} = \begin{bmatrix}
\tau^{B_p}_{\text{name},C} \\ f^{B_p}_{\text{name},C} \end{bmatrix} \quad
\text{ or, if you prefer } \quad \left[F^{B_p}_{\text{name}}\right]_C =
\begin{bmatrix} \left[\tau^{B_p}_{\text{name}}\right]_C \\
\left[f^{B_p}_{\text{name}}\right]_C \end{bmatrix}.\end{equation}
$F^{B_p}_{\text{name},C}$ is the named spatial force applied to a point,
or frame origin, $B_p$, expressed in frame $C$. The form with the
parentheses is preferred in Drake, but is a bit too verbose for my taste
here in the notes. The name is optional, and the expressed in frame, if
unspecified, is the world frame. For forces in particular, it is
recommended that we include the body, $B$, that the force is being
applied to in the symbol for the point $B_p$, especially since we will
often have equal and opposite forces. In code, we write
Fname_Bp_C
.
Like spatial velocity, spatial forces have a rotational component and a translational component; $\tau^{B_p}_C \in \Re^3$ is the torque (on body $B$ applied at point $p$ expressed in frame $C$), and $f^{B_p}_C \in \Re^3$ is the translational or Cartesian force. A spatial force is also commonly referred to as a wrench. If you find it strange to think of forces as having a rotational component, then think of it this way: the world might only impart Cartesian forces at the points of contact, but we often want to summarize the combined effect of many Cartesian forces applied to different points into one common frame. To do this, we represent the equivalent effect of each Cartesian force at the point of application as a force and a torque applied at a different point on the body.
Spatial forces fit neatly into our spatial algebra:
Now that we have our notation, we next need to understand where the contact forces come from.
Geometry engines for robotics, like SceneGraph
in
Drake has a very useful ModelVisualizer
tool that
publishes by the illustration and collision geometry roles to Meshcat
(see, for example, the Drake ). This is very useful for designing new models, but also for
understanding the contact geometry of existing models.
We used this tool before, when we were visualizing the various robot
models. Go ahead and try it again; navigate in the meshcat controls to
Scene > drake > proximity
and check the box to enable
visualizing the geometries with the proximity role.
SceneGraph also implements the concept of a collision filter. It can be important to specify that, for instance, the iiwa geometry at link 5 cannot collide with the geometry in links 4 or 6. Specifying that some collisions should be ignored not only speeds up the computation, but it also facilitates the use of simplified collision geometry for links. It would be extremely hard to approximate link 4 and 5 accurately with spheres, and cylinders if I had to make sure that those spheres and cylinders did not overlap in any feasible joint angle. The default collision filter settings should work fine for most applications, but you can tweak them if you like.
So where do the contact forces, $f^{c_i}$, come from? There is
potentially an equal and opposite contact (spatial) force for every
pair of collision geometries that are not filtered out by the
collision filter. In SceneGraph
, the GetCollisionCandidates
method returns them all. We'll take to calling the two bodies in a collision pair "body A" and "body B".
In our mathematical models, two bodies would never occupy the same region of space, but in a simulation using numerical integration, this can rarely be avoided completely. Most simulators summarize the contact forces between two bodies that are in collision -- either touching or overlapping -- as Cartesian forces at one or more contact points. It's easy to underestimate this step!
First, let's appreciate the sheer number of cases that we have to get right in the code; it's unfortunately extremely common for open-source tools to have bugs in here. But I think the simple gui I've created below makes it pretty apparent that asking to summarize the forces between two bodies in deep penetration with a single Cartesian force applied at a point is fraught with peril. As you move the objects, you will find many discontinuities; this is a common reason why you sometimes see rigid body simulations "explode". It might seem natural to try to use multiple contact points instead of a single contact point to summarize the forces, and some simulators do, but it is very hard to write an algorithm that only depends on the current configurations of the geometry which selects contact points consistently from one time step to the next.
One of the innovations in Drake which makes it particularly good at simulating complex contact situations is that we have moved away from point contact models. When we enable hydroelastic contact in Drake, the spatial forces between two penetrating bodies are computed by taking an integral over a "hydroelastic surface" which generalizes the notion of a contact patch
I've created a simple GUI that allows you to pick any two primitive geometry types and inspect the hydroelastic contact information that is computed when those object are put into penetration.
We still need to decide the magnitude and direction of these spatial forces, and to do this we need to specify the contact frame in which the spatial force is to be applied. For instance, we might use $C_B$ to denote a contact frame on body $B$, with the forces applied at the origin of the frame.
Although I strongly advocate for using the hydroelastic contact model in your simulations, for the purposes of exposition, let's think about the point contact model here. If the collision engine produces a number of contact point locations, how should we define the frames?
Our convention will be to align the positive $z$ axis with the "contact normal", with positive forces resisting penetration. Defining this normal also requires care. For instance, what is the normal at the corner of a box? Taking the normal as a (sub-)gradient of the signed-distance function between two bodies provides a reliable definition that will extend into generalized coordinates. The $x$ and $y$ axes of the contact frame are any orthogonal basis for the tangential coordinates. You can find additional figures and explanations here.
Let's work through these details on a simple example -- our foam brick sitting on the ground. The ground is welded to the world, so has no degrees of freedom; we can ignore the forces applied to the ground and focus on the forces applied to the brick.
Where are the contact forces and contact frames? If we used only box-on-box geometry, then the choice of contact force location is not unique during this "face-on-face" contact; this is even more true in 3D (where three forces would be sufficient). Let's assume for a moment that our contact engine determines two points of contact -- one at each corner of the box. (In order to help the contact engine make robust choices when we are using point contact models, we sometimes actually annotate the boxes with small sphere collision geometries in the eight corners; this is not needed when using hydroelastic contact.) I've labeled the frames $C_1$ and $C_2$ in the sketch below.
In order to achieve static equilibrium, we require that all of the forces and torques on the brick be in balance. We have three forces to consider: $F_{g}, F_1,$ and $F_2$, named using shorthand for gravity, contact 1, and contact 2, respectively. The gravity vector is most naturally described in the world frame, applied to the body at the center of mass: $$F_{g, W}^B = \begin{bmatrix} 0, 0, 0, 0, 0, -mg \end{bmatrix}^T.$$ The contact forces are most naturally described in the contact frames; for instance we know that $$f_{1, C_1,z}^{B_{c_1}} \ge 0,$$ because the force cannot pull on the ground. To write the force balance equation, we need to use our spatial algebra to express all of the spatial forces in the same frame, and apply them at the same point.
For instance, if we are to transform contact force 1, we can first change the frame: $$F_{1,B}^{B_{c_1}} = {}^B R^{C_1} F_{1,C_{1}}^{B_{c_1}}.$$ Then we can change the point at which it's applied: $$f_{1,B}^B = f_{1,B}^{B_{c_1}}, \qquad \tau_{1,B}^B = \tau_{1,B}^{B_{C_1}} + {}^Bp_B^{B_{C_1}} \times f_{1,B}^{B_{c_1}}.$$ We can now solve for the force balance: $$F_{1,B}^B + F_{2,B}^B + F_{g,B}^B = 0,$$ which is six equations in 3D (or only three in 2D). The translational components tell me that $$f_{1,B}^B + f_{2,B}^B = \begin{bmatrix}0 \\ 0 \\ mg\end{bmatrix},$$ but it's the torque component in $y$ that tells me that $f_{1,B_z}$ and $f_{2,B_z}$ must be equal if the center of mass is equally distance from the two contact points.
We haven't said anything about the horizontal forces yet, except that they must balance. Let's develop that next.
Now the rules governing contact forces can begin to take shape. First and foremost, we demand that there is no force at a distance. Using $\phi_i(q)$ to denote the distance between two bodies in configuration $q$, we have $$\phi(q) > 0 \Rightarrow f^{c_i} = 0.$$ Second, we demand that the normal force only resists penetration; bodies are never pulled into contact: $$f^{c_i}_{C_z} \ge 0.$$ In rigid contact models, we solve for the smallest normal force that enforces the non-penetration constraint (this is known as the principle of least constraint). In soft contact models, we define the force to be a function of the penetration depth and velocity.
Forces in the tangential directions are due to friction. The most commonly used model of friction is Coulomb friction, which states that $$\sqrt{{f^{c_i}_{C_x}}^2 + {f^{c_i}_{C_y}}^2} \le \mu f^{c_i}_{C_z},$$ with $\mu$ a non-negative scalar coefficient of friction. Typically we define both $\mu_{static}$, which is applied when the tangential velocity is zero, and $\mu_{dynamic}$, applied when the tangential velocity is non-zero. In the Coulomb friction model, the tangential contact force is the force within this friction cone which produces maximum dissipation.
Taken together, the geometry of these constraints forms a cone of admissable contact forces. It is famously known as the "friction cone", and we will refer to it often.
It's a bit strange for us to write that the forces are in some set. Surely the world will pick just one force to apply? It can't apply all forces in a set. The friction cone specifies the range of possible forces; under the Coulomb friction model we say that the one force that will be picked is the force inside this set that will successfully resist relative motion in the contact x-y frame. If no force inside the friction cone can completely resist the motion, then we have sliding, and we say that the force that the world will apply is the force inside the friction cone of maximal dissipation. For the conic friction cone, this will be pointed in the direction opposite of the sliding velocity. So even though the world will indeed "pick" one force from the friction cone to apply, it can still be meaningful to reason about the set of possible forces that could be applied because those denote the set of possible opposing forces that friction can perfectly resist. For instance, a brick under gravity will not move if we can exactly oppose the force of gravity with a force inside the friction cone.
If we take our last example, but tilt the table to an angle relative to gravity, then the horizontal forces start becoming important. Before going through the equations, let's check your intuition. Will the magnitude of the forces on the two corners stay the same in this configuration? Or will there be more contact force on the lower corner?
In the illustration above, you'll notice that the contact frames have rotated so that the $z$ axis is aligned with the contact normals. I've sketched two possible friction cones (dark green and lighter green), corresponding to two different coefficients of friction. We can tell immediately by inspection that the smaller value of $\mu$ (corresponding to the darker green) cannot produce contact forces that will completely counteract gravity (the friction cone does not contain the vertical line). In this case, the box will slide and no static equilibrium exists in this configuration.
If we increase the coefficient of (static) friction to the range corresponding to the lighter green, then we can find contact forces that produce an equilibrium. Indeed, for this problem, we need some amount of friction to even have an equilibrium (we'll explore this in the exercises). We also need for the vertical projection of the center of mass onto the ramp to land between the two contact points, otherwise the brick will tumble over the bottom edge. We can see this by writing our same force/torque balance equations. We can write them applied at and expressed in body frame, B, assuming the center of mass in the center of the brick and the brick has length $l$ and height $h$:\begin{gather*} f^{B}_{1, B_x} + f^{B}_{2, B_x} = -mg\sin\gamma \\ f^{B}_{1,B_z} + f^{B}_{2,B_z} = mg\cos\gamma \\ -h f^{B}_{1,B_x} + l f^{B}_{1,B_z} = h f^{B}_{2,B_x} + l f^{B}_{2, B_z} \\ f^{B}_{1, B_z} \ge 0, \quad f^{B}_{2, B_z} \ge 0 \\ |f^{B}_{1, B_x}| \le \mu f^{B}_{1, B_z}, \quad |f^{B}_{2, B_x}| \le \mu f^{B}_{2, B_z} \end{gather*}
So, are the magnitude of the contact forces the same or different? Substituting the first equation into the third reveals $$f^{B}_{2, B_z} = f^{B}_{1, B_z} + \frac{mgh}{l}\sin\gamma.$$
Rather than dropping objects from a random height, perhaps we can
initialize our simulations using optimization to find the initial
conditions that are already in static equilibrium. In StaticEquilbriumProblem
collects all of the constraints we enumerated above into an optimization
problem: \begin{align*} \find_q \quad \subjto \quad& \tau_g(q) = - \sum_i
J^T_i(q) f^{c_i} & \\ & f^{c_i}_{C_z} \ge 0 & \forall i, \\ &
|f^{c_i}_{C_{x,y}}|_2 \le \mu f^{c_i}_{C_z} & \forall i, \\ & \phi_i(q)
\ge 0 & \forall i, \\ & \phi(q) = 0 \textbf{ or } f^{c_i}_{C_z} = 0
&\forall i, \\ & \text{joint limits}.\end{align*} This is a nonlinear
optimization problem: it includes the nonconvex non-penetration
constraints we discussed in the last chapter. The second-to-last
constraints (a logical or) is particularly interesting; constraints of
the form $x \ge 0, y \ge 0, x =0 \textbf{ or } y = 0$ are known as
complementarity constraints, and are often written as $x \ge 0, y \ge 0,
xy = 0$. We can make the problem easier for the nonlinear optimization
solver by relaxing the equality to $0 \le \phi(q) f^{c_i}_{C_z} \le
\text{tol}$, which provides a proper gradient for the optimizer to follow
at the cost of allowing some force at a distance.
It's easy to add additional costs and constraints; for initial conditions we might use an objective that keeps $q$ close to an initial configuration-space sample.
So how well does it work?
Static equilibrium problems provide a nice view into a few of the complexities of contact simulation. But things get a lot more interesting when the robot and/or objects in the scene start moving!
More coming soon!
What makes a good grasp? This topic has been studied extensively for
decades in robotics, with an original focus on thinking of a (potentially
dexterous) hand interacting with a known object.
If the goal of a grasp is to stabilize an object in the hand, then one definition of a "good grasp" is be one that is able to resist disturbances described by an "adversarial" wrench applied to the body.
Above, we introduced the friction cone as the range of possible forces that friction is able to produce in order to resist motion. For the purposes of grasp planning, by applying the additive inverse to the forces in the friction cone, we can obtain all of the "adversarial" forces that can be resisted at the point of contact. And to understand the total ability of all contact forces (applied at multiple points of contact) to resist motion of the body, we want to somehow add all of these forces together. Fortunately, the spatial algebra for spatial forces can be readily extended from operating on spatial force vectors to operating on entire sets of spatial forces.
Because our sets of interest here are convex cones, I will use the relatively standard choice of $\mathcal{K}$ for the six-dimensional wrench cone. Specifically, we have $\mathcal{K}^{B_p}_{\text{name}, C}$ for the cone corresponding to potential spatial forces for $F^{B_p}_{\text{name}, C}$. For instance, our Coulomb friction cone for the point contact model (which, as we've defined it, has no torsional friction) in the contact frame could be: \begin{equation}\mathcal{K}^C_C = \left\{ \begin{bmatrix} 0 \\ 0 \\ 0 \\ f^{C}_{C_x} \\ f^C_{C_y} \\ f^C_{C_z} \end{bmatrix} : \sqrt{ \left(f^{C}_{C_x}\right)^2 + \left(f^{C}_{C_y}\right)^2 } \le \mu f^C_{C_z} \right\}.\end{equation}
The spatial algebra for spatial forces can be applied directly to the wrench cones:
I've made a simple interactive visualization for you to play with to help your intuition about these wrench cones. I have a box that fixed in space and only subjected to contact forces (no gravity is illustrated here); I've drawn the friction cone at the point of contact and at the center of the body. Note that I've intentionally disabled hydroelastic contact for this example -- the forces are computed using only a single contact point between each collision pair -- to simplify the interpretation.
There is one major caveat: the wrench cone lines in $\Re^6$, but I can only draw cones in $\Re^3$. So I've made the (standard) choice to draw the projection of the 6d cone into 3d space with two cones: one for the translational components (in green for the contact point and again in red for the body frame) and another for the rotational components (in blue for the body frame). This can be slightly misleading because one cannot actually select independently from both sets.
Here is the contact wrench cone for a single contact point, visualized for two different contact locations:
I hope that you immediately notice that the rotational component of the wrench cone is low dimensional -- due to the cross product, all vectors in that space must be orthogonal to the vector ${}^Bp^C$. Of course it's way better to run the notebook yourself and get the 3d interactive visualization.
Here is the contact wrench cone for a two contact points on that are directly opposite from each other:
Notice that while each of the rotational cones are low dimensional, they span different spaces. Together (as understood by the Minkowski sum of these two sets) they can resist all pure torques applied at the body frame. This intuition is largely correct, but this is also where the projection of the 6d cone onto two 3d cones becomes a bit misleading. There are some wrenches that cannot be resisted by this grasp. Specifically, if I were to visualize the wrench cone at the point directly between the two contact points, we would see that the wrench cones do not include torques applied directly along the axis between the two contacts. The two contact points alone, without torsional friction, are unable to resist torques about that axis.
In practice, the gripper model that we use in our simulation workflow uses hydroelastic contact, which simulates a contact patch and can produce torsional friction -- so we can resist moments around this axis. The exact amount one gets will be proportional to how hard the gripper is squeezing the object.
Now we can compute the cone of possible wrenches that any set of
contacts can apply on a body -- the contact wrench cone -- by putting all
of the individual contact wrench cones into a common frame and summing
them together. A classic metric for grasping would say that a good grasp
is one where the contact wrench cone is large (can resist many
adversarial wrench disturbances). If the contact wrench cone is all of
$\Re^6$, then we say the contacts have a achieved force
closure
It's worth mentioning that the elegant (linear) spatial algebra of the wrench cones also makes these quantities very suitable for use in optimization (e.g.
The beauty of this wrench analysis originally inspires a very model-based analysis of grasping, where one could try to optimize the contact locations in order to maximize the contact wrench cone. But our goals for this chapter are to assume very little about the object that we are grasping, so we'll (for now) avoid optimizing over the surface of an object model for the best grasp location. Nevertheless, our model-based grasp analysis gives us a few very good heuristics for grasping even unknown objects.
In particular, a good heuristic for a two fingered gripper to have a large contact wrench cone is to find colinear "antipodal" grasp points. Antipodal here means that the normal vectors of the contact (the $z$ axis of the contact frames) are pointing in exactly opposite directions. And "colinear" means that they are on the same line -- the line between the two contact points. As you can see in the two-fingered contact wrench visualization above, this is a reasonably strong heuristic for having a large total contact wrench cone. As we will see next, we can apply this heuristic even without knowing much of anything about the objects.
Rather than looking into the bin of objects, trying to recognize
specific objects and estimate their pose, let's consider an approach where
we simply look for graspable areas directly on the (unsegmented) point
cloud. These very geometric approaches to grasp selection can work very
well in practice, and can also be used in simulation to train a
deep-learning-based grasp selection system that can work very well in the
real world and even deal with partial views and occlusions
To get a good view into the bin, we're going to set up multiple RGB-D cameras. I've used three per bin in all of my examples here. And those cameras don't only see the objects in the bin; they also see the bins, the other cameras, and anything else in the viewable workspace. So we have a little work to do to merge the point clouds from multiple cameras into a single point cloud that only includes the area of interest.
First, we can crop the point cloud to discard any points that are from outside the area of interest (which we'll define as an axis-aligned bounding box immediately above the known location of the bin).
As we will discuss in some detail below, many of our grasp selection strategies will benefit from estimating the "normals" of the point cloud (a unit vector that estimates the normal direction relative to the surface of the underlying geometry). It is actually better to estimate the normals on the individual point clouds, making use of the camera location that generated those points, than to estimate the normal after the point cloud has been merged.
For sensors mounted on the real world, merging point clouds requires high-quality camera calibration and must deal with the messy depth returns. All of the tools from the last chapter are relevant, as the tasks of merging the point clouds is another instance of the point-cloud-registration problem. For the perfect depth measurements we can get out of simulation, given known camera locations, we can skip this step and simply concatenate the list of points in the point clouds together.
Finally, the resulting raw point clouds likely include many more points then we actually need for our grasp planning. One of the standard approaches for down-sampling the point clouds is using a voxel grid -- regularly sized cubes tiled in 3D. We then summarize the original point cloud with exactly one point per voxel (see, for instance Open3D's note on voxelization). Since point clouds typically only occupy a small percentage of the voxels in 3D space, we use sparse data structures to store the voxel grid. In noisy point clouds, this voxelization step is also a useful form of filtering.
I've produced a scene with three cameras looking at our favorite YCB
mustard bottle. I've taken the individual point clouds (already
converted to the world frame by the
DepthImageToPointCloud
system), cropped the point clouds
(to get rid of the geometry from the other cameras), estimated their
normals, merged the point clouds, then down-sampled the point clouds.
The order is important!
I've pushed all of the point clouds to meshcat, but with many of them set to not be visible by default. Use the drop-down menu to turn them on and off, and make sure you understand basically what is happening on each of the steps. For this one, I can also give you the meshcat output directly, if you don't want to run the code.
The grasp selection strategy that we will develop below will be based on the local geometry (normal direction and curvature) of the scene. Understanding how to estimate those quantities from point clouds is an excellent exercise in point cloud processing, and is representative of other similar point cloud algorithms.
Let's think about the problem of fitting a plane, in a least-squares
sense, to a set of points
What is really interesting is that the second and third
eigenvalues/eigenvectors also tell us something about the local geometry.
Because $W$ is symmetric, it has orthogonal eigenvectors, and these
eigenvectors form a (local) basis for the point cloud. The smallest
eigenvalue pointed along the normal, and the largest eigenvalue
corresponds to the direction of least curvature (the squared dot product
with this vector is the largest). This information can be very useful for
finding and planning grasps.
In order to approximate the local curvature of a mesh represented by a point cloud, we can use our fast nearest neighbor queries to find a handful of local points, and use this plane fitting algorithm on just those points. When doing normal estimation directly on a depth image, people often forgo the nearest-neighbor query entirely; simply using the approximation that neighboring points in pixel coordinates are often nearest neighbors in the point cloud. We can repeat that entire procedure for every point in the point cloud.
I remember when working with point clouds started to become a bigger part of my life, I thought that surely doing anything moderately computational like this on every point in some dense point cloud would be incompatible with online perception. But I was wrong! Even years ago, operations like this one were often used inside real-time perception loops. (And they pale in comparison to the number of FLOPs we spend these days evaluating large neural networks).
I've coded up the basic least-squares surface estimation algorithm, with the query point in green, the nearest neighbors in blue, and the local least squares estimation drawn with our RGB$\Leftrightarrow$XYZ frame graphic. You should definitely slide it around and see if you can understand how the axes line up with the normal and local curvature.
You might wonder where you can read more about algorithms of this type.
I don't have a great reference for you. But Radu Rusu was the main author
of the point cloud library
Now that we have processed our point cloud, we have everything we need to start planning grasps. I'm going to break that discussion down into two steps. In this section we'll come up with a cost function that scores grasp candidates. In the following section, we'll discuss some very simple ideas for trying to find grasps candidates that have a low cost.
Following our discussion of "model-based" grasp selection above, once we pick up an object -- or whatever happens to be between our fingers when we squeeze -- then we will expect the contact forces between our fingers to have to resist at least the gravitational wrench (the spatial force due to gravity) of the object. The closing force provided by our gripper is in the gripper's $x$-axis, but if we want to be able to pick up the object without it slipping from our hands, then we need forces inside the friction cones of our contacts to be able to resist the gravitational wrench. Since we don't know what that wrench will be (and are somewhat constrained by the geometry of our fingers), a reasonable strategy is to look at the colinear antipodal points on the surface of the point cloud which also align with $x$-axis of the gripper. In a real point cloud, we are unlikely to find perfect antipodal pairs, but finding areas with normals pointing in nearly opposite directions is a good strategy for grasping!
In practice, the contact between our fingers and the object(s) will be better described by a patch contact than by a point contact (due to the deflection of the rubber fingertips and any deflection of the objects being picked). So it makes sense to look for patches of points with agreeable normals. There are many ways one could write this, I've done it here by transforming the processed point cloud of the scene into the candidate frame of the gripper, and cropped away all of the points except the ones that are inside a bounding box between the finger tips (I've marked them in red in MeshCat). The first term in my grasping cost function is just reward for all of the points in the point cloud, based on how aligned their normal is to the $x$-axis of the gripper: $$\text{cost} = -\sum_i (n^i_{G_x})^2,$$ where $n^i_{G_x}$ is the $x$ component of the $i$th point in the cropped point cloud expressed in the gripper frame.
There are other considerations for what might make a good grasp, too. For our kinematically limited robot reaching into a bin, we might favor grasps that put the hand in favorable orientation for the arm. In the grasp metric I've implemented in the code, I added a cost for the hand deviating from vertical. I can reward the dot product of the vector world $-z$ vector, $[0, 0, -1]$ with the $y$-axis in gripper frame rotated into world frame with : $$\text{cost} \mathrel{{+}{=}} -\alpha \begin{bmatrix} 0 & 0 &-1\end{bmatrix}R^G \begin{bmatrix}0 \\ 1 \\ 0\end{bmatrix} = \alpha R^G_{3,2},$$ where $\alpha$ is relative cost weight, and $R^G_{3,2}$ is the scalar element of the rotation matrix in the third row and second column.
Finally, we need to consider collisions between the candidate grasp and both the bins and with the point cloud. I simply return infinite cost when the gripper is in collision. I've implemented all of those terms in the notebook, and given you a sliders to move the hand around and see how the cost changes.
We've defined a cost function that, given a point cloud from the scene and a model of the environment (e.g. the location of the bins), can score a candidate grasp pose, $X^G$. So now we would like to solve the optimization problem: find $X^G$ that minimizes the cost subject to the collision constraints.
Unfortunately, this is an extremely difficult optimization problem, with a highly nonconvex objective and constraints. Moreover, the cost terms corresponding to the antipodal points is zero for most $X^G$ -- since most random $X^G$ will not have any points between the fingers. As a result, instead of using the typical mathematical programming solvers, most approaches in the literature resort to a randomized sampling-based algorithm. And we do have strong heuristics for picking reasonable samples.
One heuristic, used for instance in
Another heuristic is to find antipodal point pairs in the point cloud, and then sample grasp candidates that would align the fingers with those antipodal pairs. Many 3D geometry libraries support "ray casting" operations at least for a voxel representation of a point cloud; so a reasonable approach to finding antipodal pairs is to simply choose a point at random in the point cloud, then ray cast into the point cloud in the opposite direction of the normal. If the normal of the voxel found by ray casting is sufficiently antipodal, and if the distance to that voxel is smaller than the gripper width, then we've found a reasonable antipodal point pair.
As an alternative to ray casting, I've implemented an even simpler heuristic in my code example: I simply choose a point at random, and start sampling grasps in orientation (starting from vertical) that align the $x$-axis of the gripper with the normal of that point. Then I mostly just rely on the antipodal term in my scoring function to allow me to find good grasps.
I do implement one more heuristic -- once I've found the points in the point cloud that are between the finger tips, then I move the hand out along the gripper $x$-axis so that those points are centered in the gripper frame. This helps prevent us knocking over objects as we close our hands to grasp them.
But that's it! It's a very simple strategy. I sample a handful of candidate grasps and just draw the top few with the lowest cost. If you run it a bunch of times, I think you will find it's actually quite reasonable. Every time it runs, it is simulating the objects falling from the sky; the actual grasp evaluation is actually quite fast.
If you play around with the grasp scoring I've implemented above a little bit, you will find deficiencies. Some of them are addressed easily (albeit heuristically) by adding a few more terms to the cost. For instance, I didn't check collisions of the pre-grasp configuration, but this could be added easily.
There are other cases where grasping alone is not sufficient as a strategy. Imagine that you place an object right in one of the corners of the bin. It might not be possible to get the hand around both sides of the object without being in collision with either the object or the side. The strategy above will never choose to try to grab the very corner of a box (because it always tried to align the sample point normal with the gripper $x$), and it's not clear that it should. This is probably especially true for our relatively large gripper. In the setup we used at TRI, we implemented an additional simple "pushing" heuristic that would be used if there were point clouds in the sink, but no viable grasp candidate could be found. Instead of grasping, we would drive the hand down and nudge that part of the point cloud towards the middle of the bin. This can actually help a lot!
There are other deficiencies to our simple approach that would be very hard to address with a purely geometric approach. Most of them come down to the fact that our system so far has no notion of "objects". For instance, it's not uncommon to see this strategy result in "double picks" if two objects are close together in the bin. For heavy objects, it might be important to pick up the object close to the center of mass, to improve our chances of resisting the gravitational wrench while staying in our friction cones. But our strategy here might pick up a heavy hammer by squeezing just the very far end of the handle.
Interestingly, I don't think the problem is necessarily that the point cloud information is insufficient, even without the color information. I could show you similar point clouds and you wouldn't make the same mistake. These are the types of examples where learning a grasp metric could potentially help. We don't need to achieve artificial general intelligence to solve this one; just experience knowing that when we tried to grasp in someplace before we failed would be enough to improve our grasp heuristics significantly.
Most of us are used to writing code in a procedural programming paradigm; code executes from the top of the program through function calls, branchs (like if/then statements), for/while loops, etc. It's tempting to try to write our high-level robot programs like that, too. For the clutter task, I could write "while there is an object to pick up in the first bin, pick it up and move it to the second bin; ...". And that is certainly not wrong! But to be a component in a robotics framework, this high-level program must somehow communicate with the rest of the system at runtime, including having interfaces to the low-level controllers that are running at some regular frequency (say 100Hz or even 1kHz). So how do we author the high-level behaviors (often called "task level") in a way that integrates seamlessly with the high-rate low-level controls?
There are a few important programming paradigms for authoring the task-level behavior. Broadly speaking, I would categorize them into three bins:
Task-level planners are based primarily on search. There is a long
history of AI planning as search. STRIPS
In late 2022, of course, a new approach to task "planning" exploded
onto the scene with the rise of large language models (LLMs). It quickly
became clear that the most powerful language models like GPT4 are capable
of producing highly nontrivial plans from very limited natural language
specifications. Careful prompting can be used to guide the LLM to produce
actions that the robot is is able to perform, and it seems that LLMs are
particularly good and writing python code. Using Python, it's even
possible to prompt the LLM to generate a policy rather
than just a plan
An informal consensus is that language models make simple planning
problems extraordinarily easy. Importantly, while a search-based planner
might exploit missing requirements / constraints in your specification,
the LLMs output very reasonable plans, using "common sense" to fill in
the gaps. There are still limits to what today's LLMs can achieve in the
way of long-horizon task planning with multi-step reasoning, but clever
prompt engineering can lead to impressive results
For the purposes of clutter clearing in the systems framework, taking the second approach: writing the task-level behavior "policy" as a simple state machine will work nicely.
First, we bundle up our grasp-selection algorithm into a system that reads the point clouds from 3 cameras, does the point cloud processing (including normal estimation), and the random sampling, and puts the grasp selection onto the output port:
Note that we will instantiate this system twice -- once for the bin located on the positive X axis and again for the system on the negative y axis. Thanks to the "pull-architecture" in Drake's systems framework, this relatively expensive sampling procedure only gets called when the downstream system evaluates the output port.
The work-horse system in this example is the Planner
system
which implements the basic state machine, calls the planning functions,
stores the resulting plans in its Context
, and evaluates the
plans on the output port:
This system needs enough input / output ports to get the relevant information into the planner, and to command the downstream controllers.
For this problem, we'll use the Coulomb friction model, where $|f_t| \le \mu f_n$. In other words, the friction force is large enough to resist any movement up until the force required would exceed $\mu f_n$, in which case $|f_t| = \mu f_n$.
Consider a box with mass $m$ sitting on a ramp at angle $\theta$, with coefficient of box $\mu$ in between the sphere and the ramp:
For a given angle $\theta$, what is the minimum coefficient of friction required for the box to not slip down the plane? Use $g$ for acceleration due to gravity.
Now consider a flat ground plane with three solid (uniform density) spheres sitting on it, with radius $r$ and mass $m$. Assume they have the same coefficient of friction $\mu$ between each other as with the ground.
For each of the following configurations: could the spheres be in static equilibrium for some $\mu\in[0,1]$, $m > 0$, $r > 0$? Explain why or why not. Remember that both torques and forces need to be balanced for all bodies to be in equilibrium.
To help you solve these problems, we have to help you build intuition and test your answers. It lets you specify the configuration of the spheres and then uses the StaticEquilbriumProblem class to solve for static equilibrium. Use this notebook to help visualize and think about the system, but for each of the configurations, you should have a physical explanation for your answer. (An example of such a physical explanation would be a free body diagram of the forces and torques on each sphere, and equations for how they can or cannot sum to zero. This is essentially what StaticEquilbriumProblem checks for.)
Finally, a few conceptual questions on the StaticEquilbriumProblem:
For this exercise, you will investigate a slightly different approach to normal vector estimation. In particular, we can exploit the spatial structure that is already in a depth image to avoid computing nearest neighbors. You will work exclusively in . You will be asked to complete the following steps:
So far, we have used sampling-based methods to find antipodal grasps - but can we have find one analytically if we knew the equation of the object shape? For this exercise, you will analyze and implement a correct-by-construction method to find antipodal points using symbolic differentiation and MathematicalProgram. You will work exclusively in . You will be asked to complete the following steps:
In the chapter notebook, we generated grasp candidates using the
antipodal heuristic. In this exercise, we will investigate an
alternative method for generating grasp candidates based on local
curvature, similar to the one proposed in
Let's reintroduce the terminology of behavior trees. Behavior trees
provide a structure for switching between different tasks in a way that
is both modular and reactive. Behavior trees are a directed rooted tree
where the internal nodes manage the control flow and the leaf nodes are
actions to be executed or conditions to be evaluated. For example, an
action may be to "pick ball". This action can succeed or fail. A
condition may be "Is my hand empty?" which can be true (thus the
condition succeeds) or can be false (the condition fails).
More specifically, each node is periodically checked. In turn it will
check its children, and report status as success, failure or running
depending on its childrens' statuses. There are different categories
of control flow nodes, but we'll consider two categories of control
flow nodes:
Let's apply our understanding of behavior trees in the context of a simple task where a robot's goal is to: find a ball, pick the ball up and place the ball in a bin. We can describe task with the high-level behavior tree:
Confirm to yourself that this small behavior tree captures the task! We can go one level deeper and expand out our "Pick Ball" behavior:
The pick ball behavior can be considered as such: Let's start with the
left branch. If the ball is already close, the condition "Ball Close?"
returns true. If the ball is not close, we execute an action to "Approach
ball", thus making the ball close. Thus either the ball is already close
(i.e. the condition "Ball Close?" is true) or we approach the ball (i.e.
the action "Approach ball" is successfully executed) to make the ball
close. Given that the ball is now close, we consider the right branch of
the pick action. If the ball is already grasped, the condition "Ball
Grasped?" returns true. if the ball is not grasped, we execute the action,
"Grasp Ball".
We can expand our our behavior tree even further to give our full
behavior:
Here we've added the behavior that if the robot cannot complete the task, it can ask a kind human for help. Take a few minutes to walk through the behavior tree and convince yourself it encodes the desired task behavior.
From this simple task we can see that while finite state machines can often be easy to understand and intuitive, they can quickly get quite messy!
We claimed behavior trees are modular. Their modularity comes from the fact that we can write reusable snippets of code that define actions and conditions and then compose them using behavior trees. For example, we can imagine that the "Select Grasp" action from part (c) is implemented using the antipodal grasping mechanism we have built up throughout this chapter. And the "Transport grasped object(s) to above Bin 2" action could be implemented using the pick and place tools we developed in Chapter 3.
Source: The scenario and figures for parts (a) and (b) of this exercise
are inspired by material in
Previous Chapter | Table of contents | Next Chapter |