-
Notifications
You must be signed in to change notification settings - Fork 49
4. Finding and fixing bugs
This doc assumes you've already read about how the code is organized. You should refer back to that doc as necessary.
The key vector for bugs to present themselves is in the getXXJacobian()
methods on dart::neural::BackpropSnapshot
. These are tricky methods to test and debug, because they are the result of a ton of linear algebra over large complex matrices, so we've built out a lot of support infrastructure to make it easier to find, diagnose, and fix bugs.
To start with, for every getXXJacobian
on dart::neural:::BackpropSnapshot
there's a corresponding finiteDifferenceXXJacobian
method.
-
BackpropSnapshot::getVelVelJacobian()
->BackpropSnapshot::finiteDifferenceVelVelJacobian()
-
BackpropSnapshot::getVelPosJacobian()
->BackpropSnapshot::finiteDifferenceVelPosJacobian()
-
BackpropSnapshot::getPosVelJacobian()
->BackpropSnapshot::finiteDifferencePosVelJacobian()
-
BackpropSnapshot::getPosPosJacobian()
->BackpropSnapshot::finiteDifferencePosPosJacobian()
-
BackpropSnapshot::getForceVelJacobian()
->BackpropSnapshot::finiteDifferenceForceVelJacobian()
In general, the analytical method should ALWAYS yield results within 1e-8
of finite differencing! Throughout development there have been several times when I convinced myself that a 1e-6
error here and there was fine. It has ALWAYS been a lurking bug, and it usually screwed up learning. So the hard-won lesson is this: be SUSPICIOUS of any and all discrepancies. If you can't explain it and it's larger than 1e-8
, it's a problem, no matter how big and complex your robot is.
We have several utilities to make it easier to automatically detect discrepancies in the wild. The first is to set the flag on the World
object World::setSlowDebugResultsAgainstFD(true)
. This flag will cause every call to any of the BackpropSnapshot::getXXJacobian
methods (on any BackpropSnapshot
generated from this World
) to also compute the BackpropSnapshot::finiteDifferenceXXJacobian
internally. If the two Jacobians don't match precisely, we print replication instructions (position, velocity, force, and LCP info) in convenient copy-pastable C++ code that you can use to create a unit test to replicate the issue. We then immediately crash the program.
You can mine for bugs by setting up trajectory optimization problems in C++ tests (see unittests/comprehensive/test_CatapultTrajectory.cpp
for an example), and then turning on the World::setSlowDebugResultsAgainstFD(true)
and running the test. The optimizer will put your world through its paces as it works on finding the optimal trajectory through it, and will call all of your getXXJacobian
methods thousands of times in lots of configurations. This will often find you a bug. You can then copy-paste the replication instructions into a fresh unit test (with the same world setup), and begin analysis.
So you've got a spot where the Jacobians don't match, and you've capture it in a unit test? Excellent! Now we're going to drill down and figure out what's wrong.
Each getXXJacobian()
method computes its analytical Jacobian by combining many much smaller matrices, some of which are also Jacobians of less important quantities (like how velocity at this timestep v_t
effects contact force found by the LCP at this timestep f_t
). Many of those smaller Jacobians are themselves composed of smaller matrices and Jacobians. In order to diagnose what's going wrong and fix it, we need to get to the lowest possible level of the computation where a difference between finite differencing and analytical methods. We've got a huge suite of unit tests to help automate this process.
The first place to start is with #import "GradientTestUtils.hpp"
in your test file, if you're working in a fresh test file. That will include tons of useful utilities. The main ones to start with are:
-
EXPECT_TRUE(verifyAnalyticalJacobians(world))
: This will attempt to find the lowest point where any gradients or Jacobians that come from contact forces diverge. This will check all the way down to the gradients of contact point and contact normal of each of the individual contacts in your scene with respect to every degree of freedom in your world. If anything doesn't match, you'll see it. -
EXPECT_TRUE(verifyVelGradients(world, world->getVelocities()))
: This is the next thing to run afterverifyAnalyticalJacobians(world)
. This will attempt to find the lowest point where any Jacobians effecting next timestep's velocity (v_t+1
) diverge.
Both of those methods call lots of lower level methods. If you jump to their source you'll see it's just a huge &&
combination of lower level tests. Feel free to use those individually in your tests too, to help isolate the problem.
If you've figured out which part of the computation is going wrong, now you've gotta fix it! A lot of this is just down to reading the code, stepping through in the debugger, and thinking about why it's failing.
There's one more useful utility to talk about, though. We have a special Jacobian on BackpropSnapshot
called scratch
to help you isolate custom little pieces of code. You can edit BackpropSnapshot::scratch()
to return any vector you're interested in, and BackpropSnapshot::getScratchAnalytical()
to return what you think the Jacobian of your scratch()
vector should be. Then you can use the method EXPECT_TRUE(verifyScratch(world))
in your unit test to check if your analytical Jacobian of your scratch
method is right. This can be a great clue as you drill down to figure out where a 1e-6
error is being introduced into your Jacobian computation. I've also found it helpful as a check as I'm building up analytical Jacobians of complex methods to take it piece by piece, and add complexity to scratch()
slowly, checking results each time I add something.