Saturday, March 13, 2010

Memory Corruption Makes Me Sad

Nothing makes programmers cower in fear more than memory corruption. These bugs are almost always A) fatal and B) ridiculously difficult to track down and C) hard to reproduce consistently. The combination of these three things can make you start thinking of your memory in surprisingly literal ways:

(and literal in more ways than one, since your memory is "trashed" hahahaha, oh that was bad...)

This particular blog posting is about my own struggle with a bug that I knew about for no less than four months, and for the last two weeks I worked on it 24/7. I fixed it, but it was very difficult to isolate.

Memory corruption happens when "the contents of a memory location are unintentionally modified due to programming errors. When the corrupted memory contents are used later in the computer program, it leads either to program crash or to strange and bizarre program behavior." What makes memory corruption particularly insidious is your program doesn't crash or behave strangely at the time of modification--it happens later, when something attempts to use the memory that was sullied.

Causes of memory corruption include use of uninitialized memory, use of unowned memory, buffer overflows, faulty heap memory management and multithreading problems. In all cases, the defining characteristic is where the program crashes isn't necessarily related to what went wrong.

Here's the best strategy I've found for dealing with memory corruption:
  1. If you're really, really lucky, you might be able to catch the problem with something like DevPartner or Valgrind, but typically these only show the problem as it's blowing up. In the case of my bug, DevPartner ended up being beneficial only because it made my error more repeatable. Also, have a look at Application Verifier.
  2. Get a static code analysis tool. If you're using VS.NET 2008, try using their Code Analysis tool. It is surprisingly effective at locating buffer overruns and errant pointer access. These tools may or may not help you, but they certainly can't hurt. If you are lucky, this may be all you need to find your problem. If it doesn't work, then you get to go on to the brute force methods (lucky you!)
  3. Go through all classes and make sure every member variable is initialized correctly. Value types should be set to sensible values, pointer types should be initialized to NULL.
  4. Search for all new/delete/malloc/free statements.  Ensure that all pointer values that are allocated begin their life as NULL. Ensure that the memory being released is immediately NULL'd. Ensure you have no new/free or malloc/delete mismatches. Ensure you do not free/delete memory twice. Ensure that any memory allocated by a third party library is also destroyed by that library (i.e. do not call "CrazyLibraryMemAlloc" and use free/delete to clean up unless you are positive that is the correct thing to do). Make sure your destructors and cleanup methods release all memory and NULL everything. Make sure you're using delete[] if the type was allocated with new []. In essence, everything should begin its life as NULL and end its life as NULL. This is probably the single best thing you can do to isolate memory tom-foolery.
  5. Review every memset, memcpy and mem-whatever in the program (ditto for Win32 variants like CopyMemory). If you are using raw buffer pointers (e.g. void*, int*, etc.), consider wrapping them in something like QByteArray. Review any and all string handling code (in particular, strcpy and the likes). If you have any raw pointer string types, consider replacing them with a decent string class like CString or Qt's QString.
  6. Are you using threads? Does the crash happen in a shared object? If so, this strongly implies your locking strategy (or lack thereof--even if you have locks, be absolutely certain they are working as you expect).
  7. Determine if you are experiencing corruption in the same location, or if it's more random. If it's random corruption, then it is more likely to be a buffer overflow. If it's localized corruption (i.e. let's say the crash always happens in a shared queue, or in the same place in code), then it is more likely that code touching the shared item is invalid. If it crashes in the exact same place always, then you are in luck--you should be able to watch that location in a debugger and break on any read/write. Whether or not you have crashes in the same place is a huge, huge clue about your problem. Track this information religiously.
  8. One method for determining if you have local/random corruption is to declare "no man's land" buffers directly above and below the item being corrupted. Like, nice, big 10k buffers that are initialized to "0xDEADBEEFDEADBEEFDEADBEEF..." When your program crashes, inspect those buffers. If those buffers contain invalid data, then it is not localized corruption. If they aren't corrupted, but the data structure they wrap is, then that strongly implies something that touches the sandwiched object is where the problem may lie.
  9. It is not likely to be the C-runtime, third party code, obscure linking issues, etc. Think about it: you're not the only person using these libraries and tools. They are generally more thoroughly vetted because of Linus' law. Is it possible? Yeah, sure, anything's possible. But is it likely? Not really.
  10. Unless evidence strongly implies otherwise, assume the issue is in your code. This is good, because it means it's something you can potentially fix. Otherwise, you may have to start a support case with whomever owns the code. If it's an open source project, you might get a quick response (or possibly no response at all...). If it's somebody like MSFT, it is going to take weeks at a minimum. Only as a last resort should you assume it's somewhere else, and be certain you have Real Information™ to backup your theory.
It may take a couple of people a day or five to go through the program and make all these changes depending on how big the application is, but it's generally the only real way to isolate problems. And it also gets you in the habit of being religiously fanatical about default values, pointer checking and correct deletion of objects, which is good.

For me, the issue ended up being extremely subtle (it evaded two separate code reviews by my peers), and after finding it, painfully obvious and somewhat embarrassing. I was having localized corruption around a shared queue that two threads accessed. The culprit was invalid locking code I'd written. Once the queue became corrupted, it would fail somewhere in the bowels of whatever queue object I'd been using (I tried CAtlList, QList, etc, but it didn't matter because none of them are thread-safe).

Which brings me back to item #10 in the above list: it's always your fault. It was my fault. It can be very tempting to assume otherwise, but generally I don't find this to be the case. So keep an open mind, think analytically, write down what you know and what you don't know, and you'll be done with the bug sooner than you know it!

Monday, March 01, 2010

DMOs Considered Harmful

Lately I've been working with DMO filters. Typically I've written regular DirectShow filters derived from the base classes, but after reading this article I decided that maybe it was time to start writing DMOs instead of transform filters. The advantages seemed attractive, and being able to write a filter that would work in Media Foundation was tempting.

After writing two of them, I can now say this was a bad idea. There are serious performance implications, major limitations and many of the proposed advantages are simply not true.

Let's start with the deal-breaker for me: you may run into serious performance issues when using a DMO filter in DirectShow because you cannot set the number or size of output buffers. When using the DMO Wrapper Filter (think of this filter as the "translator" between the DMO model and DShow filters--you could write your own DMO wrapper if you wanted), the ALLOCATOR_PROPERTIES on the output pin will always look something like:
You will always get one, big buffer for the output allocator. There is no way to change this that I know of. The article on MSDN mentions "...DMOs have no control over the number of buffers or memory allocators," but what they omit to tell you is that this can have serious consequences for high-performance playback, and particularly so for variable rate playback.

The same filter implemented as a typical transform filter allows me to specify the size and count of the output buffers:
Here, I made the DShow filter use 20 output buffers. This filter, with the exact same decoder, would do up to 8x on my machine without missing a beat. I could not reliably do better than 2x with the DMO filter before the filter graph manager reverted my SetRate() call. In case it isn't obvious, you give up a lot of control by using a DMO in DirectShow. And for some scenarios, it's pretty clear you sacrifice a lot of performance as well.

But it gets worse. Let's say your H264 encoder pukes everywhere. Vomits ALL OVER. You, being the responsible programmer you are, would like to communicate this failure to some higher power so it can come in with scrubbies and bleach and all that good stuff and clean up the technicolor yawn that is your filter innards.

Normally you'd call IMediaEventSink->Notify() and be done with it--the event gets handled asynchronously by the filter graph manager so you need not worry about threading issues (woe is the DShow coder who fires a synchronous event from their streaming thread), and everything can be dealt with in a common event handler. But someone, in their infinite wisdom, did not provide a standard eventing method for a DMO filter. Which means: your events fall on the floor.

There are a few options to deal with this. You can do what normal DirectShow filters do: hold a weak reference to IMediaEventSink and send events through that. But this requires a custom interface, and suddenly your filter is no longer that nice, clean abstraction that works in Media Foundation and DirectShow. You could create your own eventing interface, but it would need to be asynchronous (since the DMO is being called on the streaming thread) so this isn't exactly trivial. These options are not appealing.

These are the two major grievances I have with DMOs. Minor grievances include:
  • The documentation mentions issues with IDMOQualityControl, which is the DMO version of IQualityControl. Purportedly, the DMO interface " not ideal for quality control due to the limitations of the interface itself." But nowhere are these "limitations" outlined. It'd be great if MSDN would make it clear what they are.
  • The claim that "DMOs require less methods to implement, and they provide a documented API" is total nonsense. My DMO implementation was about ~300 lines and included no fewer than 14 methods I had to implement (note the section at the bottom outlining required methods). For CTransformFilter? 200 lines of code and 5 methods. Also, note that CTransformFilter is completely documented.
  • Want to manually instantiate a filter? Good luck. Have fun. You basically have to know all sorts of complicated junk about apartment threading and COM and ATL to get this to work. It's possible, but it's a lot more work and not as well documented as manually instantiating a regular DirectShow filter.
I should be fair: some of these problems are really issues in the DMO Wrapper Filter and there's certainly nothing stopping someone from writing their own wrapper filter. But some of the issues such as the lack of eventing is not so easy to deal with. Regardless, the big question is: why bother? Why not just write a regular transform filter and not deal with any of these problems?

In retrospect, I'd be more inclined to devote my efforts to some cross-platform rendering solution, like GStreamer.