Sunday, November 18, 2012

Rewriting directshow?

Directshow is a pretty tenacious interface, having been used for approaching two decades at this point.  The official party line from Redmond is "use MediaFoundation," but it's never seemed particularly useful given that MediaFoundation is absent in older versions of Windows.

I've always (perhaps foolishly) thought it'd be fun to write my own media framework.  So I spent some time figuring out what I would hope to improve on over Directshow.  I came up with the following list of things:

  1. I dislike most things about IMediaSample
    1. I dislike the way it communicates a media type--through AM_MEDIA_TYPE.  AM_MEDIA_TYPE is weird to work with, easy to use incorrectly, and littered with buckets of legacy junk that nobody really cares about anymore.  It is easy to leak memory with it.  It is difficult to understand.
    2. I dislike that media samples themselves don't have a method for communicating their media type.  You might counter with IMediaSample::GetMediaType(), but that method provides the media type only if it differs from the previous media type.
  2. Filters consist of separate "pin" and "filter" interfaces despite being one object; this distinction has always been nonsensical to me.  A pin is really the types of input or output a filter contains, as well as a means of "connecting" to the filter.  The filter is...presumably everything else.  The separation between the objects makes writing filters messy, and communicating with them odd.  And I don't see a clear reason for the distinction.  It also creates coupling problems (for example, see CSource and CSourceStream in the DShow base classes, particularly how a pin adds to a filter and deals with reference count)
  3. Support for B-frames is marginal at best.  DirectShow has only a timestamp--not separate timestamps for presentation and decode time--which makes handling certain types of media difficult.
  4. I dislike how duration is represented--100 NS units.  A better way to represent duration is to give individual media streams a timescale--this avoids nearly all rounding error and most floating point math, except at endpoints and render time.
    1. When you think about it, specifying the duration of units is really the same as specifying a timescale, except less flexible.  A duration based on 100 NS units is the same thing as having a timescale of ten million (there are ten million 100 NS units in one second).
    2. Example: we're processing 8 khz audio.  The best timescale to use here is 8000.  When buffers come along, the "duration" of the buffer is simply the number of samples contained within.  Let's say incoming buffers contain 1024 samples in it; our duration is set to 1024 as well.  Total sample duration is 1024/8000 seconds.
    3. Example: we're processing H.264.  Many applications use a timescale of 90000, although accuracy up to DShow's 100 NS units can be achieved by setting the timescale to ten million.
    4. This last example is worth mentioning--compatibility with other media frameworks--like DShow--is easy.  Set timescale to ten million, voila: no translation necessary.
  5. I dislike DirectShow's sample pool concept.  This was probably a great design choice in the early 90s, when processing media was a pretty serious task, but these days....it's irrelevant.
  6. Automatic graph building is a cool idea, but its never worked well for me in production code because random filters may be inserted into the graph.  It's neat for quick-and-dirty apps that do very little.  It's pretty much pointless for a serious production application.
  7. Surprisingly, many of the filters I'd want built-in are often absent.  For example, there really isn't a comprehensive filter for dealing with colorspace conversions, resampling audio, etc.  Many of the "bread and butter" filters you often desperately want aren't available.
  8. Live video streaming support is horribly complicated to implement, and it's questionable if most down-stream filters are even following suit with the API.  Getting AV sync to work well with a live stream literally gives me fits.  
  9. In general, a very complicated API.  Likely because at the time it was dealing with some very complicated subjects.
On the other hand, there are things I really like about DirectShow:
  1. It can be extended to handle nearly any audio or video type, metadata types, etc.  It is extremely flexible.
  2. Because it is so modular, often a major change to your application only involves replacing a filter or two here and there, so changes end up being quite tidy.  For example, I have a playback application using MP4 files containing H.264 and AAC.  Someday, if the business requirement changes, I only have to swap out one filter and make sure it implements IMediaSeeking....and everything should literally work as it previously did.
  3. IFileSourceFilter is a great idea.  I don't like the naming, but having a standard interface to configure source filters makes it much easier for client writers to swap out filters.
  4. IMediaSeeking is a great interface for advanced seek operations.  It's great to have this work well at a high level without having to be overly concerned with the guts of the underlying filter graph.
  5. Some of the utility filters, like the infinite tee, null renderer, sample grabber, etc. are always super handy.
  6. The event-driven framework--while overly complicated--is really useful given that processing audio and video (particularly in real-time) is a very "event" oriented process.
In the coming days, I'm going to take a shot at implementing a very basic media framework in C#.  People have said managed code isn't suitable for media processing; I plan to test that hypothesis, because I see no real evidence showing otherwise.

I'll update my progress here as I go.