Friday, May 28, 2010

HD Video Standard, Cont.

Last year, there was brief discussion about a standard for high definition video; you can see my previous comments here.

One of the big questions about a high definition standard is how to make sure the quality of video encoding is acceptable. Sure, video can be 1080P, but if it's over-compressed garbage produced with a sub-par encoder then it's really difficult to claim that it is "high definition." For example, here are two screenshots from blu-ray disks. I wish I were kidding about this, but it really underscores the importance of having an objective metric to ensure that some reasonable encoder quality is met.

Unfortunately, people can even screw this up. Consider today's post from the WebM blog about VP8 quality. First, it uses peak signal to noise ratio (PSNR), which is certainly a venerable metric, but almost everybody and their mother agrees that it correlates poorly with human perception of image quality. One need not look further than this paper, which outlines why this technique (particularly in the field of image processing) should be relegated to the dust-bin. Even worse is programmers can "cheat" when optimizing for PSNR, and the result is generally a blurry mess--they chose settings that result in higher PSNR, but this doesn't infer higher quality. (With regard to VP8, "rumor has it that there was some philosophical disagreement about how to optimize [encoder] tuning, and the tune for PSNR camp (at On2) mostly won out. Apparently around the time of VP6, On2 went the full-retard route and optimized purely for PSNR, completely ignoring visual considerations. This explains quite well why VP7 looked so blurry and ugly.")

A much more basic problem is how the results are presented: line graphs. Here's an example:

The vertical axis is PSNR score--higher is better. The bottom axis is bitrate. So it's pretty clear what this graph shows: it compares the quality between the VP8 encoder with and without a given feature at equivalent bitrates. However, there's also no mention of how each data point was calculated (there's even less mention of H264, but I'll leave that for another post). Video is a sequence of images, so PSNR and other image processing metrics must be run on each frame. Unlike a single image, this results in a problem: how do we take all of those scores and give an overall indication of the quality of the video encode? How do we know that the upper clip doesn't have a lot more volatility or substantial lapses in quality somewhere?

At first I assumed the WebM people were simply using average PSNR, which would be computed something like:

average = {SUM log(SNR)} / {# of frames}

I was wrong; they're using "global PSNR," which is computed as:

global = log ({SUM SNR} / {# of frames} )

But this makes no difference--global is still a mean. The reason to use global is because an encoder can recreate a pure black frame perfectly, which gives a PSNR score for that frame of infinity, which would skew average PSNR. Global PSNR avoids that problem by literally looking at the whole video as one big frame (this is wrong in so many ways; see aforementioned link on the deficiencies of PSNR, but I digress).

Problem is, means can be heavily affected by outliers, so there may be a few bad frames dragging down the score or a few great frames dragging up the score--or, there may even be a few horrible frames and a lot of great frames, in which case it all averages out, but it's also clear your video has a serious consistency problem. It's very clear that this "single value" representing video quality is suspect at best, and junk engineering at worst.

This "single value" really deprives us of understanding the overall quality of the video stream. So, without further ado, let me present some alternatives. In my previous post, I mapped SSIM scores on a per-frame basis:

Each line consists of a given bitrate. Vertical axis is SSIM scores, while the horizontal axis is frame number. You can see how drastically SSIM scores within a single clip can vary. This method also works really well with multiple encoders at a comparable bitrate, because you can clearly see where one encoder outperformed another.

Another more simple and compact way of visualizing video quality data is a box and whisker plot. Here's box plots of SSIM scores from a clip with 165 frames; each box plot corresponds with different encoder settings (all of them are the same bitrate);

This box plot conveys a lot of information about the video quality. The median value (middle of the box plot) is a general indication of the overall quality of the clip. The upper and lower quartiles (the area directly above and below the median value) encapsulate 50% of all samples, while the whiskers generally extend out some reasonable amount (this varies, see wiki article for details). Outliers are marked with circles.

What does this all mean? Consider the above box plot--the median values really don't vary that much. But what does vary is (a) the size of the quartiles and whiskers and (b) the number of outliers. Outliers, in this instance, correspond with frames that look like crap. Tighter box plots indicate more consistent quality, and higher box plots indicate higher overall quality. Using this technique, one can easily compare encoder quality and get an idea of how they really stack up against one another. From this box plot, it's pretty clear "highProfile" is the obvious winner. If you were just looking at averages or medians, it is hardly so clear. Certainly you have zero indication of any outliers, where the quality of most frames falls, etc.

Means are basically retarded. They deprive you of understanding your data well.

In any event, what really irks me about this WebM debate isn't that there's a legitimate competitor to H264--competition is always a good thing, and I welcome another codec into the fold. What bothers me is that we would let bogus metrics and faulty data presentation influence the debate. At this point, I see no indication that VP8 is even remotely on par with a good implementation of H264--perhaps it can be sold on other merits, but quality simply isn't one of them. This could change as the VP8 implementation improves, but even the standard itself is roughly equivalent to baseline profile of H264.

Putting that whole debate aside and returning to the notion of a high definition video standard, using these methods--and in particular, a box plot--one can establish that a video at least meets some baseline requirement with regard to encoder quality. The blu-ray shots above are a pretty clear indication that this industry needs such a standard. More importantly, consumers deserve it. Regardless of the codec, it's not right for people to shell out cash for some poorly-encoded, overly-quantized, PSNR-optimized mess.

Sunday, May 09, 2010

Object Tracking for Scientists

My significant other works in zoology. She needed a program to analyze about a thousand hours of video containing salamanders. Since image and video processing is my primary focus, I offered to take a stab at analyzing her video.

The goal was to determine how much distance the object (in this case, a cute little salamander known as Plethodon Shermani) traveled over the course of the video. There were definitely a few takeaway lessons from this exercise:
  1. It turns out object tracking is more difficult that I would have originally imagined, and
  2. You can make your life a lot easier during the analysis phase by making sure your video is as clean as possible. Our video was relatively clean despite some confounding factors inherent in working with salamanders, but in retrospect there were a few things that would have made analysis much easier.
I'd like to discuss both points in detail. I'm going to start with the process I used to track the object first, since that will explain many of the crucial aspects of #2.

Object Tracking for Dummies
Rather than write a bunch of code from scratch, I started with AForge.NET, which is a nice open source library with a number of image processing primitives. Specifically, I massage my input data until I can use some of the blob detection features of the library. Here is the general method I chose to perform object tracking:
  1. Convert the image to gray scale--all color information is discarded. This is because (a) color information is not necessary and (b) the subsequent processing happens much quicker on an 8 bits/pixel image--color is 24 bits/pixel.
  2. Because the salamander is darker than the background, I invert the image. Light colors become dark, and dark colors become light.
  3. Run the image through some noise reduction filter, such as a median or mean filter. Median works better, but is rather expensive.
  4. Here, I then apply a threshold filter. This filter converts all pixels to either black or white, depending on whether they're below or above a specified threshold. Since the salamander is lighter than the general background, it falls out of the woodwork here.
  5. After that, I apply an opening filter to remove any small objects from the image.
  6. At this point, hopefully all that is left is the salamander. Here, I use the blob counter class.
At this point, there are a few tricks to make sure the animal is tracked correctly. First, I track the position of the salamander with a list of points. I can use this list to (a) draw out the path of the salamander and (b) compute total distance traveled. Total distance traveled is simply the cumulative sum of the Euclidean distance between successive entries in the list. The final unit is measured in pixels. Doing some metric unit is much more difficult unless you have a known mapping between pixels and actual width.

Should there be more than one blob, I do some basic filtering on blob size--I find the salamander typically has a fairly consistent area. Furthermore, I utilized principles of temporal and spatial locality--in the event of multiples blobs, I choose the blob closest to the previous known location of the salamander. Given that salamanders don't move very quick, this is a surprisingly safe bet.

Lastly, I do some mild filtering on the list of points--I only record changes in distance greater than a few pixels to weed out any residual noise in the image.

There are some definite drawbacks to this method. Most importantly, it relies on the salamander being the darkest object in the image, which I've found to be (mostly) true. It also requires knowing the salamander's initial location, which can be tricky. Lastly, the "threshold" setting tends to vary due to variance in lighting conditions at the time of recording, so I attached that to a slider bar that's configurable in real time. There's also a big discrepancy between the sizes of salamanders, so the target min/max area is configurable.

Here's a picture of the final result:

The left screen is the original video. The middle screen is what the video looks like after the whole invert->medium->threshold->open->detect blob method. I flag the resulting object with a nice red bounding box, and the size of the object. The right screen draws out the path the salamander took in the box. Total distance traveled is listed in the bottom status bar.

Worth note is that white bar on the left side of the middle image--that's the result of using a dark cardboard divider in the video. In retrospect, this divider should have been as white as possible. This is a nice lead into our next topic...

Getting Clean Data
One thing that's important is making your video data as easy to analyze as possible. Here are a few considerations:
  1. Low light conditions will result in salt and pepper noise in the video stream. This can be removed (somewhat) using a median filter, but the filter is computationally expensive. There are other cheaper filters, but they come with various performance tradeoffs. So, high-light conditions are vastly preferable, if possible.
  2. Seek to have the highest contrast possible between the object(s) you need to track and the general background. This means all background in the video should be the opposite color of the thing you're tracking. For example, if you need to put a piece of cardboard somewhere, and you're tracking a black mouse, make your cardboard white.
  3. Minimize glare as much as possible. Museum quality glass with low glare treatment (like the type used in picture frames) will help substantially; it is vastly preferable over the crappy plastic lids many trials use. Also consider a polarizing lens to further reduce any glare.
  4. Make your lighting as consistent as possible. Having half the field of view be bright and half be dark--or the existence of shadows due to point sources of light--is difficult to manage.
  5. If possible, mark objects with visually distinct queues. For example, if you can mark an object with a red circle, that's going to make your life a lot easier in the analysis stage.
  6. Maintain positions throughout the experiment as much as possible. If you can, bolt the camera in place, and make sure the tray or dish (or whatever) is situated in the exact same location.
  7. Make sure all cameras are the same distance from the object for the entire duration of the experiment. Otherwise, it can be very difficult to track distance traveled. Also, wide angle lenses will make it difficult to track distance traveled since the image is so distorted around the edges.
  8. Higher resolution is "nice to have," but not nearly as important as consistent lighting, low glare, high contrast between object/background and consistent placement.
  9. I would take consistent lighting performance from a camera--and excellent low-light capabilities--over higher resolution or frame rate.
Your mantra: high contrast, high contrast, high contrast. If your object is black, the ideal video should look like a blob of black amid a snowy field. Obviously this may be hard; you might need visual markers in your video, you may have edges or barriers setup that are dark(er), etc. Try to minimize this as much as possible.

Salamanders were tricky for another reason: they like dark places. We used low-light conditions, which results in particularly difficult lighting conditions. To compensate for this, high quality glass is a big help. Also, the median filter helps, as does having a camera with good low-light performance.