Sunday, August 25, 2024

Yes, We Have No Bananas

HC199, RA#2

Note: video in question can be found online at http://video.google.com/videoplay?docid=-5479410612081345878

Introduction

This analysis will focus on a video clip of a television program called The Way of the Master, staring Kirk Cameron and Ray Comfort. The specific episode is entitled "The Beauty of a Broken Spirit – Atheism," and the relevant portion begins at 03:00 and runs until about 05:15 (Cameron and Comfort).

In the segment, Comfort makes an argument that a banana is "an atheist's worst nightmare" because of its clever design. Comfort notes that a banana is pre-wrapped, fits perfectly in the human hand and mouth, has a "pop top" for easy opening, desirable color and flavor, and is biodegradable. Based on this, Comfort concludes that the banana has been intelligently designed, and therefore there must be an intelligent designer. This intelligent designer must be God, which in turn validates the existence of God.

The program's stated objective is to show viewers how to "effectively share their faith" with non-believers, making its stated purpose informative. Cameron and Comfort attempt to arm their viewers with arguments to convince atheists that God exists, and religious faith is preferable to atheism. With respect to the segment in question, the underlying purpose is not to inform but to put forth a persuasive argument for Intelligent Design.

This analysis will first attempt to determine who composes the primary and secondary audiences of The Way of the Master. This will greatly affect whether or not the video segment in question is informative, persuasive, or both. Because the program's fundamental purpose is to persuade atheists of the merits of Intelligent Design (by way of proxy: viewers will presumably be the ones making the argument), Comfort's argument will be evaluated using the Toulmin model.

Audience

The Way of the Master (TWOTM) is broadcast to more than a hundred countries via satellite. Trinity Broadcasting Network, JCtv, Faith TV, Family Net, Total Living Network, Cornerstone Television, Christian Television Network, and numerous other Christian and family oriented television stations air Comfort and Cameron's program (The Way of the Master).

An examination of one of these stations—Trinity Broadcasting Network (TBN)—says much about target audiences of TWOTM. TBN is a 24 hour channel carried by 5,000 stations, 33 satellites, and cable systems estimated to be in the thousands. TBN has also benefited from must-carry rules that put it in 95% of American households (Becker). According to TBN, they are "…the world's largest religious network and America's most watched faith channel. TBN provides 24 hours of commercial-free inspirational programming that appeal to people in a wide variety of Protestant, Catholic, and Messianic Jewish denominations." TBN claims their programming is watched by over 5 million viewers in the United States, and reaches an estimated 95 million U.S. households. TBN humorously claims their programs have "High Secular Appeal" (Trinity Broadcasting Network).

It is clear that the typical viewer of TBN is not secular; all of their programming has an overarching religious theme. A Fruitcake Christmas ("Join Hermie and his pals for yet another Christian adventure!"), Jesus, Divine or Da Vinci ("Is Jesus the Divine Son of God as recorded in the New Testament? Or is Jesus a mortal man as identified in the popular book, The Da Vinci Code?"), Tribulation ("You know it's faith, when it's all you have left."), and other programs listed on TBN's website are painfully unappealing to secularists. Then, of course, there's TWOTM, which claims to be "…a ministry of Kirk Cameron and Ray Comfort, whose purpose is to teach Christians how to share the gospel effectively, biblically…the way Jesus did" (Trinity Broadcasting Network).

So Cameron and Comfort's primary audience consists of people who are religious, and more specifically, Christians. TBN has several different networks targeted at different market segments; TWOTM airs on Mondays (4-4:30 PM PST) and Wednesdays (3:30-4:00 AM PST) on a network targeted towards adults and the elderly. This means the program airs in the coveted 5-7 PM time slot for Mountain, Central and Eastern time zones. The morning program starts between 4:30 and 6:30 AM, which targets adults in Mountain, Central and Eastern time zones. From this schedule, it is clear this program is targeted towards adults and the elderly. It also ignores the entire Pacific time zone (4 PM is too early in the day, and nobody is watching television at 3:30 AM on a Wednesday), which means the primary audience consists of people living in Mountain, Central and Eastern time zones.

The secondary audience consists of intellectuals, academics and secularists. This seems like a strange secondary audience, but the video strikes a resoundingly dissonant chord with this audience. For example, the clip was shown in a talk by Oregon State Professor Stevan Arnold entitled "Intelligent Design and Evolutionary Biology: When Worlds Collide" to a full audience of students and academics (Arnold). The video's fantastically unscientific methods make it an irresistible target for anyone who opposes Intelligent Design or has a vested interest in evolutionary theory.

Analysis ("Yes, We Have No Bananas")

TWOTM claims to provide viewers with ways they can share their beliefs with atheists and non-believers, which makes its stated purpose to inform viewers. For the video clip in question, the underlying purpose is to arm viewers with an argument in favor of Intelligent Design. These viewers would then use this argument against atheists, implying that the actual purpose of the video segment is to indirectly inform atheists of the merits of Intelligent Design and the fallacy of evolutionary theory.

The Toulmin model can be used to analyze Comfort's argument in favor of Intelligent Design:

Claim: Because the banana is perfect for human consumption, there must be an intelligent designer.

Support: Banana fits perfectly into human hand, comes pre-wrapped, has desirable color and flavor, has a "pop-top" much like a soda can (which also had an intelligent designer), and is biodegradable.

Warrant: The Intelligent Designer must be God.

Comfort's support is not debatable. Bananas are fit for human consumption. And, his claim is not debatable either: commercial bananas were "intelligently designed." However, his error is in the warrant: God is not the Intelligent Designer of Comfort's banana. Arnold notes in his presentation that "…the banana is a domesticated plant, and its fruit has partly arrived at its current configuration under the agency of deliberate selection by humans. Consequently, the banana is actually a poor choice as an exemplar for the design argument" (Arnold). Bananas have been specifically bred and selected over thousands of years—not by God, but by man—to arrive in their current configuration. As a consequence of this, they have no seeds, consistent/pleasing color and flavor, and have been carefully selected for commercial viability (they must survive a lengthy import, age well, etc.). If Comfort had been eating the banana God designed, he would have cracked his tooth on a seed.

Bananas are an interesting choice for a completely different reason as well: commercial bananas lack genetic diversity because they are clones. One of the side effects of making commercial bananas seedless is they can no longer sexually reproduce. Because of this, commercial banana strains have no ability to adapt to their environment. Entire strains of banana can be made commercially obsolete by a single aggressive fungi or bacterial disease. This happened in the 1960s; Americans used to consume a commercial variety of banana called the Gros Michel (also a clone) but a fungus decimated the entire population. The banana industry was forced to pick another strain—this one called the Cavendish—which is now being threatened by a similar fungus and faces commercial disaster (Koeppel).

Because all commercial bananas are clones, they cannot participate in one of the primary mechanisms by which evolutionary change occurs: natural selection. Favorable hereditary traits (e.g. resistance to some aggressive pathogen) are not passed on because no sexual reproduction occurs. This makes it almost impossible for commercial bananas to adapt to new threats.

A banana is not an atheist's worst nightmare; to the contrary, it is a superb argument in favor of modern evolutionary theory. First, without natural selection, the banana is defenseless against quickly adapting pathogens. Second, in producing a strain of banana that is wholly susceptible to disease, we have created the opportunity for pathogens to exploit this vulnerability via the same evolutionary mechanisms that have been crippled in the banana. In this case, the "intelligent designer" turns out to be us, although the title is a misnomer; no "intelligent" designer would create an organism that is completely incapable of defending itself against a potential threat vector.


Works Cited

Arnold, Stevan J. "Intelligent Design and Evolutionary Biology: When Worlds Collide." 2 May 2005. 17 May 2008 <http://oregonstate.edu/~arnoldst/index_files/Worlds%20Collide.ppt>.

Becker, Anne. TV's Religious Revival: Broadcasting & Cable. 25 4 2005. 17 May 2008 <http://www.broadcastingcable.com/article/CA527238.html?display=News&referral=SUPP>.

Conner, Steve. National Geographic News. 27 July 2001. 17 May 2008 <http://news.nationalgeographic.com/news/2001/07/0726_wirebanana.html>.

Koeppel, Dan. Can This Fruit Be Saved? 22 June 2005. 17 May 2008 <http://www.popsci.com/scitech/article/2005-06/can-fruit-be-saved>.

The Way of the Master - The Beauty of a Broken Spirit--Atheism. Perf. Kirk Cameron and Ray Comfort. 2005.

The Way of the Master. The Way of the Master. 17 May 2008 <http://www.wayofthemaster.com/>.

Trinity Broadcasting Network. TBN - Trinity Broadcasting Network. 17 May 2008 <http://tbn.org/>.

Wednesday, December 05, 2012

Thank You, Gordon Moore

In October of 2007, I wrote about hardware assisted brute force attacks.  Namely, that it was an error to assume today's computing power was what we'd have access to in the future.  For example, if you know some computing operation is going to take ten years, one option is to wait five years, at which point (assuming Moore's law) your operation will only take 2.5 years.  You save 2.5 years.

Well, today I stumbled across this:
In a test, the [GPU cluster] was able to churn through 348 billion NTLM password hashes per second. That renders even the most secure password vulnerable to compute-intensive brute force and wordlist (or dictionary) attacks...this means we could rip through any 8 character password (95^8 combinations) in 5.5 hours.
The point is: if you wait five years, the hardware you get is exponentially faster than what was available at the time.  In 2018, are we going to be down to minutes?  And what about combining this approach with rainbow tables?

Sunday, November 18, 2012

Rewriting directshow?

Directshow is a pretty tenacious interface, having been used for approaching two decades at this point.  The official party line from Redmond is "use MediaFoundation," but it's never seemed particularly useful given that MediaFoundation is absent in older versions of Windows.

I've always (perhaps foolishly) thought it'd be fun to write my own media framework.  So I spent some time figuring out what I would hope to improve on over Directshow.  I came up with the following list of things:

  1. I dislike most things about IMediaSample
    1. I dislike the way it communicates a media type--through AM_MEDIA_TYPE.  AM_MEDIA_TYPE is weird to work with, easy to use incorrectly, and littered with buckets of legacy junk that nobody really cares about anymore.  It is easy to leak memory with it.  It is difficult to understand.
    2. I dislike that media samples themselves don't have a method for communicating their media type.  You might counter with IMediaSample::GetMediaType(), but that method provides the media type only if it differs from the previous media type.
  2. Filters consist of separate "pin" and "filter" interfaces despite being one object; this distinction has always been nonsensical to me.  A pin is really the types of input or output a filter contains, as well as a means of "connecting" to the filter.  The filter is...presumably everything else.  The separation between the objects makes writing filters messy, and communicating with them odd.  And I don't see a clear reason for the distinction.  It also creates coupling problems (for example, see CSource and CSourceStream in the DShow base classes, particularly how a pin adds to a filter and deals with reference count)
  3. Support for B-frames is marginal at best.  DirectShow has only a timestamp--not separate timestamps for presentation and decode time--which makes handling certain types of media difficult.
  4. I dislike how duration is represented--100 NS units.  A better way to represent duration is to give individual media streams a timescale--this avoids nearly all rounding error and most floating point math, except at endpoints and render time.
    1. When you think about it, specifying the duration of units is really the same as specifying a timescale, except less flexible.  A duration based on 100 NS units is the same thing as having a timescale of ten million (there are ten million 100 NS units in one second).
    2. Example: we're processing 8 khz audio.  The best timescale to use here is 8000.  When buffers come along, the "duration" of the buffer is simply the number of samples contained within.  Let's say incoming buffers contain 1024 samples in it; our duration is set to 1024 as well.  Total sample duration is 1024/8000 seconds.
    3. Example: we're processing H.264.  Many applications use a timescale of 90000, although accuracy up to DShow's 100 NS units can be achieved by setting the timescale to ten million.
    4. This last example is worth mentioning--compatibility with other media frameworks--like DShow--is easy.  Set timescale to ten million, voila: no translation necessary.
  5. I dislike DirectShow's sample pool concept.  This was probably a great design choice in the early 90s, when processing media was a pretty serious task, but these days....it's irrelevant.
  6. Automatic graph building is a cool idea, but its never worked well for me in production code because random filters may be inserted into the graph.  It's neat for quick-and-dirty apps that do very little.  It's pretty much pointless for a serious production application.
  7. Surprisingly, many of the filters I'd want built-in are often absent.  For example, there really isn't a comprehensive filter for dealing with colorspace conversions, resampling audio, etc.  Many of the "bread and butter" filters you often desperately want aren't available.
  8. Live video streaming support is horribly complicated to implement, and it's questionable if most down-stream filters are even following suit with the API.  Getting AV sync to work well with a live stream literally gives me fits.  
  9. In general, a very complicated API.  Likely because at the time it was dealing with some very complicated subjects.
On the other hand, there are things I really like about DirectShow:
  1. It can be extended to handle nearly any audio or video type, metadata types, etc.  It is extremely flexible.
  2. Because it is so modular, often a major change to your application only involves replacing a filter or two here and there, so changes end up being quite tidy.  For example, I have a playback application using MP4 files containing H.264 and AAC.  Someday, if the business requirement changes, I only have to swap out one filter and make sure it implements IMediaSeeking....and everything should literally work as it previously did.
  3. IFileSourceFilter is a great idea.  I don't like the naming, but having a standard interface to configure source filters makes it much easier for client writers to swap out filters.
  4. IMediaSeeking is a great interface for advanced seek operations.  It's great to have this work well at a high level without having to be overly concerned with the guts of the underlying filter graph.
  5. Some of the utility filters, like the infinite tee, null renderer, sample grabber, etc. are always super handy.
  6. The event-driven framework--while overly complicated--is really useful given that processing audio and video (particularly in real-time) is a very "event" oriented process.
In the coming days, I'm going to take a shot at implementing a very basic media framework in C#.  People have said managed code isn't suitable for media processing; I plan to test that hypothesis, because I see no real evidence showing otherwise.

I'll update my progress here as I go.

Monday, October 29, 2012

State Machines Made Simple

I recently stumbled across a bug in some code where our embedded device would sometimes fail to take the correct behavior on disconnecting from our back-end servers.   This was actually the third fairly major bug in this particular code. When I dumped the code to a state machine using gliffy, it became clear we had a problem with state.  It looked something like:

Okay, okay....it wasn't that bad.  But it was very, very far from good.  Here's a (vastly) simplified overview of the flow of the code:

bool someState;
bool anotherState;
bool yetAnotherState;
while( someState )
{
    // some initial work...
}

while( !anotherState )
{
    while( ... )
    {
        // some initial work again...
    }
    // Wait here until something happens 

    // ruh roh, a disconnect, loop around...
    if( yetAnotherState)
    {
        // special blah here...
    }
}

...full version of the "driver" method was ~200 lines of code, not counting methods that loosely corresponded with the state transitions.  State was largely tracked with a motely assortment of boolean values.  The nested while loops were difficult to understand, as was the initial loop that seemingly duplicated the first inner loop.  Transitions between state were littered between the driver method and the various state methods, in part because there really was no clear home for them.

If you can clearly communicate your process with a state machine, it can be translated using a pretty typical design pattern that A) makes it very easy to understand, B) very easy to modify in the future and C) very reliable.  For instance, consider this basic (and yes, slightly silly) state machine--I was doing laundry this evening, so this was what immediately came to mind:


I obviously make no claims that this method of doing laundry is correct (I will claim my wife has drilled me into adding bleach to the water before putting the clothes in, for what it's worth).  But that's really not the point.  The point is this can translate quickly, cleanly and reliably to working code.  Start by defining all your states:

enum WashingState
{
    Start,
    AddDetergent,
    StartWash,
    AddBleach,
    FillWithClothes,
    WaitUntilComplete,
    OhNoesGoToStore
}

Then, in code, do the following:

while( not done )
{
    switch( current state )
    {
    case Start:
    case AddDetergent:
        current state = have detergent ? StartWash : OhNoesGoToStore;
        break;
    case StartWash:
        // start the wash
        current state = AddBleach;
        break;
    case AddBleach:
        current state = have bleach ? FillWithClothes : OhNoesGoToStore;
        break;
    case FillWithClothes:
        // Fill with clothes...
        current state = WaitUntilComplete;
        break;
    case WaitUntilComplete:
        // wait until done
        current state = Start;
        break;
    case OhNoesGoToStore:
        // return....nothing left to do.
        return;
    }
}

In a nutshell: your enum values correspond with states.  The switch statement handles transitioning between states, and the while loop causes the switch statement to repeat continuously, until told not to.

In some code it may be more appropriate not to have a long-running process, in which case having single methods that allow atomic transition of state is probably a better way to go--but even then, the pattern really remains the same--individual methods correspond with your states, and those methods allow atomic transitions from one state to the next.

But what I enjoy about the "while( not done ) { switch( state ) {} }" pattern--particularly when it's being applied to something like a long-running thread--is it's neat, well contained, easy to understand, and easy to extend and/or modify in the future.

A few passing comments before I close this out:
  • Ignore articles like this one that claim "there is no way to enforce state transition rules" in switch-based state machines.  This is clearly a claim desperately in search of a weird over-engineered solution; it is trivial to enforce state transition rules.
  • Avoid allowing outside forces to influence state, particularly if this is a dedicated thread.  If you must, the best option is to perform this transition immediately following the switch statement, before looping around the while, to "trump" whatever choice made by the case label.  Not pretty, but neither is life sometimes.
  • Keep It Simple Stupid. There are all sorts of fancy methods out there--hierarchical state machines, crazy inheritance, template wizardry (there is some relationship between Dr. Dobbs and a total inability to embrace simplicity), etc.  It is super awesome you can make C++ jump through hoops, but please, spend thirty seconds thinking about the last time somebody handed you a plate of completely incomprehensible code.
  • Consistency matters.  Not for you, probably, but for the next person to maintain your code.  Try to be consistent about where state transitions happen, locking (if necessary), and structure of the code.

Tuesday, February 21, 2012

Thunderstone, Or An Abbreviated Guide to Comp Climbing Treachery

Lately in my spare time I've been playing Thunderstone with my wife and an acquaintance.  It's a fantasy card game where players basically build up a deck of cards during the game in an attempt to vanquish monsters in a dungeon.  There's a lot more to it, of course, but the strategy and creativity of the game is really wonderful.

Anyway, our acquaintance is an avid gamer who's clearly spent an inordinate amount of time scheming about strategy, tactics and technique for winning at these games; he is a cunning adversary, and that's what makes it fun.  So it should have come to no surprise when he was dumbfounded at the complete lack of strategy some climbers demonstrated at a local bouldering competition.

The comp was fairly typical.  It has a qualifier phase where climbers could pick and choose problems to try from fifty to seventy five options of varying difficulty.  However, only the top three problems would be considered for a climber's final score during qualifiers.  Based on this final score, three climbers would proceed to finals.  The finals format was also fairly common: climbers were shown three problems, and given five minutes to attempt each.  Between attempts they were given five minutes of rest.

My friend inquired about these rules, and I explained them, as well as some of the tactics I generally employ (note these don't hold true for all comp formats).  In no specific order, during qualifiers:
  1. Warmup.  This gives me a chance to get ready for harder problems.
  2. The goal of qualifiers is to do the minimal amount of work to get to finals.  Burning yourself out in qualifiers is a sure-fire way to fail on finals problems, which are typically more difficult.
  3. Never get on a hard problem before someone else.  Best case scenario: your competitor is unable to finish the problem, in which case you've earned free beta ("beta" is info about how to do a climb) and you possibly don't even need to attempt it anyway.  Worst case scenario: they send and you still get free beta, hopefully enabling you to send it quicker and easier than them.
  4. Religiously track who's climbed what.  Even if your competition does do some problem more difficult than you, who gives a shit?  You need to get to finals, not thug it out with some rock-jock during qualifiers over a fake prize.
You'd think these rules would be obvious, but my friend seemed dumbfounded how the third rule could ever be feasible; he aptly noted that if everybody waited for someone else to get on a hard problem, then nobody would ever attempt anything difficult.  And this is true, assuming everyone there has considered strategy.

But the reality is they haven't.  Most comp climbers spend roughly five seconds considering the best approach to maximize success before throwing themselves at some stupid-hard problem.  For example, at this comp, I waited for my competition to solve a rather cryptic problem for me.  My two finals competitors did eventually climb the problem--but it took them a handful of costly attempts, whereas I was able to recycle their efforts unlocking this problem for a single-attempt ascent.  I believe this is a win not only from an efficiency standpoint (one try is way better than multiple tries), but for your competition it can be sobering from a psychological standpoint.

Even worse, both of them spent time and effort on problems that were irrelevant to getting into finals.  Had either of them taken a moment to survey the field, they would have realized they were virtually guaranteed a finals seat since the only other climber doing problems in their range was, well, me.  Instead, they both attempted a problem neither of them ultimately sent.  But had either of them actually succeeded at this problem, it wouldn't have changed the fact that they were still headed to finals, which is all that really matters during qualifiers.

In finals, the strategy is different.  You cannot leverage someone's beta in this format since you don't get to watch them climb.  But there's always strategy:
  1. With five minutes of time, realistically, you'll get three to four good attempts.  
  2. Your first attempt will also (likely) be your strongest.  Thirty seconds between attempts on a very hard climb simply isn't enough time to recover.  The first attempt is often worth the most since you get bonus points for climbing something on your first try (i.e. flashing).  Not in all comps, but most.
  3. This means you should have a damn good plan of how you're going to do stuff before you pull on the wall.  I generally take at least a minute surveying the problem before I pull on.
  4. Know when to cut your losses.  This can be hard to gauge, and also slightly heartbreaking to give up.  But if you honestly don't see yourself sending a problem, save energy for the next one.
Again, neither of my competitors seemed to realize any of this.  They spent seconds glancing at the problem before attempting, hurting their chances at first-attempt success.  They made too many attempts.  They did not save energy for future problems.

Combine their errors in the qualifiers with a lack of strategy in finals, and you end up with a commanding win by me.  Which, of course, I like.  But I have to wonder: did age and treachery simply beat youth and talent?  Is this really a legitimate part of "climbing," especially when many climbers would attribute this sort of strategic whoring (on plastic, no less) to pure foolishness?

It's not clear.  But I have to admit I really love strategy.  I love finding that chink in the armor, that little advantage that gives me some edge, my ace in the hole.  It's what I love about Thunderstone.  It's what I love about real stone.

Wednesday, December 21, 2011

What do we actually know about software development?

I recently saw this video on Vimeo; to say it blew my mind would be an understatement:


In it, Greg Wilson makes a compelling argument that we have a responsibility to seriously question the efficacy of the software engineering practices we employ. For example, do we know code reviews will really lead to better code, or are we just trusting our gut? Is pair programming effective?  Does refactoring really matter at all?   Is design paradigm "blah" effective, or is it really an untested, unproven method with no empirical basis whatsoever?  How much of the things we do are dogma, and how much have been scientifically proven to deliver results?

Painful admission: I'd never seriously considered any of this.  Many of these fads--such as unit testing or code reviews or agile or INSERT FAD HERE--are so beaten into developers that we never stop to question the basic foundation.  So as a personal exercise, I decided to seriously dig into one of them to see what I learned: test driven development. I picked this just as some arbitrary starting point, really, based on some reddit comments.

Test driven development is "...a software development process that relies on the repetition of a very short development cycle: first the developer writes a failing automated test case that defines a desired improvement or new function, then produces code to pass that test and finally refactors the new code to acceptable standards."  In other words, I start developing by writing tests that fail.  I produce code that passes these tests.  This idea seems like a good hypothesis, so I found two heavily cited scholarly articles that supported TDD and scrutinized their findings.

The first, Realizing quality improvements through test driven development: results and experiences of four industrial teams, is an empirical study cited by 40 authors according to google scholar.  The study tracked four different teams at two different companies and compared "defect density" with a "comparable team in the organization not using TDD."  Obviously "comparable team" is ambiguous, so in their "Threats to Validity" section, they mention:
The projects developed using TDD might have been easier to develop, as there can never be an accurate equal comparison between two projects except in a controlled case study. In our case studies, we alleviated this concern to some degree by the fact that these systems were compared within the same organization (with the same higher-level manager and sub-culture). Therefore, the complexity of the TDD and non-TDD projects are comparable.
First, I do not buy that "the complexity of the TDD and non-TDD projects are comparable" because "these systems were compared within the same organization (with the same higher-level manager and sub-culture)." I know this to be true within my own organization; I work on what I'd describe as a very complicated product (computer vision, lots of networking, embedded development, lots of compression, low-latency streaming, etc.), while I have coworkers in the same organization (with the same higher-level manager) who literally write software with a small subset of the requirements of our group; the difference in complexity is gigantic. By not providing specifics of the comparable projects, it's hard to take the findings seriously.  It's also clear--from total lines of code and team size comparison--that it is unlikely these are "comparable projects."  They also don't list the experience breakdown of team members for the comparable projects.

Second, none of these results are statistically significant, and the authors acknowledge that.  It's an empirical study--not an experiment--so at best they can only infer correlations between TDD and "defect density."

Third, they include figures on "Increase in time taken to code the feature due to TDD," and use management estimates as a source of data.  Seriously?  Given what we know about estimating things, is this really a valid method?

Lastly, how do they conclude that TDD was solely responsible for the improvement?  Did the teams have other differences, such as doing code reviews?  Were programmers working together, or were they geographically dispersed?  How did management affect the various projects?  What other factors could have influenced their results?  They allude to some of this obliquely in their "threats" section;  none of it stops them from recommending best practices for TDD in their discussion section.

The second paper, published in 2008--Assessing Test-Driven Development at IBM--has ~134 citations in google scholar. I found this to be a more interesting comparison: two software teams implementing the same specification, one team using a "Legacy" method for software development and the other (admittedly future) team using TDD.   It's still not an experiment since it wasn't controlled, but the final deliverable is more comparable: the same "point of sale" specification by IBM.

Their findings are summarized in two graphs; the top is legacy, and the bottom is the newer project using TDD:



At first glance, these look promising; we see that the TDD project had a much lower "defect rate per thousand lines of code (KLOC)" than the legacy project.  We also see the TDD project had better forecasting of their defect rates.

But on closer inspection, I have to seriously question these results.  In terms of absolute line count, the Legacy solution appears to be about ~11.5 KLOC (80 total defects / (7.0 defects/KLOC)) verses ~67 KLOC for the TDD version (247 total defects / ( 3.7 defects/KLOC)). In other words, from an absolute standpoint there were one-third as many defects on the legacy system and one-sixth the total line count. So I'm struggling to understand how the TDD team ended up with a massive pile of code, and what that cost them in terms of schedule/productivity/maintainability/performance, and how they justify "six times as much code and three times the defect count compared to legacy which purportedly does the same thing!" as being a positive result. I'm open to the possibility I'm misinterpreting these graphs, but if I'm not, the authors deserve a scientific keel-hauling.

There's no comparison of development time. No mention of how successful either product actually was in production. No comparison of productivity between the two teams, only a reference to the productivity of the TDD team.  No acknowledgement that other factors besides TDD might have been responsible for their findings, since again this is not a controlled experiment.  And none of this prevents them from presenting best-practices for TDD in their results section, as though two misleading graphs is a good proxy for a well-executed experiment and statistical significance.   Frankly, this source was enormously disappointing to me.  The discrepancies in KLOC and absolute defect rates are significant enough that I'm struggling to understand how this paper was A) published and B) cited by 134 other authors.

In my opinion, neither of these papers establish that TDD is an effective practice.  Of course, neither of them preclude it from being effective either, so the jury's still out.  And it's entirely possible I've made an error interpreting these findings; I graciously welcome your peer review, dear reader.  In any event, I think Greg Wilson's point stands strong: what are our standards of truth? Why do we do what we think works as opposed to what we've empirically shown to work?


Thursday, July 14, 2011

Who Really Wants a Digital Home?

Let me start by saying I'm not a technophobe. I work in video surveillance, I write code in multiple languages, I cross-compile C++ code, I browse various open source projects in my spare time, I like python's structured use of white space, etc. I'm a nerd, and an unabashed one at that.

But let me be the first to say: I do not want a "digital home." Yep, that's right. I don't want a digital home.  

Let me back up and put some context behind this. In maybe 2005ish, I was asked to head up to Seattle to interface our surveillance product with the products of a home automation/digital home company. Their office was in a drab commercial area of Seattle that presumably fed off the table scraps of larger tech companies in the area; it was one of the dirtiest offices I'd ever seen. Their key product was a little wall-mounted device with an LCD touch screen that would control lighting, HVAC, audio, etc., and they wanted me to make it work with our surveillance system.

I was given a tour by a stocky fellow with a thick eastern European (?) accent who oversaw their engineering. He proudly demonstrated the usage of the wall-mounted device. The whole demo went something like so:

HIM: "So, if you want to turn on the lights, you just do this..."

(turns towards the device, stares directly at it, flicks through a menu or two, hits a button, and a bunch of lights come on)

ME: "Huh....neat."

HIM: "If you want music, you can do this..."

(turns back towards the device, stares directly at it, punching keys, and after some completely boggling transitions through menus we hear music)

Me: "hrrrm..."

HIM: "If you want to control HVAC..."

(turns back towards the device, stares at device, browses unintelligible menus...)

...you get the idea.

While I was watching this, it occurred to me how completely wrong this product was on so many basic levels.  I imagined myself installing this system in my home, trying to explain it to my wife, having her be frustrated learning how to use it, having her be irritated that turning on the lights was suddenly a pain in the ass, me doing tech support when it's busted so my wife can turn on lights, her making me put the wall switch back, and so on.  It's a cool idea...if you're a nerd.

For the rest of the non-nerd world?  Here's how they want to turn on the lights:


...the simplicity is stunning.  The reliability?  Undeniable.  The lack of training required?  Breathtaking.  The familiarity of the interface, the ability to "feel" around a corner and flip on a light (instead of having to stare at a tiny screen on the wall), the speed at which you can turn on the lights, the lovely tactile response of a solid mechanical switch?  All very refined.

I think much of this nuance is often lost on my fellow techies--after all, we don't fear new technology.  To the contrary, we often bask in it, marveling at the possibilities it may bring.  But the truth of the matter is these companies are attempting to replace technology that has been used and refined over the last 100 years.  When my wife wants to turn on a light, she doesn't want to deal with weird technology or navigate menus or stare at a tiny screen on the wall--she wants to turn on the lights.  When a guest is using my restroom, I don't want to explain how to turn on the light by the toilet.  Lord knows I don't want to do IT work at my own home--as it is, I hate servicing people's computers.  Replacing a light switch with a touch screen sounds like a great idea, but in reality it's as misguided and ridiculous as replacing the steering wheel of a car with an iPhone app.

All of this comes full circle to the idea of a "digital home."  What does that even mean, a digital home?  Does it mean replacing technology willy-nilly with stuff we think is cool, or actually attempting to improve the ergonomics, efficiency and useability of our homes?

Companies hawking these products invariably see their role as purveyors of a better solution to "antiquated" technology that's been in use for the last few decades.  But until their products complement--not replace--the ergonomics and efficiency of our home in a reliable manner, they'll ultimately fail to have significant market traction.  This is why the term "digital home" itself is hopelessly misguided; the term itself really implies something nobody wants.  We're pretty happy with our fuddy-duddy analog homes, thank you very much.

I believe companies that truly leave their mark in this market won't replace conventional technology and interfaces like the light switch.  They'll be known for complementing the technology with smart solutions that solve problems people actually have.  For example,
  • Can we find ways of remotely controlling lighting?
  • Can we find ways to turn off lights when they're not necessary, thus saving energy?
  • Can we find ways to reduce the amount of copper wiring a home needs, thus driving down construction costs?  Or make it easier to arbitrarily add outlets and lighting to a home or business without tearing down walls or hiring electricians?
  • Can we find ways to delight people, such as automatically turning on porch or back lights when they come home, or automatically turning lights on and off as we go room to room?  Or letting me turn out lights I forgot while I was on vacation?  Or turn out lights from my iPhone because I'm in bed and I want want to pass through the house one last time?
The key difference is: the goal is to compliment our homes--not replace them with silly digital interfaces.

Tuesday, May 10, 2011

Unroll that loop, homeboy

A month or so ago I was wrestling with a pretty thorny algorithmic problem at work.  Eventually I worked out a solution that worked great on our Windows emulator, so I figured my job was done...was I ever wrong.

Once I got it running on our embedded processor (a lowly ARM926EJ-S at ~270 MHz), I quickly found my algorithm had some serious performance issues.  So, I fired up oProfile and did some profiling so I could see where I was eating cycles.  There were a couple hotspots that I went after, but one was a loop that looked like so:

UINT64 currentRow;

...
for(int row = 0; row < NUMBER_OF_ROWS; row++)
{
  currentRow = rows[row];
  for(int column = 0; column < NUMBER_OF_COLUMNS; column++)
  {
    if( (currentRow >> column ) & 1 )  // Check if this bit is set
      DoSomeWorkOnThisPointBaby(col,row);
  }
}

Nested loops...yuck.  This code ran for every column and row, shifting the current row over to check and see if the bit was set. To make matters worse, the row was represented with a 64 bit value, which costs extra to shift.  Even worse, this wasn't work I could figure out a way of avoiding entirely--enumerating over all the values was really a core part of the algorithm.

So, I decided to see if I could at least speed up the routine by optimizing. First, I unrolled the loop.  As wiki states, the goal of loop unrolling is to increase a program's speed by reducing (or eliminating) instructions that control the loop, such as pointer arithmetic and "end of loop" tests on each iteration.

So instead of the inner loop iterating through all of the columns, it gets replaced with a gigantic block of if-statements:

UINT64 currentRow;

...

if( currentRow & 0x1 ) //first column
  DoSomeWorkOnThisPointBaby(col,row);
if( currentRow & 0x2 ) // second column
  DoSomeWorkOnThisPointBaby(col,row);

...

if( currentRow & 0x8000000000000000 ) // 64th column!
  DoSomeWorkOnThisPointBaby(col,row);

This removes all of the overhead of the for loop--no more incrementing the counter, and no more testing to see if we're at the end of the loop.  It also avoids an expensive shift operation to test each column value.

But we can do even better!  One issue is currentRow is a 64 bit value, but we're on a 32 bit processor.  Each 64 bit AND takes multiple instructions, whereas a normal 32 bit AND is a single instruction on most platforms.

So, to speed it up even more, you can break the 64 bit value into its respective halves and test them individually--so instead of 64 64-bit AND operations, it now performs 2 shifts and 64 32-bit AND operations. Another extra bonus is you only need to code 32 constants into your code now:

UINT64 currentRow;

...
UINT32 val = (UINT32) (currentRow & 0x00000000FFFFFFFF);
if( val & 0x1 ) //first column
  DoSomeWorkOnThisPointBaby(col,row);
if( val & 0x2 ) // second column
  DoSomeWorkOnThisPointBaby(col,row);

...

val = currentRow >> 32;
if( val & 0x1 ) //first column
  DoSomeWorkOnThisPointBaby(col,row);
if( val & 0x2 ) // second column
  DoSomeWorkOnThisPointBaby(col,row);

...

I'm sure I could have written the code in assembler for further improvement, but my code would no longer be portable.  Furthermore, in C these improvements will apply to almost every target platform.  Admittedly my solution will be slightly slower than it could be on a 64-bit platform where a 64-bit shift operation would be single-instruction, but not enough to matter.

The real question, though, is what sort of performance gains did I reap? After measuring again in oProfile, I was spending one-fourth the time in the function that I previously did, and the work definitely helped me get the algorithm to fit on a lowly 270 mhz processor.

Friday, October 15, 2010

A closer look at JPEG performance

In my previous post, I compared JPEG to the newcomers, WebP and JPEG XR. For the most part, JPEG held its own pretty well. WebP was better in a few respects, worse in a few...overall decent. JPEG XR was utterly horrible.

But this got me thinking: how optimized was my JPEG encoder? I first used MSFT's .NET JPEG implementation, but I had to wonder just how good that implementation actually was. For example,
  1. What quantization tables does it use?
  2. Are its quantization tables computed on a per-image basis, or is it using the IJG reference tables?
  3. How are the tables being scaled, and are they being capped at 8 bits (i.e. no value larger than 256 in your quantization table) or allowed to go to 16?
  4. For the discrete cosine transform, is it using integer or floating point math?
  5. Is it using an optimized entropy encode?
Maybe our problem isn't JPEG. Maybe our problem is bad JPEG encoders. For example, libjpeg certainly isn't terrible, but there are some flaws that are hard to overlook, such as hard-coded quantization tables. The linear scaling of the quantization matrix is equally deficient, and this was acknowledged by the maintainers of the project ("The current method for scaling the quantization tables is known not to be very good at low Q values," taken from the libjpeg README in February of 1996! How's that for old?). The library itself still defaults to non-optimal settings (probably a perfectly legitimate design decision when you weren't sure if you had an FPU...these days, it's a much safer bet that you do). In summary, there's some serious room for improvement.

So do we really need WebP? Or do we really just need to give a venerable old method some much needed love?

It is clear to me that there are various methods that can improve current JPEG performance. Using libjpeg's optimize routine, followed by a run through jpegcrush, WebP is no longer ahead. Jpegcrush itself "...abuses combinations of multiple Huffman tables and scan combinations to improve compression more than merely optimizing the Huffman tables naively," according to Dark Shikari. There are several papers about generating image specific quantization tables that minimize error for a given bitrate that look promising. So in my spare time, I'll probably investigate some of this stuff further.

Sunday, October 10, 2010

Comparison: WebP, JPEG and JPEG XR

Google recently released a new image format called WebP, which is based on the intra-frame compression from the WebM (VP8) video codec. I'd seen a few comparisons of WebP floating about, but I wasn't really happy with any of them. Google has their own analysis which I critiqued here (executive summary: bad comparison metric--PSNR--and questionable methods).

Dark Shikari's comparison was limited by testing a single quality setting and a single image. There's also this comparison and this comparison, but neither of them are that rigorous. The first is, again, a single image at one arbitrary compression level, although it does clearly show WebP producing superior results. The second is at least multiple images, but the methods are a little fuzzy and there's no variation in compression level. It's also a bloodbath (WebP wins), so I'm a bit wary of its conclusions.

So, I coded up a quick and dirty app to compress an image to WebP, convert it back to PNG and then compute the MS-SSIM metric between the original image and the final PNG. I used Chris Lomont's MS-SSIM implementation. I also did the same process for JPEG, using the built-in JPEG compression in .NET. Then, for the hell of it, I decided it'd also be fun to add JPEG XR into the mix, since it's probably the only other format that has a snow-cone's chance in hell of become viable at some future date (possibly when IE9 hits the market). I eventually got JPEG XR working, although it was a pain in the butt due to a horrible API (see below).

The input image is encoded for every possible quality setting, so the final output is a group of SSIM scores and file sizes. I also computed db for the SSIM scores the same way x264 computes it, since SSIM output is not linear (i.e. 0.99 is about twice as good as 0.98, not 1% better):

SSIM_DB = -10.0 * log10(1.0 - SSIM)

For starters, I decided to test on the image Dark Shikari used to test with, since it's a clean source and I wanted to see if I could replicate his findings. Here's the results of the parkrun test:


Quick explanation: higher SSIM (dB) is better, smaller file size is better. So in the perfect world, you'd have codecs outputting in the upper left-corner of the graph. But when images are heavily compressed the quality drops, so you end up with graphs like the above: high compression equals bad quality, low compression equals large file size.

In parkrun, WebP seems to perform decently at very low bitrates, but around ~15 dB JPEG starts to outperform it. Even at low bitrates, it never has a substantial advantage over JPEG. This does seem to confirm Dark Shikari's findings: WebP really doesn't perform well here. When I examined the actual output, it also followed this trend quite clearly. JPEG XR performs so poorly that it is almost not worth mention.

Next, I did a test on an existing JPEG image because I wanted to see the effect of recompressing JPEG artifacting. In Google's WebP analysis, they do this very thing, and I had qualms about JPEG's block-splitting algorithm being fed into itself repeatedly. Here's the image I used:


Results:


Again, absolutely dismal JPEG XR performance. Notice the weirdness in the JPEG line--I believe this is due to recompressing JPEG. I got similarly odd results when I used other JPEG source material, which does support my hypothesis that recycling JPEG compressed material in a comparison will skew the results due to JPEG's block splitting algorithm. I'll try to post a few more with JPEG source material if I can.

As far as performance goes, WebP does a little better here--it's fairly competitive to about ~23 dB, at which point JPEG overtakes it. At lower bitrates it is resoundingly better than JPEG. In fact, it musters some halfway decent looking files around ~15 dB/34 KB (example), while JPEG at that same file size looks horrible (example). However, to really do a good job matching the original, including the grainy artifacts, JPEG eventually outperforms WebP. So for this test, I think the "winner" is really decided by what your needs are. If you want small and legible, WebP is preferable. If you're looking for small(ish) and higher fidelity, JPEG is a much better choice.

Next test is an anime image:


Results:


WebP completely dominates this particular comparison. If you've ever wondered what "13 dB better" looks like, compare JPEG output to WebP output at ~17 KB file sizes. JPEG doesn't reach a comparable quality to the ~17 KB WebP file until it's around ~34 KB in size. By the time JPEG surpasses WebP around 26 dB, the extra bits are largely irrelevant to overall quality. JPEG XR unsurprisingly fails.

Since a lot of web graphics aren't still-captures from movies, I decided a fun test would be re-encoding the Google logo, which is a 25 KB PNG consisting mostly of white space:

This is a better graphic for PNG compression, but I wanted to see what I could get with lossy compression. Results:

WebP clearly dominates, which confirms another hypothesis: WebP is much better at dealing with gradients and low-detail portions than JPEG. JPEG XR is, again, comically bad.

Incidentally, Google could net some decent savings by using WebP to compress their logo (even at 30 dB, which is damn good looking, it's less than 10 KB). And I think this may be a place WebP could really shine: boring, tiny web graphics where photographic fidelity is less of a concern than minimizing file size. Typically PNG does a good job with these sorts of graphics, but it's pretty clear there are gains to be had by using WebP.

My conclusions after all this:
  1. JPEG XR performs so poorly that my only possible explanation is I must be using the API incorrectly. Let's hope this is the case; otherwise, it's an embarrassingly bad result for Microsoft.
  2. WebP is consistently better looking at low bitrates than JPEG, even when I visually inspect the results.
  3. WebP does particularly well with smooth gradients and low-detail areas, whereas JPEG tends to visually suffer with harsh banding and blocking artifacts.
  4. WebP tends to underperform on very detailed images that lack gradients and low-detail areas, like the parkrun sample.
  5. JPEG tends to surpass the quality of WebP at medium to high bitrates; if/where this threshold occurs largely depends on the image itself.
  6. WebP, in general, is okay, but I don't feel like the improvements are enough. I'd expect a next-gen format to outperform JPEG across the board--not just at very low compression levels or in specific images.
A few more asides:
  • I'd like to test with a broader spectrum of images. I also need better source material that has not been tainted by any compression.
  • No windows binary for WebP....seriously? Does Google really expect widespread adoption while snubbing 90% of non-nerd world? I worked around this by using a codeplex project that ported the WebP command line tools to Windows. I like Linux, but seriously--this needs to be fixed if they want market penetration.
  • If you ask the WebP encoder to go too low, it tends to crash. Well...maybe. The aforementioned codeplex project blows up, but I'm assuming it's due to some error in the underlying WebP encoder. That said, by the time it blows up, it's generating images so ugly that it's no longer relevant.
  • JPEG/JPEG-XR go so low that your images look like bad expressionist art. It's a little silly.
  • I absolutely loathe the new classes in .NET required to work with JPEG XR. It is a complete disaster, to put it lightly, especially if all your code is using System.Drawing.Bitmap, and suddenly it all has to get converted to System.Windows.Media.Imaging.BitmapFrame. Why, MSFT, why? If you want this format to succeed, stop screwing around and make it work with System.Drawing.Bitmap like all the other classes already do.
  • WmpBitmapEncoder.QualityLevel appears to do absolutely nothing, so I ended up using ImageQualityLevel, which is a floating point type. Annoying, not sure why QualityLevel doesn't work.
  • The JPEG XR API leaks memory like nuts when used heavily from .NET; not even GC.Collect() statements reclaim the memory. Not sure why, didn't dwell on it too much given that JPEG XR seems to not be worth anyone's time.
  • People who think H.264 intra-frame encoding is a viable image format forget it's licensed by MPEG-LA. So while it would be unsurprising if x264 did a better job with still image compression than WebP and JPEG, this is ignoring the serious legal wrangling that still needs to occur. This is not to say WebM/WebP are without legal considerations (MPEG-LA could certainly attempt to put together a patent pool for either), but at least it's within the realm of reason.
Let me know if you're interested in my code and I'll post it. It's crude (very "git'er done"), but I'm more than happy to share.

EDIT: Just to make sure .NET's JPEG encoder didn't blow, I also tried a few sanity checks using libjpeg instead. I also manually tested in GIMP on Ubuntu. The results are almost identical. If anyone knows of a "good" JPEG encoder that can outdo the anime samples I posted, please let me know of the encoder and post a screenshot (of course, make sure your screenshot is ~17 KB in size).

Friday, October 01, 2010

Not again...

Google decided to use the intra-frame coding algorithm of WebM to create a new image format for the web called WebP. And, of course, they make bold claims like "...an average 39% reduction in file size." There's nothing wrong with bold claims, but after reading the details of how they arrived at that conclusion, there are a few problems.

First, Google should stop using PSNR as a metric for image quality. PSNR does not correlate well with mean opinion scores of image quality. Furthermore, this error is particularly egregious because the WebM appears to be PSNR optimized and it isn't clear that the JPEG/JPEG2000 encoders were. If the JPEG encoder was optimized for SSIM then it may very well score lower on a PSNR test despite being visually superior. This is the no child left behind approach to encoding: Google is engineering encoders to pass the test instead of encoders that actually produce visually stunning results. A much better choice for an objective metric is MS-SSIM. It's not perfect, but statistically, it is more meaningful than PSNR.

The most meaningful objective, of course, is actually collecting mean opinion scores, but that involves actual humans, tallying lots of data and generally "making science." In the absence of this, though, why not use the most meaningful objective image quality metric available?

Second, if Google is going to try and woo us with pretty graphs, they should get their axis labels right (see Figure 1).

Third, Google basically took existing JPEG images (90% of all samples), decoded them, and re-encoded with different encoders to determine which encoder worked best? This method is totally bogus. Once an image is encoded to JPEG, it's going to have block-based artifacting on a grid with squares 8 by 8 due to JPEG's block splitting. For example, this grid is clearly visible in a lot of JPEG images:


You can clearly see the 8 by 8 blocking, particularly around the woman's chest (stop staring, perv!).

Here's another example from the wikipedia article on JPEG, blown up to better illustrate the blocking artifacts:


Again, you can clearly see the 8x8 grid. This could seriously skew results for the second JPEG compression, particularly if the encode settings are not identical to the first compression (see question 10 here). The data being used in the comparison has already been skewed by a first JPEG compression; thus, the JPEG results (i.e. the most important comparison) are potentially flawed, and it's not clear to me if this would be an advantage or disadvantage for JPEG in the comparison.

Luckily, this can be fixed--Google should do the comparison on source material with loss-less formats like PNG or BMP, which hopefully will not contain existing JPEG artifacting.

Friday, May 28, 2010

HD Video Standard, Cont.

Last year, there was brief discussion about a standard for high definition video; you can see my previous comments here.

One of the big questions about a high definition standard is how to make sure the quality of video encoding is acceptable. Sure, video can be 1080P, but if it's over-compressed garbage produced with a sub-par encoder then it's really difficult to claim that it is "high definition." For example, here are two screenshots from blu-ray disks. I wish I were kidding about this, but it really underscores the importance of having an objective metric to ensure that some reasonable encoder quality is met.

Unfortunately, people can even screw this up. Consider today's post from the WebM blog about VP8 quality. First, it uses peak signal to noise ratio (PSNR), which is certainly a venerable metric, but almost everybody and their mother agrees that it correlates poorly with human perception of image quality. One need not look further than this paper, which outlines why this technique (particularly in the field of image processing) should be relegated to the dust-bin. Even worse is programmers can "cheat" when optimizing for PSNR, and the result is generally a blurry mess--they chose settings that result in higher PSNR, but this doesn't infer higher quality. (With regard to VP8, "rumor has it that there was some philosophical disagreement about how to optimize [encoder] tuning, and the tune for PSNR camp (at On2) mostly won out. Apparently around the time of VP6, On2 went the full-retard route and optimized purely for PSNR, completely ignoring visual considerations. This explains quite well why VP7 looked so blurry and ugly.")

A much more basic problem is how the results are presented: line graphs. Here's an example:



The vertical axis is PSNR score--higher is better. The bottom axis is bitrate. So it's pretty clear what this graph shows: it compares the quality between the VP8 encoder with and without a given feature at equivalent bitrates. However, there's also no mention of how each data point was calculated (there's even less mention of H264, but I'll leave that for another post). Video is a sequence of images, so PSNR and other image processing metrics must be run on each frame. Unlike a single image, this results in a problem: how do we take all of those scores and give an overall indication of the quality of the video encode? How do we know that the upper clip doesn't have a lot more volatility or substantial lapses in quality somewhere?

At first I assumed the WebM people were simply using average PSNR, which would be computed something like:

average = {SUM log(SNR)} / {# of frames}

I was wrong; they're using "global PSNR," which is computed as:

global = log ({SUM SNR} / {# of frames} )

But this makes no difference--global is still a mean. The reason to use global is because an encoder can recreate a pure black frame perfectly, which gives a PSNR score for that frame of infinity, which would skew average PSNR. Global PSNR avoids that problem by literally looking at the whole video as one big frame (this is wrong in so many ways; see aforementioned link on the deficiencies of PSNR, but I digress).

Problem is, means can be heavily affected by outliers, so there may be a few bad frames dragging down the score or a few great frames dragging up the score--or, there may even be a few horrible frames and a lot of great frames, in which case it all averages out, but it's also clear your video has a serious consistency problem. It's very clear that this "single value" representing video quality is suspect at best, and junk engineering at worst.

This "single value" really deprives us of understanding the overall quality of the video stream. So, without further ado, let me present some alternatives. In my previous post, I mapped SSIM scores on a per-frame basis:



Each line consists of a given bitrate. Vertical axis is SSIM scores, while the horizontal axis is frame number. You can see how drastically SSIM scores within a single clip can vary. This method also works really well with multiple encoders at a comparable bitrate, because you can clearly see where one encoder outperformed another.

Another more simple and compact way of visualizing video quality data is a box and whisker plot. Here's box plots of SSIM scores from a clip with 165 frames; each box plot corresponds with different encoder settings (all of them are the same bitrate);



This box plot conveys a lot of information about the video quality. The median value (middle of the box plot) is a general indication of the overall quality of the clip. The upper and lower quartiles (the area directly above and below the median value) encapsulate 50% of all samples, while the whiskers generally extend out some reasonable amount (this varies, see wiki article for details). Outliers are marked with circles.

What does this all mean? Consider the above box plot--the median values really don't vary that much. But what does vary is (a) the size of the quartiles and whiskers and (b) the number of outliers. Outliers, in this instance, correspond with frames that look like crap. Tighter box plots indicate more consistent quality, and higher box plots indicate higher overall quality. Using this technique, one can easily compare encoder quality and get an idea of how they really stack up against one another. From this box plot, it's pretty clear "highProfile" is the obvious winner. If you were just looking at averages or medians, it is hardly so clear. Certainly you have zero indication of any outliers, where the quality of most frames falls, etc.

Means are basically retarded. They deprive you of understanding your data well.

In any event, what really irks me about this WebM debate isn't that there's a legitimate competitor to H264--competition is always a good thing, and I welcome another codec into the fold. What bothers me is that we would let bogus metrics and faulty data presentation influence the debate. At this point, I see no indication that VP8 is even remotely on par with a good implementation of H264--perhaps it can be sold on other merits, but quality simply isn't one of them. This could change as the VP8 implementation improves, but even the standard itself is roughly equivalent to baseline profile of H264.

Putting that whole debate aside and returning to the notion of a high definition video standard, using these methods--and in particular, a box plot--one can establish that a video at least meets some baseline requirement with regard to encoder quality. The blu-ray shots above are a pretty clear indication that this industry needs such a standard. More importantly, consumers deserve it. Regardless of the codec, it's not right for people to shell out cash for some poorly-encoded, overly-quantized, PSNR-optimized mess.

Sunday, May 09, 2010

Object Tracking for Scientists

My significant other works in zoology. She needed a program to analyze about a thousand hours of video containing salamanders. Since image and video processing is my primary focus, I offered to take a stab at analyzing her video.

The goal was to determine how much distance the object (in this case, a cute little salamander known as Plethodon Shermani) traveled over the course of the video. There were definitely a few takeaway lessons from this exercise:
  1. It turns out object tracking is more difficult that I would have originally imagined, and
  2. You can make your life a lot easier during the analysis phase by making sure your video is as clean as possible. Our video was relatively clean despite some confounding factors inherent in working with salamanders, but in retrospect there were a few things that would have made analysis much easier.
I'd like to discuss both points in detail. I'm going to start with the process I used to track the object first, since that will explain many of the crucial aspects of #2.

Object Tracking for Dummies
Rather than write a bunch of code from scratch, I started with AForge.NET, which is a nice open source library with a number of image processing primitives. Specifically, I massage my input data until I can use some of the blob detection features of the library. Here is the general method I chose to perform object tracking:
  1. Convert the image to gray scale--all color information is discarded. This is because (a) color information is not necessary and (b) the subsequent processing happens much quicker on an 8 bits/pixel image--color is 24 bits/pixel.
  2. Because the salamander is darker than the background, I invert the image. Light colors become dark, and dark colors become light.
  3. Run the image through some noise reduction filter, such as a median or mean filter. Median works better, but is rather expensive.
  4. Here, I then apply a threshold filter. This filter converts all pixels to either black or white, depending on whether they're below or above a specified threshold. Since the salamander is lighter than the general background, it falls out of the woodwork here.
  5. After that, I apply an opening filter to remove any small objects from the image.
  6. At this point, hopefully all that is left is the salamander. Here, I use the blob counter class.
At this point, there are a few tricks to make sure the animal is tracked correctly. First, I track the position of the salamander with a list of points. I can use this list to (a) draw out the path of the salamander and (b) compute total distance traveled. Total distance traveled is simply the cumulative sum of the Euclidean distance between successive entries in the list. The final unit is measured in pixels. Doing some metric unit is much more difficult unless you have a known mapping between pixels and actual width.

Should there be more than one blob, I do some basic filtering on blob size--I find the salamander typically has a fairly consistent area. Furthermore, I utilized principles of temporal and spatial locality--in the event of multiples blobs, I choose the blob closest to the previous known location of the salamander. Given that salamanders don't move very quick, this is a surprisingly safe bet.

Lastly, I do some mild filtering on the list of points--I only record changes in distance greater than a few pixels to weed out any residual noise in the image.

There are some definite drawbacks to this method. Most importantly, it relies on the salamander being the darkest object in the image, which I've found to be (mostly) true. It also requires knowing the salamander's initial location, which can be tricky. Lastly, the "threshold" setting tends to vary due to variance in lighting conditions at the time of recording, so I attached that to a slider bar that's configurable in real time. There's also a big discrepancy between the sizes of salamanders, so the target min/max area is configurable.

Here's a picture of the final result:


The left screen is the original video. The middle screen is what the video looks like after the whole invert->medium->threshold->open->detect blob method. I flag the resulting object with a nice red bounding box, and the size of the object. The right screen draws out the path the salamander took in the box. Total distance traveled is listed in the bottom status bar.

Worth note is that white bar on the left side of the middle image--that's the result of using a dark cardboard divider in the video. In retrospect, this divider should have been as white as possible. This is a nice lead into our next topic...

Getting Clean Data
One thing that's important is making your video data as easy to analyze as possible. Here are a few considerations:
  1. Low light conditions will result in salt and pepper noise in the video stream. This can be removed (somewhat) using a median filter, but the filter is computationally expensive. There are other cheaper filters, but they come with various performance tradeoffs. So, high-light conditions are vastly preferable, if possible.
  2. Seek to have the highest contrast possible between the object(s) you need to track and the general background. This means all background in the video should be the opposite color of the thing you're tracking. For example, if you need to put a piece of cardboard somewhere, and you're tracking a black mouse, make your cardboard white.
  3. Minimize glare as much as possible. Museum quality glass with low glare treatment (like the type used in picture frames) will help substantially; it is vastly preferable over the crappy plastic lids many trials use. Also consider a polarizing lens to further reduce any glare.
  4. Make your lighting as consistent as possible. Having half the field of view be bright and half be dark--or the existence of shadows due to point sources of light--is difficult to manage.
  5. If possible, mark objects with visually distinct queues. For example, if you can mark an object with a red circle, that's going to make your life a lot easier in the analysis stage.
  6. Maintain positions throughout the experiment as much as possible. If you can, bolt the camera in place, and make sure the tray or dish (or whatever) is situated in the exact same location.
  7. Make sure all cameras are the same distance from the object for the entire duration of the experiment. Otherwise, it can be very difficult to track distance traveled. Also, wide angle lenses will make it difficult to track distance traveled since the image is so distorted around the edges.
  8. Higher resolution is "nice to have," but not nearly as important as consistent lighting, low glare, high contrast between object/background and consistent placement.
  9. I would take consistent lighting performance from a camera--and excellent low-light capabilities--over higher resolution or frame rate.
Your mantra: high contrast, high contrast, high contrast. If your object is black, the ideal video should look like a blob of black amid a snowy field. Obviously this may be hard; you might need visual markers in your video, you may have edges or barriers setup that are dark(er), etc. Try to minimize this as much as possible.

Salamanders were tricky for another reason: they like dark places. We used low-light conditions, which results in particularly difficult lighting conditions. To compensate for this, high quality glass is a big help. Also, the median filter helps, as does having a camera with good low-light performance.

Saturday, March 13, 2010

Memory Corruption Makes Me Sad

Nothing makes programmers cower in fear more than memory corruption. These bugs are almost always A) fatal and B) ridiculously difficult to track down and C) hard to reproduce consistently. The combination of these three things can make you start thinking of your memory in surprisingly literal ways:

(and literal in more ways than one, since your memory is "trashed" hahahaha, oh that was bad...)


This particular blog posting is about my own struggle with a bug that I knew about for no less than four months, and for the last two weeks I worked on it 24/7. I fixed it, but it was very difficult to isolate.

Memory corruption happens when "the contents of a memory location are unintentionally modified due to programming errors. When the corrupted memory contents are used later in the computer program, it leads either to program crash or to strange and bizarre program behavior." What makes memory corruption particularly insidious is your program doesn't crash or behave strangely at the time of modification--it happens later, when something attempts to use the memory that was sullied.

Causes of memory corruption include use of uninitialized memory, use of unowned memory, buffer overflows, faulty heap memory management and multithreading problems. In all cases, the defining characteristic is where the program crashes isn't necessarily related to what went wrong.

Here's the best strategy I've found for dealing with memory corruption:
  1. If you're really, really lucky, you might be able to catch the problem with something like DevPartner or Valgrind, but typically these only show the problem as it's blowing up. In the case of my bug, DevPartner ended up being beneficial only because it made my error more repeatable. Also, have a look at Application Verifier.
  2. Get a static code analysis tool. If you're using VS.NET 2008, try using their Code Analysis tool. It is surprisingly effective at locating buffer overruns and errant pointer access. These tools may or may not help you, but they certainly can't hurt. If you are lucky, this may be all you need to find your problem. If it doesn't work, then you get to go on to the brute force methods (lucky you!)
  3. Go through all classes and make sure every member variable is initialized correctly. Value types should be set to sensible values, pointer types should be initialized to NULL.
  4. Search for all new/delete/malloc/free statements.  Ensure that all pointer values that are allocated begin their life as NULL. Ensure that the memory being released is immediately NULL'd. Ensure you have no new/free or malloc/delete mismatches. Ensure you do not free/delete memory twice. Ensure that any memory allocated by a third party library is also destroyed by that library (i.e. do not call "CrazyLibraryMemAlloc" and use free/delete to clean up unless you are positive that is the correct thing to do). Make sure your destructors and cleanup methods release all memory and NULL everything. Make sure you're using delete[] if the type was allocated with new []. In essence, everything should begin its life as NULL and end its life as NULL. This is probably the single best thing you can do to isolate memory tom-foolery.
  5. Review every memset, memcpy and mem-whatever in the program (ditto for Win32 variants like CopyMemory). If you are using raw buffer pointers (e.g. void*, int*, etc.), consider wrapping them in something like QByteArray. Review any and all string handling code (in particular, strcpy and the likes). If you have any raw pointer string types, consider replacing them with a decent string class like CString or Qt's QString.
  6. Are you using threads? Does the crash happen in a shared object? If so, this strongly implies your locking strategy (or lack thereof--even if you have locks, be absolutely certain they are working as you expect).
  7. Determine if you are experiencing corruption in the same location, or if it's more random. If it's random corruption, then it is more likely to be a buffer overflow. If it's localized corruption (i.e. let's say the crash always happens in a shared queue, or in the same place in code), then it is more likely that code touching the shared item is invalid. If it crashes in the exact same place always, then you are in luck--you should be able to watch that location in a debugger and break on any read/write. Whether or not you have crashes in the same place is a huge, huge clue about your problem. Track this information religiously.
  8. One method for determining if you have local/random corruption is to declare "no man's land" buffers directly above and below the item being corrupted. Like, nice, big 10k buffers that are initialized to "0xDEADBEEFDEADBEEFDEADBEEF..." When your program crashes, inspect those buffers. If those buffers contain invalid data, then it is not localized corruption. If they aren't corrupted, but the data structure they wrap is, then that strongly implies something that touches the sandwiched object is where the problem may lie.
  9. It is not likely to be the C-runtime, third party code, obscure linking issues, etc. Think about it: you're not the only person using these libraries and tools. They are generally more thoroughly vetted because of Linus' law. Is it possible? Yeah, sure, anything's possible. But is it likely? Not really.
  10. Unless evidence strongly implies otherwise, assume the issue is in your code. This is good, because it means it's something you can potentially fix. Otherwise, you may have to start a support case with whomever owns the code. If it's an open source project, you might get a quick response (or possibly no response at all...). If it's somebody like MSFT, it is going to take weeks at a minimum. Only as a last resort should you assume it's somewhere else, and be certain you have Real Information™ to backup your theory.
It may take a couple of people a day or five to go through the program and make all these changes depending on how big the application is, but it's generally the only real way to isolate problems. And it also gets you in the habit of being religiously fanatical about default values, pointer checking and correct deletion of objects, which is good.

For me, the issue ended up being extremely subtle (it evaded two separate code reviews by my peers), and after finding it, painfully obvious and somewhat embarrassing. I was having localized corruption around a shared queue that two threads accessed. The culprit was invalid locking code I'd written. Once the queue became corrupted, it would fail somewhere in the bowels of whatever queue object I'd been using (I tried CAtlList, QList, etc, but it didn't matter because none of them are thread-safe).

Which brings me back to item #10 in the above list: it's always your fault. It was my fault. It can be very tempting to assume otherwise, but generally I don't find this to be the case. So keep an open mind, think analytically, write down what you know and what you don't know, and you'll be done with the bug sooner than you know it!