Friday, October 26, 2007

Hardware Assisted Brute Force Attacks

Jeff Atwood recently blogged about a hardware assisted brute force password cracking product developed by Elcomsoft.

Using Elcomsoft's figures from a pdf they released, Atwood concludes that we can attempt this many passwords a second:

Dual Core CPU 10,000,000
GPU 200,000,000

...and using Elcomsoft's relatively weak 8-character alpha-based password (a-A, z-Z), there are this many passwords:

52^8 = 53,459,728,531,456

Some more basic math gives us:

53,459,728,531,456 / 10,000,000 pps / 60 / 60 / 24 = 61.9 days
53,459,728,531,456 / 200,000,000 pps / 60 / 60 / 24 = 3.1 days

To his credit, Atwood correctly assaults Elcomsoft's most questionable assumptions: allowing people only 8 characters, and using an extremely limited character set. Adding additional characters, it becomes clear that the exponent plays a rather sizable role. Here's the amount of time for twelve characters:

52^12 / 200,000,000 pps / 60 / 60 / 24 = 22,620,197 days

We can all agree that ~22 million days (about ~61,000 years) is an awful long time. But this is where Atwood makes an error:

For those of you keeping score at home, with a 12 character password this hardware assisted brute-force attack would take 61,973 years. Even if we increased the brute force attack rate by a factor of a thousand, it would still take 62 years.

The error here is pretty basic: over the next 62 years, what sort of new computing techniques will we develop? How much faster will GPUs run five years from now? Ten years from now? What did computing resemble 62 years ago?

I'll give you a hint:

This is ENIAC. It was the "first large-scale, electronic, digital computer capable of being reprogrammed to solve a full range of computing problems." In one second, it could perform:
  • 5,000 add or subtract operations
  • 357 multiply operations with two ten digit operands
  • 35 division or square root operations
...all of this, at a svelte 30 tons, while consuming 150 kW. This is state of the art in computing, roughly ~61 years ago.

Considering the operations being done to hash a password are substantially more complicated than simple arithmetic, how many passwords could ENIAC hash in a second? To be honest, I think it's more likely that this unit would be expressed in number of seconds per password rather than the number of passwords per second, but I'll be the first to admit that I'm not familiar enough with the NTLM hashing algorithm to say with certainty.

For the sake of argument (and not having to discuss a computing architecture that dates me by a good 35 years) let's assume that ENIAC can do one hash a second. This means our modern CPU is ten million times faster than ENIAC, or somewhere between 2^23 and 2^24 times the speed of ENIAC. The GPU is 2^27 to 2^28 times the speed of ENIAC. These are both commodity products, and ENIAC was state of the art, so a more fair comparison would be with something like Blue Gene, but I digress.

Using the above assumptions, this means that the number of hashes we can do should double every 2.25 years. This also means that it'd take 22.5 years for our current computing power to increase 2^10 (or 1024) times. Let's reread Jeff's statement:

Even if we increased the brute force attack rate by a factor of a thousand, it would still take 62 years. this means that 22.5 years from now, a computer would still need another 62 years to crack the 12-character password, for a grand total of 84.5 years. But, if we increase our computing power by a million (2^20, or 45 years from now--it's no big deal, we just push the date out), it will only take .o62 years (roughly 22 days), or 45.062 years from today. If we increase our computing power by 2^27, it ends up being roughly 4.25 hours to crack a single 12 character password.

Obviously all of this is total speculation, with wild figures and estimates on my behalf that are obviously open to interpretation, huge questions about where computing will be 40 years from now, etc. But the point I'm trying to make is: it's a fallacy to assume today's computing power is what you'll have access to in a year. It's a gigantic fallacy to assume today's computing power is what you'll have access to in 20 years (what were you doing in 1987?). Discussing what we'll have at our disposal in 62 years is mind-boggling.

When we're computing the amount of time it takes to do something, it makes no sense to hold computing power constant, especially when we start talking about any amount of time in excess of a year. If we think it's going to take ten years to perform some computation, one viable option is to wait five years (at which point in time your computing power has quadrupled), kick off that batch job, and take a 2.5 year coffee break. It still took you 7.5 years, but that was 2.5 years saved.

Sunday, October 07, 2007

The Sorry State of Online Translation Tools

As of late, I've been working on a set of automated tools that query free online translation services to perform translations for some of our target languages. For example, all of our string data is in en-US, so we might request a fr translation from babelfish.

Yes, if you were paying attention, I am doing the unspeakable: I've developed an application to automate translation using AltaVista's translation services. Is it theft or abuse of their free services? Yeah, probably--but seriously: cry me a river. (If you're reading this, BabelFish, take note: your programmers are stupid. Yes, I said it. Stupid. You have the potential to offer a for-pay service, but your API is too deficient for it to be of any real use without jumping through ridiculous hoops.)

Anyways, this task has yielded some extremely interesting observations. AltaVista (BabelFish) doesn't expose any web services for doing translation, although I believe they may have in the past. So, what we were left with was good old HTTP requests. We submit a request through their web form, and parse the translated resource out of the response. Needless to say, this is problematic. We've glued together some extremely flexible, object-oriented C# classes to manage this.

The page for translation looks something like:

...okay, great. So it managed to do this basic translation, pretty easily. But maybe I don't want "fox" translated. AltaVista instructs users to wrap "fox" in "x" characters, which results in something like:

...seems obvious enough. But wait--notice that the no-translate specifiers (x) are left in the output. I have to manually remove the x's from the output. Slop, baby, slop. How about this:

Call me crazy, but I don't remember "l'examplesx de xextra" being in the French lexicon. And something tells me they aren't going to recall it either. And when it comes to language, the French may be the last people in the world you want to piss off. But I digess.

So one thing is pretty obvious: the characters that BabelFish has chosen to designate sections of text as "do not translate" are insufficient; this should be totally obvious to even the most stupid programmer, since attempting to parse out x's from English text is going to be problematic, given that character is used (somewhat infrequently, I must admit, but why use a letter in the language itself?). And why not go with the (more) obvious (but still incredibly stupid) choice of "z", which is encountered half as often as "x" in everyday language? Perhaps their justification was that there aren't words in the English language that start and end with x, but there are phrases like "xextra examplesx", which seems to make BabelFish's parser throw a temper tantrum.

This character selection snafu probably also explains why they're left in the output; good luck removing something as unspecific as "x" from the output. Because of this, our application has a post-processing step where we search for our wrapped phrases, and then replace them with the originals before the x's were inserted.

All of this begs the question: why not use something that is obviously not an element of the English language, or any other linguistic language, for that matter? Personally, I would propose something like:

<NoTranslate lexcat="">section to not translate</NoTranslate>

...which would have made the aforementioned sentence:

I need <NoTranslate>extra examples</NoTranslate> of this leaping fox.

...or maybe:

I need extra examples of this leaping <NoTranslate lexcat=noun>fox</NoTranslate>.

..."lexcat" is an idea I had (no idea how feasible or practical it is) to give the translator some idea of what is contained in the parenthesis to perform a more concise translation. The context would be for the source language, which would (hopefully) aid in translation to the target language. I suppose other attributes could be added based on the source language (maybe gender would be useful for French, for example). Anyways, with this way of marking text, it is clearly obvious what part you don't want translated. The odds of someone encountering <NoTranslate></NoTranslate> in everyday parlance are pretty damn low.

Unfortunately, this is hardly the beginning of the crap translation service offered. Consider multi-line translations, such as:

Here's the first line to be translated.

This is a leaping fox.

...translates to:

Voici la première ligne à traduire. C'est un renard de saut.

...the problem here is pretty obvious: the translator didn't preserve line breaks. The output should have had a break between the two sentences. This is almost an even bigger issue for automating translation--one option we've kicked around is inserting some xLINEBREAKx word where there's a \r\n; without modifying the input, the location of line breaks is lost in the output. Yet another pre/post processing step to add.

The really sad thing is we've looked for better translation services, and this is actually one of the better ones. Google's service is (surprisingly) even more deficient; rather than using x's as no mark, you use a single "." at the front of a word or phrase you don't want translated. Gee, who uses periods these days? I think I'll go and remove all of the periods from my writing from now on.