Thursday, October 05, 2006

Welcome...And .NET Multithreading Woes


Yes, this is my first blog posting. I decided to start blogging both out of a sense of bordom, as well as a desire to expound upon a few topics for my own personal entertainment. Anyways, I don't need to tire anyone with more details than that, so....on to the first post.

Multithreading can be nasty business. Actually, to be totally precise, multithreading can be ugly, vicious, maloderous, heinously evil business. Learning to write multithreaded code (and write it well) can take years of practice and experimentation, and requires an intimate knowledge of how the operating system schedules threads.

A month or two ago, a coworker of mine narrowed in on a strange hang in our application where a BeginInvoke .NET call would take an extraordinary amount of time and consume 100% of available CPU. The issue was only visible on single-processor machines, and when attempting to start threads at a lower priority.

After investigating the issue, we stumbled across this forum post, which outlined what our problem was quite clearly. In .NET 1.1, when starting a new thread, the .NET runtime did not attempt to wait for the thread to start before returning. It's worth noting that most threading routines work like this; a call is made to kick off a new thread, but when the call returns, the new thread may or may not be up and running, since the whole thing obviously needs to be asynchronous. A common means of "waiting" for the thread to start is to use a WaitForSingleObject call, or some other similar eventing derivative.

Well, someone clearly thought this behavior to be undesireable. In .NET 2.0, someone added the following code to the CLR to make sure the newly spawned thread was running before returning to the calling application:

while (!pNewThread->HasThreadState(Thread::TS_FailStarted) &&

More or less, the above code attempts to "switch" (in quotes to indicate this is actually bogus) to the new thread while the new thread has not failed to begin, and the state is still unstarted. This implementation had a serious flaw when dealing with threads of a lower priority, in addition to several other major blemishes.

To really get an idea of the severity of this bug, let's take a look at the documentation for __SwitchToThread():

Causes the calling thread to yield execution to another thread that is ready to run on the current processor. The operating system selects the next thread to be executed., to recap, the person writing the above code though to themselves "gee, I'll just switch over to that thread I just kicked off. Y'know, hand over the CPU...", except it doesn't work like that: what they really did was tell the scheduler in the OS to stop giving the current thread CPU time, and give some other thread CPU time. The rub here is the "other" thread may or may not be the thread we just started. If the thread we just started is a low priority thread, it is very likely that there is some other normal priority thread that needs CPU time.

Now, imagine running multiple instances of the above code, both of which are running at normal priority, but trying to launch threads of lower priority. Parent Thread A calls __SwitchToThread(), which tells the OS to execute the next waiting thread. The OS sees that there's another thread that needs CPU time, and promptly schedules Parent Thread B, which promptly calls __SwitchToThread, and goes back to Parent Thread A! The net result is that the code will ping-pong between parent threads waiting for their respective child threads to begin, all while consuming 100% of the CPU and incurring a huge amount of context switching. In the instance of two or more parent threads waiting for child threads, this quickly loads the system and locks out the child threads from receiving any CPU time.

All of this thrashing will continue until something called the "balance set manager" comes in and coughs up some cycles for the starving threads. Everyone loves a thrashing OS early in the morning. I prefer mine with crumpets and tea.

The first error here is that polling is bad, m'kay?! The code above basically amounts to the proverbial "...are we there yet? are we there yet? are we there yet?...(x200,000,000 iterations)...are we there yet?" without ever actually doing any real work. In this instance, the polling has the added side effect of causing the very thing being waited upon to never begin!

A terrible, awful, evil "fix" for this code could be:

while (!pNewThread->HasThreadState(Thread::TS_FailStarted) &&pNewThread->HasThreadState(Thread::TS_Unstarted))

...Sleep will make the thread relinquish its CPU time back to the OS, which would allow the pathetic underling threads to get a piece of the action. However, this is still polling! A better solution to the problem would be to have the code do a WaitForSingleObject/WaitForMultipleObject, and have the spawned thread signal when it's up and running.

Error number two is not understanding how the scheduler works, and totally misunderstanding how __SwitchToThread() functions. The author clearly didn't understand how the scheduler works in the OS, which is absolutely key if multithreaded code is on the table. They may have tested their code too, but another facet of this issue is that dual-core/dual-CPU machines will not reproduce the issue, because a second core can service the newly spawned threads. The move towards multithreaded code thus has seriously implications for not only multi-processor computers, but for single processor machines as well. A developer writing code now needs to be sure to test against a single CPU machine if their primary machine is multi-core.

No comments: