Not sure if I posted what it was before, but this is what happens:
1. All the PHP pages are parsed and cached using WinCache.
2. A lot of system data is also cached.
When a page is requested, instead of needing to load if from disk / db / etc the cache (memory) is used.
This is great and speeds things up - until.... the caching engine crashes and unfortunately has the side effect of putting a "lock" on the cache.
Everything then freezes - every page request spawns a new PHP process (as all the current ones are busy waiting) - and it builds up and up and until the original lock is dropped - at which point hundreds of PHP processes all try and access it, lock it, use it, unlock it ....
Now the kicker is - if 1 process manages to get the lock every minute, the tracking system will never know an error has happened (as it will be able to get through) - which is why sometimes is goes on for a while.
Once things totally lock up, the tracking system will notice and fire a restart on the IIS application pool (this is when you start to see the 500 errors) - which is good, UNLESS, the amount of people using the site is too great, which will cause a lock again and .... repeat repeat repeat.... lol
It would normally be a case of just turning the cache off - however too many things now rely on it to be an option

So we basically just have to keep hoping MS finally release an update (or XCache resolve some issues they have and we can move over to that!)