Sorry about the goofy title, I'm in grave need of levity now due to some indexing troubles we had this past week and the ensuing recovery effort. We're currently in the midst of repairing most of the effected data but I wanted to share what's going on with it.
Technorati's spiders were shutdown for several hours on Thursday and various intervals since then while we investigated a number of anomalies that were appearing in our data; essentially, a small percentage of recently created blogs were having their data scrambled. An example of this appears in this blog post. The spidering outages allowed us time to investigate, diagnose and make corrections that prevented further data corruption. We started running some corrective measures on Friday but found over the weekend that that was only partially effective. Technorati handles a large volume of data everyday; isolating and devising remedies for these kinds of issues that effect a small percentage of the data flow is tricky. However, we think we're recovering now and the backlog of data processing is getting worked through.
Just to peek into the works a little bit, many distributed data systems rely on centrally dispensing identifiers for data elements and Technorati has such a beast. What was found were cases of blogs new to our system (from within the last 3 weeks) losing thier identifiers and those identifiers getting re-associated to other new blogs. No blogs that existed in our system before Dec. 18th (the vast majority) were impacted at all. The outward manifestations visible were posts for blogs with a shared ID mingled (a mashup the authors naturally were unhappy with) and mis-associated blog claims ("And you may tell yourself, this is not my beautiful blog").
This was a unprecedented case for us; while it had been occurring in about 8% of those blogs (created on or after December 18) for about 2 days (beginning on Tuesday, January 8th) we had until that time never encountered this phenomenon. An intensive investigation was launched, reconstructing operational timelines and correlating facts. What we found was that this stemmed from a failure incident with the primary system for identifier dispensing, another failure in the secondary system that took its place and then a corrupted data set mistakenly taking over that one, ouch! The first two blows appeared to be handled routinely but the third time was cursed; propagation of corrupted data was not detected for about 48 hours between Tuesday when it started and Thursday when we pulled the emergency brakes on the spiders.
So we're recovering now, most of the data is being restored to its previous state and we have had a number of internal postmortem discussions about earlier fault detection and recovery. If your blog was created in our system within the prior three weeks (since December 18th) and you're seeing aberrant data associated with it or it's no longer there (try http://technorati.com/blogs/YOUR_BLOG_URL to check), please visit the support request page. A selection for 'The January 8th System Outage' will be available this month while we shake out any remaining issues that aren't covered by the remedial action under way now.


