In our last release we had a big problem. Our code was in UAT phase, two weeks before the release date. We ran the scheduled performance (load) test and...bad news, the results were much worse than the one for our last release! The performance test failed. Well, we obviously can't release the code that way, so it's time to play Clue!
You remember Clue, right? It's a game where you have a bunch suspects, a bunch of rooms and a bunch of weapons -- and a murder. Your job is to figure out, through process of elimination, who committed the murder, where they committed it, and how they did it. You can use the exact same process to fix performance issues in Sitecore.
In Clue, when you start out you don't know anything. As far as you know, the crime could have been committed by anyone. That's the same way you feel when you find a performance issue. Your mind starts thinking through all the possible things that could have changed since the last, successful, performance test. Is there something wrong with the database? Did someone introduce a bug into the configurations? Was it one of the new modules we developed (it was probably Ted's, not mine!)? Did the test itself change from last time? Is there some networking issue? Did the maintenance job fail to run? Is the problem in some test content? And on-and-on your brain will go.
Your first move might be to start investigating those things one-by-one, using gut instinct, until you find the culprit. But that would be the WRONG way to do it. You'll take too long, and if you find it it'll just be by chance and luck. The trick to troubleshooting performance issues is to NOT go willy-nilly down every rabbit hole your fevered brain can imagine, but instead to take a disciplined, logical and methodological approach to eliminating the suspects.
So let's begin the game!
STEP 1 - Start Eliminating
For this particular exercise, I really wanted my team to solve the problem (instead of me doing it myself), so I got them all together in a room and provided some guidance on how to start "playing the game". We white-boarded out the following:
1. We didn't notice any performance issues in our QA testing.
2. The last release performance metrics were acceptable, but this time it's about 2x worse.
3. The performance is worse on all the Sitecore sites, and all pages.
4. Other teams control the server environment, and the testing software and methodology. They don't think anything significant has changed since last release.
1. Since QA testing was okay, it seems our problem only happens when the system is under load.
2. Since the load test was good last time, something has changed that is affecting performance.
3. Since all sites and pages are affected, it's some global problem.
4. The network and testing teams don't think anything has changed on their side, but they don't know for sure.
1. Is this problem specific to the UAT environment?
2. Is this problem specific to our release code?
3. How can we prove 1 or 2?
1. We will try to determine if the problem has to do with server or network environment or the testing methodology, OR if the problem is in our code.
We figured that the best way to test this was to restore the previous release's code and databases to the UAT environment, and run another load test. If the load test passed, that means the server environment and the test itself was okay, and the problem was in our code. If the load test failed, then our code was okay and we could gleefully make it the network team's problem.
We ran this test, and the load test passed. So just like that we have a HUGE number of new clues to work with. We have effectively eliminated the hardware, the databases, the testing software, the testing methodology, the network environment, the IIS servers, the search servers, etc. Our solution code is somehow to blame.
In one stroke we cut the number of suspects in our game of Clue roughly in half!
STEP 2 - Keep Eliminating
So now that we knew it was our code, it was back to the white board to figure out our next move.
We could start poking around all the changes that were made in this release (of which there were hundreds), and looking at lines of code to see what the problem could be. But that would be the WRONG way to do it. It would take too long, and maybe we'd find it that way but it would be through sheer chance - that's no way to play Clue!
So, back to the 'ol white board.
1. Developers changed something in this release that made performance under load worse.
2. We added new code, changed configurations, made content changes to Master and Web DBs, added scripts, made some customizations to MongoDB and Coveo search.
1. The problem affects all sites and pages, so we can eliminate those changes that are not global in scope.
1. How can we figure out if the problem is in the code, or in the database content?
1. We will try to determine if the problem has to do with our custom code, or it's related to content in the database.
The trick to all this is to eliminate as many suspects as possible with the fewest number of actions. So for this stage of the game, we decided to eliminate all of our custom code (by disabling our custom configuration files), and running the performance test once again. If the test passes, that means the problem is definitely in our custom code. If it fails, then something in the Sitecore databases is responsible.
After disabling our custom configurations we ran the test, and it PASSED. So now, once again, we have more clues to work with - the problem has to be one of our code customizations. We can remove more suspects from the game!
STEP 3 - Eliminate Even More
Now we have a major breakthrough! Through a scientific method we have figured out generally where the issue lies, and how to reproduce it (by enabling certain config files). At this point in the game it's pretty clear what we need to do - we need to figure out which of our customizations is causing the issue.
We have 92 custom configuration files in our solution (yeah, I know, lot's of customizations!). So now we can easily halve the number of suspects by disabling 46 of those config files and testing again. If the problem still exists, disable 23 more, and so on. Through this process you will eventually narrow the suspects down to ONE, and VOILA, we have our murderer - it's Colonel Mustard in the Hall with the Candlestick!
STEP 4 - Fix It
NOW you can finally dig in to the code and solve the root problem. In our case, it happened to be one simple line of code that was instantiating a new class over and over again instead of calling it statically. Something that would not necessarily affect performance until it's under load. I was really proud of the team for getting to the bottom of a complex issue and getting our release back on track.
So, to wrap up, that is how you can apply the concepts you learned from the game of Clue to your technical work. This scientific method helps you focus on looking for clues where they actually exist, not just guessing until you find them. It helps you rapidly narrow down the issue in a logical way. And it can be fun to approach problems in the form of a game.
And it doesn't just work with Sitecore performance issues. I use the game of Clue all the time to solve many different kinds of complex troubleshooting problems, even at home to find lost keys or plan where to go on my next vacation.
Good luck to you on your gaming endeavors!