MTGO daily data is awesome. If you like Magic and like numbers/statistics/math in any way, it's hard to look at the mothership's dataset and not nerd out. It's updated daily! It has full decklists! It's presented in a consistent format! It has average game win percentage for decks! Although there are certainly ways Wizards coould improve the MTGO reporting (give us individual matchup results! Please?), it's overall one of the best Modern datasets online. It's also the foundation of the Top Decks MTGO tab, and a great resource for players who want to understand the MTGO metagame.
There are endless possibilities for this kind of dataset. We can use it to find cool tech and new decklists. We can analyze attendance for MTGO queues and extrapolate that to format health. We can breakdown player retention between events. We can describe the metagame, describe an "average" list for a deck, and we can answer questions about color, archetype, and card diversity. But this piece isn't just about descriptive statistics. It's about winning. Specifically, about finding the MTGO deck that wins the most.
In this article, I will use some data analysis techniques to identify the "best" MTGO deck. By "best", I mean both a) the deck with the highest win rate in events, and b) the deck with the highest win rate in individual matches and games. By applying some statistical tools to the MTGO data, my plan is to isolate the top performing decks on MTGO. Or should I say (spoiler alert!) to isolate the "Platinum" standard in online Modern...
Mandatory statistics disclaimer
Before we start with any statistical discussion, especially one online, I want to give a quick disclaimer on the approach. At best, statistics is just one tool of many in our analytic arsenal. And at worse, it's damned lies and guesswork. There are countless ways to analyze a dataset and only so much space in this article. There are also countless ways to expand a statistical analysis or to sharpen it. More to the point in this case, there are many limitations to the dataset itself. It's only 4-0/3-1 decks. It's published only once every day. It tends to be the biggest event of the day. All of these factors don't render the entire dataset useless, but they are things we need to understand as potential limitations. So when you are reading through, resist that temptation to think "umm N is too small", "where's the multiple regression??", or "giff beta coefficientz pliz". This is just one statistician's take on the data and I'm always happy to see other approaches in the comments.
Step 1: Game Win Percentage
Our first step is to look at all decks between 1/28/2015 and 3/17/2015. This is a wider date range than we use for metagame descriptions on the site (almost 2 months vs. just 1), but we need as wide a range as possible to try and minimize the effect of a small N. With just 48 events in that range, comprising about 1600 decks, we are only looking at a tiny fraction of the total number of dailies and decks that happened between 1/28 and 3/17. So the more datapoints we can add in, the better our analysis might be. Why not go back before 1/28 to increase N further? Because the metagame was fundamentally different before the January banlist changes went into effect. As a result, decks would be winning and losing in different matchups, which means that the pre-1/28 and post-1/28 data would look different from the start.
Now that we have our date range, let's narrow it further. I only want to consider decks that were Tier 1 or Tier 2. This isn't because those decks are "better" than other decks, or because untiered decks are necessarily bad. Instead, it's because the tiered decks meet a baseline number of dailies (14 in this case). That gives us enough datapoints to work with. 5C Humans is a cool deck, but with so few appearances, it's hard to analyze its win rates with any statistical significance.
Here are the decks under consideration. Decks are presented with the number of finishes in the date range, as well as the average game win percentage (GWP) for that deck. All decks are ordered on their GWP from highest to lowest. Note that these decks here make up 1203 of the 1559 decks total, or about 78% of the reported decks in the date range. So even though it's just Tier 1 and Tier 2 decks, we are still looking at most of the reported MTGO daily metagame.
|Deck||# of finishes|
(1/28 - 3/17)
|Mono U Tron||25||72.89%|
The first time I saw the GWP numbers, I had no idea what a "good" or "bad" GWP was. So when we look at this table, we should be asking some question. What is an average GWP? What is significantly higher? What about lower? Let's construct a quick confidence interval based on that average GWP to see what the range of "expected" GWPs might be:
Average GWP: 68.65%
95% Confidence Interval: 65.54% - 71.76%
For those who need a refresher on your confidence intervals, the number above means that we can be 95% sure that the "true" average GWP (Assuming we had millions of datapoints) would fall somewhere between 65.54% and 71.67%. Anything outside of that is likely an abnormally high or abnormally low GWP. But any GWP within that range could just be "average". For instance, Affinity has an average GWP of about 70%. BW Tokens is only 67%. Because both of those GWPs fall within our confidence interval, we can't say that one deck has a higher/lower GWP than the other. Even a fair coin will flip an uneven number of heads and tails in a few trials, so maybe the difference between the Affinity and BW Tokens GWP is just because we don't have enough datapoints.
That said, when a GWP is outside of the range, we need to take a close look at that deck. So let's revisit the table and turn our attention to the two decks outside of the range. First, the deck on the bottom: It's a sad day for UWR Control. The good old red, white, and blue brings up the rear with a solidly below average GWP of 65.43%. UW Control is right down there with UWR Control, with a GWP that just ekes over the bottommost cutoff. Interesting, but the failures of UWx control on MTGO is a topic for another day and another article. Now let's look at the top. We see a few decks that are very close to that uppermost end of the confidence interval, but only one deck that is actually over it. And it's not just over it; it's a full percentage point over it. That deck? Mono U Tron (!!).
Step 2: 4-0 vs. 3-1 finishes
Let's step back for a moment. Before I/we get too excited about Mono U Tron being the best MTGO deck, we need to look at our data from a few other angles.
Let's take the same table from above and add two columns. First, the number of 4-0 finishes. Having a good GWP is one thing, but we want that GWP to convert into 4-0 wins if possible, not just 3-1 finishes. Second, the percentage of all finishes that were 4-0. We need to consider this "4-0 rate" instead of just the raw "4-0 number" because some decks are going to have more appearances overall. 20 finishes at 4-0 isn't that impressive if a deck has 300 finishes total. But it's much more impressive if that deck had only 40 finishes.
Here's the modified table, now sorted on the 4-0 finish rate.
|Deck||# of finishes|
(1/28 - 3/17)
|Mono U Tron||25||72.89%||8||32%|
For the most part, the top section of that table is the same as the table we saw earlier. Our top 5 here are, in order, Amulet Bloom, Mono U Tron (!!), Grixis Delver, Jund, and Infect. Our top 5 in the GWP table are Mono U Tron, Infect, Storm, Amulet Bloom, and Affinity. The common decks are Amulet, Tron, Infect, and Grixis Delver. If we extended the range to the top 10 decks on each list, we would add Burn and Jund to the mix.
Discerning readers might find this overlap obvious. After all, there should be a correlation between GWP and 4-0 finish rate, so of course there's some overlap! Right? Although there is a correlation between the two, it's a lot weaker than we might think . Using "Pearson's R coefficient", a measure of linear relationship, we see an R of .56 between the GWP variable and the 4-0 rate variable. To give a bit of context, R ranges between 0 and 1. Giving some social science examples to illustrate those values, something like "neighborhood poverty" and "neighborhood crime rates" might have an R over .8 or .9. But something like "gender" and "SAT score" would probably have an R closer to .1 or .2. So our .56 R is considerable, but not overwhelming. That is also to say, there are other factors at play in GWP and win rate. We couldn't perfectly predict one by knowing the other.
So let's add one more layer before turning to Mono U Tron. Let's express those two variables, GWP and 4-0 rate, as a function of each other. We'll just multiply them together to do this. This lets us sort the decks on a variable that combines both GWP across individual games and 4-0 win rate across matches.
|Deck||# of finishes|
(1/28 - 3/17)
|4-0 finish |
|Mono U Tron||25||72.89%||8||32%||0.233
Mono U Tron, back on top! Also up there are Amulet Bloom, Grixis Delver, Jund, and Infect, four other decks that have high GWPs and also very respectable 4-0 finish rates. Those are interesting results, but Mono U Tron is the most interesting to me. This is a deck that many players consider to be resoundingly Tier 2. So what is it doing on top of decks like Infect and Amulet?
Looking closely at the data, we might notice that the top five decks (even the top eight decks) all have relatively few appearances when compared with the big dogs like Burn, Abzan, and UR Twin. This could suggest that decks with fewer showings are more likely to have higher GWPs and 4-0 rates, which would definitely throw a wrench in our analysis. After all, if that was true, we might suspect that any deck with only a handful of appearances could get a high GWP/4-0 rate score, which wouldn't necessarily mean that Mono U Tron, Amulet Bloom, etc. were any better than the rest. It would just mean they had fewer appearances that got undue weight.
But when we control for the number of appearances in the overall dataset, we find that this variable actually has no impact whatsoever on either GWP, 4-0 rate, or overall score. For you statistics folk, the Multiple R value is a paltry .14. That suggests a very weak relationship between number of overall appearances and their "Score" in the table. This makes sense when we look over the decks. There are plenty of decks that have only a handful of showings (RG Tron, Ad Nauseam, Storm, etc.) that still have much lower "Scores" than decks like Mono U Tron and Grixis Delver. Same number of appearances, worse GWPs and 4-0 rates. This suggests that there really is something about those top decks that gives them an edge, and that their performance isn't just an artifact of a smaller N.
NOTE: IMPORTANT LIMITATION
Before we go any further, I want to highlight one major limitation of this dataset that some of you may already have noticed. We have no idea how many people played any of these decks in the 2-2 or worse brackets. Dailies can have over 100 players, and our reported dailies just showcase the top 40 or so. This means we have no idea how many people played their deck in the event as a whole, which means we don't know what the conversion rate is from 0-0 to 4-0/3-1. Does this invalidate the whole dataset? Not entirely, because the effect would be more or less equal across all decks. But it is something we need to consider when thinking through the data.