Having finished the appetizer, it's time for the main course: the data from my Hypergenesis test. This is the hard, quantitative data, and I've done statistics on them to determine the validity of the test. For the stats people out there, I do a multiple significance test, but will report the z-test here. There's never been disagreement between tests, and I believe that more people will remember the z-test from high school than any others. Also, the Excel readout is cleaner.
Contained are the results from my experiment. It is entirely possible that repetition will yield different results. This project models the effect that the banned card would have on the metagame as it stood when the experiment began. My result does not seek to be definitive, but rather provide a starting point for discussions on whether the card should be unbanned.
Meaning of Significance
When I refer to statistical significance, I really mean probability; specifically, the probability that the differences between a set of results are the result of the trial, and not of normal variance. Statistical tests are used to evaluate whether normal variance is behind the result, or if the experiment caused a noticeable change in result. This is expressed in confidence intervals determined by the p-value from the statistical test. In other words, statistical testing determines how confident researchers are that their results came from the test and not from chance. The assumption is typically "no change," or a null hypothesis of H=0.
If a test yields p > .10, the test is not significant, as we are less than 90% certain that the result isn't variance. If p < .10, then the result is significant at the 90% level. This is considered weakly significant and insufficiently conclusive by most academic standards; however, it can be acceptable when the n-value of the data set is low. While significant results are possible as few as 30 entries, it takes huge disparities to produce significant results, so sometimes 90% confidence is all that is achievable.
p < .05 is the 95% confidence interval, which is considered a significant result. It means that we are 95% certain that any variation in the data is the result of the experiment. Therefore, this is the threshold for accepting that the experiment is valid and models the real effect of the treatment on reality. Should p < .01, the result is significant at the 99% interval, which is as close to certainty as possible. When looking at the results, check the p-value to see if the data is significant.
Significance is highly dependent on the n-value of the data: in this case, how many matches were recorded. The lower the n, the less likely it is that the result will be significant irrespective of the magnitude of the change. With an n of 30, a 10% change will be much less significant than that same change with n=1000. This is why the individual results frequently aren't significant, even when the overall result is very significant.
Overall Matchup Data
As a reminder and for those who’ve never seen one of these tests before, I played 500 total matches: 50 matches with each experiment deck against each gauntlet deck. I switched decks each match to level out any effect skill gains had on the data. Familiarity and matchup knowledge naturally increase with games played, and since I would be better with both decks by the end, the data could end up skewed. Alternating decks ensures that the increase happens at the same time for both decks. Play/draw alternated each match, so both decks spent the same time on the draw and play. The deck lists for both the gauntlet and test decks can be found here.
- Total Neoform Match Wins: 77 (30.8%)
- Total Hypergenesis Match Wins: 122 (48.8%)
The data shows that Hypergenesis won a statistically significant percentage more than Neoform. P is so tiny that it is functionally certain that any variation is the result of the test and not natural variance. In other words, Hypergenesis did better than Neoform by a large enough degree that I can be certain the result is valid.
Honestly, I absolutely expected that Hypergenesis would do better than Neoform. It's been a pretty consistent refrain of mine for years at this point, but Neoform is not and has never been a good deck. It's pretty busted if it works, but very easily disrupted. The test (as far as I was concerned) was not to see if Hypergenesis is a better deck, but by how much. Players tend to grumble about this style of gameplay, but so long as it's inconsistent, it's no problem. Given the fact that Hypergenesis did 18% better than Neoform and in light of the cascade debacle, I think it's safe to conclude that Hypergenesis's data is instructive.
The hard data that a test seeks isn't always the total story. Often it's the surprises along the way that make a test. Sometimes I know what I want to look for, some only appear in exploratory testing. This time, I intended to watch for turn 1 wins. Those are obviously the most problematic aspects of broken combo decks, and since both decks have turn 1 kills, knowing which is more likely to win on turn 1 is instructive for their place in the metagame. I intended to count both actual wins and opponent concessions as wins. The latter was more relevant for Hypergenesis than Neoform, as the former's wins were often unsolvable boards rather than kills.
Actually following through and recording that data was a problem. Because I... *cough*... (mumbles) didn't. No excuses, I straight up forgot to record all the turn 1 wins. There were a number of sessions where it just slipped my mind. In fact, the only data that I'm sure that my numbers are accurate comes from the DnT testing. Which is less than ideal, but better than nothing.
|Deck||Turn 1 Game Wins vs DnT||% of Game Wins vs DnT||Average Win Turn|
Hypergenesis won more games on turn 1, but they represent a lower percentage of the total game wins. This makes sense as Neoform is easily disrupted and relies on that fast kill. And always has. Plus, Hypergenesis won more games, so it would have more turn 1 kills.
However, I was surprised that Hypergenesis's average win turn is higher than Neoform's. It's very clear from the data, but I wasn't expecting that result, which challenges some of my assumptions about both decks. Hypergenesis's win distribution is bowl-shaped: Turn 1 had the highest number of wins, turn 3 was lowest, and there was a spike to turn 4 just below turn 1. Meanwhile, half of Neoform's game wins came on turn 2, there were no turn 3 wins, and a few turn 4's. Neither deck won after turn 4. It suggests that Neoform is more glass-cannon than expected, but perhaps not as broken.
The other thing I watched for was fizzling. It's known that Neoform has a fizzle rate, but I've never seen it quantified. It's also important to define fizzling, and for me it was any time that the decks successfully started comboing, but failed to compile a winning sequence with no input from the opponent. Getting something countered or removed mid-combo and failing is not a "fizzle;" that's just getting disrupted. Failing to finish the combo because of poor draws is. And this never happened to Hypergenesis. If it played a cascade spell, it cast Hypergenesis. That didn't always translate into a win, but that was thanks to opponent's action, not deck failure.
The same could not be said of Neoform. I recorded a fizzle rate of about 3%. These mostly happened due to drawing too few Nourishing Shoals to draw the whole deck or even get more than two Griselbrand activations.
Frequently, Neoform subsequently lost, though not always. Every so often, this was a loss because it took Summoner's Pact to get going. The most memorable fizzle was once I got down to 7 cards in library and 8 life, but couldn't win because I had no blue mana floating and all my Simian Spirit Guides and my last two Manamorphoses were in those 7 cards. I'd used the Wild Cantor to get going, so there was no way to get the mana and turn it blue for Laboratory Maniac without decking. My opponent untapped, Pathed Griselbrand, and won the game.
Deck By Deck
Given that the overall data is statistically significant, the deck-by-deck results may be surprising. Regardless of the overall results, historically, the individual decks haven't always yielded significant results. This is because of the lower number of data points. I only have 50 matches to work with per deck rather than 250 for the overall results, so the threshold for significance increases. So if you see something odd in the data, blame the low n.
The other thing to note is that, unlike other tests, my play didn't change based on my opponent's deck. I always had to mulligan aggressively because there's little opportunity for sculpting either deck. I also always just went for the combo at first opportunity, particularly game 1. They're glass cannon combos without much or any interaction game 1, so there's nothing to gain by waiting. In games 2-3, I would only hold off on comboing if I had Ricochet Trap or Veil of Summer in hand against 4-Color, so that I could protect against counters. This meant that this test went a lot faster than any previous one. And was easier on me because I didn't have to think much.
In the order I finished the matches:
Death and Taxes
Death and Taxes does not interact turn 1 except via Path to Exile. However, each subsequent turn, the number of disruptive spells increases. Thalia is obviously rough for both test decks, but Archon of Emeria was game against Hypergenesis game 1. Both decks could subsequently be Strip Mined into submission. As a result, games didn't go very long and neither combo deck won after turn 4.
- Total Neoform Match Wins: 12 (24%)
- Total Hypergenesis Match Wins: 21 (42%)
A big part of this result was that Leonin Arbiter was relevant disruption against Neoform and not Hypergenesis. My opponent planned ahead with the Burrenton Forge-Tenders against my Anger of the Gods. We discussed at length whether against Neoform it was better to Path the Griselbrand immediately or wait for Laboratory Maniac. I wasn't running Pact of Negation maindeck, but my opponent didn't know that but did know that it wasn't always played maindeck anymore. Taking the latter course 100% wins the game against my deck, but is risky otherwise.
Something I didn't realize until this test is that Hypergenesis's text is different than Eureka's. The latter says all permanents, but Hypergenesis excludes planeswalkers. This actually takes it back to Eureka's original functionality, but it's still intriguing that Wizards deliberately made that change right before planeswalker's came out.
- Total Neoform Match Wins: 17 (34%)
- Total Hypergenesis Match Wins: 23 (46%)
This was the only deck where either deck won later in the game, and the reason is that they could afford to. 4-Color Omnath wins rapidly, but not quickly. Once it actually produces threats, it puts the game away in short order, but that may take awhile. Thus, a single failure didn't spell the end for either deck. Fighting counter walls was hard, but not impossible, post-board. Hypergenesis could, and I sometimes did, overwhelm counter walls even late-game thanks to Trap. Occasionally, planeswalkers spared Omnath immediate death by bouncing a non-hasty Emrakul, but it was rarely enough.
However, longer games also gave Neoform more time to draw both Griselbrands, which could be lethal unless they managed to discard and then Noxious Revival one back and immediately combo off. Teferi, Time Raveler was game over for Hypergenesis, but there only being two copies meant it didn't happen too much. 4-Color getting to Supreme Verdict after sideboard helped a lot, but with only one, it didn't much tip the scale in its favor.
As testing got going, my Scourge pilot got increasingly annoyed. Neoform does very poorly against discard, but Hypergenesis can overcome it thanks to cascade redundancy. Plus, both decks ran sets of Leyline of Sanctity in the sideboard. He frequently wished he was still on Grixis Death's Shadow to have counters as a backup. We tried running Blood Moon, and it was better than the cards we cut, but still wasn't very effective against either deck.
- Total Neoform Match Wins: 16 (32%)
- Total Hypergenesis Match Wins: 25 (50%)
The difference here is Neoform's game 1 weakness to Thoughtseize. Both deck's improve a lot after board while Scourge's options are limited. However, both need to cheese game 1 to beat hate games 2 and 3, and that being so much easier for Hypergenesis was decisive. Take my Violent Outburst? I've got 11 more ways to cascade. Take my fatty? Tons more, and you can't kill any of them. Also, Chancellor of the Annex was especially good here thanks to Scourge's low land count. Mishra's Bauble is a work-around, but doesn't always line up correctly.
Amulet's game 1 against combo is a straight race. And unfortunately, it's slower than most combo. There was some hope after board because this deck ran 3 Mystical Dispute, but that's narrow against Neoform and pretty poor against Hypergenesis. The biggest hope against Hypergenesis was to keep Primeval Titan, Dryad of the Illysian Grove, and five lands so that Hypergenesis immediately turned on Valakut, the Molten Pinnacle.
- Total Neoform Match Wins: 14 (28%)
- Total Hypergenesis Match Wins: 27 (54%)
This result is strongly significant, p<.01. This is in fact the most strongly significant individual result.
Dispute did a lot of work against the fast Neoforms, bumping up Amulet's win percentage. However, I also recorded more fizzles here than in other matchups. I think that this result is actually more attributable to variance than it appears. Not enough that it would have pushed it out of significance or change the overall conclusion, but enough to alter the stats.
Oops, All Spells
Oops was a lot like Amulet in that game 1, it was a straight-up race. The difference is that, under very rare circumstances, Oops can kill on turn 1 too. Thus it could keep pace with the combos. Casting Hypergenesis against a single-creature combo deck may seem like a liability, but the creatures in Oops lose to the Hypergenesis ones, so it couldn't usually attack for the win. And that's not counting the times that Urabrask the Hidden was disruptive.
- Total Neoform Match Wins: 18 (30%)
- Total Hypergenesis Match Wins: 26 (52%)
This result is weakly significant at p<.10. It just missed the 95% interval, likely one positive result away. If this were an academic paper, this is what I'd be writing my Further Research section about.
My combo decks didn't sideboard against Oops. Neither had any graveyard hate, and even then, why bother if we're racing? Oops removed the useless maindeck Leylines for Thoughtseizes, but those are only effective against Neoform, so the general tone of the matchup never changed. I've since wondered how things would have been different if Oops was also running the Belcher option like many do now, but that just wasn't a thing in November.
Half the Story
And that's the hard data. However, it's not the full story of what I found during the test. And it also doesn't address the effect of banning Simian Spirit Guide. For all that and my conclusions, tune in next week.