• Julien

UFC 232 - AI perspective. Building an MMA Fight Simulator Using Simple Algorithms


Its not Life Sciences, but it involves living things.



***UPDATE***

THE LITTLE ALGORITHM WENT 5/5 ON FIGHTS AND 4/5 IN METHOD

( FINISH VS DECISION ) FOR THE UFC 232 CARD!


NOT BAD FOR A QUICK ONE!

The Problem


Nothing better than the Holidays to work on little ML projects at a notebook level. No need for conventions, no unit testing, no API, no prod server… Just plain messing around with what is out there and solve a problem in a few hours for fun.

I’m an avid MMA fan and grew up watching Pride and the early days of the UFC. Being from Montréal, I watched Georges Saint-Pierre rise all the way from the UCC (yes that’s not a spelling mistake) to the highest of ranks in the UFC and establish himself as a GOAT contender. All his fights were an excuse for me to gather friends at my place, place bets and stress out all the way to GSP’s fight. I did my share of MMA math back in the days and soon realized that there is no such thing. ( Remember Serra-GSP 1 :\ )

I was 10 years old at UFC 1: The Beginning and 25 years later, I’m still as passionate about the sport and amazed at how wrong I can be when predicting fights outcome. For some reasons, I’ve never tried building an AI model to try to predict fights. Well, that day has come. Leading to UFC 232, I just can’t decide who to pick to win. Let’s dig in and see if we can come up with something useful.



The Data


Every project starts with this. You have to find and gather high-quality data. No matter how good you are at building models and neural networks, if the data you are working with is poor, so will be your results. Field knowledge is important too and engineering high-quality features is key to a good model. If you know nothing about the subject, you will not reach the data’s full potential.


For this small project, I googled around and found some data, although not up to date, on fights and fighters. Reddit was full of resources and some good dataset where found on GitHub. Unfortunately, most of the datasets were missing a lot of information I would’ve like to find. The old saying that styles make fights is so true, yet not represented enough in the public datasets.


After 45 minutes or so, I had an okay dataset to work with. The only thing I added manually to about half of the data was the fighter’s main martial art background. Why 50%? Because I got lazy and just stopped.



The Models


I built 2 different models. One to predict the outcome and one to predict if the fight will go to distance or not. Since I wanted to go quick, I ran one of ml+ proprietary library that goes through all the ML algorithms and finds the one with the most potential for that particular problem. Ensemble algorithms were the strongest for our MMA math. I picked XGBoost for the speed of inference and good performance overall on the dataset.


XGBoost 5 fold CV accuracy on picking the winner = 73.4%

XGBoost 5 fold CV accuracy on predicting the finish = 61.4%


Not so bad. The trick now is to use this estimator to predict the outcome of simulated fights. The code below will show you the very simple strategy I used to build various fights scenarios with 2 opponents. I simply introduced randomness in the fighters attributes that could vary from night to night depending on multiples factors: stress, fatigue, illness, home advantage etc.


I then simulated 1500 fights per corner sides and average them to get a weighted result. The dataset I worked with was about one year old, so the Conor-Khabib showdown hadn’t occurred yet. I ran it for fun and the results are interesting.


# coding: utf-8

# In[20]:


# -*- coding: UTF-8 -*-.

import numpy as np
import pandas as pd
from tqdm import tqdm
from joblib import load

#load stuff
model = load('winner.joblib')
model2 = load('outcome.joblib')
modifiable = load('modifiable.joblib') #list of characteristics that could change any given night
le = load('encoder.joblib')
dat = pd.read_excel("for_predictions.xlsx")
records = pd.read_csv("records.csv")

#Being Lazy, should've done this in the EDA
dat.dropna(axis=0, subset=["Sub_Avg_fighter"], inplace=True) #fighter missing this column lack a lot of data
dat.drop(["winperc", "win_factot"], axis=1, inplace=True)

#For readability
dat1 = dat.rename(columns={"Name":"Name1"}).copy()
dat2 = dat.rename(columns={"Name":"Name2"}).copy()

def mma_math_ftw(FIGHTER_BLUE, FIGHTER_RED, model, model2, RANDOMNESS, dat1, dat2,modifiable, SIMNUM, WEIGHT_CLASS):
    
    temp = dat1.loc[dat1.Name1 == FIGHTER_BLUE].reset_index(drop=True).add_suffix('_x')
    temp2 = dat2.loc[dat2.Name2 == FIGHTER_RED].reset_index(drop=True).add_suffix('_y')
    temp3 = pd.concat([temp, temp2], axis=1).drop(["Name1_x", "Name2_y"], axis=1)
    perc_data = len(temp3.dropna(axis=1).columns)/len(temp3.columns)

    if perc_data >= .9:
        print("\nInformation on fighters is complete, no missing data")
        
    elif perc_data <= 0.5:
        print("\nMissing information on fighters, predictions might not be as accurate as they should!")
        
    else:
        print("\nSome infos are missing, careful when betting your mortgage!")
            
    
    compiled_winner = []
    compiled_finish = []
    
    for i in tqdm(range(SIMNUM)):
        
        dat = temp3.copy()
        
        #any given saturday RANDOMNESS
        dat[modifiable] = np.random.normal(dat[modifiable].values, scale=RANDOMNESS)
        
        # feature eng - super high level metrics. 
        dat["hurt_diff"] = dat["hurt_factor_y"] - dat["hurt_factor_x"]
        dat["damage_diff"] = dat["delivered_damage_y"] - dat["delivered_damage_x"]
        dat["reach_height_diff"] = dat["reach_height_y"] - dat["reach_height_x"]
        dat["xp_diff"] = dat["total_fights_y"] - dat["total_fights_x"]
        dat["td_perc_diff"] = dat["td_perc_x"] - dat["td_perc_y"]
        dat['grappling_mastery_per_round_diff'] = dat['grappling_mastery_per_round_x'] -                                                 dat['grappling_mastery_per_round_y'] 
            
        dat['tot_strikes_landed_per_round_diff'] = dat['tot_strikes_landed_per_round_x'] -                                                 dat['tot_strikes_landed_per_round_y']
            
        dat['tot_sig_strikes_landed_per_round_diff'] = dat['tot_sig_strikes_landed_per_round_x'] -                                                 dat['tot_sig_strikes_landed_per_round_y']
            
        dat['perc_sig_strikes_per_round_diff'] = dat['perc_sig_strikes_per_round_x'] -                                                 dat['perc_sig_strikes_per_round_y']
            
        dat['tot_control_time_per_round_diff'] = dat['tot_control_time_per_round_x'] -                                                 dat['tot_control_time_per_round_y']
            
        dat['tot_neutral_time_per_round_diff'] = dat['tot_neutral_time_per_round_x'] -                                                 dat['tot_neutral_time_per_round_y']
            
        dat['control_perc_per_round_diff'] = dat['control_perc_per_round_x'] -                                                 dat['control_perc_per_round_y']
        
        
        dat.drop(['control_perc_per_round_x', 'control_perc_per_round_y',
              'tot_neutral_time_per_round_x', 'tot_neutral_time_per_round_y',
              'tot_control_time_per_round_x','tot_control_time_per_round_x',
              'perc_sig_strikes_per_round_x', 'perc_sig_strikes_per_round_y',
              'tot_sig_strikes_landed_per_round_x', 'tot_sig_strikes_landed_per_round_y',
              'tot_strikes_landed_per_round_x', 'tot_strikes_landed_per_round_y',
              'grappling_mastery_per_round_x', 'grappling_mastery_per_round_y',
              'td_perc_x', 'td_perc_y',
              'total_fights_x', 'total_fights_y',
              'reach_height_x', 'reach_height_y',
              'delivered_damage_x', 'delivered_damage_y',
              'hurt_factor_x', 'hurt_factor_x'], axis=1, inplace=True)
        
        dat['WEIGHT_CLASS'] = le.transform([WEIGHT_CLASS])
        dat = dat.fillna(-1) #lazy but it's the Holidays. 
        
        #Predict ze winning dude
        win = model.predict(dat.values)
        compiled_winner.append(win)
        
        #Outcome - Will it go to distance, Joe??
        finish_him = model2.predict(dat.values)
        compiled_finish.append(finish_him)
      
        del dat
        
    b = np.mean(compiled_winner)*100
    r = 100 - b
    o = 100 - (np.mean(compiled_finish)*100)
    
    del compiled_finish, compiled_winner, temp, temp2, temp3, perc_data
    
    return b,r,o #coincidence?!?!
    


# # Let's predict some fights!

# In[21]:


SIMNUM = 1500
RANDOMNESS = 3.5 #yes, 3.5 std dev. randomness. It is alot. But anything can happen on fight night
WEIGHT_CLASS = 'Lightweight'


# In[22]:


#corner variability - making 2 simulations based on red or blue corners
FIGHTER_BLUE = 'Conor McGregor'
FIGHTER_RED = 'Khabib Nurmagomedov'
b1,r1,o1 = mma_math_ftw(FIGHTER_BLUE, FIGHTER_RED, model, model2, RANDOMNESS, dat1, dat2, modifiable, SIMNUM, WEIGHT_CLASS)

print("\nChanging Corners...")
FIGHTER_BLUE2 = 'Khabib Nurmagomedov'
FIGHTER_RED2 = 'Conor McGregor'
b2,r2,o2 = mma_math_ftw(FIGHTER_BLUE2, FIGHTER_RED2, model, model2, RANDOMNESS, dat1, dat2, modifiable, SIMNUM, WEIGHT_CLASS)


# Prediction
print("\n================================================")
print("%s: %.2f perc chances of winning" % (FIGHTER_RED, (r1+b2)/2))
print("%s: %.2f perc chances of winning" % (FIGHTER_BLUE, (b1+r2)/2))
print("Chances that it will not go to distance: %.2f" % ((o1+o2)/2))
print("================================================")

0%|          | 3/1500 [00:00<01:08, 21.83it/s]
Information on fighters is complete, no missing data

100%|██████████| 1500/1500 [00:59<00:00, 25.35it/s]
  0%|          | 3/1500 [00:00<00:58, 25.46it/s]
Changing Corners...

Information on fighters is complete, no missing data

100%|██████████| 1500/1500 [00:59<00:00, 25.40it/s]
================================================
Khabib Nurmagomedov: 80.67 perc chances of winning
Conor McGregor: 19.33 perc chances of winning
Chances that it will not go to distance: 73.37
================================================


Khabib over Conor 8 out of 10 times and with a finish over 7 out of 10 times. Nice.



UFC 232 Card - The Predictions


Same exercice was applied to the UFC card. I’m still torn on the Jones-Gus fight. Yikes.





Based on the AI, we can expect a highly contested main event ending in dramatic fashion. With Jones favored at -285, betting on Gus is a very attractive option. The co-main is very interesting, suggesting a finish likely to be delivered by Nunes. Chiesa should win against Condit ( so sad, I love Condit, AI must be wrong ) and we can expect Anderson and Volkanovski getting the upper hand against their opponent, most likely going the distance.


Here you have it! A simple way of using an estimator to predict the outcome of simulated matches by introducing randomness in fighter’s attributes variables. I would be very curious to see how far this idea can be pushed with high-quality data


Looking forward to fight night and applying some 2.0 MMA math.

338, Saint-Antoine E. 

Suite 407

H2Y 1A3 Montréal, QC.