Detailed analysis

At a basic level, the dataset included:

If you want to see what makes up the data, then feel free to examine the tables below.

You can skip the detailed data tables if you want:

Show/hide data tables (click to expand/collapse)

Table 1: Every individual game between 2010 and round 9, 2025 (table name: player_game_stats)

Column Data Type Definition
idintegerUnique record for player game stat entry
player_idintegerPlayer identifier
game_idintegerGame identifier
team_idintegerTeam the player was playing for
opponent_team_idintegerTeam the player was playing against
kicksintegerKicks
marksintegerMarks
handballsintegerHandballs
disposalsintegerKicks + handballs
goalsintegerGoals scored
behindsintegerBehinds scored
hit_outsintegerHit outs
tacklesintegerTackles
reboundsintegerMoving the ball from the defensive 50 into the midfield
inside_50integerMoving the ball from the midfield into the forward 50; excludes multiple entries within same chain
clearancesintegerFirst effective disposal in a chain that clears the stoppage area, or an ineffective/clanger kick that clears stoppage
clangersintegerError made by a player resulting in a negative result for their side
frees_forintegerFrees awarded to player
frees_againstintegerFrees conceded by player
brownlow_votesintegerBrownlow Medal votes, awarded by umpires
contested_ possessionsintegerPossession won when the ball is in dispute (looseball-gets, hardball-gets, contested marks, etc.)
uncontested_possessionsintegerPossessions gained under no physical pressure
contested_marksintegerA mark under physical pressure of an opponent or in a pack
marks_inside_50integerMarks taken inside forward 50
one_percentersintegerDefensive effort (spoil, knock-on, smother or shepherd)
bouncesintegerBounces
goal_assistsintegerCreating a goal by getting the ball to a teammate by any means (disposal, knock-on, ground kick, hitout, etc.)
time_on_ground_pctnumericTime on ground during a game divided by total game time (%)

For more information on the definitions of above, or how they are recorded, visit the AFL site. The data was sourced from afltables.com, footywire.com, and the AFL dataset on Kaggle. For the purpose of this analysis I am going to assume everything was correctly recorded and that my data is correct. I conducted random spot checking of ~5 games each season to ensure confidence, and all information can be found in my github.

Table 2: Game level data between 2010 and round 9, 2025 (table name: games)

ColumnData TypeDefinition
roundtextRound label (e.g R3, QF)
game_datedateDate game was played (YYYY-MM-DD)
venue_idintegerGround the game was played at (venue ID)
home_team_idintegerHome team ID
away_team_idintegerAway team ID
season_yearintegerSeason year
afltables_game_idintegerGame ID from AFL Tables source
home_score_strcharacter varyingHome team score as a string (e.g "10.9.69")
away_score_strcharacter varyingAway team score as a string (e.g "4.17.41")
home_score_goalsintegerHome team goals
home_score_behindsintegerHome team behinds
home_score_totalintegerHome team total score
away_score_goalsintegerAway team goals
away_score_behindsintegerAway team behinds
away_score_totalintegerAway team total score
marginintegerScore margin as an absolute value
home_resultcharacterResult for home team (W = win, L = loss, D = draw)
away_resultcharacterResult for away team (W = win, L = loss, D = draw)

Table 3: Every venue a game occurred in (table name: venues)

ColumnData TypeDefinition
idintegerUnique venue ID
nametextVenue name
citytextCity or town where the venue is located
statetextState the venue is located (or country if outside of Australia)
locationtextLatitude & longitude of the stadium

Table 4: Every player who played a game during this time (table name: players)

ColumnData TypeDefinition
idintegerUnique player ID
full_nametextPlayer's full name
birth_datedateDate of birth (YYYY-MM-DD)
height_cmintegerPlayer's height in centimeters
weight_kgintegerPlayer's weight in kilograms
afltables_idintegerPlayer ID from AFL Tables source
positioncharacter varyingPlayer position(s), e.g 'Midfield', 'Defender'

Table 5: Every team (table name: teams)

ColumnData TypeDefinition
idintegerUnique team ID
nametextTeam name
short_codetextTeam abbreviation or short code
home_groundintegerVenue ID for team's primary home ground
statetextState where team is based

All of these tables are available on my GitHub page.

Building the Dataframe and Features

With the data prepared, I began by building an initial dataframe containing all games:


engine = create_engine(f"postgresql+psycopg2://{db_user}:{db_pass}@{db_host}:{db_port}/{db_name}")
df_games = pd.read_sql("SELECT * FROM games", engine)
  

I also built a helper function to sort rounds (this was done after noticing that the format I had the rounds in meant that sorting them meant R10 following R1 instead of R2):


# Helper function to sort rounds
def get_round_order(row):
    # Match regular rounds: 'R1', 'R23', etc.
    match = re.match(r'R(\d+)$', row['round'])
    if match:
        return int(match.group(1))
    # Finals rounds mapping (AFL convention)
    finals_order = {
        'REF': 100,  # Elimination Final (week 1)
        'RQF': 101,  # Qualifying Final (week 1)
        'RSF': 102,  # Semi Final (week 2)
        'RPF': 103,  # Preliminary Final (week 3)
        'RGF': 104,  # Grand Final (week 4)
    }
    code = row['round'][1:] 
    return finals_order.get(code, 999)  # Unknown finals get 999
  

The first variable I wanted to look at is player movement. In order to calculate this I initially only looked at changes across 1 round. I then built out changes of 2 and 3 rounds, and after analysis on each of the variables I ended up only using changes over 2 rounds. This was because these variables were so highly correlated (multicollinearity) that including more than one would make it difficult to interpret the effect of each one individually.


query_movement ="""
SELECT 
    pgs.game_id, 
    g.season_year, 
    g.round, 
    pgs.team_id, 
    pgs.player_id
FROM player_game_stats pgs
JOIN games g ON pgs.game_id = g.id
ORDER BY g.season_year, g.round, pgs.team_id
"""
df_players = pd.read_sql(query_movement, engine)

players_per_team_round = (
    df_players
    .groupby(['season_year', 'team_id', 'round'])
    .agg({'player_id': lambda x: set(x)})
    .reset_index()
    .sort_values(['season_year', 'team_id', 'round'])
)

# Use the helper function to sort the rounds
players_per_team_round['round_order'] = players_per_team_round.apply(get_round_order, axis=1)
players_per_team_round = players_per_team_round.sort_values(
    ['season_year', 'team_id', 'round_order']
).reset_index(drop=True)

# Once we have the dataframe, now we need to sort it and calculate how many players played together in previous round, plus previous 2 & 3 rounds
players_per_team_round['prev_players_1'] = (
    players_per_team_round.groupby(['season_year', 'team_id'])['player_id'].shift(1)
)
players_per_team_round['prev_players_2'] = (
    players_per_team_round.groupby(['season_year', 'team_id'])['player_id'].shift(2)
)
players_per_team_round['prev_players_3'] = (
    players_per_team_round.groupby(['season_year', 'team_id'])['player_id'].shift(3)
)

# Functions to calculate number of changes vs previous 1, 2, 3 rounds
def num_changes_vs_prev(row):
    if pd.isna(row['prev_players_1']):
        return np.nan
    return len(row['player_id'].symmetric_difference(row['prev_players_1']))

def num_changes_vs_prev2(row):
    if pd.isna(row['prev_players_1']) or pd.isna(row['prev_players_2']):
        return np.nan
    prev_union = row['prev_players_1'].union(row['prev_players_2'])
    return len(row['player_id'].symmetric_difference(prev_union))

def num_changes_vs_prev3(row):
    if pd.isna(row['prev_players_1']) or pd.isna(row['prev_players_2']) or pd.isna(row['prev_players_3']):
        return np.nan
    prev_union = row['prev_players_1'].union(row['prev_players_2']).union(row['prev_players_3'])
    return len(row['player_id'].symmetric_difference(prev_union))

players_per_team_round['num_changes_1'] = players_per_team_round.apply(num_changes_vs_prev, axis=1)
players_per_team_round['num_changes_2'] = players_per_team_round.apply(num_changes_vs_prev2, axis=1)
players_per_team_round['num_changes_3'] = players_per_team_round.apply(num_changes_vs_prev3, axis=1)
  

I also wanted to look at different variables relating to team form, as well as whether or not having more, or less, rest between games made a difference. In order to do this I built out a dataframe to look at average scores over a 5 game period, as well as the amount of time between games:


# calculate days between games as well as rolling averages of previous 5 games
df_games['game_date'] = pd.to_datetime(df_games['game_date'])

df_home = df_games[['id', 'season_year', 'round', 'game_date', 'home_team_id', 'home_score_total', 'away_score_total']].copy()
df_home = df_home.rename(columns={
    'home_team_id': 'team_id',
    'home_score_total': 'score_scored',
    'away_score_total': 'score_conceded'
})

df_away = df_games[['id', 'season_year', 'round', 'game_date', 'away_team_id', 'away_score_total', 'home_score_total']].copy()
df_away = df_away.rename(columns={
    'away_team_id': 'team_id',
    'away_score_total': 'score_scored',
    'home_score_total': 'score_conceded'
})

df_team_games = pd.concat([df_home, df_away], ignore_index=True)
df_team_games = df_team_games.sort_values(['team_id', 'game_date']).reset_index(drop=True)

# calculate average score and margin based on rolling average of previous 5 games, minimum 1 game and only including games in season
df_team_games['avg_scored_prev5'] = (
    df_team_games.groupby(['team_id', 'season_year'])['score_scored']
    .transform(lambda x: x.shift(1).rolling(5, min_periods=1).mean())
)

df_team_games['avg_conceded_prev5'] = (
    df_team_games.groupby(['team_id', 'season_year'])['score_conceded']
    .transform(lambda x: x.shift(1).rolling(5, min_periods=1).mean())
)

df_team_games['margin'] = df_team_games['score_scored'] - df_team_games['score_conceded']

df_team_games['avg_margin_prev5'] = (
    df_team_games.groupby(['team_id', 'season_year'])['margin']
    .transform(lambda x: x.shift(1).rolling(5, min_periods=1).mean())
)

# calculate days since previous game, only including games in season
df_team_games['days_since_prev'] = (
    df_team_games.groupby(['team_id', 'season_year'])['game_date']
    .diff()
    .dt.days
)
  

Lastly, I also took a look at the average age of players within a game:


# Calculate the average team age
query_age = """
SELECT 
    pgs.game_id, 
    pgs.team_id, 
    pgs.player_id, 
    p.birth_date, 
    g.game_date
FROM player_game_stats pgs
JOIN games g ON pgs.game_id = g.id
JOIN players p ON pgs.player_id = p.id
"""
df_players_age = pd.read_sql(query_age, engine)

df_players_age['age'] = (
    (pd.to_datetime(df_players_age['game_date']) - pd.to_datetime(df_players_age['birth_date'])).dt.days / 365.25
)

avg_age_per_team_game = (
    df_players_age
    .groupby(['game_id', 'team_id'])['age']
    .mean()
    .reset_index()
    .rename(columns={'age': 'avg_age', 'game_id': 'id'}) 
)
  

Once I had all of my data together, I merged the dataframes above to create a single dataframe with all the features:


# merge everything together into single dataframe
team_game_features = (
    pd.merge(
        df_team_games,
        players_per_team_round[['season_year', 'team_id', 'round', 'num_changes_1', 'num_changes_2', 'num_changes_3']],
        on=['season_year', 'team_id', 'round'],
        how='left'
    )
    .merge(
        avg_age_per_team_game,
        on=['id', 'team_id'],
        how='left'
    )
)
  

In order to better analyse team level data I also needed to split out each game into the home team, and the away team. This allows me to look at each team’s performance individually, regardless of whether they played at home or away.


# Get win/loss results

df_home_result = df_games[['id', 'season_year', 'round', 'home_team_id', 'home_result']]
df_home_result = df_home_result.rename(columns={
    'id': 'game_id',
    'home_team_id': 'team_id',
    'home_result': 'result'
})
df_home_result['home_or_away'] = 'home'

df_away_result = df_games[['id', 'season_year', 'round', 'away_team_id', 'away_result']]
df_away_result = df_away_result.rename(columns={
    'id': 'game_id',
    'away_team_id': 'team_id',
    'away_result': 'result'
})
df_away_result['home_or_away'] = 'away'

df_results = pd.concat([df_home_result, df_away_result], ignore_index=True)
  

Now I was almost ready to start looking at the data, but first I needed to merge the results dataframe into my features dataframe, then add in a new column so that I could understand home ground advantage.


# Merge results with what we want to analyse on
team_game_features = team_game_features.merge(
    df_results[['game_id', 'team_id', 'result', 'home_or_away']],
    left_on=['id', 'team_id'],
    right_on=['game_id', 'team_id'],
    how='left'
).drop(columns=['game_id'])

team_game_features['target_win'] = team_game_features['result'].map({'W': 1, 'L': 0, 'D': np.nan})

# Add home ground advantage
team_game_features['is_home'] = (team_game_features['home_or_away'] == 'home').astype(int)

# Prepare features and target
features = [
    'avg_scored_prev5', 
    'avg_conceded_prev5', 
    'days_since_prev', 
    'num_changes_2',
    'avg_age',
    'is_home'
]
analysis_df = team_game_features.dropna(subset=['target_win'] + features)
  

Now that everything is set up, I was able to create my model and take a look at how well it could predict a winner. Not only this, I also wanted to understand which variables were useful, and which weren't.

I used logistic regression to do this, as it is useful for binary outcomes like win/loss. Logistic regression assigns a “coefficient” to each variable (from the features dataframe we just built), which quantifies how much that variable contributes to the odds of a team winning a game. I used scaling so that all the coefficients would be on a scale from -1 to 1. What I want to see is numbers closer to the extremes: i.e., a coefficient of 1 or -1 would mean a perfect correlation.


# Create the model
X = analysis_df[features]
y = analysis_df['target_win']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

model = LogisticRegression(max_iter=1000)
model.fit(X_scaled, y)
# Create coefficients
coefs = pd.Series(model.coef_[0], index=features)
print(coefs.sort_values())
Output:
avg_conceded_prev5 -0.229053
num_changes_2 -0.083847
days_since_prev -0.030959
avg_age 0.270478
avg_scored_prev5 0.272462
is_home 0.312109
dtype: float64

As you can see, the model showed slight positive and negative relationships where we would expect them:

I also wanted to understand the model's accuracy—i.e., what % of games did the model actually predict correctly. To do this, I used a metric called the Area Under the ROC Curve (AUC). The AUC is a more nuanced way of understanding model performance:


# Check accuracy of model
y_pred = model.predict(X_scaled)
acc = accuracy_score(y, y_pred)
print(f"Accuracy: {acc:.3f} ({acc*100:.1f}%)")
Output:
Accuracy: 0.617 (61.7%)

The accuracy of the model was 61.7%, basically 5–6 out of 9 tips correct each round.

Lastly, I took a look at which variables were driving the predictions. To do this, I ran the model multiple times, each time removing one feature, and observed how much the AUC dropped without it:


# Check accuracy of model (AUC for each feature dropped)
baseline_auc = roc_auc_score(y, model.predict_proba(X_scaled)[:, 1])
print("Baseline AUC:", baseline_auc)

for i, feature in enumerate(features):
    X_reduced = np.delete(X_scaled, i, axis=1)
    model_temp = LogisticRegression(max_iter=1000)
    model_temp.fit(X_reduced, y)
    auc = roc_auc_score(y, model_temp.predict_proba(X_reduced)[:, 1])
    print(f"AUC without {feature}: {auc:.3f} (Drop: {baseline_auc - auc:.3f})")
Output:
Baseline AUC: 0.6717989292662536
AUC without avg_scored_prev5: 0.658 (Drop: 0.014)
AUC without avg_conceded_prev5: 0.665 (Drop: 0.007)
AUC without days_since_prev: 0.672 (Drop: 0.000)
AUC without num_changes_2: 0.671 (Drop: 0.001)
AUC without avg_age: 0.659 (Drop: 0.013)
AUC without is_home: 0.650 (Drop: 0.022)

Again, this doesn't tell us much—none of the variables dramatically change the model's accuracy if removed, suggesting that each has only a modest influence on predicting the outcome on its own.

I also tested whether including interactions between variables (e.g., combining age and home advantage) using polynomial features would improve the predictive accuracy.


# Let's look at whether multiple factors together can predict a winner
poly = PolynomialFeatures(degree=2, include_bias=False, interaction_only=True)
X_poly = poly.fit_transform(X_scaled)
feature_names = poly.get_feature_names_out(features)
model_poly = LogisticRegression(max_iter=1000)
model_poly.fit(X_poly, y)

coefs_poly = pd.Series(model_poly.coef_[0], index=feature_names)
# Show the biggest absolute coefficients (most "important" effects)
print(coefs_poly.abs().sort_values(ascending=False).head(20))
  
Output:
is_home 0.313096
avg_scored_prev5 0.280495
avg_age 0.268837
avg_conceded_prev5 0.209431
num_changes_2 0.084217
avg_scored_prev5 is_home 0.064832
avg_scored_prev5 avg_conceded_prev5 0.049802
num_changes_2 is_home 0.039212
days_since_prev is_home 0.037942
days_since_prev avg_age 0.029295
avg_conceded_prev5 is_home 0.027139
num_changes_2 avg_age 0.024612
days_since_prev 0.021512
avg_scored_prev5 days_since_prev 0.019434
avg_conceded_prev5 num_changes_2 0.016678
avg_conceded_prev5 days_since_prev 0.016134
days_since_prev num_changes_2 0.015158
avg_scored_prev5 num_changes_2 0.013681
avg_conceded_prev5 avg_age 0.012748
avg_age is_home 0.012529
dtype: float64

What above table shows is is that after performing polynomial regression, there variables that have the largest impact do not change. I.e combining variables doesn't really change the model's performance. Checking the accuracy of the polynomial model in the same way we did above confirms this:


# Evaluate predictive accuracy
y_pred_poly = model_poly.predict(X_poly)
y_prob_poly = model_poly.predict_proba(X_poly)[:,1]
print("Poly Accuracy:", accuracy_score(y, y_pred_poly))
print("Poly AUC:", roc_auc_score(y, y_prob_poly))
  
Output:
Poly Accuracy: 0.6233535065859737
Poly AUC: 0.6746580720024911

What does this all mean? Well, combining variables together didn't really move the dial too much (a small increase in accuracy from 61.7% to 62.3%)

Conclusion

Just picking the home team is correct 56% of the time, so if you want a model that is slightly better than just picking the team that is playing at home, this is the model for you.