Detailed analysis
At a basic level, the dataset included:
- All matches that occurred from 2010 to round 9, 2025 inclusive
- All players who played in any of these games (games they played outside of this time period were not included)
- All the venues that any of these games were played at
If you want to see what makes up the data, then feel free to examine the tables below.
You can skip the detailed data tables if you want:
Show/hide data tables (click to expand/collapse)
Table 1: Every individual game between 2010 and round 9, 2025 (table name: player_game_stats)
Column | Data Type | Definition |
---|---|---|
id | integer | Unique record for player game stat entry |
player_id | integer | Player identifier |
game_id | integer | Game identifier |
team_id | integer | Team the player was playing for |
opponent_team_id | integer | Team the player was playing against |
kicks | integer | Kicks |
marks | integer | Marks |
handballs | integer | Handballs |
disposals | integer | Kicks + handballs |
goals | integer | Goals scored |
behinds | integer | Behinds scored |
hit_outs | integer | Hit outs |
tackles | integer | Tackles |
rebounds | integer | Moving the ball from the defensive 50 into the midfield |
inside_50 | integer | Moving the ball from the midfield into the forward 50; excludes multiple entries within same chain |
clearances | integer | First effective disposal in a chain that clears the stoppage area, or an ineffective/clanger kick that clears stoppage |
clangers | integer | Error made by a player resulting in a negative result for their side |
frees_for | integer | Frees awarded to player |
frees_against | integer | Frees conceded by player |
brownlow_votes | integer | Brownlow Medal votes, awarded by umpires |
contested_ possessions | integer | Possession won when the ball is in dispute (looseball-gets, hardball-gets, contested marks, etc.) |
uncontested_possessions | integer | Possessions gained under no physical pressure |
contested_marks | integer | A mark under physical pressure of an opponent or in a pack |
marks_inside_50 | integer | Marks taken inside forward 50 |
one_percenters | integer | Defensive effort (spoil, knock-on, smother or shepherd) |
bounces | integer | Bounces |
goal_assists | integer | Creating a goal by getting the ball to a teammate by any means (disposal, knock-on, ground kick, hitout, etc.) |
time_on_ground_pct | numeric | Time on ground during a game divided by total game time (%) |
For more information on the definitions of above, or how they are recorded, visit the AFL site. The data was sourced from afltables.com, footywire.com, and the AFL dataset on Kaggle. For the purpose of this analysis I am going to assume everything was correctly recorded and that my data is correct. I conducted random spot checking of ~5 games each season to ensure confidence, and all information can be found in my github.
Table 2: Game level data between 2010 and round 9, 2025 (table name: games)
Column | Data Type | Definition |
---|---|---|
round | text | Round label (e.g R3, QF) |
game_date | date | Date game was played (YYYY-MM-DD) |
venue_id | integer | Ground the game was played at (venue ID) |
home_team_id | integer | Home team ID |
away_team_id | integer | Away team ID |
season_year | integer | Season year |
afltables_game_id | integer | Game ID from AFL Tables source |
home_score_str | character varying | Home team score as a string (e.g "10.9.69") |
away_score_str | character varying | Away team score as a string (e.g "4.17.41") |
home_score_goals | integer | Home team goals |
home_score_behinds | integer | Home team behinds |
home_score_total | integer | Home team total score |
away_score_goals | integer | Away team goals |
away_score_behinds | integer | Away team behinds |
away_score_total | integer | Away team total score |
margin | integer | Score margin as an absolute value |
home_result | character | Result for home team (W = win, L = loss, D = draw) |
away_result | character | Result for away team (W = win, L = loss, D = draw) |
Table 3: Every venue a game occurred in (table name: venues)
Column | Data Type | Definition |
---|---|---|
id | integer | Unique venue ID |
name | text | Venue name |
city | text | City or town where the venue is located |
state | text | State the venue is located (or country if outside of Australia) |
location | text | Latitude & longitude of the stadium |
Table 4: Every player who played a game during this time (table name: players)
Column | Data Type | Definition |
---|---|---|
id | integer | Unique player ID |
full_name | text | Player's full name |
birth_date | date | Date of birth (YYYY-MM-DD) |
height_cm | integer | Player's height in centimeters |
weight_kg | integer | Player's weight in kilograms |
afltables_id | integer | Player ID from AFL Tables source |
position | character varying | Player position(s), e.g 'Midfield', 'Defender' |
Table 5: Every team (table name: teams)
Column | Data Type | Definition |
---|---|---|
id | integer | Unique team ID |
name | text | Team name |
short_code | text | Team abbreviation or short code |
home_ground | integer | Venue ID for team's primary home ground |
state | text | State where team is based |
All of these tables are available on my GitHub page.
Building the Dataframe and Features
With the data prepared, I began by building an initial dataframe containing all games:
engine = create_engine(f"postgresql+psycopg2://{db_user}:{db_pass}@{db_host}:{db_port}/{db_name}")
df_games = pd.read_sql("SELECT * FROM games", engine)
I also built a helper function to sort rounds (this was done after noticing that the format I had the rounds in meant that sorting them meant R10 following R1 instead of R2):
# Helper function to sort rounds
def get_round_order(row):
# Match regular rounds: 'R1', 'R23', etc.
match = re.match(r'R(\d+)$', row['round'])
if match:
return int(match.group(1))
# Finals rounds mapping (AFL convention)
finals_order = {
'REF': 100, # Elimination Final (week 1)
'RQF': 101, # Qualifying Final (week 1)
'RSF': 102, # Semi Final (week 2)
'RPF': 103, # Preliminary Final (week 3)
'RGF': 104, # Grand Final (week 4)
}
code = row['round'][1:]
return finals_order.get(code, 999) # Unknown finals get 999
The first variable I wanted to look at is player movement. In order to calculate this I initially only looked at changes across 1 round. I then built out changes of 2 and 3 rounds, and after analysis on each of the variables I ended up only using changes over 2 rounds. This was because these variables were so highly correlated (multicollinearity) that including more than one would make it difficult to interpret the effect of each one individually.
query_movement ="""
SELECT
pgs.game_id,
g.season_year,
g.round,
pgs.team_id,
pgs.player_id
FROM player_game_stats pgs
JOIN games g ON pgs.game_id = g.id
ORDER BY g.season_year, g.round, pgs.team_id
"""
df_players = pd.read_sql(query_movement, engine)
players_per_team_round = (
df_players
.groupby(['season_year', 'team_id', 'round'])
.agg({'player_id': lambda x: set(x)})
.reset_index()
.sort_values(['season_year', 'team_id', 'round'])
)
# Use the helper function to sort the rounds
players_per_team_round['round_order'] = players_per_team_round.apply(get_round_order, axis=1)
players_per_team_round = players_per_team_round.sort_values(
['season_year', 'team_id', 'round_order']
).reset_index(drop=True)
# Once we have the dataframe, now we need to sort it and calculate how many players played together in previous round, plus previous 2 & 3 rounds
players_per_team_round['prev_players_1'] = (
players_per_team_round.groupby(['season_year', 'team_id'])['player_id'].shift(1)
)
players_per_team_round['prev_players_2'] = (
players_per_team_round.groupby(['season_year', 'team_id'])['player_id'].shift(2)
)
players_per_team_round['prev_players_3'] = (
players_per_team_round.groupby(['season_year', 'team_id'])['player_id'].shift(3)
)
# Functions to calculate number of changes vs previous 1, 2, 3 rounds
def num_changes_vs_prev(row):
if pd.isna(row['prev_players_1']):
return np.nan
return len(row['player_id'].symmetric_difference(row['prev_players_1']))
def num_changes_vs_prev2(row):
if pd.isna(row['prev_players_1']) or pd.isna(row['prev_players_2']):
return np.nan
prev_union = row['prev_players_1'].union(row['prev_players_2'])
return len(row['player_id'].symmetric_difference(prev_union))
def num_changes_vs_prev3(row):
if pd.isna(row['prev_players_1']) or pd.isna(row['prev_players_2']) or pd.isna(row['prev_players_3']):
return np.nan
prev_union = row['prev_players_1'].union(row['prev_players_2']).union(row['prev_players_3'])
return len(row['player_id'].symmetric_difference(prev_union))
players_per_team_round['num_changes_1'] = players_per_team_round.apply(num_changes_vs_prev, axis=1)
players_per_team_round['num_changes_2'] = players_per_team_round.apply(num_changes_vs_prev2, axis=1)
players_per_team_round['num_changes_3'] = players_per_team_round.apply(num_changes_vs_prev3, axis=1)
I also wanted to look at different variables relating to team form, as well as whether or not having more, or less, rest between games made a difference. In order to do this I built out a dataframe to look at average scores over a 5 game period, as well as the amount of time between games:
# calculate days between games as well as rolling averages of previous 5 games
df_games['game_date'] = pd.to_datetime(df_games['game_date'])
df_home = df_games[['id', 'season_year', 'round', 'game_date', 'home_team_id', 'home_score_total', 'away_score_total']].copy()
df_home = df_home.rename(columns={
'home_team_id': 'team_id',
'home_score_total': 'score_scored',
'away_score_total': 'score_conceded'
})
df_away = df_games[['id', 'season_year', 'round', 'game_date', 'away_team_id', 'away_score_total', 'home_score_total']].copy()
df_away = df_away.rename(columns={
'away_team_id': 'team_id',
'away_score_total': 'score_scored',
'home_score_total': 'score_conceded'
})
df_team_games = pd.concat([df_home, df_away], ignore_index=True)
df_team_games = df_team_games.sort_values(['team_id', 'game_date']).reset_index(drop=True)
# calculate average score and margin based on rolling average of previous 5 games, minimum 1 game and only including games in season
df_team_games['avg_scored_prev5'] = (
df_team_games.groupby(['team_id', 'season_year'])['score_scored']
.transform(lambda x: x.shift(1).rolling(5, min_periods=1).mean())
)
df_team_games['avg_conceded_prev5'] = (
df_team_games.groupby(['team_id', 'season_year'])['score_conceded']
.transform(lambda x: x.shift(1).rolling(5, min_periods=1).mean())
)
df_team_games['margin'] = df_team_games['score_scored'] - df_team_games['score_conceded']
df_team_games['avg_margin_prev5'] = (
df_team_games.groupby(['team_id', 'season_year'])['margin']
.transform(lambda x: x.shift(1).rolling(5, min_periods=1).mean())
)
# calculate days since previous game, only including games in season
df_team_games['days_since_prev'] = (
df_team_games.groupby(['team_id', 'season_year'])['game_date']
.diff()
.dt.days
)
Lastly, I also took a look at the average age of players within a game:
# Calculate the average team age
query_age = """
SELECT
pgs.game_id,
pgs.team_id,
pgs.player_id,
p.birth_date,
g.game_date
FROM player_game_stats pgs
JOIN games g ON pgs.game_id = g.id
JOIN players p ON pgs.player_id = p.id
"""
df_players_age = pd.read_sql(query_age, engine)
df_players_age['age'] = (
(pd.to_datetime(df_players_age['game_date']) - pd.to_datetime(df_players_age['birth_date'])).dt.days / 365.25
)
avg_age_per_team_game = (
df_players_age
.groupby(['game_id', 'team_id'])['age']
.mean()
.reset_index()
.rename(columns={'age': 'avg_age', 'game_id': 'id'})
)
Once I had all of my data together, I merged the dataframes above to create a single dataframe with all the features:
# merge everything together into single dataframe
team_game_features = (
pd.merge(
df_team_games,
players_per_team_round[['season_year', 'team_id', 'round', 'num_changes_1', 'num_changes_2', 'num_changes_3']],
on=['season_year', 'team_id', 'round'],
how='left'
)
.merge(
avg_age_per_team_game,
on=['id', 'team_id'],
how='left'
)
)
In order to better analyse team level data I also needed to split out each game into the home team, and the away team. This allows me to look at each team’s performance individually, regardless of whether they played at home or away.
# Get win/loss results
df_home_result = df_games[['id', 'season_year', 'round', 'home_team_id', 'home_result']]
df_home_result = df_home_result.rename(columns={
'id': 'game_id',
'home_team_id': 'team_id',
'home_result': 'result'
})
df_home_result['home_or_away'] = 'home'
df_away_result = df_games[['id', 'season_year', 'round', 'away_team_id', 'away_result']]
df_away_result = df_away_result.rename(columns={
'id': 'game_id',
'away_team_id': 'team_id',
'away_result': 'result'
})
df_away_result['home_or_away'] = 'away'
df_results = pd.concat([df_home_result, df_away_result], ignore_index=True)
Now I was almost ready to start looking at the data, but first I needed to merge the results dataframe into my features dataframe, then add in a new column so that I could understand home ground advantage.
# Merge results with what we want to analyse on
team_game_features = team_game_features.merge(
df_results[['game_id', 'team_id', 'result', 'home_or_away']],
left_on=['id', 'team_id'],
right_on=['game_id', 'team_id'],
how='left'
).drop(columns=['game_id'])
team_game_features['target_win'] = team_game_features['result'].map({'W': 1, 'L': 0, 'D': np.nan})
# Add home ground advantage
team_game_features['is_home'] = (team_game_features['home_or_away'] == 'home').astype(int)
# Prepare features and target
features = [
'avg_scored_prev5',
'avg_conceded_prev5',
'days_since_prev',
'num_changes_2',
'avg_age',
'is_home'
]
analysis_df = team_game_features.dropna(subset=['target_win'] + features)
Now that everything is set up, I was able to create my model and take a look at how well it could predict a winner. Not only this, I also wanted to understand which variables were useful, and which weren't.
I used logistic regression to do this, as it is useful for binary outcomes like win/loss. Logistic regression assigns a “coefficient” to each variable (from the features dataframe we just built), which quantifies how much that variable contributes to the odds of a team winning a game. I used scaling so that all the coefficients would be on a scale from -1 to 1. What I want to see is numbers closer to the extremes: i.e., a coefficient of 1 or -1 would mean a perfect correlation.
# Create the model
X = analysis_df[features]
y = analysis_df['target_win']
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
model = LogisticRegression(max_iter=1000)
model.fit(X_scaled, y)
# Create coefficients
coefs = pd.Series(model.coef_[0], index=features)
print(coefs.sort_values())
avg_conceded_prev5 -0.229053
num_changes_2 -0.083847
days_since_prev -0.030959
avg_age 0.270478
avg_scored_prev5 0.272462
is_home 0.312109
dtype: float64
As you can see, the model showed slight positive and negative relationships where we would expect them:
- Playing at home is an advantage
- Previous 5 games performance is relevant
- Age profile of the team is relevant
- Team stability (at least over the previous 2 rounds) was slightly negative
- Time between games didn't seem to matter at all
I also wanted to understand the model's accuracy—i.e., what % of games did the model actually predict correctly. To do this, I used a metric called the Area Under the ROC Curve (AUC). The AUC is a more nuanced way of understanding model performance:
- An AUC of 0.5 means the model is no better than random guessing.
- An AUC of 1.0 is a perfect model.
- Most useful models fall somewhere in between.
# Check accuracy of model
y_pred = model.predict(X_scaled)
acc = accuracy_score(y, y_pred)
print(f"Accuracy: {acc:.3f} ({acc*100:.1f}%)")
Accuracy: 0.617 (61.7%)
The accuracy of the model was 61.7%, basically 5–6 out of 9 tips correct each round.
Lastly, I took a look at which variables were driving the predictions. To do this, I ran the model multiple times, each time removing one feature, and observed how much the AUC dropped without it:
# Check accuracy of model (AUC for each feature dropped)
baseline_auc = roc_auc_score(y, model.predict_proba(X_scaled)[:, 1])
print("Baseline AUC:", baseline_auc)
for i, feature in enumerate(features):
X_reduced = np.delete(X_scaled, i, axis=1)
model_temp = LogisticRegression(max_iter=1000)
model_temp.fit(X_reduced, y)
auc = roc_auc_score(y, model_temp.predict_proba(X_reduced)[:, 1])
print(f"AUC without {feature}: {auc:.3f} (Drop: {baseline_auc - auc:.3f})")
Baseline AUC: 0.6717989292662536
AUC without avg_scored_prev5: 0.658 (Drop: 0.014)
AUC without avg_conceded_prev5: 0.665 (Drop: 0.007)
AUC without days_since_prev: 0.672 (Drop: 0.000)
AUC without num_changes_2: 0.671 (Drop: 0.001)
AUC without avg_age: 0.659 (Drop: 0.013)
AUC without is_home: 0.650 (Drop: 0.022)
Again, this doesn't tell us much—none of the variables dramatically change the model's accuracy if removed, suggesting that each has only a modest influence on predicting the outcome on its own.
I also tested whether including interactions between variables (e.g., combining age and home advantage) using polynomial features would improve the predictive accuracy.
# Let's look at whether multiple factors together can predict a winner
poly = PolynomialFeatures(degree=2, include_bias=False, interaction_only=True)
X_poly = poly.fit_transform(X_scaled)
feature_names = poly.get_feature_names_out(features)
model_poly = LogisticRegression(max_iter=1000)
model_poly.fit(X_poly, y)
coefs_poly = pd.Series(model_poly.coef_[0], index=feature_names)
# Show the biggest absolute coefficients (most "important" effects)
print(coefs_poly.abs().sort_values(ascending=False).head(20))
is_home 0.313096
avg_scored_prev5 0.280495
avg_age 0.268837
avg_conceded_prev5 0.209431
num_changes_2 0.084217
avg_scored_prev5 is_home 0.064832
avg_scored_prev5 avg_conceded_prev5 0.049802
num_changes_2 is_home 0.039212
days_since_prev is_home 0.037942
days_since_prev avg_age 0.029295
avg_conceded_prev5 is_home 0.027139
num_changes_2 avg_age 0.024612
days_since_prev 0.021512
avg_scored_prev5 days_since_prev 0.019434
avg_conceded_prev5 num_changes_2 0.016678
avg_conceded_prev5 days_since_prev 0.016134
days_since_prev num_changes_2 0.015158
avg_scored_prev5 num_changes_2 0.013681
avg_conceded_prev5 avg_age 0.012748
avg_age is_home 0.012529
dtype: float64
What above table shows is is that after performing polynomial regression, there variables that have the largest impact do not change. I.e combining variables doesn't really change the model's performance. Checking the accuracy of the polynomial model in the same way we did above confirms this:
# Evaluate predictive accuracy
y_pred_poly = model_poly.predict(X_poly)
y_prob_poly = model_poly.predict_proba(X_poly)[:,1]
print("Poly Accuracy:", accuracy_score(y, y_pred_poly))
print("Poly AUC:", roc_auc_score(y, y_prob_poly))
Poly Accuracy: 0.6233535065859737
Poly AUC: 0.6746580720024911
What does this all mean? Well, combining variables together didn't really move the dial too much (a small increase in accuracy from 61.7% to 62.3%)
Conclusion
Just picking the home team is correct 56% of the time, so if you want a model that is slightly better than just picking the team that is playing at home, this is the model for you.