Detailed analysis

At a basic level, the dataset included:

All matches that occurred from 2010 to round 9, 2025 inclusive
All players who played in any of these games (games they played outside of this time period were not included)
All the venues that any of these games were played at

If you want to see what makes up the data, then feel free to examine the tables below.

You can skip the detailed data tables if you want:

Show/hide data tables (click to expand/collapse)

Table 1: Every individual game between 2010 and round 9, 2025 (table name: player_game_stats)

Column	Data Type	Definition
id	integer	Unique record for player game stat entry
player_id	integer	Player identifier
game_id	integer	Game identifier
team_id	integer	Team the player was playing for
opponent_team_id	integer	Team the player was playing against
kicks	integer	Kicks
marks	integer	Marks
handballs	integer	Handballs
disposals	integer	Kicks + handballs
goals	integer	Goals scored
behinds	integer	Behinds scored
hit_outs	integer	Hit outs
tackles	integer	Tackles
rebounds	integer	Moving the ball from the defensive 50 into the midfield
inside_50	integer	Moving the ball from the midfield into the forward 50; excludes multiple entries within same chain
clearances	integer	First effective disposal in a chain that clears the stoppage area, or an ineffective/clanger kick that clears stoppage
clangers	integer	Error made by a player resulting in a negative result for their side
frees_for	integer	Frees awarded to player
frees_against	integer	Frees conceded by player
brownlow_votes	integer	Brownlow Medal votes, awarded by umpires
contested_ possessions	integer	Possession won when the ball is in dispute (looseball-gets, hardball-gets, contested marks, etc.)
uncontested_possessions	integer	Possessions gained under no physical pressure
contested_marks	integer	A mark under physical pressure of an opponent or in a pack
marks_inside_50	integer	Marks taken inside forward 50
one_percenters	integer	Defensive effort (spoil, knock-on, smother or shepherd)
bounces	integer	Bounces
goal_assists	integer	Creating a goal by getting the ball to a teammate by any means (disposal, knock-on, ground kick, hitout, etc.)
time_on_ground_pct	numeric	Time on ground during a game divided by total game time (%)

For more information on the definitions of above, or how they are recorded, visit the AFL site. The data was sourced from afltables.com, footywire.com, and the AFL dataset on Kaggle. For the purpose of this analysis I am going to assume everything was correctly recorded and that my data is correct. I conducted random spot checking of ~5 games each season to ensure confidence, and all information can be found in my github.

Table 2: Game level data between 2010 and round 9, 2025 (table name: games)

Column	Data Type	Definition
round	text	Round label (e.g R3, QF)
game_date	date	Date game was played (YYYY-MM-DD)
venue_id	integer	Ground the game was played at (venue ID)
home_team_id	integer	Home team ID
away_team_id	integer	Away team ID
season_year	integer	Season year
afltables_game_id	integer	Game ID from AFL Tables source
home_score_str	character varying	Home team score as a string (e.g "10.9.69")
away_score_str	character varying	Away team score as a string (e.g "4.17.41")
home_score_goals	integer	Home team goals
home_score_behinds	integer	Home team behinds
home_score_total	integer	Home team total score
away_score_goals	integer	Away team goals
away_score_behinds	integer	Away team behinds
away_score_total	integer	Away team total score
margin	integer	Score margin as an absolute value
home_result	character	Result for home team (W = win, L = loss, D = draw)
away_result	character	Result for away team (W = win, L = loss, D = draw)

Table 3: Every venue a game occurred in (table name: venues)

Column	Data Type	Definition
id	integer	Unique venue ID
name	text	Venue name
city	text	City or town where the venue is located
state	text	State the venue is located (or country if outside of Australia)
location	text	Latitude & longitude of the stadium

Table 4: Every player who played a game during this time (table name: players)

Column	Data Type	Definition
id	integer	Unique player ID
full_name	text	Player's full name
birth_date	date	Date of birth (YYYY-MM-DD)
height_cm	integer	Player's height in centimeters
weight_kg	integer	Player's weight in kilograms
afltables_id	integer	Player ID from AFL Tables source
position	character varying	Player position(s), e.g 'Midfield', 'Defender'

Table 5: Every team (table name: teams)

Column	Data Type	Definition
id	integer	Unique team ID
name	text	Team name
short_code	text	Team abbreviation or short code
home_ground	integer	Venue ID for team's primary home ground
state	text	State where team is based

All of these tables are available on my GitHub page.

Building the Dataframe and Features

With the data prepared, I began by building an initial dataframe containing all games:


engine = create_engine(f"postgresql+psycopg2://{db_user}:{db_pass}@{db_host}:{db_port}/{db_name}")
df_games = pd.read_sql("SELECT * FROM games", engine)

I also built a helper function to sort rounds (this was done after noticing that the format I had the rounds in meant that sorting them meant R10 following R1 instead of R2):


# Helper function to sort rounds
def get_round_order(row):
    # Match regular rounds: 'R1', 'R23', etc.
    match = re.match(r'R(\d+)$', row['round'])
    if match:
        return int(match.group(1))
    # Finals rounds mapping (AFL convention)
    finals_order = {
        'REF': 100,  # Elimination Final (week 1)
        'RQF': 101,  # Qualifying Final (week 1)
        'RSF': 102,  # Semi Final (week 2)
        'RPF': 103,  # Preliminary Final (week 3)
        'RGF': 104,  # Grand Final (week 4)
    }
    code = row['round'][1:] 
    return finals_order.get(code, 999)  # Unknown finals get 999

The first variable I wanted to look at is player movement. In order to calculate this I initially only looked at changes across 1 round. I then built out changes of 2 and 3 rounds, and after analysis on each of the variables I ended up only using changes over 2 rounds. This was because these variables were so highly correlated (multicollinearity) that including more than one would make it difficult to interpret the effect of each one individually.


query_movement ="""
SELECT 
    pgs.game_id, 
    g.season_year, 
    g.round, 
    pgs.team_id, 
    pgs.player_id
FROM player_game_stats pgs
JOIN games g ON pgs.game_id = g.id
ORDER BY g.season_year, g.round, pgs.team_id
"""
df_players = pd.read_sql(query_movement, engine)

players_per_team_round = (
    df_players
    .groupby(['season_year', 'team_id', 'round'])
    .agg({'player_id': lambda x: set(x)})
    .reset_index()
    .sort_values(['season_year', 'team_id', 'round'])
)

# Use the helper function to sort the rounds
players_per_team_round['round_order'] = players_per_team_round.apply(get_round_order, axis=1)
players_per_team_round = players_per_team_round.sort_values(
    ['season_year', 'team_id', 'round_order']
).reset_index(drop=True)

# Once we have the dataframe, now we need to sort it and calculate how many players played together in previous round, plus previous 2 & 3 rounds
players_per_team_round['prev_players_1'] = (
    players_per_team_round.groupby(['season_year', 'team_id'])['player_id'].shift(1)
)
players_per_team_round['prev_players_2'] = (
    players_per_team_round.groupby(['season_year', 'team_id'])['player_id'].shift(2)
)
players_per_team_round['prev_players_3'] = (
    players_per_team_round.groupby(['season_year', 'team_id'])['player_id'].shift(3)
)

# Functions to calculate number of changes vs previous 1, 2, 3 rounds
def num_changes_vs_prev(row):
    if pd.isna(row['prev_players_1']):
        return np.nan
    return len(row['player_id'].symmetric_difference(row['prev_players_1']))

def num_changes_vs_prev2(row):
    if pd.isna(row['prev_players_1']) or pd.isna(row['prev_players_2']):
        return np.nan
    prev_union = row['prev_players_1'].union(row['prev_players_2'])
    return len(row['player_id'].symmetric_difference(prev_union))

def num_changes_vs_prev3(row):
    if pd.isna(row['prev_players_1']) or pd.isna(row['prev_players_2']) or pd.isna(row['prev_players_3']):
        return np.nan
    prev_union = row['prev_players_1'].union(row['prev_players_2']).union(row['prev_players_3'])
    return len(row['player_id'].symmetric_difference(prev_union))

players_per_team_round['num_changes_1'] = players_per_team_round.apply(num_changes_vs_prev, axis=1)
players_per_team_round['num_changes_2'] = players_per_team_round.apply(num_changes_vs_prev2, axis=1)
players_per_team_round['num_changes_3'] = players_per_team_round.apply(num_changes_vs_prev3, axis=1)

I also wanted to look at different variables relating to team form, as well as whether or not having more, or less, rest between games made a difference. In order to do this I built out a dataframe to look at average scores over a 5 game period, as well as the amount of time between games:


# calculate days between games as well as rolling averages of previous 5 games
df_games['game_date'] = pd.to_datetime(df_games['game_date'])

df_home = df_games[['id', 'season_year', 'round', 'game_date', 'home_team_id', 'home_score_total', 'away_score_total']].copy()
df_home = df_home.rename(columns={
    'home_team_id': 'team_id',
    'home_score_total': 'score_scored',
    'away_score_total': 'score_conceded'
})

df_away = df_games[['id', 'season_year', 'round', 'game_date', 'away_team_id', 'away_score_total', 'home_score_total']].copy()
df_away = df_away.rename(columns={
    'away_team_id': 'team_id',
    'away_score_total': 'score_scored',
    'home_score_total': 'score_conceded'
})

df_team_games = pd.concat([df_home, df_away], ignore_index=True)
df_team_games = df_team_games.sort_values(['team_id', 'game_date']).reset_index(drop=True)

# calculate average score and margin based on rolling average of previous 5 games, minimum 1 game and only including games in season
df_team_games['avg_scored_prev5'] = (
    df_team_games.groupby(['team_id', 'season_year'])['score_scored']
    .transform(lambda x: x.shift(1).rolling(5, min_periods=1).mean())
)

df_team_games['avg_conceded_prev5'] = (
    df_team_games.groupby(['team_id', 'season_year'])['score_conceded']
    .transform(lambda x: x.shift(1).rolling(5, min_periods=1).mean())
)

df_team_games['margin'] = df_team_games['score_scored'] - df_team_games['score_conceded']

df_team_games['avg_margin_prev5'] = (
    df_team_games.groupby(['team_id', 'season_year'])['margin']
    .transform(lambda x: x.shift(1).rolling(5, min_periods=1).mean())
)

# calculate days since previous game, only including games in season
df_team_games['days_since_prev'] = (
    df_team_games.groupby(['team_id', 'season_year'])['game_date']
    .diff()
    .dt.days
)

Lastly, I also took a look at the average age of players within a game:


# Calculate the average team age
query_age = """
SELECT 
    pgs.game_id, 
    pgs.team_id, 
    pgs.player_id, 
    p.birth_date, 
    g.game_date
FROM player_game_stats pgs
JOIN games g ON pgs.game_id = g.id
JOIN players p ON pgs.player_id = p.id
"""
df_players_age = pd.read_sql(query_age, engine)

df_players_age['age'] = (
    (pd.to_datetime(df_players_age['game_date']) - pd.to_datetime(df_players_age['birth_date'])).dt.days / 365.25
)

avg_age_per_team_game = (
    df_players_age
    .groupby(['game_id', 'team_id'])['age']
    .mean()
    .reset_index()
    .rename(columns={'age': 'avg_age', 'game_id': 'id'}) 
)

Once I had all of my data together, I merged the dataframes above to create a single dataframe with all the features:


# merge everything together into single dataframe
team_game_features = (
    pd.merge(
        df_team_games,
        players_per_team_round[['season_year', 'team_id', 'round', 'num_changes_1', 'num_changes_2', 'num_changes_3']],
        on=['season_year', 'team_id', 'round'],
        how='left'
    )
    .merge(
        avg_age_per_team_game,
        on=['id', 'team_id'],
        how='left'
    )
)

In order to better analyse team level data I also needed to split out each game into the home team, and the away team. This allows me to look at each team’s performance individually, regardless of whether they played at home or away.


# Get win/loss results

df_home_result = df_games[['id', 'season_year', 'round', 'home_team_id', 'home_result']]
df_home_result = df_home_result.rename(columns={
    'id': 'game_id',
    'home_team_id': 'team_id',
    'home_result': 'result'
})
df_home_result['home_or_away'] = 'home'

df_away_result = df_games[['id', 'season_year', 'round', 'away_team_id', 'away_result']]
df_away_result = df_away_result.rename(columns={
    'id': 'game_id',
    'away_team_id': 'team_id',
    'away_result': 'result'
})
df_away_result['home_or_away'] = 'away'

df_results = pd.concat([df_home_result, df_away_result], ignore_index=True)

Now I was almost ready to start looking at the data, but first I needed to merge the results dataframe into my features dataframe, then add in a new column so that I could understand home ground advantage.


# Merge results with what we want to analyse on
team_game_features = team_game_features.merge(
    df_results[['game_id', 'team_id', 'result', 'home_or_away']],
    left_on=['id', 'team_id'],
    right_on=['game_id', 'team_id'],
    how='left'
).drop(columns=['game_id'])

team_game_features['target_win'] = team_game_features['result'].map({'W': 1, 'L': 0, 'D': np.nan})

# Add home ground advantage
team_game_features['is_home'] = (team_game_features['home_or_away'] == 'home').astype(int)

# Prepare features and target
features = [
    'avg_scored_prev5', 
    'avg_conceded_prev5', 
    'days_since_prev', 
    'num_changes_2',
    'avg_age',
    'is_home'
]
analysis_df = team_game_features.dropna(subset=['target_win'] + features)

Now that everything is set up, I was able to create my model and take a look at how well it could predict a winner. Not only this, I also wanted to understand which variables were useful, and which weren't.

I used logistic regression to do this, as it is useful for binary outcomes like win/loss. Logistic regression assigns a “coefficient” to each variable (from the features dataframe we just built), which quantifies how much that variable contributes to the odds of a team winning a game. I used scaling so that all the coefficients would be on a scale from -1 to 1. What I want to see is numbers closer to the extremes: i.e., a coefficient of 1 or -1 would mean a perfect correlation.


# Create the model
X = analysis_df[features]
y = analysis_df['target_win']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

model = LogisticRegression(max_iter=1000)
model.fit(X_scaled, y)
# Create coefficients
coefs = pd.Series(model.coef_[0], index=features)
print(coefs.sort_values())

Output:
avg_conceded_prev5 -0.229053
num_changes_2 -0.083847
days_since_prev -0.030959
avg_age 0.270478
avg_scored_prev5 0.272462
is_home 0.312109
dtype: float64

As you can see, the model showed slight positive and negative relationships where we would expect them:

Playing at home is an advantage
Previous 5 games performance is relevant
Age profile of the team is relevant
Team stability (at least over the previous 2 rounds) was slightly negative
Time between games didn't seem to matter at all

I also wanted to understand the model's accuracy—i.e., what % of games did the model actually predict correctly. To do this, I used a metric called the Area Under the ROC Curve (AUC). The AUC is a more nuanced way of understanding model performance:

An AUC of 0.5 means the model is no better than random guessing.
An AUC of 1.0 is a perfect model.
Most useful models fall somewhere in between.


# Check accuracy of model
y_pred = model.predict(X_scaled)
acc = accuracy_score(y, y_pred)
print(f"Accuracy: {acc:.3f} ({acc*100:.1f}%)")

Output:
Accuracy: 0.617 (61.7%)

The accuracy of the model was 61.7%, basically 5–6 out of 9 tips correct each round.

Lastly, I took a look at which variables were driving the predictions. To do this, I ran the model multiple times, each time removing one feature, and observed how much the AUC dropped without it:


# Check accuracy of model (AUC for each feature dropped)
baseline_auc = roc_auc_score(y, model.predict_proba(X_scaled)[:, 1])
print("Baseline AUC:", baseline_auc)

for i, feature in enumerate(features):
    X_reduced = np.delete(X_scaled, i, axis=1)
    model_temp = LogisticRegression(max_iter=1000)
    model_temp.fit(X_reduced, y)
    auc = roc_auc_score(y, model_temp.predict_proba(X_reduced)[:, 1])
    print(f"AUC without {feature}: {auc:.3f} (Drop: {baseline_auc - auc:.3f})")

Output:
Baseline AUC: 0.6717989292662536
AUC without avg_scored_prev5: 0.658 (Drop: 0.014)
AUC without avg_conceded_prev5: 0.665 (Drop: 0.007)
AUC without days_since_prev: 0.672 (Drop: 0.000)
AUC without num_changes_2: 0.671 (Drop: 0.001)
AUC without avg_age: 0.659 (Drop: 0.013)
AUC without is_home: 0.650 (Drop: 0.022)

Again, this doesn't tell us much—none of the variables dramatically change the model's accuracy if removed, suggesting that each has only a modest influence on predicting the outcome on its own.

I also tested whether including interactions between variables (e.g., combining age and home advantage) using polynomial features would improve the predictive accuracy.


# Let's look at whether multiple factors together can predict a winner
poly = PolynomialFeatures(degree=2, include_bias=False, interaction_only=True)
X_poly = poly.fit_transform(X_scaled)
feature_names = poly.get_feature_names_out(features)
model_poly = LogisticRegression(max_iter=1000)
model_poly.fit(X_poly, y)

coefs_poly = pd.Series(model_poly.coef_[0], index=feature_names)
# Show the biggest absolute coefficients (most "important" effects)
print(coefs_poly.abs().sort_values(ascending=False).head(20))

Output:
is_home 0.313096
avg_scored_prev5 0.280495
avg_age 0.268837
avg_conceded_prev5 0.209431
num_changes_2 0.084217
avg_scored_prev5 is_home 0.064832
avg_scored_prev5 avg_conceded_prev5 0.049802
num_changes_2 is_home 0.039212
days_since_prev is_home 0.037942
days_since_prev avg_age 0.029295
avg_conceded_prev5 is_home 0.027139
num_changes_2 avg_age 0.024612
days_since_prev 0.021512
avg_scored_prev5 days_since_prev 0.019434
avg_conceded_prev5 num_changes_2 0.016678
avg_conceded_prev5 days_since_prev 0.016134
days_since_prev num_changes_2 0.015158
avg_scored_prev5 num_changes_2 0.013681
avg_conceded_prev5 avg_age 0.012748
avg_age is_home 0.012529
dtype: float64

What above table shows is is that after performing polynomial regression, there variables that have the largest impact do not change. I.e combining variables doesn't really change the model's performance. Checking the accuracy of the polynomial model in the same way we did above confirms this:


# Evaluate predictive accuracy
y_pred_poly = model_poly.predict(X_poly)
y_prob_poly = model_poly.predict_proba(X_poly)[:,1]
print("Poly Accuracy:", accuracy_score(y, y_pred_poly))
print("Poly AUC:", roc_auc_score(y, y_prob_poly))

Output:
Poly Accuracy: 0.6233535065859737
Poly AUC: 0.6746580720024911

What does this all mean? Well, combining variables together didn't really move the dial too much (a small increase in accuracy from 61.7% to 62.3%)

Conclusion

Just picking the home team is correct 56% of the time, so if you want a model that is slightly better than just picking the team that is playing at home, this is the model for you.