Kaggle Competition Guide: How to Rank in the Top 10% Every Time
Kaggle competition guide — the systematic approach to finishing in the top 10%, from EDA and baseline models to ensembling and post-competition learning, used by Kaggle Masters.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Kaggle Competition Guide: How to Rank in the Top 10% Every Time
My first Kaggle competition ended with a rank of 847 out of 1,200. My second competition ended at 312. My fifth competition ended in the top 100. By my tenth, I'd developed a systematic approach that consistently landed me in the top 10%.
The difference wasn't raw skill or more compute. It was methodology. Top competitors in Kaggle don't just try more things — they try the right things in the right order, evaluate each change rigorously, and combine their best models systematically.
This guide codifies the approach used by Kaggle Masters — from EDA through ensembling — in a reproducible workflow you can apply to any competition.
Competition Selection
Your First Competition
Don't start with active competitions:
Learning competitions (no time pressure, extensive resources):
- Titanic: Machine Learning from Disaster
→ Binary classification basics, feature engineering
- House Prices: Advanced Regression Techniques
→ Regression, handling many features, missing values
- Digit Recognizer (MNIST)
→ First neural network / CNN
Choosing Active Competitions
Selection criteria:
✓ Data type you have skills for (tabular/image/text/audio)
✓ Evaluation metric you understand (AUC, RMSE, F1, MAP)
✓ Dataset size you can handle (RAM/GPU constraints)
✓ Active discussion forum (shows community engagement)
✗ Avoid: prize competitions as first entries (too competitive)
✗ Avoid: proprietary domain data with no public knowledge
The Competition Workflow
Phase 1: Understanding the Problem (Day 1-2)
Before any code:
# Competition setup template
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# 1. Load and inspect
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
sample_submission = pd.read_csv('sample_submission.csv')
print(f"Train shape: {train.shape}")
print(f"Test shape: {test.shape}")
print(f"Target distribution:\n{train['target'].value_counts()}")
print(f"\nMissing values in train:\n{train.isnull().sum()[train.isnull().sum() > 0]}")
# 2. Check for test/train distribution differences (data leakage signals)
for col in train.select_dtypes(include=[np.number]).columns:
train_mean = train[col].mean()
test_mean = test[col].mean()
if abs(train_mean - test_mean) / (train_mean + 1e-8) > 0.1:
print(f"Distribution shift in {col}: train={train_mean:.3f}, test={test_mean:.3f}")
Key questions to answer before modeling:
- What is the evaluation metric and how does it penalize different errors?
- Is the test data from the same time period as training? (temporal leakage)
- What features exist and what do they mean? (read the competition description)
- What external data is allowed?
Phase 2: Exploratory Data Analysis
# Target distribution
plt.figure(figsize=(10, 4))
train['target'].hist(bins=50)
plt.title('Target Distribution')
plt.show()
# Feature correlation with target
numeric_cols = train.select_dtypes(include=[np.number]).columns.tolist()
correlations = train[numeric_cols].corr()['target'].sort_values(ascending=False)
print("Top 10 correlated features:")
print(correlations.head(11)) # 11 because target itself is included
# Categorical features
for col in train.select_dtypes(include=['object']).columns:
print(f"\n{col}:")
print(train.groupby(col)['target'].agg(['mean', 'count']))
# Distribution plots by target class
for col in correlations.head(6).index[1:]: # Skip target
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
train[col].hist(ax=axes[0], bins=30, alpha=0.7)
axes[0].set_title(f'{col} Distribution')
train.groupby('target')[col].plot(kind='kde', ax=axes[1])
axes[1].set_title(f'{col} by Target')
axes[1].legend()
plt.show()
Phase 3: Baseline Model
Establish a baseline before any feature engineering:
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
import lightgbm as lgb
import numpy as np
# Simple preprocessing
def basic_preprocess(df):
df = df.copy()
# Encode categoricals
for col in df.select_dtypes(include=['object']).columns:
le = LabelEncoder()
df[col] = le.fit_transform(df[col].fillna('missing'))
# Fill numeric missing
df = df.fillna(df.median())
return df
X = basic_preprocess(train.drop(['id', 'target'], axis=1))
y = train['target']
# 5-fold CV baseline
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# LightGBM baseline (usually good starting point for tabular)
lgb_params = {
'objective': 'binary',
'metric': 'auc',
'n_estimators': 500,
'learning_rate': 0.05,
'num_leaves': 31,
'random_state': 42,
'verbose': -1
}
baseline_scores = []
for fold, (train_idx, val_idx) in enumerate(cv.split(X, y)):
X_tr, X_val = X.iloc[train_idx], X.iloc[val_idx]
y_tr, y_val = y.iloc[train_idx], y.iloc[val_idx]
model = lgb.LGBMClassifier(**lgb_params)
model.fit(X_tr, y_tr,
eval_set=[(X_val, y_val)],
callbacks=[lgb.early_stopping(50, verbose=False)])
score = model.score(X_val, y_val)
baseline_scores.append(score)
print(f"Fold {fold+1}: {score:.4f}")
print(f"\nBaseline CV score: {np.mean(baseline_scores):.4f} ± {np.std(baseline_scores):.4f}")
Document this score. Every change you make gets compared to this baseline.
Phase 4: Feature Engineering
Feature engineering typically provides the biggest gains:
def engineer_features(df, train_df=None):
df = df.copy()
# 1. Interaction features between top correlated variables
df['feature_A_times_B'] = df['feature_A'] * df['feature_B']
df['feature_A_div_B'] = df['feature_A'] / (df['feature_B'] + 1e-8)
# 2. Aggregation features (if there are group IDs)
if 'group_id' in df.columns:
group_stats = df.groupby('group_id')['feature_A'].agg(['mean', 'std', 'max', 'min'])
group_stats.columns = [f'group_{c}_feature_A' for c in group_stats.columns]
df = df.merge(group_stats, on='group_id', how='left')
# 3. Target encoding for high-cardinality categoricals
# (use cross-validation to avoid leakage)
# 4. Polynomial features for top numeric features
top_features = ['feature_A', 'feature_B', 'feature_C']
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
poly_features = poly.fit_transform(df[top_features])
return df
# Test each feature addition
# If CV improves: keep the feature
# If CV stays same or drops: remove the feature
Phase 5: Hyperparameter Tuning
After feature engineering, tune the best-performing model:
import optuna
def objective(trial):
params = {
'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
'num_leaves': trial.suggest_int('num_leaves', 20, 300),
'max_depth': trial.suggest_int('max_depth', 3, 12),
'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
'reg_alpha': trial.suggest_float('reg_alpha', 1e-8, 10.0, log=True),
'reg_lambda': trial.suggest_float('reg_lambda', 1e-8, 10.0, log=True),
'feature_fraction': trial.suggest_float('feature_fraction', 0.4, 1.0),
'bagging_fraction': trial.suggest_float('bagging_fraction', 0.4, 1.0),
'objective': 'binary',
'metric': 'auc',
'verbose': -1,
'random_state': 42
}
model = lgb.LGBMClassifier(**params)
cv_scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc', n_jobs=-1)
return cv_scores.mean()
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50, timeout=3600) # 1 hour budget
print(f"Best CV AUC: {study.best_value:.4f}")
print(f"Best params: {study.best_params}")
Phase 6: Ensembling
Combining models almost always outperforms any single model:
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.model_selection import StratifiedKFold
import lightgbm as lgb
import xgboost as xgb
import catboost as cb
# Train multiple base models with out-of-fold predictions
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# OOF (Out-of-Fold) predictions for stacking
oof_lgb = np.zeros(len(X))
oof_xgb = np.zeros(len(X))
test_preds_lgb = np.zeros(len(test))
test_preds_xgb = np.zeros(len(test))
for fold, (train_idx, val_idx) in enumerate(cv.split(X, y)):
print(f"Fold {fold + 1}/5")
X_tr, X_val = X.iloc[train_idx], X.iloc[val_idx]
y_tr, y_val = y.iloc[train_idx], y.iloc[val_idx]
# LightGBM
lgb_model = lgb.LGBMClassifier(**best_lgb_params)
lgb_model.fit(X_tr, y_tr, eval_set=[(X_val, y_val)],
callbacks=[lgb.early_stopping(50, verbose=False)])
oof_lgb[val_idx] = lgb_model.predict_proba(X_val)[:, 1]
test_preds_lgb += lgb_model.predict_proba(X_test)[:, 1] / 5
# XGBoost
xgb_model = xgb.XGBClassifier(**best_xgb_params, eval_metric='auc', verbosity=0)
xgb_model.fit(X_tr, y_tr, eval_set=[(X_val, y_val)], early_stopping_rounds=50,
verbose=False)
oof_xgb[val_idx] = xgb_model.predict_proba(X_val)[:, 1]
test_preds_xgb += xgb_model.predict_proba(X_test)[:, 1] / 5
# Evaluate individual models
from sklearn.metrics import roc_auc_score
print(f"LGB OOF AUC: {roc_auc_score(y, oof_lgb):.4f}")
print(f"XGB OOF AUC: {roc_auc_score(y, oof_xgb):.4f}")
# Simple average ensemble
oof_blend = (oof_lgb + oof_xgb) / 2
test_blend = (test_preds_lgb + test_preds_xgb) / 2
print(f"Blend OOF AUC: {roc_auc_score(y, oof_blend):.4f}")
# Stacking: train meta-model on OOF predictions
meta_features = np.column_stack([oof_lgb, oof_xgb])
meta_model = Ridge()
meta_model.fit(meta_features, y)
oof_stacked = meta_model.predict(meta_features)
print(f"Stacked OOF AUC: {roc_auc_score(y, oof_stacked):.4f}")
Competition-Specific Tips
For Tabular Competitions
- LightGBM is usually the strongest baseline
- Add CatBoost and XGBoost for diversity in ensembles
- Time-series competitions: never let future data leak into training folds
- Feature importance plots reveal what the model found useful — often suggests new features
For Image Competitions
- Transfer learning from pretrained models (EfficientNet, ViT)
- Test-time augmentation (TTA) — predict on augmented versions and average
- Multi-scale training and inference
For NLP Competitions
- Pretrained language models (BERT, RoBERTa, DeBERTa) are the standard
- External data often allowed — check the rules
- Pseudo-labeling on test data can help
The Mental Model of Top Competitors
Top competitors think in experiments, not guesses:
1. Hypothesis: "Adding this feature should improve CV by X"
2. Test: Run CV, measure delta
3. Keep or discard: Based on CV delta, not intuition
4. Document: Keep a log of every experiment
They don't:
- Submit before validating locally
- Chase the public LB (often leads to overfitting to the public set)
- Add random features hoping something sticks
- Copy solutions without understanding them
Leaderboard Shake-Up
Many Kaggle veterans have experienced finishing well on the public leaderboard, only to drop significantly on the private leaderboard. Prevention:
- Trust your CV over public LB
- Check if your CV strategy correctly simulates the private test set
- Don't use all 5 daily submissions — use them judiciously to validate specific questions
- Pick 2 final submissions: one optimized for LB, one for CV stability
Conclusion
Kaggle competitions are one of the best ways to rapidly improve practical ML skills. The competition format forces rigorous evaluation, and the post-competition discussions from top finishers are some of the highest-quality ML education available anywhere.
The path to consistent top-10% finishes: understand the problem deeply, establish a rigorous CV baseline, engineer features systematically, tune carefully, and ensemble diverse models. Each competition teaches techniques you take to the next one.
For the ML skills competitions build on, see our machine learning beginners guide, feature engineering guide, and overfitting guide.
Frequently Asked Questions
What is Kaggle and how do competitions work?
Kaggle hosts ML competitions where participants build models on provided datasets for prize money and ranking. Competitions run 1-3 months. Participants submit predictions on test data; Kaggle evaluates using the competition metric. The community forums and post-competition solutions are as valuable as the rankings.
How do I choose which Kaggle competition to enter?
Match to your data type skills (tabular/image/text). Start with learning competitions (Titanic, House Prices) with no time pressure. Choose active competitions based on domain familiarity and learning objectives. Avoid massive prize competitions as your first entry.
What techniques separate top competitors from average ones?
Thorough EDA, systematic feature engineering, rigorous cross-validation strategy, model ensembling (blending and stacking), efficient compute management, and reading competition discussions for domain insights. Feature engineering and CV strategy are the highest-leverage skills.
What is model stacking?
Training a meta-model on the out-of-fold predictions of multiple base models. The meta-model learns how to optimally combine base model outputs. Consistently outperforms simple averaging when base models are diverse. Key: base models should use different algorithms or significantly different feature sets.
How important is the cross-validation strategy?
Extremely — it's your only reliable signal about what improvements work. If CV doesn't correlate with the leaderboard, you'll optimize in the wrong direction. Choose CV strategy based on data type: stratified k-fold for classification, time-series split for temporal data, group k-fold when samples from the same entity should stay together.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
Best Machine Learning Courses in 2025: Ranked After Taking Them All
The best machine learning courses in 2025 — ranked by a practitioner who completed them. Honest assessments of Coursera, Fast.ai, Kaggle, and 7 others with cost and time required.
Computer Vision Tutorial: Build an Image Classifier from Scratch
Computer vision tutorial for beginners — build a real image classifier using CNNs and PyTorch, understand how computers see images, and learn transfer learning for production results.
Feature Engineering Guide: Turn Raw Data into Powerful ML Inputs
Feature engineering guide for machine learning — practical techniques to create, transform, and select features that improve model accuracy, with Python code examples for every method.
Machine Learning for Beginners: A Honest Guide to Getting Started
Machine learning for beginners explained honestly — what ML actually is, which skills you need first, the fastest learning path, and what to build to prove you can do it.