How do I choose which Kaggle competition to enter?

Choose competitions based on: data type match (tabular if you know scikit-learn, image if you know PyTorch CV, text if you know NLP), learning objectives (enter competitions to learn specific techniques, not just rank), team viability (solo vs. team competitions — team competitions allow pooling skills), and competition age (older competitions have more public kernels to learn from). Avoid: playground competitions if you want serious learning; massive prize competitions as a first entry (they're highly competitive). Best first competition: Titanic (classification basics) or House Prices (regression basics) — both have extensive learning resources and no active competition pressure.

What techniques separate top Kaggle competitors from average ones?

The key separators: (1) Thorough EDA — top competitors understand every feature deeply before modeling; (2) Feature engineering — consistently the highest-impact activity; (3) Cross-validation strategy — choosing the right CV that correlates with the private leaderboard (critical and often underestimated); (4) Model ensembling — combining multiple models almost always outperforms any single model; (5) Stacking — training meta-models on base model predictions; (6) Computing carefully — GPU usage, efficient data loading, memory management for large datasets; (7) Reading discussions — Kaggle forums contain gold from competitors who have domain knowledge you lack.

What is model stacking in Kaggle competitions?

Stacking is an ensembling technique where you train a second-level 'meta-model' on the predictions of first-level 'base models'. The meta-model learns how to best combine the base models' outputs. Example: base models might be XGBoost, LightGBM, CatBoost, and a neural network. For each training example, you get predictions from all base models using cross-validation (to avoid leakage). The meta-model (often Logistic Regression or another XGBoost) trains on these predictions to make the final prediction. Stacking consistently outperforms simple averaging. The key: base models should be diverse (different algorithms or significantly different features) — stacking correlated models provides minimal benefit.

How important is the cross-validation strategy in Kaggle?

Extremely important — possibly the most underestimated aspect of competition ML. Your CV score is your only reliable signal about what improvements actually help. If your CV doesn't correlate with the public leaderboard, you'll be optimizing in the wrong direction. Common CV strategies: k-fold for standard tabular data; stratified k-fold for imbalanced classification; time-series split for temporal data (never mix future data into training folds); group k-fold when samples from the same entity should stay together. A competition rule of thumb: if an improvement shows in CV but not public LB, trust CV. If it shows in LB but not CV, be suspicious — you might be overfitting to the public test set.

AI Tips Prompting Python AI Tools Web Dev ChatGPT LLM Agent Dev Reviews Notes Free Books

AiTechWorlds

Machine Learning

Kaggle Competition Guide: How to Rank in the Top 10% Every Time

⚡ Quick Answer

Kaggle competition guide — the systematic approach to finishing in the top 10%, from EDA and baseline models to ensembling and post-competition learning, used by Kaggle Masters.

AiTechWorlds Team May 27, 2026 8 min read

#kaggle-competition-guide #kaggle-strategy #competitive-ml #machine-learning

📚Part of the Machine Learning guide — explore all Machine Learning articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Kaggle Competition Guide: How to Rank in the Top 10% Every Time

My first Kaggle competition ended with a rank of 847 out of 1,200. My second competition ended at 312. My fifth competition ended in the top 100. By my tenth, I'd developed a systematic approach that consistently landed me in the top 10%.

The difference wasn't raw skill or more compute. It was methodology. Top competitors in Kaggle don't just try more things — they try the right things in the right order, evaluate each change rigorously, and combine their best models systematically.

This guide codifies the approach used by Kaggle Masters — from EDA through ensembling — in a reproducible workflow you can apply to any competition.

Competition Selection

Your First Competition

Don't start with active competitions:

Learning competitions (no time pressure, extensive resources):
- Titanic: Machine Learning from Disaster
  → Binary classification basics, feature engineering
  
- House Prices: Advanced Regression Techniques
  → Regression, handling many features, missing values

- Digit Recognizer (MNIST)
  → First neural network / CNN

Choosing Active Competitions

Selection criteria:
✓ Data type you have skills for (tabular/image/text/audio)
✓ Evaluation metric you understand (AUC, RMSE, F1, MAP)
✓ Dataset size you can handle (RAM/GPU constraints)
✓ Active discussion forum (shows community engagement)
✗ Avoid: prize competitions as first entries (too competitive)
✗ Avoid: proprietary domain data with no public knowledge

The Competition Workflow

Phase 1: Understanding the Problem (Day 1-2)

Before any code:

# Competition setup template
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Load and inspect
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
sample_submission = pd.read_csv('sample_submission.csv')

print(f"Train shape: {train.shape}")
print(f"Test shape: {test.shape}")
print(f"Target distribution:\n{train['target'].value_counts()}")
print(f"\nMissing values in train:\n{train.isnull().sum()[train.isnull().sum() > 0]}")

# 2. Check for test/train distribution differences (data leakage signals)
for col in train.select_dtypes(include=[np.number]).columns:
    train_mean = train[col].mean()
    test_mean = test[col].mean()
    if abs(train_mean - test_mean) / (train_mean + 1e-8) > 0.1:
        print(f"Distribution shift in {col}: train={train_mean:.3f}, test={test_mean:.3f}")

Key questions to answer before modeling:

What is the evaluation metric and how does it penalize different errors?
Is the test data from the same time period as training? (temporal leakage)
What features exist and what do they mean? (read the competition description)
What external data is allowed?

Phase 2: Exploratory Data Analysis

# Target distribution
plt.figure(figsize=(10, 4))
train['target'].hist(bins=50)
plt.title('Target Distribution')
plt.show()

# Feature correlation with target
numeric_cols = train.select_dtypes(include=[np.number]).columns.tolist()
correlations = train[numeric_cols].corr()['target'].sort_values(ascending=False)
print("Top 10 correlated features:")
print(correlations.head(11))  # 11 because target itself is included

# Categorical features
for col in train.select_dtypes(include=['object']).columns:
    print(f"\n{col}:")
    print(train.groupby(col)['target'].agg(['mean', 'count']))

# Distribution plots by target class
for col in correlations.head(6).index[1:]:  # Skip target
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    train[col].hist(ax=axes[0], bins=30, alpha=0.7)
    axes[0].set_title(f'{col} Distribution')
    
    train.groupby('target')[col].plot(kind='kde', ax=axes[1])
    axes[1].set_title(f'{col} by Target')
    axes[1].legend()
    plt.show()

Phase 3: Baseline Model

Establish a baseline before any feature engineering:

from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
import lightgbm as lgb
import numpy as np

# Simple preprocessing
def basic_preprocess(df):
    df = df.copy()
    # Encode categoricals
    for col in df.select_dtypes(include=['object']).columns:
        le = LabelEncoder()
        df[col] = le.fit_transform(df[col].fillna('missing'))
    # Fill numeric missing
    df = df.fillna(df.median())
    return df

X = basic_preprocess(train.drop(['id', 'target'], axis=1))
y = train['target']

# 5-fold CV baseline
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# LightGBM baseline (usually good starting point for tabular)
lgb_params = {
    'objective': 'binary',
    'metric': 'auc',
    'n_estimators': 500,
    'learning_rate': 0.05,
    'num_leaves': 31,
    'random_state': 42,
    'verbose': -1
}

baseline_scores = []
for fold, (train_idx, val_idx) in enumerate(cv.split(X, y)):
    X_tr, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_tr, y_val = y.iloc[train_idx], y.iloc[val_idx]
    
    model = lgb.LGBMClassifier(**lgb_params)
    model.fit(X_tr, y_tr,
              eval_set=[(X_val, y_val)],
              callbacks=[lgb.early_stopping(50, verbose=False)])
    
    score = model.score(X_val, y_val)
    baseline_scores.append(score)
    print(f"Fold {fold+1}: {score:.4f}")

print(f"\nBaseline CV score: {np.mean(baseline_scores):.4f} ± {np.std(baseline_scores):.4f}")

Document this score. Every change you make gets compared to this baseline.

Phase 4: Feature Engineering

Feature engineering typically provides the biggest gains:

def engineer_features(df, train_df=None):
    df = df.copy()
    
    # 1. Interaction features between top correlated variables
    df['feature_A_times_B'] = df['feature_A'] * df['feature_B']
    df['feature_A_div_B'] = df['feature_A'] / (df['feature_B'] + 1e-8)
    
    # 2. Aggregation features (if there are group IDs)
    if 'group_id' in df.columns:
        group_stats = df.groupby('group_id')['feature_A'].agg(['mean', 'std', 'max', 'min'])
        group_stats.columns = [f'group_{c}_feature_A' for c in group_stats.columns]
        df = df.merge(group_stats, on='group_id', how='left')
    
    # 3. Target encoding for high-cardinality categoricals
    # (use cross-validation to avoid leakage)
    
    # 4. Polynomial features for top numeric features
    top_features = ['feature_A', 'feature_B', 'feature_C']
    from sklearn.preprocessing import PolynomialFeatures
    poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
    poly_features = poly.fit_transform(df[top_features])
    
    return df

# Test each feature addition
# If CV improves: keep the feature
# If CV stays same or drops: remove the feature

Phase 5: Hyperparameter Tuning

After feature engineering, tune the best-performing model:

import optuna

def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
        'num_leaves': trial.suggest_int('num_leaves', 20, 300),
        'max_depth': trial.suggest_int('max_depth', 3, 12),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-8, 10.0, log=True),
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-8, 10.0, log=True),
        'feature_fraction': trial.suggest_float('feature_fraction', 0.4, 1.0),
        'bagging_fraction': trial.suggest_float('bagging_fraction', 0.4, 1.0),
        'objective': 'binary',
        'metric': 'auc',
        'verbose': -1,
        'random_state': 42
    }
    
    model = lgb.LGBMClassifier(**params)
    cv_scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc', n_jobs=-1)
    return cv_scores.mean()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50, timeout=3600)  # 1 hour budget

print(f"Best CV AUC: {study.best_value:.4f}")
print(f"Best params: {study.best_params}")

Phase 6: Ensembling

Combining models almost always outperforms any single model:

import numpy as np
from sklearn.linear_model import Ridge
from sklearn.model_selection import StratifiedKFold
import lightgbm as lgb
import xgboost as xgb
import catboost as cb

# Train multiple base models with out-of-fold predictions
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# OOF (Out-of-Fold) predictions for stacking
oof_lgb = np.zeros(len(X))
oof_xgb = np.zeros(len(X))
test_preds_lgb = np.zeros(len(test))
test_preds_xgb = np.zeros(len(test))

for fold, (train_idx, val_idx) in enumerate(cv.split(X, y)):
    print(f"Fold {fold + 1}/5")
    X_tr, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_tr, y_val = y.iloc[train_idx], y.iloc[val_idx]
    
    # LightGBM
    lgb_model = lgb.LGBMClassifier(**best_lgb_params)
    lgb_model.fit(X_tr, y_tr, eval_set=[(X_val, y_val)],
                  callbacks=[lgb.early_stopping(50, verbose=False)])
    oof_lgb[val_idx] = lgb_model.predict_proba(X_val)[:, 1]
    test_preds_lgb += lgb_model.predict_proba(X_test)[:, 1] / 5
    
    # XGBoost
    xgb_model = xgb.XGBClassifier(**best_xgb_params, eval_metric='auc', verbosity=0)
    xgb_model.fit(X_tr, y_tr, eval_set=[(X_val, y_val)], early_stopping_rounds=50,
                  verbose=False)
    oof_xgb[val_idx] = xgb_model.predict_proba(X_val)[:, 1]
    test_preds_xgb += xgb_model.predict_proba(X_test)[:, 1] / 5

# Evaluate individual models
from sklearn.metrics import roc_auc_score
print(f"LGB OOF AUC: {roc_auc_score(y, oof_lgb):.4f}")
print(f"XGB OOF AUC: {roc_auc_score(y, oof_xgb):.4f}")

# Simple average ensemble
oof_blend = (oof_lgb + oof_xgb) / 2
test_blend = (test_preds_lgb + test_preds_xgb) / 2
print(f"Blend OOF AUC: {roc_auc_score(y, oof_blend):.4f}")

# Stacking: train meta-model on OOF predictions
meta_features = np.column_stack([oof_lgb, oof_xgb])
meta_model = Ridge()
meta_model.fit(meta_features, y)
oof_stacked = meta_model.predict(meta_features)
print(f"Stacked OOF AUC: {roc_auc_score(y, oof_stacked):.4f}")

Competition-Specific Tips

For Tabular Competitions

LightGBM is usually the strongest baseline
Add CatBoost and XGBoost for diversity in ensembles
Time-series competitions: never let future data leak into training folds
Feature importance plots reveal what the model found useful — often suggests new features

For Image Competitions

Transfer learning from pretrained models (EfficientNet, ViT)
Test-time augmentation (TTA) — predict on augmented versions and average
Multi-scale training and inference

For NLP Competitions

Pretrained language models (BERT, RoBERTa, DeBERTa) are the standard
External data often allowed — check the rules
Pseudo-labeling on test data can help

The Mental Model of Top Competitors

Top competitors think in experiments, not guesses:

1. Hypothesis: "Adding this feature should improve CV by X"
2. Test: Run CV, measure delta
3. Keep or discard: Based on CV delta, not intuition
4. Document: Keep a log of every experiment

They don't:
- Submit before validating locally
- Chase the public LB (often leads to overfitting to the public set)
- Add random features hoping something sticks
- Copy solutions without understanding them

Leaderboard Shake-Up

Many Kaggle veterans have experienced finishing well on the public leaderboard, only to drop significantly on the private leaderboard. Prevention:

Trust your CV over public LB
Check if your CV strategy correctly simulates the private test set
Don't use all 5 daily submissions — use them judiciously to validate specific questions
Pick 2 final submissions: one optimized for LB, one for CV stability

Conclusion

Kaggle competitions are one of the best ways to rapidly improve practical ML skills. The competition format forces rigorous evaluation, and the post-competition discussions from top finishers are some of the highest-quality ML education available anywhere.

The path to consistent top-10% finishes: understand the problem deeply, establish a rigorous CV baseline, engineer features systematically, tune carefully, and ensemble diverse models. Each competition teaches techniques you take to the next one.

For the ML skills competitions build on, see our machine learning beginners guide, feature engineering guide, and overfitting guide.

Frequently Asked Questions

Kaggle is the world's largest data science community, hosting machine learning competitions where participants build models on provided datasets and compete for prize money and ranking points. Competitions run for 1-3 months. Organizers provide training data with labels, test data without labels, and an evaluation metric. Participants submit predictions on test data; Kaggle scores them using the metric. Private leaderboard (final ranking) is revealed after the competition closes. Competitions range from $5,000 to $1,000,000+ in prizes. The more important asset for most participants: the learning, the discussion forums where top solutions are shared post-competition, and the Kaggle ranking system (Novice → Contributor → Expert → Master → Grandmaster).

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

machine learning data visualization and model training — best machine learning courses in 2025

AI Learning

Best Machine Learning Courses in 2025: Ranked After Taking Them All

The best machine learning courses in 2025 — ranked by a practitioner who completed them. Honest assessments of Coursera, Fast.ai, Kaggle, and 7 others with cost and time required.

May 27, 2026 10 min read

machine learning data visualization and model training — computer vision tutorial

AI Learning

Computer Vision Tutorial: Build an Image Classifier from Scratch

Computer vision tutorial for beginners — build a real image classifier using CNNs and PyTorch, understand how computers see images, and learn transfer learning for production results.

May 27, 2026 9 min read

machine learning data visualization and model training — feature engineering guide

AI Learning

Feature Engineering Guide: Turn Raw Data into Powerful ML Inputs

Feature engineering guide for machine learning — practical techniques to create, transform, and select features that improve model accuracy, with Python code examples for every method.

May 27, 2026 9 min read

machine learning data visualization and model training — machine learning for beginners machine learning beginners

AI Learning

🔥 Trending

Machine Learning for Beginners: A Honest Guide to Getting Started

Machine learning for beginners explained honestly — what ML actually is, which skills you need first, the fastest learning path, and what to build to prove you can do it.

May 27, 2026 9 min read

Go deeper on this topic

NotesLLM Core Concepts Explained NotesML Learning Paradigms: Complete Guide CourseMachine Learning CourseMachine Learning Fundamentals NotesPrompt Engineering Cheat Sheet NotesChatGPT Tips & Tricks Cheat Sheet

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Machine Learning

Kaggle Competition Guide: How to Rank in the Top 10% Every Time

⚡ Quick Answer

Kaggle competition guide — the systematic approach to finishing in the top 10%, from EDA and baseline models to ensembling and post-competition learning, used by Kaggle Masters.

AiTechWorlds Team May 27, 2026 8 min read

#kaggle-competition-guide #kaggle-strategy #competitive-ml #machine-learning

📚Part of the Machine Learning guide — explore all Machine Learning articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Kaggle Competition Guide: How to Rank in the Top 10% Every Time

This guide codifies the approach used by Kaggle Masters — from EDA through ensembling — in a reproducible workflow you can apply to any competition.

Competition Selection

Your First Competition

Don't start with active competitions:

Learning competitions (no time pressure, extensive resources):
- Titanic: Machine Learning from Disaster
  → Binary classification basics, feature engineering
  
- House Prices: Advanced Regression Techniques
  → Regression, handling many features, missing values

- Digit Recognizer (MNIST)
  → First neural network / CNN

Choosing Active Competitions

Selection criteria:
✓ Data type you have skills for (tabular/image/text/audio)
✓ Evaluation metric you understand (AUC, RMSE, F1, MAP)
✓ Dataset size you can handle (RAM/GPU constraints)
✓ Active discussion forum (shows community engagement)
✗ Avoid: prize competitions as first entries (too competitive)
✗ Avoid: proprietary domain data with no public knowledge

The Competition Workflow

Phase 1: Understanding the Problem (Day 1-2)

Before any code:

# Competition setup template
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Load and inspect
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
sample_submission = pd.read_csv('sample_submission.csv')

print(f"Train shape: {train.shape}")
print(f"Test shape: {test.shape}")
print(f"Target distribution:\n{train['target'].value_counts()}")
print(f"\nMissing values in train:\n{train.isnull().sum()[train.isnull().sum() > 0]}")

# 2. Check for test/train distribution differences (data leakage signals)
for col in train.select_dtypes(include=[np.number]).columns:
    train_mean = train[col].mean()
    test_mean = test[col].mean()
    if abs(train_mean - test_mean) / (train_mean + 1e-8) > 0.1:
        print(f"Distribution shift in {col}: train={train_mean:.3f}, test={test_mean:.3f}")

Key questions to answer before modeling:

What is the evaluation metric and how does it penalize different errors?
Is the test data from the same time period as training? (temporal leakage)
What features exist and what do they mean? (read the competition description)
What external data is allowed?

Phase 2: Exploratory Data Analysis

# Target distribution
plt.figure(figsize=(10, 4))
train['target'].hist(bins=50)
plt.title('Target Distribution')
plt.show()

# Feature correlation with target
numeric_cols = train.select_dtypes(include=[np.number]).columns.tolist()
correlations = train[numeric_cols].corr()['target'].sort_values(ascending=False)
print("Top 10 correlated features:")
print(correlations.head(11))  # 11 because target itself is included

# Categorical features
for col in train.select_dtypes(include=['object']).columns:
    print(f"\n{col}:")
    print(train.groupby(col)['target'].agg(['mean', 'count']))

# Distribution plots by target class
for col in correlations.head(6).index[1:]:  # Skip target
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    train[col].hist(ax=axes[0], bins=30, alpha=0.7)
    axes[0].set_title(f'{col} Distribution')
    
    train.groupby('target')[col].plot(kind='kde', ax=axes[1])
    axes[1].set_title(f'{col} by Target')
    axes[1].legend()
    plt.show()

Phase 3: Baseline Model

Establish a baseline before any feature engineering:

from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
import lightgbm as lgb
import numpy as np

# Simple preprocessing
def basic_preprocess(df):
    df = df.copy()
    # Encode categoricals
    for col in df.select_dtypes(include=['object']).columns:
        le = LabelEncoder()
        df[col] = le.fit_transform(df[col].fillna('missing'))
    # Fill numeric missing
    df = df.fillna(df.median())
    return df

X = basic_preprocess(train.drop(['id', 'target'], axis=1))
y = train['target']

# 5-fold CV baseline
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# LightGBM baseline (usually good starting point for tabular)
lgb_params = {
    'objective': 'binary',
    'metric': 'auc',
    'n_estimators': 500,
    'learning_rate': 0.05,
    'num_leaves': 31,
    'random_state': 42,
    'verbose': -1
}

baseline_scores = []
for fold, (train_idx, val_idx) in enumerate(cv.split(X, y)):
    X_tr, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_tr, y_val = y.iloc[train_idx], y.iloc[val_idx]
    
    model = lgb.LGBMClassifier(**lgb_params)
    model.fit(X_tr, y_tr,
              eval_set=[(X_val, y_val)],
              callbacks=[lgb.early_stopping(50, verbose=False)])
    
    score = model.score(X_val, y_val)
    baseline_scores.append(score)
    print(f"Fold {fold+1}: {score:.4f}")

print(f"\nBaseline CV score: {np.mean(baseline_scores):.4f} ± {np.std(baseline_scores):.4f}")

Document this score. Every change you make gets compared to this baseline.

Phase 4: Feature Engineering

Feature engineering typically provides the biggest gains:

def engineer_features(df, train_df=None):
    df = df.copy()
    
    # 1. Interaction features between top correlated variables
    df['feature_A_times_B'] = df['feature_A'] * df['feature_B']
    df['feature_A_div_B'] = df['feature_A'] / (df['feature_B'] + 1e-8)
    
    # 2. Aggregation features (if there are group IDs)
    if 'group_id' in df.columns:
        group_stats = df.groupby('group_id')['feature_A'].agg(['mean', 'std', 'max', 'min'])
        group_stats.columns = [f'group_{c}_feature_A' for c in group_stats.columns]
        df = df.merge(group_stats, on='group_id', how='left')
    
    # 3. Target encoding for high-cardinality categoricals
    # (use cross-validation to avoid leakage)
    
    # 4. Polynomial features for top numeric features
    top_features = ['feature_A', 'feature_B', 'feature_C']
    from sklearn.preprocessing import PolynomialFeatures
    poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
    poly_features = poly.fit_transform(df[top_features])
    
    return df

# Test each feature addition
# If CV improves: keep the feature
# If CV stays same or drops: remove the feature

Phase 5: Hyperparameter Tuning

After feature engineering, tune the best-performing model:

import optuna

def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
        'num_leaves': trial.suggest_int('num_leaves', 20, 300),
        'max_depth': trial.suggest_int('max_depth', 3, 12),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-8, 10.0, log=True),
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-8, 10.0, log=True),
        'feature_fraction': trial.suggest_float('feature_fraction', 0.4, 1.0),
        'bagging_fraction': trial.suggest_float('bagging_fraction', 0.4, 1.0),
        'objective': 'binary',
        'metric': 'auc',
        'verbose': -1,
        'random_state': 42
    }
    
    model = lgb.LGBMClassifier(**params)
    cv_scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc', n_jobs=-1)
    return cv_scores.mean()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50, timeout=3600)  # 1 hour budget

print(f"Best CV AUC: {study.best_value:.4f}")
print(f"Best params: {study.best_params}")

Phase 6: Ensembling

Combining models almost always outperforms any single model:

import numpy as np
from sklearn.linear_model import Ridge
from sklearn.model_selection import StratifiedKFold
import lightgbm as lgb
import xgboost as xgb
import catboost as cb

# Train multiple base models with out-of-fold predictions
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# OOF (Out-of-Fold) predictions for stacking
oof_lgb = np.zeros(len(X))
oof_xgb = np.zeros(len(X))
test_preds_lgb = np.zeros(len(test))
test_preds_xgb = np.zeros(len(test))

for fold, (train_idx, val_idx) in enumerate(cv.split(X, y)):
    print(f"Fold {fold + 1}/5")
    X_tr, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_tr, y_val = y.iloc[train_idx], y.iloc[val_idx]
    
    # LightGBM
    lgb_model = lgb.LGBMClassifier(**best_lgb_params)
    lgb_model.fit(X_tr, y_tr, eval_set=[(X_val, y_val)],
                  callbacks=[lgb.early_stopping(50, verbose=False)])
    oof_lgb[val_idx] = lgb_model.predict_proba(X_val)[:, 1]
    test_preds_lgb += lgb_model.predict_proba(X_test)[:, 1] / 5
    
    # XGBoost
    xgb_model = xgb.XGBClassifier(**best_xgb_params, eval_metric='auc', verbosity=0)
    xgb_model.fit(X_tr, y_tr, eval_set=[(X_val, y_val)], early_stopping_rounds=50,
                  verbose=False)
    oof_xgb[val_idx] = xgb_model.predict_proba(X_val)[:, 1]
    test_preds_xgb += xgb_model.predict_proba(X_test)[:, 1] / 5

# Evaluate individual models
from sklearn.metrics import roc_auc_score
print(f"LGB OOF AUC: {roc_auc_score(y, oof_lgb):.4f}")
print(f"XGB OOF AUC: {roc_auc_score(y, oof_xgb):.4f}")

# Simple average ensemble
oof_blend = (oof_lgb + oof_xgb) / 2
test_blend = (test_preds_lgb + test_preds_xgb) / 2
print(f"Blend OOF AUC: {roc_auc_score(y, oof_blend):.4f}")

# Stacking: train meta-model on OOF predictions
meta_features = np.column_stack([oof_lgb, oof_xgb])
meta_model = Ridge()
meta_model.fit(meta_features, y)
oof_stacked = meta_model.predict(meta_features)
print(f"Stacked OOF AUC: {roc_auc_score(y, oof_stacked):.4f}")

Competition-Specific Tips

For Tabular Competitions

LightGBM is usually the strongest baseline
Add CatBoost and XGBoost for diversity in ensembles
Time-series competitions: never let future data leak into training folds
Feature importance plots reveal what the model found useful — often suggests new features

For Image Competitions

Transfer learning from pretrained models (EfficientNet, ViT)
Test-time augmentation (TTA) — predict on augmented versions and average
Multi-scale training and inference

For NLP Competitions

Pretrained language models (BERT, RoBERTa, DeBERTa) are the standard
External data often allowed — check the rules
Pseudo-labeling on test data can help

The Mental Model of Top Competitors

Top competitors think in experiments, not guesses:

1. Hypothesis: "Adding this feature should improve CV by X"
2. Test: Run CV, measure delta
3. Keep or discard: Based on CV delta, not intuition
4. Document: Keep a log of every experiment

They don't:
- Submit before validating locally
- Chase the public LB (often leads to overfitting to the public set)
- Add random features hoping something sticks
- Copy solutions without understanding them

Leaderboard Shake-Up

Many Kaggle veterans have experienced finishing well on the public leaderboard, only to drop significantly on the private leaderboard. Prevention:

Trust your CV over public LB
Check if your CV strategy correctly simulates the private test set
Don't use all 5 daily submissions — use them judiciously to validate specific questions
Pick 2 final submissions: one optimized for LB, one for CV stability

Conclusion

For the ML skills competitions build on, see our machine learning beginners guide, feature engineering guide, and overfitting guide.

Frequently Asked Questions

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

AI Learning

Best Machine Learning Courses in 2025: Ranked After Taking Them All

The best machine learning courses in 2025 — ranked by a practitioner who completed them. Honest assessments of Coursera, Fast.ai, Kaggle, and 7 others with cost and time required.

May 27, 2026 10 min read

AI Learning

Computer Vision Tutorial: Build an Image Classifier from Scratch

Computer vision tutorial for beginners — build a real image classifier using CNNs and PyTorch, understand how computers see images, and learn transfer learning for production results.

May 27, 2026 9 min read

AI Learning

Feature Engineering Guide: Turn Raw Data into Powerful ML Inputs

Feature engineering guide for machine learning — practical techniques to create, transform, and select features that improve model accuracy, with Python code examples for every method.

May 27, 2026 9 min read

AI Learning

🔥 Trending

Machine Learning for Beginners: A Honest Guide to Getting Started

Machine learning for beginners explained honestly — what ML actually is, which skills you need first, the fastest learning path, and what to build to prove you can do it.

May 27, 2026 9 min read

Go deeper on this topic

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Kaggle Competition Guide: How to Rank in the Top 10% Every Time

Kaggle Competition Guide: How to Rank in the Top 10% Every Time

Competition Selection

Your First Competition

Choosing Active Competitions

The Competition Workflow

Phase 1: Understanding the Problem (Day 1-2)

Phase 2: Exploratory Data Analysis

Phase 3: Baseline Model

Phase 4: Feature Engineering

Phase 5: Hyperparameter Tuning

Phase 6: Ensembling

Competition-Specific Tips

For Tabular Competitions

For Image Competitions

For NLP Competitions

The Mental Model of Top Competitors

Leaderboard Shake-Up

Conclusion

Further Reading

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

Best Machine Learning Courses in 2025: Ranked After Taking Them All

Computer Vision Tutorial: Build an Image Classifier from Scratch

Feature Engineering Guide: Turn Raw Data into Powerful ML Inputs

Machine Learning for Beginners: A Honest Guide to Getting Started

Go deeper on this topic

Get Free AI Notes Daily

Kaggle Competition Guide: How to Rank in the Top 10% Every Time

Kaggle Competition Guide: How to Rank in the Top 10% Every Time

Competition Selection

Your First Competition

Choosing Active Competitions

The Competition Workflow

Phase 1: Understanding the Problem (Day 1-2)

Phase 2: Exploratory Data Analysis

Phase 3: Baseline Model

Phase 4: Feature Engineering

Phase 5: Hyperparameter Tuning

Phase 6: Ensembling

Competition-Specific Tips

For Tabular Competitions

For Image Competitions

For NLP Competitions

The Mental Model of Top Competitors

Leaderboard Shake-Up

Conclusion

Further Reading

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

Best Machine Learning Courses in 2025: Ranked After Taking Them All

Computer Vision Tutorial: Build an Image Classifier from Scratch

Feature Engineering Guide: Turn Raw Data into Powerful ML Inputs

Machine Learning for Beginners: A Honest Guide to Getting Started

Go deeper on this topic

Get Free AI Notes Daily