AiTechWorlds
AiTechWorlds
Imagine a doctor seeing a patient for the first time. They do not glance at the patient, guess "appendicitis," and reach for a scalpel. They follow a process. They ask about symptoms and medical history — that is data collection. They note which symptoms seem most relevant — that is feature selection. They compare the symptom pattern to diseases they have seen before — that is model inference. They order tests to confirm their confidence level — that is evaluation.
Machine learning follows the exact same disciplined process. Skip a step and you risk the equivalent of misdiagnosis. Follow it carefully and your model will be reliable, explainable, and ready for production.
Before writing a single line of code, answer two questions:
A clear problem definition saves hours of wasted effort. "Predict customer churn" is vague. "Predict whether a subscription customer will cancel within the next 30 days, given their last 90 days of usage data" is actionable.
Real-world data is messy. Values are missing, entries are duplicated, timestamps are in different formats, and outliers exist from data entry errors. This step transforms raw data into a usable state. Most working data scientists report spending 60–80% of their time here.
Before modeling, understand your data. Plot distributions. Check correlations. Look for class imbalance. Find which features have the most variance. EDA prevents surprises later and often reveals the most impactful features before any model is trained.
Raw columns are rarely the best input for a model. Feature engineering creates new, more informative variables from existing ones. Extracting "day of week" from a timestamp, computing "price per square foot" from price and area, or log-transforming a skewed column are all feature engineering.
Select an algorithm appropriate for your problem type and data size. Train it on your training set. This is usually the shortest step — often just a few lines of code.
Check performance on data the model has never seen. Use metrics appropriate to your problem (accuracy, F1 score, RMSE, AUC). If performance is insufficient, tune hyperparameters, try different algorithms, or go back and engineer better features.
A model that lives only in a Jupyter notebook helps no one. Deploy it as an API, a batch job, or an embedded component. Then monitor it — real-world data drifts over time, and a model accurate today may degrade in six months.
| Library | Role | Install |
|---|---|---|
pandas | Load, clean, and manipulate tabular data | pip install pandas |
NumPy | Fast numerical arrays and math operations | pip install numpy |
scikit-learn | ML algorithms, preprocessing, evaluation | pip install scikit-learn |
matplotlib | Plotting and visualization | pip install matplotlib |
seaborn | Statistical visualizations built on matplotlib | pip install seaborn |
These five libraries handle 90% of classical ML work. Install them once:
pip install pandas numpy scikit-learn matplotlib seaborn
The Iris dataset contains 150 measurements of flower petals and sepals from three species. It is the "Hello World" of ML. Here is the full workflow — all seven steps — in 12 lines of code:
# Step 1: Import tools
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Step 2: Load data (already clean — 150 rows, 4 features, 3 classes)
X, y = load_iris(return_X_y=True)
# Step 3: Split into training set (80%) and test set (20%)
# random_state=42 ensures the same split every run (reproducibility)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Step 4: Choose a model and train it on training data
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train) # Model learns from 120 examples
# Step 5: Make predictions on the 30 test examples the model never saw
predictions = model.predict(X_test)
# Step 6: Measure accuracy
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"Accuracy: {accuracy_score(y_test, predictions):.2%}")
Output:
Training samples: 120
Test samples: 30
Accuracy: 100.00%
The Iris dataset is clean and well-separated, so 100% accuracy is normal for this dataset. In real projects you will rarely see this — and that is expected.
| Line | What It Does |
|---|---|
load_iris(return_X_y=True) | Returns features array X (150×4) and labels array y (150,) |
train_test_split(..., test_size=0.2) | Randomly assigns 80% to train, 20% to test |
random_state=42 | Seeds the random number generator for reproducibility |
model.fit(X_train, y_train) | The model learns the mapping from features to species |
model.predict(X_test) | Applies learned rules to unseen examples |
accuracy_score(y_test, predictions) | Compares predicted labels to true labels |
This is the most important concept in practical ML.
If you test the model on the same data you trained it on, you are asking a student to take the same test they studied from — with the same questions. Of course they will score 100%. That tells you nothing about whether they understood the material.
All Data (150 rows)
├── Training Set (120 rows) ← Model learns from this
└── Test Set (30 rows) ← Model is evaluated on this
(never seen during training)
The test set simulates real-world deployment: new data the model has never encountered.
In practice, you cycle back. EDA reveals a data quality issue, so you return to Step 2. Evaluation shows poor performance, so you return to Step 4 to engineer better features. Deployment reveals distribution shift, so you collect new data and retrain.
Problem Definition
|
v
Collect Data <----------+
| |
v |
EDA |
| |
v |
Feature Engineering |
| |
v |
Train Model |
| |
v |
Evaluate -----(poor)---+
|
(good)
v
Deploy
|
v
Monitor -----(drift)---> Collect Data
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises