Batch scoring for real pipelines.

Turn a customer CSV into a scored file: validation, training-aligned features, predict_proba, versioned output. Built for jobs that look like ETL — not microsecond APIs.

View on GitHub How to run

The demo below uses the API when you run uvicorn web.main:app from the repo. A static mirror of this page has no backend — use the CLI or CI there.

Tests · pytest suite

CLI · flags & exit codes

Optional · run .meta.json

Try the pipeline

Upload a Telco-style customer batch (same shape as the input contract below), or run on a small built-in sample — no CSV on your machine required. The server validates, preprocesses, and scores; then you can download the result.

Scoring uses the small E2E fixture model committed in this repo (under tests/fixtures/) so the full path is real and testable — it is not a production or client-specific model. Run with sample batch loads five Telco-style rows from the same fixture set.

Positioning

This project: batch

File or schedule driven: ingest → validate → score → hand off CSV (or parquet) to CRM, analytics, or compliance. Same pattern as many production churn or risk batch jobs.

Contrast: online API

Request/response, SLOs, autoscaling. Valuable, but a different skill slice. This repo stays intentionally narrow: offline orchestration and clean batch output.

End-to-end flow

InRaw CSV

GateValidate

PrepPreprocess

ScoreModel

OutScored CSV

Stack

Python pandas scikit-learn joblib PyYAML pytest

I/O contract

Input (Telco-style)

customerID, tenure, Contract,
MonthlyCharges, TotalCharges,
PhoneService, InternetService, …

Output

customer_id
churn_score
predicted_label
model_version
scoring_timestamp

Output review (Jupyter)

After you run the pipeline locally, the repo includes a notebook that validates the scored CSV (schema, row counts, probabilities in [0,1]) and plots the churn-score distribution.

View notebook on GitHub → · readable in the browser; clone the repo to execute cells.

Interview pitch (30 seconds)

Problem: After training, a lot of value is in recurring batch scores — not every product needs a real-time endpoint.
What I shipped: One pipeline: strict input checks, same feature engineering as training, probabilities and labels, plus model version and timestamp on every row.
Ops detail: CLI for cron/CI, non-zero exit on failure, optional JSON manifest so you know which input and threshold produced a file.

Vahdettin Karataş

Location:

Tools & Focus Areas

How I can help

Recent outcomes