Vahdettin Karataş
Data & Reporting Consultant
Available for 1-2 projects / month
  • Location:
    Prague, Czech Republic
Tools & Focus Areas
  • Dashboards & reporting
  • Data cleaning & analysis
  • Reporting workflows
  • Data pipelines & ETL
  • ML training → serving patterns
How I can help
  • Client work: dashboards, reports, automation
  • ML systems & interactive apps (see menu)
  • Data cleaning & trustworthy spreadsheets
Recent outcomes
  • Weekly reporting time reduced from hours to minutes
  • Cleaner spreadsheets that are easier to maintain
  • Teams get faster answers from their data
Portfolio · ML systems

Batch scoring for real pipelines.

Turn a customer CSV into a scored file: validation, training-aligned features, predict_proba, versioned output. Built for jobs that look like ETL — not microsecond APIs.

The demo below uses the API when you run uvicorn web.main:app from the repo. A static mirror of this page has no backend — use the CLI or CI there.

Tests · pytest suite
CLI · flags & exit codes
Optional · run .meta.json

Try the pipeline

Upload a Telco-style customer batch (same shape as the input contract below), or run on a small built-in sample — no CSV on your machine required. The server validates, preprocesses, and scores; then you can download the result.

Scoring uses the small E2E fixture model committed in this repo (under tests/fixtures/) so the full path is real and testable — it is not a production or client-specific model. Run with sample batch loads five Telco-style rows from the same fixture set.

Optional if you use “Run with sample batch”.

Positioning

Contrast: online API

Request/response, SLOs, autoscaling. Valuable, but a different skill slice. This repo stays intentionally narrow: offline orchestration and clean batch output.

End-to-end flow

InRaw CSV
GateValidate
PrepPreprocess
ScoreModel
OutScored CSV

Stack

Python pandas scikit-learn joblib PyYAML pytest

I/O contract

Input (Telco-style)

customerID, tenure, Contract,
MonthlyCharges, TotalCharges,
PhoneService, InternetService, …

Output

customer_id
churn_score
predicted_label
model_version
scoring_timestamp

Output review (Jupyter)

After you run the pipeline locally, the repo includes a notebook that validates the scored CSV (schema, row counts, probabilities in [0,1]) and plots the churn-score distribution.

View notebook on GitHub →  ·  readable in the browser; clone the repo to execute cells.

Interview pitch (30 seconds)

  • Problem: After training, a lot of value is in recurring batch scores — not every product needs a real-time endpoint.
  • What I shipped: One pipeline: strict input checks, same feature engineering as training, probabilities and labels, plus model version and timestamp on every row.
  • Ops detail: CLI for cron/CI, non-zero exit on failure, optional JSON manifest so you know which input and threshold produced a file.

Batch Scoring Pipeline

Offline ML batch job · showcase page

© Vahdettin Karataş. All rights reserved.