Skip to main content

Introduction

This workshop has its objective to inspire people with the latest open and modern data analytics technologies and how to design the right architectures. More specifically, a distinction is made between two typical scenarios:
  • Batch data processing - Analyzing historical data in batches
  • Real-time data streaming - Processing data as it arrives in real-time

Workshop Objectives

High-Level Goals

  1. Master Modern Data Stack Components: Learn to work with DuckDB, ClickHouse, Kafka, and Metabase
  2. Understand Data Architecture Patterns: Explore both batch and real-time processing paradigms
  3. Build End-to-End Analytics Solutions: From data ingestion to visualization
  4. Practice Real-World Scenarios: Work with actual restaurant and coffee shop data from Prishtina

Learning Outcomes

By the end of this workshop, you will be able to:
  • Set up and configure modern data analytics tools
  • Process batch data using DuckDB and PostgreSQL
  • Implement real-time streaming with ClickHouse and Kafka
  • Create data lakehouses for scalable analytics
  • Build interactive dashboards with Metabase
  • Apply geospatial analytics to location-based data
  • Design data architectures for different use cases

Workshop Structure

Task 1: Batch Analytics with DuckDB

  • Focus: Historical data processing and analysis
  • Technologies: DuckDB, PostgreSQL, R2 Object Storage
  • Data: Namecheap premium domain names, restaurants and coffee shops and historical revenue data

Task 2: Real-Time Analytics with ClickHouse

  • Focus: Real-time streaming and live analytics
  • Technologies: ClickHouse, Kafka, Metabase
  • Data: Streaming real-time transaction from restaurants and coffee shops

Prerequisites

  • Basic understanding of SQL
  • Familiarity with command line tools
  • Knowledge of CSV, JSON, and tabular database formats
  • Understanding of basic data concepts

Data Overview

Namecheap Premium Domain Names

  • Format: CSV
  • Content: Premium domain names
  • Data Fields: domain, price, extensions_taken
  • Source: Namecheap marketplace

Prishtina Places Dataset

  • Format: JSON
  • Content: Restaurants and coffee shops
  • Data Fields: name, location, rating, reviews, coordinates
  • Source: Scraped using SerpAPI from Google Places

Historical Transaction Data

  • Format: PostgreSQL database
  • Content: Synthetic transaction data for each establishment
  • Time Range: Historical synthetic data from 2025 for trend analysis

Real-time Transaction Data

  • Format: Kafka
  • Content: Real-time transaction data from restaurants and coffee shops
  • Time Range: Real-time syntheticdata for immediate analyses and decision making

Getting Started

  1. Open the Workshop: Click the “Open in GitHub Codespaces” button above
  2. Follow Tasks Sequentially: Complete Tasks (e.g. 1.1) before moving to the next task (e.g. 1.2)
  3. Use Hints: Each task includes helpful hints and answers in accordion sections
  4. Experiment: Don’t be afraid to try different approaches

Next Steps

Ready to begin? Let’s start with Task 1 to learn about ad-hoc and batch analytics with DuckDB.
I