Introduction

This workshop has its objective to inspire people with the latest open and modern data analytics technologies and how to design the right architectures. More specifically, a distinction is made between two typical scenarios:
  • Batch data processing - Analyzing historical data in batches
  • Real-time data streaming - Processing data as it arrives in real-time

Workshop Objectives

High-Level Goals

  1. Master Modern Data Stack Components: Learn to work with DuckDB, ClickHouse, Kafka, and Metabase
  2. Understand Data Architecture Patterns: Explore both batch and real-time processing paradigms
  3. Build End-to-End Analytics Solutions: From data ingestion to visualization
  4. Practice Real-World Scenarios: Work with actual restaurant and coffee shop data from Prishtina

Learning Outcomes

By the end of this workshop, you will be able to:
  • Set up and configure modern data analytics tools
  • Process batch data using DuckDB and PostgreSQL
  • Implement real-time streaming with ClickHouse and Kafka
  • Create data lakehouses for scalable analytics
  • Build interactive dashboards with Metabase
  • Apply geospatial analytics to location-based data
  • Design data architectures for different use cases

Workshop Structure

Task 1: Batch Analytics with DuckDB

  • Focus: Historical data processing and analysis
  • Technologies: DuckDB, PostgreSQL, R2 Object Storage
  • Data: Namecheap premium domain names, restaurants and coffee shops and historical revenue data

Task 2: Real-Time Analytics with ClickHouse

  • Focus: Real-time streaming and live analytics
  • Technologies: ClickHouse, Kafka, Metabase
  • Data: Streaming real-time transaction from restaurants and coffee shops

Prerequisites

  • Basic understanding of SQL
  • Familiarity with command line tools
  • Knowledge of CSV, JSON, and tabular database formats
  • Understanding of basic data concepts

Data Overview

Namecheap Premium Domain Names

  • Format: CSV
  • Content: Premium domain names
  • Data Fields: domain, price, extensions_taken
  • Source: Namecheap marketplace

Prishtina Places Dataset

  • Format: JSON
  • Content: Restaurants and coffee shops
  • Data Fields: name, location, rating, reviews, coordinates
  • Source: Scraped using SerpAPI from Google Places

Historical Transaction Data

  • Format: PostgreSQL database
  • Content: Synthetic transaction data for each establishment
  • Time Range: Historical synthetic data from 2025 for trend analysis

Real-time Transaction Data

  • Format: Kafka
  • Content: Real-time transaction data from restaurants and coffee shops
  • Time Range: Real-time syntheticdata for immediate analyses and decision making

Getting Started

  1. Open the Workshop: Click the “Open in GitHub Codespaces” button above
  2. Follow Tasks Sequentially: Complete Tasks (e.g. 1.1) before moving to the next task (e.g. 1.2)
  3. Use Hints: Each task includes helpful hints and answers in accordion sections
  4. Experiment: Don’t be afraid to try different approaches

Next Steps

Ready to begin? Let’s start with Task 1 to learn about ad-hoc and batch analytics with DuckDB.