Introduction
This workshop has its objective to inspire people with the latest open and modern data analytics technologies and how to design the right architectures. More specifically, a distinction is made between two typical scenarios:- Batch data processing - Analyzing historical data in batches
- Real-time data streaming - Processing data as it arrives in real-time
Workshop Objectives
High-Level Goals
- Master Modern Data Stack Components: Learn to work with DuckDB, ClickHouse, Kafka, and Metabase
- Understand Data Architecture Patterns: Explore both batch and real-time processing paradigms
- Build End-to-End Analytics Solutions: From data ingestion to visualization
- Practice Real-World Scenarios: Work with actual restaurant and coffee shop data from Prishtina
Learning Outcomes
By the end of this workshop, you will be able to:- Set up and configure modern data analytics tools
- Process batch data using DuckDB and PostgreSQL
- Implement real-time streaming with ClickHouse and Kafka
- Create data lakehouses for scalable analytics
- Build interactive dashboards with Metabase
- Apply geospatial analytics to location-based data
- Design data architectures for different use cases
Workshop Structure
Task 1: Batch Analytics with DuckDB
- Focus: Historical data processing and analysis
- Technologies: DuckDB, PostgreSQL, R2 Object Storage
- Data: Namecheap premium domain names, restaurants and coffee shops and historical revenue data
Task 2: Real-Time Analytics with ClickHouse
- Focus: Real-time streaming and live analytics
- Technologies: ClickHouse, Kafka, Metabase
- Data: Streaming real-time transaction from restaurants and coffee shops
Prerequisites
- Basic understanding of SQL
- Familiarity with command line tools
- Knowledge of CSV, JSON, and tabular database formats
- Understanding of basic data concepts
Data Overview
Namecheap Premium Domain Names
- Format: CSV
- Content: Premium domain names
- Data Fields: domain, price, extensions_taken
- Source: Namecheap marketplace
Prishtina Places Dataset
- Format: JSON
- Content: Restaurants and coffee shops
- Data Fields: name, location, rating, reviews, coordinates
- Source: Scraped using SerpAPI from Google Places
Historical Transaction Data
- Format: PostgreSQL database
- Content: Synthetic transaction data for each establishment
- Time Range: Historical synthetic data from 2025 for trend analysis
Real-time Transaction Data
- Format: Kafka
- Content: Real-time transaction data from restaurants and coffee shops
- Time Range: Real-time syntheticdata for immediate analyses and decision making
Getting Started
- Open the Workshop: Click the “Open in GitHub Codespaces” button above
- Follow Tasks Sequentially: Complete Tasks (e.g. 1.1) before moving to the next task (e.g. 1.2)
- Use Hints: Each task includes helpful hints and answers in accordion sections
- Experiment: Don’t be afraid to try different approaches