Documentation Index
Fetch the complete documentation index at: https://data.wiki/llms.txt
Use this file to discover all available pages before exploring further.
Introduction
This workshop has its objective to inspire people with the latest open and modern data analytics technologies and how to design the right architectures. More specifically, a distinction is made between two typical scenarios:- Batch data processing - Analyzing historical data in batches
- Real-time data streaming - Processing data as it arrives in real-time
Workshop Objectives
High-Level Goals
- Master Modern Data Stack Components: Learn to work with DuckDB, ClickHouse, Kafka, and Metabase
- Understand Data Architecture Patterns: Explore both batch and real-time processing paradigms
- Build End-to-End Analytics Solutions: From data ingestion to visualization
- Practice Real-World Scenarios: Work with actual restaurant and coffee shop data from Prishtina
Learning Outcomes
By the end of this workshop, you will be able to:- Set up and configure modern data analytics tools
- Process batch data using DuckDB and PostgreSQL
- Implement real-time streaming with ClickHouse and Kafka
- Create data lakehouses for scalable analytics
- Build interactive dashboards with Metabase
- Apply geospatial analytics to location-based data
- Design data architectures for different use cases
Workshop Structure
Task 1: Batch Analytics with DuckDB
- Focus: Historical data processing and analysis
- Technologies: DuckDB, PostgreSQL, R2 Object Storage
- Data: Namecheap premium domain names, restaurants and coffee shops and historical revenue data
Task 2: Real-Time Analytics with ClickHouse
- Focus: Real-time streaming and live analytics
- Technologies: ClickHouse, Kafka, Metabase
- Data: Streaming real-time transaction from restaurants and coffee shops
Prerequisites
- Basic understanding of SQL
- Familiarity with command line tools
- Knowledge of CSV, JSON, and tabular database formats
- Understanding of basic data concepts
Data Overview
Namecheap Premium Domain Names
- Format: CSV
- Content: Premium domain names
- Data Fields: domain, price, extensions_taken
- Source: Namecheap marketplace
Prishtina Places Dataset
- Format: JSON
- Content: Restaurants and coffee shops
- Data Fields: name, location, rating, reviews, coordinates
- Source: Scraped using SerpAPI from Google Places
Historical Transaction Data
- Format: PostgreSQL database
- Content: Synthetic transaction data for each establishment
- Time Range: Historical synthetic data from 2025 for trend analysis
Real-time Transaction Data
- Format: Kafka
- Content: Real-time transaction data from restaurants and coffee shops
- Time Range: Real-time syntheticdata for immediate analyses and decision making
Getting Started
- Open the Workshop: Click the “Open in GitHub Codespaces” button above
- Follow Tasks Sequentially: Complete Tasks (e.g. 1.1) before moving to the next task (e.g. 1.2)
- Use Hints: Each task includes helpful hints and answers in accordion sections
- Experiment: Don’t be afraid to try different approaches