Dl/Overview
From stonehomewiki
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Data Tiers
Data Tiers
| Diagram | Description |
|---|---|
|
|
Bronze Tier:
Silver Tier:
Gold Tier:
Platinum Tier:
|
ETL
ETL Flow
- 1: User pushes ETL code into ETL Code Repo
- 2: Airflow Scheduler trigger DAG (DAG is generated based on metadata)
- The ETL job is a task within an airflow DAG
- 3: ETL executor pulls code from ETL Code Repo into loacl disk
- 4: ETL executor uses dbt library to submit job to Apache Spark via JDBC interface (e.g. via Thrift Server)
- 5: Thrift Server take the SQL and pass it to Apache Spark to execute
BI Connection
Using MPP Engine
- BI Tool to access Gold Tier data and Platnium Tier data via JDBC interface exposed by a MPP engine
- Why? A MPP Engine provide better interactive SQL query speed than Spark Thrift Server
Using RDBMS
- Gold Tier data and Platnium Tier data are replicated to RDBMS, such as Oracle DB
- BI Tool to access Gold Tier data and Platnium Tier data via JDBC interface exposed by RDBMS
- This pattern does not work for very large datalake since Gold Tier and Platnium Tier are too large to be replicated to RDBMS
Retrieved from "https://home.stonezhong.net/index.php?title=Dl/Overview&oldid=461"