|
|
| Line 39: |
Line 39: |
| {{#mermaid: | | {{#mermaid: |
| graph TD | | graph TD |
| Scheduler[Apache Airflow/Scheduler] | | subgraph storage |
| ETLE[ETL Executor<Airflow Task>]
| | GoldTier[Gold Tier] |
| LC[Local ETL Code]
| | PlatniumTier[Platnium Tier] |
| ER[ETL Code Repo] | | end |
| JDBC[JDBC<Thrift Server>] | | MPP[MPP Engine #40;Starburst Trino#41;] |
| User[User<Data Engineer>] | | BI[BI Tool #40;Power BI#41;] |
| Spark[Apache Spark] | | GoldTier --> MPP |
| Scheduler --2: trigger--> ETLE
| | PlatniumTier --> MPP |
| ETLE --3: git pull -->LC | | MPP --> BI |
| ETLE --4: dbt--> JDBC | |
| JDBC --5:--> Spark
| |
| ER --> LC
| |
| User --1: git push-->ER
| |
| }} | | }} |
| <br /> | | <br /> |
|
| |
|
| * 1: User pushes ETL code into ETL Code Repo | | * BI Tool to access Gold Tier data and Platnium Tier data via JDBC interface exposed by a MPP engine |
| * 2: Airflow Scheduler trigger DAG (DAG is generated based on metadata)
| | ** Why? A MPP Engine provide better interactive SQL query speed than Spark Thrift Server |
| ** The ETL job is a task within an airflow DAG
| | |
| * 3: ETL executor pulls code from ETL Code Repo into loacl disk | |
| * 4: ETL executor uses dbt library to submit job to Apache Spark via JDBC interface (e.g. via Thrift Server) | |
| * 5: Thrift Server take the SQL and pass it to Apache Spark to execute
| |
| </div> | | </div> |
| </div> | | </div> |
| <p></p> | | <p></p> |
Revision as of 18:53, 25 November 2025
Data Lake Knowledge Center
ETL
ETL Flow
- 1: User pushes ETL code into ETL Code Repo
- 2: Airflow Scheduler trigger DAG (DAG is generated based on metadata)
- The ETL job is a task within an airflow DAG
- 3: ETL executor pulls code from ETL Code Repo into loacl disk
- 4: ETL executor uses dbt library to submit job to Apache Spark via JDBC interface (e.g. via Thrift Server)
- 5: Thrift Server take the SQL and pass it to Apache Spark to execute
BI Connection
ETL Flow
- BI Tool to access Gold Tier data and Platnium Tier data via JDBC interface exposed by a MPP engine
- Why? A MPP Engine provide better interactive SQL query speed than Spark Thrift Server