Revision as of 18:53, 25 November 2025

ETL

ETL Flow

1: User pushes ETL code into ETL Code Repo
2: Airflow Scheduler trigger DAG (DAG is generated based on metadata)
- The ETL job is a task within an airflow DAG
3: ETL executor pulls code from ETL Code Repo into loacl disk
4: ETL executor uses dbt library to submit job to Apache Spark via JDBC interface (e.g. via Thrift Server)
5: Thrift Server take the SQL and pass it to Apache Spark to execute

ETL Flow

BI Tool to access Gold Tier data and Platnium Tier data via JDBC interface exposed by a MPP engine
- Why? A MPP Engine provide better interactive SQL query speed than Spark Thrift Server

@@ Line 39: / Line 39: @@
 {{#mermaid:
 graph TD
-     Scheduler[Apache Airflow/Scheduler]
+     subgraph storage
-    ETLE[ETL Executor&lt;Airflow Task&gt;]
+        GoldTier[Gold Tier]
-    LC[Local ETL Code]
+        PlatniumTier[Platnium Tier]
-     ER[ETL Code Repo]
+     end
-     JDBC[JDBC&lt;Thrift Server&gt;]
+     MPP[MPP Engine #40;Starburst Trino#41;]
-     User[User&lt;Data Engineer&gt;]
+     BI[BI Tool #40;Power BI#41;]
-     Spark[Apache Spark]
+     GoldTier --> MPP
-    Scheduler --2: trigger--> ETLE
+     PlatniumTier --> MPP
-     ETLE --3: git pull -->LC
+     MPP --> BI
-     ETLE --4: dbt--> JDBC
-    JDBC --5:--> Spark
-    ER --> LC
-    User --1: git push-->ER
 }}
 <br />
-* 1: User pushes ETL code into ETL Code Repo
+* BI Tool to access Gold Tier data and Platnium Tier data via JDBC interface exposed by a MPP engine
-* 2: Airflow Scheduler trigger DAG (DAG is generated based on metadata)
+** Why? A MPP Engine provide better interactive SQL query speed than Spark Thrift Server
-** The ETL job is a task within an airflow DAG
-* 3: ETL executor pulls code from ETL Code Repo into loacl disk
-* 4: ETL executor uses dbt library to submit job to Apache Spark via JDBC interface (e.g. via Thrift Server)
-* 5: Thrift Server take the SQL and pass it to Apache Spark to execute
 </div>
 </div>
 <p></p>