Dl/Overview: Difference between revisions

From stonehomewiki
Jump to navigationJump to search
Line 39: Line 39:
{{#mermaid:
{{#mermaid:
graph TD
graph TD
     Scheduler[Apache Airflow/Scheduler]
     subgraph storage
    ETLE[ETL Executor<Airflow Task>]
        GoldTier[Gold Tier]
    LC[Local ETL Code]
        PlatniumTier[Platnium Tier]
     ER[ETL Code Repo]
     end
     JDBC[JDBC<Thrift Server>]
     MPP[MPP Engine #40;Starburst Trino#41;]
     User[User<Data Engineer>]
     BI[BI Tool #40;Power BI#41;]
     Spark[Apache Spark]
     GoldTier --> MPP
    Scheduler --2: trigger--> ETLE
     PlatniumTier --> MPP
     ETLE --3: git pull -->LC
     MPP --> BI
     ETLE --4: dbt--> JDBC
    JDBC --5:--> Spark
    ER --> LC
    User --1: git push-->ER
}}
}}
<br />
<br />


* 1: User pushes ETL code into ETL Code Repo
* BI Tool to access Gold Tier data and Platnium Tier data via JDBC interface exposed by a MPP engine
* 2: Airflow Scheduler trigger DAG (DAG is generated based on metadata)
** Why? A MPP Engine provide better interactive SQL query speed than Spark Thrift Server
** The ETL job is a task within an airflow DAG
 
* 3: ETL executor pulls code from ETL Code Repo into loacl disk
* 4: ETL executor uses dbt library to submit job to Apache Spark via JDBC interface (e.g. via Thrift Server)
* 5: Thrift Server take the SQL and pass it to Apache Spark to execute
</div>
</div>
</div>
</div>
<p></p>
<p></p>

Revision as of 18:53, 25 November 2025

Data Lake Knowledge Center

ETL

BI Connection