Data Lake Knowledge Center
ETL
ETL Flow
- 1: User pushes ETL code into ETL Code Repo
- 2: Airflow Scheduler trigger DAG (DAG is generated based on metadata)
- The ETL job is a task within an airflow DAG
- 3: ETL executor pulls code from ETL Code Repo into loacl disk
- 4: ETL executor uses dbt library to submit job to Apache Spark via JDBC interface (e.g. via Thrift Server)
- 5: Thrift Server take the SQL and pass it to Apache Spark to execute
BI Connection
ETL Flow
- BI Tool to access Gold Tier data and Platnium Tier data via JDBC interface exposed by a MPP engine
- Why? A MPP Engine provide better interactive SQL query speed than Spark Thrift Server