Latest revision as of 21:02, 25 November 2025

Data Tiers

Diagram	Description
	Bronze Tier: The purpose for bronze tier is to store data downloaded from external world into data lake so we can use all sort of tools inside data lake to further process it raw data no uniformed format, could be csv, JSON, AVRO, parquet, binary, anything could even be unstructured no data quality assurance Silver Tier: The purpose for bronze tier is to allow data ingestion application to sanitize data, verify the quality of the data Data quality is assurred Data may not be normalized. One table may use UTC for a timestamp column while aother table may use timestamp without timezone. No stadnardlization for column name. Data format is uniformed, usually it is stored as a format that is best fits the further ETL process, for example parquet. Gold Tier: Tables for star schema dimention tables and fact tables that forms star schema a star schema is designed in such a way that it can answer any question about a business area. Platinum Tier: Various query result for specific business questions materized in tables Various query result for specific business questions materized in tables Queries are generated from star schema from gold tier Tables may be replicated to a RDBMS for BI tool to access (sometime you can expose them directly, e.g. Spark Thrift Server)

ETL

ETL Flow

1: User pushes ETL code into ETL Code Repo
2: Airflow Scheduler trigger DAG (DAG is generated based on metadata)
- The ETL job is a task within an airflow DAG
3: ETL executor pulls code from ETL Code Repo into loacl disk
4: ETL executor uses dbt library to submit job to Apache Spark via JDBC interface (e.g. via Thrift Server)
5: Thrift Server take the SQL and pass it to Apache Spark to execute

BI Connection

Using MPP Engine

BI Tool to access Gold Tier data and Platnium Tier data via JDBC interface exposed by a MPP engine
- Why? A MPP Engine provide better interactive SQL query speed than Spark Thrift Server

Using RDBMS

Gold Tier data and Platnium Tier data are replicated to RDBMS, such as Oracle DB
BI Tool to access Gold Tier data and Platnium Tier data via JDBC interface exposed by RDBMS
This pattern does not work for very large datalake since Gold Tier and Platnium Tier are too large to be replicated to RDBMS

@@ Line 1: / Line 1: @@
 <p>[[dl/home|Data Lake Knowledge Center]]</p>
+= Data Tiers =
-= ETL =
 <div class="toccolours mw-collapsible mw-collapsed expandable">
-<div class="mw-collapsible-preview">ETL Flow</div>
+<div class="mw-collapsible-preview">Data Tiers</div>
 <div class="mw-collapsible-content">
 {| class="wikitable grid mono section"
@@ Line 10: / Line 9: @@
 ! Description
 |-
-|{{#mermaid:
+|
+{{#mermaid:
+graph TD
+    BornzeTier[Bronze Tier]
+    SilverTier[Silver Tier]
+    GoldTier[Gold Tier]
+    PlatniumTier[Platnium Tier]
+    ExternalData[External Data]
+    ExternalData --ingestion--> BornzeTier --ingestion--> SilverTier --ETL--> GoldTier --ETL--> PlatniumTier
+}}
+|
+<big><b>Bronze Tier</b></big>:<br /><br />
+<b>The purpose for bronze tier is to store data downloaded from external world into data lake so we can use all sort of tools inside data lake to further process it</b>
+* raw data
+* no uniformed format, could be csv, JSON, AVRO, parquet, binary, anything
+* could even be unstructured
+* no data quality assurance
+<hr /><br />
+<big><b>Silver Tier</b></big>:<br /><br />
+<b>The purpose for bronze tier is to allow data ingestion application to sanitize data, verify the quality of the data</b>
+* Data quality is assurred
+* Data may not be normalized. One table may use UTC for a timestamp column while aother table may use timestamp without timezone. No stadnardlization for column name.
+* Data format is uniformed, usually it is stored as a format that is best fits the further ETL process, for example parquet.
+<hr /><br />
+<big><b>Gold Tier</b></big>:<br /><br />
+<b>Tables for star schema</b>
+* dimention tables and fact tables that forms star schema
+* a star schema is designed in such a way that it can answer any question about a business area.
+<hr /><br />
+<big><b>Platinum Tier</b></big>:<br /><br />
+<b>Various query result for specific business questions materized in tables</b>
+* Various query result for specific business questions materized in tables
+** Queries are generated from star schema from gold tier
+* Tables may be replicated to a RDBMS for BI tool to access (sometime you can expose them directly, e.g. Spark Thrift Server)
+|}
+</div>
+</div>
+<p></p>
+= ETL =
+<div class="toccolours mw-collapsible mw-collapsed expandable">
+<div class="mw-collapsible-preview">ETL Flow</div>
+<div class="mw-collapsible-content">
+{{#mermaid:
 graph TD
      Scheduler[Apache Airflow/Scheduler]
-     ETLE[ETL Executor&lt;Airflow Task&gt;]
+     ETLE[ETL Executor#40;Airflow Task#41;]
      LC[Local ETL Code]
      ER[ETL Code Repo]
-     JDBC[JDBC&lt;Thrift Server&gt;]
+     JDBC[JDBC#40;Thrift Server#41;]
-     User[User&lt;Data Engineer&gt;]
+     User[User#40;Data Engineer#41;]
      Spark[Apache Spark]
-     Scheduler --1 trigger--> ETLE
+     Scheduler --2: trigger--> ETLE
-     ETLE --2 git pull -->LC
+     ETLE --3: git pull -->LC
-     ETLE --3 dbt--> JDBC
+     ETLE --4: dbt--> JDBC
-     JDBC --4--> Spark
+     JDBC --5:--> Spark
      ER --> LC
-     USER --5 git push-->ER
+     User --1: git push-->ER
 }}
-| style="vertical-align:top;" |
+<br />
-* 1 Airflow Scheduler trigger DAG (DAG is generated based on metadata)
+* 1: User pushes ETL code into ETL Code Repo
+* 2: Airflow Scheduler trigger DAG (DAG is generated based on metadata)
 ** The ETL job is a task within an airflow DAG
-* 2 ETL executor pulls code from a repo into loacl disk
+* 3: ETL executor pulls code from ETL Code Repo into loacl disk
-* 3 ETL executor uses dbt library to submit job to Apache Spark via JDBC interface (e.g. via Thrift Server)
+* 4: ETL executor uses dbt library to submit job to Apache Spark via JDBC interface (e.g. via Thrift Server)
-* 4 Thrift Server take the SQL and pass it to Apache Spark to execute
+* 5: Thrift Server take the SQL and pass it to Apache Spark to execute
-|}
+</div>
+</div>
+<p></p>
+= BI Connection =
+<div class="toccolours mw-collapsible mw-collapsed expandable">
+<div class="mw-collapsible-preview">Using MPP Engine</div>
+<div class="mw-collapsible-content">
+{{#mermaid:
+graph TD
+    subgraph storage
+        GoldTier[Gold Tier]
+        PlatniumTier[Platnium Tier]
+    end
+    JDBC[JDBC Interface]
+    MPP[MPP Engine #40;Starburst Trino#41;]
+    BI[BI Tool #40;Power BI#41;]
+    GoldTier --> MPP
+    PlatniumTier --> MPP
+    MPP --> JDBC --> BI
+}}
+<br />
+* BI Tool to access Gold Tier data and Platnium Tier data via JDBC interface exposed by a MPP engine
+** Why? A MPP Engine provide better interactive SQL query speed than Spark Thrift Server
+</div>
+</div>
+<p></p>
+<div class="toccolours mw-collapsible mw-collapsed expandable">
+<div class="mw-collapsible-preview">Using RDBMS</div>
+<div class="mw-collapsible-content">
+{{#mermaid:
+graph TD
+    subgraph storage
+        GoldTier[Gold Tier]
+        PlatniumTier[Platnium Tier]
+    end
+    RDBMS[RDBMS#40;Oracle ADW#41;]
+    JDBC[JDBC Interface]
+    BI[BI Tool #40;Power BI#41;]
+    GoldTier --replicate--> RDBMS
+    PlatniumTier --replicate--> RDBMS
+    RDBMS --> JDBC --> BI
+}}
+<br />
+* Gold Tier data and Platnium Tier data are replicated to RDBMS, such as Oracle DB
+* BI Tool to access Gold Tier data and Platnium Tier data via JDBC interface exposed by RDBMS
+* This pattern does not work for very large datalake since Gold Tier and Platnium Tier are too large to be replicated to RDBMS
 </div>
 </div>
 <p></p>

Dl/Overview: Difference between revisions

Latest revision as of 21:02, 25 November 2025

Data Tiers

ETL

BI Connection

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools