Dl/Overview: Difference between revisions

From stonehomewiki
Jump to navigationJump to search
 
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
<p>[[dl/home|Data Lake Knowledge Center]]</p>
<p>[[dl/home|Data Lake Knowledge Center]]</p>
= Data Tiers =
<div class="toccolours mw-collapsible mw-collapsed expandable">
<div class="mw-collapsible-preview">Data Tiers</div>
<div class="mw-collapsible-content">
{| class="wikitable grid mono section"
|-
! Diagram
! Description
|-
|
{{#mermaid:
graph TD
    BornzeTier[Bronze Tier]
    SilverTier[Silver Tier]
    GoldTier[Gold Tier]
    PlatniumTier[Platnium Tier]
    ExternalData[External Data]
    ExternalData --ingestion--> BornzeTier --ingestion--> SilverTier --ETL--> GoldTier --ETL--> PlatniumTier
}}
|
<big><b>Bronze Tier</b></big>:<br /><br />
<b>The purpose for bronze tier is to store data downloaded from external world into data lake so we can use all sort of tools inside data lake to further process it</b>
* raw data
* no uniformed format, could be csv, JSON, AVRO, parquet, binary, anything
* could even be unstructured
* no data quality assurance
<hr /><br />
<big><b>Silver Tier</b></big>:<br /><br />
<b>The purpose for bronze tier is to allow data ingestion application to sanitize data, verify the quality of the data</b>
* Data quality is assurred
* Data may not be normalized. One table may use UTC for a timestamp column while aother table may use timestamp without timezone. No stadnardlization for column name.
* Data format is uniformed, usually it is stored as a format that is best fits the further ETL process, for example parquet.
<hr /><br />
<big><b>Gold Tier</b></big>:<br /><br />
<b>Tables for star schema</b>
* dimention tables and fact tables that forms star schema
* a star schema is designed in such a way that it can answer any question about a business area.
<hr /><br />
<big><b>Platinum Tier</b></big>:<br /><br />
<b>Various query result for specific business questions materized in tables</b>
* Various query result for specific business questions materized in tables
** Queries are generated from star schema from gold tier
* Tables may be replicated to a RDBMS for BI tool to access (sometime you can expose them directly, e.g. Spark Thrift Server)
|}
</div>
</div>
<p></p>


= ETL =
= ETL =
Line 8: Line 63:
graph TD
graph TD
     Scheduler[Apache Airflow/Scheduler]
     Scheduler[Apache Airflow/Scheduler]
     ETLE[ETL Executor&lt;Airflow Task&gt;]
     ETLE[ETL Executor#40;Airflow Task#41;]
     LC[Local ETL Code]
     LC[Local ETL Code]
     ER[ETL Code Repo]
     ER[ETL Code Repo]
     JDBC[JDBC&lt;Thrift Server&gt;]
     JDBC[JDBC#40;Thrift Server#41;]
     User[User&lt;Data Engineer&gt;]
     User[User#40;Data Engineer#41;]
     Spark[Apache Spark]
     Spark[Apache Spark]
     Scheduler --2: trigger--> ETLE
     Scheduler --2: trigger--> ETLE

Latest revision as of 21:02, 25 November 2025

Data Lake Knowledge Center

Data Tiers

ETL

BI Connection