Dl/Best Practices: Difference between revisions
From stonehomewiki
Jump to navigationJump to search
Stonezhong (talk | contribs) (Created page with "= Best Practices = <div class="toccolours mw-collapsible mw-collapsed expandable"> <div class="mw-collapsible-preview">Definition</div> <div class="mw-collapsible-content"> </div> </div> <p></p>") |
Stonezhong (talk | contribs) |
||
| (12 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
= | <p> [[dl/home|Data Lake Knowledge Center]] </p> | ||
= Platform = | |||
<div class="toccolours mw-collapsible mw-collapsed expandable"> | <div class="toccolours mw-collapsible mw-collapsed expandable"> | ||
<div class="mw-collapsible-preview"> | <div class="mw-collapsible-preview">Apache Spark</div> | ||
<div class="mw-collapsible-content"> | <div class="mw-collapsible-content"> | ||
Apache Spark is a good platform for batch based data processing as well as streaming based data processing. Advantage: | |||
* Scalable | |||
* Well supported (DataBricks is backing up this product) | |||
* Well adopted | |||
* Supported by many cloud providers ([https://aws.amazon.com/emr/ AWS EMR], [https://azure.microsoft.com/en-us/products/hdinsight Azure HDInsight] , [https://cloud.google.com/dataproc GCP Dataproc], [https://www.oracle.com/big-data/data-flow/ OCI dataflow]) | |||
* Instead of building your own data lake, you can use [https://www.databricks.com/ LakeHouse] provided by databricks, they support AWS, Azure and GCP. | |||
</div> | |||
</div> | |||
<p></p> | |||
= Data Ingestion = | |||
<div class="toccolours mw-collapsible mw-collapsed expandable"> | |||
<div class="mw-collapsible-preview">Always save a copy of raw data</div> | |||
<div class="mw-collapsible-content"> | |||
When you do data ingestion, you want to save the raw data for the following reasons | |||
* Your ingestion pipeline may have bugs, saving raw data allows you to fix bugs and re-populate the data | |||
* Raw data may not meed the data quality and you may ignore it, in case you ignore it, keep the raw data allows you to check what kind of data quality problem they are, and sometimes you can inform the data producer to have it fixed. | |||
* Raw data is owned by data source team and they have their own retention policy -- raw data is not always accessible afterwards. | |||
</div> | |||
</div> | |||
<p></p> | |||
<div class="toccolours mw-collapsible mw-collapsed expandable"> | |||
<div class="mw-collapsible-preview">Use data connectors to manage data ingestions</div> | |||
<div class="mw-collapsible-content"> | |||
* Create highly reusable "data connectors" to manage the data ingestio | |||
** An anti pattern is to create too many one time, custome written, poorly documented data ingestion code | |||
</div> | |||
</div> | |||
<p></p> | |||
= Data Governance = | |||
<div class="toccolours mw-collapsible mw-collapsed expandable"> | |||
<div class="mw-collapsible-preview">Keep good structure of your data</div> | |||
<div class="mw-collapsible-content"> | |||
* raw, sometime unstructured data: | |||
- you stage the raw data (to be ingested) here, sometimes, these data can be unstructured. | |||
* raw, ingested data | |||
- they are structured, e.g. in parquet format. They captured all the information you interested from raw data. They may orgnized well -- the purpose is to capture all raw information with minimum processing. | |||
* logical data layer | |||
- well modeled, maybe around a subject model. (a fact table with bunch of dimension tables) | |||
See also | |||
* [https://lingarogroup.com/blog/data-lake-architecture Data Lake Architecture: How to Create a Well Designed Data Lake] | |||
</div> | </div> | ||
</div> | </div> | ||
<p></p> | <p></p> | ||