|
|
| Line 42: |
Line 42: |
| <div class="mw-collapsible-preview">Dataset Quality Management</div> | | <div class="mw-collapsible-preview">Dataset Quality Management</div> |
| <div class="mw-collapsible-content"> | | <div class="mw-collapsible-content"> |
| | | * Allow user to define a set of rules to check dataset quality |
| | * Web UI to render dataset quality |
| | * backend daemon to check dataset quality based on rules |
| </div> | | </div> |
| </div> | | </div> |
| <p></p> | | <p></p> |
Latest revision as of 00:40, 7 March 2023
Data Lake Knowledge Center
Purpose
Data Management
APIs
- CRUD dataset
- Add, remove dataframe from dataset
- Load dataset
Web UI
- Browse dataset
- CRUD dataset
- Search dataset
Use cases:
- User can see datasets, view author, team, description, schema, publish time, etc
- User can search datasets based on author, team, description, schema, publish time, etc.
- User can create new dataset on Web UI
- A data ingestion app can create dataset via API
- A data ingestion app can add dataframe to a dataset via API
Dataset SLA Management
Lots of datasets has new dataframe with a given frequency, and if the new dataframe not showed up within certain time, we need to be alerted and be aware of it so we can fix the underneath issue, for example, fix a broken data pipeline.
So we allow:
- User can define a frequency for new dataframe for a dataset
- System will monitor the dataset for new dataframe, it new dataframe missed the frequency, it sends an alert.
- User can also see the new dataframe published time so check how often a dataset missed it's fresness SLA.
Dataset Quality Management
- Allow user to define a set of rules to check dataset quality
- Web UI to render dataset quality
- backend daemon to check dataset quality based on rules