Dl/glossary: Difference between revisions

From stonehomewiki
Jump to navigationJump to search
No edit summary
No edit summary
 
(10 intermediate revisions by the same user not shown)
Line 9: Line 9:


<b>Dataframe should be immutable once it is published -- the data it carries should never change.</b>
<b>Dataframe should be immutable once it is published -- the data it carries should never change.</b>
</div>
</div>
<p></p>
<div class="toccolours mw-collapsible mw-collapsed expandable">
<div class="mw-collapsible-preview">Asset</div>
<div class="mw-collapsible-content">
Represent underneath data a dataframe is loaded from. For example:
* A parquet file in AWS S3 bucket
* A parquet file in HDFS
* A parquet file in local filesystem
* A table in DB (RDBMS)
* A subset of a table in DB (RDBMS) -- via a filter
* A collection in mongodb
* A subset of a collection in mongodb -- via a filter
</div>
</div>
</div>
</div>
Line 14: Line 29:
<p></p>
<p></p>


<div class="toccolours mw-collapsible mw-collapsed mono expandable">
<div class="toccolours mw-collapsible mw-collapsed expandable">
<div class="mw-collapsible-preview">Asset</div>
<div class="mw-collapsible-preview">Asset URI</div>
<div class="mw-collapsible-content">
<div class="mw-collapsible-content">
An assert represent data storage location where you can load data and get dataframe
A URI that uniquely identifies an asset, for example:
* <code>asset://s3/bucket_name/foo.parquet</code>      -- represent a parquet file stored in AWS S3
* <code>asset://mysql/myserver/mydb/foo</code>          -- represent a table in MySQL, server name is myserver, dbname is mydb, table name is foo
* <code>asset://mysql/myserver/mydb/foo/?batch_id=1&</code> -- represent a table in MySQL, server name is myserver, dbname is mydb, table name is foo, with a filter, which batch_id column need to match 1
</div>
</div>
</div>
</div>
Line 23: Line 41:
<p></p>
<p></p>


<div class="toccolours mw-collapsible mw-collapsed mono expandable">
<div class="toccolours mw-collapsible mw-collapsed expandable">
<div class="mw-collapsible-preview">Dataset</div>
<div class="mw-collapsible-preview">Dataset</div>
<div class="mw-collapsible-content">
<div class="mw-collapsible-content">
bbb
It is a set of dataframes that has the common schema.
 
* dataset name is not unique, but name + major_version + minor_version is unique
** two dataset of the same name, different major_version might have incompatible schema
** two dataset of the same name, same major_version should have compatible schema, the latter minor_version is just adding new columns.
* Each dataframe of this dataset has a unique key.
 
Each dataset has a unique URI, for example:
* <code>dataset://stock.msft/1.4/1</code>: represent a dataset, name is stock.msft, major_version is 1.4, minor version is 1
 
Each dataframe of a dataset also have URI, for example:
* <code>dataframe://stock.msft/1.4/1/2023-01-01</code>: represent a dataframe, it belongs to dataset, dataset name is stock.msft, major_version is 1.4, minor version is 1, and dataframe key is 2023-01-01.
 
</div>
</div>
<p></p>
 
<div class="toccolours mw-collapsible mw-collapsed expandable">
<div class="mw-collapsible-preview">Data Application</div>
<div class="mw-collapsible-content">
* Data Application consumes 0 or more dataset as input, and generate dataframes and update dataset with new dataframes
** A data ingestion application is a data application that does not consume any dataset
** A data transformation application takes one or more dataset as input.
</div>
</div>
<p></p>
 
<div class="toccolours mw-collapsible mw-collapsed expandable">
<div class="mw-collapsible-preview">Data Pipeline</div>
<div class="mw-collapsible-content">
* A bunch of data application with dependency defined
* Runs on a regular basis (has schedule information)
</div>
</div>
</div>
</div>
<p></p>
<p></p>

Latest revision as of 00:52, 7 March 2023

Data Lake Knowledge Center