Dl/glossary: Difference between revisions
From stonehomewiki
Jump to navigationJump to search
Stonezhong (talk | contribs) No edit summary |
Stonezhong (talk | contribs) No edit summary |
||
| (12 intermediate revisions by the same user not shown) | |||
| Line 6: | Line 6: | ||
A basic data unit that carries data. For example: | A basic data unit that carries data. For example: | ||
* A dataframe in Apache Spark | * A dataframe in Apache Spark | ||
* A dataframe in | * A dataframe in Pandas | ||
<b>Dataframe should be immutable once it is published -- the data it carries should never change.</b> | |||
</div> | |||
</div> | |||
<p></p> | |||
<div class="toccolours mw-collapsible mw-collapsed expandable"> | |||
<div class="mw-collapsible-preview">Asset</div> | |||
<div class="mw-collapsible-content"> | |||
Represent underneath data a dataframe is loaded from. For example: | |||
* A parquet file in AWS S3 bucket | |||
* A parquet file in HDFS | |||
* A parquet file in local filesystem | |||
* A table in DB (RDBMS) | |||
* A subset of a table in DB (RDBMS) -- via a filter | |||
* A collection in mongodb | |||
* A subset of a collection in mongodb -- via a filter | |||
</div> | </div> | ||
</div> | </div> | ||
| Line 12: | Line 29: | ||
<p></p> | <p></p> | ||
<div class="toccolours mw-collapsible mw-collapsed | <div class="toccolours mw-collapsible mw-collapsed expandable"> | ||
<div class="mw-collapsible-preview">Asset</div> | <div class="mw-collapsible-preview">Asset URI</div> | ||
<div class="mw-collapsible-content"> | <div class="mw-collapsible-content"> | ||
A URI that uniquely identifies an asset, for example: | |||
* <code>asset://s3/bucket_name/foo.parquet</code> -- represent a parquet file stored in AWS S3 | |||
* <code>asset://mysql/myserver/mydb/foo</code> -- represent a table in MySQL, server name is myserver, dbname is mydb, table name is foo | |||
* <code>asset://mysql/myserver/mydb/foo/?batch_id=1&</code> -- represent a table in MySQL, server name is myserver, dbname is mydb, table name is foo, with a filter, which batch_id column need to match 1 | |||
</div> | </div> | ||
</div> | </div> | ||
| Line 21: | Line 41: | ||
<p></p> | <p></p> | ||
<div class="toccolours mw-collapsible mw-collapsed | <div class="toccolours mw-collapsible mw-collapsed expandable"> | ||
<div class="mw-collapsible-preview">Dataset</div> | <div class="mw-collapsible-preview">Dataset</div> | ||
<div class="mw-collapsible-content"> | <div class="mw-collapsible-content"> | ||
It is a set of dataframes that has the common schema. | |||
* dataset name is not unique, but name + major_version + minor_version is unique | |||
** two dataset of the same name, different major_version might have incompatible schema | |||
** two dataset of the same name, same major_version should have compatible schema, the latter minor_version is just adding new columns. | |||
* Each dataframe of this dataset has a unique key. | |||
Each dataset has a unique URI, for example: | |||
* <code>dataset://stock.msft/1.4/1</code>: represent a dataset, name is stock.msft, major_version is 1.4, minor version is 1 | |||
Each dataframe of a dataset also have URI, for example: | |||
* <code>dataframe://stock.msft/1.4/1/2023-01-01</code>: represent a dataframe, it belongs to dataset, dataset name is stock.msft, major_version is 1.4, minor version is 1, and dataframe key is 2023-01-01. | |||
</div> | |||
</div> | |||
<p></p> | |||
<div class="toccolours mw-collapsible mw-collapsed expandable"> | |||
<div class="mw-collapsible-preview">Data Application</div> | |||
<div class="mw-collapsible-content"> | |||
* Data Application consumes 0 or more dataset as input, and generate dataframes and update dataset with new dataframes | |||
** A data ingestion application is a data application that does not consume any dataset | |||
** A data transformation application takes one or more dataset as input. | |||
</div> | |||
</div> | |||
<p></p> | |||
<div class="toccolours mw-collapsible mw-collapsed expandable"> | |||
<div class="mw-collapsible-preview">Data Pipeline</div> | |||
<div class="mw-collapsible-content"> | |||
* A bunch of data application with dependency defined | |||
* Runs on a regular basis (has schedule information) | |||
</div> | </div> | ||
</div> | </div> | ||
<p></p> | <p></p> | ||