Dl/glossary: Difference between revisions

Latest revision as of 00:52, 7 March 2023

Dataframe

A basic data unit that carries data. For example:

Dataframe should be immutable once it is published -- the data it carries should never change.

Asset

Represent underneath data a dataframe is loaded from. For example:

Asset URI

A URI that uniquely identifies an asset, for example:

asset://s3/bucket_name/foo.parquet -- represent a parquet file stored in AWS S3
asset://mysql/myserver/mydb/foo -- represent a table in MySQL, server name is myserver, dbname is mydb, table name is foo
asset://mysql/myserver/mydb/foo/?batch_id=1& -- represent a table in MySQL, server name is myserver, dbname is mydb, table name is foo, with a filter, which batch_id column need to match 1

Dataset

It is a set of dataframes that has the common schema.

dataset name is not unique, but name + major_version + minor_version is unique
- two dataset of the same name, different major_version might have incompatible schema
- two dataset of the same name, same major_version should have compatible schema, the latter minor_version is just adding new columns.
Each dataframe of this dataset has a unique key.

Each dataset has a unique URI, for example:

dataset://stock.msft/1.4/1: represent a dataset, name is stock.msft, major_version is 1.4, minor version is 1

Each dataframe of a dataset also have URI, for example:

dataframe://stock.msft/1.4/1/2023-01-01: represent a dataframe, it belongs to dataset, dataset name is stock.msft, major_version is 1.4, minor version is 1, and dataframe key is 2023-01-01.

Data Application

Data Application consumes 0 or more dataset as input, and generate dataframes and update dataset with new dataframes
- A data ingestion application is a data application that does not consume any dataset
- A data transformation application takes one or more dataset as input.

Data Pipeline

@@ Line 67: / Line 67: @@
 ** A data ingestion application is a data application that does not consume any dataset
 ** A data transformation application takes one or more dataset as input.
+</div>
+</div>
+<p></p>
+<div class="toccolours mw-collapsible mw-collapsed expandable">
+<div class="mw-collapsible-preview">Data Pipeline</div>
+<div class="mw-collapsible-content">
+* A bunch of data application with dependency defined
+* Runs on a regular basis (has schedule information)
 </div>
 </div>
 <p></p>