|
|
| Line 51: |
Line 51: |
| ** two dataset of the same name, same major_version should have compatible schema, the latter minor_version is just adding new columns. | | ** two dataset of the same name, same major_version should have compatible schema, the latter minor_version is just adding new columns. |
| * Each dataframe of this dataset has a unique key. | | * Each dataframe of this dataset has a unique key. |
| | |
| | Each dataset has a unique URI, for example: |
| | * <code>dataset://stock.msft/1.4/1</code>: represent a dataset, name is stock.msft, major_version is 1.4, minor version is 1 |
| | |
| | Each dataframe of a dataset also have URI, for example: |
| | * <code>dataframe://stock.msft/1.4/1/2023-01-01</code>: represent a dataframe, it belongs to dataset, dataset name is stock.msft, major_version is 1.4, minor version is 1, and dataframe key is 2023-01-01. |
| | |
| </div> | | </div> |
| </div> | | </div> |
| <p></p> | | <p></p> |
Revision as of 23:23, 5 March 2023
Data Lake Knowledge Center
Dataframe
A basic data unit that carries data. For example:
- A dataframe in Apache Spark
- A dataframe in Pandas
Dataframe should be immutable once it is published -- the data it carries should never change.
Asset
Represent underneath data a dataframe is loaded from. For example:
- A parquet file in AWS S3 bucket
- A parquet file in HDFS
- A parquet file in local filesystem
- A table in DB (RDBMS)
- A subset of a table in DB (RDBMS) -- via a filter
- A collection in mongodb
- A subset of a collection in mongodb -- via a filter
Asset URI
A URI that uniquely identifies an asset, for example:
asset://s3/bucket_name/foo.parquet -- represent a parquet file stored in AWS S3
asset://mysql/myserver/mydb/foo -- represent a table in MySQL, server name is myserver, dbname is mydb, table name is foo
asset://mysql/myserver/mydb/foo/?batch_id=1& -- represent a table in MySQL, server name is myserver, dbname is mydb, table name is foo, with a filter, which batch_id column need to match 1
Dataset
It is a set of dataframes that has the common schema.
- dataset name is not unique, but name + major_version + minor_version is unique
- two dataset of the same name, different major_version might have incompatible schema
- two dataset of the same name, same major_version should have compatible schema, the latter minor_version is just adding new columns.
- Each dataframe of this dataset has a unique key.
Each dataset has a unique URI, for example:
dataset://stock.msft/1.4/1: represent a dataset, name is stock.msft, major_version is 1.4, minor version is 1
Each dataframe of a dataset also have URI, for example:
dataframe://stock.msft/1.4/1/2023-01-01: represent a dataframe, it belongs to dataset, dataset name is stock.msft, major_version is 1.4, minor version is 1, and dataframe key is 2023-01-01.