Introduction

Definition

A DataUnit object represent the basic unit that carries data.

Fields

id: UUID
Primary key

dataset_id: UUID
A dataset that owns this data unit.

publisher: str
Represent user who published this data unit. In case people has question about this data unit, they should contact the publisher for clarification.

published_time: DateTime
Represent the time this data object is published. A data unit object is only visible and accessible to user once it is published.

locations: List[DataLocation]
Represent a list of locations this data exists. The content of the data should be exactly the same for different locations. For example, one location could be for
ETL pipeline, another location could be for backup/archive. It is also possible the actual payload gets replicated to multiple region with the same content, user
can choose the closest data to read for performance consideration. But nevertheless, the content of the data for different locations should be exactly the same.
Note, every object in DataLake has lables, you can have "purpose" label attached to different location object to distinguish those different purposes.

is_recalled: boolean
If the publisher want to recall the data unit, they can set this flag to True. User should in general ignore the recalled data unit, unless for troubleshooting
purpose.

recall_time: DateTime
If the data unit is recalled, this represent the time the data unit is recalled.

schema: DataType
The schema of the data. Not all data unit object represent structured data, for example, a data unit object could represent a JPEG image. However, when a data unit
represent structured data, the this field describe the schema of the data.

sources: List[DataUnit]
Represent a list of the other data unit that is being used to produce this data unit. This allows us to trace back any data related problem to the root cause.

Methods

  
@property
def is_dirty(self) -> boolean

Returns True if any of the transitive source data unit being recalled. In such case, this data unit should be re-constructed to maintain data accuracy and integrity.

For example, if data unit A is for report A and data unit B is for report B, they both consumes data unit X and Y, and latter X is recalled, and if only A is 
re-constructed without B, the dashboard build on top of A will be inconsistent against dashboard build on top of B, and causes lots of customer confusion.

Examples:

{
    id: "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa",
    publisher: "stone.zhong@gmail.com",
    pubish_time: "2023-08-20 00:00:00",
    is_recalled: False,
    recall_time: None,
    schema_id: "bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb",
    source_ids: ["11111111-1111-1111-1111-111111111111" ,"22222222-2222-2222-2222-222222222222"],
    location_ids: ["33333333-3333-3333-3333-333333333333", "44444444-4444-4444-4444-444444444444"]
}

Considerations

Data Unit is immutable

So when you re-construct a dirty data unit, you should create a new data unit and set the original data unit as "recalled"

Dl/DataUnit

Introduction

Considerations

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools