Dl/DataType
Data Lake Knowledge Center | Models
Introduction
A DataType object describe the schema of nested data.
We follow the JSON Schema specification to represent types.
Fields
id: UUID
Primary key
type: str
The type of the data. For example: "int", "number", "object".
properties: Optional[dict]
If type is "object", this field list all properties.
description: Optional[str]
Human readable document about this type
items: DataType
If type is "array", this specifies the array element type
Examples:
# This represent a "number" type
{
id: "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa",
type: "number",
description: "Represent a float number"
}
# This represent a structure which has 2 number member fields for "x" and "y".
{
id: "bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb",
type: "object",
properties: {
"x": {
type: {"$ref": "#/$defs/aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"},
description: "The x coordinate value, measured in miles"
},
"y": {
type: {"$ref": "#/$defs/aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"},
description: "The y coordinate value, measured in miles"
}
},
description: "Represent a point on a two dimensional canvas"
}
# This represent an array of structure which has 2 number member fields for "x" and "y".
{
id: "11111111-1111-1111-1111-111111111111",
type: "array",
items: {
"$ref": "#/$defs/aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"
}
description: "Represent a series of points on a two dimensional canvas"
}
Considerations
A type can be referenced in the format {"$ref": type_id}
For example:
{"$ref": "#/$defs/aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"} references a type with id of "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"
JSON Schema only support very few primitive types, such as string, number, int, etc. However, many system support more primitive types than JSON, for example, parquet file can have "timestamp" columns, however, "timestamp" is not supported by JSON Schema specifications. We will add more common primitive types to extend JSON schema specification to fits into many of the DataLake use cases.
We build full-text index on description field. This allows user to search type or field based on key word shows up in description field, which improves data discovery.
Those contains can help us to validate the data in a content ignostic way uniformly.