Dl/DataType: Difference between revisions

From stonehomewiki
Jump to navigationJump to search
 
(6 intermediate revisions by the same user not shown)
Line 66: Line 66:
= Considerations =
= Considerations =
<div class="toccolours mw-collapsible mw-collapsed expandable">
<div class="toccolours mw-collapsible mw-collapsed expandable">
<div class="mw-collapsible-preview">url should have enough information to locate the data</div>
<div class="mw-collapsible-preview">Referencing a type</div>
<div class="mw-collapsible-content">
<div class="mw-collapsible-content">
url field should have enough information for user to locate the data. For example, <code>"s3://mubucket/stock_quotes/2023-08-20.jsonl"</code> is a good url if your datalake only lives one AWS region. If your datalake crosses multiple AWS regions, you should put region ID in the url so you know from which region the bucket belongs to.
A type can be referenced in the format <code>{"$ref": type_id}</code>
 
For example:
<pre><nowiki>
{"$ref": "#/$defs/aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"} references a type with id of "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"
</nowiki></pre>
</div>
</div>
<p></p>
 
<div class="toccolours mw-collapsible mw-collapsed expandable">
<div class="mw-collapsible-preview">Possible extension of the JSON schema to support more primitive types</div>
<div class="mw-collapsible-content">
JSON Schema only support very few primitive types, such as string, number, int, etc. However, many system support more primitive types than JSON, for example, parquet file can have "timestamp" columns, however, "timestamp" is not supported by JSON Schema specifications. We will add more common primitive types to extend JSON schema specification to fits into many of the DataLake use cases.
</div>
</div>
<p></p>
 
<div class="toccolours mw-collapsible mw-collapsed expandable">
<div class="mw-collapsible-preview">Description field helps to improve data discovery</div>
<div class="mw-collapsible-content">
We build full-text index on description field. This allows user to search type or field based on key word shows up in description field, which improves data discovery.
</div>
</div>
<p></p>
 
<div class="toccolours mw-collapsible mw-collapsed expandable">
<div class="mw-collapsible-preview">Schema definition can also include constrains</div>
<div class="mw-collapsible-content">
Those constrains can help us to validate the data in a content ignostic way uniformly.
 
For example:
<pre><nowiki>
# This represent a structure which has 2 number member fields for "x" and "y".
{
    id: "bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb",
    type: "object",
    properties: {
        "x": {
            type: {"$ref": "#/$defs/aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"},
            description: "The x coordinate value, measured in miles",
            minimum: 0.0
        },
        "y": {
            type: {"$ref": "#/$defs/aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"},
            description: "The y coordinate value, measured in miles",
            minimum: 0.0
        }
    },
    description: "Represent a point on a two dimensional canvas"
}
</nowiki></pre>
</div>
</div>
</div>
</div>
<p></p>
<p></p>

Latest revision as of 05:35, 23 August 2023

Data Lake Knowledge Center | Models

Introduction

Considerations