Internet-Draft | NTV tabular format (NTV-TAB) | December 2023 |
THOMY | Expires 21 June 2024 | [Page] |
This document describes a set of simple rules for unambiguously and concisely encoding semantic tabular and multidimensional data (NTV-TAB format). These rules are based on the NTV structure and its JSON representation (JSON-NTV format).¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 21 June 2024.¶
Copyright (c) 2023 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
The main operational standard used to exchange textual tabular data is CSV format [RFC4180]. Unfortunately CSV format is obsolete (last revision in 2005) and current CSV tools do not comply with the standard.¶
It is therefore important to define an alternative format that meets the expectations of tabular and multidimensional data exchanges. The NTV-TAB format proposed here is a response to this need.¶
The format's focus is on simplicity, lightness and web usage.¶
The key features of this format are the following:¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
This document also uses the following terms:¶
Tabular data is data that is structured into rows, each of which contains information about some things. Each row contains the same number of cells (although some of these cells may be empty), which provide values of properties of the thing described by the row. In tabular data, cells within the same column provide values for the same property of the things described by each row. This is what differentiates tabular data from other line-oriented formats. [W3C_TAB]¶
Two main uses are identified for tabular data:¶
This document deals with this second use.¶
In structure-oriented use, columns and rows are not equivalent, the columns (or Fields) represent the 'semantics' of the data and the rows represent a specific combination of Field's values according to the structure defined by the tabular data (Dataset). The nature of the rows is often implicit.¶
Two basic patterns are present in Datasets:¶
Table 1 and Table 2 present an example of such patterns¶
Root | level 1 | level 2 |
---|---|---|
A | B | D |
A | B | E |
A | C | F |
A | C | G |
Value | row | col |
---|---|---|
1 | A | C |
2 | A | D |
3 | B | C |
4 | B | D |
Taking these structures into account leads to significant duplication of data. In the general case, Datasets mix these different structures.¶
If we now observe the relationships between Fields [TAB-ANA], we can identify four main uses:¶
Example: Price list of different foods based on packaging for the year 2022.Table 3¶
Id | Product | Food | Packaging | Weight | Price | Period | Availability |
---|---|---|---|---|---|---|---|
11 | apple | fruit | bag | 1 kg | 1 | 2nd half 2022 | Yes |
12 | apple | fruit | cardboard | 10 kg | 9 | 2nd half 2022 | Yes |
13 | orange | fruit | bag | 1 kg | 2 | 2nd half 2022 | end of 2022 |
14 | orange | fruit | cardboard | 10 kg | 18 | 2nd half 2022 | end of 2022 |
15 | pepper | vegetable | bag | 1 kg | 1.5 | 2nd half 2022 | end of 2022 |
16 | pepper | vegetable | cardboard | 10 kg | 13 | 2nd half 2022 | end of 2022 |
17 | banana | fruit | bag | 1 kg | 0.5 | 2nd half 2022 | Yes |
18 | banana | fruit | cardboard | 10 kg | 4 | 2nd half 2022 | Yes |
We find here:¶
A Field is an ordered set of Cells.¶
To represent this structure, several representations are possible depending on the nature of the data:¶
Three representations are available for a tabular object : row-oriented (list of Rows), cells-oriented (list of Cells), field-oriented (list of Fields).¶
The field-oriented representation is retained because it takes into account the semantics carried by the Fields as well as the inter-Field analysis presented above.¶
A Dataset is then seen as a set of Fields representing the properties of the entire Dataset.¶
The order of Fields or Rows is not relevant.¶
A Dataset is represented by the following NTV entities:¶
The JSON format of a NTVdataset is his JSON-NTV format.¶
This category is the usual representation of a Field with different values (Full format) or with several identical values (Unique format).¶
Full format :¶
The Full format is the format that does not use any coding. Codec and NTVfield are identical. The NTVfield is therefore a NTVlist where the NTVname is the name of the Field, the NTVtype is the default type of the NTVcells and the NTVvalue is the list of NTVcells.¶
Unique format :¶
The Unique format is used when all NTVcells are identical. The Codec is the NTVcell.Codec and NTVfield are identical (coding is implict). The NTVfield is therefore the NTVcell.¶
Note :¶
This category completes the simple formats with the other most common representations of a Field :¶
In those formats, Codec is explicit and is the TVlist of different Field NTVcells (Codec). The NTVfield is a NVlist where the NTVname is the name of the Field.¶
Complete format :¶
The "complete format" is equivalent to the format used to store categorical variables.¶
The NTVfield is a NVlist composed with two NTV entities :¶
The list of NTVcells is reconstituted by replacing the integers in the coding Vlist with the NTVcell at the coding index in the Codec (e.g. pandas categories and codes).¶
Example JsonNTVvalue ( "product" Field) :¶
Sparse format :¶
A specific format (one dimensional sparse LIL format) is used for sparse data. It is defined by:¶
The NTVfield is a NVlist composed with three NTV entities :¶
The list of NTVcells is reconstituted by replacing in a list of 'fill_value', the values with index in the Coding Vlist by the corresponding value defined by the Ref index in the Codec TVlist.¶
Example JsonNTVvalue ( "food" Field) :¶
Primary format :¶
This format is equivalent to the Complete format where the Keys Vlist is calculated with the "repetition coefficient".¶
The NTVfield is a NVlist composed with two NTV entities :¶
The Keys Vlist is generated with the formula:¶
where:¶
Example: coef = 2, period = 3, Keys length = 12¶
The Repetition coefficient is the number of adjacent identical values in the Keys list.¶
This category of formats reduces the size of Complete format with optimized Keys. The length of Keys is reduced with using of derived (Relative format) or coupled (Implicit format) relationships between two Fields.¶
In those formats, Codec is explicit and is the TVlist of different Field NTVcells (Codec). The NTVfield is a NVlist where the NTVname is the name of the Field.¶
Implicit format :¶
This representation is associated with "coupled" Fields. These Fields have a one-to-one correspondence.¶
The NTVfield is a NVlist composed with two NTV entities :¶
This format is equivalent to the Complete format where Keys is the Keys of the Field (with Complete format) defined by Ref.¶
Example JsonNTVvalue ( "weight" Field is associated with "packaging" Field ) :¶
Relative format :¶
This representation is associated with "derived" Fields. These Fields have a one-to-many correspondence.¶
The values of a "derived" Field are inferred from the values of the parent Field.¶
The Field is a NVlist composed with three NTV entities :¶
This format is equivalent to the Complete format where the Keys Vlist is obtained by replacing the values of the Keys Vlist of the parent Field with the corresponding values in the Relative Keys (the length of the Relative Keys is the length of the Codec of the parent Field).¶
Example JsonNTVvalue ( "food" Field - "product" Field is the parent Field of "food" Field) :¶
The NTVfield structure corresponding to the format defined above are in Table 4:¶
Structure | Codec | Ref | Coding | |
---|---|---|---|---|
format | NTV | TVlist | Vsingle | Vlist |
Relative |
NTVlist¶ len = 3¶ |
x |
index¶ or name¶ |
Relative Keys¶ len < len(Field)¶ |
Complete |
NTVlist¶ len = 2¶ |
x |
Keys¶ len = len(Field)¶ |
|
Sparse |
NTVlist¶ len = 3¶ |
x |
list of index¶ sp_value¶ |
sp_index¶ 1<len<len(Field)¶ |
Implicit |
NTVlist¶ len = 2¶ |
x |
index¶ or name¶ |
|
Primary |
NTVlist¶ len = 2¶ |
x |
coef¶ len = 1¶ |
|
Unique | NTVsingle | |||
Full |
NTVlist len = len(Field)¶ |
Three levels are available to convert tabular data in JSON structure Table 5.¶
Level 0: "simple" is the usual representation of tabular data.¶
Fields are converted with the Simple or Unique format.¶
Level 1: "default" avoids duplication of information by adding simple encoding.¶
Fields are converted according to their own structure (simple, unique, categorical, sparse, periodic).¶
Level 2: "optimize" avoids duplication of information and minimizes encoding. It is the usual representation of multidimensional data.¶
This level requires an analysis of the relationships between Fields ("partition")¶
Level | Structure | ||
---|---|---|---|
mode | Type Field | format | |
0 | simple | Unique | Unique |
Simple | Full | ||
1 | default | Unique | Unique |
Simple | Full | ||
Sparse | Sparse | ||
Categorical | Complete | ||
Periodic | Primary | ||
2 | optimize | Unique | Unique |
Root coupled | Full | ||
Root derived | Complete | ||
Primary | Primary | ||
Derived | Relative | ||
Coupled | Implicit |
The example in Section 2.2 has the following JSON representation Table 6:¶
Format | JsonNTV Representations |
---|---|
Full |
{ "price::float": [ 1, 9, 2, 18, 1.5, 13, 0.5, 4 ] }¶ { "price": [ 1, 9, 2, 18, 1.5, 13, 0.5, 4 ] }¶ [ 1, 9, 2, 18, 1.5, 13, 0.5, 4 ]¶ |
Complete |
{"product":[["orange","pepper","apple","banana"],¶ [2,2,0,0,1,1,3,3]]}¶ {"product": [ ["orange","pepper","apple","banana"],¶ [2, 2, 0, 0, 1, 1, 3, 3] ]}¶ [ ["orange","pepper","apple","banana"],¶ [2, 2, 0, 0, 1, 1, 3, 3] ]¶ |
Unique |
{ "period": "2nd half 2022" }¶ "2nd half 2022"¶ |
Implicit |
{"weight":[{"::string":["1 kg","10 kg"]},"packaging"]}¶ [["1 kg","10 kg"],3]¶ |
Relative |
{"food": [ {"::string": [ "fruit" , "vegetable" ]},¶ "product", [ 0, 1, 0, 0 ]] }¶ [ [ "fruit" , "vegetable" ], 1, [ 0, 1, 0, 0 ] ]¶ |
Sparse |
{"food":[{"::string":["vegetable","vegetable","fruit"]},¶ [4,5,-1]]}¶ [["vegetable","vegetable","fruit"],[4,5, 1]]¶ |
Primary |
{"packaging":[{"::string":["cardboard","bag"]},[1]]}¶ [["cardboard","bag"],[1]]¶ {"product":[["apple","orange","peppers","banana"],[2]]}¶ [["apple","orange","peppers","banana"],[2]]¶ |
The examples in Table 7 below illustrate the optimize level:¶
Data | Optimize level | |
---|---|---|
type | Full format | JsonNTV |
matrix |
[['a','a','b','b','c','c'],¶ [10,20,10,20,10,20],¶ [1,2,3,4,5,6]]¶ |
[[['a','b','c'],[2]],¶ [[10,20],[1]],¶ [1,2,3,4,5,6]]¶ |
single |
[[1,2,3,4,5,6],¶ ['a','a','a','a','a','a']]¶ |
[[1,2,3,4,5,6],¶ 'a']¶ |
complete | [[1,2,3,3,5,5]] | [[[1,2,3,5],[0,1,2,2,3,3]]] |
coupled |
[[1,2,3,3,5,5],¶ ['a','b','c','c','e','e']]¶ |
[[[1,2,3,5],[0,1,2,2,3,3]],¶ [['a','b','c','e'],0]]¶ |
derived |
[[1,2,3,4,5,6],¶ ['a','a','b','b','c','c'],¶ [10,10,10,10,20,20]]¶ |
[[1,2,3,4,5,6],¶ [['a','b','c'],[0,0,1,1,2,2]],¶ [[10,20],1,[0,0,1]]]¶ |
matrix¶ +¶ coupled¶ |
[[6,6,7,7,8,8,9,9],¶ [10,20,10,20,10,20,10,20],¶ [1,1,2,2,3,3,4,4],¶ [1,2,3,4,5,6,7,8]]¶ |
[[[6,7,8,9],[2]],¶ [[10,20],[1]],¶ [[1,2,3,4],0],¶ [1,2,3,4,5,6,7,8]]¶ |
matrix¶ +¶ coupled¶ +¶ derived¶ |
[[6,6,7,7,8,8,9,9],¶ [10,20,10,20,10,20,10,20],¶ [1,1,2,2,3,3,4,4],¶ [11,11,22,22,22,22,22,22],¶ [1,2,3,4,5,6,7,8]]¶ |
[[[6,7,8,9],[2]],¶ [[10,20],[1]],¶ [[1,2,3,4],0],¶ [[11,22],0,[0,1,1,1]],¶ [1 2,3,4,5,6,7,8]]¶ |
The examples in Table 8 below illustre NTVdataset with a length equal to 0, 1 or 2:¶
[ ] or { } | Empty NTVdataset |
[25] or [[25]] | NTVdataset with 1 NTVfield and length 1 |
[2, 1] or [[2], [1]] or [2, [1]] | NTVdataset with 2 NTVfield and length 1 |
[[2, 1]] | NTVdataset with 1 NTVfield and length 2 |
[[2, 1], [4, 3]] | NTVdataset with 2 NTVfield and length 2 |
NTV-TAB format defines the representation of a Dataset into the NTV format. This conversion is reversible (lossless).¶
Furthermore, the NTV format defines the conversion into JSON format. This conversion is also reversible.¶
The exchange format (JsonText) of a Dataset is therefore obtained by a representation in NTV-TAB format then a conversion to JSON format and finally a conversion to text format (or binary format with CBOR conversion). The data is reconstituted identically by reverse conversions.¶
As explain in Section 2.2 cells are often duplicated in a Field. The principle of NTV-TAB format is to replace duplicated data with encoding based on integers.¶
This optimization considerably reduces the size of a representation of a Dataset. Appendix A details the methodology to optimize this size.¶
NTVcells in a NTVdataset are any NTVsingle. We can therefore include in a NTVdataset the data associated with the types defined in the NTV format.¶
The 'tab' and the 'field' NTVtypes are associated to NTVdataset and NTVfield. A NTVcell can also include a NTVdataset or a NTVfield.¶
Figure 1 is an example of nested Dataset. The 'nested' JsonNTV is the representation of a Dataset with length equal 2 and composed with two Fields 'field1' and 'field2'.¶
nested = { "field1": { "dataset1:tab":{ "dts1_field1": [1,2,3], "dts1_field2": [4,5,6] }, "dataset2:tab":{ "dts2_field1": [10,20,30], "dts2_field2": [40,50,60], "dts2_field3": [70,80,90] }, }, "field2":{ "field2_1:field": [1,2,3], "field2_2:field": [4,5,6], } }
A NTV parser generates a NTV entity from a JSON-value.¶
The decoding NTV entity is directly converted into the NTVdataset and a list of NTVfields.¶
For each NTVfield the format is deduced following the structure defined in the table xxx.¶
For each format, a decoder converts the NTVvalue of the NTVfield into the chosen object.¶
Note :¶
Several NTVvalue are ambiguous to deduce the Field format : ¶
To avoid this ambiguity, precautions can be taken for Dataset with length = 2 or 3 and with a Full format:¶
Any JsonValue is a JsonNTVValue and conversely, any JsonNTVvalue is a JsonValue.¶
Thus, any JSON data may or may not be treated as JsonNTV data, so there is no need to create a specific MIME media type for JsonNTV.¶
All properties of the MIME media type "application/json" are applicable.¶
The format used for NTV data exchanges is the JSON format. So, all the security considerations of [RFC8259] apply.¶
The NTV structure provides no cryptographic integrity protection of any kind.¶
This appendix presents an analysis of NTVdataset size optimization with the defined formats.¶
The principle of defined formats is to replace duplicated data with encoding based on integers.¶
We define the size of a Dataset representation (SZ) as the sum of the encoding size and the size of unencoded unduplicated values. The coding is modeled as being the product of the values remaining to be represented (nv - nc) with an average coding size (sc):¶
where :¶
example :¶
SZ is maximal when there is no coding (sc = sv) and minimal when the coding is perfect (sc = 0):¶
We then define the following indicators:¶
The optimization of the size of the representation is then evaluated by comparing the size obtained without coding and that obtained with coding:¶
The indicators are deduced from the following four measurable values:¶
We then deduce sv = SZmax / nv as well as sc = (SZ - nc * sv) / (nv - nc)¶
In the example above, the indicators are:¶
The formats used to represent an NTVfield are in general form:¶
The size of this format can then be written (without taking into account the overhead linked to the format):¶
Comparison with the structure defined in the previous chapter allows us to deduce the parameters:¶
The gain G is therefore equal to the maximum gain 1-UL reduced by the weight of the coding corresponding to the parameter k weighted by the average size of the values compared to an integer.¶
Table 9 below specifies the values of k for the different formats:¶
Format | k coefficient | comments |
---|---|---|
Full | 0 | R = 1 |
Unique | 0 | R = 1/nv (nc = 1) |
Complete | 1 | R = nc/nv + si/sv |
Primary | 1 / nv | R = nc/nv + si/sv/nv |
Coupled | 1 / nv | R = nc/nv + si/sv/nv |
Sparse | 2 * ns / nv | R = nc/nv + 2*si/sv*ns/nv |
Derived | nd / nv | R = nc/nv + si/sv*nd/nv |
This appendix presents the compatibility between Tableschema [TABLE] and the NTV-TAB format.¶
Table Schema is a simple language- and implementation-agnostic way to declare a schema for tabular data. A Table Schema is represented by a descriptor. The descriptor MUST be a JSON object with defined properties (JsonMember).¶
Table Schema define following descriptors and properties:¶
Fields property¶
Field descriptor¶
Field properties¶
Defined Properties:¶
Table properties¶
Three levels of compatibility are addressed :¶
Concepts¶
Type / Format¶
Constraints¶
These compatibility levels are reached, which makes it possible to validate an NTVdataset with a schema defined according to the Table Schema format.¶
The following principles should then be considered to validate an NTVdataset:¶
Figure 2 is an example of Dataset with Full format ('tab_data1' without NTVtypes) and with other formats ('tab_data2').¶
tab_data1 = { "index": [100, 200, 300, 400, 500, 600], "dates": ["1964-01-01", "1985-02-05", "2022-01-21", "1964-01-01", "1985-02-05", "2022-01-21"], "value": [10, 10, 20, 20, 30, 30], "coord": [[1,2], [3,4], [5,6], [7,8], [3,4], [5,6]], "names": ["john", "eric", "judith", "mila", "hector", "maria"], "unique": ["true, "true", "true", "true", "true", "true"] } tab_data2 = { "index": [100, 200, 300, 400, 500, 600], "dates": {"::date":[["1964-01-01","1985-02-05","2022-01-21"],[1]}, "value": [[10, 20, 30], [2]], "coord::point": [[1,2], [3,4], [5,6], [7,8], [3,4], [5,6]], "names::string":["john", "eric", "judith", "mila", "hector", "maria"], "unique": True }
The schema in Figure 3 is valid with 'tab_data1' and 'tab_data2' formats¶
tab_schema = { "fields": [ {"name":"index", "type":"integer", "constraint":{"minimum":50}}, {"name":"dates", "type":"date"}, {"name":"value", "type":"integer"}, {"name":"coord", "type":"geopoint", "format":"array"}, {"name":"names"}, {"name":"unique", "type":"boolean"} ] }
TBD¶