BSON 8 is a binary encoding for
JSON documents. The top-level entity described by the spec is the
document
, which contains the total size of the document, a
sequence of elements, and a trailing null byte.
We can describe the above in Poke with the following type definition:
type BSON_Doc = struct { offset<int<32>,B> size; BSON_Elem[size - size'size - 1#B] elements; byte endmark : endmark == 0x0; };
BSON elements come in different kinds, which correspond to the different types of JSON entities: 32-bit integers, 64-bit integers, strings, arrays, timestamps, and so on. Every element starts with a tag, which is a 8-bit unsigned integer that identifies it’s kind, and a name encoded as a NULL-terminated string. What comes next depends on the kind of element.
The following Poke type definition describes a subset of BSON elements, namely integers, big integers and strings:
type BSON_Elem = struct { byte tag; string name; union { int32 integer32 : tag == 0x10; int64 integer64 : tag == 0x12; BSON_String str : tag == 0x02; } value; };
The union in BSON_Elem
corresponds to the variable part. When
poke decodes an union, it tries to decode each alternative (union
field) in turn. The first alternative that is successfully decoded
without raising a constraint violation exception is the chosen one.
If no alternative can be decoded, a constraint violation exception is
raised.
To see this process in action, let’s use the BSON corresponding to the following little JSON document:
{ "name" : "Jose E. Marchesi", "age" : 40, "big" : 1076543210012345 }
Let’s take a look to the different elements:
(poke) .load bson.pk (poke) var d = BSON_Doc 0#B (poke) d.elements'length 0x3UL (poke) d.elements[0] BSON_Elem { tag=0x2UB, name="name", value=struct { str=BSON_String { size=0x11, value="Jose E. Marchesi", chars=[0x4aUB,0x6fUB,0x73UB,0x65UB...] } } } (poke) d.elements[1] BSON_Elem { tag=0x10UB, name="age", value=struct { integer32=0x27 } } (poke) d.elements[2] BSON_Elem { tag=0x12UB, name="big", value=struct { integer64=0x3d31c3f9e3eb9L } }
Note how unions decode into structs featuring different fields. What field is available depends on the alternative chosen while decoding.
In the session above, d.elements[1].value
contains an
integer32
field, whereas d.elements[2].value
contains an
integer64
field. Let’s see what happens if we try to access the
wrong field:
(poke) d.elements[1].value.integer64 unhandled invalid element exception
We get a run-time exception. This kind of errors cannot be caught at
compile time, since both integer32
and integer64
are
potentially valid fields in the union value.
Unlike in C, Poke unions are tagged. Unions have their own lexical closures, and it is the capured values that determine what field is chosen at every time. Wherever the union goes, its tag accompanies it.
To see this more clearly, consider the following alternative Poke description of the BSON elements:
type BSON_Elem = union { struct { byte tag : tag == 0x10; string name; int32 value; } integer32; struct { byte tag : tag == 0x12; string name; int64 value; } integer64; struct { byte tag : tag == 0x12; string name; BSON_String value; } str; };
This description is way more verbose than the one used in previous sections, but it shows a few important properties of Poke unions.
First, the constraints guiding the decoding are not required to appear
in the union itself: it is a recursive process. In this example,
BSON_String
could have constraints on it’s own, and these
constraints will also impact the decoding of the union.
Second, there are generally many different ways to express the same plain binary using different type structures. This is no different than getting different parse trees from the same sequence of tokens using different grammars denoting the same language.
See how different a BSON element looks (and feels) using this alternative description:
(poke) d.elements[0] BSON_Elem { str=struct { tag=0x2UB, name="name", value=BSON_String { size=0x11, value="Jose E. Marchesi", chars=[0x4aUB,0x6fUB,0x73UB,0x65UB...] } } } (poke) d.elements[1] BSON_Elem { integer32=struct { tag=0x10UB, name="age", value=0x27 } } (poke) d.elements[2] BSON_Elem { integer64=struct { tag=0x12UB, name="big", value=0x3d31c3f9e3eb9L } }
What is the best way? It certainly depends on the kind of data you want to manipulate, and the level of abstraction you want to achieve. Ultimately, it is up to you.
Generally speaking, the best structuring is the one that allows you to manipulate the data in terms of the structured abstractions as naturally as possible. That’s the art and craft of writing good pickles.