Next: , Previous: , Up: Structuring Data   [Contents][Index]


4.12 Dealing with Alternatives

4.12.1 BSON

BSON 8 is a binary encoding for JSON documents. The top-level entity described by the spec is the document, which contains the total size of the document, a sequence of elements, and a trailing null byte.

We can describe the above in Poke with the following type definition:

type BSON_Doc =
  struct
  {
    offset<int<32>,B> size;
    BSON_Elem[size - size'size - 1#B] elements;
    byte endmark : endmark == 0x0;
  };

BSON elements come in different kinds, which correspond to the different types of JSON entities: 32-bit integers, 64-bit integers, strings, arrays, timestamps, and so on. Every element starts with a tag, which is a 8-bit unsigned integer that identifies it’s kind, and a name encoded as a NULL-terminated string. What comes next depends on the kind of element.

The following Poke type definition describes a subset of BSON elements, namely integers, big integers and strings:

type BSON_Elem =
  struct
  {
    byte tag;
    string name;

    union
    {
      int32 integer32 : tag == 0x10;
      int64 integer64 : tag == 0x12;
      BSON_String str : tag == 0x02;
    } value;
  };

The union in BSON_Elem corresponds to the variable part. When poke decodes an union, it tries to decode each alternative (union field) in turn. The first alternative that is successfully decoded without raising a constraint violation exception is the chosen one. If no alternative can be decoded, a constraint violation exception is raised.

To see this process in action, let’s use the BSON corresponding to the following little JSON document:

{
    "name" : "Jose E. Marchesi",
    "age" : 40,
    "big" : 1076543210012345
}

Let’s take a look to the different elements:

(poke) .load bson.pk
(poke) var d = BSON_Doc  0#B
(poke) d.elements'length
0x3UL
(poke) d.elements[0]
BSON_Elem {
  tag=0x2UB,
  name="name",
  value=struct {
    str=BSON_String {
      size=0x11,
      value="Jose E. Marchesi",
      chars=[0x4aUB,0x6fUB,0x73UB,0x65UB...]
    }
  }
}
(poke) d.elements[1]
BSON_Elem {
  tag=0x10UB,
  name="age",
  value=struct {
    integer32=0x27
  }
}
(poke) d.elements[2]
BSON_Elem {
  tag=0x12UB,
  name="big",
  value=struct {
    integer64=0x3d31c3f9e3eb9L
  }
}

Note how unions decode into structs featuring different fields. What field is available depends on the alternative chosen while decoding.

In the session above, d.elements[1].value contains an integer32 field, whereas d.elements[2].value contains an integer64 field. Let’s see what happens if we try to access the wrong field:

(poke) d.elements[1].value.integer64
unhandled invalid element exception

We get a run-time exception. This kind of errors cannot be caught at compile time, since both integer32 and integer64 are potentially valid fields in the union value.

4.12.2 Unions are Tagged

Unlike in C, Poke unions are tagged. Unions have their own lexical closures, and it is the capured values that determine what field is chosen at every time. Wherever the union goes, its tag accompanies it.

To see this more clearly, consider the following alternative Poke description of the BSON elements:

type BSON_Elem =
  union
  {
    struct
    {
      byte tag : tag == 0x10;
      string name;
      int32 value;
    } integer32;

    struct
    {
      byte tag : tag == 0x12;
      string name;
      int64 value;
    } integer64;

    struct
    {
      byte tag : tag == 0x12;
      string name;
      BSON_String value;
    } str;
  };

This description is way more verbose than the one used in previous sections, but it shows a few important properties of Poke unions.

First, the constraints guiding the decoding are not required to appear in the union itself: it is a recursive process. In this example, BSON_String could have constraints on it’s own, and these constraints will also impact the decoding of the union.

Second, there are generally many different ways to express the same plain binary using different type structures. This is no different than getting different parse trees from the same sequence of tokens using different grammars denoting the same language.

See how different a BSON element looks (and feels) using this alternative description:

(poke) d.elements[0]
BSON_Elem {
  str=struct {
    tag=0x2UB,
    name="name",
    value=BSON_String {
      size=0x11,
      value="Jose E. Marchesi",
      chars=[0x4aUB,0x6fUB,0x73UB,0x65UB...]
    }
  }
}
(poke) d.elements[1]
BSON_Elem {
  integer32=struct {
    tag=0x10UB,
    name="age",
    value=0x27
  }
}
(poke) d.elements[2]
BSON_Elem {
  integer64=struct {
    tag=0x12UB,
    name="big",
    value=0x3d31c3f9e3eb9L
  }
}

What is the best way? It certainly depends on the kind of data you want to manipulate, and the level of abstraction you want to achieve. Ultimately, it is up to you.

Generally speaking, the best structuring is the one that allows you to manipulate the data in terms of the structured abstractions as naturally as possible. That’s the art and craft of writing good pickles.


Footnotes

(8)

http://bsonspec.org/spec.html


Next: Structured Integers, Previous: Padding and Alignment, Up: Structuring Data   [Contents][Index]