Decode-Compute-Encode (GNU poke Manual)

1.1.1 Decode-Compute-Encode

Computing with data whose form is not the most convenient way to be manipulated, like is often the case in unstructured binary data, requires performing a preliminary step that transforms the data into a more convenient representation, usually featuring a higher level of abstraction. This step is known in computer jargon as unmarshalling, when the data is fetch from some storage or transmission media or, more generally, decoding.

Once the computation has been performed, the result should be transformed back to the low-level representation to be stored or transmitted. This is performed in a closing step known as marshalling or, more generally, encoding.

Consider the following C program whose purpose is to read a 32-bit signed integer from a byte-oriented storage media at a given offset, multiply it by two, and store the result at the same offset.

void double_number (int fd, off_t offset, int endian)
{
   int number, i;
   unsigned char b[4];

   /* Decode.  */
   lseek (fd, offset, SEEK_SET);
   for (i = 0; i < 4; ++i)
      read (fd, &b[i], 1);

   if (endian == BIG)
     number = b[0] << 24 | b[1] << 16 | b[2] << 8 | b[3];
   else
     number = b[3] << 24 | b[2] << 16 | b[1] << 8 | b[0];

   /* Compute.  */
   number = number * 2;

   /* Encode.  */
   if (endian == BIG)
   {
     b[0] = (number >> 24) & 0xff;
     b[1] = (number >> 16) & 0xff;
     b[2] = (number >> 8) & 0xff;
     b[3] = number & 0xff;
   }
   else
   {
     b[3] = (number >> 24) & 0xff;
     b[2] = (number >> 16) & 0xff;
     b[1] = (number >> 8) & 0xff;
     b[0] = number & 0xff;
   }

   lseek (fd, offset, SEEK_SET);
   for (i = 0; i < 4; ++i)
      write (fd, &b[i], 1);
}

As we can see, decoding takes care of fetching the data from the storage in simple units, bytes. Then it mounts the more abstract entity on which the computation will be performed, in this case a signed 32-bit integer. Considerations like endianness, negative encoding (which is assumed to be two’s complement in this example and handled automatically by C) and error conditions (omitted in this example for clarity) should be handled properly.

Conversely, encoding turns the signed 32-bit integer into a sequence of bytes and then writes them out to the storage at the desired offset. Again, this requires taking endianness into account and handling error conditions.

This example may look simplistic and artificial, and it is, but too often the computation proper (like multiplying the integer by two) is way more straightforward than the decoding and encoding of the data used for the computation.

Generally speaking, decoding and encoding binary data is laborious and error prone. Think about sequences of elements, variable-length and clever compact encodings, elements not aligned to byte boundaries, the always bug-prone endianness, and a long etc. Dirty business, sometimes risky, and always boring.