diff --git a/docs/binary-data.md b/docs/binary-data.md new file mode 100644 index 0000000..6749120 --- /dev/null +++ b/docs/binary-data.md @@ -0,0 +1,39 @@ +# Binary Data + +_Page not organized well and under development, but here are the highlights_... + +## Overview + +Support for binary data in the Universal Binary JSON specification was in discussion for 2 years before it was finalized. Many, many different approaches were considered and discarded all in the name of maintaining compatibility with JSON while keeping an eye on performance. The result is a surprisingly simple and binary-efficient construct that is also easily translated to JSON and back to UBJSON again with the help of a good library, namely: a [strongly-typed](container-types#optimized-format) _array_ of [_uint8_](value-types#numeric-types) values. + +## Compatibility with JSON + +Representing binary data efficiently in Universal Binary JSON while still maintaining compatibility with JSON is deceptively simple: leverage a [strongly-typed _array_](container-types#optimized-format) of [_uint8_](value-types#numeric-types) values -- essentially a list of integers. There is no explicit _binary_ [type](type-reference), but instead the ability to represent binary inside of Universal Binary JSON in a very optimized and JSON-compatible construct. The [#1 goal](http://ubjson.org/#goals) of Universal Binary JSON is compatibility with JSON. Compatibility is defined as: + +``` +if + A.ubjson -> translated to -> B.json + && + B.json -> translated to -> C.ubjson +then + A.ubjson == C.ubjson +``` + +All of the Universal Binary JSON value and container types are 1:1 compatible with JSON. The only _semantically_ (but not _structurally_) incompatible construct in UBJSON is strongly-typed containers in that once the container is converted to JSON the typing of the container is lost. Converting the container back to UBJSON and re-enabling the strong-typing does require assistance from the encoding library. Since JSON has no direct support for binary data or this style of strongly-typed container, the translation to JSON converts the strongly-typed _array_ to an _array_ of simple JSON types - in the case of binary data, it would be an _array_ of _number_ values (In the example above this is the translation step from A.ubjson to B.json). Going from JSON back to UBJSON (B.json -> C.ubjson) has the potential for losing the strongly-typed container information and has to be handled with care to re-enable the optimized representation of that information back in the UBJSON format. + +## Library Implementation Recommendation + +The library implementors are encouraged to provide this functionality in the form of two _optional settings_ that can be turned on during generation: + +* [x] Automatically use strongly typed containers when possible +* [x] Force use of strongly typed containers based on first element type + +> ⓘ Specific naming and implementation is up to the developer. This is merely a suggestion on how to handle this situation as elegantly as possible for the client. + +The idea being that the library can either make an automated attempt at reconstructing the strongly typed containers OR if you have a lot of knowledge of your data, you can force the library to reconstitute what looks to be a strongly typed container based on the fist element type. + +> ⚠ If _Force_ is used the library should take care to detect and fail if a different type of value is found in the container during generation. More specifically, the library should remember the first element type and continue checking types as it is generating UBJSON to ensure the type continues to stay consistent.[/box] _Still under development..._ + +## Performance Considerations + +Something to be aware of when converting UBJSON containing a large amount of binary data is that each strongly-typed container of _uint8_ values will convert to a JSON array of _number_ values, because this translation also introduces a ',' character between every value in the array, this effectively **doubles the size** of the binary data. \ No newline at end of file diff --git a/docs/contact.md b/docs/contact.md new file mode 100644 index 0000000..9834704 --- /dev/null +++ b/docs/contact.md @@ -0,0 +1,5 @@ +# Contact + +Please file an issue at [GitHub](https://github.com/ubjson/universal-binary-json)! I really would like to get any comments, questions or feedback on the specification you think is important to share. UBJSON will only be successful through the passion of many. + +If you are using the Universal Binary JSON format in an application we’d love to hear about it or if you wrote [a library](libraries) to add support for it to your favorite language please let us know and we’ll add it to the site! \ No newline at end of file diff --git a/docs/container-types.md b/docs/container-types.md new file mode 100644 index 0000000..402d8f0 --- /dev/null +++ b/docs/container-types.md @@ -0,0 +1,420 @@ +# Container Types + +The Universal Binary JSON Specification defines a total of **2 container types** matching [JSON's container types](http://json.org/): + +1. [Array Type](#array-type) +2. [Object Type](#object-type) + +Ignoring special-case optimizations, the design of the Universal Binary JSON containers is intentionally identical to JSON (the same start/end markers) and are **streaming-friendly**; more specifically they can be written out on-demand without knowing the size of the container ahead of time. + +### Optimized Format + +Both _array_ and _object_ container types in UBJSON support being represented in a more optimized format that can increase parsing performance as well as shrink data size in most cases (without compression). Please see [Optimized Format](#optimized-format) below for details on how to leverage this support. + +# Array Type + +* * * + +* [Usage](#array-use) +* [Example](#array-example) + +The _array_ type in Universal Binary JSON is defined as: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
TypeSizeMarkerLengthData Payload
array2+ bytes**[ and ]OptionalYes (if non-empty)
+ +** See [Optimized Format](#optimized-format) below. + +### Usage + +The _array_ type in Universal Binary JSON is equivalent to the **array** type from the [JSON specification](http://json.org/). + +### Example + +JSON snippet (42 bytes compacted): + +```json +[ + null, + true, + false, + 4782345193, + 153.132, + "ham" +] +``` + +UBJSON snippet (21 bytes, **50% smaller**): + +``` +[[] + [Z] + [T] + [F] + [l][4782345193] + [d][153.132] + [S][i][3][ham] +[]] +``` + +> ✓ Universal Binary JSON format is **50% smaller** than the compacted JSON. + +# Object Type + +* * * + +* [Usage](#object-use) +* [Example](#object-example) + +The _object_ type in Universal Binary JSON is defined as: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
TypeSizeMarkerLengthData Payload
object2+ bytes**{ and }OptionalYes (if non-empty)
+ +** See [Optimized Format](#optimized-format) below. + +### Usage + +The _object_ type in Universal Binary JSON is equivalent to the **object** type from the [JSON specification](http://json.org/). + +### Example + +JSON snippet (90 bytes compacted): + +```json +{ + "post": { + "id": 1137, + "author": "rkalla", + "timestamp": 1364482090592, + "body": "I totally agree!" + } +} +``` + +UBJSON snippet (82 bytes, **9% smaller**): + +``` +[{] + [i][4][post][{] + [i][2][id][I][1137] + [i][6][author][S][i][5][rkalla] + [i][9][timestamp][L][1364482090592] + [i][4][body][S][i][16][I totally agree!] + [}] +[}] +``` + +> ⓘ **NOTE**: The [S] (_string_) marker is omitted from each of the _names_ in the _name/value_ pairings inside the object. The JSON specification does not allow non-_string_ _name_ values, therefore the [S] marker is redundant and **must not** be used. + +# Optimized Format + +* * * + +* [Array Example](#optimized-format-example-array) +* [Object Example](#optimized-format-example-object) +* [Special Cases](#optimized-special-cases) (Null and Boolean) +* [Size & Performance Benefits](#optimized-size-perf-benefits) +* [Binary Data Support](#optimized-binary-support) + +While the basic specification for the _array_ and _object_ types are identical to the JSON specification (i.e. simple beginning and end markers), both containers support _optional_ parameters that can help optimize the container for better parsing performance and smaller size. At a very high level, the optimized format for both _array_ and _object_ container types are built around two optional parameters: **type** and **count** + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
TypeSizeMarkerArg. TypeExampleDesc
type1-byte$[Value Type](value-types) or [Container Type](type-referencecontainer-types/) Marker[$][S]string type
count1-byte#Integer [Numeric Value](value-types#numeric-types)[#][i][64]count of 64
+ +The effect on the container when specifying one or both parameters is as follows: + +* **type** [**$**] - when a **type** is specified, all _value_ types stored in the container (either _array_ or _object_) are considered to be of that singular type and as a result, type markers are omitted for each value in the container. This can be thought of providing the ability to create a strongly typed container in UBJSON. + * If a **type** is specified, it must be done so before a **count**. + * If a **type** is specified, a **count** must be specified as well (otherwise it is impossible to tell when a container is ending; e.g., did you just parse ']' or the int8 value of 93?) +* **count** [**#**] - when a **count** is specified, the parser is able to know ahead of time how many child elements will be parsed. This allows the parser to pre-size any internal construct used for parsing, verify that the promised number of child _values_ were found and avoid scanning for any terminating bytes while parsing. + * A **count** can be specified without a **type**. + +> ⓘ **NOTE**: Yes it is possible for an _array_ or _object_ to define their **type** as '[' or '{' to signal that they themselves contain additional containers! + +> ⬇ **BONUS**: Parsers can provide _highly-optimized_ implementations for strongly typed containers of non-variable-length types (e.g. numeric, boolean, etc.) because the exact byte-length of the data is known![/box] Some rules that generators and parsers need to be aware of when dealing with these optional parameters is as follows: + +* [count] A **count** must be >= 0. +* [count] A **count** can be specified by itself. +* [count] If a **count** is specified the container must not specify an end-marker. +* [count] A container that specifies a **count** must contain the specified number of child elements. +* [type] If a **type** is specified, it must be done so before **count**. +* [type] If a **type** is specified, a **count** must also be specified. A **type** cannot be specified by itself. +* [type] A container that specifies a **type** must not contain any additional type markers for any contained value. +* [type] The **type** cannot be No-op. Indeed, creating a container whose type is “nothing” (which is what No-op actually is) does not really mean anything. + + + +## Array Example + +Below are examples of incrementally more optimized representations of an _array_ in UBJSON. + +### No Optimization + +``` +[[] + [d][29.97] + [d][31.13] + [d][67.0] + [d][2.113] + [d][23.888] +[]] +``` + +### Optimized with count + +``` +[[][#][i][5] // An array of 5 elements. + [d][29.97] + [d][31.13] + [d][67.0] + [d][2.113] + [d][23.8889] +// No end marker since a count was specified. +``` + +### Optimized with type & count + +``` +[[][$][d][#][i][5] // An array of 5 float32 elements. + [29.97] // Value type is known, so type markers are omitted. + [31.13] + [67.0] + [2.113] + [23.8889] +// No end marker since a count was specified. +``` + + + +## Object Example + +Below are examples of incrementally more optimized representations of an _object_ in UBJSON. + +> ⓘ Remember, in UBJSON the _string_ markers ([S]) are omitted from the _names_ in the _name-value_ pairs of an Object because JSON only allows _names_ of type _string_.[/box] + +### No Optimization + +``` +[{] + [i][3][lat][d][29.976] + [i][4][long][d][31.131] + [i][3][alt][d][67.0] +[}] +``` + +### Optimized with count + +``` +[{][#][i][3] // An object of 3 name:value pairs. + [i][3][lat][d][29.976] + [i][4][long][d][31.131] + [i][3][alt][d][67.0] +// No end marker since a count was specified. +``` + +### Optimized with type & count + +``` +[{][$][d][#][i][3] // An object of 3 name:float32-value pairs. + [i][3][lat][29.976] // Value type is known, so type markers are omitted. + [i][4][long][31.131] + [i][3][alt][67.0] +// No end marker since a count was specified. +``` + + + +## Special Cases (Null and Boolean) + +Up until now all the examples of leveraging **type** and **count** have illustrated the benefit of optimizing out the markers from [value types](value-types) that have a data payload (e.g. numeric values, strings, etc.); since the type of all the values are known, the markers are easily omitted. There are, however, a few special value types that have **no data payload** and the markers themselves represent the value, specifically: [_null_](value-types#null-value) and [boolean](value-types#boolean-types) (no-op is not a valid type for a container). This section will take a look at how those types behave when used with strongly-typed containers. At a high level, placing these values in a strongly-typed container provides the basic behavior of essentially pre-defining the value for every element in the container. In the case of and _array_, all the values contained in it. In the case of an _object_, all the _values_ associated with all the _names_ in the _name-value_ pairs. + +### Array + +``` +[[][$][F][#][I][512] // 512 'false' values. +``` + +The example above is a strongly typed _array_ of **type** _false_ and with a **count** of 512. This simple declaration is equivalent to a **514-byte** _array_ containing 512 [F] markers; instead this single line is **6-bytes** providing a **99% size reduction**. Admittedly this is a selective example of leveraging this feature, but the point is that there are potentially very large performance and size optimizations available if your data can take advantage of this shorthand. + +> ⓘ Strongly-typed arrays of [_null_](value-types#null) and [boolean](value-types#boolean) **must** have an empty body. The header itself defines the container's contents. + +### Object + +``` +[{][$][Z][#][i][3] + [i][4][name] // name only, no value specified. + [i][8][password] + [i][5][email] +``` + +The example above is a strongly typed _object_ of **type** _null_ and with a **count** of 3. When used in the context of an _object_, specifying one of these special-case values as a **type** has the effect of setting the default _value_ for every _name-value_ pair in the object; therefore the _object_ only contains the _names_ of all the pairs. In the case of _object_s the space-savings is typically a little less drastic than in the _array_ case depending on the size of the _names_; in the case of small _names_, it could be significant, approaching a **50% reduction**. + +> ⓘ Strongly-typed objects of [_null_](value-types#null) and [boolean](value-types#boolean) **must not** have any _values_ specified in the body, just the _name_ portions of the _name-value_ pairs. The header itself defines the _value_ for every _name-value_ pair. + + +## Size & Performance Benefits + +* [Optimized for Parsing](#optimized-size-perf-benefits-parsing) +* [Simple Validation Mechanism](#optimized-size-perf-benefits-validation) +* [Reduce Size up to 50%](#optimized-size-perf-benefits-size) + +The benefits realized by leveraging the optimized container types in UBJSON depend heavily on the data being stored and the implementation of the generator or parser. Baring the frustration of "_it depends_" as an answer, the benefits can be viewed at a very high level as the following: + +### Optimized for Parsing + +By specifying a **count**, you are hinting to the parser about the number of elements to expect. The performance gains are primarily around allowing the parser to pre-size its internal data structures to exactly the right size to hold pointers to the parsed values. By specifying a **type** and **count**, the parser not only knows how many child elements to expect, as well as less data to parse and less conditions to run (no marker checks), but in the cases of fixed-length values, the parser knows the exact **byte length** of the payload! For example, consider: + +``` +[[][$][l][#][I][1024] // 1,024 int32 values + [32] + [2147483647] + [101231] + [77832823] + ... 1,000 more int32 values ... +``` + +After the parser parses the container's header, it knows the byte length of the entire payload is 4096 and in a single read operation can read all the values in and quickly break them up into their _[int32](#numeric-types)_ representations. When you are able to leverage the **type** and **count** together to help the parser understand the payload in more detail is where the real performance gains come from. + +### Simple Validation Mechanism + +By specifying a **count** parameter, you are telling the parser the number of child elements it should find in the container. In the case where the parser is unable to find the specified number of child elements it can quickly report a format error to the caller. This is a very simple version of verification and not as robust as say a checksum-based approach, but it still provides benefit in addition to a performance gain. + +### Reduce Size up to 50% + +This is a 1-byte-per-value reduction in any container where strong typing is used. In the case of containers holding large amounts of fairly compact data (small numbers, chars, small strings or value-types like _null_), removing the type marker from the beginning of **each** of the values in the container can almost cut the size requirements for the data in half. The smaller the containers and bigger the individual values are (large numbers, large strings) the less **size benefit** this optimization will have, but it still provides a potentially significant opportunity to the parser to optimize it's code paths for parsing large chunks of same-type _values_ (and not needing to worry about type changes mid-container). This is covered in more detail in the previous section: _Optimized for Parsing_ + +## Binary Data Support + +This section is here for referential convenience; please see [Binary Data](type-referencebinary-data/) for information on storing binary data in UBJSON. \ No newline at end of file diff --git a/docs/developer-resources.md b/docs/developer-resources.md new file mode 100644 index 0000000..8084afd --- /dev/null +++ b/docs/developer-resources.md @@ -0,0 +1,51 @@ +# Developer Resources + +This page contains information for developers looking to develop a Universal Binary JSON library. + +* [Library Implementation Requirements](#library_req) +* [Best Practices](#best_practice) +* [Example Files](#example_files) + +# Library Implementation Requirements + +Libraries implementing the Universal Binary JSON spec must adhere to the following guidelines: + +* Parsers must follow a "writer-makes-right" policy - more specifically, if a parser encounters unexpected or invalid data (e.g. negative container length value) an exception should be thrown and parsing stopped. + +# Best Practices + +* [Optimizing Container Performance](#best_container_perf) +* [Using Smallest Number Representation](#best_smallest_num) +* [Handling High-Precision Numbers on Unsupported Platforms](#best_high_prec_num) + +Through work with the community, feedback from others and our own experience with the specification, below are some of the best-practices collected into one place making it easy for folks working with the format to find answers to the more flexible portions of the spec. + +## Optimizing Container Performance + +> ✓ **Why:** (Potentially large) data size reduction and parsing performance increase. **How**: Homogeneous data type in a container. + +Very large performance advantages are available when writing out _ARRAY_ or _OBJECT_ containers that contain same-type values. Be sure to read through the [_optimized container format_](container-types#optimized-format) that can be leveraged in these cases. A typical level of optimization is being able to omit all the marker characters for all same-typed values in a container, making the sizes of all typical [_value types_](type-reference) 1-byte smaller. An a-typical level of optimization, that leads to the biggest reduction, is for all [1-byte value types](type-reference) (e.g. _NO-OP_, _NULL_, etc); when used in conjunction with the [_optimized container format_,](container-types#optimized-format) the values themselves can be omitted from the container entirely leading to a space savings that approaches 100% as the size of the container grows. + +## Using Smallest Number Representation + +> ✓ **Why:** [~50% size reduction](#size) for numbers > 5 digits and < 20 digits. **How**: Always use the most compact numeric type possible when writing UBJSON. + +Numeric values can be represented in [a number of ways](value-types#numeric-types) in UBJSON; you can reduce the size of your UBJSON by inspecting the stored value and ensuring it is represented in the most-compact numeric representation possible when storing the UBJSON blob. Keep in mind that varying the type of values inside of a container may impact your ability to use the **type** parameter to [optimize container storage](container-types#optimized-format). + +## Handling High-Precision Numbers on Unsupported Platforms + +> ✓ **Why:** Cleanly handle > 64-bit numbers on platforms that don't support them. **How**: By using the _[high-precision](value-types#numeric-types-gt-64bit)_ type. + +Not every language supports arbitrarily long numbers and some not even numbers greater than 64-bits in size. In order to safely allow the transport and handling of > 64-bit numbers across every platform, UBJSON provides the _high-precision_ numeric type. The _high-precision_ type is a string-based type (identical in format to the [_string_](value-types#string-type) type) that provides a universally compatible mechanism by which arbitrarily large or precise numbers can be handled. For platforms with arbitrarily large/precise number support, they are free to parse the _high-precision_ value into a native type; for platforms without support, the _high-precision_ value can be safely passed on, persisted to storage or handled in other non-numeric ways while still allowing the client to handle the request and not overflow or otherwise balk at the unsupported numeric type. That said, for libraries written to support platforms that do not natively support arbitrarily large or precise values, the following guidance can be employed to provide a safe and consistent behavior when encountering them: + +1. **[Default] Exception/Error:** Throw an exception(or return an error) when an unsupported _high-precision_ value is encountered during parsing. The platform doesn't support them so allow the client a chance to be aware of the fact that it is receiving data it won't know how to parse into a native type. +2. **[Optional] Handle as a String:** (must be user-enabled) In the case where the client doesn't need to do any processing of the value and is just doing pass-through like persisting it to a data store, treat the _high-precision_ value as a _string_ and return it to the caller. +3. **[Optional] Skip**: (must be user-enabled) Provide the ability for the parser to optionally skip unsupported values during parsing. Be aware that this is a dangerous approach and **will likely lead to data loss** (skipped values won't be visible to the client), but in the case where a client _must_ be able to parse any and all UBJSON it received even if it doesn't support arbitrarily large or precise numbers, then this has to be considered. + +These guidelines should provide the most functional experience for a client to work with UBJSON on their platform of choice. + +# Example Files + +> ⚠ Example files below only support Draft 8 + +You can find files to test your implementation with [here](https://github.com/thebuzzmedia/universal-binary-json-java/tree/master/src/test/resources/org/ubjson). There are _formatted-json_, _compacted-json_ and _UBJ_ versions of each of the testing files contained in the repository. The simple Java classes that have matching names to the UBJ files are Java class representations of the files (for Java testing) and the _Marshaller_ classes are the hand-coded serialization and deserialization code used to write out and read in those test files from UBJ format. Even if you are not working in Java, you can use those classes as a high level guide if you are curious or ignore them completely and just test against the raw file resources. \ No newline at end of file diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 0000000..7c514d8 --- /dev/null +++ b/docs/index.md @@ -0,0 +1,154 @@ +# Specification + +1. [Quick Start](#quickstart) +2. [License](#license) +3. [Why](#why) +4. [Goals](#goals) +5. [Data Format](#data-format) +6. [Size Requirements](#size-requirements) +7. [Endianness](#endianness) +8. [MIME Type](#mime-type) +9. [File Extension](#file-extension) +10. [Requests for Enhancement (RFE)](#rfe) + +# Quick Start + +You know what JSON is and you understand data formats and just want the good bits? + +* Keep the [Type Reference](type-reference) open in a tab to show you the markers and type definitions all in one page. +* Details on the [Value Types](value-types) (13 of them) +* Details on the [Container Types](container-types) (2 of them) + * Don't forget containers have an optional [_optimized format_](container-types#optimized-format) you can leverage. +* Grab a [UBJSON library](libraries) for your favorite language or platform (or write your own!) +* Discuss questions about the Spec or Libraries in the [Google Group](https://groups.google.com/forum/?fromgroups#!forum/universal-binary-json). +* File bugs or issues in [GitHub](https://github.com/thebuzzmedia/universal-binary-json/issues)! + +# License + + + +The Universal Binary JSON Specification is licensed under the [Apache 2.0 License](http://www.apache.org/licenses/LICENSE-2.0.html). Use of the spec, either as-defined or a customized extension of it, is intended to be commercial-friendly. The ultimate purpose of this specification is to provide a useful tool for software developers to leverage in any way they see fit. + +# Why + + + +[JSON](http://json.org/) has become a ubiquitous text-based file format for data interchange. Its simplicity, ease of processing and (relatively) rich data typing made it a natural choice for many developers needing to store or shuffle data between systems quickly and easy. Unfortunately, marshalling native programming language constructs in and out of a text-based representations does have a measurable processing cost associated with it. In high-performance applications, avoiding the text-processing step of JSON can net big wins in both processing time and size reduction of stored information, which is where a binary JSON format becomes helpful. Attempts to make using JSON faster through binary specifications like [BSON](http://bsonspec.org/), [BJSON](http://bjson.org/) or [Smile](http://wiki.fasterxml.com/SmileFormatSpec) exist, but have been [rejected](https://issues.apache.org/jira/browse/COUCHDB-702) from [mass-adoption](http://bsonspec.org/#/implementation) for two reasons: + +1. **Custom (Binary-Only) Data Types**: Inclusion of custom data types that have no ancillary in the original JSON spec, leaving room for incompatibilities to exist as different implementations of the spec handle the binary-only data types differently. +2. **Complexity**: Some specifications provide higher performance or smaller representations at the cost of a [much more complex specification](http://wiki.fasterxml.com/SmileFormatSpec), making implementations more difficult which can slow or block adoption. One of the key reasons JSON became as popular as it did was because of its ease of use. + +BSON, for example, defines types for binary data, regular expressions, JavaScript code blocks and other constructs that have no equivalent data type in JSON. BJSON defines a _binary_ data type as well, again leaving the door wide open to interpretation that can potentially lead to incompatibilities between two implementations of the spec and Smile, while the closest, defines more complex data constructs and generation/parsing rules in the name of absolute space efficiency. These are not short-comings, just trade-offs the different specs made in order to service specific use-cases. The existing binary JSON specifications all define incompatibilities or complexities that undo the singular tenant that made JSON so successful: **simplicity**. JSON's simplicity made it accessible to anyone, made implementations in every language available and made explaining it to anyone consuming your data immediate. Any successful binary JSON specification must carry these properties forward for it to be genuinely helpful to the community at large. This specification is defined around a singular marker-based construct used to build up and represent JSON values and objects. Reading and writing the format is trivial, designed with the goal of being understood in under 10 minutes (likely less if you are very comfortable with JSON already). +> ⓘ **TIP**: UBJSON is built exclusively out of marker-characters like 'C' (for CHAR), 'S' (for STRING), etc. followed by either the payload itself, or a length and then the payload... that's it! + +Fortunately, while the Universal Binary JSON specification carries these tenants of simplicity forward, it is also able to take advantage of optimized binary data structures that are (on average) 30% smaller than compacted JSON and specified for ultimate read performance; bringing **simplicity, size** and **performance** all together into a single specification that is 100% compatible with JSON. + +## Why not JSON+gzip? + +On the surface simply gzipping your compacted JSON may seem like a valid (and smaller) alternative to using the Universal Binary JSON specification, but there are two significant costs associated with this approach that you should be aware of: + +1. At least a [50% performance overhead](http://www.cowtowncoder.com/blog/archives/2009/05/entry_263.html) for processing the data. +2. Lack of data clarity and inability to inspect it directly. + +While gzipping your JSON will give you great compression, about 75% on average, the overhead required to read/write the data becomes significantly higher. Additionally, because the binary data is now in a compressed format you can no longer open it directly in an editor and scan the human-readable portions of it easily; which can be important during debugging, testing or data verification and recovery. Utilizing the Universal Binary JSON format will typically provide [a 30% reduction in size](#size) _and_ store your data in an optimized format offering you much higher performance while still allowing you to open the file directly and read through it. If you had a usage scenario where your data is put into long-term cold storage and pulled out in large chunks for processing, you might even consider gzipping your Universal Binary JSON files, storing those, and when they are pulled out and unzipped, you can then process them with all the speed advantages of UBJSON. As always, deciding which approach is right for your project depends heavily on what you need. + +# Goals + + + +The Universal Binary JSON specification has 3 goals: **1\. Universal Compatibility** + +Meaning absolute compatibility with the JSON spec itself as well as only utilizing data types that are natively supported in all popular programming languages. + +This allows 1:1 transforms between standard JSON and Universal Binary JSON as well as efficient representation in all popular programming languages without requiring parser developers to account for strange data types that their language may not support. + +**2\. Ease of Use** + +The Universal Binary JSON specification is intentionally defined using a single core data structure to build up the entire specification. + +This accomplishes two things: it allows the spec to be understood quickly and allows developers to write trivially simple code to take advantage of it or interchange data with another system utilizing it. + +**3\. Speed / Efficiency** + +Typically the motivation for using a binary specification over a text-based one is speed and/or efficiency, so strict attention was paid to selecting data constructs and representations that are (roughly) 30% smaller than their compacted JSON counterparts and optimized for fast parsing. + +# Data Format + + + +The Universal Binary JSON specification utilizes a single construct with two optional segments (_length_ and _data)_ for all types: + +
+[type, 1-byte char]([integer numeric length])([data])
+
+ +Each element in the tuple is defined as: + +* **type** + * A 1-byte ASCII char used to indicate the [type](type-reference) of the data following it. +* **length** (_OPTIONAL_) + * A positive, integer [numeric type](value-types#numeric-types) (int8, uint8, int16, int32, int64) specifying the length of the following data payload. +* **data** (_OPTIONAL_) + * A run of bytes representing the actual binary data for this type of value. + +Some value are simple enough that just writing the 1-byte ASCII marker into the stream is enough to represent the value (e.g. _null_) while others have a _type_ that is specific enough that no _length_ is needed as the length is implied by the type (e.g. _int32_) while others still require both a _type_ and a _length_ to communicate their value (e.g. _string_). + +## Types + +Universal Binary JSON defines a number of [_Value Types_](value-types) and [_Container Types_](type-referencecontainer-types/) that map directly to [JSON's types](http://json.org/). For the most part the correlation is 1:1 except in the case of [_numeric_ types](value-types#numeric-types) where UBJSON defines many more specific types of number storage and representation than JSON's single _number_ type. + +* [Type Reference](type-reference) (Overview) + * [Value Types](value-types) + * [Container Types](type-referencecontainer-types/) + +# Size Requirements + + + +The Universal Binary JSON specification tries to strike the perfect balance between space savings, simplicity and performance. Data stored using the Universal Binary JSON format are on average **30% smaller** as a rule of thumb. As you can see from some of the examples in this document though, it is not uncommon to see the binary representation of some data lead to [a 50% or 60% size reduction](container-types#array-type-example) without compression. The size reduction of your data depends heavily on the type of data you are storing. It is best to do your own benchmarking with a comprehensive sampling of your own data. +> 📄 The Universal Binary JSON specification does not use compression algorithms to achieve smaller storage sizes. The size reduction is a side effect of the efficient binary storage format. + +## Size Reduction Tips + +The amount of storage size reduction you'll experience with the Universal Binary JSON format will depend heavily on the type of data you are encoding. Some data shrinks considerably, some mildly and some not at all, but in every case your data will be stored in a much more efficient format that is faster to read and write. Below are pointers to give you an idea of how certain data may shrink in this format: + +* _null_**,** _true_ and _false_ values will be **75% smaller** (80% in the case of _false_) +* Large _numeric_ values (> 5 digits < 20 digits) will be **50% smaller**. +* _array_ and _object_ containers will be **1-byte-per-value smaller**. +* Leveraging the [_optimized container format_](container-types#optimized-format) can lead to a **significant** size reduction in environments where container data is of the same type. +* _string_ values are 2-10 bytes bigger _per string_ (depending on the length of the string being represented by the smaller integer numeric type). + +One of the great things about the Universal Binary JSON format is that even though most all your data will be represented in a smaller footprint, you still get two big wins: + +1. A smaller data format means faster writes and smaller reads. It also means less data to process when parsing. +2. Binary format means no encoding/decoding primitive values to text and no parsing primitive values from text. + +# Endianness + + + +The Universal Binary JSON specification requires that all numeric values be written in [Big-Endian](http://en.wikipedia.org/wiki/Endianness) order. + +# MIME Type + + + +The Universal Binary JSON specification is a binary format and recommends using the following mime type: +``` +application/ubjson +``` + +This was added directly to the specification in hopes of avoiding [similar confusion with JSON](http://stackoverflow.com/questions/477816/the-right-json-content-type). + +# File Extension + + + +"**ubj**" is the [recommended file extension](http://www.fileinfo.com/extension/ubj) when writing out files using the Universal Binary JSON format (e.g. "_user.ubj_"). The extension stands for "_Universal Binary JSON_" and has no known conflicting mappings to other file formats. + + +# Requests for Enhancement (RFE) + + + +All (proposed) changes to the specification are being tracked in [GitHub](https://github.com/thebuzzmedia/universal-binary-json/issues). \ No newline at end of file diff --git a/docs/libraries.md b/docs/libraries.md new file mode 100644 index 0000000..7b262b7 --- /dev/null +++ b/docs/libraries.md @@ -0,0 +1,58 @@ +# Libraries + +Below are a list of libraries, by language, that implement the Universal Binary JSON Specification. + +* * * + +## ASM.JS + +* [UBJSON for JS](https://github.com/artcompiler/L16) (in [ASM.JS](http://asmjs.org/)) + +## C + +* [ubj](https://github.com/Steve132/ubj) +* [ubjsc](https://bitbucket.org/tsieprawski/ubjsc) + +## C++ + +* [protoc](http://sourceforge.net/p/protoc/wiki/Home/) +* [UbjsonCpp](https://github.com/WhiZTiM/UbjsonCpp) (C++ 14) + +## D + +* [Universal Binary JSON (UBJSON) for D](https://github.com/adilbaig/ubjsond) + +## Java + +* (dinocore) [UBJSON Java Library](https://github.com/dinocore1/ubjson) +* [Universal Binary JSON Java Library](https://github.com/thebuzzmedia/universal-binary-json-java) +* UBJSON ([Reader](http://libgdx.badlogicgames.com/nightlies/docs/api/com/badlogic/gdx/utils/UBJsonReader.html)/[Writer](http://libgdx.badlogicgames.com/nightlies/docs/api/com/badlogic/gdx/utils/UBJsonWriter.html)) in libGDX Game Engine + +## MATLAB + +* [JSONlab](http://iso2mesh.sourceforge.net/cgi-bin/index.cgi?jsonlab) + +## .NET + +* [Ubjson.NET](http://ubjsonnet.codeplex.com/) + +## Node.js + +* [Node-UBJSON](https://github.com/Sannis/node-ubjson) + +## PHP + +* [PHP-UBJSON](https://github.com/dizews/php-ubjson) + +## Python + +* [simpleubjson](https://code.google.com/p/simpleubjson/) +* [py-ubjson](https://github.com/Iotic-Labs/py-ubjson) + +## Qt + +* [Qt Component - Universal Binary JSON](http://qt-apps.org/content/show.php?content=162288) + +## Swift + +* [UBJSONSerialization](https://github.com/Frizlab/UBJSONSerialization) \ No newline at end of file diff --git a/docs/thanks.md b/docs/thanks.md new file mode 100644 index 0000000..151a391 --- /dev/null +++ b/docs/thanks.md @@ -0,0 +1,53 @@ +# Thanks + +Universal Binary JSON was originally motivated by a desire to provide an on-disk & over-the-wire format that required no parsing or marshalling in CouchDB ([inspiration](https://issues.apache.org/jira/browse/COUCHDB-702)). In its [original draft form](https://docs.google.com/document/d/12SimAfBVcl8Fd-lr_SSBkM5B_PyEhDRfhgu1Lzvfpfw/edit?hl=en_US), UBJSON was much too simple of a spec with too many holes but over the next number of years and **only** with the help of the following people (among many others) did the spec grow up. I want to express my personal thanks to each one of you for all the help you lent at the different stages of UBJSON's development (and continue to provide in some cases). Sincerely, Riyad Kalla + +* * * + +[**Adil Baig**](http://thoughtsimproved.wordpress.com/) + +Adil has been very involved in the in-depth and multi-year long discussions surrounding a more optimized container specification as well as binary data support. Adil also provided a very compelling, diff-typing proposal for an optimized container format that provided a lot of good guidance around elegant alternatives to consider. + +**[Alex Blewitt](http://twitter.com/#!/alblue)** + +Helped catch a number of specification errors around UTF-8 encoding in the original draft of the specification that would have been confusing/nasty to release. He also provided great feedback about the size and performance metrics for the specification. + +**[Alexander Shorin](http://code.google.com/p/simpleubjson/)** + +Alex is both the author of the [UBJSON Python library](https://code.google.com/p/simpleubjson/) and a valued collaborator on the Universal Binary JSON spec as it matured. Alex provided instrumental insight into the modifications made between Draft 8 and Draft 9 of the spec to help simplify the spec by removing all the duplicate (_compact_) type representations, simplifying the length-arguments for _STRING_ and _HUGE_ as well as being the one to point out that the _length_ arguments for the _ARRAY_ and _OBJECT_ container types are effectively useless once the streaming-format support was added (and do not make generator code or parsing code any easier or more performant). + +[**Bjørn Reese**](https://github.com/breese) + +Bjørn has been involved in most all of the _binary data support_ discussions that have taken place since 2012\. His detail-oriented contributions helped move the discussion forwad. + +**[John Cowan](http://tech.groups.yahoo.com/group/json/message/1734)** + +John was the one that recommended using UTF-8 string-encoded values (or _huge_) for arbitrarily huge numbers after seeing my desire to avoid including any non-portable constructs into the binary format. + +Given that the discussion on numeric formats had been a very active one with lots of feelings on all sides, it was a boon to have John step up with such a simple suggestion that allowed for maximum compatibility and portability. It was a win-win all the way around. + +**[Michael Makarenko](http://www.m1xa.com/)** (aka "M1xA") + +Michael is the author behind the [Ubjson.NET library](libraries) and contributor of the _int16_ and _float_ numeric types to the specification. For numeric-heavy (e.g. scientific) data, the inclusions of the in16 and float types can lead to significant space savings when writing out values in the Universal Binary JSON format. + +Michael has also gone to great lengths to make the .NET implementation of UBJSON as tight and performant as possible; collaborating on benchmark design and testing data as well as compatibility testing between implementations to ensure a great Universal Binary JSON experience for .NET developers. + +In addition to development, Michael has helped contribute to the growth of the Universal Binary JSON community with [articles about the specification](http://habrahabr.ru/blogs/open_source/130112/). + +**[Paul Davis](http://davispj.com/)** + +While approaching the CouchDB team for feedback on the Universal Binary JSON spec, I met Paul who was willing to spend a significant amount of time reviewing the specification and recommending suggestions, changes and improvements from everything the CouchDB team has learned by dealing closely with JSON for years. + +Paul pointed out the shortcomings of prefixing the length to the two container types if the specification could ever be used easily with services or apps that streamed UBJ format for huge runs of data that the server couldn't load, buffer and count ahead of time before responding to the client. In order to more easily support streaming, unknown-length container types had to be added. + +Paul also pointed out the importance of a NO_OP/SKIP/IGNORE type that can be useful during a long-lived streaming operation where the server may be waiting on something (like a DB) and you need to keep the connection alive between client/server and avoid the client timing out, but you need the client to know the data it is receiving is just meant as a "Hang on" message from the server and not actual data. This is where the NO_OP command comes in handy. + +**[Stephan Beal](http://tech.groups.yahoo.com/group/json/message/1686)** + +Stephan helped quite a bit with understanding the implications of a >= 64-bit numeric format and the implications of portability across a number of popular platforms. + +* * * + +**[JSON Specification Group](http://tech.groups.yahoo.com/group/json/)** + +I would like to personally thank everyone in the JSON Specification Group. The amount of feedback and help with the specification has been wonderful, constructive and creative. It also lead to one of the busiest conversations in the last year! \ No newline at end of file diff --git a/docs/type-reference.md b/docs/type-reference.md new file mode 100644 index 0000000..8ae11fc --- /dev/null +++ b/docs/type-reference.md @@ -0,0 +1,219 @@ +# Type Reference + +The table below is a quick-reference for folks working closely with the Universal Binary JSON format that want all the information at their finger tips: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
TypeSizeMarkerLengthData Payload
+ +> Value Types + +
null 1-byte Z No No
no-op 1-byte N No No
true 1-byte T No No
false 1-byte F No No
int8 2-bytes i No Yes
uint8 2-bytes U No Yes
int16 3-bytes I No Yes
int32 5-bytes l No Yes
int64 9-bytes L No Yes
float32 5-bytes d No Yes
float64 9-bytes D No Yes
high-precision number 1-byte + int num val + string byte len H Yes Yes (if non-empty)
char 2-bytes C No Yes
string 1-byte + int num val + string byte len S Yes Yes (if non-empty)
+ +> Container Types + +
array** 2+ bytes [ and ] Optional Yes (if non-empty)
object** 2+ bytes { and } Optional Yes (if non-empty)
+** See container optimized format for details. + +## Example + +Below is an example of what a common JSON response would look like in UBJSON. This particular example was taken from the [GitHub developer docs](http://developer.github.com/v3/users/). + +**JSON Response** +```json +{ + "login": "octocat", + "id": 1, + "avatar_url": "https://github.com/images/error/octocat_happy.gif", + "gravatar_id": "somehexcode", + "url": "https://api.github.com/users/octocat", + "name": "monalisa octocat", + "company": "GitHub", + "blog": "https://github.com/blog", + "location": "San Francisco", + "email": "octocat@github.com", + "hireable": false, + "bio": "There once was...", + "public_repos": 2, + "public_gists": 1, + "followers": 20, + "following": 0, + "html_url": "https://github.com/octocat", + "created_at": "2008-01-14T04:33:35Z", + "type": "User", + "total_private_repos": 100, + "owned_private_repos": 100, + "private_gists": 81, + "disk_usage": 10000, + "collaborators": 8, + "plan": { + "name": "Medium", + "space": 400, + "collaborators": 10, + "private_repos": 20 + } +} +``` +**UBJSON Response (using block-notation)** +``` +[{] + [i][5][login][S][i][7][octocat] + [i][2][id][i][1] + [i][10][avatar_url][S][i][49][https://github.com/images/error/octocat_happy.gif] + [i][11][gravatar_id][S][i][11][somehexcode] + [i][3][url][S][i][36][https://api.github.com/users/octocat] + [i][4][name][S][i][16][monalisa octocat] + [i][7][company][S][i][6][GitHub] + [i][4][blog][S][i][23][https://github.com/blog] + [i][8][location][S][i][13][San Francisco] + [i][5][email][S][i][18][octocat@github.com] + [i][8][hireable][F] + [i][3][bio][S][i][17][There once was...] + [i][12][public_repos][i][2] + [i][12][public_gists][i][1] + [i][9][followers][i][20] + [i][9][following][i][0] + [i][8][html_url][S][i][26][https://github.com/octocat] + [i][10][created_at][S][i][20][2008-01-14T04:33:35Z] + [i][4][type][S][i][4][User] + [i][19][total_private_repos][i][100] + [i][19][owned_private_repos][i][100] + [i][13][private_gists][i][81] + [i][10][disk_usage][I][10000] + [i][13][collaborators][i][8] + [i][4][plan][{] + [i][4][name][S][i][6][Medium] + [i][5][space][I][400] + [i][13][collaborators][i][10] + [i][13][private_repos][i][20] + [}] +[}] +``` \ No newline at end of file diff --git a/docs/value-types.md b/docs/value-types.md new file mode 100644 index 0000000..98a5a0e --- /dev/null +++ b/docs/value-types.md @@ -0,0 +1,386 @@ +# Value Types + +The Universal Binary JSON Specification defines a total of **13 value types** (to [JSON's 5 value types](http://json.org)). + +The reason for the increased number of value types is because UBJSON defines **8 numeric value types** (to JSON's 1) allowing for highly optimized storage/retrieval of numeric values depending on the necessary precision; in addition to a number of other more optimized representations of JSON values. + +The specifications for each of the Universal Binary JSON Specification value types are below. + +1. [Null Value](#null-value) +2. [No-Op Value](#no-op-value) +3. [Boolean Types](#boolean-types) +4. [Numeric Types](#numeric-types) +5. [Char Type](#char-type) +6. [String Type](#string-type) +7. [Binary Data](#binary-data) + +# Null Value + +* * * + +* [Usage](#null-use) +* [Example](#null-example) + +The _null_ value in Universal Binary JSON is defined as: + +| Type | Size | Marker | Length | Data Payload | +| --- | --- | --- | --- | --- | +| null | 1-byte | Z | No | No | + + +### Usage + +The _null_ value in Universal Binary JSON is equivalent to the **null** value from the [JSON specification](http://json.org/). + +### Example + +JSON snippet: + +```json +{ + "passcode": null +} +``` + +UBJSON snippet (using block-notation): + +``` +[{] + [i][8][passcode][Z] +[}] +``` + +# No-Op Value + +* * * + +* [Usage](#noop-use) +* [Example](#noop-example) + +The _no-op_ value in Universal Binary JSON is defined as: + +| Type | Size | Marker | Length | Data Payload | +| --- | --- | --- | --- | --- | +| noop | 1-byte | N | No | No | + +### Usage + +The intended usage of the _no-op_ value is as a valueless signal between a producer (most likely a server) and a consumer (most likely a client) to indicate activity; for example, as a **keep-alive** signal so a client knows a server is still working and hasn't hung or timed out. There is no equivalent to _no-op_ value in the original [JSON specification](http://json.org/). The _NO-OP_ value is meant to be a **valueless** value; meaning it can be added to the **elements of a container** and when parsed by the receiver, the _no-op_ values are simply skipped and carry know meaningful value with them. For example, the two following _array_ elements are considered equal (using JSON format for readability): + +```json +["foo", "bar", "baz"] +``` + +and + +```json +["foo", no-op, "bar", no-op, no-op, no-op, "baz", no-op, no-op] +``` + +There are a number of interesting advantages to having a valueless-value defined directly in the spec. + +### Example + +Consider a web service that performs an expensive operation that can take quite a while (let's say 5 minutes): + +``` + +[N] +<10 second delay> +[N] +<10 second delay> +[N] +<10 second delay> +<...receiving data...> +<10 second delay> +[N] +<10 second delay> +[N] +<...receiving remainder of data...> + +``` + +Most clients by default will timeout after 60 seconds and more aggressive clients will timeout even faster. To help let clients know that the server has not hung, is still alive and is still processing the request the server can reply at some determined interval (e.g. every X seconds) with the _no-op_ value and the client can parse it, acknowledge it and reset its timeout-disconnect timer as a result. + +Another example of leveraging _no-op_ in an interesting way is modeling an efficient **delete** operation for UBJSON on-disk when elements of a container are removed. Instead of reading the entire container, removing the elements and writing the whole thing out again, _no-op_ bytes can simply be written over the records that were removed from the containers. When the record is parsed, it is semantically identical to a container without the values. + +These are just a few examples of how you can leverage the _no-op_ value. + +# Boolean Types + +* * * + +* [Usage](#boolean-use) +* [Example](#boolean-example) + +The _boolean_ types in Universal Binary JSON are defined as: + +| Type | Size | Marker | Length | Data Payload | +| --- | --- | --- | --- | --- | +| true | 1-byte | T | No | No | +| false | 1-byte | F | No | No | + +### Usage + +A _boolean_ type is represented in Universal Binary JSON similar to the [JSON specification](http://json.org/): using a _T_ (true) and _F_ (false) character marker. + +### Example + +JSON snippet: + +```json +{ + "authorized": true, + "verified": false +} +``` + +UBJSON snippet (using block-notation): + +``` +[{] + [i][10][authorized][T] + [i][8][verified][F] +[}] +``` + +# Numeric Types + +* * * + +* [Usage](#numeric-use) +* [Example](#numeric-example) +* [Infinity](#numeric-infinity) +* [Signage & Min/Max Values](#numeric-sign-min-max) +* [64-bit Values](#numeric-64bit) +* [Larger than 64-bit Values](#numeric-gt-64bit) +* [Byte Order / Endianness](#numeric-byte-order-endianness) +* [Storage Size](#numeric-storage-size) + +There are 8 numeric types in Universal Binary JSON and are defined as: + +| Type | Size | Marker | Length | Data Payload | +| --- | --- | --- | --- | --- | +| int8 | 2-bytes | i | No | Yes | +| uint8 | 2-bytes | U | No | Yes | +| int16 | 3-bytes | I | No | Yes | +| int32 | 5-bytes | l | No | Yes | +| int64 | 9-bytes | L | No | Yes | +| float32 | 5-bytes | d | No | Yes | +| float64 | 9-bytes | D | No | Yes | +| high-precision number | 1-byte + int num val + string byte len | H | Yes | Yes (if non-empty) | + +In JavaScript (and JSON) the _[Number](http://people.mozilla.org/~jorendorff/es5.html#sec-8.5)_ type can represent any numeric value, while in most other languages multiple (discrete) numeric types exist to describe different sizes and types of numeric values; this allows the runtime to handle numeric operations more efficiently. + +In order for the Universal Binary JSON specification to be a performant alternative to JSON, support for these most common numeric types had to be added to allow for more efficient reading and writing of numeric values. + +Trying to maintain a single numeric type in UBJSON would have lead to parsing complexity, requiring each language to further inspect the numeric value and marshall it down to the most appropriate internal type. By pre-defining these different numeric types directly in UBJSON, it allows for either a direct conversion into a native language type (e.g. Java) or a straight forward marshaling into the nearest-supported language type (e.g. Erlang). + +### Usage + +The intended usage of the different _numeric_ types are to efficiently store numbers in a space and encoding-optimized format. + +> ⓘ It is always recommended to use the smallest numeric type that fits your needs. For data with a large amount of numeric data, this can cut down the size of the payloads significantly (on average a **50% reduction** in size). + +### Example + +JSON Snippet: + +```json +{ + "int8": 16, + "uint8": 255, + "int16": 32767, + "int32": 2147483647, + "int64": 9223372036854775807, + "float32": 3.14, + "float64": 113243.7863123, + "huge1": "3.14159265358979323846", + "huge2": "-1.93+E190", + "huge3": "719..." +} +``` + +UBJSON snippets (using block-notation): + +``` +[i][4][int8][i][16] +[i][5][uint8][U][255] +[i][5][int16][I]32767] +[i][5][int32][l][2147483647] +[i][5][int64][L][9223372036854775807] +[i][7][float32][d][3.14] +[i][7][float64][D][113243.7863123] +[i][5][huge1][H][i][22][3.14159265358979323846] +[i][5][huge2][H][i][10][-1.93+E190] +[i][5][huge3][H][U][200][719...] +``` + +### Infinity + +Numeric values of **infinity** are encoded as a [_null_](#null) value. (See [ECMA](http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-262.pdf) and [JSON](http://json.org/json.ppt)) + +### Signage & Min/Max Values + +The min/max range of values (_inclusive_) for each numeric type are as follows: +| Type | Signed | Min Value | Max Value | +| --- | --- | --- | --- | +| int8 | Yes | -128 | 127 | +| uint8 | No | 0 | 255 | +| int16 | Yes | -32,768 | 32,767 | +| int32 | Yes | -2,147,483,648 | 2,147,483,647 | +| int64 | Yes | -9,223,372,036,854,775,808 | 9,223,372,036,854,775,807 | +| float32 | Yes | See IEEE 754 Spec | See IEEE 754 Spec | +| float64 | Yes | See IEEE 754 Spec | See IEEE 754 Spec | +| high-precision number | Yes | Infinite | Infinite | + +### 64-bit Integers + +While almost all languages native support 64-bit integers, not all do (e.g. C89 and JavaScript ([yet](http://wiki.ecmascript.org/doku.php?id=harmony:binary_data_discussion&s=int64))) and care must be taken when encoding 64-bit integer values into binary JSON then attempting to decode it on a platform that doesn’t support it. + +If you are fully aware of the platforms and runtime environments your binary JSON is being used on and know they all support 64-bit integers, then you are fine. + +If you are trying to deserialize 64-bit integers in a client’s browser in JavaScript or another environment that does not support 64-bit integers, then you will want to take care to skip them in the input or have the client producing them encode them as _double_ or _high-precision_ values if that is easier to handle. + +Alternatively you might consider encoding your 64-bit values as doubles if you know you are going from the server to a client JavaScript environment with the binary-encoded information. + +### High-Precision Numbers (Larger than 64-bit) + +The _high-precision number_ type is an ultra-portable mechanism by which arbitrarily large (or precise) numbers, greater than 64-bit in size, are encoded as a UTF-8 string and passed between systems that support them. This allows _high-precision number_ values to degrade gracefully on systems that do not have a built-in type to support numeric values larger than 64-bit. Please refer to the [Best Practices](developer-resources#best_practice) page for techniques on working around the lack of larger-than-64-bit numeric types on certain platforms if you need them. + +_high-precision number_ values must be written out in accordance with the original [JSON _number_ type specification](http://json.org/). + +### Byte Order / Endianness + +All integer types (_int8, uint8, int16, int32_ and _int64_) are written in [most-significant-bit order](http://en.wikipedia.org/wiki/Most_significant_bit) (high byte written first, aka "[big endian](http://en.wikipedia.org/wiki/Endianness)"). + +_float32_ values are written in IEEE 754 [single precision floating point format](http://en.wikipedia.org/wiki/IEEE_754-1985), which is the following structure: + +* Bit 31 (1 bit) – sign +* Bit 30-23 (8 bits) – exponent +* Bit 22-0 (23 bits) – fraction (significand) + +_float64_ values are written in IEEE 754 [double precision floating point format](http://en.wikipedia.org/wiki/Double_precision_floating-point_format#Double_precision_binary_floating-point_format), which is the following structure: + +* Bit 63 (1 bit) – sign +* Bit 62-52 (11 bits) – exponent +* Bit 51-0 (52 bits) – fraction (significand) + +### Storage Size + +The size of the _high-precision number_ type "on-disk" follows the same structure and sizing of the [_string_](#string-type) type (see **Storage Size** section). + +All other numeric types storage size is reflected at the beginning of this section as well as in the [Type Reference](type-reference) table. + +# Char Type + +* * * + +* [Usage](#string-use) +* [Example](#string-example) +* [Encoding](#string-encoding) +* [Storage Size](#string-storage-size) + +The _char_ type in Universal Binary JSON is defined as: + +| Type | Size | Marker | Length | Data Payload | +| --- | --- | --- | --- | --- | +| char | 2-bytes | C | No | Yes | + +### Usage + +The _char_ type in Universal Binary JSON is an unsigned byte meant to represent a single printable ASCII character (decimal values 0-127). Put another way, the _char_ type represents a single-byte UTF-8 encoded character. + +> 📄 The _char_ type is synonymous with 1-byte, UTF8 encoded value (decimal values 0-127). A _char_ value **must not** have a decimal value larger than 127. + +The _char_ type is functionally identical to the [_uint8_ type](#numeric), but semantically is meant to represent a character and not a numeric value. + +### Example + +JSON snippet: + +```json +{ + "rolecode": "a", + "delim": ";", +} +``` + +UBJSON snippet (using block-notation): + +``` +[[] + [i][8][rolecode][C][a] + [i][5][delim][C][;] +[]] +``` + +# String Type + +* * * + +* [Usage](#string-use) +* [Example](#string-example) +* [Encoding](#string-encoding) +* [Storage Size](#string-storage-size) + +The _string_ type in Universal Binary JSON is defined as: + +| Type | Size | Marker | Length | Data Payload | +| --- | --- | --- | --- | --- | +| string | 1-byte + int num val + string byte len | S | Yes | Yes (if non-empty) | + +### Usage + +The _string_ type in Universal Binary JSON is equivalent to the **string** type from the [JSON specification](http://json.org/). + +### Example + +JSON snippet: + +```json +{ + "username": "rkalla", + "imagedata": "" +} +``` + +UBJSON snippet (using block-notation): + +``` +[[] + [i][8][username][S][i][5][rkalla] + [i][9][imagedata][S][l][2097152][...huge string payload...] +[]] +``` + +### Encoding (UTF-8) + +The JSON specification does not dictate a specific required encoding, it does however use [UTF-8](http://en.wikipedia.org/wiki/UTF-8) as the default encoding. + +The Universal Binary JSON specification dictates UTF-8 as the **required string encoding** (this includes the _high-precision number_ type as it is a string-encoded value). This will allow you to easily exchange binary JSON between open systems that all support and follow this encoding requirement as well as providing a number of [advantages and optimizations](http://en.wikipedia.org/wiki/UTF-8#Advantages). + +### Storage Size + +The size of the _string_ type varies depending on two things: + +1. The integral numeric type used to describe the length of the string (e.g. _int8, in16, int32_ or _int64_) +2. The UTF-8 encoded size, in bytes, of the string. + +For example, English typically uses 1-byte per character, so the string “hello” has a length of 5. The same string in Russian is “привет” with a byte length of 12 and in Arabic the text becomes “مرحبا” with a byte length of 10. + +Here are some examples of what different _string_ values look like to illustrate the point: + + +| Binary Representation | Description | +| --- | --- | +| `[S][i][5][hello]` | 8 bytes, string UTF-8 "hello" (English) | +| `[S][i][12][привет]` | 15 bytes, string UTF-8 "hello" (Russian) | +| `[S][i][10][مرحبا]` | 13 bytes, string UTF-8 "hello" (Arabic) | +| `[S][I][1024][...1k long string...]` | 1 + 3 + 1024 bytes = 1028 bytes total | + +# Binary Data + +* * * + +_Please see the [Binary Data](binary-data) page..._