INTERNET-DRAFT S. Stuart Intended Status: Proposed Standard Google Expires: April 11, 2013 R. Fernando Cisco October 8, 2012 Encoding rules and MIME type for Protocol Buffers draft-rfernando-protocol-buffers-00 Abstract This document describes the encoding format for Protocol Buffers encoded data and registers a MIME type associated with Protocol Buffers encoded data. Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/1id-abstracts.html The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html Copyright and License Notice Copyright (c) 2012 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect Fernando and Stuart Expires April 11, 2013 [Page 1] INTERNET DRAFT Protocol Buffers October 8, 2012 to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Message Structure . . . . . . . . . . . . . . . . . . . . . . . 3 3. Encoding Rules . . . . . . . . . . . . . . . . . . . . . . . . 4 3.1 Numbers as VarInts . . . . . . . . . . . . . . . . . . . . . 5 3.2 Encoding and Interpretation of Protobuf Messages . . . . . . 5 3.3 Wire Types . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.3.1 Wire Type 0 . . . . . . . . . . . . . . . . . . . . . . 5 3.3.2 Wire Type 1 . . . . . . . . . . . . . . . . . . . . . . 6 3.3.3 Wire Type 2 . . . . . . . . . . . . . . . . . . . . . . 6 3.3.4 Wire Type 5 . . . . . . . . . . . . . . . . . . . . . . 6 4. Embedded Messages . . . . . . . . . . . . . . . . . . . . . . . 7 5. Optional and Repeated Elements . . . . . . . . . . . . . . . . 7 6. Field Order . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 9 8. Security Considerations . . . . . . . . . . . . . . . . . . . . 9 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 9 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 9 10.1 Informative References . . . . . . . . . . . . . . . . . . 9 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 10 Fernando and Stuart Expires April 11, 2013 [Page 2] INTERNET DRAFT Protocol Buffers October 8, 2012 1 Introduction Protocol buffers, referred to as protobuf in this document, is a commonly used interchange format to serialize structured data for storage and transmission between applications and systems. It supports simple and composite data types and provides rules to serialize those data types into a portable format that is both language and platform neutral. Since it encodes data into binary format, it is fast and efficient. It is also supported by a wide variety of programming languages. While protocol buffers has gained wide spread use, it has so far been described only informally and has not been standardized. This document specifies the encoding rules for protobuf and registers the MIME type 'application/protobuf' for it in accordance with RFC 2048. This document heavily borrows ideas from web page [GPBENC]. 1.1 Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119]. 2. Message Structure Protobuf defines all data elements in discrete units called "messages" [GPBOVW]. A message is a logical collection of related data items. It is similar to a "record" or a "structure" in a traditional programming language. Many standard simple data types are available as field types, including bool, int32, float, double and string. One can also add further structure to the outer message by using enums and other messages as field types. The following is an example of a message definition in protobuf: message Person { enum PhoneType { MOBILE = 0; HOME = 1; WORK = 2; } required string name = 1; required int32 id = 2; optional string email = 3; Fernando and Stuart Expires April 11, 2013 [Page 3] INTERNET DRAFT Protocol Buffers October 8, 2012 message PhoneNumber { required string number = 1; optional PhoneType type = 2; } repeated PhoneNumber phone = 4; } Note the presence of simple data types such as strings and int32s as well as complex data types such as enums and messages in the above message definition. Each field is annotated with one of the following three modifiers: 1. required: a value for the field must be provided, otherwise the message will be considered malformed and the decoding entity will throw and exception. 2. optional: the field may or may not be set. If an optional field value isn't set, a default value is used. 3.repeated: the field may be repeated any number of times (including zero). The order of the repeated values will be preserved in the protocol buffer encoding. The integer token to the right of the assignment operator is a field number. These field numbers uniquely identify a field in a message and together with the wire type is used to form the key for the key- value pairs in the serialized data stream. Field numbers 1-15 require one less byte to encode than higher numbers, so as an optimization one can decide to use those field numbers for the commonly used or repeated elements. Each element in a repeated field requires re- encoding the field number, so repeated fields are particularly good candidates for this optimization. This document will not describe every syntactic element of the protbuf language but will restrict discussion to only those elements that are relevant to the encoding and decoding of data types. 3. Encoding Rules This section describes the encoding rules for the different field types. Fernando and Stuart Expires April 11, 2013 [Page 4] INTERNET DRAFT Protocol Buffers October 8, 2012 3.1 Numbers as VarInts To understand protobuf encoding, we need to first understand VarInts. All numbers in protobuf are represented as base 128 variable-length integers (or VarInt). VarInt is an encoding scheme that uses only as many bytes as is necessary to represent a number and it can be used to encode arbitrary large numbers. It achieves this by using a continuation bit in every byte. Each byte in a VarInt, except the last byte, has the most significant bit (msb) set indicating that there are more bytes to come. The last byte has the msb set to zero. The stream of 7-byte quantities (after msb has been removed) are then reversed and concatenated to produce one single binary representation of the number. 3.2 Encoding and Interpretation of Protobuf Messages Protobuf messages are not self describing. In other words, the entity decoding the binary representation of the message needs to refer to the equivalent text definition of the message to interpret the fields. The "tag" that's associated with the field (with the "=" sign in the text definition) indicates to the decoder which field it is looking at currently. To achieve backward compatibility a wire-type is also included for every field. Using the wire-type, the decoder can skip a field without interpreting it if it desires to do so. This can be useful to achieve backward compatibility when the decoder is not aware of a particular field's tag value. Every field is encoded as a (key, value) pair. The key is a VarInt with the value ((field-tag << 3) | wire-type). In other words, the last three bits of the key VarInt is the wire type. 3.3 Wire Types This document defines the following wire types, their interpretation and the data types that they are used for. 3.3.1 Wire Type 0 If the wire type is 0, the value field is simply a VarInt. This encoding is used to represent int32, int64, uint32, uint64, sint32, sint64, bool and enum. For positive integers the interpretation of the VarInt is straight forward as explained in section 3.1. For example, consider the following message, Fernando and Stuart Expires April 11, 2013 [Page 5] INTERNET DRAFT Protocol Buffers October 8, 2012 message Test1 { required int32 a = 1; } would be serialized as '08 96 01'. If int32 and int64 are used for encoding negative integers, the resulting VarInt is always a ten byte quantity (effectively treating it as a large unsigned integer). If a singed type is used, a zigzag encoding scheme is used which assigns small VarInt values for small negative numbers. In this scheme, the numbers -2, -1, 0, 1, 2 would be represented as VarInts 3, 1, 0, 2, 4 and so on. Mathematically, each value 'n' is encoded using (n << 1) ^ (n >> 31) for sint32 or (n << 1) ^ (n >> 63) for sint64. 3.3.2 Wire Type 1 This is a fixed length 64-bit quantity. This wire type is used to represent fixed64, sfixed64 and double data types. The value is stored in little-endian format. 3.3.3 Wire Type 2 This is a length delimited stream of bytes. The value field is a VarInt encoded length followed by the specified number of bytes of data. As an example, consider the following message, message Test2 { required string b = 2; } would be serialized as, '12 0b 68 65 6c 6c 6f 20 77 6f 72 6c 64', if the string 'b' was set to "Hello World". 3.3.4 Wire Type 5 This is a fixed length 32-bit quantity. This wire type is used to represent fixed32, sfixed32 and float data types. The value is stored in little-endian format. Fernando and Stuart Expires April 11, 2013 [Page 6] INTERNET DRAFT Protocol Buffers October 8, 2012 4. Embedded Messages Embedded messages are encoded as follows. The inner (or the embedded) message is serialized first using the rules described above. The resultant byte stream is then treated as a Wire Type 2 field in the outer message and added to its encoding. Consider the example, message Test1 { required int32 foo = 1; } message Test2 { required Test1 c = 3; } If the field 'foo' were to take the value 150, the resultant encoded byte stream for the inner message would be 08 '96 01'. And for Test2 would be '1a 03 08 96 01'. 5. Optional and Repeated Elements If the message definition has 'repeated' elements, then the encoded message has zero or more key-value pairs with the same field number. These repeated values do not have to appear consecutively; they may be interleaved with other fields. If the message definition has 'optional' elements, then the encoded message may or may not have a key-value pair with that field number. A repeated field could be a 'packed repeated field' in which case the encoding for the field is slightly different. A packed repeated field containing zero elements does not appear in the encoded message. Otherwise, all of the elements of the field are packed into a single key-value pair with the wire type 2 (length delimited). Each element is encoded the same way it would be normally, except without a field number preceding it. 6. Field Order When a message is serialized its known fields should be written sequentially by field number. This allows parsing code to use optimizations that rely on field numbers being in sequence. However, protocol buffer parsers must be able to parse fields in any order, as not all messages are created by simply serializing an object - for Fernando and Stuart Expires April 11, 2013 [Page 7] INTERNET DRAFT Protocol Buffers October 8, 2012 instance, it's sometimes useful to merge two messages by simply concatenating them. Fernando and Stuart Expires April 11, 2013 [Page 8] INTERNET DRAFT Protocol Buffers October 8, 2012 7. IANA Considerations The MIME media type for protobuf messages is application/protobuf. Type name: application Subtype name: protobuf Required parameters: n/a Optional parameters: n/a Encoding considerations: 8 bit binary, UTF-8 Security considerations: Generally there are security issues with serialization formats if code is transmitted and executed on the decoder end. Since protobuf binary encoding does not carry code, we consider the encoding scheme itself to not introduce any security risks. 8. Security Considerations See section 7. 9. Acknowledgements We thank the engineers at Google for giving us the protocol buffers serialization format. All the concepts described in this document come from web pages [GPBENC, GPBOVW] defining protocol buffer mechanisms. This document is merely an attempt to standardize those mechanisms in IETF and assign a MIME type for protobuf encoded messages. 10. References 10.1 Informative References [GPBENC] Google Protocol Buffer Encoding, https://developers.google.com/protocol-buffers/docs/encoding [GPBOVW] Google Protocol Buffer Overview, https://developers.google.com/protocol-buffers/docs/overview Authors' Addresses Fernando and Stuart Expires April 11, 2013 [Page 9] INTERNET DRAFT Protocol Buffers October 8, 2012 Stephen Stuart Google 1600 Amphitheatre Parkway Mountain View, CA 94043 USA EMail: sstuart@google.com Rex Fernando Cisco Systems 170 W. Tasman Dr. San Jose, CA 95134 Email: rex@cisco.com Fernando and Stuart Expires April 11, 2013 [Page 10]