YAML Ain't Markup Language (YAML™) 1.0

Working Draft 2004-JAN-29

Oren Ben-Kiki

Clark Evans

Brian Ingerson

This version: http://www.yaml.org/spec/history/2004-01-29/
Latest version: http://www.yaml.org/spec/

This document may be freely copied provided it is not modified.

Status of this Document

This is an intermediate working draft and is being actively revised. Hopefully the next draft will be a release canidate.

We wish to thank implementers who have tirelessly tracked earlier versions of this specification, and our fabulous user community whose feedback has both validated and clarified our direction.

Abstract

YAML™ (rhymes with “camel”) is a human friendly, cross language, unicode based data serialization language designed around the common native structures of agile programming languages. It is broadly useful for programming needs ranging from configuration files to Internet messaging to object persistence to data auditing. Together with the Unicode standard for characters, this specification provides all the information necessary to understand YAML Version 1.0 and to construct programs that process YAML information.


Table of Contents

1. Introduction
1.1. Goals
1.2. Prior Art
1.3. Relation to XML
1.4. Terminology
2. Preview
2.1. Collections
2.2. Structures
2.3. Scalars
2.4. Tags
2.5. Full Length Example
3. Processing YAML Information
3.1. Processes
3.1.1. Represent
3.1.2. Serialize
3.1.3. Present
3.1.4. Parse
3.1.5. Compose
3.1.6. Construct
3.2. Information Models
3.2.1. Node Graph Representation
3.2.2. Event / Tree Serialization
3.2.3. Character Stream Presentation
3.3. Completeness
3.3.1. Well-Formed
3.3.2. Resolved
3.3.3. Recognized and Valid
3.3.4. Available
4. Syntax
4.1. Characters
4.1.1. Character Set
4.1.2. Encoding
4.1.3. Indicators
4.1.4. Line Breaks
4.1.5. Miscellaneous
4.2. Space Processing
4.2.1. Indentation
4.2.2. Throwaway comments
4.3. YAML Stream
4.3.1. Document
4.3.2. Directive
4.3.3. Presentation Node
4.3.4. Node Property
4.3.5. Tag
4.3.6. Anchor
4.4. Alias
4.5. Collection
4.5.1. Sequence
4.5.2. Mapping
4.6. Scalar
4.6.1. End Of line Normalization
4.6.2. Block Modifiers
4.6.3. Explicit Indentation
4.6.4. Chomping
4.6.5. Literal
4.6.6. Folding
4.6.7. Folded
4.6.8. Single Quoted
4.6.9. Escaping
4.6.10. Double Quoted
4.6.11. Plain
A. Tag Repository
A.1. Sequence
A.2. Mapping
A.3. String
B. YAML Terms

Chapter 1. Introduction

"YAML Ain't Markup Language" (abbreviated YAML) is a data serialization language designed to be human friendly and work well with modern programming languages for common everyday tasks. This specification is both an introduction to the YAML language and the concepts supporting it; and also a complete reference of the information needed to develop applications for processing YAML.

Open, interoperable and readily understandable tools have advanced computing immensely. YAML was designed from the start to be useful and friendly to the people working with data. It uses printable unicode characters, some of which provide structural information and the rest representing the data itself. YAML achieves a unique cleanness by minimizing the amount of structural characters, and allowing the data to show itself in a natural and meaningful way. For example, indentation is used for structure, colons separate pairs, and dashes are used for bulleted lists.

There are myriad flavors of data structures, but they can all be adequately represented with three basic primitives: mappings (hashes/dictionaries), sequences (arrays/lists) and scalars (strings/numbers). YAML leverages these primitives and adds a simple typing system and aliasing mechanism to form a complete language for encoding any data structure. While most programming languages can use YAML for data serialization, YAML excels in those languages that are fundamentally built around the three basic primitives. These include the new wave of agile languages such as Perl, Python, PHP, Ruby and Javascript.

There are hundreds of different languages for programming, but only a handful of languages for storing and transferring data. Even though its potential is virtually boundless, YAML was specifically created to work well for common use cases such as: configuration files, log files, interprocess messaging, cross-langauge data sharing, object persistence and debugging of complex data structures. When data is well organized and easy to understand, programming becomes a simpler task.

1.1. Goals

The design goals for YAML are:

  1. YAML documents are easily readable by humans.
  2. YAML uses the native data structures of agile languages.
  3. YAML data is portable between programming languages.
  4. YAML has a consistent model to support generic tools.
  5. YAML enables stream-based processing.
  6. YAML is expressive and extensible.
  7. YAML is easy to implement and use.

1.2. Prior Art

YAML's initial direction was set by the data serialization and markup language discussions among SML-DEV members. Later on it directly incorporated experience from Brian Ingerson's Perl module Data::Denter. Since then YAML has matured through ideas and support from its user community.

YAML integrates and builds upon concepts described by C, Java, Perl, Python, Ruby, RFC0822 (MAIL), RFC1866 (HTML), RFC2045 (MIME), RFC2396 (URI), XML, SAX and SOAP.

The syntax of YAML was motivated by Internet Mail (RFC0822) and remains partially compatible with that standard. Further, YAML borrows the idea of having multiple documents from MIME (RFC2045). YAML's top-level production is a stream of independent documents; ideal for message-based distributed processing systems.

YAML's indentation based block scoping is similar to Python's (without the ambiguities caused by tabs). Indented blocks facilitate easy inspection of a document's structure. YAML's literal scalar leverages this by enabling formatted text to be cleanly mixed within an indented structure without troublesome escaping.

YAML's double quoted scalar uses familar C-style escape sequences. This enables ASCII representation of non-printable or 8-bit (ISO 8859-1) characters such as \x3B. 16-bit Unicode and 32-bit (ISO/IEC 10646) characters are supported with escape sequences such as \u003B and \U0000003B.

Motivated by HTML's end-of-line normalization, YAML's folded scalar employs an intuitive method of handling white space. In YAML, single line breaks may be folded into a single space, while empty lines represent line break characters. This technique allows for paragraphs to be word-wrapped without affecting the canonical form of the content.

YAML's core type system is based on the requirements of Perl, Python and Ruby. YAML directly supports both collection (hash, array) values and scalar (string) values. Support for common types enables programmers to use their language's native data constructs for YAML manipulation, instead of requiring a special document object model (DOM).

Like XML's SOAP, YAML supports serializing native graph structures through a rich alias mechanism. Also like SOAP, YAML provides for application-defined types. This allows YAML to encode rich data structures required for modern distributed computing. YAML provides unique global type names using a namespace mechanism inspired by Java's DNS based package naming convention and XML's URI based namespaces.

YAML was designed to have an incremental interface that includes both a pull-style input stream and a push-style (SAX-like) output stream interfaces. Together this enables YAML to support the processing of large documents, such as a transaction log, or continuous streams, such as a feed from a production machine.

1.3. Relation to XML

Newcomers to YAML often search for its correlation to the eXtensible Markup Language (XML). While the two languages may actually compete in several application domains, there is no direct correlation between them.

YAML is primarily a data serialization language. XML was designed to be backwards compatible with the Standard Generalized Markup Language (SGML) and thus had many design constraints placed on it that YAML does not share. Inheriting SGML's legacy, XML is designed to support structured documents, where YAML is more closely targeted at messaging and native data structures. Where XML is a pioneer in many domains, YAML is the result of lessons learned from XML and other technologies.

It should be mentioned that there are ongoing efforts to define standard XML/YAML mappings. This generally requires that a subset of each language be used. For more information on using both XML and YAML, please visit http://yaml.org/xml/.

1.4. Terminology

This specification uses key words in accordance with RFC2119 to indicate requirement level. In particular, the following words are used to describe the actions of a YAML processor:

may
This word, or the adjective “optional”, mean that conformant YAML processors are permitted, but need not behave as described.
should
This word, or the adjective “recommended”, mean that there could be reasons for a YAML processor to deviate from the behavior described, but that such deviation could hurt interoperability and should therefore be advertised with appropriate notice.
must
This word, or the term “required” or “shall”, mean that the behavior described is an absolute requirement of the specification.

Chapter 2. Preview

This section provides a quick glimpse into the expressive power of YAML. It is not expected that the first-time reader grok all of the examples. Rather, these selections are used as motivation for the remainder of the specification.

2.1. Collections

YAML's block collections use indentation for scope and begin each member on its own line. Block sequences indicate each member with a dash (“-”). Block mappings use a colon to mark each (key: value) pair.

Example 2.1.  Sequence of scalars
(ball players)

- Mark McGwire
- Sammy Sosa
- Ken Griffey

Example 2.2.  Mapping of scalars to scalars
(player statistics)

hr:  65
avg: 0.278
rbi: 147

Example 2.3.  Mapping of scalars to sequences
(ball clubs in each league)

american:
  - Boston Red Sox
  - Detroit Tigers
  - New York Yankees
national:
  - New York Mets
  - Chicago Cubs
  - Atlanta Braves

Example 2.4.  Sequence of mappings
(players' statistics)

-
  name: Mark McGwire
  hr:   65
  avg:  0.278
-
  name: Sammy Sosa
  hr:   63
  avg:  0.288

YAML also has in-line flow styles for compact notation. The flow sequence is written as a comma separated list within square brackets. In a similar manner, the flow mapping uses curley braces. In YAML, the space after the “-” and “:” and “:” is mandatory.

Example 2.5. Sequence of sequences

- [name        , hr, avg  ]
- [Mark McGwire, 65, 0.278]
- [Sammy Sosa  , 63, 0.288]


Example 2.6. Mapping of mappings

Mark McGwire: {hr: 65, avg: 0.278}
Sammy Sosa: {
    hr: 63,
    avg: 0.288
  }

2.2. Structures

YAML uses three dashes (“---”) to separate documents within a stream. Comment lines begin with the pound sign (“#”). Three dots (“...”) indicate the end of a document without starting a new one, for use in communication channels.

Repeated nodes are first marked with the ampersand (“&”) and then referenced with an asterisk (“*”) thereafter.

Example 2.7.  Two documents in a stream
each with a leading comment

# Ranking of 1998 home runs
---
- Mark McGwire
- Sammy Sosa
- Ken Griffey

# Team ranking
---
- Chicago Cubs
- St Louis Cardinals

Example 2.8.  Play by play feed
from a game

---
time: 20:03:20
player: Sammy Sosa
action: strike (miss)
...
---
time: 20:03:47
player: Sammy Sosa
action: grand slam
...

Example 2.9.  Single document with two comments
 

---
hr: # 1998 hr ranking
  - Mark McGwire
  - Sammy Sosa
rbi:
  # 1998 rbi ranking
  - Sammy Sosa
  - Ken Griffey

Example 2.10.  Node for “Sammy Sosa
appears twice in this document

---
hr:
  - Mark McGwire
  # Following node labeled SS
  - &SS Sammy Sosa
rbi:
  - *SS # Subsequent occurance
  - Ken Griffey

The question mark indicates a complex key. Within a block sequence, mapping pairs can start immediately following the dash.

Example 2.11. Mapping between sequences

? # PLAY SCHEDULE
  - Detroit Tigers
  - Chicago Cubs
:
  - 2001-07-23

? [ New York Yankees,
    Atlanta Braves ]
: [ 2001-07-02, 2001-08-12,
    2001-08-14 ]

Example 2.12. Sequence key shortcut

---
# products purchased
- item    : Super Hoop
  quantity: 1
- item    : Basketball
  quantity: 4
- item    : Big Shoes
  quantity: 1


2.3. Scalars

Scalar values can be written in block form using a literal style (“|”) where all new lines count. Or they can be written with the folded style (“>”) for content that can be word wrapped. In the folded style, newlines are treated as a space unless they are part of a blank or indented line.

Example 2.13.  In literals,
newlines are preserved

# ASCII Art
--- |
  \//||\/||
  // ||  ||__

Example 2.14.  In the plain scalar,
newlines are treated as a space

---
  Mark McGwire's
  year was crippled
  by a knee injury.

Example 2.15.  Folded newlines preserved
for indented and blank lines

--- >
 Sammy Sosa completed another
 fine season with great stats.

   63 Home Runs
   0.288 Batting Average

 What a year!

Example 2.16.  Indentation determines scope
 

name: Mark McGwire
accomplishment: >
  Mark set a major league
  home run record in 1998.
stats: |
  65 Home Runs
  0.278 Batting Average

YAML's flow scalars include the plain style (most examples thus far) and quoted styles. The double quoted style provides escape sequences. Single quoted style is useful when escaping is not needed. All flow scalars can span multiple lines; intermediate whitespace is trimmed to a single space.

Example 2.17. Quoted scalars

unicode: "Sosa did fine.\u263A"
control: "\b1998\t1999\t2000\n"
hexesc:  "\x13\x10 is \r\n"

single: '"Howdy!" he cried.'
quoted: ' # not a ''comment''.'
tie-fighter: '|\-*-/|'

Example 2.18. Multiline flow scalars

plain:
  This unquoted scalar
  spans many lines.

quoted: "So does this
  quoted scalar.\n"

2.4. Tags

In YAML, plain (unquoted) scalars are given an implicit type depending on the application. The examples in this specification use types from YAML's tag repository, which includes types like integers, floating point values, timestamps, null, boolean, and string values.

Example 2.19. Integers

canonical: 12345
decimal: +12,345
sexagecimal: 3:25:45
octal: 014
hexadecimal: 0xC

Example 2.20. Floating point

canonical: 1.23015e+3
exponential: 12.3015e+02
sexagecimal: 20:30.15
fixed: 1,230.15
negative infinity: (-inf)
not a number: (NaN)

Example 2.21. Miscellaneous

null: ~
true: y
false: n
string: '12345'

Example 2.22. Timestamps

canonical: 2001-12-15T02:59:43.1Z
iso8601:  2001-12-14t21:59:43.10-05:00
spaced:  2001-12-14 21:59:43.10 -05:00
date:   2002-12-14

Explicit typing is denoted with a tag using the bang (“!”) symbol. Application tags should include a domain name and may use the caret (“^”) to abbreviate subsequent tags.

Example 2.23. Various explicit tags

---
not-date: !str 2002-04-28

picture: !binary |
 R0lGODlhDAAMAIQAAP//9/X
 17unp5WZmZgAAAOfn515eXv
 Pz7Y6OjuDg4J+fn5OTk6enp
 56enmleECcgggoBADs=

application specific tag: !!something |
 The semantics of the tag
 above may be different for
 different documents.

Example 2.24. Application specific tag

# Establish a tag prefix
--- !clarkevans.com,2002/graph/^shape
  # Use the prefix: shorthand for
  # !clarkevans.com,2002/graph/circle
- !^circle
  center: &ORIGIN {x: 73, y: 129}
  radius: 7
- !^line
  start: *ORIGIN
  finish: { x: 89, y: 102 }
- !^label
  start: *ORIGIN
  color: 0xFFEEBB
  value: Pretty vector drawing.

Example 2.25. Unorderd set

# sets are represented as a
# mapping where each key is
# associated with the empty string
--- !set
? Mark McGwire
? Sammy Sosa
? Ken Griff

Example 2.26. Ordered mappings

# ordered maps are represented as
# a sequence of mappings, with
# each mapping having one key
--- !omap
- Mark McGwire: 65
- Sammy Sosa: 63
- Ken Griffy: 58

2.5. Full Length Example

Below are two full-length examples of YAML. On the left is a sample invoice; on the right is a sample log file.

Example 2.27. Invoice

--- !clarkevans.com,2002/^invoice
invoice: 34843
date   : 2001-01-23
bill-to: &id001
    given  : Chris
    family : Dumars
    address:
        lines: |
            458 Walkman Dr.
            Suite #292
        city    : Royal Oak
        state   : MI
        postal  : 48046
ship-to: *id001
product:
    - sku         : BL394D
      quantity    : 4
      description : Basketball
      price       : 450.00
    - sku         : BL4438H
      quantity    : 1
      description : Super Hoop
      price       : 2392.00
tax  : 251.42
total: 4443.52
comments:
    Late afternoon is best.
    Backup contact is Nancy
    Billsmer @ 338-4338.

Example 2.28. Log file

---
Time: 2001-11-23 15:01:42 -05:00
User: ed
Warning:
  This is an error message
  for the log file
---
Time: 2001-11-23 15:02:31 -05:00
User: ed
Warning:
  A slightly different error
  message.
---
Date: 2001-11-23 15:03:17 -05:00
User: ed
Fatal:
  Unknown variable "bar"
Stack:
  - file: TopClass.py
    line: 23
    code: |
      x = MoreObject("345\n")
  - file: MoreClass.py
    line: 58
    code: |-
      foo = bar



Chapter 3. Processing YAML Information

YAML is both a text format and a method for representing native language data structures in this format. This specification defines two concepts: a class of data objects called YAML representations, and a syntax for encoding YAML representations as a series of characters, called a YAML stream. A YAML processor is a tool for converting information between these complementary views. It is assumed that a YAML processor does its work on behalf of another module, called an application. This chapter describes the information structures a processor must provide to or obtain from the application.

YAML information is used in two ways: for machine processing, and for human consumption. The challange of reconciling these two perspectives is best done in three distinct translation stages: representation, serialization, and presentation. Representation addresses how YAML views native language data structures to achieve portability between programming environments. Serialization concerns itself with turning a YAML representation into a serial form, that is, a form with sequential access constraints. Presentation deals with the formatting of a YAML serialization as a stream of characters, in a manner friendly to humans.

Figure 3.1. YAML Overview

YAML Overview

A processor need not expose the serialization or representation stages. It may translate directly between native objects and a character stream and (“dump” and “load” in the diagram above). However, such a direct translation should take place so that the native objects are constructed only from information available in the representation.

3.1. Processes

This section details the processes shown in the diagram above.

3.1.1. Represent

YAML representations model the data constructs from agile programming languages, such as Perl, Python, or Ruby. YAML representations view native language data objects in a generic manner, allowing data to be portable between various programming languages and implementations. Strings, arrays, hashes, and other user-defined types are supported. This specification formalizes what it means to be a YAML representatation and suggests how native language objects can be viewed as a YAML representation.

YAML representations are constructed with three primitives: the sequence, the mapping and the scalar. By sequence we mean an ordered collection, by mapping we mean an unordered association of unique keys to values, and by scalar we mean any object with opaque structure yet expressable as a series of unicode characters. When used generatively, these primitives construct directed graph structures. These primitives were chosen beacuse they are both powerful and familiar: the sequence corresponds to a Perl array and a Python list, the mapping corresponds to a Perl hashtable and a Python dictionary. The scalar represents strings, integers, dates and other atomic data types.

YAML represents any native language data object as one of these three primitives, together with a type specifier called a tag. Type specifiers are either global, using a syntax based on the domain name and registration date, or private in scope. For example, an integer is represented in YAML with a scalar plus a globally scoped tag:yaml.org,2002/int tag. Similarly, an invoice object, particular to a given organization, could be represented as a mapping together with a tag:private.yaml.org,2002:invoice tag. This simple model, based on the sequence and mapping and scalar together with a type specifier, can represent any data structure independent of programming language.

3.1.2. Serialize

For sequential access mediums, such as an event callback API, a YAML representation must be serialized to an ordered tree. Serialization is necessary since nodes in a YAML representation may be referenced more than once (more than one incoming arrow) and since mapping keys are unordered. Serialization is accomplished by imposing an ordering on mapping keys and by replacing the second and subsequent references to a given node with place holders called aliases. The result of this process, the YAML serialization tree, can then be traversed to produce a series of event calls for one-pass processing of YAML data.

3.1.3. Present

YAML character streams (or documents) encode YAML representations into a series of characters. Some of the characters in a YAML stream represent the content of the source information, while other characters are used for presentation style. Not only must YAML character streams store YAML representations, they must do so in a manner which is human friendly.

To address human presentation, the YAML syntax has a rich set of stylistic options which go far beyond the needs of data serialization. YAML has two approaches for expressing a node's nesting, one that uses indentation to designate depth in the serialization tree and another which uses begin and end delimiters. Depending upon escaping and how line breaks should be treated, YAML scalars may be written with many different styles. YAML syntax also has a comment mechanism for annotations othogonal to the “content” of a YAML representation. These presentation level details provide sufficient variety of expression.

In a similar manner, for human readable text, it is frequently desirable to omit data typing information which is often obvious to the human reader and not needed. This is especially true if the information is created by hand, expecting humans to bother with data typing detail is optimistic. Implicit type information may be restored using a data schema or similar mechanisms.

3.1.4. Parse

Parsing is the inverse process of presentation, it takes a stream of characters and produces a series of events.

3.1.5. Compose

Composing takes a series of events and produces a node graph representation. See completeness for more detail on the constraints composition must follow. When composing, one must deal with broken aliases and anchors, and other things of this sort.

3.1.6. Construct

Construction converts construct YAML representations into native language objects.

3.2. Information Models

This section has the formal details of the results of the processes.

Figure 3.2. YAML Information Models

YAML Information Models

To maximize data portability between programming languages and implementations, users of YAML should be mindful of the distinction between serialization or presentation properties and those which are part of the YAML representation. While imposing a order on mapping keys is necessary for flattening YAML representations to a sequential access medium, the specific ordering of a mapping should not be used to convey application level information. In a similar manner, while indentation technique or the specific scalar style is needed for character level human presentation, this syntax detail is not part of a YAML serialization nor a YAML representation. By carefully separating properties needed for serialization and presentation, YAML representations of native language information will be consistent and portable between various programming environments.

3.2.1. Node Graph Representation

In YAML's view, native data is represented as a directed graph of tagged nodes. Nodes that are defined in terms of other nodes are collections and nodes that are defined independent of any other nodes are scalars. YAML supports two kinds of collection nodes, sequence and mappings. Mapping nodes are somewhat tricky beacuse its keys are considered to be unordered and unique.

Figure 3.3. YAML Representation

YAML Representation

3.2.1.1. Nodes

A YAML representation is a rooted, connected, directed graph. By “directed graph” we mean a set of nodes and arrows, where arrows connect one node to another ( a formal definition ). Note that the YAML graph may include cycles, and a node may have more than one incoming arrow.

YAML nodes have a tag and can be of one of three kinds: scalar, sequence, or mapping. The node's tag serves to restrict the set of possible values which the node can have.

scalar

A scalar is a series of zero or more Unicode characters. YAML places no restriction on the length or content of the series.

sequence

A sequence is a series of zero or more nodes. In particular, a sequence may contain the same node more than once or it could even contain itself (directly or indirectly).

mapping

A mapping is an unordered set of key/value node pairs, with the restriction that each of the keys is unique. This restriction has non-trivial implications detailed below. YAML places no further restrictions on the nodes. In particular, keys may be arbitrary nodes, the same node may be used as a value in several pairs, and a mapping could even contain itself as a key or a value (directly or indirectly).

When appropriate, it is convient to consider sequences and mappings together, as a collection. In this view, sequences are treated as mappings with integer keys starting at zero. Having a unified collections view for sequences and mappings is helpful for both constructing practical YAML tools and APIs and for theoretical analysis.

YAML allows several representations to be encoded to the same character stream. Representations appearing in the same character stream are independent. That is, a given node may not appear in more than one representation graph.

3.2.1.2. Tags

YAML represents type information of native objects with a simple identifier, called a tag. These identifiers are URIs, using a subset of the “tag” URI scheme. YAML tags use only the domain based form, tag:domain,date:identifier, for example, tag:yaml.org,2002:str. YAML presentations provide several mechanisms to make this less verbose. Tags may be minted by those who own the domain at the specified date. The day must be omitted if it is the 1st of the month, and the month and day must be omitted for January 1st. The year is never omitted. Thus, each YAML tag has a single globally unique representation. More information on this URI scheme can be found at http://www.taguri.org (mirror).

YAML tags can be either globally unique, or private to a single representation graph. Private tags start with tag:private.yaml.org,2002:. Clearly private tags are not globally unique, since the domain name and the date are fixed.

YAML does not mandate any special relationship between different tags that begin with the same substring. Tags ending URI fragments (containing “#”) are no exception. Tags that share the same base URI but differ in their fragment part are considered to be different, independent tags. By convention, fragments are used to identify different “versions” of a tag, while “/” is used to define nested tag “namespace” hierarchies. However, this is merely a convention, and each tag may employ its own rules. For example, tag:perl.yaml.org,2002: tags use “::” to express namespace hierarchies, tag:java.yaml.org,2002: tags use “.”, etc.

YAML tags are used to associate meta information with each node. In particular, each tag is required to specify a the kind (scalar, sequence, or mapping) it applies to. Scalar tags must also provide mechanism for converting values to a canonical form for supporting equality testing. Furthermore, a tag may provide additional information such as the set of allowed values for validation, a mechanism for implicit typing, or any other data that is applicable to all of the tag's nodes.

3.2.1.3. Equality

Since YAML mappings require key uniqueness, representations must include a mechanism for testing the equality of nodes. This is non-trivial since YAML presentations allow various ways to write a given scalar. For example, the integer ten can be written as 10 or 0xA (hex). If both forms are used as a key in the same mapping, only a YAML processor which “knows” about integer tags and their presentation formats would correctly flag the duplicate key as an error.

canonical form

YAML supports the need for scalar equality by requiring that every scalar tag have a mechanism to produce a canonical form of its scalars. By canonical form, we mean a Unicode character string which represents the scalar's content and can be used for equality testing. While this requirement is stronger than a well defined equality operator, it has other uses, such as the production of digital signatures.

equality

Two nodes must have the same tag and value to be equal. Since each tag applies to exactly one kind, this implies that the two nodes must have the same kind to be equal. Two scalar nodes are equal only when their canonical values are character-by-character equivalent. Equality of collections is defined recursively. Two sequences are equal only when they have the same length and each node in one sequence is equal to the corresponding node in the other sequence. Two mappings are equal only when they have equal sets of keys, and each key in this set is associated with equal values in both mappings.

identity

Node equality should not be confused with node identity. Two nodes are identical only when they represent the same native object. Typically, this corresponds to a single memory address. During serialization, equal scalar nodes may be treated as if they were identical. In contrast, the seperate identity of two distinct, but equal, collection nodes must be preserved.

3.2.2. Event / Tree Serialization

To express a YAML representation using a serial API, it necessary to impose an order on mapping keys and employ alias nodes to indicate a subsequent occurence of a previously encountered node. The result of this serialization process is a tree structure, where each branch has an ordered set of children. This tree can be traversed for a serial event based API. Construction of native structures from the serial interface should not use key order or anchors for the preservation of important data.

Figure 3.4. YAML Serialization

YAML Serialization

3.2.2.1. Key Order

In the representation model, keys in a mapping do not have order. To serialize a mapping, it is necessary to impose an ordering on its keys. This order should not be used when composing a representation graph from serialized events.

In every case where node order is significant, a sequence must be used. For example, an ordered mapping can be represented by a sequence of mappings, where each mapping is a single key/value pair. YAML presentations provide convient shorthand syntax for this case.

3.2.2.2. Aliases

In the representation model, a node may appear in more than one context. When serializing such nodes, the first occurance of the node is serialized with an anchor and subsequent occurances are serialized as an alias which specifies the same anchor. Anchors need not be unique within a serialization. When composing a representation graph from serialized events, alias nodes refer to the most recent node in the serialization having the specified anchor.

An anchored node need not have an alias referring to it. It is therefore possible to provide an anchor for all nodes in serialization. After composing a representation graph, the anchors are discarded. Hence, anchors must not be used for encoding application data.

3.2.3. Character Stream Presentation

YAML presentations make use of styles, comments, directives and other syntactical details. Although the processor may provide this information, these features should not be used when constructing native structures.

Figure 3.5. YAML Presentation

YAML Presentation

3.2.3.1. Styles

In the syntax, each node has an additional style property, depending on its node. There are two types of styles, block and flow. Block styles use indentation to denote nesting and scope within the presentation. In contrast, flow styles rely on explicit markers to denote nesting and scope.

YAML provides several shorthand forms for collection styles, allowing for compact nesting of collections in common cases. For compact set notation, null mapping values may be omitted. For compact ordered mapping notation, a mapping with a single key:value pair may be specified directly inside a flow sequence collection. Also, simple block collections may begin in-line rather than the next line.

YAML provides a rich set of scalar style variants. Scalar block styles include the literal and folded styles; scalar flow styles include the plain, single quoted and double quoted styles. These styles offer a range of tradeoffs between expressive power and readability.

3.2.3.2. Comments

The syntax allows optional comment blocks to be interleaved with the node blocks. Comment blocks may appear before or after any node block. A comment block can't appear inside a scalar node value.

3.2.3.3. Directives

Each document may be associated with a set of directives. A directive is a key:value pair where both the key and the value are simple strings. Directives are instructions to the YAML processor, allowing for extending YAML in the future. This version of YAML defines a single directive, “YAML”. Additional directives may be added in future versions of YAML. A processor should ignore unknown directives with an appropriate warning. There is no provision for specifying private directives. This is intentional.

The “YAML” directive specifies the version of YAML the document adheres to. This specification defines version 1.0. A version 1.0 processor should accept documents with an explicit “%YAML:1.0” directive, as well as documents lacking a “YAML” directive. Documents with a directive specifying a higher minor version (e.g. “%YAML:1.1”) should be processed with an appropriate warning. Documents with a directive specifying a higher major version (e.g. “%YAML:2.0”) should be rejected with an appropriate error message.

3.3. Completeness

The process of converting YAML information from a character stream presentation to a native data structure has several potential failure points. The character stream may be ill-formed, implicit tags may be unresolvable, tags may be unrecognized, the content may be invalid, and a native type may be unavailable. Each of these failures results with an incomplete conversion.

A partial representation need not specify the tag of each node, and the canonical form of scalar values need not be available. This weaker representation is useful for cases of incomplete knowledge of tags used in the document.

Figure 3.6. YAML Completeness

YAML Completeness

3.3.1. Well-Formed

A well-formed character stream must match the productions specified in the next chapter. A YAML processor should reject ill-formed input. A processor may recover from syntax errors, but it must provide a mechanism for reporting such errors.

3.3.2. Resolved

It is not required that all tags in a complete YAML representation be explicitly specified in the character stream presentation. In this case, these implicit tags must be resolved.

When resolving tags, a YAML processor must only rely upon representation details, with one notable exception. It may consider whether a scalar was written in the plain style when resolving the scalar's tag. Other than this exception, the processor must not rely upon presentation or serialization details. In particular, it must not consider key order, anchors, styles, spacing, indentation or comments.

The plain scalar style exception allows unquoted values to signify numbers, dates, or other typed data, while quoted values are treated as generic strings. With this exception, a processor may match plain scalars against a set of regular expressions, to provide automatic resolution of such types without an explict tag.

If a document contains unresolved nodes, the processor is unable to compose a complete representation graph. However, the processor may compose an partial representation, based on each node's kind (mapping, sequence, scalar) and allowing for unresolved tags.

3.3.3.  Recognized and Valid

To be valid, a node must have a tag which is recognized by the processor and its value must satisfy the constraints imposed by its tag. If a document contains a scalar node with an unrecognized tag or an invalid value, only a partial representation may be composed. In constrast, a processor can always compose a complete YAML representation for an unrecognized or an invalid collection, since collection equality does not depend upon the collection's data type.

3.3.4. Available

In a given processing environment, there may not be an available native type corresponding to a given tag. If a node's tag is unavailable, a YAML processor will not be able to construct a native data structure for it. In this case, a complete YAML representation may still be composed, and an application may wish to use this representation directly.

Chapter 4. Syntax

Following are the BNF productions defining the syntax of YAML character streams. The productions introduce the relevant character classes, describe the processing of white space, and then follow with the decomposition of the stream into logical chunks. To make this chapter easier to follow, production names use Hungarian-style notation:

c-
a production matching a single special character
b-
a production matching a single line break
nb-
a production matching a single non-break character
s-
a production matching a single non-break space character
ns-
a production matching a single non-break non-space character
i-
a production matching indentation spaces
X - Y -
a production matching a sequence of characters, starting with an X- production and ending with a Y- production
l-
a production matching a single line (shorthand for i-b-)

4.1. Characters

4.1.1. Character Set

YAML streams use a subset of the Unicode character set. On input, a YAML processor must accept all printable ASCII characters, the space, tab, line break, and all Unicode characters beyond 0x9F. On output, a YAML processor must only produce those acceptable characters, and should also escape all non-printable Unicode characters.

[1] c-printable ::=   #x9 | #xA | #xD
| [#x20-#x7E] | #x85
| [#xA0-#xD7FF]
| [#xE000-#xFFFD]
| [#x10000-#x10FFFF]
/* characters as defined by the Unicode standard, excluding most control characters and the surrogate blocks */

This character range explicitly excludes the surrogate block [#xD800-#xDFFF], DEL 0x7F, the C0 control block [#x0-#x1F], the C1 control block [#x80-#x9F], #xFFFE and #xFFFF. Note that in UTF-16, characters above #xFFFF are represented with a surrogate pair. When present, DEL and characters in the C0 and C1 control block must be represented in a YAML stream using escape sequences.

4.1.2. Encoding

A YAML processor must support the UTF-16 and UTF-8 character encodings. If an input stream does not begin with a byte order mark, the encoding shall be UTF-8. UTF-16 (LE or BE) or UTF-8, as signaled by the byte order mark. Since YAML files may only contain printable characters, this does not raise any ambiguities. For more information about the byte order mark and the Unicode character encoding schemes see the Unicode FAQ.

[2] c-byte-order-mark ::= #xFEFF /* unicode BOM */

4.1.3. Indicators

Indicators are special characters that are used to describe the structure of a YAML document. In general, they cannot be used as the first character of a plain scalar.

[3] c-sequence-start ::= [ /* starts a flow sequence collection */
[4] c-sequence-end ::= ] /* ends a flow sequence collection */
[5] c-mapping-start ::= { /* starts a flow mapping collection */
[6] c-mapping-end ::= } /* ends a flow mapping collection */
[7] c-sequence-entry ::= - /* indicates a sequence entry */
[8] c-mapping-entry ::= : /* separates a key from its value */
[9] c-collect-entry ::= , /* separates flow collection entries */
[10] c-complex-key ::= ? /* a complex key */
[11] c-tag ::= ! /* indicates a tag property */
[12] c-anchor ::= & /* an anchor property */
[13] c-alias ::= * /* an alias node */
[14] c-literal ::= | /* a literal scalar */
[15] c-folded ::= > /* a folded scalar */
[16] c-single-quote ::= ' /* a single quoted scalar */
[17] c-double-quote ::= " /* a double quoted scalar */
[18] c-throwaway ::= # /* a throwaway comment */
[19] c-directive ::= % /* a directive */
[20] c-reserved ::= @” | “` /* reserved for future use */
[21] c-indicators ::=   [ | ] | { | }
| - | : | ? | ,
| ! | * | &
| | | > | ' | "
| # | % | @” | “`
/* indicator characters */

4.1.4. Line Breaks

The Unicode standard defines several line break characters. These line breaks can be grouped into two categories. Specific line breaks have well-defined semantics for breaking text into lines and paragraphs. Generic line breaks are not given meaning beyond “ending a line”.

[22] b-line-feed ::= #xA /* aSCII line feed (LF) */
[23] b-carriage-return ::= #xD /* aSCII carriage return (CR) */
[24] b-next-line ::= #x85 /* unicode next line (NEL) */
[25] b-line-separator ::= #x2028 /* unicode line separator (LS) */
[26] b-paragraph-separator ::= #x2029 /* unicode paragraph separator (PS) */
[27] b-char ::=   b-line-feed
| b-carriage-return
| b-next-line
| b-line-separator
| b-paragraph-separator
/* line break characters */
[28] b-generic ::=   ( b-carriage-return
    b-line-feed )
| b-carriage-return
| b-line-feed
| b-next-line
/* line break with non-specific semantics */
[29] b-specific ::=   b-line-separator
| b-paragraph-separator
/* line break with specific semantics */
[30] b-any ::= b-generic | b-specific /* any non-content line break */

Outside scalar text content, YAML allows any line break to be used to terminate lines, and in most cases also allows such line breaks to be preceded by trailing comment characters. On output, a YAML processor is free to emit such line breaks using whatever convention is most appropriate. YAML output should avoid using trailing line spaces.

4.1.5. Miscellaneous

This section includes several common character range definitions.

[31] nb-char ::= c-printable - b-char /* characters valid in a line */
[32] s-char ::= #x9 | #x20 /* white space valid in a line */
[33] ns-char ::= nb-char - s-char /* non-space characters valid in a line */
[34] ns-ascii-letter ::= [#x41-#x5A] | [#x61-#x7A] /* aSCII letters, A-Z or a-z */
[35] ns-decimal-digit ::= [#x30-#x39] /* 0-9 */
[36] ns-hex-digit ::= ns-decimal-digit
| [#x41-#x46] | [#x61-#x66]
/* 0-9, A-F or a-f */
[37] ns-word-char ::=   ns-decimal-digit
| ns-ascii-letter | “-
/* characters valid in a word */

4.2. Space Processing

YAML streams use lines and spaces to convey structure. This requires special processing rules for white space (space and tab).

4.2.1. Indentation

In a YAML character stream, structure is often determined from indentation, where indentation is defined as a line break character followed by zero or more space characters. With one notable exception, a node must be more indented than its parent node. All sibling nodes must use the exact same indentation level, however the content of each such node could be further indented. Indentation is used exclusively to delineate structure and is otherwise ignored; in particular, indentation characters must never be considered part o