Yet Another Markup Language (YAML) 1.0

Working Draft 16 Jun 2001

Editors:
Brian Ingerson <briani@ActiveState.com>
Clark C. Evans
Oren Ben-Kiki <oren@ben-kiki.org>

Copyright © 2001 Clark Evans, All Rights Reserved. This document may be freely copied provided it is not modified.


Abstract

YAML (pronounced "yaamel") is a straightforward machine parable data serialization format designed for human readability and interaction with scripting languages such as Perl and Python. YAML is optimized for configuration settings, log files, Internet messaging and filtering. This specification describes the serialization format, a "C" API for the parser and emitter, and several language bindings.

Status of this Document

This specification is a working draft and reflects consensus reached by the members of the yaml-core mailing list. Any questions regarding this draft should be raised on this list. This is a draft and changes are expected, therefore implementers should closely follow this mailing list to stay up-to-date on trends and announcements.

There are many productions which may have bugs, and the MIME productions are yet to be completed, although this should not stop an implementer! Feedback is welcome.

Table of Contents

1 Introduction
    1.1 Origin and Goals
    1.2 Key Concepts
    1.3 Example
    1.4 Relation to XML
    1.5 Terminology
2 Serialization
    2.1 Information Model
    2.2 Characters
        2.2.1 Character Set
        2.2.2 Encoding
        2.2.3 End-of-Line Normalization
        2.2.4 Indicators
        2.2.5 Escape Sequences
        2.2.6 Miscellaneous Characters
    2.3 Strings
        2.3.1 Indentation
        2.3.2 Whitespace Folding
        2.3.3 Quoted String
        2.3.4 Anchor String
        2.3.5 Class String
    2.4 Document
        2.4.1 Node
        2.4.2 Classes
        2.4.3 Reference
        2.4.4 List
        2.4.5 Map
    2.5 Scalar
        2.5.1 Simple Scalar
        2.5.2 Block Scalar
        2.5.3 Quoted
        2.5.4 Binary Scalar
3 Sequential Processing
    3.1 Parser Interface
    3.2 Emitter Interface
    3.3 Interface Converter
4 Language Bindings
    4.1 "C" Language
    4.2 Python
    4.3 Perl
    4.4 Java
5 Changes From Other Versions
    5.1 Changes From The 09 Jun 2001 Draft
    5.2 Changes From The 26 May 2001 Draft
    5.3 Probable Future Changes

1 Introduction

Yet Another Markup Language, abbreviated YAML, describes a class of data objects called YAML documents and partially describes the behavior of computer programs that process them.

YAML documents encode into a serialized form information having a recursive scalar, map, or list structure. YAML also includes a method to encode references and binary values. At its core, a YAML document consists of a sequence of characters, some of which are considered part of the document's content, and others that are used to indicate structure within the information stream.

A software module called a YAML parser is used to read YAML documents and provide access to their content and structure. In a similar way, a YAML emitter is used to write YAML documents, serializing their content and structure. A YAML processor is a module that provides for parser or emitter functionality or both. It is assumed that a YAML processor does its work on behalf of another module, called an application. This specification describes the interface and required behavior of an YAML processor in terms of how it must read or write YAML documents and the information it must provide or obtain from the application.

1.1 Origin and Goals

The design goals for YAML are:

  1. YAML documents are very readable by humans.
  2. YAML interacts well with scripting languages.
  3. YAML uses host language's native data structures.
  4. YAML works well with Internet mail architecture.
  5. YAML allows binary and large formatted text.
  6. YAML has a consistent information model.
  7. YAML includes a stream based interface.
  8. YAML is expressive and extensible.
  9. YAML is easy to implement.

YAML was designed with experience gained from the construction and deployment of Data::Denter. YAML has also enjoyed much markup language critique from SML-DEV list participants, including experience with the Minimal XML and Common XML specifications.

This specification, together with associated standards (Unicode and ISO/IEC 10646 for characters), provides all the information necessary to understand YAML Version 1.0 and construct computer programs to process it.

1.2 Key Concepts

YAML builds upon the structures and concepts described by XML, SOAP, Perl, HTML, Python, C, rfc0822, rfc2045, rfc2046, SAX.

YAML's type structures are similar to those of Perl. In YAML, there are three fundamental structures: scalars (text and binary), maps (%) and lists (@). YAML also supports references to enable the serialization of graphs. Furthermore, each data value can be associated with a class name to allow the use of specific data types. This type structure is common to many other languages and provides a solid basis for an information model. Furthermore, it enables the programmer to use their programming language's native data constructs for YAML manipulation, instead of a document object.

YAML handles whitespace somewhat like HTML does. In YAML, sequences of spaces, tabs, and carriage return characters are usually folded into a single space during parsing. This wonderful technique makes markup code readable by allowing indentation and word-wrapping without affecting the canonical form of the content. This folding technique can also be found in the structured headers of RFC 822.

YAML's block scoping is similar to Python's. In YAML, the extent of a node is indicated by its child's nesting level, i.e., what column it is in. Block indenting provides for easy inspection of the document's structure and greatly improves readability. This block scoping is possible due to the whitespace folding technique.

YAML's quoted strings are similar to C's. In YAML, text scalars can be surrounded by quotes enabling escape sequences such as \n to represent a new line, \t to represent a tab, and \\ to represent the backslash. Unlike C, since whitespace is folded, YAML introduces bash style "\ " to escape additional spaces that are part of the content and should not be folded. Further, the trailing \ is used as a continuation marker, allowing content to be broken into multiple lines without introducing unwanted whitespace. Lastly, ISO 8859-1 characters can be specified using "\x3B" style escapes and Unicode characters can be specified using the similar escaping techniques "\u003B" and "\U0000003B".

The syntax of YAML is an extension of rfc822, allowing for direct usage of YAML in mail handlers. YAML also has a binary scalar type using base64 encoding.

In addition to a native in-memory load/save interface, YAML provides both a pull style input stream and a push style (SAX like) output stream interface. This enables YAML to directly support the processing of large documents, such as a transaction log, or continuous streams, such as a feed from a production machine.

1.3 Example

Following is a simple example of an invoice represented as a YAML document. The colon is used to separate name:value pairs. The percent sign following a colon indicates that the value is a mapping. The at sign indicates that the value is an ordered list.

buyer: %
    address     : %
       city       : Royal Oak
       line one   : 458 Wittigen's Way
       line two   : Suite #292
       postal     : 48046
       state      : MI
    family name : Dumars
    given name  : Chris
date    : 12-JAN-2001
delivery: %
    method : UZS Express Overnight
    price  : $45.50
comments :
    Mr. Dumars is frequently gone in the morning
    so it is best advised to try things in late
    afternoon. If Joe isn't around, try his house
    keeper, Nancy Billsmer @ (734) 338-4338.
invoice : 00034843
product : @
    %
        desc      : Grade A, Leather Hide Basketball
        id        : BL394D
        price     : $450.00
        quantity  : 4
    %
        desc      : Super Hoop (tm)
        id        : BL4438H
        price     : $2,392.00
        quantity  : 1
tax      : $0.00
total    : $4237.50

1.4 Relation to XML

There are many differences between YAML and the eXtensible Markup Language ("XML").  XML was designed to be backwards compatible with Standard Generalized Markup Language ("SGML") and thus had many design constraints placed on it that YAML does not share. Also XML, inheriting SGML's legacy, is designed to support structured documents, where YAML is more closely targeted at messaging with direct support for the native data structures of modern programming languages. Further, XML is a pioneer in many domains and YAML has been grown on the lessons learned by the XML community. These points aside, there are many differences.

The YAML and XML information model are starkly different. In XML, the primary construct is an attributed tree, where each element has a ordered, named list of children and an unordered mapping of names to strings. In YAML, the primary hierarchical construct alternates between an anonymous list, named map, and scalar values. This difference is critical since YAML's model is directly supported by native data structures in most modern programming languages, where XML's model requires mapping, conventions, or an alternative programming component, a document object model.

The YAML and XML syntax vary significantly. In XML, tags are used to denote the begin and end of an element. In YAML, line indentation is determined by scope. Where YAML builds upon RFC 822 and MIME for its declaration section and support for binary and/or large scalar values, XML has its own processing instruction mechanism and relies upon layered technologies for support for binary leaves. Furthermore, YAML has a simple whitespace policy, where XML's whitespace policy is completely configurable.

1.5 Terminology

The terminology used to describe YAML is defined in the body of this specification. The terms defined in the following list are used in building those definitions and in describing the actions of a YAML processor:

may
Conformant YAML texts and processors are permitted to but need not behave as described.
must
Conformant YAML texts and processors are required to behave as described, otherwise they are in error.
error
A violation of the rules of this specification; results are undefined. Conforming software may detect and report an error and may recover from it.
fatal error
An error which a conforming YAML processor must detect and report to the application. After encountering a fatal error, the processor may continue processing the data to search for further errors and may report such errors to the application.

2 Serialization

A YAML document can reside in many different forms as long as it complies with the information model below. This includes a sequence of bytes in memory, on disk, or arriving via a network socket as much as it includes events from a sequential interface or a language specific in-memory representation. After covering the information model, this section focuses on the serialized representation.

2.1 Information Model

The information model for YAML is a directed acyclic graph having map, list, and scalar nodes. These data structures are directly supported in modern programming languages such as Python, Perl, Java, and C++.

Since the information model is a graph, a separate serialization model is required. The serialization model adds an additional node type, the reference, to record subsequent occurrences of a given node within a sequence. This enables a more compact notation so that duplicate occurrences of a given node need not be serialized more than once.

document An ordered sequence of anonymous map nodes.
node A YAML node can be one of four types: list, text scalar, binary scalar, map. Every node may optionally have a class.
list An ordered sequence of zero or more nodes. Nodes are included by reference, thus they may be part of more than one list or map.
binary scalar An ordered sequence of zero or more bits, where bits are an atomic construct able to have a value of one or zero.
text scalar A sequence of zero or more characters.
map An unordered sequence of zero or more (text, node) tuples where the text (key) is unique within the sequence.
class An opaque object that has a serialization which complies with the class production.
The serialization model adds an anchor attribute to every node, and introduces the reference node. The reference node advertises an anchor to indicate the repetition of a node previously encountered.
anchor An additional attribute added to each node that provides for identification of the node within a given node sequence. Only nodes which could be referenced further in the sequence must be given an anchor. It is not necessary that the anchor be unique.
reference An additional node type consisting of an anchor which is used to indicate the repetition of the node with the same anchor most previously encountered.

In the serialization model, it is important that each node is serialized exactly once. If a node appears more than once in the graph, only the first occurrence of the node should be serialized. All remaining occurrences of this node should be represented with reference nodes. In this scheme, if a YAML document is loaded into an random access representation, then the reference nodes and anchor indicators should not be available as the non-serialized information model should be used. Also note that anchors can repeat to allow for concatenation, although only the most recent node with a given anchor may be referenced.

2.2 Characters

Characters are the basis for a serialized version of a YAML document. Below is a general definition of a character followed by several characters which have specific meaning in particular contexts.

2.2.1 Character Set

By default, serialized YAML uses a subset of the ISO/IEC 10646 character set.
[01] char ::= #x9 | #xA | [#x20-#xD7FF]
| [#xE000-#xFFFD] | [#x10000-#x10FFFF]
/* a single Unicode character, including the space and new line characters */

Due to the end of line normalization rules, the carriage return (#xD) is not included. As with standard practice, the the surrogate block, FFFE and FFFF are excluded.

In the information model binary scalar values do not comply with the above character set. Such values are taken to contain arbitrary non-textual data and therefore are not concerned with character set issues.

2.2.2 Encoding

A YAML processor is required to support both UTF-16 and UTF-8 character encodings. If an input stream begins with a byte order mark, then the initial character encoding shall be UTF-16. Otherwise, the initial encoding shall be UTF-8.
[02] bom ::= #xFEFF /* the Unicode ZERO WIDTH NON-BREAKING SPACE character used to mark a UTF-16 stream and determine byte ordering. */

If the stream begins with #xFFFE, then the byte order of the input stream must be swapped when reading. For more information on the byte order mark see the Unicode FAQ.

2.2.3 End-of-Line Normalization

On input and before parsing, a compliant YAML parser must translate both the two-character sequence #xD #xA (CR LF) and any #xD (CR) which is not followed by a #xA (LF) into a single #xA (LF) character. This allows for the definition of an end-of-line marker.
[03] eol ::= #xA /* a normalized end of line marker */
[04] blank ::= eol eol /* a blank line */

On output, a YAML emitter is free to serialize end of line markers using what ever convention is most appropriate. For Internet mail, CRLF is the preferred form.

2.2.4 Indicators

Indicators are special characters which are used to describe the structure of a YAML document.

[05] imap ::= '%' /* indicates a map node */
[06] ilist ::= '@' /* indicates a list node */
[07] iquote ::= '"' /* indicates a quoted string */
[08] ikey ::= ':' /* separates a key from a node */
[09] iblk ::= '|' /* indicates a block scalar */
[10] iblkend ::= '\' /* indicates end of a block */
[11] iescape ::= '\' /* indicates an escape sequence */
[12] iref ::= '*' /* indicates a reference node */
[13] iclass ::= '!' /* indicates a class attribute */
[14] inull ::= '~' /* indicates a null node */
[15] ianchor ::= '&' /* indicates an anchor attribute */
[16] ibin ::= '[' /* indicates begin of binary scalar */
[17] ibinend ::= ']' /* indicates end of binary scalar */
[18] ireserved ::= '^' | '#' | ';'
| '`' | ',' | '.'
| '(' | ')' | '{' | '}'
/* reserved */
[19] indicator ::= imap | ilist | iquote | ikey
| iblk | iblkend | iescape | iref
| iclass | ianchor | ibin | ibinend
| ireserved
/* indicator characters */

2.2.5 Escape Sequences

Escape sequences are used to denote significant whitespace, specify characters by a hexadecimal value, and produce the literal quote and escape indicators.

[20] eescape ::= iescape iescape /* escape literal */
[21] equote ::= iescape iquote /* quote literal */
[22] edquote ::= iquote iquote /* quote literal */
[23] ebel ::= iescape 'a' /* ASCII alert (BEL) */
[24] ebs ::= iescape 'b' /* ASCII backspace (BS) */
[25] eesc ::= iescape 'e' /* ASCII escape (ESC) */
[26] eff ::= iescape 'f' /* ASCII formfeed (FF) */
[27] eeol ::= iescape 'n' /* ASCII linefeed (LF) */
[28] eret ::= iescape 'r' /* ASCII carriege return (CR) */
[29] etab ::= iescape 't' /* ASCII horizontal tab (TAB) */
[30] evtab ::= iescape 'v' /* ASCII vertical tab (VTAB) */
[31] espace ::= iescape #x20 /* space */
[32] ex2 ::= iescape 'x' hex hex /* 8-bit character */
[33] eu4 ::= iescape 'u' hex hex hex hex /* 16-bit character */
[34] eu8 ::= iescape 'U'
    hex hex hex hex
    hex hex hex hex
/* 32-bit character */
[35] escape ::= eescape | equote | edquote
| eeol | etab | espace
| ex2 | eu4 | eu4
/* escape sequences */

2.2.6 Miscellaneous Characters

This section includes several common character range definitions.

[36] lwsp ::= #x20 | #x9 /* linear whitespace, the space or tab character. */
[37] wsp ::= lwsp | eol /* a single whitespace character, including the space, tab, new line. Excluding the carriage return. */
[38] pchr ::= char - wsp /* non-whitespace characters */
[39] lchr ::= pchr | lwsp /* printables and linear whitespace */
[40] qchr ::= ( ( pchr - iquote ) - iescape ) /* printables less the quote and escape character. */
[41] pqchr ::= ( ( pchr - iquote ) - iescape ) /* printables less the quote and escape character. */
[42] ichr ::= ( pchr - indicator ) /* printables less the indicator characters. */
[43] alpha ::= [#x41-#x5a] | [#x61-#x7a] /* ASCII alphabetic character, a-z and A-Z. */
[44] number ::= [#x30-#x39] /* ASCII numeric character, 0-9 */
[45] alphanum ::= alpha | number /* ASCII alpha numeric character */
[46] hex ::= number | [#x41-#x46] /* one hexadecimal digit 0-9 or A-F */
[47] bchr ::= number | alpha | '+' | '/' | '=' /* chars for base64 encoding */

2.3 Strings

Moving on to a higher level of abstraction, are sequences of characters, or strings. This section describes whitespace folding and indentation policies, as well as quoted and raw strings.

2.3.1 Indentation

In a YAML serialization, structure is determined from indentation, where indentation is defined as an end of line marker followed by zero or more spaces or tabs. Indentation level is defined recursively.

[48] tab(0) ::= eol /* the first level of indentation is a single end of line marker (new line). */
[49] tab(n) ::= tab(n-1) lwsp+ /* the previous indentation setting plus one or more spaces or tabs. */

It should be noted that this production does not imply that tab(3) is a constant thought the entire serialization, as the particular indentation style may change over time. Specifically, an indentation level tab(n) is set with the first subordinate line of a node at tab(n-1). All subordinate lines must share this same indentation setting. This allows indentation to vary over the serialization as long as it remains constant for a given node's content.

Since the YAML serialization depends upon indentation level to delineate blocks, additional productions are a function of an integer, based on the tab(n) production above. In all such cases, production(n) stands for a construct which is itself indented at tab(n) and whose subordinate lines are indented at tab(n+1).

2.3.2 Whitespace Folding

Since a YAML serialization uses whitespace for indentation, in many cases the parser must condense sequences of one or more whitespace characters into a single space (#x20). When this functionality is implied, the lfold production below will be used.

[50] lfold ::= lwsp+ /* linear folded whitespace */

With this definition, there are two sorts of folded strings which are useful to describe.

[51] fstr ::= pchr+ ( lfold pchr+ )* /* folded string */
[52] ifstr ::= ichr+ ( lfold ichr+ )* /* indicator free, folded string */

A folded string starts with a printable which is not an indicator. It is then followed by a sequence of printable words with intermediate folded whitespace. At parse time, the folded space must be condensed into a single space (#x20) character.

2.3.3 Quoted String

A quoted string is a mechanism to treat a sequence of characters as a single unit. Within a quoted string, indicators (with the exception of " and \) can be used without worry and escape sequences can be used to introduce significant whitespace which would otherwise be subject to folding.

A quoted string begins and ends with a quote character. It can extend for as many lines as necessary, although an editor or emitter is free to re-break and indent a quoted string as needed to maintain readability.

A line break in a multi-line quoted string is considered to be white space for the purpose of folding, unless prefixed by a \ in which case the line break, and the indentation following it, are completely ignored. This allows a quoted string to be broken into multiple lines at arbitrary positions.

A quoted string can't contain two consecutive line breaks without an intermediate printable or escaped character. Also, a quoted string cannot contain an un-escaped quote or an invalid escape sequence.

[53] qstr(n) ::= iquote
( qchr | escape | lfold | qline(n+1) )*
iquote
/* quoted string */
[54] qline(n) ::= iescape? tab(n) ( qchr | escape ) /* line break in a quoted string */

2.3.4 Anchor String

An anchor string is a sequence of numeric digits used when referencing a node previously visited in the document stream.

[55] astr ::= number+ /* any sequence of digits 0-9. */

2.3.5 Class String

Each node can be adorned with a class string. There are three types of class strings: local, qualified, and built-in.

[56] cstr ::= cloc | cqual | cbltn /* a local, qualified or built-in class name. */

Local class strings, which are unique within a controlled context, are a sequence of printable characters without save the period. These strings may not be globally unique and should only be used for local usage or for experimentation where data exchange is not likely.

[57] cloc ::= ichr (pchr - '.' )* /* a local class string */

Qualified class strings are globally unique since they are serialized with a reverse domain name format similar to "com.clarkevans.timesheet". The definition of a qualified class is determined by the domain name holder.

[58] cqual ::= dnsrev /* a reverse dns qualified class string */

Built-in class strings, reserved for YAML usage, are also globally unique and begin with a period. Core classes include real, designating an inprecise floating point value, and int, designating a precise integer value.

[59] cbltn ::= '.' dnsseg /* a built-in class name */

The above definitions depend upon domain name strings. These are defined below.

[60] dnsrev ::= dnstop ( '.' dnsseg ) * /* a reverse dns name */
[61] dnstop ::= alpha
| ( alpha ( alphanum | '-' )* alphanum )
/* a top level domain, like "com", "org", ... */
[62] dnsseg ::= alphanum
| ( alphanum ( alphanum | '-' )* alphanum )
/* a domain name segment */

2.4 Document

A serialized object is a YAML document if, taken as a whole, it complies with the following production.

[63] document ::= bom? ( pair(0) ( tab(0) pair(0) )* )?
( blank ( pair(0) ( tab(0) pair(0) )* )? )*
eol
/* a byte order mark followed by a sequence of maps seperated by a blank line. */

2.4.1 Node

A node begins at a particular level of indentation, n, and its content is indented at a level n+1. A node can either be a scalar, map, list, or reference.

[64] node(n) ::= ( ( color lwsp+ )?
     (
map(n) | list(n)
     | scalar(n) | reference ) )
| color? tab(n+1) scalar(n+1)
/* an optional node color followed by either a map, list, scalar, or reference */
[65] color ::= anchor | class
| ( anchor lwsp+ class )
| ( class lwsp+ anchor )
/* either an anchor, or class, or both in any order */

2.4.2 Class

YAML indirectly supports the serialization of data types and object class name with a class attribute on each node. If this attribute is provided, then each specific language binding may use this information to dress the node appropriately, otherwise a warning must be issued and the must data treated as if the class attribute was not provided.

[66] class ::= iclass cstr /* associates an class string with a given node */
built-in:  !.real              23.42
local:     !local              Local class name.
non-class:                     Free of class.
qualified: !com.clarkevans.ts  Qualified class name.

2.4.3 Reference

An anchor is an indicator which can be used to color a node giving it an sequential numeric digit for an identifier. The reference node type can then be used to indicate additional inclusions of an anchored node. The anchor string of a reference refers to the most recent node having the equivalent anchor string. Two anchor strings are equivalent if they are identical after removal of any leading 0 characters.

It is an error to have a reference use an anchor string which does not occur previously in the serialization.

Two anchor strings are
[67] anchor ::= ianchor astr /* associates an anchor string with a given node for further reference */
[68] ref ::= iref astr /* a reference node */

anchor:    &0001         This scalar has an anchor.
clsanc:    !cls &0002    Has alocal class and an anchor.
non-ref:                 Next node refers to the previous.
reference: *02

2.4.4 List

A list is the simplest form of node, it is a sequence of nodes at a higher indentation.

Scalar list entries may be prefixed by the optional indicator : to improve readability

[69] list(n) ::= ilist
( tab(n+1) ( node(n+1) | lnode(n+1) ) )*
/* a list of zero or more indented nodes. */
[70] lnode(n) ::= ikey ( lwsp+ color )?
  ( lwsp+ scalar(n)
  | tab(n+1) scalar(n+1) )
/* a prefixed scalar list node */
list: @
  First item in top list
  @
    Subordinate list
  @
  Above list is empty
  : A multi line entry
    with the optional prefix
  Sixth item in top list

2.4.5 Map

A map is a unique association of keys with values. Where a key is either a quoted or folded string.
[71] map(n) ::= imap ( tab(n) pair(n) )* /* a map indicator, followed by a list of map items */
[72] pair(n) ::= key(n) lwsp* ikey
lwsp+ node(n)
/* a key/node map pair indented appropriately. */
[73] key(n) ::= ifstr | qstr(n) /* a single line folded string, or a quoted string */

In a given map, there is the further restriction that within a given map, two folded key values cannot be identical. Also, the canonical serialization orders the keys alphabetically.

map: %
   first : First entry
   second: %
     key: Subordinate map!
   third item: @
     Subordinate list
     %
     Previous map is empty.
   "@": This key had to be quoted.
   "This is
    a multi-line
    key" :
      Whose value is in the next line.

2.5 Scalars

While most of the document productions above are fairly strict, the scalar production is generous. It offers four in-line styles of expressing scalar values depending upon the readability requirements. The table below describes the various styles.

Folded? Escaped? Information Model
Simple YesNoText
Block NoNoText
Quoted YesYesText
Binary YesNoBinary

Note that if a YAML document is loaded into a random access representation, then the representation may ignore the distinction between the three text scalar styles, but must distinguish between a text scalar and a binary scalar.

[74] scalar(n) ::= simple(n) | block(n)
| quoted(n) | binary(n)
/* in-line scalar nodes */

2.5.1 Simple Scalar

For this case, each line is indented, whitespace is folded (line breaks always counts as white space), escaping is not available, and the content must not start with an indicator. To delineate the end of this scalar, indention is employed.

[75] simple(n) ::= ichr ( fstr? tab(n+1) pchr )* fstr /* one or more indented, non-escaped, folded characters */
first: This is a one line scalar.
second:
    <html><head><title>Embedded HTML!</title></head>
    <body><p class="none">This can even
    have embedded HTML since there is no
    escaping, and since < (the starting character)
    is not an indicator!
    </p></body>
third: Indicators like @ : % are allowed, as
       well as quotes, as long as the first
       character is not an indicator.  Further,
       whitespace     is   folded.
fourth: @
    A single line entry.
    The second, multi-line,
        entry of the list.
    : The third list
      entry, with the
      optional : prefix.

2.5.2 Block Scalar

This mechanism supports source code or other text values with significant use of indicators, quoted escaping, or significant whitespace. To retain readability for this case, the block scalar allows raw text strings to be used in an indented fashion where each line indented is preceded by a block indicator. All whitespace is significant up to and including the end of line marker. To end the block, a block end indicator is used and any characters on this line are also included, up to and not including the end of line marker.

[76] block(n) ::= ( iblk lchr* )?
( tab(n+1) iblk lchr* )*
iblkend lchr*
/* an indented character block */
first: |This is a block scalar,    with significant
       |whitespace, and the use of " @, etc.
       |The \ is used to signify the end of the block.
       |     Whitespace    *is*    significant.
       \
second:
   \No leading nor trailing new line.
third: @
   |
   |First list item which has a
   |leading and trailing new line.
   \
   |Second list item. Does not have
   |leading nor trailing new line.
   |Does have three new lines in the
   \middle.

2.5.3 Quoted Scalar

A quoted scalar uses quoted strings. With this mechanism, the indicators (excepting the quote and slash) can be used freely, whitespace is folded (line breaks count as white space unless escaped), and escape sequences can be used to indicate significant whitespace, non-printable characters, quotes and slashes.

[77] quoted(n) ::= qstr(n) /* a quoted scalar value. */
first: "Quoted scalar.\nWith a new line."
second: @
  "Spaces are folded so this
   new line is not significant."
  "Furthermore indicators such
   as @ # : can be added."
  "Escape sequences can be used
   to removed or \ add white\
   space, \\ and \" must be escaped."
  "The subordinate lines additional
      indentation may be more then a single
      space, and a line may be brok\
      en in mid-word."
  "This was a list of five quoted
   scalars!"

2.5.4 Binary Scalar

A binary scalar uses the base64 encoding. Base64 strings are a sequence of base64 characters intermixed with whitespace which is ignored by the base64 processor. An additional constraint is that no more than 76 base64 characters are allowed within each line.

[78] binary(n) ::= ibin
( bchr | lwsp | bline(n+1) )*
ibinend
/* a base64 encoded binary string */
[79] bline(n) ::= tab(n) bchr /* line break in a binary scalar */
binary: [R0lGODlhDAAMAIQAAP//9/X17unp5WZmZgAAAO
         fn515eXvPz7Y6OjuDg 4J+fn5OTk6enp56enmlp
         aWNjY6Ojo4SEhP/++f/++f/++f/++f/++f/++f
         /++f/++f/++f/++f/+ +f/++f/++f/++SH+Dk1h
         ZGUgd2l0aCBHSU1QACwAAAAADAAMAAAFLCAgjo
         EwnuNAFOhpEMTRiggc z4BNJHrv/zCFcLiwMWYN
         G84BwwEeECcgggoBADs=]
desc:
  The binary value above is a tiny arrow encoded
  as a gif image.

3 Sequential Processing

There are two basic types of interfaces used for sequential processing, pull and push. The difference lies primarly in who has the primary control of the process, the consumer or the producer. With a push interface, the producer has the primary control and with a pull interface is is the consumer who drives the process. For the most control over the process, YAML's parser has a pull interface and YAML's emitter has a push interface. This enables the application which pulls from the parser and pushes to the emitter to have the primary control of the process.

3.1 Parser Interface

The parser or iterator interface provides a sequential access to a YAML document where the flow control is dicated by the consumer. The parser interface uses the serialization model, handling duplicate nodes within the graph through a reference. The parser interface has several components.

cursor A structure is used to hold the current node, including the node's type (scalar, list, map, or reference), the node's class, the node's anchor, and, if the node is part of a map, the key associated with the node.
open() A function which returns the document's top level list cursor. This method may require an input stream such as a string buffer, a file object, or a open socket.
first() A function which takes a map or list cursor and returns a cursor for the first child within the sequence, if any. It is an error to call this function on cursor currently holding a reference or scalar.
next() A method which advances the cursor passed to the next node within the current map or list. If there is not a subsequent node within the sequence, then this function must free the cursor and should indicate that additional siblings are not to be found.
close() A method on the top level cursor which can be used on the top level cursor to stop iteration and invalidate all subordinate cursors.
read() A method which can be called on a scalar cursor to read an arbitrary sized chunk of it's value. A scalar's value can only be read once, but multiple calls to this function may be required.
deref() An optional method which can be called on a cursor with a reference node to return a new cursor for the node referenced. This method need only be supported by those iterators over a random access input stream.

As part of this interface, each cursor returned from the parser must remain valid until close() is called, or the next() is called on the cursor, or next() is called on an ancestral cursor. Note that this implies that N cursors may be open at any given time, one cursor for each map or list in a given lineage. It is the parser's task to make sure that each cursor points to valid information.

3.2 Emitter Interface

The emitter or visitor interface interface provides a sequential access to a YAML document where the flow control is dicated by the producer. The emitter interface uses the serialization model, handling duplicate nodes within the graph through a reference. The emitter interface has several components.

cursor The emitter may require a cursor for its bookkeeping which is opaque.
start() A function which must be called to start the emitter. This function returns a cursor to be used for subsequent operations.
begin() A function to be called when a map, list, or scalar node is to be started. This function is passed a the current cursor and returns a cursor to be used for subordinate operations. This cursor takes the node's type, a class, an anchor, and possibly a key if the node is in the context of a map.
write() A method which can be called multiple times within the context of a scalar to provide the scalar's character value.
ref() An optional method which can be called on a map or list cursor to add a reference to the sequence of children. This method takes the current cursor and an anchor as parameters. If this method is not provided, the caller must recursively visit the referenced node.
end() A method called on a cursor to indicate that the node in question is no longer in scope. The cursor passed should no longer be used.
finish() A method which must be called on the top level cursor to indicate that the document stream has finished.

As part of this interface, a cursor is deemed valid untill end() has been called on it. It is an error to call end() on a cursor when a subordinate cursor is still valid. This implies that the emitter need not use N cursors, and that one cursor which is re-used with a depth indicator will suffice.

3.3 Interface Converter

In some cases it is necessary to convert a pull to a push interface, or a push to a pull interface. While converting from pull to push is relatively easy, converting from push to pull requires a multi-threaded approach or a large intermediate memory buffer. To help automate conversions, the YAML processing library includes both interface converters.

4 Language Bindings

The realization of the information model depends upon the programming language. Included in this specification are the official bindings for Python, Perl, and C. A language binding is not required to implement the sequential interface.

TODO: This section needs much work

4.1 "C" Language

Since the "C" programming language does not have dynamic lists or maps, this binding only implements the sequential interface.

4.2 Python

For the Python binding, unless specified otherwise by a class, scalar values are bound to the string type, lists are bound to the list type, and maps are bound to the dictionary type with the key represented as a string.

If a scalar has a real class, then the scalar must be bound as a floating point value if possible, otherwise a warning must be issued. Likewise, if a scalar has a int class, then the scalar must be bound to a normal integer, otherwise a warning must be issued. Conversions can fail if the scalar value is not parse-able or if the size of the parsed value will not fit within the standard type.

If any other class is encountered, then the results depend upon the type. First, the class identifier is used to construct a class with the same package name. And then, depending upon the type, initialization is carried out differently. If the node is a map, then the keys are assumed to be attributes, and the setattr method is used to initialize the object. If the node is a scalar, then repr is used. Otherwise, if the node is a array, then additem method is used.

4.3 Perl

For the Perl binding...

4.4 Java

For the Java binding...

5 Changes From Other Versions

5.1 Changes From The 09 Jun 2001 Draft

Relationship with MIME
Beyond using base64 for binary scalars, no additional special relationship with MIME is expected. Hence references to the MIME and mail RFCs were moved from section 1.1 ("required reading") to section 1.2 ("background material").
Strict Indentation
Indentation is now completely strict for all scalar styles. Also, the productions were changes to use a consistent semantics to the indentation level parameter.
List Scalar Prefixes
A list scalar entry may be prefixed by an optional : indicator to improve readability of multi-line simple scalar values.
Anchor Semantics
Leading zeros are now ignored for comparing anchor strings.
No Empty Line At Start
The document production was fixed so as not to require an empty line at the start of a document.
Character Escapes
The set of character escapes is now maximal (including the rare \e escape for the useful ASCII ESC character). Also, it is now possible to "escape" a line break in a quoted string (the previous drafts were inconsistent at this point).
32 Bit Characters
The current draft allows such characters, and includes a specialized escaping format ('\Uxxxxxxxx') to support them.

5.2 Changes From The 26 May 2001 Draft

Changes Section
The changes section was added for easier comparison of different versions. The final draft will not contain this section.
Class Indicator
The indicator was changed from # to ! to allow for # to be used for comments.
No Empty Line At End
The document production was fixed so as not to require an empty line at the end of a document.
Strict Indentation
Indentation in quoted strings and binary blocks is now strict to ensure readability.
Productions
Problems in the productions were fixed, especially where related to white space issues and formatting of the result.
BOM Comment
The link to the Unicode FAQ was moved to section 2.2.2.
Binary Scalars
The information model now distinguishes between text and binary scalars.

5.3 Probable Future Changes

API details
How the API handles the sequence of maps; push/pull issues; mapping to scripting languages of interest (Python, Perl); etc.
Color Idiom
It is possible to allow for schema evolution, attachment of comments, handling unknown classes and many other use cases by supporting the color idiom.
Comment Attribute
A textual comment may be attached to each node, similar to the class attribute. It is probably best to do this as part of the color idiom, but it could concievably be added on its own.
Canonical Form
It may be useful to define a canonical form for a YAML document, specifying completely indentation, folding, and white space issues. There are already two references to such a form in the draft; if we don't define such a form these references should be removed.
32-bit Characters
The implications of supporting them need to be checked and proper wording added to the draft. Given any kind of choice, the effects on complying YAML processors should be minimized.
Reviewing Examples
Ensure there are enough examples. In particular, special syntax forms should be demonstrated to remove doubt in the interpretation of the productions.
Tidying the HTML
Ensure the draft is valid (X)HTML using Tidy or the like, and ensuring all links (internal and external) are correct.
Verifying Productions
Verify all productions are correct, are actually used, and properly hyper-linked.
Polish
Spell and grammar checking, formatting, etc.