Copyright © 2001, 2002 Oren Ben-Kiki, Clark Evans & Brian Ingerson, all rights reserved. This document may be freely copied provided that it is not modified.
This specification is a working draft and reflects consensus reached
by the members of the yaml-core mailing list. Any questions regarding
this draft should be raised on this list at
With this release of the YAML specificiation, we now encourage development of YAML processors, so that the design of YAML can be validated. The specification is still subject to change; however, such changes will be limited to polish and fixing any logical flaws and bugs.
Therefore, this is "Last Call" for changes; if you have a pet feature now is the very last time that it can be proposed before Release Canaidate status. Changes which would cause "Last Call" YAML streams to be invalid will be seriously considered only if absolutely necessary.
YAML(tm) (rhymes with "camel") is a straightforward machine-parsable data serialization format designed for human readability and interaction with scripting languages such as Perl and Python. YAML is designed for data serialization, formatted dumping, configuration files, log files, Internet messaging and filtering. This specification describes the YAML information model and serialization format. Together with the Unicode standard for characters, it provides all the information necessary to understand YAML Version 1.0 and to construct programs to process YAML information.
YAML Ain't Markup Language, abbreviated YAML, is both a human-readable data serialization format and processing model. This text describes the class of data objects called YAML document streams and partially describes the behavior of computer programs that process them.
YAML document streams encode in a textual form the native data constructs of modern scripting languages. Strings, arrays, hashes, and other user-defined data types are supported. A YAML document stream consists of a sequence of characters, some of which are considered part of the document's content, and others that are used to indicate document structure.
YAML information can be viewed in two primary ways, for machine processing and for human presentation. A YAML processor is a tool for converting information between these complementary views. It is assumed that a YAML processor does its work on behalf of another module, called an application. This specification describes the required behavior of a YAML processor. It describes how a YAML processor must read or write YAML document streams and the information structures it must provide to or obtain from the application.
The design goals for YAML are:
YAML documents are very readable by humans.
YAML interacts well with scripting languages.
YAML uses host languages' native data structures.
YAML has a consistent model to support generic tools.
YAML enables stream-based processing.
YAML is expressive and extensible.
YAML is easy to implement.
YAML's initial direction was set by the markup language discussions among SML-DEV members. YAML was also conceived with experience gained from the construction and deployment of Brian Ingerson's Perl module Data::Denter. Since then YAML has matured through the ideas and support it has received from its user community.
YAML was first conceived as notation for a simple set of primatives, the sequence, the mapping and the scalar, which, when used recursively to form a graph structure, are strong enough for most machine processing needs. By sequence we mean a ordered collection, by mapping we mean an unordered association of unique keys to values, and by scalar we mean a series of unicode characters. These primitives map cleanly to most modern programming languages; the sequence corresponds to a Perl array and a Python list, the mapping corresponds to a Perl hashtable and a Python dictionary. This basis is also formally justified, as both mapping and sequence are mathematical functions with well defined characteristics. With this core model, YAML supports machine processing with a balance of pratical motivation and theory.
To meet the needs of serialization and human presentation, YAML has many syntatical aspects beyond the primitives described above. As a graph is flattened into a tree, ordering is imposed upon mapping keys and an alias mechaism is used to write subsequent occurances of duplicate nodes. To enhance readability, various writing styles are provided for different aesthetic needs. Further, a comment mechanism allows for annotation othogonal to the "content" of a YAML stream. YAML syntax also has other details, such as placement of line breaks and choice of escaping and scalar formats. While these aspects are essential to a human presentation of YAML, they are not needed for machine processing.
This split between machine processing and human presentation creates an inherent tension. While it may be tempting to drive machine processing with comments, key order, styles and other presentation information, this would greatly complicate the definition and operation of generic tools. Although one could argue that their YAML data is sufficently isolated, information is often used in ways unforseen. Therefore, applications should only rely upon the formal definition of YAML's primitives to drive processing. For example, sequences should be used when order is important for machine processing even though mapping key order may be available. Likewise, duplicate keys should never be used, even though a parser may report them without warning. Overall, this distinction is one of intent. Applications which respect the split between human presentation and machine processing will enjoy the ability to use generic tools such as path expressions evaluators, graph transformation languages, or schema validators.
This separation does not prevent YAML processors from providing mechanisms to report or handle presentation aspects. Human readability is a prime directive for YAML. Therefore, a YAML processor may provide a shadow or wrapper mechanism to maintain and provide access to presentation aspects of a YAML text. In this way an application can have influence over how its information will be written to a stream for the best human impact. Since presentation aspects may be the same for a large class of YAML documents, a stylesheet could also be used to provide preferred key ordering, syntax styles, comments, and other presentation oriented instructions.
YAML integrates and builds upon structures and concepts described by C, Java, Perl, Python, Ruby, RFC0822 (MAIL), RFC1866 (HTML), RFC2045 (MIME), RFC2396 (URI), SAX, SOAP and XML.
YAML's core type system is based on the serialization requirements of Perl, Python and Ruby. YAML directly supports both scalar (string) values and collection (array, hash) values. Support for common types enables programmers to use their language's native data constructs for YAML manipulation, instead of requiring a special document object model (DOM).
Like XML's SOAP, the YAML serialization supports native graph structures through a rich alias mechanism. Also like SOAP, YAML provides for application-defined types. This allows YAML to serialize rich data structures required for modern distributed computing. YAML provides unique global type names using a namespace mechanism inspired by Java's DNS based package naming convention and XML's URI based namespaces.
YAML's block scoping is similar to Python's. In YAML, the extent of a node is indicated by its column. YAML's literal scalar leverages this by enabling formatted text to be cleanly mixed within an indented structure without troublesome escaping. Further, YAML's block indenting provides for easy inspection of the document's structure.
Motivated by HTML's end-of-line normalization, YAML's folded scalar introduces a unique method of handling white space. In YAML, single line breaks may be folded into a single space, while empty lines represent line break characters. This technique allows for paragraphs to be word-wrapped without affecting the canonical form of the content.
YAML's double quoted scalar uses
familar C-style escape sequences. This
enables ASCII representation of non-printable or 8-bit (ISO 8859-1)
characters such as '\x3B
'.
16-bit Unicode and 32-bit (ISO/IEC 10646) characters are supported
with escape sequences such as '\u003B
' and '\U0000003B
'.
The syntax of YAML was motivated by Internet Mail (RFC0822) and remains partially compatible with this standard. Further, YAML borrows the idea of having multiple documents from MIME (RFC2045). YAML's top-level production is a stream of independent documents; ideal for message-based distributed processing systems.
YAML was designed to have an incremental interface that includes both a pull-style input stream and a push-style (SAX-like) output stream interfaces. Together this enables YAML to support the processing of large documents, such as a transaction log, or continuous streams, such as a feed from a production machine.
Newcomers to YAML often search for its correlation to the eXtensible Markup Language (XML). While the two languages may actually compete in several application domains, there is no direct correlation between them. YAML is primarily a data serialization language. XML is often used for various types of data serialization but that is not its fundamental design goal.
There are many differences between YAML and XML. XML was designed to be backwards compatible with Standard Generalized Markup Language (SGML) and thus had many design constraints placed on it that YAML does not share. Also XML, inheriting SGML's legacy, is designed to support structured documents, where YAML is more closely targeted at messaging and native data structures. Where XML is a pioneer in many domains, YAML is the result of many lessons from the XML community.
The YAML and XML information models are starkly different. In XML, the primary construct is an attributed tree, where each element has an ordered, named list of children and an unordered mapping of names to strings. In YAML, the primary constructs are sequence (natively stored as an array), mapping (natively stored as a hash) and scalar values (string, integer, floating point). This difference is critical since YAML's model is directly supported by native data structures in most modern programming languages, where XML's model requires mapping conventions, or an alternative programming component (e.g. a document object model).
It should be mentioned that there are ongoing efforts to define standard XML/YAML mappings. This generally requires that a subset of each language be used.
The terminology used to describe YAML is defined in the body of this specification. The terms defined in the following list are used in building those definitions and in describing the actions of a YAML processor:
may | Conformant YAML streams and processors are permitted to but need not behave as described. | |
should | Conformant YAML streams and processors are encouraged to behave as described, but may do otherwise if a warning message is provided to the user and any deviant behavior requires conscious effort to enable. (i.e. a non-default setting) | |
must | Conformant YAML streams and processors are required to behave as described, otherwise they are in error. | |
error | A violation of the rules of this specification; results are undefined. Conforming software must detect and report an error and may recover from it. |
This section provides a quick glimpse into the expressive power of YAML. It is not expected that the first-time reader grok all of the examples. Rather, these selections are used as motivation for the remainder of the specification.
--- name: Mark McGwire hr: 65 avg: 0.278 --- name: Sammy Sosa hr: 63 avg: 0.288
|
# Ranking of players by # 1998 season home runs. --- - Mark McGwire - Sammy Sosa - Ken Griffey
|
||||
hr: # 1998 hr ranking - Mark McGwire - Sammy Sosa rbi: # 1998 rbi ranking - Sammy Sosa - Ken Griffey
|
hr: - Mark McGwire # Following node labeled SS - &SS Sammy Sosa rbi: - *SS # Subsequent occurance - Ken Griffey
|
||||
The question mark indicates a complex key. Within a block sequence, mapping pairs can start immediately following the dash. |
|||||
? # PLAY SCHEDULE - Detroit Tigers - Chicago Cubs : - 2001-07-23 ? [ New York Yankees, Atlanta Braves ] : [ 2001-07-02, 2001-08-12, 2001-08-14 ]
|
invoice: 34843 date : 2001-01-23 bill-to: Chris Dumars product: - item : Super Hoop quantity: 1 - item : Basketball quantity: 4 - item : Big Shoes quantity: 1
|
Scalar values can be written in block form using a literal style (|) where all new lines count. Or they can be written with the folded style (>) for content that can be word wrapped. In the folded style, newlines are treated as a space unless they are part of a blank or indented line.
--- | \/|\/| / | |_
|
--- > Mark McGwire's year was crippled by a knee injury.
|
||||
--- > Sammy Sosa completed another fine season with great stats. 63 Home Runs 0.288 Batting Average What a year!
|
name: Mark McGwire accomplishment: > Mark set a major league home run record in 1998. stats: | 65 Home Runs 0.278 Batting Average
|
||||
YAML's flow scalars include the plain style (most examples thus far) and quoted styles. The double quoted style provides escape sequences. Single quoted style is useful when escaping is not needed. All flow scalars can span multiple lines; intermediate whitespace trimmed to a single space. |
|||||
unicode: "Sosa did fine.\u263A" control: "\b1998\t1999\t2000\n" hexesc: "\x13\x10 is \r\n" single: '"Howdy!" he cried.' quoted: ' # not a ''comment''.' tie-fighter: '|\-*-/|'
|
plain: This unquoted scalar spans many lines. quoted: "\ So does this quoted scalar.\n"
|
In YAML, plain (unquoted) scalars are given an implicit type depending on the application. YAML's type repository includes integers, floating point values, timestamps, null, boolean, and string values.
canonical: 12345 decimal: +12,345 octal: 014 hexadecimal: 0xC
|
canonical: 1.23015e+3 exponential: 12.3015e+02 fixed: 1,230.15 negative infinity: (-inf) not a number: (NaN)
|
||||
null: ~ true: + false: - string: '12345'
|
canonical: 2001-12-15T02:59:43.1Z iso8601: 2001-12-14t21:59:43.10-05:00 spaced: 2001-12-14 21:59:43.10 -05:00 date: 2002-12-14
|
||||
Explicit typing is denoted with the bang (!) symbol. Application types should include a domain name and may use the caret (^) to abbreviate subsequent types. |
|||||
--- not-date: !str 2002-04-28 picture: !binary#base64 | R0lGODlhDAAMAIQAAP//9/X 17unp5WZmZgAAAOfn515eXv Pz7Y6OjuDg4J+fn5OTk6enp 56enmleECcgggoBADs= hmm: !somewhere.com,2002/type | family above is short for taguri:somewhere.com,2002:type
|
--- !clarkevans.com,2002/graph/^shape - !^circle center: &ORIGIN {x: 73, y: 129} radius: 7 - !^line # !clarkevans.com,2002/graph/line start: *ORIGIN finish: { x: 89, y: 102 } - !^text start: *ORIGIN color: 0xFFEEBB value: Pretty vector drawing.
|
Below are two full-length examples of YAML. On the left is a sample invoice; on the right is a sample log file.
--- !clarkevans.com,2002/^invoice invoice: 34843 date : 2001-01-23 bill-to: &id001 given : Chris family : Dumars address: lines: | 458 Walkman Dr. Suite #292 city : Royal Oak state : MI postal : 48046 ship-to: *id001 product: - sku : BL394D quantity : 4 description : Basketball price : 450.00 - sku : BL4438H quantity : 1 description : Super Hoop price : 2392.00 tax : 251.42 total: 4443.52 comments: > Late afternoon is best. Backup contact is Nancy Billsmer @ 338-4338.
|
--- Time: 2001-11-23 15:01:42 -05:00 User: ed Warning: > This is an error message for the log file --- Time: 2001-11-23 15:02:31 -05:00 User: ed Warning: > A slightly different error message. --- Date: 2001-11-23 15:03:17 -05:00 User: ed Fatal: > Unknown variable "bar" Stack: - file: TopClass.py line: 23 code: | x = MoreObject("345\n") - file: MoreClass.py line: 58 code: |- foo = bar
|
Following are the BNF productions defining the syntax of YAML streams.
Characters are the basis for a YAML stream. Below is a general definition of a character followed by several characters that have specific meaning in particular contexts.
YAML streams use a subset of the Unicode character set. A YAML parser must accept all printable ASCII characters, the space, tab, line break, and all Unicode characters beyond 0x9F. A YAML emitter must only produce those characters accepted by the parser, but should also escape all non-printable Unicode characters if a character table is readily available.
[001] |
printable_char |
::= |
#x9 |
The range above explicitly excludes the surrogate block
[#xD800-#xDFFF]
, DEL 0x7F
, the C0 control
block [#x0-#x1F]
, the C1 control block
[#x80-#x9F]
, #xFFFE
and
#xFFFF
. Note that in UTF-16, characters above
#xFFFF
are represented with a surrogate pair. DEL and
characters in the C0 and C1 control block may be represented in a
YAML stream using escape
sequences.
A YAML parser is required to support the UTF-32, UTF-16 and UTF-8 character encodings. If an input stream does not begin with a byte order mark, the encoding shall be UTF-8. Otherwise the encoding shall be UTF-32 (LE or BE), UTF-16 (LE or BE) or UTF-8, as signaled by the byte order mark. Note that as YAML files may only contain printable characters, this does not raise any ambiguities. For more information about the byte order mark and the Unicode character encoding schemes see the Unicode FAQ.
[002] |
byte_order_mark |
::= |
#xFEFF |
Indicators are special characters that are used to describe the structure of a YAML document.
[003] |
sequence_entry_indicator |
::= |
'-' |
||
[004] |
mapping_entry_separator |
::= |
':' |
||
[005] |
sequence_flow_start |
::= |
'[' |
||
[006] |
sequence_flow_end |
::= |
']' |
||
[007] |
mapping_flow_start |
::= |
'{' |
||
[008] |
mapping_flow_end |
::= |
'}' |
||
[009] |
collect_line_separator |
::= |
',' |
||
[010] |
top_key_indicator |
::= |
'?' |
||
[011] |
alias_indicator |
::= |
'*' |
||
[012] |
anchor_indicator |
::= |
'&' |
||
[013] |
transfer_indicator |
::= |
'!' |
||
[014] |
literal_indicator |
::= |
'|' |
||
[015] |
folded_indicator |
::= |
'>' |
||
[016] |
single_quote |
::= |
''' |
||
[017] |
double_quote |
::= |
'"' |
||
[018] |
throwaway_indicator |
::= |
'#' |
||
[019] |
reserved_indicators |
::= |
'%' | '@' |
'`' |
Indicators can be grouped into two categories. The '-
' , ':
', ',
', '?
' and '#
' space indicators are always followed
by a white space character (space, tab or
line break). If followed by any
other character, they are taken to be normal content characters.
The remaining indicators are taken to be indicators even if
followed by a non-space character.
[020] |
space_indicators |
::= |
sequence_entry_indicator |
||
[021] |
non_space_indicators |
::= |
sequence_flow_start |
The Unicode standard defines the following line break characters.
[022] |
line_feed |
::= |
#xA |
||
[023] |
carriage_return |
::= |
#xD |
||
[024] |
next_line |
::= |
#x85 |
||
[025] |
line_separator |
::= |
#x2028 |
||
[026] |
paragraph_separator |
::= |
#x2029 |
||
[027] |
line_break_char |
::= |
line_feed |
Line breaks can be grouped into two categories. Specific line breaks have well-defined semantics for breaking text into lines and paragraphs. The semantics of generic line break characters is not defined beyond "ending a line".
Outside scalar text content, YAML allows any line break to be used to terminate lines, and in most cases also allows such line breaks to be preceded by trailing comment characters. On output, a YAML emitter is free to emit such line breaks using whatever convention is most appropriate. An emitter should avoid emitting trailing line spaces.
[028] |
generic_break |
::=
|
( carriage_return greedy
|
||
[029] |
specific_break |
::= |
line_separator |
||
[030] |
any_break |
::= |
generic_break |
This section includes several common character range definitions.
[031] |
flow_char |
::= |
printable_char |
||
[032] |
flow_space |
::= |
#x20 | #x9 |
||
[033] |
flow_non_space |
::= |
flow_char |
||
[034] |
ascii_letter |
::= |
[#x41-#x5A] |
||
[035] |
decimal_digit |
::= |
[#x30-#x39] |
||
[036] |
hex_digit |
::= |
decimal_digit |
||
[037] |
word_char |
::= |
decimal_digit |
YAML streams use lines and spaces to convey structure. This requires special processing rules for white space (space and tab).
In a YAML text representation, structure is determined from indentation, where indentation is defined as a line break character followed by zero or more space characters.
Tab characters are not allowed in indentation. Since different systems treat tabs differently, portability problems are a concern. Therefore, YAML's tab policy is conservative; they are not allowed. Note that most modern editors may be configured so that pressing the tab key results in the insertion of an appropriate number of spaces.
A node must be more indented than its parent node. All sibling nodes must use the exact same indentation level. However the content of each such node may be indented independently.
The indentation level is used exclusively to delineate structure. Indentation characters are otherwise ignored. In particular, they are never taken to be a part of the the serialized text.
[038] |
indent(n) |
::= |
#x20 x n |
||
[039] |
indent(<n) |
::= |
indent(m) |
m such
that m < n */ |
|
[040] |
indent(<=n) |
::= |
indent(m) |
m such
that m <= n */ |
Since the YAML strean depends upon indentation level to
delineate blocks, additional productions are a function of an
integer, based on the
,
and
productions above. In some cases the notation
production(any)
is used; it is a shorthand for
"production(n)
for some specific value of
n
".
The '-
' sequence entry indicator is
considered to be part of the indentation, as this seems the way
people tend to interpret it. Hence this indicator itself need be
not indented relative to its parent node. Note that spaces
following this indicator are not taken to be part of the
indentation except for in one special case (map_in_seq).
Throwaway comments have no effect whatsoever on the data serialized in the stream. Their usual purpose is to communicate between the human maintainers of the file. A typical example is comments in a configuration file.
A throwaway comment always spans to the end of a line. It
consists of white spaces, optionally followed by a '#
' indicators, a white
space character, and arbitrary comment characters to the end of the
line.
Outside text content, empty lines or lines containing only white
space are taken to be implicit throwaway comment lines. Lines
containing indentation followed by '#
' and comment
characters are taken to be explicit throwaway comment lines.
A throwaway comment may appear before a document node or following any node. It may not appear inside a scalar node, but may precede or follow it.
[041] |
|
::= |
throwaway_indicator+ |
||
[042] |
|
::= |
comment_empty_line(n) |
||
[043] |
|
::= |
indent(<=n) |
||
[044] |
|
::= |
indent(<n) |
||
[045] |
comment_break |
::= |
( flow_space+ |
### The first tree lines of this stream ### are comments (the second one is empty). this: | # Comments may trail block indicators. contains three lines of text. The third one starts with a # character. This isn't a comment. # The last three lines of this stream # are comments (the first line is empty).
A sequence of bytes is a YAML stream if, taken as a whole, it complies with the following production. Note that an empty stream is a valid YAML stream containing no documents.
Encoding is assumed to be UTF-8 unless explicitly specified by including a byte order mark as the first character of the stream. While a byte order mark may also appear before additional document headers, the same encoding must be used for all documents contained in a YAML stream.
[046] |
yaml_stream |
::= |
implicit_document? |
||
[047] |
implicit_document |
::= |
byte_order_mark? |
||
[048] |
explicit_document |
::= |
byte_order_mark? |
A YAML stream may contain several independent YAML documents. A
document header line is used to start a new document. This line
must start with a document separator: '---
' followed
by a line break or a sequence of space characters. If no explicit
header line is specified at the start of the stream, the parser
should behave as if a header line containing ---
#YAML:1.0
'
When YAML is used as the format for a communication stream, it is useful to be able to indicate the end of a document independent of starting the next one. Without such a marker, the YAML processor reading the stream would be forced to wait for the header of the next document (that may be long time in coming) in order to detect the end of the previous document.
To support this scenario, a YAML document may be terminated by a
'...
' line. Nothing but throwaway comments may appear
between this line and the (mandatory) header line of the following
document.
[049] |
document_header |
::= |
document_start |
||
[050] |
document_start |
::= |
'-' '-' '-' |
||
[051] |
document_trailer |
::= |
document_end |
||
[052] |
document_end |
::= |
'.' '.' '.' |
--- > This YAML stream contains a single text value. The next stream is a log file - a sequence of log entries. Adding an entry to the log is a simple matter of appending it at the end.
--- at: 2001-08-12 09:25:00.00 Z type: GET HTTP: '1.0' url: '/index.html' --- at: 2001-08-12 09:25:10.00 Z type: GET HTTP: '1.0' url: '/toc.html'
# This stream is an example of a top-level mapping. invoice : 34843 date : 2001-01-23 total : 4443.52
# The following is a stream of three documents. The first is an empty # mapping, the second an empty sequence, and the last an empty string. --- {} --- [ ] --- ''
# A communication channel based on a YAML stream. --- sent at: 2002-06-06 11:46:25.10 Z payload: Whatever # Receiver can process this as soon as the following is sent: ... # Even if the next message is sent long after: --- sent at: 2002-06-06 12:05:53.47 Z payload: Whatever ...
Directives are instructions to the YAML parser. Like throwaway comments, directives are not reflected in the data serialized in the stream. Directives apply to a single document. It is an error for the same directive to be specified more than once for the same document.
[053] |
directive |
::= |
throwaway_indicator |
||
[054] |
directive_name |
::= |
word_char+ |
||
[055] |
directive_value |
::= |
flow_non_space+ |
This version of YAML defines a single directive, #YAML
. Additional
directives may be added in future versions of YAML. A parser should
ignore unknown directives with an appropriate warning. There is no
provision for specifying private directives. This is
intentional.
The #YAML
directive specifies the version of YAML
the document adheres to. This specification defines version
1.0
.
A version 1.0 parser should accept documents with an explicit
#YAML:1.0
directive, as well as documents lacking a
#YAML
directive. Documents with a directive specifying
a higher minor version (e.g. #YAML:1.1
) should be
processed with an appropriate warning. Documents with a directive
specifying a higher major version (e.g. #YAML:2.0
)
should be rejected with an appropriate error message.
A text node begins at a particular level of indentation, n, and its content is indented at some level >n. A text node can be a collection (mapping or sequence), a scalar (block or flow) or an alias.
A YAML document is a normal node. However a document can't be an alias (there is nothing it may refer to). Also if the header line is omitted the first document must be a block (not flow) collection.
[056] |
top_value_node(n) |
::= |
top_alias_node |
|
[057] |
flow_value_node(n) |
::= |
alias |
|
[058] |
top_key_node(n) |
::= |
( top_key_indicator |
|
[059] |
flow_key_node(n) |
::= |
alias |
|
[060] |
top_alias_node |
::= |
flow_space+ |
|
[061] |
top_collect_node(n) |
::= |
blk_collect_node(n) |
|
[062] |
blk_collect_node(n) |
::= |
( flow_space+ |
|
[063] |
flow_collect_node(n) |
::= |
( collect_properties |
|
[064] |
top_scalar_node(n) |
::= |
blk_scalar_node(n) |
|
[065] |
blk_scalar_node(n) |
::= |
( flow_space+ |
|
[066] |
top_scalar_value_node(n) |
::= |
( scalar_properties |
|
[067] |
flow_scalar_value_node(n) |
::= |
( scalar_properties |
|
[068] |
flow_scalar_key_node(n) |
::= |
( scalar_properties |
Each text node may have anchor and transfer method properties. These properties are specified in a properties list appearing before the node value itself. For a top-level node (a document), the properties appear in the document header line, following the directives (if any). It is an error for the same property to be specified more than once for the same node.
[069] |
collect_properties |
::= |
( collect_transfer |
||
[070] |
scalar_properties |
::= |
( scalar_transfer |
The transfer method property specifies how to load the
associated node. It includes the type
family for the node and, for global scalar type families, an
optional specific format used,
separated by a '#
'
character.
Like throwaway comments and directives, formats are not reflected in the data serialized in the stream. In contrast, the type family is considered to be part of this data.
By providing an explicit transfer property to a node, implicit typing is prevented. However, an explicit empty transfer method property can be used to force implicit typing to be applied to a node. If either an empty explicit format or no explicit format are given, the loader automatically detects the format.
implicit integer type family: 12 also implicit integer family: ! "12" explicit integer, implicit format: !int 12 also implicit format: !int# 0x12 explicit format: !int#dec 0x12
YAML makes use of the taguri:
scheme for defining
URIs for its global type families and the x-private:
scheme for its private type families. While these schemes provide
the necessary semantics for identifying type families, they are
rather verbose.
To increase readability, YAML does not use the full URI notation in the stream. Instead, it provides several shorthand notations for different groups of type family URIs. A parser may choose not to expand shorthand type family names to URIs. However, in such a case the parser must still perform escaping to ensure a single unique representation of each type family name.
If the type family begins with a '!
' character,
it is taken to be a private type family whose URI is under the
x-private:
scheme. URI fragments are allowed but
their semantics is completely up to the semantics of the
private type. In particular, they may or may not indicate a
format.
# Both examples below make use of the 'x-private:ball' # type family URI, but with different semantics. --- pool: !!ball { number: 8 } --- bearing: !!ball { material: steel }
If the type family contains no ':
' and no
'/
' characters it is assumed to be defined under
the yaml.org
domain. This domain is used to define
the core and language-independent YAML data types.
# The URI is 'taguri:yaml.org,2002:str' - !str a Unicode string
Otherwise, if the type family begins with a single word,
followed by a '/
' character, it is assumed to
belong to a sub-domain of yaml.org
.
Each domain language.yaml.org
will
include all globally unique types of the language that aren't
covered by the set of language-independent types. Globally
unique types for each language include any built-in types and
any standard library types. For languages such as Java and C#,
all type names based on reverse DNS strings are globally
unique. For languages such as Perl, that has a central
authority (CPAN) for
managing the global namespace, all the types sanctioned by the
central authority are globally unique. The list of supported
languages and their types is maintained as part of the YAML type repository.
# The URI is 'taguri:perl.yaml.org,2002:Text::Tabs' - !perl/Text::Tabs {}
Otherwise, the type family must begin with a domain name and
a date (separated by a ',
' character), followed by
a '/
' character. In this case it is taken to be
defined under the specified domain and date.
# The URI is 'taguri:clarkevans.com,2003-02:timesheet' - !clarkevans.com,2003-02/timesheet
Type families defined in the yaml.org
domain or any
of its sub-domains must be defined using the appropriate
specialized shorthand rather than using the generic domain syntax.
This ensures each type family has a unique representation as a
shorthand, in addition to having a unique representation as a
URI.
YAML allows non-printable Unicode characters to be used in a transfer method using escape sequences.
# The following values have the same type family. - !domain.tld,2002/type\x30 value - !domain.tld,2002/type0 value
Sometimes it may be helpful for a YAML type family or transfer
method to be expanded to a full URI. A YAML processor may provide a
mechanism to perform such expansion. Since URIs support a limited
ASCII-based character set, this expansion requires all characters
outside this set to be encoded in UTF-8 and the resulting bytes to
be encoded using %
notation.
When an explicit %
character appears in a transfer
method, it is passed to the URI form unchanged, allowing explicit
%
escapes to be used in the transfer method where
necessary. It is an error for a transfer method not to have a valid
expanded URI format (e.g., contain an invalid explicit
%
escape sequence).
# The following are different as far as YAML is concerned. - !domain.tld,2002/type%30 value - !domain.tld,2002/type0 value
YAML provides convenient shorthand for the common case where a
node and (most of) its descendents have global types families whose
shorthand forms share a common prefix. For this case, YAML allows
using the '^
'
character to separate the ancestor node's type family into a prefix
and a suffix. The parser does not consider the separator to be part
of type family name.
When the parser encounters a descendant node whose type family
name begins with '^
', it appends the ancestor node's
prefix to it. Again the '^
' character is not taken to
be part of the name.
It is possible for a descendant node to establish a different prefix. In this case the node may not make use of its ancestor's node prefix. It must specify a full type family name, separated into a prefix and suffix as above.
It is an error for a node's type family name to begin with
'^
' unless it has an ancestor node establishing a
prefix. However, a node may establish a prefix even if none of its
descendents make use of it.
Note that the type prefix mechanism is purely syntactical and does not imply any additional semantics. In particular, the prefix must not be assumed to be an identifier for anything.
# 'taguri:domain.tld,2002:invoice' is some type family. invoice: !domain.tld,2002/^invoice # 'seq' is shorthand for 'taguri:yaml.org,2002:seq'. # This does not effect '^customer' below # because it is does not specify a prefix. customers: !seq # '^customer' is shorthand for the full notation # '!domain.tld,2002/customer' that stands for the # URI 'taguri:domain.tld,2002:customer'. - !^customer given : Chris family : Dumars
[071] |
prefix_separator |
::= |
'^' |
||
[072] |
format_separator |
::= |
'#' |
||
[073] |
trans_char |
::= |
escape_sequence |
||
[074] |
mundane_trans_char |
::= |
trans_char - ':' -
'/' |
||
[075] |
collect_transfer |
::= |
transfer_indicator |
||
[076] |
scalar_transfer |
::= |
collect_transfer |
||
[077] |
private_family |
::= |
transfer_indicator |
||
[078] |
global_family |
::= |
core_family |
||
[079] |
format |
::= |
trans_char* |
||
[080] |
core_family |
::= |
( ( mundane_trans_char |
taguri: type |
|
[081] |
language_family |
::= |
( word_char+ |
taguri: language |
|
[082] |
domain_family |
::= |
( word_char+ |
taguri: domain,date:type |
|
[083] |
domain_year |
::= |
decimal_digit x
4 |
||
[084] |
domain_day_month |
::= |
decimal_digit x
2 |
An anchor is a property that can be used to mark a node for future reference. An alias node can then be used to indicate additional inclusions of an anchored node by specifying the node's anchor.
[085] |
anchor_property |
::= |
anchor_indicator |
||
[086] |
anchor |
::= |
word_char+ |
An alias node does not directly exist in the data serialized in the stream. Instead, it represents a second occurence of the data represented by the anchored node. The first occurence of the node must be marked by an anchor to allow additional occurences to be represented as alias nodes.
An alias refers to the most recent preceding node having the same anchor. It is an error to have an alias use an anchor that does not occur previously in the serialization of the document. It is not an error to have an anchor that is not used by any alias node.
[087] |
alias |
::= |
alias_indicator anchor |
anchor : &A001 This scalar has an anchor. override : &A001 The alias node below is a repeated use of this value. alias : *A001
Collection nodes come in two kinds, sequence and mapping. Each kind has two styles, block and flow. Block styles begin on the next line and use indentation for internal structure. Flow collection styles start on the current line, may span multiple lines, and rely on indicators to represent internal structure.
To enable line spanning in flow collections, wherever tokens may be separated by white space, it is possible to end the line (with an optional throwaway comment) and continue the collection in the next line. Line spanning functionality is indicated by the use of the optional_space and the required_space productions.
[088] |
blk_collection(n) |
::= |
blk_sequence(n) |
||
[089] |
flow_collection(n) |
::= |
flow_sequence(n) |
||
[090] |
optional_space(n) |
::= |
flow_space* |
||
[091] |
required_space(n) |
::= |
flow_space+ |
A sequence node is the simplest node kind. It is a an ordered collection of sub-nodes at a higher indentation level. A flow style is available for short, simple sequences.
[092] |
blk_sequence(n) |
::= |
( indent(n-1) |
||
[093] |
blk_seq_entry(n) |
::= |
sequence_entry_indicator |
||
[094] |
flow_sequence(n) |
::= |
sequence_flow_start |
||
[095] |
flow_seq_entry(n) |
::= |
flow_value_node(n) |
empty: [] flow: [ one, two, three # May span lines, , four, # indentation is five ] # mostly ignored. block: - Note indicator is not indented. - - Subordinate sequence entry - > A folded sequence entry - Sixth item in top sequence
A mapping node is an unordered association of unique keys with values. It is an error for two equal key entries to appear in the same mapping node. In such a case the processor may continue, ignoring the second key and issuing an appropriate warning. This strategy preserves a consistent information model for streaming and random access applications.
A flow form is available for short, simple mapping nodes. Also, if a mapping node has no properties, and its first key is specified as a flow scalar without any properties, this first key may immediately follow the sequence entry indicator.
[096] |
blk_mapping(n) |
::= |
( indent(n) |
||
[097] |
map_in_seq(n) |
::= |
indent(m) |
||
[098] |
blk_map_entry(n) |
::= |
top_key_node(n) |
||
[099] |
flow_mapping(n) |
::= |
mapping_flow_start |
||
[100] |
flow_map_entry(n) |
::= |
flow_key_node(n) |
empty: {} flow: { one: 1, two: 2 } spanning: { one: 1, two: 2 } block: key : value nested mapping: key: Subordinate mapping nested sequence: - Subordinate sequence !float 12 : This key is a float. "\a" : This key had to be escaped. ? '?' : This key had to be quoted. ? > This is a multi line folded key : Whose value is also multi-line. ? this also works as a key : with a value at the next line. ? - This key - is a sequence : - With a sequence value. ? This: key is a: mapping : with a: mapping value. --- - A key: value pair in a sequence. A second: key:value pair. - The previous entry is equal to the following one. - A key: value pair in a sequence. A second: key:value pair.
While most of the document productions are fairly strict, the scalar production is generous. It offers three flow style variants and two block style variants to choose from, depending upon the readability requirements.
Throwaway comments may follow a scalar node, but may not appear inside one. The comment lines following a block scalar node must be less indented than the block scalar value. Empty lines in a scalar node that are followed by a non-empty content line are interpreted as content rather than as implicit comments. Such lines may be less indented than the text content.
[101] |
blk_scalar(n) |
::= |
literal(n) |
|
[102] |
top_scalar_value(n) |
::= |
single_quoted(n) |
|
[103] |
flow_scalar_value(n) |
::= |
single_quoted(n) |
|
[104] |
flow_scalar_key(n) |
::= |
single_quoted(n) |
Inside all scalar nodes, a compliant YAML parser must translate the two-character combination CR LF, any CR that is not followed by an LF, and any NEL into a single LF (this does not apply to escaped characters). LS and PS characters are preserved. These rules are compatible with Unicode's newline guidelines.
Normalization functionality is indicated by the use of the line_feed_break production defined below.
[105] |
line_feed_break |
::= |
generic_break |
||
[106] |
normalized_break |
::= |
line_feed_break |
On output, a YAML emitter is free to serialize end of line markers using whatever convention is most appropriate, though again LS and PS must be preserved.
Each block scalar may have explicit indentation and chomping modifiers. These modifiers are specified following the block style indicator. It is an error for the same modifier to be specified more than once for the same node.
[107] |
blk_modifiers |
::= |
( explicit_indent |
Typically the indentation level of a block scalar node is
detected from its first content line. This detection fails when
this first line is empty, contains a leading '#
' character, or contains
leading white space characters.
In such cases YAML requires that the indentation level for the scalar node text content be given explicitly. This level is specified as the integer number of the additional indentation spaces used for the text content.
The indentation level is always non-zero, except for the top level node of each document. This node is commonly indented by zero spaces (not indented).
It is always valid to specify an explicit indentation level, though emitters should not do so in cases where detection succeeds. It is an error for detection to fail when there is no explicit indentation specified.
[108] |
explicit_indent |
::= |
decimal_digit |
# Explicit indentation must # be given in all the three # following cases. leading spaces: |2 This value starts with four spaces. leading line break: |2 This value starts with a line break. leading comment indicator: |2 # first line starts with a # character. # Explicit indentation may # also be given when it is # not required. redundant: |2 This value is indented 2 spaces. # Indentation may apply to top level nodes. --- | Usually top level nodes are not indented. --- | This text is indented two spaces. It contains no leading spaces. --- |0 This text contains two leading spaces.
Typically the final line break of a block scalar is considered to be a part of its value, and any trailing empty lines are taken to be comment lines. This default "clip" chomping behavior can be overriden by specifying a chomp control modifier.
[109] |
chomp_control |
::= |
'-' | '+' |
-
: strip-
' chomp control specifies that the final
line break character of the block scalar should be stripped from
its value.+
: keep+
' chomp control specifies that any
trailing empty lines following the block scalar should be
considered to be a part of its value. If this modifier is not
specified, such lines are considered to be empty throwaway
comment lines and are ignored. When this functionality is
implied, the trailing_lines(n)
production will be used.[110] |
trailing_lines(n) |
::= |
|
/* trailing content empty line (ignored unless
'+ ' keep) */ |
clipped: | This has one newline. same as "clipped" above: "This has one newline.\n" stripped: |- This has no newline. same as "stripped" above: "This has no newline." kept: |+ This has two newlines. same as "kept" above: "This has two newlines.\n\n"
A literal scalar is the simplest scalar form. No processing is performed on literal scalar characters aside from end of line normalization and stripping away the indentation. Indentation is detected from the first content line. Explicit indentation must be specified in case this yields the wrong result.
This restricts literal scalars to printable characters only. Also, long lines can't be broken. In exchange for these restrictions, a literal scalar may use any printable character, including line breaks. This makes literal scalars the most readable format for source code or other text values with significant use of indicators, quotes, escape sequences, and line breaks.
[111] |
literal(n) |
::= |
literal_indicator |
/* literal scalar */ | |
[112] |
literal_value(n) |
::= |
literal_chunk(n)+ |
||
[113] |
literal_chunk(n) |
::= |
line_feed_empty_line(n)* |
||
[114] |
specific_empty_line(n) |
::= |
indent(<=n) |
||
[115] |
line_feed_empty_line(n) |
::= |
indent(<=n) |
||
[116] |
literal_text_line(n) |
::= |
indent(n)
flow_char+ |
empty: | literal: | The \ ' " characters may be freely used. Leading white space is significant. Line breaks are significant. Thus this value contains one empty line and ends with a single line break, but does not start with one. is equal to: "The \\ ' \" characters may \ be\nfreely used. Leading white\n space \ is significant.\n\nLine breaks are \ significant. Thus this value\ncontains \ one empty line and ends with a single\nline \ break, but does not start with one.\n" # Comments may follow a block scalar value. # They must be less indented. # Modifiers may be combined in any order. indented and chomped: |2- This has no newline. also written as: |-2 This has no newline. both are equal to: " This has no newline."
When folding is done, a single normalized line feed is converted to a
single space (#x20
). When two or more consecutive
(possibly indented) normalized line feeds are encountered, the
parser does not convert them into spaces. Instead, the parser
ignores the first of the line feeds and preserves the rest. Thus a
single line feed can be serialized as two, two line feeds can be
serialized as three, etc.
When folding is done, specific line breaks are preserved and may be safely used to convey text structure.
Block scalars are based on indentation to convey structure. Hence leading white space in block scalar lines is always significant. Folding block scalars builds on this fact to offer powerful and intuitive semantics.
In block scalars, folding only applies to line feeds that separate text lines starting with a non-space character. Hence, folding does not apply to leading line feeds, line feeds surrounding an empty line ending with a specific line break, or line feeds surrounding a text line that starts with a space character.
The combined effect of the processing rules above is that each "paragraph" is interpreted as a single line, empty lines are used to represent a line feed, and "more indented" lines are preserved. Also, specific line breaks may be safely used to indicate text structure.
[117] |
space_line_feed |
::= |
line_feed_break |
||
[118] |
ignored_line_feed |
::= |
line_feed_break |
||
[119] |
blk_line_feeds(n) |
::= |
ignored_line_feed |
Flow scalars depend on explicit indicators to convey structure, rather than indentation. Hence, in such scalars, all line space preceding or following a line break is not considered to be part of the scalar value. Hence folding flow scalars provides a more relaxed, less powerful semantics. In flow scalars, all leading and trailing white space is stripped from each line. All generic line breaks are folded (even if the line was "more indented").
The combined effect of these processing rules is that each "paragraph" is interpreted as a single line, empty lines are used to represent a line feed, and text can be freely "indented" without affecting the scalar value. Again, specific line breaks may be safely used to indicate text structure.
[120] |
ignored_trail_spaces |
::= |
flow_space* |
||
[121] |
ignored_lead_spaces(n) |
::= |
indent(n) |
||
[122] |
trail_space_line |
::= |
ignored_trail_spaces |
||
[123] |
trail_line_feeds |
::= |
ignored_trail_spaces |
||
[124] |
trail_specific_line |
::= |
ignored_trail_spaces |
A folded scalar is similar to a literal scalar. However, unlike a
literal scalar, a folded scalar is subject to (block) line folding. This allows long lines to
be broken anywhere a space character (#x20
) appears,
at the cost of requiring an empty line to represent each line feed character.
[125] |
folded(n) |
::= |
folded_indicator |
|
[126] |
folded_value(n) |
::= |
line_feed_empty_line(n)* |
|
[127] |
folded_chunk(n) |
::= |
( folded_paragraph(n) |
|
[128] |
folded_paragraph(n) |
::= |
( folded_text_line(n) |
|
[129] |
folded_text_line(n) |
::= |
indent(n) |
|
[130] |
folded_after_chunk(n) |
::= |
normalized_break |
|
[131] |
non_folded_chunk(n) |
::= |
non_folded_empty(n) |
|
[132] |
non_folded_empty(n) |
::= |
indent(<=n) |
|
[133] |
non_folded_indent(n) |
::= |
indent(n) |
empty: > one paragraph: > Line feeds are converted to spaces, so this value contains no line breaks except for the final one. multiple paragraphs: >2 An empty line, either at the start or in the value: Is interpreted as a line break. Thus this value contains three line breaks. indented text: > This is a folded paragraph followed by a list: * first entry * second entry Followed by another folded paragraph, another list: * first entry * second entry And a final folded paragraph. above is equal to: | This is a folded paragraph followed by a list: * first entry * second entry Followed by another folded paragraph, another list: * first entry * second entry And a final folded paragraph. # Explicit comments may follow # but must be less indented.
The single quoted flow scalar style is indicated by surrounding
''
' characters. Therefore,
within a single quoted scalar such characters need to be escaped. No other form of escaping
is done, limiting single quoted scalars to printable
characters.
Single quoted scalars are subject to (flow) folding. This allows empty lines to be
broken everywhere a single space character (#x20
)
separates non-space characters, at the cost of requiring an empty
line to represent each line feed
character.
[134] |
single_quoted(n) |
::= |
single_quote |
||
[135] |
single_quoted_value(n) |
::= |
single_line_feeds(n) |
||
[136] |
single_line_feeds(n) |
::= |
( trail_space_line |
||
[137] |
single_specific_lines(n) |
::= |
trail_specific_line+ |
||
[138] |
single_text_line(n) |
::= |
ignored_lead_spaces(n) |
||
[139] |
single_text_chunk(n) |
::= |
single_quoted_char |
||
[140] |
single_quoted_char |
::= |
escaped_single_quote |
||
[141] |
escaped_single_quote |
::= |
single_quote |
empty: '' second: '! : \ etc. can be used freely.' third: 'a single quote '' must be escaped.' span: 'this contains six spaces and one line break' is same as: "this contains six spaces\nand one line break"
Escaping allows YAML scalar nodes to specify arbitrary Unicode characters, using C-style escape codes. Non-escaped nodes are restricted to printable Unicode characters.
[142] |
escape |
::= |
'\' |
||
[143] |
esc_escape |
::= |
escape escape |
||
[144] |
esc_double_quote |
::= |
escape double_quote |
||
[145] |
esc_bel |
::= |
escape
'a' |
||
[146] |
esc_backspace |
::= |
escape
'b' |
||
[147] |
esc_esc |
::= |
escape
'e' |
||
[148] |
esc_form_feed |
::= |
escape
'f' |
||
[149] |
esc_line_feed |
::= |
escape
'n' |
||
[150] |
esc_return |
::= |
escape
'r' |
||
[151] |
esc_tab |
::= |
escape
't' |
||
[152] |
esc_vertical_tab |
::= |
escape
'v' |
||
[153] |
esc_null |
::= |
escape
'0' |
||
[154] |
esc_space |
::= |
escape
#x20 |
||
[155] |
esc_non_breaking_space |
::= |
escape
'_' |
||
[156] |
esc_next_line |
::= |
escape
'N' |
||
[157] |
esc_line_separator |
::= |
escape
'L' |
||
[158] |
esc_paragraph_separator |
::= |
escape
'P' |
||
[159] |
esc_8_bit |
::= |
escape
'x' |
||
[160] |
esc_16_bit |
::= |
escape
'u' |
||
[161] |
esc_32_bit |
::= |
escape
'U' |
||
[162] |
escape_sequence |
::= |
esc_escape |
An escaped line break is completely ignored.
[163] |
ignored_break |
::= |
escape any_break |
The double quoted style variant adds escaping to the single quoted style variant. This is
indicated by surrounding '"
' characters. Escaping
allows arbitrary Unicode characters to be specified at the cost of
some verbosity: escaping the printable
'\
' and '"
' characters. It is an error
for a double quoted value to contain invalid escape sequences.
Like single quoted scalars, double quoted scalars may span multiple lines, resulting in a single space content character for each line break. If the line break is escaped, any white space preceding it is preserved, and the line break and any leading white space in the continuation line are discarded.
[164] |
double_quoted(n) |
::= |
double_quote |
||
[165] |
double_quoted_value(n) |
::= |
double_line_feeds(n) |
||
[166] |
double_line_feeds(n) |
::= |
( trail_space_line |
||
[167] |
double_specific_lines(n) |
::= |
trail_specific_line+ |
||
[168] |
double_text_line(n) |
::= |
ignored_lead_spaces(n) |
||
[169] |
double_text_chunk(n) |
::= |
double_quoted_char |
||
[170] |
double_escaped_line(n) |
::= |
flow_space* |
||
[171] |
double_empty_lines(n) |
::= |
( ignored_lead_spaces(n) |
||
[172] |
double_quoted_char |
::= |
escape_sequence |
empty: "" second: "! : etc. can be used freely." third: "a \" or a \\ must be escaped." fourth: "this value ends with an LF.\n" span: "this contains four \ spaces" is equal to: "this contains four spaces"
The plain style variant is a restricted form of the single quoted style variant. As it has no identifying markers, it may not start or end with white space characters, may not start with most indicators, and may not contain certain indicators. Also, a plain scalar is subject to implicit typing. This can be avoided by providing an explicit transfer method property.
Since it lacks identifying markers, the restrictions on a plain scalar depends on the context. There are three different such contexts, with increasing restrictions.
Top level plain values are the least restricted plain scalar
format. While they can't start with most indicators, they may
contain any indicator except ' #
' and ':
'. Plain scalars
used in flow collections are further restricted not to contain the
',
'
indicator. Finally, plain keys are further restricted to a single
line.
[173] |
plain_top_value(n) |
::= |
'-' | plain_value(n) |
||
[174] |
plain_value(n) |
::= |
plain_first_line |
||
[175] |
plain_key |
::= |
'-' | plain_first_line |
||
[176] |
plain_first_line |
::= |
plain_first_char |
||
[177] |
plain_first_char |
::= |
( flow_non_space |
||
[178] |
plain_char |
::= |
( flow_non_space |
||
[179] |
plain_space_indicator |
::= |
plain_top_indicator |
||
[180] |
plain_top_indicator |
::= |
throwaway_indicator |
||
[181] |
plain_flow_indicator |
::= |
plain_top_indicator |
||
[182] |
plain_line_feeds(n) |
::= |
( trail_space_line |
||
[183] |
plain_specific_lines(n) |
::= |
trail_specific_line+ |
||
[184] |
plain_text_line(n) |
::= |
ignored_lead_spaces(n) |
||
[185] |
plain_text_chunk(n) |
::= |
plain_char |
first: There is no unquoted empty string. second: 12 ## This is an integer. boolean: - ## This is (false). third: !str 12 ## This is a string. span: this contains six spaces and one line break indicators: this has no comments. #:foo and bar# are both text. flow: [ can span lines, # comment like this ] note: { one-line keys: but multi-line values }
Constraints beyond the syntax productions are required for the consistency of generic YAML utilities such as schema language, transformation tool, path selection expressions, and bindings between languages. These constraints are best expressed as a set of models describing the various states of processing YAML. A conforming YAML processor must satisfy these constraints.
YAML processing may be understood in terms of four interacting representations of the data: a textual format, an event stream, a generic node graph and native language data structures. For each one of these representations, there is a corresponding information model.
Translating YAML information between these representations are six processing components: a parser, a linker, a loader, a dumper, a serializer and an emitter. The parser extracts structured information from the text stream. The linker resolves aliases in the resulting tree, creating a directed graph. The loader resolves types and formats and converts this graph to native data structures. The dumper, serializer and emitter perform the inverse operations.
TEXT |
-
Parser
--> |
SERIAL (tree) |
---
Linker
---> |
GRAPH (generic) |
-
Loader -> |
NATIVE (language) |
<-
Emitter - |
<-
Serializer - |
<-
Dumper - |
||||
/\ || \/ Application |
A processor need not expose the event stream (serial model) or generic view (graph model) and may translate directly between a text stream and the native binding. However, such a direct translation should take place so that the native binding is constructed only from information available in the native model. In particular, information specific to the graph model (format), serial model (alias anchors and pair ordering) and text model (comments and styles) must not be relied upon in the construction of the native representation. Exceptions to this guideline include editors that must operate directly on the syntax.
The native model may be implemented by arbitrary native data structures of the programming language used. The only constraint on the native representation is that it preserve the information defined by the native model.
Implementations of the graph model are, by necessity, specific to a particular programming language. Such implementations are constrained to provide the information specified by the graph model.
The serial model is often implemented as an event stream, and is important for implementing one-pass operations on YAML data. Again, of necessity implementations are specific to a particular programming language, and are constrained to provide the information required by the serial model.
YAML text syntax is fully defined by this specification and hence any instance is independent of the particular programming language chosen. This allows the definition of generic YAML tools that may be applied independently of the programming language used, as well as provides a way to interchange data between applications implemented in differing languages.
The "human" view of YAML data contains presentation elements (comments, styles, anchors, key order, format) that are absent from the "machine" view of the data. In a similar manner, for human readable text, it is frequently desirable to omit data typing information which is often obvious to the human reader and not needed. This is especially true if the information is created by hand, expecting humans to bother with data typing detail is optimistic.
The native model abstracts data structures of common programming languages. In YAML's view, any native data is viewed as a directed graph of typed nodes. Nodes that are defined in terms of other nodes are collections and nodes that are defined independent of any other nodes are scalars. YAML supports two kinds of collection nodes, mappings and sequences. The native model also defines when two different nodes have the same content and provides a definition of node identity.
A native node is YAML's building block for data structures. A native node stands for anything from a single integer to a complex data structure such as a complete VRML scene or SQL database. A native node has the following properties:
type family
value
The type family mechanism provides an abstraction of data types that is portable across languages and platforms. Each native binding may have zero or more native concrete types or class constructs that correspond to a given type family.
YAML supports both global and private type families. Global type families have consistent semantics across all YAML documents. Private type families should not be expected to maintain the same semantics in different documents, even if these appear in the same document stream.
name
taguri:
scheme. Private type family names are URIs
under the x-private:
scheme. The
taguri:
scheme is described in http://www.taguri.org.YAML only makes use of taguri:
URIs that take
the form taguri:domain,date:identifier
.
Specifically, it does not make use of taguri:
URIs
that are based on an E-mail address. Nor does it make use of
URIs outside the taguri:
scheme.
definition
kind
Point
structures are mappings). In other cases, deciding on the kind
requires a data modeling decision (for example, whether a date
is thought of as a single scalar or as a mapping with
independent sub-parts).scalar
sequence
mapping
domain
range
function
collection
In most programming languages, there are two distinct manners in which variables can be equivalent.
identity
equality
Equality is defined between scalar nodes and between collection nodes, as described below.
scalar
equality
collection
equality
For the purpose of node equivalence, all YAML collection type families are considered to be mutable and all scalar type families are considered to be immutable. It is possible to modify the value of a mutable (collection) node "in place". For immutable (scalar) nodes, it is impossible to do so; instead, modifications require the creation of a new, independent scalar value of the same type family and using it instead. To better understand this distinction, consider the following example:
C syntax: | struct Point { int x; int y; } p = { 1, 2 }; YAML syntax: !Point { x: 1, y: 2 }
It is impossible to modify the integer value 1. The only
modification possible is constructing a new, unrelated integer
value 3 and using this new value for the X coordinate. Performing
this replacement would cause the point to change "in place" from
{ x: 1, y: 2 }
{ x: 3, y:
2 }
For immutable (scalar) type families, the distinction between equal and identical nodes is only of interest for efficiency reasons (reducing memory usage), and has no semantic significance. Hence for such type families a YAML processor may freely replace two equal but separate (non-identical) nodes with two occurrences of the same (identical) node, and vice versa.
For mutable (collection) type families, however, the distinction between equality and indentity is an important part of the information model and a YAML processor is required to preserve node identity.
A YAML stream is a sequence of disjoint directed graphs, each with a root node.
stream
document
The term disjoint
means that
for any two nodes x
and y
, there does
not exist a third node z
that is reachable from both x
and y
. For any node x
, x
is reachable
from
y
if and only if either x
and
y
are identical, or
y
is a collection
and there exists a node z
in the domain or the range of y
such that
x
is reachable from z
.
To access YAML information through a generic API, scalars must be viewed as strings. Since a native data type may be stringified in more than one way, the graph model extends the native model with the concept of a format. This model allows the operation of generic tools to be defined independent of language. Applications constructing a native binding from the graph model should not use formats for the preservation of important data.
It may be possible to write a string value of a scalar in more than one way. For example, an integer value of 255 can also be written in hex as 0xFF. This distinction is covered by the concept of a format. A format defines a way to write the values of a scalar type family as Unicode strings. Using formats allows generic YAML APIs to be implemented in terms of such strings and still allow handling of arbitrary native data.
name
definition
regexp
Formats are an extension required by the graph model, and are not part of the native model. Hence, when constructing native data structures from YAML data, format need not be preserved. For example, a YAML integer node should be loaded to a native integer data type, discarding the information that the integer was serialized in hex format.
Each type family used for scalar nodes has associated formats. These formats can be separated into two groups, implicit formats and explicit formats. In addition, one of the formats is designated to be the type family's canonical format.
Type families used for collection nodes do not have any associated formats.
implicit
formats
explicit
formats
canonical
format
canonical
format. This must be one of the
implicit or explicit formats, or a subset of one of these
formats. The canonical format must provide exactly one unique
string representation for each possible value of the
scalar.In the graph model, each scalar node has an associated format that is one of those defined by the node's type family. Collection nodes do not have an associated format. The value of graph scalar nodes is a Unicode string that is a representation of the appropriate native value using the node's format.
In the graph model, the type family and format are optional. When a type family or format is missing, we say that it is implicit. If a format is provided, the type family is mandatory. Since the type family is mandatory in the native model, the loader must resolve the type family and format using implicit typing. When native data is converted to YAML, the dumper is responsible for deciding which graph nodes will have explicit type family and format.
Since graph nodes may be implicitly typed, the loader may not be able to determine the type family of the node. Even when the type family is known, the loader may not have an appropriate native type. Therefore, a given YAML document need not have a native binding for every programming language or application.
To allow for YAML to be communicated as a sequence of events, an ordered tree structure must be used instead of a graph. This section describes an extension to the graph model where the graph is flattened and ordered to provide a serial interface by using aliases for recurring nodes and by imposing key order. Applications constructing a native binding from the serial model should not use these extensions for the preservation of important data.
To lay out graph nodes as a tree structure, a mechanism is needed to manage duplicate occurrences. This is solved using an additional node kind, alias. The first occurrence of a node is represented using a serial node of the appropriate kind. Subsequent occurrences of either a collection or a scalar are represented by an alias node.
All nodes in the serial model have the following properties in addition to the properties defined in the graph model:
parent
anchor
Note that when a serial node is converted to a graph node, the anchor, if any, is not converted. Likewise the parent property and the alias kind are not preserved as the graph node may be contained in several collections.
The alias node represents subsequent occurrences of a graph node in the serialization. Like all serial nodes, an alias node has a parent and an anchor property. In addition, an anchor node has a single additional property:
referent
When an alias node is converted into a graph node it becomes a subsequent occurrence of the graph node represented by its referent node.
A pair is an ordered set of two serial nodes. The first member of the set is the key and the second member of the set is the value.
Mapping serial nodes represent the first occurrence of a mapping in a given serialization. The value of mapping serial nodes is an ordered set of node pairs.
When a mapping serial node is converted into a graph node, three operations occur. The domain is constructed with the graph node for each key in its set of pairs. Likewise, the range is constructed with the graph node for each value in its set of pairs. Last, the function is constructed via association of key graph nodes to value graph nodes, as provided by the set of pairs. Note that the ordering of the pairs is explicitly not converted.
When serializing a YAML graph, every serial node is put into a single
linear sequence within a given document through the mapping pair
ordering. With the composition of collections, this ordering
becomes total. For any
two nodes or aliases, x
and y
we say that
x
precedes
y
when any of the following holds:
x
is the parent of
y
.
x
and y
are nodes within a sequence, and x
appears
before y
.
x
is a key and y
is a value in a
given pair.
x
and y
are nodes in two pairs within a mapping, and the pair containing
x
comes before the pair containing
y
.
There exists a node z
such that x
precedes z
and z
precedes
y
.
To enhance readability, a YAML serialization extends the serial model with syntax styles, comments and directives and other syntactical details. Although the parser may provide this information, applications should take care not to use these features to encode information found in a native binding.
In the syntax, each node has an additional style property, depending on its kind.
scalar style
collection
style
The syntax allows optional comment blocks to be interleaved with the node blocks. Comment blocks may appear before or after any node block. A comment block can't appear inside a scalar node value.
comment
Attached to each document is a document directive section.
directive section
The YAML syntax contains multiple mechanisms for increased readability, such as escaping, indentation, folding, line break normalization etc. While the parser may make such details available, they should not be used to encode information required for the construction of the native data.
Every native node has, by definition, a type family. However this type family may be missing (implicit) from the graph model. YAML provides three mechanisms for identifying the true type family (and format) of each node.
!
', all by itself. The
loader is then responsible for assigning a type family (and format)
for such nodes. This is done in an application-specific manner.
However, it is common practice to base such implicit typing on the
implicit formats of scalar
type families. Similarly, implicit typing of collection nodes may
be based on the kind of the collection node and its contents.
Implicit typing of a node may also depend on its position in the
graph.The implicit typing rules depends upon the application. It is possible to parse a document without being aware of these rules. However, without knowledge of these rules, loading an implicitly typed node to native data structures is not possible.
Following is a description of the three mandatory core type families. YAML requires support for the seq, map and str type families. YAML also provides a list of language-independent types that are not mandatory in the YAML type repository available at https://yaml.org/type. These types map to native data types in most programming languages or are useful in a wide range of applications. Hence applications are strongly encouraged to make use of them whenever they are appropriate in order to improve interoperability between YAML systems.
This type family is typically used for implicitly typing sequence nodes. Example bindings include the Perl array, Python's list or tuple, and Java's array or vector.
name: |
taguri:yaml.org,2002:seq |
|||
shorthand: |
!seq |
|||
definition: |
Collections indexed by sequential integers starting with zero. | |||
kind: |
Sequence. |
# The following are equal seqs # with different identities. flow: [ one, two ] spanning: [ one, two ] block: - one - two
This type family is typically used for implicitly typing mapping nodes. Example bindings include the Perl hash, Python's dictionary, and Java's Hashtable.
name: |
taguri:yaml.org,2002:map |
|||
shorthand: |
!map |
|||
definition: |
Associative container, where each key is unique in the association and mapped to exactly one value. | |||
kind: |
Mapping. |
# The following are equal maps # with different identities. flow: { one: 1, two: 2 } block: one: 1 two: 2
This type family is used as the default for all scalar styles with the exception of plain scalars, unless they are given an explicit transfer method property. Also, it is typically used as the default implicit type family for all plain scalars that don't match any other implicit type.
This type is usually bound to the native language's string or character array construct. Note that generic YAML tools should have an immutable (const) interface to such constructs even when the language default is mutable (such as in C/C++).
URI: |
taguri:yaml.org,2002:str |
||||
shorthand: |
!str |
||||
definition: |
Unicode strings, a sequence of zero or more Unicode characters. | ||||
kind: |
Scalar. | ||||
formats: |
|||||
default implicit | canonical |
~= | .* |
# Assuming an application # using implicit integers. - 12 # An integer - ! "12" # Also an integer. # The following scalars # are loaded to the # string value '1' '2'. - !str 12 - '12' - "12" - "\ 1\ 2" # Otherwise, everything is a string: - /foo/bar - 192.168.1.1