Exploring a Document, but Encoding a Text

markup = annotation or other marks within a text intended to instruct a compositor, typist, or web developer how a particular passage should be printed, laid out, or displayed

markup language = set of markup conventions specifying how markup is to be distinguished from text, what markup is allowed, what markup is required, and what the markup means

Generalizing from that sense, we define encoding as any means of making explicit an interpretation of a text using a markup language.

What is XML?

XML stands for eXtensible Markup Language, and it’s a standard system for storing and accessing information used practically everywhere around the world. For our purposes as researchers, it’s an excellent method for storing information, and for preparing to share it with the public. We write XML to form hierarchies (or nested structures) of information in order to locate and extract said information (whether that be for presentation as HTML, creation of data visualizations, or more simply—information searchability.) XML is interested in the meaning of data more than in its presentation. While most other markup languages are concerned with mimicking how a document appears XML because it does not have a fixed set of tags can extend beyond presentation markup. This makes XML documents multi-purposing. So that you can mark up a text only once and then use it for multiple purposes.

This is a XML element:

image of example XML element: <element attribute="attribteValue">Hello I am the content of this element.</element>

A tag is the text between the left angle bracket (<) and the right angle bracket (>). There are starting tags and ending tags. A start tag is defined with angle brackets, and an end tag looks like a start tag, except it has a forward slash after the opening angle bracket.
An element is the starting tag, the ending tag, and everything in between. This can include text and/or other elements. Here is an example of nested elements: <person handle="RJP43" pronoun="she">Rebecca <surname>Parker</surname></person>. When we talk about an element, we’re referring to the whole thing. The element name refers to the text written inside of the start and end tags.
An attribute is a name-value pair inside the start tag of an element. Elements can include something called attributes—an additional markup that gives supplementary information about an element (attributes are sort of like adjectives, or descriptive modifiers). They consist of an attribute name and an attribute value.

This is a self-closing element:

image of example self-closing element; <lb n="1251"/>
In special cases, XML elements can actually have no content at all! These are called self-closing elements and they have a special syntax so that they open and close inside a single tag.

Don't contain text or any other elements.
Consist of a single tag - smush the start and end tag together.
May have attributes.

This is a XML Comment:


Note: Two dashes in the middle of a comment are not allowed. When writing XML comments we recommend encoders provide their initials and the date the comment is being left. We want you to think of XML comments as breadcrumbs to future encoders and processors of your XML; therefore, be sure to use complete sentences and leave logical comments that can be understood by others even after you are no longer working on the project.

These are XML Reserved Characters & How To Escape Them:

< less than - <
> greater than - >
& ampersand - &

Understanding the XML Hierarchy

Elements when brought together conform to a particular hierarchy. The following three analogies will help you better understand a well-formed, properly-nested XML hierarchy:

Nesting Dolls

Elements are Russian Nesting Dolls - “Well-formedness = Nested-ness” - Everything is properly delimited, There is a single root element (“the big doll”) that contains all of the other elements both structural and contextual in nature, No elements overlap
Family Tree

Elements form trees - Reference relationships: Ancestor, Descendant, Sibling, Parent, Grandparent. Humanities scholars use XML to represent their documents because the tree model is convenient both as a logical representation (meaning some aspects of the inherent structure of documents are tree-like) and for programming purposes (meaning computers can process tree representations efficiently).
Boxes in Boxes

Elements are boxes - Attributes distinguish box types

Rules for "Well-Formed" XML

The XML prolog is optional; however, if it exists, it must come first in the document.
Example XML prolog: <?xml version="1.0" encoding="UTF-8"?>
An XML document must be contained in a single element. That single element is called the root element, and it contains all the text and any other elements.
XML elements can't overlap - elements must be properly nested - need a start and end (or self-closing)
XML elements are case sensitive - the start and end tag must match - <person> vs <PERSON> vs <Person>
Attributes must have values and those values must be enclosed within quotation marks.

The lessons and exercises constructed for this course incorporate materials from Dr. Elisa Beshero-Bondar's Digital Humanities courses, the Digital Mitford Coding School, the Text Encoding Initiative's learning resources, GitHub Guides, and the GitHub Help resources. This repository is public-facing, therefore, the lessons and exercises herein are licensed under a CC BY-NC-SA license.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exploring a Document, but Encoding a Text

What is XML?

This is a XML element:

This is a self-closing element:

This is a XML Comment:

These are XML Reserved Characters & How To Escape Them:

Understanding the XML Hierarchy

Rules for "Well-Formed" XML

Welcome, Learners!

Working-in-GitHub

Lessons

Exercises

Clone this wiki locally