Skip to content

Working with Pandoc

Aaron edited this page Jan 16, 2015 · 15 revisions

About pandoc

Pandoc is a free software, released by John MacFarlane under the GNU General Public License, that converts files from one markup format into another. Pandoc supports conversions between a tremendous variety of formats, including markdown, reStructuredText, textile, HTML, DocBook, LaTeX, MediaWiki markup, TWiki markup, OPML, Emacs Org-Mode, Txt2Tags, Microsoft Word docx, EPUB, and Haddock markup:

Pandoc-supported Conversions:

For more information, check out the pandoc project website.

Installing pandoc

Installing pandoc is straightforward:

Windows

Windows users can utilize a package installer, available from pandoc’s download page. To produce PDF output, you’ll also need to install LaTeX. Pandoc recommends MiKTeX.

Mac OS X

Mac OSX users can also install pandoc via a package installer from pandoc’s download page. If you later want to uninstall the package, you can do so by downloading this script and running it from your terminal with perl uninstall-pandoc.pl.

For PDF output, you’ll also need LaTeX. Because a full MacTeX installation takes more than a gigabyte of disk space, Pandoc recommends installing BasicTeX (64M), and using the tlmgr tool to install additional packages as needed. If you get errors warning of fonts not found, try:

tlmgr install collection-fontsrecommended

Linux

For 64-bit Debian and Ubuntu, pandoc provides package installer via their download page. This will install the pandoc and pandoc-citeproc executables and man pages. Users of other Linux flavors can check if installation is possible using your package manager. Pandoc is in the Debian, Ubuntu, Slackware, Arch, Fedora, NiXOS, and gentoo repositories. Note, however, that versions in the repositories are often old. so you may be better off just installing from source. Check out pandoc's installation instructions for details

For PDF output, you’ll need LaTeX. Pandoc recommends installing TeX Live

Using pandoc

PLEASE NOTE : This section is intended as a 'quick-and-dirty' CCCS reference for using pandoc to implement a few of the basic conversions that we run most often. The pandoc website provides comprehensive usage instructions and examples, including a great 'Getting Started' page to help users who are not familiar with command line operations to understand how to navigate around your file system and to execute basic pandoc commands. Their 'User's Guide' elaborates detailed usage information and instructions. Please refer to the official pandoc documentation if you wish to run specific (or complicated) conversions that are not addressed here.

Understanding and Implementing Basic Conversion Operations

Accessing the command line

Pandoc is a command-line tool; there is no graphical user interface. In other words, to use pandoc, you need to work through a terminal interface [or 'virtual console'].

Deploying the terminal in Windows:

Windows users can use either the classic command prompt or the more modern PowerShell terminal. If you use Windows in desktop mode, you can access the your terminal by typing either cmd or powershell into the Start menu's 'search' tool and double click the application icon to run it.

Deploying the terminal in Mac OSX:

On OS X, the 'Terminal' application can be found in /Applications/Utilities. You can access the terminal via the 'Finder' window, navigating first to 'Applications' and then to 'Utilities'. Once again, double click the 'Terminal' icon to run it.

Deploying the terminal in Linux:

Most Linux users will be comfortable launching the terminal application. For those who are not, accessing your terminal will vary according to your distro. Usually, you can launch it using key command such as ctrl+alt+t (or something close to it). To find the terminal via your GUI, try clicking the 'start' icon for your distro and entering terminal into the field.

Basic operations

Operating pandoc is accomplished by entering commands into the terminal. To start, let’s verify that pandoc is installed by issuing the following command:

pandoc --version

You should see a message telling you which version of pandoc is installed, and giving you some additional information.

Converting a file

Basic conversion requires only specifying a 'source' file and an 'output' file, such as follows:

    pandoc input.txt -o output.html

In the example above, the one 'option' tag (also known as a 'flag') is used: -o. This tag means "output (and it can also be spelled out as a full word: --output.

The command entry syntax is rather flexible in the sense that the order of operations can vary as long as appropriate option flags are used. Thus, the following two command will produce the same result:

    pandoc -o output.html input.txt 

[NOTE: If no input-file is specified, input is read from stdin. If multiple input files are given, pandoc will concatenate them all (with blank lines between them) before parsing. Output goes to stdout by default.]

A document's format can be specified explicitly. The input format is specified using the -r / --read or -f / --from options. The output format is specified using the -w / --write or -t / --to options. Thus, to convert hello.txt from markdown to LaTeX, you could type:

    pandoc -f markdown -t latex example.txt -o example.pdf

also

    pandoc example.txt -f markdown -t latex -o example.pdf

If the input or output format is not specified explicitly, pandoc will attempt to guess it from the extensions of the input and output file‐names. Thus, the command:

    pandoc -o example.md example.pdf

will convert example.md from markdown to a LaTeX-generated pdf document.

[NOTE: If the input files' extensions are unknown, the input format will be assumed to be markdown unless explicitly specified.]

IMPORTANT: Pandoc uses the UTF-8 character encoding for both input and output. If your local character encoding is not UTF-8, you will need to pipe input and output through iconv:

    iconv -t utf-8 input.txt | pandoc | iconv -f utf-8

Referencing a Template for Stylized File Conversions

Options -- Quick Reference Guide

The following 'General Options' should cover most use cases. A more comprehensive list of options is available via the pandoc 'User's Guide'

General File Definition

-f FORMAT, -r FORMAT, --from=FORMAT, --read=FORMAT

Specify input format. Markdown syntax extensions can be individually enabled or disabled by appending +EXTENSION or -EXTENSION to the format name. So, for example, markdown_strict+footnotes+definition_lists is strict markdown with footnotes and definition lists enabled, and markdown-pipe_tables+hard_line_breaks is pandoc’s markdown without pipe tables and with hard line breaks. See Pandoc’s markdown, for a list of extensions and their names.

-t FORMAT, -w FORMAT, --to=FORMAT, --write=FORMAT

Specify output format. Note that odt, epub, and epub3 output will not be directed to stdout; an output filename must be specified using the -o/--output option. If +lhs is appended to markdown, rst, latex, beamer, html, or html5, the output will be rendered as literate Haskell source (see Literate Haskell support]. Markdown syntax extensions can be individually enabled or disabled by appending +EXTENSION or -EXTENSION to the format name, as described above under -f, above.

-o FILE, --output=FILE

Write output to FILE instead of stdout. If FILE is - (blank or absent), output will go to stdout. (Exception: if the output format is odt, docx, epub, or epub3, output to stdout is disabled.)

--data-dir=DIRECTORY

Specify the user data directory to search for pandoc data files. If this option is not specified, the default user data directory will be used. For UNIX or UNIX-like systems, this is

    $HOME/.pandoc

For Windows XP, the path is

    C:\Documents And Settings\USERNAME\Application Data\pandoc

For Windows 7, the path is

    C:\Users\USERNAME\AppData\Roaming\pandoc

[NOTE: You can find the default user data directory on your system by looking at the output of pandoc --version. A reference.odt, reference.docx, default.csl, epub.css, templates, slidy, slideous, or s5 directory placed in this directory will override pandoc’s normal defaults.]

Reader Options

-R, --parse-raw

Parse untranslatable HTML codes and LaTeX environments as raw HTML or LaTeX, instead of ignoring them. Affects only HTML and LaTeX input. Raw HTML can be printed in markdown, reStructuredText, HTML, Slidy, Slideous, DZSlides, reveal.js, and S5 output; raw LaTeX can be printed in markdown, reStructuredText, LaTeX, and ConTeXt output. The default is for the readers to omit untranslatable HTML codes and LaTeX environments. (The LaTeX reader does pass through untranslatable LaTeX commands, even if -R is not specified.)

-S, --smart

Produce typographically correct output, converting straight quotes to curly quotes, --- to em-dashes, -- to en-dashes, and ... to ellipses. Nonbreaking spaces are inserted after certain abbreviations, such as “Mr.” (Note: This option is significant only when the input format is markdown, markdown_strict, textile or twiki. It is selected automatically when the input format is textile or the output format is latex or context, unless --no-tex-ligatures is used.)

--old-dashes

Selects the pandoc <= 1.8.2.1 behavior for parsing smart dashes: - before a numeral is an en-dash, and -- is an em-dash. This option is selected automatically for textile input.

--base-header-level=NUMBER

Specify the base level for headers (defaults to 1).

--default-image-extension=EXTENSION

Specify a default extension to use when image paths/URLs have no extension. This allows you to use the same source for formats that require different kinds of images. Currently this option only affects the markdown and LaTeX readers.

--normalize

Normalize the document after reading: merge adjacent Str or Emph elements, for example, and remove repeated Spaces.

-p, --preserve-tabs

Preserve tabs instead of converting them to spaces (the default). Note that this will only affect tabs in literal code spans and code blocks; tabs in regular text will be treated as spaces.

--tab-stop=NUMBER

Specify the number of spaces per tab (default is 4).

--track-changes=accept|reject|all

Specifies what to do with insertions and deletions produced by the MS Word “track-changes” feature. accept (the default), inserts all insertions, and ignores all deletions. reject inserts all deletions and ignores insertions. all puts in both insertions and deletions, wrapped in spans with insertion and deletion classes, respectively. The author and time of change is included. all is useful for scripting: only accepting changes from a certain reviewer, say, or before a certain date. This option only affects the docx reader.

--extract-media=DIR

Extract images and other media contained in a docx or epub container to the path DIR, creating it if necessary, and adjust the images references in the document so they point to the extracted files. This option only affects the docx and epub readers.

Writer Options

Citation Rendering

Math Rendering