-
Notifications
You must be signed in to change notification settings - Fork 11
Working with Pandoc
Pandoc is a free software, released by John MacFarlane under the GNU General Public License, that converts files from one markup format into another. Pandoc supports conversions between a tremendous variety of formats, including markdown, reStructuredText, textile, HTML, DocBook, LaTeX, MediaWiki markup, TWiki markup, OPML, Emacs Org-Mode, Txt2Tags, Microsoft Word docx, EPUB, and Haddock markup:
Pandoc-supported Conversions:
For more information, check out the pandoc project website.
Installing pandoc is straightforward:
Windows
Windows users can utilize a package installer, available from pandoc’s download page. To produce PDF output, you’ll also need to install LaTeX. Pandoc recommends MiKTeX.
Mac OS X
Mac OSX users can also install pandoc via a package installer from pandoc’s download page. If you later want to uninstall the package, you can do so by downloading this script and running it from your terminal with perl uninstall-pandoc.pl
.
For PDF output, you’ll also need LaTeX. Because a full MacTeX installation takes more than a gigabyte of disk space, Pandoc recommends installing BasicTeX (64M), and using the tlmgr
tool to install additional packages as needed. If you get errors warning of fonts not found, try:
tlmgr install collection-fontsrecommended
Linux
For 64-bit Debian and Ubuntu, pandoc provides package installer via their download page. This will install the pandoc and pandoc-citeproc executables and man pages. Users of other Linux flavors can check if installation is possible using your package manager. Pandoc is in the Debian, Ubuntu, Slackware, Arch, Fedora, NiXOS, and gentoo repositories. Note, however, that versions in the repositories are often old. so you may be better off just installing from source. Check out pandoc's installation instructions for details
For PDF output, you’ll need LaTeX. Pandoc recommends installing TeX Live
PLEASE NOTE : This section is intended as a 'quick-and-dirty' CCCS reference for using pandoc to implement a few of the basic conversions that we run most often. The pandoc website provides comprehensive usage instructions and examples, including a great 'Getting Started' page to help users who are not familiar with command line operations to understand how to navigate around your file system and to execute basic pandoc commands. Their 'User's Guide' elaborates detailed usage information and instructions. Please refer to the official pandoc documentation if you wish to run specific (or complicated) conversions that are not addressed here.
Pandoc is a command-line tool; there is no graphical user interface. In other words, to use pandoc, you need to work through a terminal interface [or 'virtual console'].
Deploying the terminal in Windows:
Windows users can use either the classic command prompt or the more modern PowerShell terminal. If you use Windows in desktop mode, you can access the your terminal by typing either cmd
or powershell
into the Start menu's 'search' tool and double click the application icon to run it.
Deploying the terminal in Mac OSX:
On OS X, the 'Terminal' application can be found in /Applications/Utilities
. You can access the terminal via the 'Finder' window, navigating first to 'Applications' and then to 'Utilities'. Once again, double click the 'Terminal' icon to run it.
Deploying the terminal in Linux:
Most Linux users will be comfortable launching the terminal application. For those who are not, accessing your terminal will vary according to your distro. Usually, you can launch it using key command such as ctrl+alt+t
(or something close to it). To find the terminal via your GUI, try clicking the 'start' icon for your distro and entering terminal
into the field.
Operating pandoc is accomplished by entering commands into the terminal. To start, let’s verify that pandoc is installed by issuing the following command:
pandoc --version
You should see a message telling you which version of pandoc is installed, and giving you some additional information.
Converting a file
Basic conversion requires only specifying a 'source' file and an 'output' file, such as follows:
pandoc input.txt -o output.html
In the example above, the one 'option' tag (also known as a 'flag') is used: -o
. This tag means "output (and it can also be spelled out as a full word: --output
.
The command entry syntax is rather flexible in the sense that the order of operations can vary as long as appropriate option flags are used. Thus, the following two command will produce the same result:
pandoc -o output.html input.txt
[NOTE: If no input-file is specified, input is read from stdin
. If multiple input files are given, pandoc will concatenate them all (with blank lines between them) before parsing. Output goes to stdout
by default.]
A document's format can be specified explicitly. The input format is specified using the -r
/ --read
or -f
/ --from
options. The output format is specified using the -w
/ --write
or -t
/ --to
options. Thus, to convert hello.txt from markdown to LaTeX, you could type:
pandoc -f markdown -t latex example.txt -o example.pdf
also
pandoc example.txt -f markdown -t latex -o example.pdf
If the input or output format is not specified explicitly, pandoc will attempt to guess it from the extensions of the input and output file‐names. Thus, the command:
pandoc -o example.md example.pdf
will convert example.md
from markdown to a LaTeX-generated pdf document.
[NOTE: If the input files' extensions are unknown, the input format will be assumed to be markdown unless explicitly specified.]
IMPORTANT: Pandoc uses the UTF-8 character encoding for both input and output. If your local character encoding is not UTF-8, you will need to pipe input
and output
through iconv
:
iconv -t utf-8 input.txt | pandoc | iconv -f utf-8
Referencing a Template for Stylized File Conversions
The following 'General Options' should cover most use cases. A more comprehensive list of options is available via the pandoc 'User's Guide'
General File Definition
-f FORMAT, -r FORMAT, --from=FORMAT, --read=FORMAT
Specify input format. Markdown syntax extensions can be individually enabled or disabled by appending +EXTENSION
or -EXTENSION
to the format name. So, for example, markdown_strict+footnotes+definition_lists
is strict markdown with footnotes and definition lists enabled, and markdown-pipe_tables+hard_line_breaks
is pandoc’s markdown without pipe tables and with hard line breaks. See Pandoc’s markdown, for a list of extensions and their names.
-t FORMAT, -w FORMAT, --to=FORMAT, --write=FORMAT
Specify output format. Note that odt
, epub
, and epub3
output will not be directed to stdout; an output filename must be specified using the -o
/--output
option. If +lhs
is appended to markdown
, rst
, latex
, beamer
, html
, or html5
, the output will be rendered as literate Haskell source (see Literate Haskell support]. Markdown syntax extensions can be individually enabled or disabled by appending +EXTENSION
or -EXTENSION
to the format name, as described above under -f
, above.
-o FILE, --output=FILE
Write output to FILE
instead of stdout
. If FILE
is -
(blank or absent), output will go to stdout
. (Exception: if the output format is odt
, docx
, epub
, or epub3
, output to stdout
is disabled.)
--data-dir=DIRECTORY
Specify the user data directory to search for pandoc data files. If this option is not specified, the default user data directory will be used. For UNIX or UNIX-like systems, this is
$HOME/.pandoc
For Windows XP, the path is
C:\Documents And Settings\USERNAME\Application Data\pandoc
For Windows 7, the path is
C:\Users\USERNAME\AppData\Roaming\pandoc
[NOTE: You can find the default user data directory on your system by looking at the output of pandoc --version. A reference.odt, reference.docx, default.csl, epub.css, templates, slidy, slideous, or s5 directory placed in this directory will override pandoc’s normal defaults.]
Reader Options
-R, --parse-raw
Parse untranslatable HTML codes and LaTeX environments as raw HTML or LaTeX, instead of ignoring them. Affects only HTML and LaTeX input. Raw HTML can be printed in markdown, reStructuredText, HTML, Slidy, Slideous, DZSlides, reveal.js, and S5 output; raw LaTeX can be printed in markdown, reStructuredText, LaTeX, and ConTeXt output. The default is for the readers to omit untranslatable HTML codes and LaTeX environments. (The LaTeX reader does pass through untranslatable LaTeX commands, even if -R is not specified.)
-S, --smart
Produce typographically correct output, converting straight quotes to curly quotes, --- to em-dashes, -- to en-dashes, and ... to ellipses. Nonbreaking spaces are inserted after certain abbreviations, such as “Mr.” (Note: This option is significant only when the input format is markdown, markdown_strict, textile or twiki. It is selected automatically when the input format is textile or the output format is latex or context, unless --no-tex-ligatures is used.)
--old-dashes
Selects the pandoc <= 1.8.2.1 behavior for parsing smart dashes: - before a numeral is an en-dash, and -- is an em-dash. This option is selected automatically for textile input.
--base-header-level=NUMBER
Specify the base level for headers (defaults to 1).
--default-image-extension=EXTENSION
Specify a default extension to use when image paths/URLs have no extension. This allows you to use the same source for formats that require different kinds of images. Currently this option only affects the markdown and LaTeX readers.
--normalize
Normalize the document after reading: merge adjacent Str or Emph elements, for example, and remove repeated Spaces.
-p, --preserve-tabs
Preserve tabs instead of converting them to spaces (the default). Note that this will only affect tabs in literal code spans and code blocks; tabs in regular text will be treated as spaces.
--tab-stop=NUMBER
Specify the number of spaces per tab (default is 4).
--track-changes=accept|reject|all
Specifies what to do with insertions and deletions produced by the MS Word “track-changes” feature. accept (the default), inserts all insertions, and ignores all deletions. reject inserts all deletions and ignores insertions. all puts in both insertions and deletions, wrapped in spans with insertion and deletion classes, respectively. The author and time of change is included. all is useful for scripting: only accepting changes from a certain reviewer, say, or before a certain date. This option only affects the docx reader.
--extract-media=DIR
Extract images and other media contained in a docx or epub container to the path DIR, creating it if necessary, and adjust the images references in the document so they point to the extracted files. This option only affects the docx and epub readers.
Writer Options
Citation Rendering
Math Rendering