domingo, 25 de enero de 2015

Pandoc: Customized LaTeX templates for PDF generation

One of the most interesting, however rarely commented, features of Pandoc is its support for LaTeX templates. Pandoc templates allows us to keep writing mostly in Markdown, instead of falling back to LaTeX, even if we have to produce documents that wouldn't be considered as "normal". In those cases we can still take advantage of Markdown simplicity by delegating all the gory-details to our own templates as long as they stick to the Pandoc syntax and semantics.

As an illustration, let us suppose we have to regularly deliver very formalized (with a lot of boilerplate text) but at the same time really specialized (as for design) pdfs of meeting minutes such as the following:

It would seem that Pandoc is not the best option for this task. I'll try to prove that Pandoc might be, on the contrary, particularly suitable.

Pandoc templates

As every Pandoc user knows converting to LaTeX from markdown relies on a default template included in the Pandoc installation. Issuing this command
pandoc --standalone --output my_document.pdf my_document.md

Or, using short options, for brevity

pandoc -s -o my_document.pdf my_document.md

applies a default LaTeX template named default.latex that, on my system and current version (1.12.2.1), can be found at $PANDOC_DIR/data/templates/default.latex. The result of applying the template to the markdown document is a file processed transparently by a LaTeX engine (currently, pdflatex by default) that produces the final pdf output.

A quick inspection into the default template reveals that a Pandoc template is just a normal LaTeX file in which certain specific constructs (variables, conditionals and loops) are included.

To use another template, either a customized version of the default template, or our own template written from scratch, we need to pass its filename to the option --template. Assuming that the template file is in the same directory as the input file the pandoc command would be as follows:

pandoc -s --template="my_template.latex" -o my_document.pdf my_document.md

Pandoc Variables

A variable in Pandoc has the following syntax.
$variable-name$

Every variable in the template will be replaced with its actual value. This value can be passed in different ways. One of them is to pass it via the -M variable-name=value (--metadata is the long name switch) Pandoc command line option, as we will see later.

As for pre-defined variables contained in the default template, you can see more information about most of them in the Pandoc documentation (http://johnmacfarlane.net/pandoc/README.html#templates).

Another useful way to know what variables are actually present in the default template is to extract them with a Unix filter:

grep -o '\$.*\$' /usr/share/pandoc/data/templates/default.latex \
| grep -v '\$endif\$\|\$else\$\|\$endfor\$'

A very important thing to point out is that there is a critical variable, $body$ that should be present in every template. This variable is implicitly replaced with the actual content of our input.

Let's check the latter by creating a minimal LaTeX template simple.latex

\documentclass{minimal}
\begin{document}
$body$
\end{document}

and issuing pandoc to take our input from the terminal [Below is a reproduction of the test]

$ pandoc -s --template="simple.latex" --to latex
Hi
Ctrl+D
\documentclass{minimal}
\begin{document}
Hi
\end{document}

The first line is the command issued. I'm telling Pandoc to apply our own template over whatever input is going to be passed, and convert it to LaTeX. The following lines are what I've actually typed in to be consumed by pandoc, hitting Ctrl+D (in Linux and other Unixes) signals EOF and closes the standard input. The rest is the output produced by pandoc. Note that what I've typed in, "Hi", is now in place of the $body$ variable, as expected.

Let's try something a bit more complicated, adding our own variable, say $greeting$, to our template:

\documentclass{minimal}
\begin{document}
$greeting$
$body$
\end{document}

and test it as before

$ pandoc -s -M greeting="Hi, World" --template="simple.latex" --to latex
This is Pandoc!
Ctrl+D
\documentclass{minimal}
\begin{document}
Hi, World
This is Pandoc!
\end{document}

The command is almost the same. The critical add-on is the -M option referred above. Unlike the variable $body$, the values for our variables need to be passed to pandoc explicitly via the -M option. Again, the variable in the template is replaced with the given value to produce the output.

Conditionals

Conditionals have this syntax:
$if(variable)$
X
$else$
Y
$endif$ 

where the $else$ part is optional.

Let's suppose that we want to be able to choose between different document classes for the same document as needed. In particular, we want to create a minimal document by default, or another document class on demand. We can do this by creating this conditional:

$if(my-doc-class)$
$my-doc-class$
$else$
minimal
$endif$

Be careful about the syntax. Each syntactic part (if(), else, endif) is enclosed in dollar signs. Variables to be replaced by their values are also surrounded by that sign. Literal values, as well as the variable reference in the if section go without them, though. Of course, we have to put this conditional in the suitable place, the LaTeX \documentclass macro:

\documentclass{$if(my-doc-class)$
               $my-doc-class$
               $else$
               minimal
               $endif$}

One-liners maybe less readable but more LaTeX-style aware. So put the corresponding one-liner into the template:

\documentclass{$if(my-doc-class)$$my-doc-class$$else$minimal$endif$}
\begin{document}
$greeting$
$body$
\end{document}
Now, Let's check:
$ pandoc -s -M greeting="Hi, World" --template="simple.latex" --to latex
This should be a minimal
Ctrl+D
\documentclass{minimal}
\begin{document}
Hi, World
This should be a minimal
\end{document}

It works!

Let's try the second use case:

$ pandoc -s -M greeting="Hi, World" -M my-doc-class="book" --template="simple.latex" --to latex
And this one, a book
Ctrl+D
\documentclass{book}
\begin{document}
Hi, World
And this one, a book
\end{document}

It also works! Note that this time the previously defined variable my-doc-class is set to the value "book" by means of the option -M as we did before with $greeting$

Loops

Loops behave in a similar manner. The basic syntax is as follows:
$for(variable)$
X
$sep$separator
$endfor$

The $sep$separator line (for defining a separator between consecutive items) is optional.

Let's say we want to take note of participants in a meeting as the very first line in our document. We can define a variable participant in our template and leave Pandoc to fill its content. We want to separate by commas the names of participants. So we can add this to the template:

$for(participant)$
$participant$
$sep$, 
$endfor$
Or just a one-liner in the corresponding place:
\documentclass{$if(my-doc-class)$$my-doc-class$$else$minimal$endif$}

\begin{document}
Participants: $for(participant)$$participant$$sep$, $endfor$

$greeting$
$body$
\end{document}

Testing again ... Now we add as many -M participant=... options as participants we wish to set.

$ pandoc -s -M greeting="Hi, World" -M participant="W. Shakespeare" -M participant="Edgar A. Poe" --template="simple.latex" --to latex
Good staff!
Ctrl+D
\documentclass{minimal}
\begin{document}
Participants: W. Shakespeare, Edgar A. Poe

Hi, World
Good staff!
\end{document}
Nice. Everything works fine!

Metadata blocks

It's cumbersome, to say the least, being forced to pass all those things to the command line. You aren't, of course. Pandoc provides the so-called metadata blocks for this task. A metadata block for our previous experiments would look like as follows:
---
my-doc-class: minimal
greeting: Hi, World
participant:
- William Shakespeare
- Edgar A. Poe
---

This is just a chunk of text following the YAML specification (http://yaml.org/spec). YAML blocks included in a document to be consumed by pandoc must begin with three hyphens and end with three points or three hyphens.

A metadata block consists of fields. Each field has a name and its associated value separated by semicolon. Some fields can contain multiple values preceded by a hyphen like the participant field in the previous example.

These blocks allow us to pass to pandoc all the required information without bothering with the command line switches. The customary way is to add the block at the beginning of our input document. Another way, even better in my opinion, is to create a yaml file that we pass to pandoc along with our input file.

For instance, if we save the input of our last experiment (the "Good staff!" string) in a file named my_document.md and the metadata block above in a file named variables.yaml, pandoc can be called to produce exactly the same output as the one we obtained before (try it!) as follows:

pandoc -s --template="simple.latex" --to latex my_document.md variables.yaml

The final meeting document

Our initial task becomes feasible due to the explained Pandoc flexibility. It's just a matter of creating a normal LaTeX document (our template) with a bit of variables and loops. All of the gory-details and boilerplate text will be shifted to the template while the actual and relevant content could be written, as usual and conveniently, in pure Markdown.

For reference I reproduce all the files involved in the creation of the document shown at the beginning of this entry. No further comments about LaTeX here, that is a bit quick&dirty, BTW ;-) I'm assuming readers already know LaTeX, anyway.

my_document.md

1.  Lorem ipsum dolor sit amet,

    Lorem ipsum dolor sit amet consectetur adipiscing elit,
    sed do eiusmod tempor incididunt ut labore et dolore
    magna aliqua.

2.  Ut enim ad minim veniam,

    Ut enim ad minim veniam, quis nostrud exercitation ullamco
    laboris nisi ut aliquip ex ea commodo consequat.

3.  Duis aute irure dolor

    Duis aute irure dolor in reprehenderit in voluptate velit
    esse cillum dolore eu fugiat nulla pariatur.

4.  Excepteur sint occaecat

    Excepteur sint occaecat cupidatat non proident, sunt in
    culpa qui officia deserunt mollit anim id est laborum

my_template.latex

\documentclass[a4paper]{extreport}
\usepackage[T1]{fontenc}
\usepackage{marginnote}
\usepackage{background}

\setlength{\parindent}{0pt}

\SetBgScale{1}
\SetBgColor{black}
\SetBgAngle{0}
\SetBgHshift{-0.52\textwidth}
\SetBgVshift{-1mm}
\SetBgContents{\rule{0.4pt}{\textheight}}

\setlength{\marginparwidth}{32mm}
\renewcommand*{\raggedleftmarginnote}{}
\reversemarginpar

\begin{document}
\textbf{Report}\marginnote{$for(attendee)$$attendee$$sep$\\ $endfor$}

$body$

\section{}
Freedonia, $month$ $day$, $year$ 

\section{}
Chief of Departament

\vspace{2cm}
\emph{Signature: atopos}
\end{document}

variables.yaml

---
day: 25
month: January
year: 2015
attendee:
- William Shakespeare
- Edgar A. Poe
- Miguel de Cervantes
---

The Pandoc command to generate the pdf

pandoc -s --template="my_template.latex" -o my_document.pdf my_document.md variables.yaml

[Spanish translation on the way...]

5 comentarios:

  1. Thank you for the guidance here, Works!!!

    ResponderEliminar
  2. I was searching for how to code the system date for the yaml block, did not find it; have you come across it? E.g. rmarkdown is '\`r format(Sys.Date(), "%B %d, %Y")\`'
    Thanks

    ResponderEliminar
    Respuestas
    1. No, I haven't, but looks that it might work as follows:

      date: "`r format(Sys.time(), '%d %B, %Y')`"

      Taken from here:

      http://stackoverflow.com/questions/23449319/yaml-current-date-in-rmarkdown

      Eliminar
  3. Thanks for responding, I could not get it working; using your initial yaml block, I also put the date in the margin note, ... Thanks again for the setup,

    ResponderEliminar