Introduction to texor
Abhishek Ulayil
2024-08-16
Source:vignettes/introduction-to-texor.Rmd
introduction-to-texor.Rmd
The CRAN package texor
helps in converting old LaTeX
based documents, research papers to HTML through intermediate
conversions. This was particularly a problem for legacy R Research
papers where HTML export was not available and hence modern
compatibility to export a HTML file was missed out.
Why migrate to web format ?
We have advanced a lot in the field of web development and modern websites offer a much more interactive and accessible interface for the knowledge we consume. The advantages of a web format are :
- Better support for screen readers, hence better accessibility.
- Convenient to use on mobile devices with smaller screens.
- Maintaining parity with newer articles available in the web format.
- Easier translations for the text.
For maintaining parity with modern articles, we convert the legacy articles into R markdown format, a markdown based solution developed to allow for publishing PDFs and web content simultaneously without requiring separate documents, along with executable code chunks to reproduce the results during compile.
How do we bring back the legacy documents into a modern format ?
Now, we have a lot of legacy articles which are only available in the PDF format and to bring these LaTeX based documents into a web format we needed a conversion tool which could read LaTeX and generate a markdown file. The solution exists in a beautiful software written in haskell called “Pandoc”, it is fast, portable and integrated well in the R ecosystem. But there are limitations in the way Pandoc works with LaTeX articles, some of these are :
- Pandoc understands a subset of the LaTeX language, which means we have to simplify any document we pass on to it for an efficient conversion.
- It does not recognize custom environments like
example
enviornment which is based on top ofverbatim
environment, so we need to devise methods to replace these custom environments with simple alternatives. - Talking about bibliography, pandoc will not convert the embedded bibliographic entries into BibTeX, which is a preferred format in R markdown.
- As for meta-data, pandoc can extract a few bits of meta-data from the LaTeX article, however we need a lot more details in order to create a front matter yaml for a R markdown.
- Sweave code chunks are not natively understood by pandoc, this requires a custom reader to seperate the code chunks and transform it into R markdown code chunks.
- Pandoc will never generate R code chunks for figures and tables, we would need to develop filters to get these functionality.
- Numbering and referencing of various article components like figure, equations and tables require modifications during the conversion using filters to acheive a web article which is consistent with the ones we currently publish.
- Some artifacts like included PDF images, tikz graphics and algorithm environments needs pre-processing and integration with the conversion process.
Sounds like a lot of hassle to just convert a single LaTeX article to
a R markdown file right ? How nice it would be if we automated
workarounds for most of these limitations programmatically and do not
need to manually perform them for each and every document. This was the
exact thought, when we developed the texor
package and its
sister package rebib
. It did all of the above and reduced
the conversion process for the end user to just a single function
call.
If you are converting a R journal LaTeX article
texor::latex_to_web(path_to_folder)
or in case you are converting a Sweave article1
texor::rnw_to_rmd(path_to_file)
There are more customization options available, if you desire things to be handled differently but for the most part, the default settings will yield a relatively good conversion to R markdown, which can be knitted to HTML.
This is the aim of the whole package, reducing complexity and automating repetitive tasks for a better conversion process.
Although a key point to note here is, not all documents might convert well or at all. This is due to the nature of LaTeX being a very customizable and less restrictive.
Internal Conversion Workflow:
To explain the internal conversion process a bit more in depth, I have divided them into stages, the workflow here is indicative only and may differ from the actual sequence due to updates.
Stage 0 : preparing to convert
In this stage, we will check the basics like using correct path, normalizing the path,extracting the file_name/ wrapper_name etc..
# normalizing path using xfun package
dir <- xfun::normalize_path(dir)
# getting wrapper file name
wrapper_file <- texor::get_wrapper_type(dir)
# getting the main LaTeX file name
file_name <- texor::get_texfile_name(dir)
Stage 1 : Copying/Removing Style files
Pandoc does not need, all of the style files as it is not trying to compile, but rather convert. Hence, to workaround certain limitations, we have to remove the RJournal.sty file and include a new style file which redefines certain commands.
# This function will remove RJournal.sty file,
# Copy the Metafix.sty file and link it in wrapper.
texor::include_style_file(dir)
Stage 2 : Handling Bibliographies
As we do not desire the embedded bibliography to be included as a div element in the article itself, we need to convert it to Bibtex format.
For removing the bibliography div elements from the article we use a Lua filter later on.
For converting the embedded bibliography we use rebib package. By default I have set up the bibliography aggregation function, which will logically create/update the bibtex file and include it in the article_tex_file as well (if not linked).
# bibliography aggregation when both bibtex and embedded bibliography available,
# Using Bibtex file for bibliography if no embedded bibliography available,
# Create a new bibtex file using the embedded bibliography in the document.
rebib::aggregate_bibliography(dir)
Stage 3 : Handing Figures
Texor package creates a yaml report about the figure environments, including tikz, algorithm2e images. There is also a logical function which uses pandoc’s Image data for converting PDF images to PNG.
data <- texor::handle_figures(dir, file_name)
Stage 4 : Patching Environments
Pandoc does not support certain environments, like:
in figures : figure*, algorithmic, algorithm.
in table : table*.
in code : example, example*, Sin, Sout, Scode, Sinput, Soutput,
smallverbatim, boxedverbatim.
Here, texor will use the stream editor to patch these environments to
the default types figure
,table
and
verbatim
.
There is also a function to patch equations (especially eqnarray environment).
texor::patch_code_env(dir)
texor::patch_table_env(dir)
texor::patch_figure_env(dir)
texor::patch_equations(dir)
Stage 5 : Converting the LaTeX document to Markdown
Here we will convert the document to Markdown, with a lot of Lua filters modifying the document.
texor::convert_to_markdown(dir)
Stage 6 : Copying over dependency files
This function will copy the files such as figures of all kinds, bibtex file, pdfs etc. to the /web folder.
texor::copy_other_files(dir)
Stage 7 : Converting the Markdown to Rmarkdown
In this stage we convert the markdown to Rmarkdown by reading and adding metadata information like ctv,CRANpkgs,BIOpkgs,slug,author metadata, title, abstract,etc..
We also add important parameters for
rjtools::rjournal_web_article
like:
- self_contained = TRUE
- toc = FALSE
- legacy_pdf = TRUE
- mathjax = “https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml-full.js”
texor::generate_rmd(dir)
Stage 8 : Creating RJ-web-article from the Rmarkdown file
texor::produce_html(dir)
Note on Dependencies
This package is involved in tackling multiple challenges, thus has to rely on multiple software tools. A list of dependencies is included here:
- Pandoc (min: V3.1) 2 : To convert the basic LaTeX to rmarkdown.
- poppler pdfutils : To convert embedded PDF to PNG.
- LaTeX/TinyTex3 : For Tikz graphic conversion.
- rebib : For bibliography conversions/aggregations.
- rjtools : For the rjournal_web_article template.
- stringr : for string manipulation functions.
- Lua filters 4