Jeromy Anglim's Blog: Psychology and Statistics


Friday, March 31, 2017

Generating APA style tables in R: Current challenges

This post reviews some aspects of generating formatted tables using R suitable for inclusion in a manuscript conforming to APA style. I review my current workflow that involves a large amount of manual formatting in Excel. I then discuss what it would take to automate more of these manual steps in R.

My current workflow for incorporating tables into a journal manuscript involves the following steps:
  • Create data.frame in R with core table data, row names are column names carry row and column headers. This usually includes some rounding of numbers to desired precision (in order to avoid Excel rounding errors)
  • Export data.frame as a csv e.g., write.csv(mytab, file = "output/mytab.csv"), although sometimes I'll write to Excel to automate bolding.
  • Open csv in Excel and apply manual formatting
  • Paste adjusted table into Word

Pros and cons of manual formatting in Excel

Benefits of Excel approach
  • In many respects this approach is fairly efficient. 
  • If you are not updating your table results often, then it is often quicker to do formatting adjustments in Excel. 
Problems with Excel approach
  • If the data is updated multiple times, then the conversion of the table to formatted requirements can be time consuming. 
  • There is also the potential for errors to be introduced in the adjustment process. And the more times that the data is updated, the more the adjustment process might lead to transcription errors.
  • The time it takes to manually convert the table discourages making updates that would require this.
  • There is scope to standardise certain tables (e.g., correlation matrices, tables of descriptives by groups) and thus work spent automating could have benefits for future projects.

Review of activities done during Excel formatting

The following is influenced by terminology and formatting requirements of APA style (see Chapter 5 in APA 6th Edition Manual).
  • Modify fonts
    • Change font type and size to align with manuscript (e.g.,12 point Times New Roman)
    • Add selective font formats. Bolding certain numbers is quite common (e.g., correlations or factor loadings above a threshold); Italicising certain statistical labels (e.g., M and SD, 1, 2, 3 etc in correlation tables) is common.
    • Superscripted fonts related to specific table notes
  • Add or modify content
    • Convert R row and column names to names used in table. In particular, variable names are almost always distinct from table names.
    • Ensure capitalisation meets style requirements
    • Add consecutive numbers and period typically to row names. E.g., it is common to number variables in a correlation matrix "1. Age", "2. Income", etc. 
    • Add stub heading. I.e., the column heading for the first column (i.e., row.names) 
    • Adjust numbers: e.g., a p-value less than .001 might be shown as <.001, an adjusted r-squared value less than 0 might be displayed as 0.
    • Convert p-values to significance stars
  • Adjust cell alignment. 
    • Usually, headers are centred, numbers in body are centred, and first row is left aligned.
    • When row headings are nested, nested row stubs are indented (e.g., 3 spaces)
  • Delete cell content
    • Deleting lower or upper diagonal from symmetrical matrices: e.g., correlation matrix
    • Deleting diagonal from correlation matrices
  • Delete rows or columns
    • Ideally, the actual rows or columns of data have been specified correctly in R, but occasionally, it is simpler to remove rows or columns at the Excel stage. For example, the R output might list fit statistics for 6 models, but it is later decided that only five are relevant. In particular, rearranging the order of rows should be done in R for increased reliability.
  • Add lines
    • Lines are placed on top and bottom line of column header and bottom line of last row
    • Decked column headings and table spanners require additional lines
  • Format numbers
    • Common tasks include adjusting number of decimal places, removing leading zeros (e.g., correlations, multiple r, p-values), putting parentheses around certain numbers, putting two numbers together in some way (e.g., ranges, confidence intervals, often have a separator like a comma or hyphen and may be surrounded by brackets).
  • Add line breaks in cells
    • Some cells have two or more bits of information that should be presented on distinct rows. column names will include sample size on second row (e.g., "Treatment {line-break} (n = 132)" ). E.g., value is presented in first line and confidence intervals in second line. In this case, it is also possible to insert an additional row into the table and include these values in separate cells.
    • Some text is too long and needs to be split across multiple rows. This is usually done automatically. However, often this should include an indent on the second or subsequent row.
  • Adjust column widths
    • This is often a manual process in order to get the table to fit on the page and avoid cell wrapping.
  • Decked headings: Special requirements
    • Decked headings occur where two or more column headings are grouped under a column spanner (e.g., M and SD is shown for two groups where the group name is the spanner). 
    • Merge cells of column spanner (i.e., the heading that groups the two columns)
    • Insert line below the cells of the column spanner
    • Insert a small empty column between column spanner and other columns (this ensures that there is a gap between the line underneath the column spanners and makes it easier to see the intended grouping)
  • Table spanners: Special requirements
    • A table spanner is a centred heading that represents a major subdivision of a table. 
    • It involves inserting a new row with merged cells and centred text and adding a line to the bottom of the table division.
  • Table caption, title, and notes: Special requirements
    • In general, I specify these things in the manuscript. Mostly this works well. There is just the occasional bit of information that might be data driven. E.g., correlations above a certain value might be flagged as significant and this information might be included in the table note.

Reflections on manual formatting

Table formatting is complex. There is a visual quality to formatting tables. While some tables are approximated by a matrix with row and column headers, there are a huge number of common and not so common additional requirements. I often identify refinements to table formatting in an iterative fashion until it looks right.

While I attempted to document all the tasks that I do, I would not be surprised if there were additional tasks that did not come to mind. And presumably the common requirements of APA style tables in psychology are not the same as those relevant to other style guides and other disciplines.

It is possible to automate all of the above steps using R and output a table in a suitable format such as rtf, docx, or possibly HTML. However, at this point, this would require a lot of coding for each table.

There are a few packages of relevance:

  • apaTables provides APA tables exported to RTF for a few very specific scenarios. And the author also adopts specific preferences, which while well reasoned, are not always what you want.
  • apaStyle is similar to apaTables in that it exports to Word format, although it seems a little more flexible. It has a generic table function that can handle decked headings, but it still seems a long way from the flexibility required to produce most tables.
  • rempsyc includes functions for outputting APA tables to Word from R.
  • xtable is one of the best packages for table production but it exports principally to HTML and LaTeX. It also doesn't really seem designed for capturing all the complexities of APA style tables.
  • htmlTable in gmisc allows for some complexity. See this example.

The challenge is to design a flexible and efficient system that is also reliable (in that it limits the introduction of errors). I think a nice challenge for anyone willing to take this on would be to develop simple set of functions in R that can be applied to generate tables in Word or RTF format that could be applied to produce the 16 tables in the APA 6th edition style manual (ideally from hypothetical data to include the additional challenges of extracting and formatting the numbers, converting variable names, etc.). These tables include a range of the common requirements of APA style that are not well supported in existing packages.

**Update:**

  • After posting, I learnt about the papaja package. It seems specifically designed for writing APA style documents with R Markdown. The apa_table function seems like its designed to capture many of the quirks of APA style, but at present its more advanced table-formatting features are limited to exporting LaTeX (i.e., Rmarkdown to LaTeX to PDF). A fully reproducible workflow has a lot to love, but at present I still find that collaboration and other features makes Word my go-to option for manuscript preparation. 
  • huxtable (mentioned in the comments) has quite a lot of formatting flexibility. It exports to HTML and LaTeX format. See this vignette. It also supports a row and column spans, albeit  row spans are handled as separate columns whereas APA style uses indenting. I'm also not clear on how you would go from HTML to Word. My general impression is that HTML is less prescriptive by design.