-
Notifications
You must be signed in to change notification settings - Fork 4
/
README.Rmd
95 lines (63 loc) · 2.65 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
```
[![CRAN status](https://www.r-pkg.org/badges/version/striprtf)](https://cran.r-project.org/package=striprtf)
[![Build Status](https://travis-ci.org/kota7/striprtf.svg?branch=master)](https://travis-ci.org/kota7/striprtf)
[![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/kota7/striprtf?branch=master&svg=true)](https://ci.appveyor.com/project/kota7/striprtf)
[![](http://cranlogs.r-pkg.org/badges/striprtf)](https://cran.r-project.org/package=striprtf)
[![R-CMD-check](https://github.com/kota7/striprtf/workflows/R-CMD-check/badge.svg)](https://github.com/kota7/striprtf/actions)
[![CircleCI build status](https://circleci.com/gh/kota7/striprtf.svg?style=svg)](https://circleci.com/gh/kota7/striprtf)
# striprtf: Extract Text from RTF (Rich Text Format) File
## Installation
This package is now on CRAN.
```R
install.packages("striprtf")
```
Alternatively, install development version from Github using `devtools` library.
```R
devtools::install_github("kota7/striprtf")
```
## Usage
The package exports two main functions:
- `read_rtf` takes a path to a Rich Text Format (RTF) file and extracts plain text out of it.
- `strip_rtf` does the same with string input instead of file path.
```{r}
library(striprtf)
x <- read_rtf(system.file("extdata/king.rtf", package = "striprtf"))
head(x)
```
The package has also been tested with documents in East Asian languages.
```{r}
read_rtf(system.file("extdata/amenimo.rtf", package = "striprtf"))
read_rtf(system.file("extdata/mean.rtf", package = "striprtf"))
```
## Important Change in the Function Names
From ver 0.3.1, the functions are renamed as follows:
- `striprtf` --> `read_rtf`
- `rtf2text` --> `strip_rtf`
See NEWS for other updates.
## Tables (v0.4.1+)
Supports tables in documents. Use `row_start`, `row_end`, `cell_end` arguments
to adjust the format the tables.
Suppports line breaks (and other special characters) within cells.
The parser is made robust from v0.4.5.
Tested with files generated by Microsoft Word, Google Doc, and Libre Office Writer.
```{r}
# example file added at v0.4.2
read_rtf(system.file("extdata/shakespeare.rtf", package = "striprtf"),
row_start = "**", row_end = "", cell_end = " --- ")
```
Note:
- No support for nested tables
- No support for merged cells
## References
- [Stack overflow thread](https://stackoverflow.com/a/188877) where the algorithm has been discussed.
- [RTF specification 1.5](https://www.biblioscape.com/rtf15_spec.htm)