stringr
is a super useful package for dealing with character strings. We will demonstrate how to wrangle and manipulate character strings by importing a pdf.
If this tutorial convinces you that stringr
is awesome 😆… you can apparently buy wall art of the hex sticker and decorate your home or office, at www.redbubble.com:
We are going to show an example of wrangling character strings from the pdf file for this article.
This article evaluated food consumption patterns in 195 countries for 15 different dietary risk factors that have probable associations with non-communicable disease (NCD). If you are interested in more about this, stay tuned for our case study.
This example will involve using many of the functions of the stringr
package. This package is part of the Tidyverse. The Tidyverse is a library of packages created by RStudio. These packages make data science in R especially efficient.
You will be able to identify when stringr
might be useful for particular kinds of data.
You will know how to use some very useful stringr
functions like:
Function | Use |
---|---|
str_replace() |
replace or exchange a pattern of characters for another |
str_split() |
split or divide strings of any size (words/sentences/paragraphs/) into substrings |
str_subset() |
select part of a string based on a characteristic |
str_count() |
count the occurrence of a specific character |
str_which() |
identify where an occurrence of a specific character occurs |
str_remove() |
remove characters from your strings |
str_trim() |
remove leading and tailing white space |
str_squish() |
remove all white space |
For information on other functions see here.
You will know how to work with regular expressions.
You will be able to name other packages that are useful for wrangling data that contains characters.
We will begin by loading the packages that we will need:
An article was recently published in the lancet journal that evaluates global dietary trends and the relationship of these dietary factors with mortality and fertility.
This article includes a table that contains dietary guidelines for dietary factors that are particularly associated with health risk.
We are interested in this table on page 3 of the article:
First let’s import the PDF using the pdftools
package.
We can use the base
summary()
function to get a sense of what the data looks like. By base
we mean that these functions are part of the base
package and are loaded automatically.Thus library(base)
is not required.
Length Class Mode
15 character character
We can see that we have 15 different character strings. Each one contains the text on each of the 15 different pages of the PDF.
We can get similar results using the glimpse()
function of the dplyr
package (it is also in the tibble
package).
chr [1:15] " "| __truncated__ ...
We will be using the %>%
pipe for sequential steps in our code later on. This will make more sense when we have multiple sequential steps using the same data object.
We could do the same code as above using this notation. For example we first grab the paper object, then we glimpse it.
chr [1:15] " "| __truncated__ ...
Again, the table we are interested in is on the third page, so let’s grab just that portion of the PDF.
Here is what the top of this page looks like before the table:
Length Class Mode
1 character character
Here we can see that the table
object now contains the text from the 3rd page as a single large character string.
chr " Articles\nin systolic blood pressure, and then estimated the Disease-specific deaths and disability-adjusted\nrelationship between change in systolic blood pressure life-years\nand disease outcomes.14 Data on disease-specific deaths and disability-adjusted\n life-years (DALYs) by age, sex, country, and year were\nOptimal level of intake "| __truncated__
The text is difficult to read because of the column structure in the pdf. Now let’s try to grab just the text in the table.
One way to approach this is to split the string by some pattern that we notice in the table.
Only the capitalized form of the word “Diet” appears to be within the table, and is not present in the preceding text (although “diet” is). All the rows of interest of the table appear to start with the word “Diet”.
Let’s use the str_split()
function of the stringr
package to split the data within the object called table
by the word “Diet”. Only lines from page 3 that contain the word Diet
will be selected (and not “diet” as this function is case-sensitive). Each section of the text that contains “Diet” will be split into individual pieces every time the world “Diet” occurs and the word itself will be removed.
In this case we are also using the magrittr assignment pipe or double pipe that looks like this %<>%
. This allows us use the table data as input to the later steps but also reassign the output to the same data object name.
Using the base::summary()
and dplyr::glimpse()
function we can see that we created a list of the 17 rows in the table that contain the word “Diet”.
Length Class Mode
[1,] 17 -none- character
We can see that we start with the row that contains “Diet low in fruits”.
List of 1
$ : chr [1:17] " "| __truncated__ " low in fruits Mean daily consumption of fruits (fresh, frozen, cooked, canned, or dried fruit"| __truncated__ " low in vegetables Mean daily consumption of vegetables (fresh, frozen, cooked, canned, or dried v"| __truncated__ " low in legumes Mean daily consumption of legumes (fresh, frozen, cooked, canned, or dried legu"| __truncated__ ...
RStudio creates really helpful cheat sheets like this one which shows you all the major functions in stringr
. You can download others here.
You can see that we could have also used the str_split_fixed()
function which would also separate the substrings into different columns of a matrix.
Note: we would need to know the number of substrings or pieces that we would like returned.
For example…
If we used the fixed version, we will create 3 vectors of a matrix with the first 3 strings that would be created when dividing the large string based on the first 3 occurrences of “Diet”.
[1] "matrix"
We can also specify the number of splits with the str_split()
, but this will create a list of substrings, not a matrix.
[1] "list"
For more information about str_split()
see here and here.
Now, back to our single list of 17 character strings.
Let’s separate the values within the list using the base unlist
function, this will allow us to easily select the different substrings within the object called table
.
Length Class Mode
17 character character
It’s important to realize that the first split will include the text before the first occurrence of Diet
as the first value in the output. We could use the first()
function of the dplyr
package to look at this value. However, we will suppress the output as this is quite large.
Instead we can take a look at the second element of the list. using the nth()
function of dplyr
.
[1] " low in fruits Mean daily consumption of fruits (fresh, frozen, cooked, canned, or dried fruits, excluding 250 g (200–300) per day 94·9\n fruit juices and salted or pickled fruits)\n "
Indeed this looks like the first row of interest in our table:
Using the last()
and the nth()
functions of the dplyr
package we can take a look at the last values of the list.
#to see the second to last value we can use nth()
#the -2 specifies that we want the second to last value
#-3 would be third to last and -1 would be the last value
dplyr::nth(table, -2)
[1] " high in sodium 24 h urinary sodium measured in g per day 3 g (1–5) per day* 26·2\n *To reflect the uncertainty in existing evidence on optimal level of intake for sodium, 1–5 g per day was considered as the uncertainty range for the optimal level of sodium where less than 2·3 g per day is the\n intake level of sodium associated with the lowest level of blood pressure in randomised controlled trials and 4–5 g per day is the level of sodium intake associated with the lowest risk of cardiovascular disease in\n observational studies.\n Table: "
[1] "ary risk factor exposure definitions, optimal level, and data representativeness index, 1990–2017\nwww.thelancet.com Published online April 3, 2019 http://dx.doi.org/10.1016/S0140-6736(19)30041-8 3\n"
Therefore, we don’t need this part of the table or the text before the table if we just want the consumption recommendations.
So we will select the 2nd through the second to last of the substrings. Since we have 17 substrings, we will select the 2nd through the 16th. However a better way to do this rather than selecting by index, would be to select phrases that are unique to the text within the table that we want. We will use the str_subset()
function of stringr
package to select the table rows with consumption guidelines. Most of the rows have the phrase “Mean daily consumption”, however, there are other phrases for some of the rows, including “Mean daily intake” and “24 h sodium.” So we will subset for each of these phrases.
# one could subset the table like this:
#table <- table[2:16]
table %<>%
str_subset(
pattern = "Mean daily consumption|Mean daily intake|24 h")
Notice that we separate the different patterns to look for using vertical bar character “|” and that all of the patterns are within quotation marks together.
Question opportunity:
What other string patterns could you use to subset the rows of the table that we want?
Why might it be better to subset based on the text rather than the index?
Now the first row is what we want:
[1] " low in fruits Mean daily consumption of fruits (fresh, frozen, cooked, canned, or dried fruits, excluding 250 g (200–300) per day 94·9\n fruit juices and salted or pickled fruits)\n "
And the last row is what we want:
[1] " high in sodium 24 h urinary sodium measured in g per day 3 g (1–5) per day* 26·2\n *To reflect the uncertainty in existing evidence on optimal level of intake for sodium, 1–5 g per day was considered as the uncertainty range for the optimal level of sodium where less than 2·3 g per day is the\n intake level of sodium associated with the lowest level of blood pressure in randomised controlled trials and 4–5 g per day is the level of sodium intake associated with the lowest risk of cardiovascular disease in\n observational studies.\n Table: "
Notice that there the decimal points from the pdf are being recognized as an interpunct instead of a period or decimal. An interpunct is a centered dot, as opposed to a period or decimal that is aligned to the bottom of the line.
The interpunct was previously used to separate words in certain languages, like ancient Latin.
It is important to replace these for later when we want these values to be converted from character strings to numeric. We will again use the stringr
package. This time we will use the str_replace_all()
function which replaces all instances of a pattern in an individual string. In this case we want to replace all instances of the interpunct with a decimal point.
Now we will try to split the strings for each row based on the presence of 2 spaces to create the columns of the table, as there appears to be larger than a space between the columns to create substrings. The substrings will be separated by quotes.
The second page of the stringr
cheat sheet has more information about using “Special Characters” in stringr
. For example \\s
is interpreted as a space as the \\
indicates that the s
should be interpreted as a special character and not simply the letter s. The {2,} indicates 2 or more spaces, while {2} would indicate exactly 2 spaces.
So here we will separate the substrings into columns by 2 more more spaces:
List of 15
$ : chr [1:6] " low in fruits" "Mean daily consumption of fruits (fresh, frozen, cooked, canned, or dried fruits, excluding" "250 g (200–300) per day" "94.9" ...
$ : chr [1:7] " low in vegetables" "Mean daily consumption of vegetables (fresh, frozen, cooked, canned, or dried vegetables," "360 g (290–430) per day" "94.9" ...
$ : chr [1:5] " low in legumes" "Mean daily consumption of legumes (fresh, frozen, cooked, canned, or dried legumes)" "60 g (50–70) per day" "94.9" ...
$ : chr [1:7] " low in whole grains" "Mean daily consumption of whole grains (bran, germ, and endosperm in their natural" "125 g (100–150) per day" "94.9" ...
$ : chr [1:5] " low in nuts and seeds" "Mean daily consumption of nut and seed foods" "21 g (16–25) per day" "94.9" ...
$ : chr [1:6] " low in milk" "Mean daily consumption of milk including non-fat, low-fat, and full-fat milk, excluding soy" "435 g (350–520) per day" "94.9" ...
$ : chr [1:6] " high in red meat" "Mean daily consumption of red meat (beef, pork, lamb, and goat, but excluding poultry, fish," "23 g (18–27) per day" "94.9" ...
$ : chr [1:6] " high in processed meat" "Mean daily consumption of meat preserved by smoking, curing, salting, or addition of" "2 g (0–4) per day" "36.9" ...
$ : chr [1:6] " high in sugar-sweetened Mean daily consumption of beverages with ≥50 kcal per 226.8 serving, including carbonated" "3 g (0–5) per day" "36.9" "beverages" ...
$ : chr [1:6] " low in fibre" "Mean daily intake of fibre from all sources including fruits, vegetables, grains, legumes, and" "24 g (19–28) per day" "94.9" ...
$ : chr [1:5] " low in calcium" "Mean daily intake of calcium from all sources, including milk, yogurt, and cheese" "1.25 g (1.00–1.50) per day" "94.9" ...
$ : chr [1:5] " low in seafood omega-3 Mean daily intake of eicosapentaenoic acid and docosahexaenoic acid" "250 mg (200–300) per day" "94.9" "fatty acids" ...
$ : chr [1:7] " low in polyunsaturated" "Mean daily intake of omega-6 fatty acids from all sources, mainly liquid vegetable oils," "11% (9–13) of total daily energy" "94.9" ...
$ : chr [1:6] " high in trans fatty acids" "Mean daily intake of trans fat from all sources, mainly partially hydrogenated vegetable oils" "0.5% (0.0–1.0) of total daily energy" "36.9" ...
$ : chr [1:8] " high in sodium" "24 h urinary sodium measured in g per day" "3 g (1–5) per day*" "26.2" ...
If we look closely, we can see that the sugar-sweetened beverage and the seafood category had only one space between the first and second columns - the columns about the dietary category and the one that describes in more detail what the consumption suggestion is about.
The values for these two columns appear to be together still in the same substring for these two categories. There are no quotation marks adjacent to the word "Mean"
.
Here you can see how the next substring should have started with the word "Mean"
by the new inclusion of a quotation mark "
.
We can add an extra space in front of the word "Mean"
for these particular categories and then try splitting again.
Since we originally split based on 2 or more spaces, we can just add a space in front of the word “Mean” for all the table strings and then try subsetting again. We can use the str_which()
function of the stringr
package to find the index of these particular cases.
[1] 9 12
Here we can see just those strings that match the pattern:
[1] " high in sugar-sweetened Mean daily consumption of beverages with ≥50 kcal per 226.8 serving, including carbonated 3 g (0–5) per day 36.9\n beverages beverages, sodas, energy drinks, fruit drinks, but excluding 100% fruit and vegetable juices\n "
[2] " low in seafood omega-3 Mean daily intake of eicosapentaenoic acid and docosahexaenoic acid 250 mg (200–300) per day 94.9\n fatty acids\n "
Now we can replace these values within the table object after adding a space in front of “Mean”.
table[str_which(table,
pattern =
"seafood|sugar")]<-str_replace(
string = table[str_which(table,
pattern =
"seafood|sugar")],
pattern = "Mean",
replacement = " Mean")
And now we can try splitting again by 2 or more spaces:
We could also just add a space in front of all the values of “Mean” in the table since the split was performed based on 2 or more spaces. Thus the other elements in table
would also be split just as before despite the additional space.
List of 15
$ : chr [1:6] " low in fruits" "Mean daily consumption of fruits (fresh, frozen, cooked, canned, or dried fruits, excluding" "250 g (200–300) per day" "94.9" ...
$ : chr [1:7] " low in vegetables" "Mean daily consumption of vegetables (fresh, frozen, cooked, canned, or dried vegetables," "360 g (290–430) per day" "94.9" ...
$ : chr [1:5] " low in legumes" "Mean daily consumption of legumes (fresh, frozen, cooked, canned, or dried legumes)" "60 g (50–70) per day" "94.9" ...
$ : chr [1:7] " low in whole grains" "Mean daily consumption of whole grains (bran, germ, and endosperm in their natural" "125 g (100–150) per day" "94.9" ...
$ : chr [1:5] " low in nuts and seeds" "Mean daily consumption of nut and seed foods" "21 g (16–25) per day" "94.9" ...
$ : chr [1:6] " low in milk" "Mean daily consumption of milk including non-fat, low-fat, and full-fat milk, excluding soy" "435 g (350–520) per day" "94.9" ...
$ : chr [1:6] " high in red meat" "Mean daily consumption of red meat (beef, pork, lamb, and goat, but excluding poultry, fish," "23 g (18–27) per day" "94.9" ...
$ : chr [1:6] " high in processed meat" "Mean daily consumption of meat preserved by smoking, curing, salting, or addition of" "2 g (0–4) per day" "36.9" ...
$ : chr [1:7] " high in sugar-sweetened" "Mean daily consumption of beverages with ≥50 kcal per 226.8 serving, including carbonated" "3 g (0–5) per day" "36.9" ...
$ : chr [1:6] " low in fibre" "Mean daily intake of fibre from all sources including fruits, vegetables, grains, legumes, and" "24 g (19–28) per day" "94.9" ...
$ : chr [1:5] " low in calcium" "Mean daily intake of calcium from all sources, including milk, yogurt, and cheese" "1.25 g (1.00–1.50) per day" "94.9" ...
$ : chr [1:6] " low in seafood omega-3" "Mean daily intake of eicosapentaenoic acid and docosahexaenoic acid" "250 mg (200–300) per day" "94.9" ...
$ : chr [1:7] " low in polyunsaturated" "Mean daily intake of omega-6 fatty acids from all sources, mainly liquid vegetable oils," "11% (9–13) of total daily energy" "94.9" ...
$ : chr [1:6] " high in trans fatty acids" "Mean daily intake of trans fat from all sources, mainly partially hydrogenated vegetable oils" "0.5% (0.0–1.0) of total daily energy" "36.9" ...
$ : chr [1:8] " high in sodium" "24 h urinary sodium measured in g per day" "3 g (1–5) per day*" "26.2" ...
Looks better!
We want just the first (the food category) and third column (the optimal consumption amount suggested) for each row in the table.
We can use the map
function of the purrr
package to accomplish this.
The map
function allows us to perform the same action multiple times across each element within an object.
This following will allow us to select the 1st or 3rd substring from each element of the table
object.
[[1]]
[1] " low in fruits"
[[2]]
[1] " low in vegetables"
[[3]]
[1] " low in legumes"
[[4]]
[1] " low in whole grains"
[[5]]
[1] " low in nuts and seeds"
[[6]]
[1] " low in milk"
[[1]]
[1] "250 g (200–300) per day"
[[2]]
[1] "360 g (290–430) per day"
[[3]]
[1] "60 g (50–70) per day"
[[4]]
[1] "125 g (100–150) per day"
[[5]]
[1] "21 g (16–25) per day"
[[6]]
[1] "435 g (350–520) per day"
Now we will create a tibble
using this data. However, currently both category
and amount
are of class list
. To create a tibble
we need to unlist the data to create vectors.
[1] "list"
[1] "character"
[1] " low in fruits" " low in vegetables"
[3] " low in legumes" " low in whole grains"
[5] " low in nuts and seeds" " low in milk"
[7] " high in red meat" " high in processed meat"
[9] " high in sugar-sweetened" " low in fibre"
[11] " low in calcium" " low in seafood omega-3"
[13] " low in polyunsaturated" " high in trans fatty acids"
[15] " high in sodium"
[1] "250 g (200–300) per day"
[2] "360 g (290–430) per day"
[3] "60 g (50–70) per day"
[4] "125 g (100–150) per day"
[5] "21 g (16–25) per day"
[6] "435 g (350–520) per day"
[7] "23 g (18–27) per day"
[8] "2 g (0–4) per day"
[9] "3 g (0–5) per day"
[10] "24 g (19–28) per day"
[11] "1.25 g (1.00–1.50) per day"
[12] "250 mg (200–300) per day"
[13] "11% (9–13) of total daily energy"
[14] "0.5% (0.0–1.0) of total daily energy"
[15] "3 g (1–5) per day*"
We could have done all of this at once in one command like this:
Now we will create a tibble
, which is an important data frame structure in the tidyverse which allows us to use other packages in the tidyverse with our data.
We will name our tibble
columns now as we create our tibble
using the tibble()
function of both the tidyr
and the tibble
packages, as names are required in tibbles.
# A tibble: 15 x 2
category amount
<chr> <chr>
1 " low in fruits" 250 g (200–300) per day
2 " low in vegetables" 360 g (290–430) per day
3 " low in legumes" 60 g (50–70) per day
4 " low in whole grains" 125 g (100–150) per day
5 " low in nuts and seeds" 21 g (16–25) per day
6 " low in milk" 435 g (350–520) per day
7 " high in red meat" 23 g (18–27) per day
8 " high in processed meat" 2 g (0–4) per day
9 " high in sugar-sweetened" 3 g (0–5) per day
10 " low in fibre" 24 g (19–28) per day
11 " low in calcium" 1.25 g (1.00–1.50) per day
12 " low in seafood omega-3" 250 mg (200–300) per day
13 " low in polyunsaturated" 11% (9–13) of total daily energy
14 " high in trans fatty acids" 0.5% (0.0–1.0) of total daily energy
15 " high in sodium" 3 g (1–5) per day*
Looking pretty good!
However, we want to separate the different amounts within the amount column.
Recall what the original table looked like:
We can use the tidyr::separate()
function to separate the data within the amount column into three new columns based on the optimal level and the optimal range. We can separate the values based on the open parentheses "("
and the long dash "–"
characters.
# The first column will be called optimal
# It will contain the 1st part of the amount column data before the 1st underscore"("
# The 2nd column will be called lower
# It will contain the data after the "("
# The 3rd column will be called upper
# It will contain the 2nd part of the data based on the "–"
guidelines%<>%
tidyr::separate(amount,
c("optimal", "lower", "upper"),
sep ="[[(|–]]")
head(guidelines)
# A tibble: 6 x 4
category optimal lower upper
<chr> <chr> <chr> <chr>
1 " low in fruits" "250 g " 200 300) per day
2 " low in vegetables" "360 g " 290 430) per day
3 " low in legumes" "60 g " 50 70) per day
4 " low in whole grains" "125 g " 100 150) per day
5 " low in nuts and seeds" "21 g " 16 25) per day
6 " low in milk" "435 g " 350 520) per day
Let’s Also create a new variable/column in our tibble that indicates the direction that can be harmful for each dietary factor.
# A tibble: 15 x 5
direction food optimal lower upper
<chr> <chr> <chr> <chr> <chr>
1 " low" fruits "250 g " 200 300) per day
2 " low" vegetables "360 g " 290 430) per day
3 " low" legumes "60 g " 50 70) per day
4 " low" whole grains "125 g " 100 150) per day
5 " low" nuts and seeds "21 g " 16 25) per day
6 " low" milk "435 g " 350 520) per day
7 " high" red meat "23 g " 18 27) per day
8 " high" processed meat "2 g " 0 4) per day
9 " high" sugar-sweetened "3 g " 0 5) per day
10 " low" fibre "24 g " 19 28) per day
11 " low" calcium "1.25 g " 1.00 1.50) per day
12 " low" seafood omega-3 "250 mg " 200 300) per day
13 " low" polyunsaturated "11% " 9 13) of total daily energy
14 " high" trans fatty acids "0.5% " 0.0 1.0) of total daily energy
15 " high" sodium "3 g " 1 5) per day*
If we wanted to remove the direction variable we could use the purrr::modify_at() function:
OK, looking better, but we still need a bit of cleaning to remove symbols and extra words from the columns. Some of the extra symbols include: "%"
, ")"
and the "*"
.
The "*"
and the ")"
are what we call metacharacters or regular expressions. These are characters that have special meanings.
Now we need the "\\"
to indicate that we want these characters to be matched exactly and not interpreted as the meaning of the symbol.
See here for more info about regular expressions in R.
Also here we have a bit of an example using the str_count()
function of stringr
, which counts the number of instances of a character string. In this case we will look for individual characters but you could also search for words or phrases.
Count the letter t:
[1] "Testing for ts or\ttabs can be tricky.(yes, it really can!*)\n"
[1] 5
Count tabs:
[1] 1
Count parentheses:
# this would not work because r thinks this is part of the code itself
#str_count(regextest, ")")
# this would not work because r thinks this is part of the code itself
#str_count(regextest, "\)")
str_count(regextest, "\\)") #this works!
[1] 1
Count the occurrence of the astrix:
# this also does not work
#str_count(regextest, "*")
# nor does this
#str_count(regextest, "\*")
str_count(regextest, "\\*")#this works!
[1] 1
We also want to make a unit variable so that we can make sure that our units are consistent later.
[1] "250 g " "360 g " "60 g " "125 g " "21 g " "435 g " "23 g "
[8] "2 g " "3 g " "24 g " "1.25 g " "250 mg " "11% " "0.5% "
[15] "3 g "
Notice that the values that are percentages don’t have spaces between the number and the unit. We can separate the optimal
values by a space or a percent symbol "%"
using "|"
to indicate that we want to separate by either. In this case we will lose the “%” and will need to add it back to those values.
We can specify a space using an actual space or \\s
.
# A tibble: 15 x 6
direction food lower optimal unit upper
<chr> <chr> <chr> <chr> <chr> <chr>
1 " low" fruits 200 250 g 300) per day
2 " low" vegetables 290 360 g 430) per day
3 " low" legumes 50 60 g 70) per day
4 " low" whole grains 100 125 g 150) per day
5 " low" nuts and seeds 16 21 g 25) per day
6 " low" milk 350 435 g 520) per day
7 " high" red meat 18 23 g 27) per day
8 " high" processed meat 0 2 g 4) per day
9 " high" sugar-sweetened 0 3 g 5) per day
10 " low" fibre 19 24 g 28) per day
11 " low" calcium 1.00 1.25 g 1.50) per day
12 " low" seafood omega-3 200 250 mg 300) per day
13 " low" polyunsaturated 9 11 "" 13) of total daily ener…
14 " high" trans fatty acids 0.0 0.5 "" 1.0) of total daily ene…
15 " high" sodium 1 3 g 5) per day*
Great, so to now we will add “%
” to the unit
variable for the low in polyunsaturated
and high in trans fatty acids
rows.
First we need to replace the empty values with NA using the na_if()
function of the dplyr
package.
# A tibble: 15 x 6
direction food lower optimal unit upper
<chr> <chr> <chr> <chr> <chr> <chr>
1 " low" fruits 200 250 g 300) per day
2 " low" vegetables 290 360 g 430) per day
3 " low" legumes 50 60 g 70) per day
4 " low" whole grains 100 125 g 150) per day
5 " low" nuts and seeds 16 21 g 25) per day
6 " low" milk 350 435 g 520) per day
7 " high" red meat 18 23 g 27) per day
8 " high" processed meat 0 2 g 4) per day
9 " high" sugar-sweetened 0 3 g 5) per day
10 " low" fibre 19 24 g 28) per day
11 " low" calcium 1.00 1.25 g 1.50) per day
12 " low" seafood omega-3 200 250 mg 300) per day
13 " low" polyunsaturated 9 11 <NA> 13) of total daily ener…
14 " high" trans fatty acids 0.0 0.5 <NA> 1.0) of total daily ene…
15 " high" sodium 1 3 g 5) per day*
Then to replace the NA
values, we can use the replace_na()
function in the tidyr
package and the mutate()
function of dplyr
to specify which values to replace, in this case the NA
values within the variable unit
. Essentially this variable gets reassigned with the new values, as we mostly think of the mutate()
function as creating new variables.
guidelines %<>%
dplyr::mutate(unit = replace_na(unit, "%"))
#now just to show these rows
guidelines %>%
filter(unit == "%")
# A tibble: 2 x 6
direction food lower optimal unit upper
<chr> <chr> <chr> <chr> <chr> <chr>
1 " low" polyunsaturated 9 11 % 13) of total daily energy
2 " high" trans fatty acids 0.0 0.5 % 1.0) of total daily ener…
Let’s also move unit
to be the last column. We can use the select()
and everything()
functions of the dplyr
package to do this.
Here you can see Hadley Wickham’s (Chief Scientist at RStudio) explanation for this behavior of select()
:
https://github.com/tidyverse/dplyr/issues/2838#issuecomment-306062800
To remove all of the remaining extra characters and words we will again use the stringr
package. This time we will use the str_remove()
function to remove all instances of these characters.
guidelines <-as_tibble(
map(
guidelines,
str_remove,
pattern = "\\) per day|\\) of total daily energy"))
guidelines <-as_tibble(
map(guidelines,
str_remove,
pattern = "\\*"))
guidelines
# A tibble: 15 x 6
direction food lower optimal upper unit
<chr> <chr> <chr> <chr> <chr> <chr>
1 " low" fruits 200 250 300 g
2 " low" vegetables 290 360 430 g
3 " low" legumes 50 60 70 g
4 " low" whole grains 100 125 150 g
5 " low" nuts and seeds 16 21 25 g
6 " low" milk 350 435 520 g
7 " high" red meat 18 23 27 g
8 " high" processed meat 0 2 4 g
9 " high" sugar-sweetened 0 3 5 g
10 " low" fibre 19 24 28 g
11 " low" calcium 1.00 1.25 1.50 g
12 " low" seafood omega-3 200 250 300 mg
13 " low" polyunsaturated 9 11 13 %
14 " high" trans fatty acids 0.0 0.5 1.0 %
15 " high" sodium 1 3 5 g
Nice! that’s pretty clean but we can do a bit more.
One of the next things to notice about our data is the character classes of our variables.
Notice that the optimal amounts of consumption are currently of class character as indicated by the <chr>
just below the column names / variable names of the guidelines
tibble:
# A tibble: 15 x 6
direction food lower optimal upper unit
<chr> <chr> <chr> <chr> <chr> <chr>
1 " low" fruits 200 250 300 g
2 " low" vegetables 290 360 430 g
3 " low" legumes 50 60 70 g
4 " low" whole grains 100 125 150 g
5 " low" nuts and seeds 16 21 25 g
6 " low" milk 350 435 520 g
7 " high" red meat 18 23 27 g
8 " high" processed meat 0 2 4 g
9 " high" sugar-sweetened 0 3 5 g
10 " low" fibre 19 24 28 g
11 " low" calcium 1.00 1.25 1.50 g
12 " low" seafood omega-3 200 250 300 mg
13 " low" polyunsaturated 9 11 13 %
14 " high" trans fatty acids 0.0 0.5 1.0 %
15 " high" sodium 1 3 5 g
To convert these values to numeric we can use the mutate_at()
function of the dplyr
package.
The mutate_at()
function allows us to perform a function on specific columns/variables within a tibble. We need to indicate which variables that we would like to convert using vars()
. In this case if we look at the beginning of the guidelines
tibble, we can see that optimal
, lower
and upper
should be converted. As these three columns are sequential, we can simply put a :
between optimal
and upper
to indicate that we want all the variables in between these columns to be converted.
# A tibble: 15 x 6
direction food lower optimal upper unit
<chr> <chr> <dbl> <dbl> <dbl> <chr>
1 " low" fruits 200 250 300 g
2 " low" vegetables 290 360 430 g
3 " low" legumes 50 60 70 g
4 " low" whole grains 100 125 150 g
5 " low" nuts and seeds 16 21 25 g
6 " low" milk 350 435 520 g
7 " high" red meat 18 23 27 g
8 " high" processed meat 0 2 4 g
9 " high" sugar-sweetened 0 3 5 g
10 " low" fibre 19 24 28 g
11 " low" calcium 1 1.25 1.5 g
12 " low" seafood omega-3 200 250 300 mg
13 " low" polyunsaturated 9 11 13 %
14 " high" trans fatty acids 0 0.5 1 %
15 " high" sodium 1 3 5 g
Great! Now these variables are of class <dbl>
(stands for double) which indicates that they are numeric. Here is a link for more info on numeric classes in R.
If we had not replaced the "·"
interpunct values to a period conversion from character to numeric will be problematic and will result in NA values.
We seem to have lost the word "beverages"
from the "sugar-sweetened beverages"
category, as well as "fatty acids"
from the "seafood omega 3 fatty acids"
, and the "polyunsaturated fatty acids"
categories as the full category name was listed on two lines within the table. We would like to replace these values with the full name.
To select the food
column we will show you several options. Only a couple will work well with reassigning the data in that particular variable within guidelines
without assigning an intermediate data object. We will look using mutate_at()
, pull()
, select()
, and two styles of brackets [,c("variable name")]
and [["variablename"]]
.
The bracket [,c("variable name")]
option and the select() option will grab a tibble (data frame) version of the food column out of guidelines. However we can’t start commands with select for assignments.
# A tibble: 15 x 1
food
<chr>
1 fruits
2 vegetables
3 legumes
4 whole grains
5 nuts and seeds
6 milk
7 red meat
8 processed meat
9 sugar-sweetened
10 fibre
11 calcium
12 seafood omega-3
13 polyunsaturated
14 trans fatty acids
15 sodium
# A tibble: 15 x 1
food
<chr>
1 fruits
2 vegetables
3 legumes
4 whole grains
5 nuts and seeds
6 milk
7 red meat
8 processed meat
9 sugar-sweetened
10 fibre
11 calcium
12 seafood omega-3
13 polyunsaturated
14 trans fatty acids
15 sodium
pull()
and the bracket [["variable name"]]
option in contrast, will grab the vector version of the food data:
[1] "fruits" "vegetables" "legumes"
[4] "whole grains" "nuts and seeds" "milk"
[7] "red meat" "processed meat" "sugar-sweetened"
[10] "fibre" "calcium" "seafood omega-3"
[13] "polyunsaturated" "trans fatty acids" "sodium"
[1] "fruits" "vegetables" "legumes"
[4] "whole grains" "nuts and seeds" "milk"
[7] "red meat" "processed meat" "sugar-sweetened"
[10] "fibre" "calcium" "seafood omega-3"
[13] "polyunsaturated" "trans fatty acids" "sodium"
The pull function can be very useful when combined with other functions (for example you typically want to use a vector with the str_replace()
function), but just like select, we can’t start assignments with pull()
.
This is not possible and will result in an error:
select(guidelines, food) <-
str_replace(
pull(guidelines,"food"),
pattern = "sugar-sweetened",
replacement = "sugar-sweetened beverages")
This will only print the result, but not reassign the food variable values:
guidelines %>%
pull(food)%>%
str_replace(
pattern = "sugar-sweetened",
replacement = "sugar-sweetened beverages")
[1] "fruits" "vegetables"
[3] "legumes" "whole grains"
[5] "nuts and seeds" "milk"
[7] "red meat" "processed meat"
[9] "sugar-sweetened beverages" "fibre"
[11] "calcium" "seafood omega-3"
[13] "polyunsaturated" "trans fatty acids"
[15] "sodium"
Using select()
would work as well to print the result (although the result structure is different):
guidelines %>%
select(food)%>%
str_replace(
pattern = "sugar-sweetened",
replacement = "sugar-sweetened beverages")
[1] "c(\"fruits\", \"vegetables\", \"legumes\", \"whole grains\", \"nuts and seeds\", \"milk\", \"red meat\", \"processed meat\", \"sugar-sweetened beverages\", \"fibre\", \"calcium\", \"seafood omega-3\", \"polyunsaturated\", \"trans fatty acids\", \"sodium\")"
Question opportunity:
Why do these commands not reassign the food variable values?
The bracket option is great alternative and allows us to reassign the values within guidelines easily. Either of the two styles of brackets: [,c("variable name")]
and [["variablename"]]
will work.
#1st method: `[,c("variable name")]`
#Replacing "sugar-sweetened" with "sugar-sweetened beverages"
guidelines[,c("food")] <-
str_replace(
pull(guidelines,"food"),
pattern = "sugar-sweetened",
replacement = "sugar-sweetened beverages")
#2nd method: `[["variablename"]]`
#Replacing "seafood omega-3" with"seafood omega-3 fatty acids"
guidelines[["food"]] <-
str_replace(
pull(guidelines,"food"),
pattern = "seafood omega-3",
replacement = "seafood omega-3 fatty acids")
guidelines
# A tibble: 15 x 6
direction food lower optimal upper unit
<chr> <chr> <dbl> <dbl> <dbl> <chr>
1 " low" fruits 200 250 300 g
2 " low" vegetables 290 360 430 g
3 " low" legumes 50 60 70 g
4 " low" whole grains 100 125 150 g
5 " low" nuts and seeds 16 21 25 g
6 " low" milk 350 435 520 g
7 " high" red meat 18 23 27 g
8 " high" processed meat 0 2 4 g
9 " high" sugar-sweetened beverages 0 3 5 g
10 " low" fibre 19 24 28 g
11 " low" calcium 1 1.25 1.5 g
12 " low" seafood omega-3 fatty acids 200 250 300 mg
13 " low" polyunsaturated 9 11 13 %
14 " high" trans fatty acids 0 0.5 1 %
15 " high" sodium 1 3 5 g
Finally, the best option is probably the mutate_at()
function from dplyr
. In this case we need to include ~
in front of the function that we would like to use on the values in our food
variables. We also include .
as a replacement to reference the data that we want to use within str_replace()
(which in this case is the food
variable values of guidelines
).
Notice we didn’t need this when we previously use mutate_at()
with the as.numeric()
function. This is because the str_replace()
function requires us to specify what data we are using as one of the arguments, while as.numeric()
does not.
#Replacing "polyunsaturated" with"polyunsaturated fatty acids"
guidelines%<>%
mutate_at(vars(food),
~str_replace(
string = .,
pattern = "polyunsaturated",
replacement = "polyunsaturated fatty acids"))
guidelines
# A tibble: 15 x 6
direction food lower optimal upper unit
<chr> <chr> <dbl> <dbl> <dbl> <chr>
1 " low" fruits 200 250 300 g
2 " low" vegetables 290 360 430 g
3 " low" legumes 50 60 70 g
4 " low" whole grains 100 125 150 g
5 " low" nuts and seeds 16 21 25 g
6 " low" milk 350 435 520 g
7 " high" red meat 18 23 27 g
8 " high" processed meat 0 2 4 g
9 " high" sugar-sweetened beverages 0 3 5 g
10 " low" fibre 19 24 28 g
11 " low" calcium 1 1.25 1.5 g
12 " low" seafood omega-3 fatty acids 200 250 300 mg
13 " low" polyunsaturated fatty acids 9 11 13 %
14 " high" trans fatty acids 0 0.5 1 %
15 " high" sodium 1 3 5 g
This might be considered a better option because it is more readable as to where the food
data came from that we are replacing values within.
There is one last minor detail… the direction
variable has leading spaces still. We can use str_trim()
to fix that! (You could also use str_squish()
which removes all white spaces, not just leading spaces)
# A tibble: 15 x 6
direction food lower optimal upper unit
<chr> <chr> <dbl> <dbl> <dbl> <chr>
1 low fruits 200 250 300 g
2 low vegetables 290 360 430 g
3 low legumes 50 60 70 g
4 low whole grains 100 125 150 g
5 low nuts and seeds 16 21 25 g
6 low milk 350 435 520 g
7 high red meat 18 23 27 g
8 high processed meat 0 2 4 g
9 high sugar-sweetened beverages 0 3 5 g
10 low fibre 19 24 28 g
11 low calcium 1 1.25 1.5 g
12 low seafood omega-3 fatty acids 200 250 300 mg
13 low polyunsaturated fatty acids 9 11 13 %
14 high trans fatty acids 0 0.5 1 %
15 high sodium 1 3 5 g
#gives identical results in this case
guidelines%<>%
mutate_at(vars(direction), str_squish)
guidelines
# A tibble: 15 x 6
direction food lower optimal upper unit
<chr> <chr> <dbl> <dbl> <dbl> <chr>
1 low fruits 200 250 300 g
2 low vegetables 290 360 430 g
3 low legumes 50 60 70 g
4 low whole grains 100 125 150 g
5 low nuts and seeds 16 21 25 g
6 low milk 350 435 520 g
7 high red meat 18 23 27 g
8 high processed meat 0 2 4 g
9 high sugar-sweetened beverages 0 3 5 g
10 low fibre 19 24 28 g
11 low calcium 1 1.25 1.5 g
12 low seafood omega-3 fatty acids 200 250 300 mg
13 low polyunsaturated fatty acids 9 11 13 %
14 high trans fatty acids 0 0.5 1 %
15 high sodium 1 3 5 g
OK! Now we know how much of each dietary factor we generally need for optimal health according to the guidelines used in this article.
stringr
might be useful for particular kinds of data.Genomic sequence data, text data etc.
stringr
functions like:Function | Use |
---|---|
str_replace() |
replace or exchange a pattern of characters for another |
str_split() |
split or divide strings of any size (words/sentences/paragraphs/) into substrings |
str_subset() |
select part of a string based on a characteristic |
str_count() |
count the occurrence of a specific character |
str_which() |
identify where an occurrence of a specific character occurs |
str_remove() |
remove characters from your strings |
str_trim() |
remove leading and tailing white space |
str_squish() |
remove all white space |
Don’t forget \\
!
pdftools
tidyr
dplyr
purrr
For more helpful tutorials of a similar style as this one see here. More will be coming including a longer version of this tutorial!
For more information about the tidyverse see here.
For information on other stringr
functions see here.
See here for more info about regular expressions in R.
Get Cheat Sheats here.
Here are links for these packages and the others used in this tutorial:
Package | Use |
---|---|
here | to easily load and save data |
dplyr | to arrange/filter/select/compare specific subsets of the data |
pdftools | to read a pdf into R |
stringr | to manipulate the text within the pdf of the data |
magrittr | to use the %<>% pipping operator |
purrr | to perform functions on all columns of a tibble |
tibble | to create data objects that we can manipulate with dplyr/stringr/tidyr/purrr |
tidyr | to separate data within a column into multiple columns |
Here is a summary of helpful links about many of the functions used in this tutorial:
(Thanks to Leonardo Collado-Torres for gathering these!)
here::here()
dplyr::nth() last() and first()
pdftools::pdf_text()
tidyr::glimpse()
stringr::str_split() and str_split_fixed()
stringr::str_replace_all()
purrr::map()
tibble::tibble()
tidyr::separate()
purrr::modify_at()
readr::read_file()
stringr::str_count()
dplyr::pull()
dplyr::na_if()
dplyr::mutate()
dplyr::filter() (not stats::filter() !!)
dplyr::vars()