-
Notifications
You must be signed in to change notification settings - Fork 0
/
external-data.qmd
295 lines (199 loc) · 10.3 KB
/
external-data.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
---
title: "Starting with Data"
engine: knitr
code-annotations: hover
---
## Library Loading
Begin by loading any needed libraries.
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
Load the `tidyverse` meta-package.
```{r r-start, warning = F, message = F}
# Load needed library
library(tidyverse)
```
## [{{< fa brands python >}} Python]{.py}
Load the `pandas` and `os` libraries.
```{python py-libs}
# Load needed libraries
import os
import pandas as pd
```
:::
## Data Import
We can now load an external dataset derived from the `lterdatasampler` [{{< fa brands r-project >}} R]{.r} package (see [here](https://lter.github.io/lterdatasampler/)) with both [{{< fa brands r-project >}} R]{.r} and [{{< fa brands python >}} Python]{.py}. This relatively simple operation is also a nice chance to showcase how 'namespacing' (i.e., indicating which package a given function comes from) differs between the two languages.
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
Namespacing in [{{< fa brands r-project >}} R]{.r} is accomplished by doing `package_name::function_name` and is optional (though, in my opinion, good practice!). Note that we use the assignment operator (`<-`) to assign the contents of the CSV to an object.
```{r r-ext-data}
# Read in vertebrate data CSV
vert_r <- utils::read.csv(file = file.path("data", "verts.csv"))
```
Recall from our file path module that the `file.path` function accounts for computer operating system differences.
## [{{< fa brands python >}} Python]{.py}
In [{{< fa brands python >}} Python]{.py}, namespacing is required and is done via `package_name.function_name`. Note that we use the assignment operator (`=`) to assign the contents of the CSV to a variable.
```{python py-ext-data}
# Read in vertebrate data CSV
vert_py = pd.read_csv(os.path.join("data", "verts.csv"))
```
Recall from our file path module that the `join` function accounts for computer operating system differences.
:::
## Tabular Data [Type]{.py}/[Class]{.r}
Data stored in CSVs (and similar data formats like Microsoft Excel, etc.) has a unique [type]{.py}/[class]{.r} that differs from some of the categories we covered in the "Fundamentals" section.
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
In [{{< fa brands r-project >}} R]{.r}, such data are class `data.frame`
```{r r-class}
# Check class of a data object
class(vert_r)
```
## [{{< fa brands python >}} Python]{.py}
In [{{< fa brands python >}} Python]{.py}, such data are type `DataFrame` and this variable type is defined by the `pandas` library. This is the standard type returned by the `pandas` `read_csv` function.
```{python py-class}
# Check type of a data object
type(vert_py)
```
:::
## Making Heads or Tails of Data
Checking the 'head' or 'tail' of the data (i.e., the first or last few rows of the data respectively) is a nice way of getting a sense for the general format of the dataframe being assessed.
::: panel-tabset
## `r fontawesome::fa("r-project", fill = "#FF9B00", a11y = "sem")` R
In [{{< fa brands r-project >}} R]{.r}, we use the `head` or `tail` function to return the first or last rows respectively.
```{r r-head}
# Check out head
utils::head(vert_r, n = 2) # <1>
```
1. Note that we're using the _optional_ `n` argument to specify the number of rows to return
```{r r-tail}
# Check out tail
utils::tail(vert_r, n = 3)
```
## `r fontawesome::fa("python", fill = "#077DC2", a11y ="sem")` Python
In [{{< fa brands python >}} Python]{.py}, we use the `head` or `tail` method to return the first or last rows respectively. Note that these methods are only available to variables of type `DataFrame`. All methods are appended to the end of the variable of the appropriate type separated by a period.
```{python py-head}
# Check out head
vert_py.head(3)
```
```{python py-tail}
# Check out tail
vert_py.tail(2)
```
:::
## Data Structure
While it is nice to know the [type]{.py}/[class]{.r} of the data table generally, we often need to know the [type]{.py}/[class]{.r} of the columns within those data tables. Our operations are typically aimed at modifying particular columns and pre-requisite to that is knowing the [type]{.py}/[class]{.r} of the column to know what actions are available to us.
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
[{{< fa brands r-project >}} R]{.r} uses the `str` function to assess data structure. Structure includes the dimensions of the data (i.e., number of rows and columns) as well as the class of the data (`data.frame`) and the class of each column.
```{r r-structure}
utils::str(vert_r)
```
Be careful to not confuse this with the [{{< fa brands python >}} Python]{.py} function `str` that coerces values to type "string"!
## [{{< fa brands python >}} Python]{.py}
When we want to know the type of each column in a [{{< fa brands python >}} Python]{.py} `DataFrame`, we can use the `dtypes` "attribute". Attributes are akin to a method but they completely lack arguments that might modify their behavior. As a consequence they are extremely precisely defined. The `dtypes` attribute returns the type of each column.
```{python py-structure}
vert_py.dtypes
```
:::
## Data Summaries
We often want to begin our exploration of a given dataset by getting a summary of each column. This is rarely what we actually need for statistics or visualization but it is a nice high-level way of getting a sense for the composition of the data.
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
[{{< fa brands r-project >}} R]{.r} provides the `summary` function to summarize all columns in a dataset. Note that it is not terribly informative for columns of class character.
```{r r-describe}
# Get summary of data
summary(vert_r)
```
## [{{< fa brands python >}} Python]{.py}
[{{< fa brands python >}} Python]{.py} has the `describe` method for getting similar information about each column of a `DataFrame`.
```{python py-describe}
# Describe data
vert_py.describe()
```
[{{< fa brands python >}} Python]{.py} also offers the `info` method to identify only the count of non-null entries in each column as well as the type of each column.
```{python py-describe2}
# Check info of data
vert_py.info()
```
:::
## Identifying Columns
If we want to skip these steps and just identify the full set of column [labels]{.py}/[names]{.r} there is a small operation for stripping that information out in both languages.
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
[{{< fa brands r-project >}} R]{.r} has the `names` function to quickly return a vector of the column names in a given `data.frame` object.
```{r r-names}
# Check column names
names(vert_r)
```
## [{{< fa brands python >}} Python]{.py}
[{{< fa brands python >}} Python]{.py} has the `columns` attributes for returning a simple list of column labels in a given `DataFrame` variable.
```{python py-names}
# Check column labels
vert_py.columns
```
:::
## Accessing Column(s)
We may want to access a particular column or set of columns. There are several approaches we might use to access a particular column in either [{{< fa brands python >}} Python]{.py} or [{{< fa brands r-project >}} R]{.r}. They are:
- Indexing the column by its location
- Indexing the column by its [label]{.py}/[name]{.r}
- Indexing the column with an operator
In all of the following examples we'll use the `head` [method]{.py}/[function]{.r} to simplify the output.
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
If we know the order of the columns we can use the same syntax as when we indexed a vector.
```{r r-cols1}
# Use column indexing
head(vert_r[1])
```
Note that if we also want to account for rows we would put the desired column number on the right and the desired row number on the left.
```{r r-cols1b}
# Use row/column indexing
head(vert_r[, 1])
```
If preferred we could instead substitute the number for the column name.
```{r r-cols2}
head(vert_r["year"])
```
In [{{< fa brands r-project >}} R]{.r}, the column operator is a `$` and we place that character between the object and column names (e.g., `data_name$column_name`).
```{r r-cols3}
head(vert_r$year)
```
## [{{< fa brands python >}} Python]{.py}
If we know the index position of the desired column we can use the `iloc` method (short for "integer location"). Note that the `iloc` method uses square brackets instead of parentheses (typical methods use parentheses). We must include a colon with either nothing or the start/stop bounds (see "Slicing" in "Fundamentals"). A colon without specifying bounds returns all rows.
```{python py-cols1}
# Use row/column indexing
vert_py.iloc[: , 0].head()
```
If preferred we could instead substitute the number for the column label.
```{python py-cols2}
vert_py["year"].head()
```
In [{{< fa brands python >}} Python]{.py}, the column operator is a `.` and we place that character between the variable name and column label (e.g., `data_name.column_name`).
```{python py-cols3}
vert_py.year.head()
```
:::
When we want to access multiple columns we can still use either index positions or column [labels]{.py}/[names]{.r} but we _cannot_ use the column operator.
We'll continue to use the `head` [method]{.py}/[function]{.r} to simplify outputs.
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
In [{{< fa brands r-project >}} R]{.r} we could use a concatenated vector of index positions to specify particular columns by their positions.
```{r r-multicols1}
# Use multiple column index positions
head(vert_r[, c(1:2, 13)])
```
Instead we could use a vector of column names for extra precision.
```{r r-multicols2}
# Use multiple column names
head(vert_r[, c("year", "sitecode", "weight_g")])
```
## [{{< fa brands python >}} Python]{.py}
In [{{< fa brands python >}} Python]{.py} we could use the `iloc` method for multiple columns and simply supply a list of the column index positions in which we are interested. Note that this method is different from many of the other methods we've covered so far because it uses square brackets instead of parentheses.
```{python py-multicols1}
# Use multiple column index positions
vert_py.iloc[:, [0, 1, 12]].head()
```
Or we could use the `.loc` method and supply a list of column labels. Note that the `loc` method also uses square brackets instead of parentheses and requires a colon to the left of the comma in order to return all rows.
```{python py-multicols2}
vert_py.loc[:, ["year", "sitecode", "weight_g"]].head()
```
:::