**Logical operators: ** Filtering for certain observations (e.g. flights from a
particular airport) is often of interest in data frames where we might want to
examine observations with certain characteristics separately from the rest of
the data. To do so we use the `filter` function and a series of
**logical operators**. The most commonly used logical operators for data
analysis are as follows:
- `==` means "equal to"
- `!=` means "not equal to"
- `>` or `<` means "greater than" or "less than"
- `>=` or `<=` means "greater than or equal to" or "less than or equal to"

#### `summarize` (calculate statistics)
We can also obtain numerical summaries for these flights:
```{r lax-flights-summ}
lax_flights %>%
summarize(mean_dd = mean(dep_delay, na.rm=TRUE),
median_dd = median(dep_delay, na.rm=TRUE), n_dd = n())
```
Note that in the `summarize` function we created a list of three different
numerical summaries that we were interested in. The names of these elements are
user defined, like `mean_dd`, `median_dd`, `n_dd`, and you could customize these names
as you like (just don't use spaces in your names). Calculating these summary
statistics also require that you know the function calls. Note that `n()` reports
the sample size.
**Summary statistics: ** Some useful function calls for summary statistics for a
single numerical variable are as follows:
- `mean`
- `median`
- `sd`
- `IQR`
- `min`
- `max`
Note that each of these functions take a single vector as an argument, and
returns a single value.

Functions you may not be familiar with (and that we will see in more detail in coming weeks) include:
$$\mbox{sd} = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (X_i - \overline{X})^2}$$
$$\mbox{IQR} = 75\% - 25\%$$
We can also filter based on multiple criteria. Suppose we are interested in
flights headed to San Francisco (SFO) in February:
```{r}
sfo_feb_flights <- flights %>%
filter(dest == "SFO", month == 2)
```
Note that we can separate the conditions using commas if we want flights that
are both headed to SFO **and** in February. If we are interested in either
flights headed to SFO **or** in February we can use the `|` instead of the comma.
1. Create a new data frame that includes flights headed to SFO in February,
and save this data frame as `sfo_feb_flights`. How many flights
meet these criteria?
2. Describe the distribution of the **arrival** delays of these flights using summary and/or appropriate summary statistics.
#### `group_by` (group before summarizing)
Another useful technique is quickly calculating summary
statistics for various groups in your data frame. For example, we can modify the
above command using the `group_by` function to get the same summary stats for
each origin airport:
```{r summary-custom-list-origin}
sfo_feb_flights %>%
group_by(origin) %>%
summarize(median_dd = median(dep_delay, na.rm=TRUE),
iqr_dd = IQR(dep_delay, na.rm=TRUE), n_flights = n())
```
Here, we first grouped the data by `origin`, and then calculated the summary
statistics.
3. Calculate the median and interquartile range for `arr_delay` of flights in
in the `sfo_feb_flights` data frame, grouped by carrier. Which carrier
has the most variable arrival delays (as measured by IQR)?
### `arrange` departure delays over months
Which month would you expect to have the highest average delay departing from an
NYC airport?
Let's think about how we would answer this question:
- First, calculate monthly averages for departure delays. With the new language
we are learning, we need to
+ `group_by` months, then
+ `summarize` mean departure delays.
- Then, we need to `arrange` these average delays in `desc`ending order
```{r mean-dep-delay-months}
flights %>%
group_by(month) %>%
summarize(mean_dd = mean(dep_delay, na.rm=TRUE)) %>%
arrange(desc(mean_dd))
```
### On time departure rate for NYC airports
Suppose you will be flying out of NYC and want to know which of the
three major NYC airports has the best on time departure rate of departing flights.
Suppose also that for you a flight that is delayed for less than 5 minutes is
basically "on time". You consider any flight delayed for 5 minutes of more to be
"delayed".
In order to determine which airport has the best on time departure rate,
we need to
- first classify each flight as "on time" or "delayed",
- then group flights by origin airport,
- then calculate on time departure rates for each origin airport,
- and finally arrange the airports in descending order for on time departure
percentage.
#### `mutate` (create a new variable)
Let's start with classifying each flight as "on time" or "delayed" by
creating a new variable with the `mutate` function.
```{r dep-type}
flights <- flights %>%
mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))
```
The first argument in the `mutate` function is the name of the new variable
we want to create, in this case `dep_type`. Then if `dep_delay < 5` we classify
the flight as `"on time"` and `"delayed"` if not, i.e. if the flight is delayed
for 5 or more minutes.
Note that we are also overwriting the `flights` data frame with the new
version of this data frame that includes the new `dep_type` variable.
We can handle all the remaining steps in one code chunk:
```{r}
flights %>%
group_by(origin) %>%
summarize(ot_dep_rate = mean(dep_type == "on time", na.rm=TRUE)) %>%
arrange(desc(ot_dep_rate))
```
4. If you were selecting an airport (of the three NYC airports in the dataset) simply based on on time departure percentage, which NYC airport would you choose to fly out of? (How did you define "on time"? 0 min? 5 min? Something else?)
* * *
## To Turn In
5. Mutate the data frame so that it includes a new variable that contains the
average speed, `avg_speed` traveled by the plane for each flight (in mph).
**Hint:** Average speed can be calculated as distance divided by
number of hours of travel, and note that `air_time` is given in minutes.
6. Another useful `dplyr` filtering helper function is `between`. What does it do? Use it to find flights that left between 0 and 60 minutes late. How many such flights are there?
7. Suppose you really dislike departure delays, and you want to schedule
your travel in a month that minimizes your potential departure delay leaving
NYC. One option is to choose the month with the lowest mean departure delay.
Another option is to choose the month with the lowest median departure delay.
What are the pros and cons of these two choices? Which month would you choose?
8. Which month has the highest average arrival delay from an NYC airport? What
about the highest median arrival delay? Which of these measures is more
reliable for deciding which month(s) to avoid flying if you really dislike
delayed flights.