---
title: "Dataset Summary"
author: "Daniel Falster & Susie Zajitschek"
date: "06/07/2018"
output:
html_document:
fig_height: 6
fig_width: 10
df_print: paged
rows.print: 10
code_folding: show
theme: yeti
toc: yes
toc_depth: 3
toc_float:
collapsed: false
smooth_scroll: true
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE, echo=TRUE}
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE, cache=FALSE)
root.dir = rprojroot::find_root("README.md")
knitr::opts_knit$set(root.dir = root.dir)
```
```{r}
library(readr)
library(dplyr)
library(skimr)
library(ggplot2)
library(scales)
library(viridis)
library(knitr)
library(pander)
library(kableExtra)
source("R/data_load_clean.R")
```
# Loading data
Read in cleaned data, specifying variable types:
```{r}
data <- readRDS("export/data_clean.rds")
```
Note: cleaned data were generated by running
```{r, eval=FALSE}
data <- load_raw("data/dr7.0_all_control_data.csv") %>% clean_raw_data()
```
where using the follwing function
```{r}
clean_raw_data
```
# Data overview
Number of rows & columns:
```{r}
data %>% dim()
```
Now we `r data %>% names() %>% length()` variables in the dataset:
```{r}
data %>% names()
```
Now use `skimr` to take a quick look of all variables:
```{r, results='asis'}
x <- data %>% skimr::skim()
pander::pander(x)
```
Next we'll look at some specific variables of potential importance.
# Production center
Contributions by `production_center`:
```{r}
x <- data %>% group_by(production_center) %>% summarise(n=n())
ggplot(x, aes(reorder(production_center, n), n)) +
geom_col() + coord_flip()
x
```
# Strains
There are several strains under the variable `strain_name`:
```{r}
data$strain_name %>% unique() %>% length()
```
```{r}
data$strain_name %>% table() %>% sort(decreasing = TRUE) %>%
tibble(variable=names(.), count = .) %>%
kable() %>%
kable_styling() %>%
scroll_box(width = "100%", height = "500px")
```
There is also a variable several strains under the variable `strain_accession_id`:
```{r}
data$strain_accession_id %>% unique() %>% length()
```
```{r}
data$strain_accession_id %>% table() %>% sort(decreasing = TRUE) %>%
tibble(variable=names(.), count = .) %>%
kable() %>%
kable_styling() %>%
scroll_box(width = "100%", height = "500px")
```
# Weights
Overall distribution of weights:
```{r}
ggplot(data, aes(x=weight)) +
geom_histogram(bins=50)
```
Weights by center and sex:
```{r, fig.height=12}
ggplot(data, aes(x=weight, fill=sex)) +
geom_histogram(bins=50) +
scale_y_log10() +
facet_wrap( ~ production_center, ncol=1)
```
# Ages
There seems to be an issue with some very negative values of age. The range in the raw data is too wide:
```{r}
range(data$age_in_days, na.rm=TRUE)
ggplot(data, aes(x=age_in_days)) +
geom_histogram(bins=50)
```
So for now we'll filter those out, to give an reasonable distribution of ages:
```{r}
data <- data %>% filter(age_in_days > 0 & age_in_days < 500)
ggplot(data, aes(x=age_in_days)) +
geom_histogram(bins=50)
```
Age by center and sex:
```{r, fig.height=12}
ggplot(data, aes(x=age_in_days, fill=sex)) +
geom_histogram(bins=50) +
scale_y_log10() +
facet_wrap( ~ production_center, ncol=1)
```
Age vs weight by sex:
```{r}
data %>%
filter(sex %in% c("male", "female")) %>%
ggplot(aes(x=age_in_days, y=weight)) +
geom_hex() +
viridis::scale_fill_viridis() +
coord_fixed() +
facet_wrap( ~ sex, ncol=1)
```
# Procedures
Contributions by `procedure_name`:
```{r, fig.height=12}
x <- data %>% group_by(procedure_name) %>% summarise(n=n())
ggplot(x, aes(reorder(procedure_name, n), n)) +
geom_col() + coord_flip()
```
```{r, results='asis'}
data$procedure_name %>% table() %>% sort(decreasing = TRUE) %>%
kable() %>%
kable_styling() %>%
scroll_box(width = "500px", height = "400px")
```
Note the uneven distribution of procdures by production_center:
```{r, fig.height=25}
x <- data %>%
group_by(production_center, procedure_name) %>%
summarise(n=n())
ggplot(x, aes(reorder(production_center, n), n)) +
geom_col() + coord_flip() +
facet_wrap( ~ procedure_name, ncol=4)
```
```{r, results='asis'}
t(table(data$production_center, data$procedure_name)) %>%
kable() %>%
kable_styling() %>%
scroll_box(width = "100%", height = "500px")
```
# Parameters
There are a lot of unique values under the variable `parameter_name`:
```{r}
data$parameter_name %>% unique() %>% length()
```
```{r}
data$parameter_name %>% table() %>% sort(decreasing = TRUE) %>%
tibble(variable=names(.), count = .) %>%
kable() %>%
kable_styling() %>%
scroll_box(width = "100%", height = "500px")
```
# Individuals
There seem to be multiple records for an individual, which is identified by the varaible `biological_sample_id`. Based on this there ar `r data$biological_sample_id %>% unique() %>% length()` unique individuals. And there are multiple records per individual. For example, here are records for `biological_sample_id=107609`:
```{r}
select(filter(data, biological_sample_id == "107609"), sex, production_center, biological_sample_id, age_in_days, weight, parameter_name) %>% arrange(age_in_days) %>% data.frame()
```