---
title: "Dataset Summary"
author: "Daniel Falster & Susie Zajitschek"
date: "06/07/2018"
output:
html_document:
fig_height: 6
fig_width: 10
df_print: paged
rows.print: 10
code_folding: hide
theme: yeti
toc: yes
toc_depth: 3
toc_float:
collapsed: false
smooth_scroll: true
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE, echo=TRUE}
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE, cache=FALSE)
root.dir = rprojroot::find_root("README.md")
knitr::opts_knit$set(root.dir = root.dir)
library(readr)
library(dplyr)
library(skimr)
library(ggplot2)
library(scales)
library(viridis)
library(knitr)
library(kableExtra)
```
Here is an initial exploration of the mouse data we are working on.
The data is big, but not so large that we can't use our standard tools on a desktop. I'd suggest using packages from the [tidyverse](https://www.tidyverse.org/) family, in particular `readr`, `dplyr`, `ggplot2`, `skimr`, `scales`, `viridis`.
# Dataset overview
Read in data, specifying variable types:
```{r cars}
data_raw <- read_csv("data/dr7.0_all_control_data.csv",
col_types = cols(
.default = col_character(),
project_id = col_character(),
id = col_character(),
parameter_id = col_character(),
age_in_days = col_integer(),
date_of_experiment = col_datetime(format = ""),
weight = col_double(),
phenotyping_center_id = col_character(),
production_center_id = col_character(),
weight_date = col_datetime(format = ""),
date_of_birth = col_datetime(format = ""),
procedure_id = col_character(),
pipeline_id = col_character(),
biological_sample_id = col_character(),
biological_model_id = col_character(),
weight_days_old = col_integer(),
datasource_id = col_character(),
experiment_id = col_character(),
data_point = col_double(),
age_in_weeks = col_integer(),
`_version_` = col_character()
)
)
```
Number of rows & columns:
```{r}
data_raw %>% dim()
```
Of the `r data %>% names() %>% length()` variables in the dataset, lots of variables are full of NAs:
```{r}
n_records <- data_raw %>% nrow()
NA_count <- data_raw %>%
summarise_all(funs(sum(is.na(.))/n_records*100)) %>%
unlist() %>% sort(decreasing=TRUE) %>%
tibble(variable=names(.), percent_NA = .)
# make a plot
ggplot(NA_count, aes(percent_NA)) + geom_histogram(bins=50)
```
Let's remove those variables that have all NAs:
```{r}
(all_NAs <- NA_count$variable[NA_count$percent_NA == 100])
data <- data_raw %>% select(-one_of(all_NAs))
```
Now we `r data %>% names() %>% length()` variables in the dataset:
```{r}
data %>% names()
```
Now use `skimr` to take a quick look of all variables:
```{r, results='asis'}
x <- data %>% skimr::skim()
pander::pander(x)
```
Next we'll look at some specific variables of potential importance.
# Production center
Contributions by `production_center`:
```{r}
x <- data %>% group_by(production_center) %>% summarise(n=n())
ggplot(x, aes(reorder(production_center, n), n)) +
geom_col() + coord_flip()
x
```
# Weights
Overall distribution of weights:
```{r}
ggplot(data, aes(x=weight)) +
geom_histogram(bins=50)
```
Weights by center and sex:
```{r, fig.height=12}
ggplot(data, aes(x=weight, fill=sex)) +
geom_histogram(bins=50) +
scale_y_log10() +
facet_wrap( ~ production_center, ncol=1)
```
# Ages
There seems to be an issue with some very negative values of age. The range in the raw data is too wide:
```{r}
range(data$age_in_days, na.rm=TRUE)
ggplot(data, aes(x=age_in_days)) +
geom_histogram(bins=50)
```
So for now we'll filter those out, to give an reasonable distribution of ages:
```{r}
data <- data %>% filter(age_in_days > 0 & age_in_days < 500)
ggplot(data, aes(x=age_in_days)) +
geom_histogram(bins=50)
```
Age by center and sex:
```{r, fig.height=12}
ggplot(data, aes(x=age_in_days, fill=sex)) +
geom_histogram(bins=50) +
scale_y_log10() +
facet_wrap( ~ production_center, ncol=1)
```
Age vs weight by sex:
```{r}
data %>%
filter(sex %in% c("male", "female")) %>%
ggplot(aes(x=age_in_days, y=weight)) +
geom_hex() +
viridis::scale_fill_viridis() +
coord_fixed() +
facet_wrap( ~ sex, ncol=1)
```
# Procedures
Contributions by `procedure_name`:
```{r, fig.height=12}
x <- data %>% group_by(procedure_name) %>% summarise(n=n())
ggplot(x, aes(reorder(procedure_name, n), n)) +
geom_col() + coord_flip()
```
```{r, results='asis'}
data$procedure_name %>% table() %>% sort(decreasing = TRUE) %>%
kable() %>%
kable_styling() %>%
scroll_box(width = "500px", height = "400px")
```
Note the uneven distribution of procdures by production_center:
```{r, fig.height=25}
x <- data %>%
group_by(production_center, procedure_name) %>%
summarise(n=n())
ggplot(x, aes(reorder(production_center, n), n)) +
geom_col() + coord_flip() +
facet_wrap( ~ procedure_name, ncol=4)
```
```{r, results='asis'}
t(table(data$production_center, data$procedure_name)) %>%
kable() %>%
kable_styling() %>%
scroll_box(width = "100%", height = "500px")
```
# Parameters
There are a lot of unique values under the variable `parameter_name`:
```{r}
data$parameter_name %>% unique() %>% length()
```
```{r}
data$parameter_name %>% table() %>% sort(decreasing = TRUE) %>%
tibble(variable=names(.), count = .) %>%
kable() %>%
kable_styling() %>%
scroll_box(width = "100%", height = "500px")
```
# Individuals
There seem to be multiple records for an individual, which is identified by the varaible `biological_sample_id`. Based on this there ar `r data$biological_sample_id %>% unique() %>% length()` unique individuals. And there are multiple records per individual. For example, here are records for `biological_sample_id=107609`:
```{r}
select(filter(data_raw, biological_sample_id == "107609"), sex, production_center, external_sample_id, biological_sample_id, age_in_days, weight, parameter_name) %>% arrange(age_in_days) %>% data.frame()
```
```{r}
data_i <- data %>%
filter(parameter_name == "Body weight") %>%
group_by(biological_sample_id) %>%
summarise(
production_center = production_center[1],
project_name = project_name[1],
sex = sex[1],
n = n(),
med_age= median(age_in_days)
)
data_i
```
# Datasource
more recently, Jeremy states: "I provided you too much data (!) which should not be included. A quarter to a third of the data is from legacy projects and should be excluded. As I mentioned, exclude anything without datasource_name = `IMPC`"
By source:
```
data_raw$datasource_name %>% table()
```
So reducing to IMPC reduces the data by one third.
It also greatly reduces the number of parameters:
```{r}
filter(data_raw, datasource_name == 'IMPC') %>% pull(parameter_name) %>% unique() %>% length()
```
```
data_raw %>% pull(parameter_name) %>% unique() %>% length()
```
# Issues:
1. `procedure_name` are values eding in `(GMC)` different?
2. Spelling / case differences: e.g. `Body weight` vs `Body Weight`