Exploratory Data Analysis

Astro 497: Week 2, Monday

Logistics

Added due dates for reading questions (through mid-term exam) onto Canvas
Lab 2
- Added link to create Lab 2 starter repository
- Will be time to start working on during class
- Let me know by end-of-business Tuesday if have any breakout room requests
First COVID close contacts reported
- Thanks for being cautious
- Recording today's class for student(s) who miss
- Will start posting

Overview

Choose data to explore
Ingest data
Validate data
Clean data
Describe/Visualize data
Identify potential relationships in data
Make a plan

Choose data to Explore

Classical Astronomy approach:

Choose scientific problem
Decide what data is needed
Request telescope time
Conduct observations
Ingest data you collect

Classical archival science approach:

Choose scientific problem
Decide what data is needed
Learn about/query multiple surveys/datasets that might have data to address your question.
Prioritize which to consider first
Query archive(s) to ingest data others collected.

Survey-science key-project approach:

Choose scientific problem
Decide what data is needed
Obtain funding
Build observatory, telescope, detector, software pipeline, archive, etc. to meet your specifications
Conduct survey (observations, calibration, data reduction, archiving, etc.)
Query database(s) to ingest data from survey
Release data to public

Survey-science ancillary science approach:

Identify exciting dataset(s)
Learn about how they were collected, limitations, uncertainties, biases, etc.
Decide if they has the potential to addres your science question
Query database(s) to ingest data others have collected

Many variations

Spectrum of approaches for how to identify questions/datasets
Combine survey, archival and targeted approaches to address a common question.

Ingest Data

Construct a query
Download the results of that query
Store the data locally
Read the data into memory.

tip(md"""
**Options for storing/organizing your data**
- Vectors, Matrices and higher-dimensional arrays:   
- DataFrames & Tables: reduces risk of bookkeeping errors
- Databases (e.g., multiple talbes of different lengths)
""")

Tip

Options for storing/organizing your data

Vectors, Matrices and higher-dimensional arrays:
DataFrames & Tables: reduces risk of bookkeeping errors
Databases (e.g., multiple talbes of different lengths)

Validate Data

What is the size and shape of the data?
What are the types of data?
What are the ranges of values?
Is there missing data?
Check if a representative subset of the data is consistent with expectations.
Are some entries suspiciously discrepant from expectations/other data?
What is the approximate empirical distribution of value?
Are values self-consistent?

Clean Data

Are some data values:

missing?
clearly erroneous?
susipicously discrepant from expectations?
susipicously discrepant from other data?

tip(md"""
**Any large dataset is likely to have some suspicious data!**
- Could these issues affect my analysis?
- Could these values interfere even exploratory data analysis?
- Should I try to understand my data source better before I proceed?
- Should I fix the issues now or proceed with caution?
   - 80%/20% rule
- If proceeding, how will I make sure that I (and my team) don't forget these concerns?
""")

Tip

Any large dataset is likely to have some suspicious data!

Could these issues affect my analysis?
Could these values interfere even exploratory data analysis?
Should I try to understand my data source better before I proceed?
Should I fix the issues now or proceed with caution?
- 80%/20% rule
If proceeding, how will I make sure that I (and my team) don't forget these concerns?

Describe/Visualize Data

Location: mean, median, mode
Scale: standard deviation, quantiles, bounds
Higher-order moments: skewness, kurtosis, behavior of tails
Transformations
- Linear transformations (shift, scale, rotate)
- Non-linear transformations for visualization (e.g., log, sqrt)
- Power transforms to standardize distributions (e.g., Box-Cox transform)
Ohter strategies
- Clamping data to limit effects of outliers
- Imputing missing data to allow for fast exploratory analysis
Statistical tests
- Test for normality

Identify potential relationships in Data

Look for relationships between values:

For each object
Across objects
In space
In time

Statistics

Correlation coefficients
Rank correlation coefficient
Dangers of statistics

Visualizations

Scatter plot
2-d histograms or density estimates
Limitations of visualizations

Make a Plan

Is this question/dataset combination worthy of more of my time?
Should I consider combining with other dataset(s) to fill gaps?
What needs to done before begining quantiative analysis?
What apparent relationships should be evaluted quantiatively?
What potential concerns should be kept in mind?

Helper Code

ChooseDisplayMode()

Full Width Mode Present Mode

TableOfContents(aside=true)

begin
    using PlutoUI, PlutoTeachingTools
end

Built with Julia 1.8.2 and

PlutoTeachingTools 0.1.5
PlutoUI 0.7.39

To run this tutorial locally, download this file and open it with Pluto.jl.