Exploratory Data Analysis
Astro 497: Week 2, Monday
Logistics
Added due dates for reading questions (through mid-term exam) onto Canvas
Lab 2
Added link to create Lab 2 starter repository
Will be time to start working on during class
Let me know by end-of-business Tuesday if have any breakout room requests
First COVID close contacts reported
Thanks for being cautious
Recording today's class for student(s) who miss
Will start posting
Overview
Choose data to explore
Ingest data
Validate data
Clean data
Describe/Visualize data
Identify potential relationships in data
Make a plan
Choose data to Explore
Classical Astronomy approach:
Choose scientific problem
Decide what data is needed
Request telescope time
Conduct observations
Ingest data you collect
Classical archival science approach:
Choose scientific problem
Decide what data is needed
Learn about/query multiple surveys/datasets that might have data to address your question.
Prioritize which to consider first
Query archive(s) to ingest data others collected.
Survey-science key-project approach:
Choose scientific problem
Decide what data is needed
Obtain funding
Build observatory, telescope, detector, software pipeline, archive, etc. to meet your specifications
Conduct survey (observations, calibration, data reduction, archiving, etc.)
Query database(s) to ingest data from survey
Release data to public
Survey-science ancillary science approach:
Identify exciting dataset(s)
Learn about how they were collected, limitations, uncertainties, biases, etc.
Decide if they has the potential to addres your science question
Query database(s) to ingest data others have collected
Many variations
Spectrum of approaches for how to identify questions/datasets
Combine survey, archival and targeted approaches to address a common question.
Ingest Data
Construct a query
Download the results of that query
Store the data locally
Read the data into memory.
tip(md"""
**Options for storing/organizing your data**
- Vectors, Matrices and higher-dimensional arrays:
- DataFrames & Tables: reduces risk of bookkeeping errors
- Databases (e.g., multiple talbes of different lengths)
""")
Options for storing/organizing your data
Vectors, Matrices and higher-dimensional arrays:
DataFrames & Tables: reduces risk of bookkeeping errors
Databases (e.g., multiple talbes of different lengths)
Validate Data
What is the size and shape of the data?
What are the types of data?
What are the ranges of values?
Is there missing data?
Check if a representative subset of the data is consistent with expectations.
Are some entries suspiciously discrepant from expectations/other data?
What is the approximate empirical distribution of value?
Are values self-consistent?
Clean Data
Are some data values:
missing?
clearly erroneous?
susipicously discrepant from expectations?
susipicously discrepant from other data?
tip(md"""
**Any large dataset is likely to have some suspicious data!**
- Could these issues affect my analysis?
- Could these values interfere even exploratory data analysis?
- Should I try to understand my data source better before I proceed?
- Should I fix the issues now or proceed with caution?
- 80%/20% rule
- If proceeding, how will I make sure that I (and my team) don't forget these concerns?
""")
Any large dataset is likely to have some suspicious data!
Could these issues affect my analysis?
Could these values interfere even exploratory data analysis?
Should I try to understand my data source better before I proceed?
Should I fix the issues now or proceed with caution?
80%/20% rule
If proceeding, how will I make sure that I (and my team) don't forget these concerns?
Describe/Visualize Data
Location: mean, median, mode
Scale: standard deviation, quantiles, bounds
Higher-order moments: skewness, kurtosis, behavior of tails
Transformations
Linear transformations (shift, scale, rotate)
Non-linear transformations for visualization (e.g., log, sqrt)
Power transforms to standardize distributions (e.g., Box-Cox transform)
Ohter strategies
Clamping data to limit effects of outliers
Imputing missing data to allow for fast exploratory analysis
Statistical tests
Test for normality
Identify potential relationships in Data
Look for relationships between values:
For each object
Across objects
In space
In time
Statistics
Correlation coefficients
Rank correlation coefficient
Dangers of statistics
Visualizations
Scatter plot
2-d histograms or density estimates
Limitations of visualizations
Make a Plan
Is this question/dataset combination worthy of more of my time?
Should I consider combining with other dataset(s) to fill gaps?
What needs to done before begining quantiative analysis?
What apparent relationships should be evaluted quantiatively?
What potential concerns should be kept in mind?
Helper Code
ChooseDisplayMode()
TableOfContents(aside=true)
begin
using PlutoUI, PlutoTeachingTools
end
Built with Julia 1.8.2 and
PlutoTeachingTools 0.1.5PlutoUI 0.7.39
To run this tutorial locally, download this file and open it with Pluto.jl.
To run this tutorial locally, download this file and open it with Pluto.jl.
To run this tutorial locally, download this file and open it with Pluto.jl.
To run this tutorial locally, download this file and open it with Pluto.jl.