Exploratory Data Analysis

Astro 497: Week 2, Monday

Logistics

  • Added due dates for reading questions (through mid-term exam) onto Canvas

  • Lab 2

    • Added link to create Lab 2 starter repository

    • Will be time to start working on during class

    • Let me know by end-of-business Tuesday if have any breakout room requests

  • First COVID close contacts reported

    • Thanks for being cautious

    • Recording today's class for student(s) who miss

    • Will start posting

Overview

  1. Choose data to explore

  2. Ingest data

  3. Validate data

  4. Clean data

  5. Describe/Visualize data

  6. Identify potential relationships in data

  7. Make a plan

Choose data to Explore

Classical Astronomy approach:

  1. Choose scientific problem

  2. Decide what data is needed

  3. Request telescope time

  4. Conduct observations

  5. Ingest data you collect

Classical archival science approach:

  1. Choose scientific problem

  2. Decide what data is needed

  3. Learn about/query multiple surveys/datasets that might have data to address your question.

  4. Prioritize which to consider first

  5. Query archive(s) to ingest data others collected.

Survey-science key-project approach:

  1. Choose scientific problem

  2. Decide what data is needed

  3. Obtain funding

  4. Build observatory, telescope, detector, software pipeline, archive, etc. to meet your specifications

  5. Conduct survey (observations, calibration, data reduction, archiving, etc.)

  6. Query database(s) to ingest data from survey

  7. Release data to public

Survey-science ancillary science approach:

  1. Identify exciting dataset(s)

  2. Learn about how they were collected, limitations, uncertainties, biases, etc.

  3. Decide if they has the potential to addres your science question

  4. Query database(s) to ingest data others have collected

Many variations

  • Spectrum of approaches for how to identify questions/datasets

  • Combine survey, archival and targeted approaches to address a common question.

Ingest Data

  • Construct a query

  • Download the results of that query

  • Store the data locally

  • Read the data into memory.

tip(md"""
**Options for storing/organizing your data**
- Vectors, Matrices and higher-dimensional arrays:   
- DataFrames & Tables: reduces risk of bookkeeping errors
- Databases (e.g., multiple talbes of different lengths)
""")
Tip

Options for storing/organizing your data

  • Vectors, Matrices and higher-dimensional arrays:

  • DataFrames & Tables: reduces risk of bookkeeping errors

  • Databases (e.g., multiple talbes of different lengths)

Validate Data

  • What is the size and shape of the data?

  • What are the types of data?

  • What are the ranges of values?

  • Is there missing data?

  • Check if a representative subset of the data is consistent with expectations.

  • Are some entries suspiciously discrepant from expectations/other data?

  • What is the approximate empirical distribution of value?

  • Are values self-consistent?

Clean Data

Are some data values:

  • missing?

  • clearly erroneous?

  • susipicously discrepant from expectations?

  • susipicously discrepant from other data?

tip(md"""
**Any large dataset is likely to have some suspicious data!**
- Could these issues affect my analysis?
- Could these values interfere even exploratory data analysis?
- Should I try to understand my data source better before I proceed?
- Should I fix the issues now or proceed with caution?
   - 80%/20% rule
- If proceeding, how will I make sure that I (and my team) don't forget these concerns?
""")
Tip

Any large dataset is likely to have some suspicious data!

  • Could these issues affect my analysis?

  • Could these values interfere even exploratory data analysis?

  • Should I try to understand my data source better before I proceed?

  • Should I fix the issues now or proceed with caution?

    • 80%/20% rule

  • If proceeding, how will I make sure that I (and my team) don't forget these concerns?

Describe/Visualize Data

  • Location: mean, median, mode

  • Scale: standard deviation, quantiles, bounds

  • Higher-order moments: skewness, kurtosis, behavior of tails

  • Transformations

    • Linear transformations (shift, scale, rotate)

    • Non-linear transformations for visualization (e.g., log, sqrt)

    • Power transforms to standardize distributions (e.g., Box-Cox transform)

  • Ohter strategies

    • Clamping data to limit effects of outliers

    • Imputing missing data to allow for fast exploratory analysis

  • Statistical tests

    • Test for normality

Identify potential relationships in Data

Look for relationships between values:

  • For each object

  • Across objects

  • In space

  • In time

Statistics

  • Correlation coefficients

  • Rank correlation coefficient

  • Dangers of statistics

Visualizations

  • Scatter plot

  • 2-d histograms or density estimates

  • Limitations of visualizations

Make a Plan

  • Is this question/dataset combination worthy of more of my time?

  • Should I consider combining with other dataset(s) to fill gaps?

  • What needs to done before begining quantiative analysis?

  • What apparent relationships should be evaluted quantiatively?

  • What potential concerns should be kept in mind?

Helper Code

ChooseDisplayMode()
     
TableOfContents(aside=true)
begin
    using PlutoUI, PlutoTeachingTools
end

Built with Julia 1.8.2 and

PlutoTeachingTools 0.1.5
PlutoUI 0.7.39

To run this tutorial locally, download this file and open it with Pluto.jl.

To run this tutorial locally, download this file and open it with Pluto.jl.

To run this tutorial locally, download this file and open it with Pluto.jl.

To run this tutorial locally, download this file and open it with Pluto.jl.