Data Science Lifecycle

Astro 497, Week 9, Day 1

TableOfContents()

Reading questions

question(md"""
Is there one type of star that is more frequently found having exoplanets orbiting it, and if so, could that be due to selection effects as well?
""")

Question

Is there one type of star that is more frequently found having exoplanets orbiting it, and if so, could that be due to selection effects as well?

By number of known planets:

Most planets discovered by RVs around G & K type stars (sweat spot for RVs)
Most planets discovered by transits around G & F type stars (brighter)

By occurrence rate of planets:

Cool stars
Metal-rich stars

What biases could contribute to these apparent trends?

hint(md"""
- Cooler main sequence stars have smaller masses and radii →
  - RV amplitude is larger for given planet mass
  - Transit depth is larger for given planet size
- Metal-rich stars have larger opacity in photosphere →
  - Brighter for given mass and age

- Early indication of a preference for giant planets around metal-rich stars led some RV surveys to intentionally select metal-rich stars.
  
""")

Hint

Cooler main sequence stars have smaller masses and radii →
- RV amplitude is larger for given planet mass
- Transit depth is larger for given planet size
Metal-rich stars have larger opacity in photosphere →
- Brighter for given mass and age
Early indication of a preference for giant planets around metal-rich stars led some RV surveys to intentionally select metal-rich stars.

question(md"What are the selection effects for other methods of exoplanet detection?")

Question

What are the selection effects for other methods of exoplanet detection?

Transit Timing Variations
- Systems near mean-motion resonances
- Very closely spaced planets
- TTV period short enough to see TTVs during Kepler mission
- Orbital periods long enough that TTV amplitude could be detected
Imaging
- Large orbital separations
- Prefer nearly face-on orbits
- Planets bright in IR →
  - Nearby
  - Hot → Massive & Young
Microlensing
- Sweet spot in projected angular distance
- More massive planets have longer microlensing signatures

question(md"Do rogue planets impact exoplanet selection effects or number distributions at all? ")

Question

Do rogue planets impact exoplanet selection effects or number distributions at all?

Microlensing

question(md"""
Will multi-observatory transit surveys detection ability limited by "red noise"? """)

Question

Will multi-observatory transit surveys detection ability limited by "red noise"?

When using different telescopes from different locations, better coverage in time-domain is achieved. However, correlated noise due to atmospheric effects or stellar variability will still affect transit survey sensitivity.

Periodograms

question(md"""
Can you explain the LS periodogram?
""")

Question

Can you explain the LS periodogram?

See periodograms

question(md"""
Is it possible to measure multiple periodogram power peaks that are all similar in magnitude and above the detection threshold, and what should be done in these cases?
""")

Question

Is it possible to measure multiple periodogram power peaks that are all similar in magnitude and above the detection threshold, and what should be done in these cases?

Labs

question(md"""
Crossjoin:  What does the cartesian product of rows exactly mean?
""")

Question

Crossjoin: What does the cartesian product of rows exactly mean?

df1 = DataFrame(:x=>1:3, :a=>["a","b","c"] )

	x	a
1	1	"a"
2	2	"b"
3	3	"c"

df2 = DataFrame(:y=>10:10:30, :b=>rand(3) )

	y	b
1	10	0.0673925
2	20	0.85735
3	30	0.774101

crossjoin(df1,df2)

	x	a	y	b
1	1	"a"	10	0.0673925
2	1	"a"	20	0.85735
3	1	"a"	30	0.774101
4	2	"b"	10	0.0673925
5	2	"b"	20	0.85735
6	2	"b"	30	0.774101
7	3	"c"	10	0.0673925
8	3	"c"	20	0.85735
9	3	"c"	30	0.774101

File Formats

What type of data does it store?

Text
Documents
Numerical values
Time-series
Images
Data cubes

Very common file formats

Text (ASCII or Unicode)
- Delimited (e.g., CSV, TSV)
- Fixed-width (e.g, AAS machine-readable tables)
- Markup languages (e.g., html, xml, toml, yaml,...)
Binary
- FITS: Standard for astronomical observations
- HDF5: Standard for numerical simulations

Key questions to ask when choosing a file format

How big is the dataset?
Will users want to read all data at once or small pieces of data?
Is dataset highly structured (e.g., large tables or images)?
Is it important to include machine-readable metadata?
Does it make sense to compress data?

What not to use

Your own custom binary file format
File formats that depend on versions of your softare (e.g., pickle)
Markup languages for highly structured data
Text formats for large datasets

Data Science Lifecycle

Example of a Data Science Lifecycle

(This is just one of many.)

Ask an interesting question
- What is the scientific goal?
- What would you do if you had all the data?
- What do you want to predict or estimate?
Get the data
- How were the data sampled?
- Which data are relevant?
- Are there privacy issues?
Explore the data
- Plot the data.
- Are there anomalies?
- Are there patterns?
Model the data
- Build a model.
- Fit the model.
- Validate the model.
Communicate and visualize the results
- What did we learn?
- Do the results make sense?
- Can we tell a story?

–- Blitzstein & Pfister for Harvard CS109

What's missing?

hint(md"""
- Making iterative process/loops explicit
- Interpreting results for oneself
- Deploying model to work for future data
""")

Hint

Making iterative process/loops explicit
Interpreting results for oneself
Deploying model to work for future data

Some workflows common in industry

OSEMN

Obtain
Scrub
Explore
Model
iNterpret

CRISP-DM

Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment

Emphasizes loops and deployment

Team Data Science Process (TDSP)

Combines a workflow with project templates and recommendations for infrastructure and tools. Favors MS products.

Domino’s data science life cycle is founded on three guiding principles:

Ideation
Data Acquisition and Exploration
Research & Development
Validation
Delivery
Monitoring

Emphasizes frequest iteration, collaboration and reproducibility.

Adapting Data Science Workflows from Industry to Scientific Setting

Reinterpret terms like "business case" and "customer"
Often don't know to quantify success when we start a project
Generally, place more value on interpretability
Can accommodate projects requiring longer timescales
Increasingly, plan to make data & codes with public
Often communication is primarily with other scientists

ICDS Fall 2022 Symposium

Data Science, AI, and a Sustainable, Resilient, and Equitable Future

Keynote speaker: danah boyd

Partner Researcher at Microsoft Research, the founder of Data & Society and a Distinguished Visiting Professor at Georgetown University

Quotes that stood out to me.

"I found that the people who ascribe the most power to statistics and data are not people who do statistics and data science. They are executives who give the vision talks about the power of data..." - Jeff Hammerbacher (2016), former lead of Data Science at Facebook
"Performing math is different than doing math." (in context of redistrictors making it look like yore being objective)
"You're not making true claims. You're making invitations for inquiry." (in context of redistricting)
"It's easier for me to agree with the model" (in context of career risk)

Setup

ChooseDisplayMode()

Full Width Mode Present Mode

begin
using PlutoUI, PlutoTeachingTools, HypertextLiteral
using DataFrames
end

question(str; invite="Question") = Markdown.MD(Markdown.Admonition("tip", invite, [str]))

question (generic function with 1 method)

Built with Julia 1.8.2 and

DataFrames 1.4.1
HypertextLiteral 0.9.4
PlutoTeachingTools 0.2.3
PlutoUI 0.7.44

To run this tutorial locally, download this file and open it with Pluto.jl.