Data Science Lifecycle
Astro 497, Week 9, Day 1
TableOfContents()
Reading questions
question(md"""
Is there one type of star that is more frequently found having exoplanets orbiting it, and if so, could that be due to selection effects as well?
""")
Is there one type of star that is more frequently found having exoplanets orbiting it, and if so, could that be due to selection effects as well?
By number of known planets:
Most planets discovered by RVs around G & K type stars (sweat spot for RVs)
Most planets discovered by transits around G & F type stars (brighter)
By occurrence rate of planets:
Cool stars
Metal-rich stars
What biases could contribute to these apparent trends?
hint(md"""
- Cooler main sequence stars have smaller masses and radii →
- RV amplitude is larger for given planet mass
- Transit depth is larger for given planet size
- Metal-rich stars have larger opacity in photosphere →
- Brighter for given mass and age
- Early indication of a preference for giant planets around metal-rich stars led some RV surveys to intentionally select metal-rich stars.
""")
-
Cooler main sequence stars have smaller masses and radii →
RV amplitude is larger for given planet mass
Transit depth is larger for given planet size
Metal-rich stars have larger opacity in photosphere →
Brighter for given mass and age
Early indication of a preference for giant planets around metal-rich stars led some RV surveys to intentionally select metal-rich stars.
question(md"What are the selection effects for other methods of exoplanet detection?")
What are the selection effects for other methods of exoplanet detection?
Transit Timing Variations
Systems near mean-motion resonances
Very closely spaced planets
TTV period short enough to see TTVs during Kepler mission
Orbital periods long enough that TTV amplitude could be detected
Imaging
Large orbital separations
Prefer nearly face-on orbits
Planets bright in IR →
Nearby
Hot → Massive & Young
Microlensing
Sweet spot in projected angular distance
More massive planets have longer microlensing signatures
question(md"Do rogue planets impact exoplanet selection effects or number distributions at all? ")
Do rogue planets impact exoplanet selection effects or number distributions at all?
Microlensing
question(md"""
Will multi-observatory transit surveys detection ability limited by "red noise"? """)
Will multi-observatory transit surveys detection ability limited by "red noise"?
When using different telescopes from different locations, better coverage in time-domain is achieved. However, correlated noise due to atmospheric effects or stellar variability will still affect transit survey sensitivity.
Periodograms
question(md"""
Can you explain the LS periodogram?
""")
Can you explain the LS periodogram?
See periodograms
question(md"""
Is it possible to measure multiple periodogram power peaks that are all similar in magnitude and above the detection threshold, and what should be done in these cases?
""")
Is it possible to measure multiple periodogram power peaks that are all similar in magnitude and above the detection threshold, and what should be done in these cases?
Labs
question(md"""
Crossjoin: What does the cartesian product of rows exactly mean?
""")
Crossjoin: What does the cartesian product of rows exactly mean?
df1 = DataFrame(:x=>1:3, :a=>["a","b","c"] )
x | a | |
---|---|---|
1 | 1 | "a" |
2 | 2 | "b" |
3 | 3 | "c" |
df2 = DataFrame(:y=>10:10:30, :b=>rand(3) )
y | b | |
---|---|---|
1 | 10 | 0.0673925 |
2 | 20 | 0.85735 |
3 | 30 | 0.774101 |
crossjoin(df1,df2)
x | a | y | b | |
---|---|---|---|---|
1 | 1 | "a" | 10 | 0.0673925 |
2 | 1 | "a" | 20 | 0.85735 |
3 | 1 | "a" | 30 | 0.774101 |
4 | 2 | "b" | 10 | 0.0673925 |
5 | 2 | "b" | 20 | 0.85735 |
6 | 2 | "b" | 30 | 0.774101 |
7 | 3 | "c" | 10 | 0.0673925 |
8 | 3 | "c" | 20 | 0.85735 |
9 | 3 | "c" | 30 | 0.774101 |
File Formats
What type of data does it store?
Text
Documents
Numerical values
Time-series
Images
Data cubes
Very common file formats
Text (ASCII or Unicode)
Delimited (e.g., CSV, TSV)
Fixed-width (e.g, AAS machine-readable tables)
Markup languages (e.g., html, xml, toml, yaml,...)
Binary
FITS: Standard for astronomical observations
HDF5: Standard for numerical simulations
Key questions to ask when choosing a file format
How big is the dataset?
Will users want to read all data at once or small pieces of data?
Is dataset highly structured (e.g., large tables or images)?
Is it important to include machine-readable metadata?
Does it make sense to compress data?
What not to use
Your own custom binary file format
File formats that depend on versions of your softare (e.g., pickle)
Markup languages for highly structured data
Text formats for large datasets
Data Science Lifecycle
Example of a Data Science Lifecycle
(This is just one of many.)
Ask an interesting question
What is the scientific goal?
What would you do if you had all the data?
What do you want to predict or estimate?
Get the data
How were the data sampled?
Which data are relevant?
Are there privacy issues?
Explore the data
Plot the data.
Are there anomalies?
Are there patterns?
Model the data
Build a model.
Fit the model.
Validate the model.
Communicate and visualize the results
What did we learn?
Do the results make sense?
Can we tell a story?
–- Blitzstein & Pfister for Harvard CS109
What's missing?
hint(md"""
- Making iterative process/loops explicit
- Interpreting results for oneself
- Deploying model to work for future data
""")
-
Making iterative process/loops explicit
Interpreting results for oneself
Deploying model to work for future data
Some workflows common in industry
OSEMN
Obtain
Scrub
Explore
Model
iNterpret
CRISP-DM
Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
Emphasizes loops and deployment
Team Data Science Process (TDSP)
Combines a workflow with project templates and recommendations for infrastructure and tools. Favors MS products.
Domino’s data science life cycle is founded on three guiding principles:
Ideation
Data Acquisition and Exploration
Research & Development
Validation
Delivery
Monitoring
Emphasizes frequest iteration, collaboration and reproducibility.
Adapting Data Science Workflows from Industry to Scientific Setting
Reinterpret terms like "business case" and "customer"
Often don't know to quantify success when we start a project
Generally, place more value on interpretability
Can accommodate projects requiring longer timescales
Increasingly, plan to make data & codes with public
Often communication is primarily with other scientists
ICDS Fall 2022 Symposium
Data Science, AI, and a Sustainable, Resilient, and Equitable Future
Keynote speaker: danah boyd
Partner Researcher at Microsoft Research, the founder of Data & Society and a Distinguished Visiting Professor at Georgetown University
Quotes that stood out to me.
"I found that the people who ascribe the most power to statistics and data are not people who do statistics and data science. They are executives who give the vision talks about the power of data..." - Jeff Hammerbacher (2016), former lead of Data Science at Facebook
"Performing math is different than doing math." (in context of redistrictors making it look like yore being objective)
"You're not making true claims. You're making invitations for inquiry." (in context of redistricting)
"It's easier for me to agree with the model" (in context of career risk)
Setup
ChooseDisplayMode()
begin
using PlutoUI, PlutoTeachingTools, HypertextLiteral
using DataFrames
end
question(str; invite="Question") = Markdown.MD(Markdown.Admonition("tip", invite, [str]))
question (generic function with 1 method)
Built with Julia 1.8.2 and
DataFrames 1.4.1HypertextLiteral 0.9.4
PlutoTeachingTools 0.2.3
PlutoUI 0.7.44
To run this tutorial locally, download this file and open it with Pluto.jl.
To run this tutorial locally, download this file and open it with Pluto.jl.
To run this tutorial locally, download this file and open it with Pluto.jl.
To run this tutorial locally, download this file and open it with Pluto.jl.