Queries & Data Wrangling
Astro 497, Week 8, Monday
TableOfContents()
Databases
Simplest form
Store data
Retrieve data
Commonly implemented as a set of tables
Columns contain different fields (e.g., ID, magnitude, RA, Dec)
Rows contain entries (e.g., 51 Pegasi, Kepler-10, HAT-P-13,... )
Value-added features
Return subset of data efficiently
Many strategies for how to filter data (e.g., order of operations)
Database server can use heuristics to pick good strategy
Allow for transactions to update database
Fundamental properties of databases
Atomicity: All part of a transaction succeed, or the database is rolled back to its previous state
Consistency: Data in database always satisfies its validation rules
Isolation: Even if multiple transactions are made concurrently, there is no interference between transactions
Durability: Once a transaction is committed, it will remain committed
SQL-based Database Servers
Open-source:
MySQL
PostgreSQL & Greenplum
Commercial:
Microsoft SQL Server:
IBM DB2
Oracle Database
...
Continuing innovation in database systems
SciDB (array & disk based database)
MonetDB (column store)
JuliaDB (pure Julia, for persistent data)
When selecting a database for a project, consider:
How much data is to be stored?
How frequent/large will transactions be?
Are there specific hardware or OS requirements?
Does the team have someone dedicated to supporting database?
Database Clients
One database server many clients simultaneously
Different clients can use different interfaces
Command line
Webpage
URL-based
Custom Graphical user interface (GUI)
TopCat
Queries
Query: An expression that requests database to return a specific subset of data.
Query languages:
Structured Query Language (SQL): Dated, but by far the most common
Astronomical Data Query Language (ADQL): Astronomy-specific
Language Integrated Query (LINQ): Microsoft-supported
Many more
SQL essentials
Selecting (columns)
Filtering (for rows)
Joining (multiple tables)
Aggregating (rows within a table)
SQL programming
Variables
Functions
Procedures
Data management
Transactions
Virtual Observatory (VO)
Defines standards that help astronomers to collaborate effectively, emphasizing working with multiple data sources.
Astronomical Data Query Language (ADQL)
Table Access Protocol (TAP)
Astronomy-specific functions in ADQL
AREA
BOX
CENTROID
CIRCLE
CONTAINS
COORD1
COORD2
COORDSYS
DISTANCE
INTERSECTS
POINT
POLYGON
REGION
Example Bad SQL Query
Do not send
select * from SomeTable
Why?
Example SQL Queries
Take a quick peak two columns data for first few entries
select top 10 X, Y from SomeTable
Find extreme values of X
select top 10 X, Y
from SomeTable
order by X
Filter which rows are returned using expression
select top 10 x,y
from SomeTable
where x*x+y*y between 0 and 1
order by x
Check how many rows are in a table
select COUNT(designation) as N from gaiadr2.gaia_source
Check how many rows satisfy a filter
select COUNT(designation) as n, AVG(astrometric_n_good_obs_al) as astrometric_n_good_obs_al_ave
from gaiadr2.gaia_source
where phot_g_mean_mag < 14
Grouping data to make a histogram
select COUNT(designation) as N,
AVG(astrometric_n_good_obs_al) as astrometric_n_good_obs_al_ave,
AVG(phot_g_mean_mag) as phot_g_mean_mag_ave,
ROUND(phot_g_mean_mag,1) as bin
from gaiadr2.gaia_source
where phot_g_mean_mag < 14
group by bin
order by bin
Table Access Protocol (TAP)
Start with ADQL
SELECT <column list> FROM <table> WHERE <constraints>
but transform it into a url, by
Prepend a base service url
convert spaces to
+
'sDeal with other special characters (e.g., +, quotes)
Optionally, specify format for results
e.g.,
https://exoplanetarchive.ipac.caltech.edu/TAP/sync?query=select+pl_name,pl_masse,ra,dec+from+ps
url_ex1 = make_tap_query_url(nexsci_query_base_url, "ps", select_cols="pl_name,gaia_id,sy_kepmag,ra,dec", where="default_flag=1")
"https://exoplanetarchive.ipac.caltech.edu/TAP/sync?query=select+pl_name,gaia_id,sy_kepmag,ra,dec+from+ps+where+default_flag=1&format=tsv"
df_ex1 = query_to_df(url_ex1)
pl_name | gaia_id | sy_kepmag | ra | dec | |
---|---|---|---|---|---|
1 | "OGLE-TR-10 b" | "Gaia DR2 4056443366649948160" | missing | 267.868 | -29.8765 |
2 | "BD-08 2823 c" | "Gaia DR2 3770419611540574080" | missing | 150.197 | -9.51657 |
3 | "HR 8799 c" | "Gaia DR2 2832463659640297472" | missing | 346.87 | 21.134 |
4 | "HD 110014 b" | "Gaia DR2 3676091134604409728" | missing | 189.811 | -7.99567 |
5 | "HIP 5158 b" | "Gaia DR2 2351405057377686272" | missing | 16.5095 | -22.4536 |
6 | "HD 44219 b" | "Gaia DR2 3001428566419966592" | missing | 95.06 | -10.7251 |
7 | "HD 132563 b" | "Gaia DR2 1585765117538284800" | missing | 224.589 | 44.0429 |
8 | "Kepler-24 c" | "Gaia DR2 2052823535171095296" | 14.925 | 290.413 | 38.3437 |
9 | "CHXR 73 b" | "Gaia DR2 5201175987817179136" | missing | 166.619 | -77.6259 |
10 | "alf Ari b" | missing | missing | 31.7933 | 23.4624 |
... | |||||
5197 | "TOI-1749 d" | "Gaia DR2 2253774094189458432" | missing | 282.737 | 64.4195 |
desig = replace_spaces_for_tap(df_ex1.gaia_id[8])
"Gaia+DR2+2052823535171095296"
url_ex2 = make_tap_query_url(gaia_query_base_url, "gaiadr2.gaia_source", where="designation='$(desig)'",select_cols="*",max_rows=5)
"https://gea.esac.esa.int/tap-server/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=csv&QUERY=select+top+5+*+from+gaiadr2.gaia_source+where+designation='Gaia+DR2+2052823535171095296'&format=tsv"
df_ex2 = query_to_df(url_ex2)
solution_id | designation | source_id | random_index | ref_epoch | ra | ra_error | dec | ... | |
---|---|---|---|---|---|---|---|---|---|
1 | 1635721458409799680 | "Gaia DR2 2052823535171095296" | 2052823535171095296 | 197836529 | 2015.5 | 290.413 | 0.020076 | 38.3437 |
Joins
Joining tables is a fundamental concept that can be applied either to DataFrames stored locally or as part of SQL/ADQL queries.
innerjoin
&semijoin
: Return rows for values of the key that exist in both tablesouterjoin
: Return rows for values of the key that exist in either tableleftjoin
: Return rows for values of the key that exist in first tablerightjoin
: Return rows for values of the key that exist in second table
antijoin
: Return rows Return rows for values of the key that exist in first table but not the second table
crossjoin
: Return table with every row from first table as rows and every row from second table as columns
Examples
df_ex3 = innerjoin(df_ex1,df_ex2, on=:gaia_id=>:designation, matchmissing=:notequal, makeunique=true )
pl_name | gaia_id | sy_kepmag | ra | dec | solution_id | source_id | random_index | ... | |
---|---|---|---|---|---|---|---|---|---|
1 | "Kepler-24 c" | "Gaia DR2 2052823535171095296" | 14.925 | 290.413 | 38.3437 | 1635721458409799680 | 2052823535171095296 | 197836529 | |
2 | "Kepler-24 d" | "Gaia DR2 2052823535171095296" | 14.925 | 290.413 | 38.3437 | 1635721458409799680 | 2052823535171095296 | 197836529 | |
3 | "Kepler-24 b" | "Gaia DR2 2052823535171095296" | 14.925 | 290.413 | 38.3437 | 1635721458409799680 | 2052823535171095296 | 197836529 | |
4 | "Kepler-24 e" | "Gaia DR2 2052823535171095296" | 14.925 | 290.413 | 38.3437 | 1635721458409799680 | 2052823535171095296 | 197836529 |
names(df_ex3)
99-element Vector{String}: "pl_name" "gaia_id" "sy_kepmag" "ra" "dec" "solution_id" "source_id" ⋮ "radius_percentile_lower" "radius_percentile_upper" "lum_val" "lum_percentile_lower" "lum_percentile_upper" "datalink_url"
tip(md"Originally, both tables contained columns named `ra` and `dec`. The joined table contains columns `ra` and `ra_1` (and `dec` and `dec_1`) because we set `makeunique`.")
Originally, both tables contained columns named ra
and dec
. The joined table contains columns ra
and ra_1
(and dec
and dec_1
) because we set makeunique
.
What if we didn't know the Gaia designation?
targetpos = (; ra = df_ex1.ra[1], dec = df_ex1.dec[1] )
(ra = 267.8677483, dec = -29.8764758)
url_ex4 = make_tap_query_url(gaia_query_base_url, "gaiadr3.gaia_source", where="1=contains(POINT($(targetpos.ra),$(targetpos.dec)),CIRCLE(ra,dec,30./3600.))", select_cols="*,DISTANCE(POINT($(targetpos.ra),$(targetpos.dec)),POINT(ra,dec))+AS+ang_sep",order_by_cols="ang_sep",max_rows=1000)
"https://gea.esac.esa.int/tap-server/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=csv&QUERY=select+top+1000+*,DISTANCE(POINT(267.8677483,-29.8764758),POINT(ra,dec))+AS+ang_sep+from+gaiadr3.gaia_source+where+1=contains(POINT(267.8677483,-29.8764758),CIRCLE(ra,dec,30./3600.))+order+by+ang_sep&format=tsv"
df_ex4 = query_to_df(url_ex4)
solution_id | designation | source_id | random_index | ref_epoch | ra | ra_error | dec | ... | |
---|---|---|---|---|---|---|---|---|---|
1 | 1636148068921376768 | "Gaia DR3 4056443366649948160" | 4056443366649948160 | 628955980 | 2016.0 | 267.868 | 0.0484435 | -29.8765 | |
2 | 1636148068921376768 | "Gaia DR3 4056443366695974528" | 4056443366695974528 | 1783248586 | 2016.0 | 267.867 | 0.301148 | -29.8763 | |
3 | 1636148068921376768 | "Gaia DR3 4056443366694853888" | 4056443366694853888 | 82166964 | 2016.0 | 267.868 | 0.597072 | -29.8772 | |
4 | 1636148068921376768 | "Gaia DR3 4056443366696002944" | 4056443366696002944 | 586843531 | 2016.0 | 267.869 | 0.238754 | -29.8765 | |
5 | 1636148068921376768 | "Gaia DR3 4056443370989465856" | 4056443370989465856 | 1370584052 | 2016.0 | 267.869 | 9.60017 | -29.8771 | |
6 | 1636148068921376768 | "Gaia DR3 4056443366695975424" | 4056443366695975424 | 1032450251 | 2016.0 | 267.869 | 0.207094 | -29.8754 | |
7 | 1636148068921376768 | "Gaia DR3 4056443366694961152" | 4056443366694961152 | 1119708763 | 2016.0 | 267.867 | 0.596232 | -29.8779 | |
8 | 1636148068921376768 | "Gaia DR3 4056443370989465728" | 4056443370989465728 | 589362746 | 2016.0 | 267.869 | 5.98055 | -29.8774 | |
9 | 1636148068921376768 | "Gaia DR3 4056443366694892160" | 4056443366694892160 | 781324219 | 2016.0 | 267.869 | 0.40115 | -29.8777 | |
10 | 1636148068921376768 | "Gaia DR3 4056443370989472384" | 4056443370989472384 | 1372304020 | 2016.0 | 267.866 | 0.953113 | -29.8758 | |
... | |||||||||
276 | 1636148068921376768 | "Gaia DR3 4056443370989452672" | 4056443370989452672 | 1608917468 | 2016.0 | 267.871 | 5.46342 | -29.8843 |
Wait, which row is the best match?
sort(df_ex4[!,[:designation,:ang_sep,:phot_g_mean_mag] ], :ang_sep)
designation | ang_sep | phot_g_mean_mag | |
---|---|---|---|
1 | "Gaia DR3 4056443366649948160" | 7.49048e-7 | 15.669 |
2 | "Gaia DR3 4056443366695974528" | 0.000588803 | 18.363 |
3 | "Gaia DR3 4056443366694853888" | 0.000686453 | 18.3986 |
4 | "Gaia DR3 4056443366696002944" | 0.00116672 | 18.6382 |
5 | "Gaia DR3 4056443370989465856" | 0.00118059 | 20.2377 |
6 | "Gaia DR3 4056443366695975424" | 0.00138877 | 18.5433 |
7 | "Gaia DR3 4056443366694961152" | 0.00148488 | 19.7284 |
8 | "Gaia DR3 4056443370989465728" | 0.00152604 | 20.2813 |
9 | "Gaia DR3 4056443366694892160" | 0.00153313 | 19.1686 |
10 | "Gaia DR3 4056443370989472384" | 0.00170832 | 18.3691 |
... | |||
276 | "Gaia DR3 4056443370989452672" | 0.00831395 | 19.7214 |
df_ex1.sy_kepmag[8]
14.925
Questions?
Would it be possible to use available data sets to discover planets that have not been found yet by anyone else?
Setup & Helper Code
ChooseDisplayMode()
begin
using CSV, DataFrames, Query
using HTTP
using PlutoUI, PlutoTeachingTools
end
begin
nexsci_query_base_url = "https://exoplanetarchive.ipac.caltech.edu/TAP/sync?query="
gaia_query_base_url =
"https://gea.esac.esa.int/tap-server/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=csv&QUERY="
end;
begin
make_tap_query_url_url = "#" * (PlutoRunner.currently_running_cell_id |> string)
"""
`make_tap_query_url(base_url, table_name; ...)`
Returns url for a Table Access Protocol (TAP) query.
Inputs:
- base url
- table name
Optional arguments (and default):
- `max_rows` (all)
- `select_cols` (all)
- `where` (no requirements)
- `order_by_cols` (not sorted)
- `format` (tsv)
See [NExScI](https://exoplanetarchive.ipac.caltech.edu/docs/TAP/usingTAP.html#sync) or [Virtual Observatory](https://www.ivoa.net/documents/TAP/) for more info.
"""
function make_tap_query_url(query_base_url::String, query_table::String; max_rows::Integer = 0, select_cols::String = "", where::String = "", order_by_cols::String = "", format::String="tsv" )
query_select = "select"
if max_rows > 0
query_select *= "+top+" * string(max_rows)
end
if length(select_cols) >0
query_select *= "+" * select_cols
else
query_select *= "+*"
end
query_from = "+from+" * query_table
query_where = length(where)>0 ? "+where+" * where : ""
query_order_by = length(order_by_cols) > 0 ? "+order+by+" * order_by_cols : ""
query_format = "&format=" * format
url = query_base_url * query_select * query_from * query_where * query_order_by * query_format
end
end
"""
`query_to_df(url)` downloads data from a URL and attempts to place it into a DataFrame
"""
query_to_df(url) = CSV.read(HTTP.get(url).body,DataFrame)
"""`replace_spaces_for_tap(str)`
Replace spaces with +'s as expected for TAP queries.
"""
replace_spaces_for_tap(s::AbstractString) = replace(s," "=>"+")
begin
"""
`select_cols_for_tap(cols)`
Returns a string of comma-separated columns names from a vector of columns names (as either strings or symbols), for using in a TAP query.
"""
function select_cols_for_tap end
select_cols_for_tap(cols_to_keep::AbstractVector{Symbol}) = select_cols_for_tap(string.(cols_to_keep)) #string(map(s->string(s) * "+", cols_to_keep)...)[1:end-1]
select_cols_for_tap(cols_to_keep::AbstractVector{AS}) where {AS<:AbstractString} = string(map(s->s * ",", cols_to_keep)...)[1:end-1]
select_cols_for_tap(col_to_keep::Symbol) = string(col_to_keep)
select_cols_for_tap(col_to_keep::AbstractString) = col_to_keep
end
select_cols_for_tap (generic function with 4 methods)
Built with Julia 1.8.2 and
CSV 0.10.4DataFrames 1.3.6
HTTP 1.4.0
PlutoTeachingTools 0.2.3
PlutoUI 0.7.43
Query 1.0.0
To run this tutorial locally, download this file and open it with Pluto.jl.
To run this tutorial locally, download this file and open it with Pluto.jl.
To run this tutorial locally, download this file and open it with Pluto.jl.
To run this tutorial locally, download this file and open it with Pluto.jl.