Queries & Data Wrangling

Astro 497, Week 8, Monday

TableOfContents()

Databases

Simplest form

Store data
Retrieve data
Commonly implemented as a set of tables
- Columns contain different fields (e.g., ID, magnitude, RA, Dec)
- Rows contain entries (e.g., 51 Pegasi, Kepler-10, HAT-P-13,... )

Value-added features

Return subset of data efficiently
Many strategies for how to filter data (e.g., order of operations)
Database server can use heuristics to pick good strategy
Allow for transactions to update database

Fundamental properties of databases

Atomicity: All part of a transaction succeed, or the database is rolled back to its previous state
Consistency: Data in database always satisfies its validation rules
Isolation: Even if multiple transactions are made concurrently, there is no interference between transactions
Durability: Once a transaction is committed, it will remain committed

SQL-based Database Servers

Open-source:

MySQL
PostgreSQL & Greenplum

Commercial:

Microsoft SQL Server:
IBM DB2
Oracle Database
...

Continuing innovation in database systems

SciDB (array & disk based database)
MonetDB (column store)
JuliaDB (pure Julia, for persistent data)

When selecting a database for a project, consider:

How much data is to be stored?
How frequent/large will transactions be?
Are there specific hardware or OS requirements?
Does the team have someone dedicated to supporting database?

Database Clients

One database server many clients simultaneously
Different clients can use different interfaces
- Command line
- Webpage
- URL-based
- Custom Graphical user interface (GUI)
  - TopCat

Queries

Query: An expression that requests database to return a specific subset of data.

Query languages:

Structured Query Language (SQL): Dated, but by far the most common
Astronomical Data Query Language (ADQL): Astronomy-specific
Language Integrated Query (LINQ): Microsoft-supported
Many more

SQL essentials

Selecting (columns)
Filtering (for rows)
Joining (multiple tables)
Aggregating (rows within a table)

SQL programming

Variables
Functions
Procedures
Data management
Transactions

Virtual Observatory (VO)

Defines standards that help astronomers to collaborate effectively, emphasizing working with multiple data sources.

Astronomical Data Query Language (ADQL)
Table Access Protocol (TAP)

Astronomy-specific functions in ADQL

AREA
BOX
CENTROID
CIRCLE
CONTAINS
COORD1
COORD2
COORDSYS
DISTANCE
INTERSECTS
POINT
POLYGON
REGION

Example Bad SQL Query

Do not send

select * from SomeTable

Why?

Example SQL Queries

Take a quick peak two columns data for first few entries

select top 10 X, Y from SomeTable

Find extreme values of X

select top 10 X, Y 
from SomeTable 
order by X

Filter which rows are returned using expression

select top 10 x,y
from SomeTable 
where x*x+y*y between 0 and 1
order by x

Check how many rows are in a table

select COUNT(designation) as N from gaiadr2.gaia_source

Gaia Archive

Check how many rows satisfy a filter

select COUNT(designation) as n, AVG(astrometric_n_good_obs_al) as astrometric_n_good_obs_al_ave
from gaiadr2.gaia_source
where phot_g_mean_mag < 14

Grouping data to make a histogram

select COUNT(designation) as N, 
       AVG(astrometric_n_good_obs_al) as astrometric_n_good_obs_al_ave,  
       AVG(phot_g_mean_mag) as phot_g_mean_mag_ave,
       ROUND(phot_g_mean_mag,1) as bin
from gaiadr2.gaia_source
where phot_g_mean_mag < 14
group by bin 
order by bin

Table Access Protocol (TAP)

Start with ADQL

SELECT <column list> FROM <table> WHERE <constraints>

but transform it into a url, by

Prepend a base service url
convert spaces to +'s
Deal with other special characters (e.g., +, quotes)
Optionally, specify format for results

e.g.,

https://exoplanetarchive.ipac.caltech.edu/TAP/sync?query=select+pl_name,pl_masse,ra,dec+from+ps

url_ex1 = make_tap_query_url(nexsci_query_base_url, "ps", select_cols="pl_name,gaia_id,sy_kepmag,ra,dec", where="default_flag=1")

"https://exoplanetarchive.ipac.caltech.edu/TAP/sync?query=select+pl_name,gaia_id,sy_kepmag,ra,dec+from+ps+where+default_flag=1&format=tsv"

df_ex1 = query_to_df(url_ex1)

	pl_name	gaia_id	sy_kepmag	ra	dec
1	"OGLE-TR-10 b"	"Gaia DR2 4056443366649948160"	missing	267.868	-29.8765
2	"BD-08 2823 c"	"Gaia DR2 3770419611540574080"	missing	150.197	-9.51657
3	"HR 8799 c"	"Gaia DR2 2832463659640297472"	missing	346.87	21.134
4	"HD 110014 b"	"Gaia DR2 3676091134604409728"	missing	189.811	-7.99567
5	"HIP 5158 b"	"Gaia DR2 2351405057377686272"	missing	16.5095	-22.4536
6	"HD 44219 b"	"Gaia DR2 3001428566419966592"	missing	95.06	-10.7251
7	"HD 132563 b"	"Gaia DR2 1585765117538284800"	missing	224.589	44.0429
8	"Kepler-24 c"	"Gaia DR2 2052823535171095296"	14.925	290.413	38.3437
9	"CHXR 73 b"	"Gaia DR2 5201175987817179136"	missing	166.619	-77.6259
10	"alf Ari b"	missing	missing	31.7933	23.4624
...
5197	"TOI-1749 d"	"Gaia DR2 2253774094189458432"	missing	282.737	64.4195

desig = replace_spaces_for_tap(df_ex1.gaia_id[8])

"Gaia+DR2+2052823535171095296"

url_ex2 = make_tap_query_url(gaia_query_base_url, "gaiadr2.gaia_source", where="designation='$(desig)'",select_cols="*",max_rows=5)

"https://gea.esac.esa.int/tap-server/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=csv&QUERY=select+top+5+*+from+gaiadr2.gaia_source+where+designation='Gaia+DR2+2052823535171095296'&format=tsv"

df_ex2 = query_to_df(url_ex2)

	solution_id	designation	source_id	random_index	ref_epoch	ra	ra_error	dec	...
1	1635721458409799680	"Gaia DR2 2052823535171095296"	2052823535171095296	197836529	2015.5	290.413	0.020076	38.3437

Joins

Joining tables is a fundamental concept that can be applied either to DataFrames stored locally or as part of SQL/ADQL queries.

innerjoin & semijoin: Return rows for values of the key that exist in both tables
outerjoin: Return rows for values of the key that exist in either table
leftjoin: Return rows for values of the key that exist in first table
rightjoin: Return rows for values of the key that exist in second table

antijoin: Return rows Return rows for values of the key that exist in first table but not the second table

crossjoin: Return table with every row from first table as rows and every row from second table as columns

Examples

df_ex3 = innerjoin(df_ex1,df_ex2, on=:gaia_id=>:designation, matchmissing=:notequal, makeunique=true )

	pl_name	gaia_id	sy_kepmag	ra	dec	solution_id	source_id	random_index
1	"Kepler-24 c"	"Gaia DR2 2052823535171095296"	14.925	290.413	38.3437	1635721458409799680	2052823535171095296	197836529
2	"Kepler-24 d"	"Gaia DR2 2052823535171095296"	14.925	290.413	38.3437	1635721458409799680	2052823535171095296	197836529
3	"Kepler-24 b"	"Gaia DR2 2052823535171095296"	14.925	290.413	38.3437	1635721458409799680	2052823535171095296	197836529
4	"Kepler-24 e"	"Gaia DR2 2052823535171095296"	14.925	290.413	38.3437	1635721458409799680	2052823535171095296	197836529

names(df_ex3)

99-element Vector{String}:
 "pl_name"
 "gaia_id"
 "sy_kepmag"
 "ra"
 "dec"
 "solution_id"
 "source_id"
 ⋮
 "radius_percentile_lower"
 "radius_percentile_upper"
 "lum_val"
 "lum_percentile_lower"
 "lum_percentile_upper"
 "datalink_url"

tip(md"Originally, both tables contained columns named `ra` and `dec`.  The joined table contains columns `ra` and `ra_1` (and `dec` and `dec_1`) because we set `makeunique`.")

Tip

Originally, both tables contained columns named ra and dec. The joined table contains columns ra and ra_1 (and dec and dec_1) because we set makeunique.

What if we didn't know the Gaia designation?

targetpos = (; ra = df_ex1.ra[1], dec = df_ex1.dec[1] )

(ra = 267.8677483, dec = -29.8764758)

url_ex4 = make_tap_query_url(gaia_query_base_url, "gaiadr3.gaia_source", where="1=contains(POINT($(targetpos.ra),$(targetpos.dec)),CIRCLE(ra,dec,30./3600.))", select_cols="*,DISTANCE(POINT($(targetpos.ra),$(targetpos.dec)),POINT(ra,dec))+AS+ang_sep",order_by_cols="ang_sep",max_rows=1000)

"https://gea.esac.esa.int/tap-server/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=csv&QUERY=select+top+1000+*,DISTANCE(POINT(267.8677483,-29.8764758),POINT(ra,dec))+AS+ang_sep+from+gaiadr3.gaia_source+where+1=contains(POINT(267.8677483,-29.8764758),CIRCLE(ra,dec,30./3600.))+order+by+ang_sep&format=tsv"

df_ex4 = query_to_df(url_ex4)

	solution_id	designation	source_id	random_index	ref_epoch	ra	ra_error	dec
1	1636148068921376768	"Gaia DR3 4056443366649948160"	4056443366649948160	628955980	2016.0	267.868	0.0484435	-29.8765
2	1636148068921376768	"Gaia DR3 4056443366695974528"	4056443366695974528	1783248586	2016.0	267.867	0.301148	-29.8763
3	1636148068921376768	"Gaia DR3 4056443366694853888"	4056443366694853888	82166964	2016.0	267.868	0.597072	-29.8772
4	1636148068921376768	"Gaia DR3 4056443366696002944"	4056443366696002944	586843531	2016.0	267.869	0.238754	-29.8765
5	1636148068921376768	"Gaia DR3 4056443370989465856"	4056443370989465856	1370584052	2016.0	267.869	9.60017	-29.8771
6	1636148068921376768	"Gaia DR3 4056443366695975424"	4056443366695975424	1032450251	2016.0	267.869	0.207094	-29.8754
7	1636148068921376768	"Gaia DR3 4056443366694961152"	4056443366694961152	1119708763	2016.0	267.867	0.596232	-29.8779
8	1636148068921376768	"Gaia DR3 4056443370989465728"	4056443370989465728	589362746	2016.0	267.869	5.98055	-29.8774
9	1636148068921376768	"Gaia DR3 4056443366694892160"	4056443366694892160	781324219	2016.0	267.869	0.40115	-29.8777
10	1636148068921376768	"Gaia DR3 4056443370989472384"	4056443370989472384	1372304020	2016.0	267.866	0.953113	-29.8758
...
276	1636148068921376768	"Gaia DR3 4056443370989452672"	4056443370989452672	1608917468	2016.0	267.871	5.46342	-29.8843

Wait, which row is the best match?

sort(df_ex4[!,[:designation,:ang_sep,:phot_g_mean_mag] ], :ang_sep)

	designation	ang_sep	phot_g_mean_mag
1	"Gaia DR3 4056443366649948160"	7.49048e-7	15.669
2	"Gaia DR3 4056443366695974528"	0.000588803	18.363
3	"Gaia DR3 4056443366694853888"	0.000686453	18.3986
4	"Gaia DR3 4056443366696002944"	0.00116672	18.6382
5	"Gaia DR3 4056443370989465856"	0.00118059	20.2377
6	"Gaia DR3 4056443366695975424"	0.00138877	18.5433
7	"Gaia DR3 4056443366694961152"	0.00148488	19.7284
8	"Gaia DR3 4056443370989465728"	0.00152604	20.2813
9	"Gaia DR3 4056443366694892160"	0.00153313	19.1686
10	"Gaia DR3 4056443370989472384"	0.00170832	18.3691
...
276	"Gaia DR3 4056443370989452672"	0.00831395	19.7214

df_ex1.sy_kepmag[8]

14.925

Questions?

Question

Would it be possible to use available data sets to discover planets that have not been found yet by anyone else?

Setup & Helper Code

ChooseDisplayMode()

Full Width Mode Present Mode

begin
    using CSV, DataFrames, Query
    using HTTP
    using PlutoUI, PlutoTeachingTools
end

begin
    nexsci_query_base_url = "https://exoplanetarchive.ipac.caltech.edu/TAP/sync?query="
    gaia_query_base_url = 
    "https://gea.esac.esa.int/tap-server/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=csv&QUERY="
end;

begin
    make_tap_query_url_url = "#" * (PlutoRunner.currently_running_cell_id |> string)
"""
`make_tap_query_url(base_url, table_name; ...)`

Returns url for a Table Access Protocol (TAP) query.
Inputs:
- base url 
- table name
Optional arguments (and default):
- `max_rows` (all)
- `select_cols` (all)
- `where` (no requirements)
- `order_by_cols` (not sorted)
- `format` (tsv)
See [NExScI](https://exoplanetarchive.ipac.caltech.edu/docs/TAP/usingTAP.html#sync) or [Virtual Observatory](https://www.ivoa.net/documents/TAP/) for more info.
"""
function make_tap_query_url(query_base_url::String, query_table::String; max_rows::Integer = 0, select_cols::String = "", where::String = "", order_by_cols::String = "", format::String="tsv" )
    
    query_select = "select"
    if max_rows > 0 
        query_select *= "+top+" * string(max_rows)
    end
    if length(select_cols) >0
        query_select *= "+" * select_cols 
    else
        query_select *= "+*"
    end
    query_from = "+from+" * query_table
    query_where = length(where)>0 ? "+where+" * where : ""
    query_order_by = length(order_by_cols) > 0 ? "+order+by+" * order_by_cols : ""
    query_format = "&format=" * format
    url = query_base_url * query_select * query_from * query_where * query_order_by * query_format
end
end

"""
`query_to_df(url)` downloads data from a URL and attempts to place it into a DataFrame
"""
query_to_df(url) = CSV.read(HTTP.get(url).body,DataFrame)

"""`replace_spaces_for_tap(str)`

Replace spaces with +'s as expected for TAP queries.
"""
replace_spaces_for_tap(s::AbstractString) = replace(s," "=>"+")

begin
    """ 
    `select_cols_for_tap(cols)`

    Returns a string of comma-separated columns names from a vector of columns names (as either strings or symbols), for using in a TAP query.
    """
    function select_cols_for_tap end
    select_cols_for_tap(cols_to_keep::AbstractVector{Symbol}) = select_cols_for_tap(string.(cols_to_keep)) #string(map(s->string(s) * "+", cols_to_keep)...)[1:end-1]
    select_cols_for_tap(cols_to_keep::AbstractVector{AS}) where {AS<:AbstractString} = string(map(s->s * ",", cols_to_keep)...)[1:end-1]
    select_cols_for_tap(col_to_keep::Symbol) = string(col_to_keep)
    select_cols_for_tap(col_to_keep::AbstractString) = col_to_keep
end

select_cols_for_tap (generic function with 4 methods)

Built with Julia 1.8.2 and

CSV 0.10.4
DataFrames 1.3.6
HTTP 1.4.0
PlutoTeachingTools 0.2.3
PlutoUI 0.7.43
Query 1.0.0

To run this tutorial locally, download this file and open it with Pluto.jl.