Queries & Data Wrangling

Astro 497, Week 8, Monday

TableOfContents()

Databases

Simplest form

  • Store data

  • Retrieve data

  • Commonly implemented as a set of tables

    • Columns contain different fields (e.g., ID, magnitude, RA, Dec)

    • Rows contain entries (e.g., 51 Pegasi, Kepler-10, HAT-P-13,... )

Value-added features

  • Return subset of data efficiently

  • Many strategies for how to filter data (e.g., order of operations)

  • Database server can use heuristics to pick good strategy

  • Allow for transactions to update database

Fundamental properties of databases

  • Atomicity: All part of a transaction succeed, or the database is rolled back to its previous state

  • Consistency: Data in database always satisfies its validation rules

  • Isolation: Even if multiple transactions are made concurrently, there is no interference between transactions

  • Durability: Once a transaction is committed, it will remain committed

SQL-based Database Servers

Open-source:

  • MySQL

  • PostgreSQL & Greenplum

Commercial:

  • Microsoft SQL Server:

  • IBM DB2

  • Oracle Database

  • ...

Continuing innovation in database systems

  • SciDB (array & disk based database)

  • MonetDB (column store)

  • JuliaDB (pure Julia, for persistent data)

When selecting a database for a project, consider:

  • How much data is to be stored?

  • How frequent/large will transactions be?

  • Are there specific hardware or OS requirements?

  • Does the team have someone dedicated to supporting database?

Database Clients

  • One database server many clients simultaneously

  • Different clients can use different interfaces

    • Command line

    • Webpage

    • URL-based

    • Custom Graphical user interface (GUI)

      • TopCat

Queries

Query: An expression that requests database to return a specific subset of data.

Query languages:

  • Structured Query Language (SQL): Dated, but by far the most common

  • Astronomical Data Query Language (ADQL): Astronomy-specific

  • Language Integrated Query (LINQ): Microsoft-supported

  • Many more

SQL essentials

  • Selecting (columns)

  • Filtering (for rows)

  • Joining (multiple tables)

  • Aggregating (rows within a table)

SQL programming

  • Variables

  • Functions

  • Procedures

  • Data management

  • Transactions

Virtual Observatory (VO)

Defines standards that help astronomers to collaborate effectively, emphasizing working with multiple data sources.

  • Astronomical Data Query Language (ADQL)

  • Table Access Protocol (TAP)

Astronomy-specific functions in ADQL

  • AREA

  • BOX

  • CENTROID

  • CIRCLE

  • CONTAINS

  • COORD1

  • COORD2

  • COORDSYS

  • DISTANCE

  • INTERSECTS

  • POINT

  • POLYGON

  • REGION

Example Bad SQL Query

Do not send

select * from SomeTable

Why?

Example SQL Queries

Take a quick peak two columns data for first few entries

select top 10 X, Y from SomeTable 

Find extreme values of X

select top 10 X, Y 
from SomeTable 
order by X

Filter which rows are returned using expression

select top 10 x,y
from SomeTable 
where x*x+y*y between 0 and 1
order by x 

Check how many rows are in a table

select COUNT(designation) as N from gaiadr2.gaia_source

Check how many rows satisfy a filter

select COUNT(designation) as n, AVG(astrometric_n_good_obs_al) as astrometric_n_good_obs_al_ave
from gaiadr2.gaia_source
where phot_g_mean_mag < 14

Grouping data to make a histogram

select COUNT(designation) as N, 
       AVG(astrometric_n_good_obs_al) as astrometric_n_good_obs_al_ave,  
       AVG(phot_g_mean_mag) as phot_g_mean_mag_ave,
       ROUND(phot_g_mean_mag,1) as bin
from gaiadr2.gaia_source
where phot_g_mean_mag < 14
group by bin 
order by bin

Table Access Protocol (TAP)

Start with ADQL

SELECT <column list> FROM <table> WHERE <constraints>

but transform it into a url, by

  1. Prepend a base service url

  2. convert spaces to +'s

  3. Deal with other special characters (e.g., +, quotes)

  4. Optionally, specify format for results

e.g.,

https://exoplanetarchive.ipac.caltech.edu/TAP/sync?query=select+pl_name,pl_masse,ra,dec+from+ps
url_ex1 = make_tap_query_url(nexsci_query_base_url, "ps", select_cols="pl_name,gaia_id,sy_kepmag,ra,dec", where="default_flag=1")
"https://exoplanetarchive.ipac.caltech.edu/TAP/sync?query=select+pl_name,gaia_id,sy_kepmag,ra,dec+from+ps+where+default_flag=1&format=tsv"
df_ex1 = query_to_df(url_ex1)
pl_name gaia_id sy_kepmag ra dec
1 "OGLE-TR-10 b" "Gaia DR2 4056443366649948160" missing 267.868 -29.8765
2 "BD-08 2823 c" "Gaia DR2 3770419611540574080" missing 150.197 -9.51657
3 "HR 8799 c" "Gaia DR2 2832463659640297472" missing 346.87 21.134
4 "HD 110014 b" "Gaia DR2 3676091134604409728" missing 189.811 -7.99567
5 "HIP 5158 b" "Gaia DR2 2351405057377686272" missing 16.5095 -22.4536
6 "HD 44219 b" "Gaia DR2 3001428566419966592" missing 95.06 -10.7251
7 "HD 132563 b" "Gaia DR2 1585765117538284800" missing 224.589 44.0429
8 "Kepler-24 c" "Gaia DR2 2052823535171095296" 14.925 290.413 38.3437
9 "CHXR 73 b" "Gaia DR2 5201175987817179136" missing 166.619 -77.6259
10 "alf Ari b" missing missing 31.7933 23.4624
...
5197 "TOI-1749 d" "Gaia DR2 2253774094189458432" missing 282.737 64.4195
desig = replace_spaces_for_tap(df_ex1.gaia_id[8])
"Gaia+DR2+2052823535171095296"
url_ex2 = make_tap_query_url(gaia_query_base_url, "gaiadr2.gaia_source", where="designation='$(desig)'",select_cols="*",max_rows=5)
"https://gea.esac.esa.int/tap-server/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=csv&QUERY=select+top+5+*+from+gaiadr2.gaia_source+where+designation='Gaia+DR2+2052823535171095296'&format=tsv"
df_ex2 = query_to_df(url_ex2)
solution_id designation source_id random_index ref_epoch ra ra_error dec ...
1 1635721458409799680 "Gaia DR2 2052823535171095296" 2052823535171095296 197836529 2015.5 290.413 0.020076 38.3437

Joins

Joining tables is a fundamental concept that can be applied either to DataFrames stored locally or as part of SQL/ADQL queries.

  • innerjoin & semijoin: Return rows for values of the key that exist in both tables

  • outerjoin: Return rows for values of the key that exist in either table

  • leftjoin: Return rows for values of the key that exist in first table

  • rightjoin: Return rows for values of the key that exist in second table

  • antijoin: Return rows Return rows for values of the key that exist in first table but not the second table

  • crossjoin: Return table with every row from first table as rows and every row from second table as columns

Examples

df_ex3 = innerjoin(df_ex1,df_ex2, on=:gaia_id=>:designation, matchmissing=:notequal, makeunique=true )
pl_name gaia_id sy_kepmag ra dec solution_id source_id random_index ...
1 "Kepler-24 c" "Gaia DR2 2052823535171095296" 14.925 290.413 38.3437 1635721458409799680 2052823535171095296 197836529
2 "Kepler-24 d" "Gaia DR2 2052823535171095296" 14.925 290.413 38.3437 1635721458409799680 2052823535171095296 197836529
3 "Kepler-24 b" "Gaia DR2 2052823535171095296" 14.925 290.413 38.3437 1635721458409799680 2052823535171095296 197836529
4 "Kepler-24 e" "Gaia DR2 2052823535171095296" 14.925 290.413 38.3437 1635721458409799680 2052823535171095296 197836529
names(df_ex3)
99-element Vector{String}:
 "pl_name"
 "gaia_id"
 "sy_kepmag"
 "ra"
 "dec"
 "solution_id"
 "source_id"
 ⋮
 "radius_percentile_lower"
 "radius_percentile_upper"
 "lum_val"
 "lum_percentile_lower"
 "lum_percentile_upper"
 "datalink_url"
tip(md"Originally, both tables contained columns named `ra` and `dec`.  The joined table contains columns `ra` and `ra_1` (and `dec` and `dec_1`) because we set `makeunique`.")
Tip

Originally, both tables contained columns named ra and dec. The joined table contains columns ra and ra_1 (and dec and dec_1) because we set makeunique.

What if we didn't know the Gaia designation?

targetpos = (; ra = df_ex1.ra[1], dec = df_ex1.dec[1] )
(ra = 267.8677483, dec = -29.8764758)
url_ex4 = make_tap_query_url(gaia_query_base_url, "gaiadr3.gaia_source", where="1=contains(POINT($(targetpos.ra),$(targetpos.dec)),CIRCLE(ra,dec,30./3600.))", select_cols="*,DISTANCE(POINT($(targetpos.ra),$(targetpos.dec)),POINT(ra,dec))+AS+ang_sep",order_by_cols="ang_sep",max_rows=1000)
"https://gea.esac.esa.int/tap-server/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=csv&QUERY=select+top+1000+*,DISTANCE(POINT(267.8677483,-29.8764758),POINT(ra,dec))+AS+ang_sep+from+gaiadr3.gaia_source+where+1=contains(POINT(267.8677483,-29.8764758),CIRCLE(ra,dec,30./3600.))+order+by+ang_sep&format=tsv"
df_ex4 = query_to_df(url_ex4)
solution_id designation source_id random_index ref_epoch ra ra_error dec ...
1 1636148068921376768 "Gaia DR3 4056443366649948160" 4056443366649948160 628955980 2016.0 267.868 0.0484435 -29.8765
2 1636148068921376768 "Gaia DR3 4056443366695974528" 4056443366695974528 1783248586 2016.0 267.867 0.301148 -29.8763
3 1636148068921376768 "Gaia DR3 4056443366694853888" 4056443366694853888 82166964 2016.0 267.868 0.597072 -29.8772
4 1636148068921376768 "Gaia DR3 4056443366696002944" 4056443366696002944 586843531 2016.0 267.869 0.238754 -29.8765
5 1636148068921376768 "Gaia DR3 4056443370989465856" 4056443370989465856 1370584052 2016.0 267.869 9.60017 -29.8771
6 1636148068921376768 "Gaia DR3 4056443366695975424" 4056443366695975424 1032450251 2016.0 267.869 0.207094 -29.8754
7 1636148068921376768 "Gaia DR3 4056443366694961152" 4056443366694961152 1119708763 2016.0 267.867 0.596232 -29.8779
8 1636148068921376768 "Gaia DR3 4056443370989465728" 4056443370989465728 589362746 2016.0 267.869 5.98055 -29.8774
9 1636148068921376768 "Gaia DR3 4056443366694892160" 4056443366694892160 781324219 2016.0 267.869 0.40115 -29.8777
10 1636148068921376768 "Gaia DR3 4056443370989472384" 4056443370989472384 1372304020 2016.0 267.866 0.953113 -29.8758
...
276 1636148068921376768 "Gaia DR3 4056443370989452672" 4056443370989452672 1608917468 2016.0 267.871 5.46342 -29.8843

Wait, which row is the best match?

sort(df_ex4[!,[:designation,:ang_sep,:phot_g_mean_mag] ], :ang_sep)
designation ang_sep phot_g_mean_mag
1 "Gaia DR3 4056443366649948160" 7.49048e-7 15.669
2 "Gaia DR3 4056443366695974528" 0.000588803 18.363
3 "Gaia DR3 4056443366694853888" 0.000686453 18.3986
4 "Gaia DR3 4056443366696002944" 0.00116672 18.6382
5 "Gaia DR3 4056443370989465856" 0.00118059 20.2377
6 "Gaia DR3 4056443366695975424" 0.00138877 18.5433
7 "Gaia DR3 4056443366694961152" 0.00148488 19.7284
8 "Gaia DR3 4056443370989465728" 0.00152604 20.2813
9 "Gaia DR3 4056443366694892160" 0.00153313 19.1686
10 "Gaia DR3 4056443370989472384" 0.00170832 18.3691
...
276 "Gaia DR3 4056443370989452672" 0.00831395 19.7214
df_ex1.sy_kepmag[8]
14.925

Questions?

Question

Would it be possible to use available data sets to discover planets that have not been found yet by anyone else?

Setup & Helper Code

ChooseDisplayMode()
     
begin
    using CSV, DataFrames, Query
    using HTTP
    using PlutoUI, PlutoTeachingTools
end
begin
    nexsci_query_base_url = "https://exoplanetarchive.ipac.caltech.edu/TAP/sync?query="
    gaia_query_base_url = 
    "https://gea.esac.esa.int/tap-server/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=csv&QUERY="
end;
begin
    make_tap_query_url_url = "#" * (PlutoRunner.currently_running_cell_id |> string)
"""
`make_tap_query_url(base_url, table_name; ...)`

Returns url for a Table Access Protocol (TAP) query.
Inputs:
- base url 
- table name
Optional arguments (and default):
- `max_rows` (all)
- `select_cols` (all)
- `where` (no requirements)
- `order_by_cols` (not sorted)
- `format` (tsv)
See [NExScI](https://exoplanetarchive.ipac.caltech.edu/docs/TAP/usingTAP.html#sync) or [Virtual Observatory](https://www.ivoa.net/documents/TAP/) for more info.
"""
function make_tap_query_url(query_base_url::String, query_table::String; max_rows::Integer = 0, select_cols::String = "", where::String = "", order_by_cols::String = "", format::String="tsv" )
    
    query_select = "select"
    if max_rows > 0 
        query_select *= "+top+" * string(max_rows)
    end
    if length(select_cols) >0
        query_select *= "+" * select_cols 
    else
        query_select *= "+*"
    end
    query_from = "+from+" * query_table
    query_where = length(where)>0 ? "+where+" * where : ""
    query_order_by = length(order_by_cols) > 0 ? "+order+by+" * order_by_cols : ""
    query_format = "&format=" * format
    url = query_base_url * query_select * query_from * query_where * query_order_by * query_format
end
end
"""
`query_to_df(url)` downloads data from a URL and attempts to place it into a DataFrame
"""
query_to_df(url) = CSV.read(HTTP.get(url).body,DataFrame)
"""`replace_spaces_for_tap(str)`

Replace spaces with +'s as expected for TAP queries.
"""
replace_spaces_for_tap(s::AbstractString) = replace(s," "=>"+")
begin
    """ 
    `select_cols_for_tap(cols)`

    Returns a string of comma-separated columns names from a vector of columns names (as either strings or symbols), for using in a TAP query.
    """
    function select_cols_for_tap end
    select_cols_for_tap(cols_to_keep::AbstractVector{Symbol}) = select_cols_for_tap(string.(cols_to_keep)) #string(map(s->string(s) * "+", cols_to_keep)...)[1:end-1]
    select_cols_for_tap(cols_to_keep::AbstractVector{AS}) where {AS<:AbstractString} = string(map(s->s * ",", cols_to_keep)...)[1:end-1]
    select_cols_for_tap(col_to_keep::Symbol) = string(col_to_keep)
    select_cols_for_tap(col_to_keep::AbstractString) = col_to_keep
end
select_cols_for_tap (generic function with 4 methods)

Built with Julia 1.8.2 and

CSV 0.10.4
DataFrames 1.3.6
HTTP 1.4.0
PlutoTeachingTools 0.2.3
PlutoUI 0.7.43
Query 1.0.0

To run this tutorial locally, download this file and open it with Pluto.jl.

To run this tutorial locally, download this file and open it with Pluto.jl.

To run this tutorial locally, download this file and open it with Pluto.jl.

To run this tutorial locally, download this file and open it with Pluto.jl.