SQL Tutorial Level 2
Sloan Digital Sky Survey III
SkyServer DR9
 
DR9 Help
 Start Here
 FAQ
 Glossary
 Cooking with Sloan
 Tool User Guides
 Contact Help Desk
 
 Searching SDSS-III
 SQL Tutorial
 SQL in SkyServer
 About the Database
 Table Descriptions
 Sample SQL Queries
 Schema Browser
 Searching Advice
 Query Limits
 
 Details of the Data
 Algorithms
 Data Publications
 API
Using SQL with SkySever

SQL is the Structured Query Language, a standard means of asking for data from databases, and is used to query the Catalog Archive Server (CAS). This page provides a brief overview of SQL. Query examples are also available, with comments, as well as a page of links to more detailed off-site documentation.


Database Fundamentals    Back to Top

The CAS provides access to SDSS-III catalog data that is stored in a relational databse managment system (DBMS) for data integrity and access speed. The data is organized into tables in a relational database. The SkyServer is the Web portal to the CAS databases and allows you to submit SQL queries to extract the data that you need from these databases. However, you do not usually need to specify which database your query is run on, since the SkyServer is configured by default to submit your queries to a particular database. This site is configured to submit your queries to the BESTDR9 database, which contains the best data and most recent processings for the entire released sky area.

The BESTDR9 database contains a large number of tables, some of which contain photometric measurements (such as PhotoObj), spectroscopic measurements (such as SpecObj), or information about the observing conditions (Field) or survey geometry(TileBoundary). See the data model page for more details.

In addition to the tables, we have defined Views, which are subsets or combinations of the data stored in the tables. Views are queried the same way Tables are; they exist just to make your life easier. For instance, the view Galaxy can be used to get photometric data on objects we classify as galaxies, without having to specify the classification in your query.

Both the Skyserver and CasJobs interfaces have a Schema Browser. It shows you all of the available databases, the tables in each database, and the quantities stored in each column of the tables.

Finally, we have created a variety of functions and stored procedures which let you easily perform some common operations. Usually, their names are prefixed by f or sp, like in fPhotoStatus or spGetFiberList. The full list of functions and store procedures is found in the Schema Browser. Note that some functions are scalar-valued, meaning that they return a single value, while others (such as the commonly used dbo.fGetNearbyObjEq, are table-valued; they actually return a table of data, and not a single number. This is important when interpreting the returned data and performing joins.

Please note the caution about using function calls as noted in the Optimizing Queries section when attempting queries over that return a large number of objects.


Query Fundamentals    Back to Top

Now that we have an overview of the database structure, how do we actually get data out? You will have to write a query using SQL. The most basic query consists of three parts:

  1. A SELECT clause, which specifies the parameters you wish to retrieve;
  2. A FROM clause, which specifies the database tables you want to extract the data from;
  3. A WHERE clause, which specifies the limitations/predicates you want to place on the extracted data.

The WHERE clause is not necessary if you want to retrieve parameters of all objects in a specified table, but this typically will be an overwhelming amount of data!

Note that the query language is insensitive to splitting the query over many lines. It is also not case sensitive. To make queries more readable, it is common practice to write the distinct query clauses on separate lines. The Sample Queries button on the CasJobs Query page provides a variety of samples, ordered in complexity. For instance, to obtain the list of unique Fields that have been loaded into the database, we use:

SELECT FieldID
FROM Field

You can just copy and paste this (or any other) query into the SQL Search window of SkyServer, and press submit, or into the CasJobs query window, and press the submit button.

If we want to retrieve multiple parameters from the database, we separate them with commas:

SELECT ra,dec 
FROM Galaxy

Of course, the parameters you request must be included in the table(s) you are querying! Now, let's say we want magnitudes of all bright galaxies. We will need to specify a magnitude range to do this:

SELECT u,g,r,i,z
FROM Galaxy
WHERE r<12 and r>0

Here, we have used the WHERE clause to provide a magnitude range. The and operator is used to require that multiple limits be met. This leads us to...


Simple Logical and Mathematical Operators    Back to Top

Not only can we place limits on individual parameters, we can place multiple limits using logical operators, as well as place limits on the results of mathematical operations on multiple parameters. We may also retrieve results that are logical joins of multiple queries. Here we list the logical, comparison, and mathematical operators.

The LOGICAL operators are AND,OR,NOT; they work as follows:

aba AND ba OR b
TRUETRUETRUETRUE
TRUEFALSEFALSETRUE
TRUENULLNULLTRUE
FALSEFALSEFALSEFALSE
FALSENULLFALSENULL
NULLNULLNULLNULL

When comparing values, you will use the COMPARISON operators:

OperatorDescription
< less than
> greater than
<= less than or equal to
>= greater than or equal to
= equal
<> or != not equal

In addition to the comparison operators, the special BETWEEN construct is available.
a BETWEEN x AND y is equivalent to a >= x AND a <= y

Similarly,
a NOT BETWEEN x AND y is equivalent to a < x OR a > y

Finally, the MATHEMATICAL operators (both numeric and bitwise) are:

NameDescriptionExampleResult
+ Addition2 + 35
- Subtraction2 - 3-1
* Multiplication2 * 36
/ Division 4 / 22
% Modulo (remainder)5 % 41
POWER ExponentiationPOWER (2.0,3.0)8.0
SQRT Square rootSQRT (25.0)5.0
ABS Absolute valueABS (-5.0)5.0
& Bitwise AND91 & 15
01011011 & 00001111
11
00001011
| Bitwise OR32 | 3
00100000 | 00000011
35
00100011
^ Bitwise XOR17 # 5
00010001 # 00000101
20
00010100
~ Bitwise NOT~1-2
AVG AverageAVG(ModelMag_r)
MIN MinimumMIN(ModelMag_r)
MAX MaximumMAX(ModelMag_r)
LOG Natural LogarithmLOG(petroMag_r)
LOG10 Base-10 LogarithmLOG10(petroMag_r)
EXP Exponential Value (ex)EXP(2.5)12.182494

In addition, the usual mathematical and trigonometric functions are available in SQL, such as COS, SIN, TAN, ACOS, etc..


Querying Bit Flags    Back to Top

Several SDSS tables contain bit-encoded flags to indicate various types of information about the object or quantity in question (e.g., the PhotoObjAll table and the PhotoTag view each have the flags field, SpecObj has zWarning flags etc.).

One of the most important uses of bit flags is to indicate why an object was targeted for spectroscopy. A list of spectroscopic target flags is available on the DR9 Spectroscopic Target Flags page.

This section describes how you can test for flag values in your query. For sample queries that demonstrate the use of flags, see the Errors using flags, Elliptical galaxies with model fits, Diameter limited sample, LRG sample, and Clean photometry with flags sample queries for examples on how to use flags.

Checking a single flag    Back to Top

To return rows for which the flag is set, the basic syntax for the constraint is:

(flag-column & bitmask) > 0
and to return rows for which the flag is not set:
(flag-column & bitmask) = 0
where bitmask is the binary value in which the bit corresponding to the flag is 1 and all other bits are 0. You can use the flag functions provided by the SkyServer (listed in Schema Browser) to get the bitmask for a given flag, but if you are scanning a large fraction of a large table like PhotoObj, you are better off not making a function call for each row, and in that case you should first get the actual binary value of the bitmask first and substitute that instead. This is described in the Using dbo functions in your query subsection of the Optimizing Queries section below.

For example, to select objects for which the BLENDED flag is set in PhotoTag, you would use a query like:
SELECT top 10 objid, flags FROM PhotoTag
WHERE flags & dbo.fPhotoFlags('BLENDED') > 0

and to select only objects for which the flag is NOT set, use

SELECT top 10 objid, flags FROM PhotoTag
WHERE flags & dbo.fPhotoFlags('BLENDED') = 0 

Checking multiple flags    Back to Top

To test if multiple flags are set, you can combine the values by adding them and then testing the result.

To select objects for which all of several flags are set, generate the combined bitmask by adding the individual flag bitmasks, then compare the result of ANDing the combined bitmask with the flag column with the combined bitmask itself, e.g.,

SELECT top 10 objid, flags FROM PhotoTag
WHERE 
    ( flags & (dbo.fPhotoFlags('NODEBLEND')
               + dbo.fPhotoFlags('BINNED1')
	       + dbo.fPhotoFlags('BINNED2')) ) 
	= ( dbo.fPhotoFlags('NODEBLEND')
            + dbo.fPhotoFlags('BINNED1')
	    + dbo.fPhotoFlags('BINNED2') )

To select objects for which at least one of several flags is set, you just need to check that ANDing the combined bitmask with the flag column returns a non-zero result, e.g.,

SELECT top 10 objid, flags FROM PhotoTag
WHERE 
    ( flags & (dbo.fPhotoFlags('NODEBLEND')
               + dbo.fPhotoFlags('BINNED1')
               + dbo.fPhotoFlags('BINNED2')) ) > 0

To select objects for which none of several flags is set, the result of ANDing the flag column with the combined bitmask must be 0, e.g.,

SELECT top 10 objid, flags FROM PhotoTag
WHERE 
    ( flags & (dbo.fPhotoFlags('NODEBLEND')
               + dbo.fPhotoFlags('BINNED1')
               + dbo.fPhotoFlags('BINNED2')) ) = 0


Clean Photometry    Back to Top

The SDSS photo pipeline sets a number of flags that indicate the quality of the photometry for a given object in the catalog. If you desire objects with only clean photometry for science, you should be aware that you need to filter out unwanted objects yourself in your query. This is not done automatically for you (e.g. with a view of the PhotoObjAll table). The main reason is that the flag constraints that are required for this filtering often impose a significant performance penalty on your query .

Please see the Clean Photometry sample query for help on how to use the photometry flags to select only objects with clean photometry.

Excluding Invalid Data Values    Back to Top

As mentioned in the EDR Paper, the database designates quantities that are not calculated for a particular object in a table with special values, as follows:

  • The value of a quantity that has not been calculated is set to -9999.
  • The value of an error that has not been calculated is set to -1000.
To exclude such invalid values from your query result, you should include constraints in your WHERE clause explicity to filter them out, e.g.

SELECT ra,dec,u,err_u FROM PhotoObj
WHERE 
	ra BETWEEN 180 AND 181
	AND dec BETWEEN -0.5 AND 0.5
	AND u BETWEEN -9999 AND 20.0 	-- or "u > -9999 AND u < 20.0", 
					-- instead of just "u < 20.0"
	AND err_u BETWEEN -1000 AND 0.1	-- or err_u > -1000 AND err_u < 0.1, 
					-- instead of just "err_u < 0.1"


Changing Precision of Query Output    Back to Top

Use the STR(column,n,d) SQL construct (where n is the total number of digits and d is the number of decimal places) to set the precision of the column that your query requests. The SkyServer returns values with a default precision that is set for each data type, and this may not be enough for columns like ra, dec etc. See the Selected neighbors in run or the Uniform Quasar Sample sample queries for examples of how to use STR.


Joins: Querying With Multiple Tables    Back to Top

You may wish to obtain quantities from multiple tables, or place constraints on quantities in one table while obtaining measurements from another. For instance, you may want magnitudes (from PhotoObj) from all objects spectroscopically identified (SpecObj) as galaxies. To perform these types of queries, you must use a join. You can join any two (or more) tables in the databases as long as they have some quantity in common (typically an object or field ID). To actually perform the join, you must have a JOIN subclause in the FROM clause of your query that specifies the common quantity to be equal in the two tables. Here is an example, getting the g magnitudes for stars in fields where the PSF fitting worked well:

SELECT TOP 10 s.psfMag_g
FROM Star s JOIN Field f ON s.fieldID = f.fieldID
WHERE
  s.psfMag_g < 20 
  AND  f.pspStatus = 2

Notice how we define abbreviations for the table names in the FROM clause; this is not necessary but makes for a lot less typing. Also, you do not have to ask for quantities to be returned from all the tables. You must specify all the tables on which you place constraints (including the join) in the FROM clause, but you can use any subset of these tables in the SELECT. If you use more than two tables, they do not all need to be joined on the same quantity. For instance, this three way join is perfectly acceptable:

SELECT TOP 10 p.objID,f.field,z.z
FROM PhotoObj p 
   JOIN Field f ON f.fieldid = p.fieldid
   JOIN photoz z ON p.objid = z.objid
WHERE
   f.psfWidth_r > 1.2
   AND  p.colc > 400.0

The type of joins shown above are called inner joins. In the above examples, we only return those objects which are matched between the multiple tables. If we want to include all rows of one of the tables, regardless of whether or not they are matched to another table, we must perform an outer join. One example is to get photometric data for all objects, while getting the spectroscopic data for those objects that have spectroscopy.

In the example below, we perform a left outer join, which means that we will get all entries (regardless of matching) from the table on the left side of the join. In the example below, the join is on P.objID = s.BestObjID; therefore, we will get all photometric (P) objects, with data from the spectroscopy if it exists. If there is no spectroscopic data for an object, we'll still get the photometric measurements but have nulls for the corresponding cpectroscopy.

select top 100 P.objID, P.ra, P.dec, S.SpecObjId, S.ra, S.dec
from PhotoObj as P left outer join SpecObjAll as S
on P.objID = s.BestObjID

When using table valued functions, you must do the join explicitly (rather than using "="). To do this, we use the syntax
SELECT quantities
FROM table1
JOIN table2 on table1.quantity = table2.quantity
WHERE constraints

For instance, in the example below, we use the function dbo.fGetNearbyObjEq to get all objects within a given radius (in this case, 1') of a specified coordinate. This is a table-valued, so it returns a table, containing the ObjIDs and distances of nearby objects. We want to get further photometric parameters on the returned objects, so we must join the output table with PhotoObj.:

SELECT G.objID, GN.distance
FROM Galaxy as G
JOIN dbo.fGetNearbyObjEq(115.,32.5, 1.0) AS GN
  ON G.objID = GN.objID 
WHERE (G.flags & dbo.fPhotoFlags('saturated')) = 0

Manipulating Query Output    Back to Top

SQL provides a number of ways to reorder, group, or otherwise arrange the output of your queries. Some of these options are:

  • count: Just tell me how many objects would be returned by my query. You can specify a column name as the argument to the count function or just "*" to mean all columns. It doesn't really matter (unless you are including a DISTINCT qualifier, see below) since it will count all rows that match your query anyway. Example:

    SELECT count(*) 
    FROM Galaxy
    WHERE ra between 180.1 and 180.5
    

  • distinct: Return only the unique values of the quantities requested in the SELECT statement. Example:

    SELECT distinct run 
    FROM Field
    

    Here a COUNT would return different counts depending on the column you selected, e.g.
    SELECT count(distinct run) 
    FROM Field
    

    would return a different number in general from:
    SELECT count(distinct field) 
    FROM Field
    

  • top: Return only the first n rows of the query results. We've already been using this above. Note that the rows selected in the TOP x are not deterministic, i.e., there is no order implied in the rows selected. This is true of all database queries in general. You have to explicitly enforce an order in your query if you want, and it is an expensive option in terms of query execution time usually (see ORDER BY clause below). Example:

    SELECT top 10 r
    FROM Star
    

  • order by: Order the output by the specified quantities. Default is ascending order, but you can specify descending as well. You can also order by multiple columns. Example:

    SELECT top 10 u,g,r
    FROM Star
    order by g,r desc
    

    Note how repeatedly executing this query returns the same 100 rows. This is not true of the previous query, especially if you run it at different times so the cache does not come into play..

  • group by: Group the output by the specified quantities. For instance, you could have all the stars in the output, followed by the galaxies. You could also perform operations on the grouped quantities. You could get the min and max magnitudes for stars and galaxies separately, as shown below:

    SELECT min(r),max(r)
    FROM PhotoPrimary
    group by type
    
    You can use this to count how many of each object type is loaded as primary photometric objects, for instance:

    SELECT count(r)
    FROM PhotoPrimary
    group by type
    


Optimizing Queries    Back to Top

It is easy to construct very complex queries which can take a long time to execute. When writing queries, one can often rewrite them to run faster. This is called optimization.

The first, and most trivial, optimization trick is to use the minimal Table or View for your query. For instance, if all you care about are galaxies, use the Galaxy view in your FROM clause, instead of PhotoObj. We have also created a 'thin' version of PhotoObjAll, called PhotoTag. This vertical subset contains all the objects in PhotoObjAll, but only a subset of the measured quantities. Using the PhotoTag view to speed up the query only makes sense if you do NOT want parameters that are only available in the full PhotoObjAll.

It is extremely useful to think about how a database handles queries, rather than trying to write a plain, sequential list of constraints. NOT every query that is syntactically correct will necessarily be efficient; the built-in query optimizer is not perfect! Thus, writing queries such that they use the tricks below can produce significant speed improvements.

Using indices in your query    Back to Top

Another simple way to make queries faster is to use indexed quantities to search on. There are two types of indices in the CAS, indices built into the database, and an external spatial index that we have added to make spatial searches much faster, called the Hierarchical Triangular Mesh.. The latter is explicitly invoked by using the built-in spatial search functions like fGetNearbyObjEq (does a radial search in equatorial coordinates), or fGetObjFromRectEq (searches in a rectangular area). Database indices are invoked automatically when you include columns in your search (in the WHERE clause) that have indices built on them. There are 3 types of database indices:

  1. Primary Key (PK) indices - these are also called clustered indices because the data is physically arranged (clustered) on disk in the ascending order of this key. Each table has exactly one clustered or PK index, built on its primary search key. For most of the photometric data tables, this is the objID column, and for most of the spectroscopic data tables, this is the specObjID column.
  2. Foreign Key (FK) indices - these are indices built on columns that define a relationship with other tables, e.g. objID in the SpecObjAll table is a foreign key on the PhotoObjAll table. Having FK indices helps to speed up queries that include a JOIN between the SpecObjAll table (and its associated views) and the PhotoObjAll table (and its views).
  3. Covering Indices - these are indices created on (groups of) columns that are frequently used (together). Thus they "cover" the search space defined by those columns. They may be unique or non-unique (allow duplicates).
All database indices defined on a given table in the CAS are listed in the Schema Browser under the Indices tab. If you click on the the Indices tab, you will get a listing of all indices for all tables. If you want to view the indices for a single table, expand the the Indices tab, scroll down the list to the table you want and just click on the table name to see all its indices.

If you must search on non-indexed columns in addition to the indexed ones, you can still benefit by first performing a query using only the indexed quantities, and then select those parameters from the returned subset of objects. An indexed quantity is one where a look-up table has effectively been calculated, so that the database software does not have to do a time-consuming sequential search through all the objects in the table. For instance, sky coordinates cx,cy,cz are indexed using a Hierarchical Triangular Mesh (HTM). So, you can make a query faster by rewriting it such that it is nested; the inner query grabs the entire row for objects of interest based on the indexed quantities, while the outer query then gets the specific quantities desired.

Using dbo functions in your query    Back to Top

Finally, a word of caution about using function calls in queries. If your query is going to match a large number of objects (million or more), using a function call, especially one that operates on a constant or literal, in the WHERE clause is not a good idea, because the function will be called once per matching row in that table, resulting in a significant performance hit. Here is an example of this:

SELECT ... 
FROM PhotoObj
WHERE
    flags & dbo.fPhotoFlags('BLENDED') > 0

In this case, it would be better to first do the pre-query:

SELECT dbo.fPhotoFlags('BLENDED')

to get the bitmask value for that flag, and then rewrite the above query as:

SELECT ... 
FROM PhotoObj
WHERE
    flags & 8 > 0

This will avoid the wastefully repeated function call for each and every photobj in the table.


Roy Gal, Ani Thakar, Jim Gray, Alex Szalay
Last updated Apr 8, 2004.