Berkeley Statistics Logo

Databases/SQL tutorial

Training materials on working with large datasets using SQL, R, and Python. See the top menu for specific topics.

View the Project on GitHub berkeley-scf/tutorial-databases

This project is maintained by berkeley-scf, the UC Berkeley Statistical Computing Facility

Hosted on GitHub Pages — Theme by orderedlist

1 Introduction to SQL

1.1 Getting started

Here is a simple query that selects the first five rows (and all columns, based on the * wildcard) from the questions table.

select * from questions limit 5

To run this from R we provide the SQL syntax as a string as the second argument to dbGetQuery.

library(RSQLite)
drv <- dbDriver("SQLite")
dir <- 'data' # relative or absolute path to where the .db file is
dbFilename <- 'stackoverflow-2021.db'
db <- dbConnect(drv, dbname = file.path(dir, dbFilename))
dbGetQuery(db, "select * from questions limit 5")
##   questionid        creationdate score viewcount answercount
## 1   65534165 2021-01-01 22:15:54     0       112           2
## 2   65535296 2021-01-02 01:33:13     2      1109           0
## 3   65535910 2021-01-02 04:01:34    -1       110           1
## 4   65535916 2021-01-02 04:03:20     1        35           1
## 5   65536749 2021-01-02 07:03:04     0       108           1
##   commentcount favoritecount                               title
## 1            0            NA     Can't update a value in sqlite3
## 2            0            NA Install and run ROS on Google Colab
## 3            8             0       Operators on date/time fields
## 4            0            NA          Plotting values normalised
## 5            5            NA     Export C# to word with template
##    ownerid
## 1 13189393
## 2 14924336
## 3   651174
## 4 14695007
## 5 14899717

Now let’s see some more interesting usage of other SQL syntax.

First we get the questions that are viewed the most by filtering to the rows for which the ‘viewcount’ is greater than 100000. We’ll limit the results to the first 5 rows so we don’t print too much out.

dbGetQuery(db, "select * from questions where viewcount > 100000 limit 5")
##   questionid        creationdate score viewcount answercount
## 1   65547199 2021-01-03 06:22:52   124    110832           7
## 2   65549858 2021-01-03 12:30:19    52    130479          11
## 3   65630743 2021-01-08 14:20:57    77    107140          19
## 4   65632698 2021-01-08 16:22:59    74    101044           9
## 5   65896334 2021-01-26 05:33:33   111    141899          12
##   commentcount favoritecount
## 1            2             0
## 2            0             0
## 3            4             0
## 4            1             0
## 5            7             0
##                                                                                   title
## 1                                                          Using Bootstrap 5 with Vue 3
## 2 "ERESOLVE unable to resolve dependency tree" when installing npm react-facebook-login
## 3                          How to solve flutter web api cors error only with dart code?
## 4                                            How to open a link in a new Tab in NextJS?
## 5                              Python Pip broken with sys.stderr.write(f"ERROR: {exc}")
##    ownerid
## 1 11232893
## 2 12425004
## 3 12373446
## 4  9578961
## 5   202576

Next, let’s find the number of views for the 15 questions viewed the most.

dbGetQuery(db, "select viewcount from questions 
                order by viewcount desc limit 15")
##    viewcount
## 1    1236876
## 2     816368
## 3     610026
## 4     505992
## 5     458856
## 6     445775
## 7     426798
## 8     315861
## 9     307961
## 10    303399
## 11    296364
## 12    286886
## 13    286810
## 14    278432
## 15    276806

Let’s lay out the various verbs in SQL. Here’s the form of a standard query (but note that the sorting done by ORDER BY is computationally expensive and would be used sparingly):

SELECT <column(s)> FROM <table> WHERE <condition(s) on column(s)> ORDER BY <column(s)>

SQL keywords are often written in ALL CAPITALS by convention, although I won’t necessarily do that in this tutorial.

And here is a table of some important keywords:

Keyword What it does
SELECT select columns
FROM which table to operate on
WHERE filter (choose) rows satisfying certain conditions
LIKE, IN, <, >, =, <=, >=, !=, etc. used as part of filtering conditions
ORDER BY sort based on columns

Some other keywords are: DISTINCT, ON, JOIN, GROUP BY, AS, USING, UNION, INTERSECT, HAVING, SIMILAR TO (not available in SQLite), SUBSTR in SQLite and SUBSTRING in PostgreSQL.

Challenge: Return a few rows from the users, questions, answers, and tags tables so you can get a sense for what the entries in the tables are like.

Challenge: Find the users in the database with the most upvotes.

1.2 Getting unique results (DISTINCT)

A useful SQL keyword is DISTINCT, which allows you to eliminate duplicate rows from any table (or remove duplicate values when one only has a single column or set of values).

## Find the unique tags:
dbGetQuery(db, "select distinct tag from questions_tags limit 15")
##           tag
## 1     sorting
## 2  visual-c++
## 3         mfc
## 4   cgridctrl
## 5         css
## 6      anchor
## 7        divi
## 8      python
## 9  python-3.x
## 10      audio
## 11        vlc
## 12        ios
## 13     arrays
## 14  dataframe
## 15 javascript
## Count the number of unique tags:
dbGetQuery(db, "select count(distinct tag) from questions_tags")
##   count(distinct tag)
## 1               42137

1.3 Grouping / stratifying (GROUP BY)

A common pattern of operation is to stratify the dataset, i.e., collect it into mutually exclusive and exhaustive subsets. One would then generally do some aggregation operation on each subset. The aggregation is always done within each of the groups. In SQL this is done with the GROUP BY keyword.

Here’s a basic example where we count the occurrences of different tags.

dbGetQuery(db, "select tag, count(*) as n from questions_tags
                group by tag order by n desc limit 100")

Also note the use of as to define a name for the new column.

Challenge: What specifically does that query do? Describe the table that would be returned.

In general GROUP BY statements will involve some aggregation operation on the subsets. Options include: COUNT, MIN, MAX, AVG, SUM.

The result of a query that uses group by is a table with as many rows as groups.

Warning: To filter the result of a grouping operation, we need to use having rather than where. (where would filter before the application of the group by).

dbGetQuery(db, "select tag, count(*) as n from questions_tags
                group by tag having n > 100000 limit 10")

Warning: Determining what fields can be selected when using group by can be tricky, because it varies by DBMS. For example, with Postgres, you can only select fields created by aggregation and the fields that group by is applied to, as well as when there is something called a functional dependency. SQLite allows more flexibility. For example the following can be done in SQLite to find user and answer information for the answer to each question from the user with the highest reputation. However Postgres gives the error ‘ERROR: column “u.userid” must appear in the GROUP BY clause or be used in an aggregate function’.

dbGetQuery(db, "select *, max(reputation) from users U join answers A
                on A.ownerid = U.userid group by A.questionid limit 5")

Challenge: Write a query that will count the number of answers for each question, returning the IDs of the most answered questions. Hint: consider which field in the “answers” table we do the grouping on (and you shouldn’t need to use the “questions” table).

1.4 Joins

1.4.1 Introduction to joins

Suppose in the example of students in classes, we want a result that has the grades of all students in 9th grade. For this we need information from the Student table (to determine grade level) and information from the ClassAssignment table (to determine the class grade for each class a student takes). Getting information from multiple tables, where a row in one table is matched with one or more rows in another table is called a join. In this case the join would look for all rows in the ClassAssignment table that match a given row (i.e., student) in the Student table, using the column in each of the tables containing the student ID to do the matching of rows.

The syntax generally looks like this (again the WHERE and ORDER BY are optional):

SELECT <column(s)> FROM <table1> JOIN <table2> ON <columns to match on>
   WHERE <condition(s) on column(s)> ORDER BY <column(s)>

Let’s see an example join on the Stack Overflow database. In particular let’s select only the questions with the tag “python”.

result1 <- dbGetQuery(db, "select * from questions join questions_tags 
                           on questions.questionid = questions_tags.questionid 
                           where tag = 'python'")
head(result1)           
##   questionid        creationdate score viewcount answercount
## 1   65526804 2021-01-01 01:54:10     0      2087           3
## 2   65527402 2021-01-01 05:14:22     1        56           1
## 3   65529525 2021-01-01 12:06:43     1       175           1
## 4   65529971 2021-01-01 13:14:40     1        39           0
## 5   65532644 2021-01-01 18:46:52    -2        49           1
## 6   65534179 2021-01-01 22:17:15     1       476           0
##   commentcount favoritecount
## 1            3            NA
## 2            0            NA
## 3            0            NA
## 4            1            NA
## 5            1            NA
## 6            4            NA
##                                                              title
## 1            How to play an audio file starting at a specific time
## 2                                 Join dataframe columns in python
## 3                              Issues with pygame.time.get_ticks()
## 4 How to check if Windows prompts a notification box using python?
## 5                      How I divide this text file in a Dataframe?
## 6                         Suppress OpenCV Output Message in Python
##    ownerid questionid    tag
## 1 14718094   65526804 python
## 2  1492229   65527402 python
## 3 13720770   65529525 python
## 4 13845215   65529971 python
## 5 14122166   65532644 python
## 6 10355409   65534179 python

It’s also possible to get the same exact result without using the JOIN keyword, but you’ll need the WHERE keyword to ensure that the rows get matched correctly.

result2 <- dbGetQuery(db, "select * from questions, questions_tags
                           where questions.questionid = questions_tags.questionid 
                           and tag = 'python'")


identical(result1, result2)
## [1] TRUE

We’ll explain what is going on in the next section.

Here’s a three-way join (both with and without the JOIN keyword) with some additional use of aliases to abbreviate table names. What does this query ask for?

result1 <- dbGetQuery(db, "select * from questions Q
                           join questions_tags T on Q.questionid = T.questionid
                           join users U on Q.ownerid = U.userid
                           where tag = 'python' and upvotes > 100")

Once again, we could do that without JOIN and using WHERE to match the rows appropriately.

result2 <- dbGetQuery(db, "select * from questions Q, questions_tags T, users U
                           where Q.questionid = T.questionid 
                           and Q.ownerid = U.userid
                           and tag = 'python' and upvotes > 100")

Challenge: Write a query that would return all the answers to questions with the Python tag.

Challenge: Write a query that will count the number of answers for each question, returning the most answered questions and their information. Note that this extends the question in the previous section.

Challenge: Write a query that would return the users who have answered a question with the Python tag.

1.4.2 Types of joins

We’ve seen a bunch of joins but haven’t discussed the full taxonomy of types of joins. There are various possibilities for how to do a join depending on whether there are rows in one table that do not match any rows in another table.

Inner joins: In database terminology an inner join is when the result has a row for each match of a row in one table with the rows in the second table, where the matching is done on the columns you indicate. If a row in one table corresponds to more than one row in another table, you get all of the matching rows in the second table, with the information from the first table duplicated for each of the resulting rows. For example in the Stack Overflow data, an inner join of questions and answers would pair each question with each of the answers to that question. However, questions without any answers or (if this were possible) answers without a corresponding question would not be part of the result.

Outer joins: Outer joins add additional rows from one table that do not match any rows from the other table as follows. A left outer join gives all the rows from the first table but only those from the second table that match a row in the first table. A right outer join is the converse, while a full outer join includes at least one copy of all rows from both tables. So a left outer join of the Stack Overflow questions and answers tables would, in addition to the matched questions and their answers, include a row for each question without any answers, as would a full outer join. In this case there should be no answers that do not correspond to question, so a right outer join should be the same as an inner join. Note that one cannot do a right outer join (or a full outer join) in SQLit; you’ll need to switch the order of the tables and do a left outer join.

Cross joins: A cross join gives the Cartesian product of the two tables, namely the pairwise combination of every row from each table, analogous to expand.grid in R. I.e., take a row from the first table and pair it with each row from the second table, then repeat that for all rows from the first table. Since cross joins pair each row in one table with all the rows in another table, the resulting table can be quite large (the product of the number of rows in the two tables). In the Stack Overflow database, a cross join would pair each question with every answer in the database, regardless of whether the answer is an answer to that question.

Here’s a table of the different kinds of joins:

Type of join Rows from first table Rows from second table
inner (default) all that match on specified condition all that match on specified condition
left outer all all that match first table
right outer all that match second table all
full outer all all
cross all combined pairwise with second table all combined pairwise with first table

A ‘natural’ join is an inner join that doesn’t require you to specify the common columns between tables on which to enforce equality, but it’s often good practice to not use a natural join and to explicitly indicate which columns are being matched on.

Simply listing two or more tables separated by commas as we saw earlier is the same as a cross join. Alternatively, listing two or more tables separated by commas, followed by conditions that equate rows in one table to rows in another is the same as an inner join.

In general, inner joins can be seen as a form of cross join followed by a condition that enforces matching between the rows of the table. More broadly, here are five equivalent joins that all perform the equivalent of an inner join:

select * from table1 join table2 on table1.id = table2.id ## explicit inner join
select * from table1, table2 where table1.id = table2.id  ## without explicit JOIN
select * from table1 cross join table2 where table1.id = table2.id 
select * from table1 join table2 using(id)
select * from table1 natural join table2

Note that in the last query the join would be based on all common columns, which could be a bit dangerous if you don’t look carefully at the schema of both tables. Assuming id is the common column, then the last of these queries is the same as the others.

Challenge: Create a view with one row for every question-tag pair, including questions without any tags.

Challenge: Write a query that would return the displaynames of all of the users who have never posted a question. The NULL keyword will come in handy – it’s like NA in R. Hint: NULLs should be produced if you do an outer join.

Challenge: How many questions tagged with ‘random-forest’ were unanswered? (You should need two different kinds of joins to answer this.)

1.4.3 Counting NULLs

We’ve seen that one can user outer joins as a way to find rows in one table that do not appear in another table. If you want to be able to count the number of entries by group and have a count of 0 when a group doesn’t appear in a table, you can do that by making sure to apply the count only to a field produced by an outer join that contains NULL values and not to all fields.

For example if we wanted to count the number of answers provided by each user and make sure to include people who have answered no questions, assigning them a value of 0, we could do it like this by counting the answerid field, which will have NULL for users with no answers and will be counted as a 0:

dbGetQuery(db, "select userid, count(answerid) as n_answers from users left outer join answers on userid=ownerid group by userid order by userid limit 10")

If we had done count(*), then each row for a user with no answers would be incorrectly assigned a 1 (since they have a row associated with them) instead of a 0 (note user #56 below).

dbGetQuery(db, "select userid, count(*) as n_answers from users left outer join answers on userid=ownerid group by userid order by userid limit 10")

1.4.4 Joining a table with itself (self join)

Sometimes we want to query information across rows of the same table. For example supposed we want to analyze the time lags between when the same person posts a question. Do people tend to post in bursts or do they tend to post uniformly over the year? To do this we need contrasts between the times of the different posts. (One can also address this using window functions, discussed later.)

So we need to join two copies of the same table, which means dealing with resolving the multiple copies of each column.

This would look like this:

dbGetQuery(db, "select * from questions Q1 join questions Q2
                on Q1.ownerid = Q2.ownerid")

That should create a new table with all pairs of questions asked by a single person.

Actually, there’s a problem here.

Challenge: What kinds of rows will we get that we don’t want?

A solution to that problem of having the same question paired with itself is:

dbGetQuery(db, "create view question_contrasts as
                select * from questions Q1 join questions Q2
                on Q1.ownerid = Q2.ownerid
                where Q1.creationdate != Q2.creationdate")

Challenge: There’s actually a further similar problem. What is the problem and how can we fix it by changing two characters in the query above? Hint, even as character strings, the creationdate column has an ordering.

1.5 Temporary tables and views

You can think of a view as a temporary table that is the result of a query and can be used in subsequent queries. In any given query you can use both views and tables. The advantage is that they provide modularity in our querying. For example, if a given operation (portion of a query) is needed repeatedly, one could abstract that as a view and then make use of that view.

Suppose we always want the upvotes and displayname of question owners available. Once we have the view we can query it like a regular table.

## note there is a creationdate in users too, hence disambiguation
dbExecute(db, "create view questions_plus as
               select questionid, questions.creationdate, score, viewcount, 
                      title, ownerid, upvotes, displayname
               from questions join users on questions.ownerid = users.userid")

## don't be confused by the "0" response --
## it just means that nothing is returned to R; the view _has_ been created
               
dbGetQuery(db, "select * from questions_plus where upvotes > 100 limit 5")

One use of a view would be to create a mega table that stores all the information from multiple tables in the (unnormalized) form you might have if you simply had one data frame in R or Python.

dbExecute(db, "drop view questions_plus") # drop the view if we no longer need it

If you want to create a temporary table just for a single query, you can use a subquery or a WITH clause, as dicussed in Section 3.2.

2 Additional SQL topics

2.1 Creating database tables

Often one would create tables from within R or Python (though one can create tables from within the sqlite and psql command line interfaces as well). Here’s the syntax from R.

## Option 1: pass directly from CSV to database
dbWriteTable(conn = db, name = "student", value = "student.csv", row.names = FALSE,
             header = TRUE)

## Option 2: pass from data in an R data frame
## First create your data frame:
# student_df <- data.frame(...)
## or
# student_df <- read.csv(...)
dbWriteTable(conn = db, name = "student", value = student_df, row.names = FALSE,
             append = FALSE)

2.2 String processing and creating new fields

We can do some basic matching with LIKE, using % as a wildcard and _ to stand in for any single character:

dbGetQuery(db, "select * from questions_tags where tag like 'r-%' limit 10")
##    questionid        tag
## 1    65598394   r-factor
## 2    65999758 r-markdown
## 3    66007924    r-exams
## 4    66036936 r-markdown
## 5    65985449    r-caret
## 6    66035257 r-markdown
## 7    66135867 r-markdown
## 8    65878099 r-markdown
## 9    65973815 r-markdown
## 10   66102594   r-lavaan

In Postgres, in addition to the basic use of LIKE to match character strings, one can use regular expression syntax with SIMILAR TO.

SIMILAR TO is not available in SQLite so the following can only be done in the Postgres instance of our example database. Here we’ll look for all tags that are of the form “r-”, “-r”, “r” or “-r-”. SQL uses % as a wildcard (this is not standard regular expression syntax).

## Try in postgreSQL, not SQLite
result <- dbGetQuery(db, "select * from questions_tags 
                          where tag similar to 'r-%|%-r|r|%-r-%' limit 10")
## Standard regex for 'any character' doesn't seem to work:
## result <- dbGetQuery(db, "select * from questions_tags 
                             where tag SIMILAR TO 'r-.*|.*-r|r|.*-r-.*' limit 10")

Note: The matching does not match on substrins, unless one uses wildcards at beginning and end of the pattern, so “r” will only find “r” and not, for example, “dplyr”.

To extract substrings we can SUBSTRING in Postgres. Here’s a basic example:

dbGetQuery(db, "select substring(creationdate from '^[[:digit:]]{4}') as year
                from questions limit 3")

If you need to specify the pattern to be extracted relative to the surrounding characters, then Postgres requires that the pattern to be extracted be surrounded by #" (one could use another character in place of #), but for use from R we need to escape the double-quote with a backslash so it is treated as a part of the string passed to Postgres and not treated by R as indicating where the character string stops/starts. We also need the % wildcard character when extracting in this way.

dbGetQuery(db, "select substring(creationdate from
                '%-#\"[[:digit:]]{2}#\"-%' for '#') as month
                from questions limit 3")

Warning: SQLite provides SUBSTR for substrings, but the flexibility of SUBSTR seems to be much less than use of SUBSTRING in PostgreSQL.

Here is some documentation on string functions in PostgreSQL.

Challenge: Select the questions that have “java” but not “javascript” in their titles using regular expression syntax.

Challenge: Figure out how to calculate the length (in characters) of the title of each question.

Challenge:Process the creationdate field to create year, day, and month fields in a new view. Note that this would be good practice for string manipulation but you would want to handle dates and times using the material in the next section and not use string processing.

2.3 Dates and times

Here we’ll see how you can work with dates and times in SQLite, but the functionality should be similar in other DBMS.

SQLite doesn’t have specific date-time types, but it’s standard to store date-times as strings in the text field in the ISO-8601 format: YYYY-MM-DD HH:MM:SS.SSS. That’s the format of the dates in the StackOverflow database:

dbGetQuery(db, "select distinct creationdate from questions limit 5")
##          creationdate
## 1 2021-01-01 22:15:54
## 2 2021-01-02 01:33:13
## 3 2021-01-02 04:01:34
## 4 2021-01-02 04:03:20
## 5 2021-01-02 07:03:04

Then SQLite provides some powerful functions for manipulating and extracting information in such fields. Here are just a few examples, noting that strftime is particularly powerful. Other DBMS should have similar functionality, but I haven’t investigated further.

## Julian days (decimal days since noon UTC/Greenwich time November 24, 4714 BC (Yikes!)). 
output <- dbGetQuery(db, "select creationdate, julianday(creationdate)
                from questions limit 5")
output
##          creationdate julianday(creationdate)
## 1 2021-01-01 22:15:54                 2459216
## 2 2021-01-02 01:33:13                 2459217
## 3 2021-01-02 04:01:34                 2459217
## 4 2021-01-02 04:03:20                 2459217
## 5 2021-01-02 07:03:04                 2459217
## Julian day is decimal-valued:
formatC(output[ , 2], 6, format = 'f')
## [1] "2459216.427708" "2459216.564734" "2459216.667755"
## [4] "2459216.668981" "2459216.793796"
## Convert to local time
dbGetQuery(db, "select distinct creationdate, datetime(creationdate, 'localtime')
                from questions limit 5")
##          creationdate datetime(creationdate, 'localtime')
## 1 2021-01-01 22:15:54                 2021-01-01 14:15:54
## 2 2021-01-02 01:33:13                 2021-01-01 17:33:13
## 3 2021-01-02 04:01:34                 2021-01-01 20:01:34
## 4 2021-01-02 04:03:20                 2021-01-01 20:03:20
## 5 2021-01-02 07:03:04                 2021-01-01 23:03:04
## Eastern time, manually, ignoring daylight savings
dbGetQuery(db, "select distinct creationdate, datetime(creationdate, '-05:00')
                from questions limit 5")
##          creationdate datetime(creationdate, '-05:00')
## 1 2021-01-01 22:15:54              2021-01-01 17:15:54
## 2 2021-01-02 01:33:13              2021-01-01 20:33:13
## 3 2021-01-02 04:01:34              2021-01-01 23:01:34
## 4 2021-01-02 04:03:20              2021-01-01 23:03:20
## 5 2021-01-02 07:03:04              2021-01-02 02:03:04
## day of week: Jan 1 2021 was a Friday (0=Sunday, 6=Saturday)
dbGetQuery(db, "select creationdate, strftime('%w', creationdate)
                from questions limit 5")
##          creationdate strftime('%w', creationdate)
## 1 2021-01-01 22:15:54                            5
## 2 2021-01-02 01:33:13                            6
## 3 2021-01-02 04:01:34                            6
## 4 2021-01-02 04:03:20                            6
## 5 2021-01-02 07:03:04                            6

Unfortunately I’m not sure if the actual dates in the database are Greenwich time or some US time zone, but we’ll ignore that complication here.

Let’s put it all together to do something meaningful.

result <- dbGetQuery(db, "select strftime('%H', creationdate) as hour,
                          count() as n from questions group by hour")
head(result)
##   hour     n
## 1   00 35444
## 2   01 33989
## 3   02 35542
## 4   03 37745
## 5   04 39609
## 6   05 45229
plot(as.numeric(result$hour), result$n, xlab = 'hour of day (UTC/Greenwich???)',
                                        ylab = 'number of questions')

Here’s some documentation of the syntax for the functions, including stftime.

3 More advanced SQL

3.1 Set operations: UNION, INTERSECT, EXCEPT

You can do set operations like union, intersection, and set difference using the UNION, INTERSECT, and EXCEPT keywords on tables that have the same schema (same column names and types), though most often these would be used on single columns (i.e., single-column tables).

Note: While one can often set up an equivalent query without using INTERSECT or UNION, set operations can be very handy.

Here’s an example of a query that can be done with or without an intersection. Suppose we want to know the names of all individuals who have asked both an R question and a Python question. We can do this with INTERSECT:

system.time(
   result1 <- dbGetQuery(db, "select displayname, userid from
                              questions Q join users U on U.userid = Q.ownerid
                              join questions_tags T on Q.questionid = T.questionid
                              where tag = 'r'
                              intersect
                              select displayname, userid from
                              questions Q join users U on U.userid = Q.ownerid
                              join questions_tags T on Q.questionid = T.questionid
                              where tag = 'python'")
               )
##    user  system elapsed 
##   3.912   1.338   6.470

Alternatively we can do a self-join. Note that the syntax gets complicated as we are doing multiple joins.

system.time(
   result2 <- dbGetQuery(db, "select displayname, userid from
                              (questions Q1 join questions_tags T1
                              on Q1.questionid = T1.questionid)
                              join
                              (questions Q2 join questions_tags T2
                              on Q2.questionid = T2.questionid)
                              on Q1.ownerid = Q2.ownerid
                              join users on Q1.ownerid = users.userid
                              where T1.tag = 'r' and T2.tag = 'python'")
               )
##    user  system elapsed 
##   5.331   3.048   9.693
identical(result1, result2)
## [1] FALSE

Challenge: Those two queries return equivalent information, but the results are not exactly the same. What causes the difference? How can we modify the second query to get the exact same results as the first?

Which is faster? The second one looks more involved in terms of the joins, so the timing results seen above make sense.

We could use UNION or EXCEPT to find people who have asked either or only one type of question, respectively.

Challenge: Find the users who have asked either an R question or a Python question.

Challenge: Find the users who have asked but not answered a question.

3.2 Subqueries

A subquery is a full query that is embedded in a larger query.

3.2.1 Subqueries in the FROM statement

We can use subqueries in the FROM statement to create a temporary table to use in a query. Here we’ll do it in the context of a join.

Challenge: What does the following do?

dbGetQuery(db, "select * from questions join answers A
                on questions.questionid = A.questionid
                join
                (select ownerid, count(*) as n_answered from answers
                group by ownerid order by n_answered desc limit 1000) most_responsive
                on A.ownerid = most_responsive.ownerid")

It might be hard to just come up with that full query all at once. A good strategy is probably to think about creating a view that is the result of the inner query and then have the outer query use that. You can then piece together the complicated query in a modular way. For big databases, you are likely to want to submit this as a single query and not two queries so that the SQL optimizer can determine the best way to do the operations. But you want to start with code that you’re confident will give you the right answer!

Note we could also have done that query using a subquery in the WHERE statement, as discussed in the next section.

Challenge: Write a query that finds the number of answers per question, but only answers from users with at least 100 upvotes.

Finally one can use subqueries in the SELECT clause to create new variables, but we won’t go into that here.

3.2.2 Subqueries in the WHERE statement

Instead of a join, we can use subqueries as a way to combine information across tables, with the subquery involved in a WHERE statement. The subquery creates a set and we then can check for inclusion in (or exclusion from with not in) that set.

For example, suppose we want to know the average number of UpVotes for users who have posted a question with the tag “python”.

dbGetQuery(db, "select avg(upvotes) from users where userid in
                (select distinct ownerid from
                questions join questions_tags
                on questions.questionid = questions_tags.questionid
                where tag = 'python')")       
##   avg(upvotes)
## 1     62.72529

In some cases one can do a join rather than using a subquery, but in the following example, it fails.

Challenge: What’s wrong with the following query as an attempt to answer the question above? (See if you can figure it out before looking at the answer below.)

dbGetQuery(db, "select avg(upvotes) from questions, questions_tags, users
                where questions.questionid = questions_tags.questionid and
                questions.ownerid = users.userid and
                tag = 'python'")

(Answer: In the subquery, we find the IDs of the users we are looking for and then average over the UpVotes of those individuals. In the join version we found all the questions that had a Python tag and averaged over the UpVotes of the individuals associated with those questions. So the latter includes multiple UpVotes values from individuals who have posted multiple Python questions.)

Challenge: Write a query that would return the user information for users who have answered a question with the Python tag. We’ve seen this challenge before, but do it now based on a subquery.

Challenge: Find the users who have asked but not answered a question. We’ve seen this before, but now make use of subqueries instead of a join.

Challenge: How would you find all the answers associated with the user with the most upvotes?

Challenge: Create a frequency list of the tags used in the top 100 most answered questions. Note there is a way to do this with a JOIN and a way without a JOIN.

3.2.3 Using WITH

The WITH clause allows you to create a temporary table to then use in an associated SELECT statement. So it provides similar functionality to using a view but without it being a persistent part of the database. The temporary table is only available within the associated SELECT statement. WITH can only occur as part of a query with SELECT.

Let’s see use of WITH to accomplish what we did with a subquery in the FROM statement above.

dbGetQuery(db, "with most_responsive as (
                select ownerid, count(*) as n_answered from answers
                group by ownerid order by n_answered desc limit 1000
                )
                select * from questions join answers A
                on questions.questionid = A.questionid
                join most_responsive on A.ownerid = most_responsive.ownerid")

One could also replace the subquery in the WHERE statement above using WITH.

dbGetQuery(db, "with tmp as (select distinct ownerid from
                questions join questions_tags
                on questions.questionid = questions_tags.questionid
                where tag = 'python')
                select avg(UpVotes) from users where userid in
                tmp")       

Finally, you can create multiple temporary tables in the WITH clause. This can help make your query more modular without the complication of creating views that will only be used once.

3.3 Window functions

Window functions provide the ability to perform calculations across sets of rows that are related to the current query row.

Comments:

  • The result of applying a window function is the same number of rows as the input, even though the functionality is similar to group by. Hint: think about the result of group by + mutate in dplyr in R.
  • One can apply a window function within groups or across the whole table.
  • The functions one can apply include standard aggregation functions such as avg and count as well as non-standard functions (specific to using window functions) such as rank and cume_dist.
  • Unless you’re simply grouping into categories, you’ll generally need to order the rows for the window function to make sense.

The syntax is a bit involved, so let’s see with a range of examples:

  • Aggregate within groups but with one output value per input row
## Total number of questions for each owner
dbGetQuery(db, "select ownerid,
                count() over (partition by ownerid) as n
                from questions where ownerid is not NULL limit 10")
##    ownerid n
## 1       33 1
## 2       51 1
## 3       56 3
## 4       56 3
## 5       56 3
## 6       58 2
## 7       58 2
## 8       95 3
## 9       95 3
## 10      95 3
  • Compute cumulative calculations; note the need for ORDER BY within the PARTITION clause (the other ORDER BY is just for display purposes here):
## Rank (based on ordering by creationdate) of questions by owner
dbGetQuery(db, "select *,
                rank() over (partition by ownerid order by creationdate) as rank
                from questions order by ownerid desc limit 10")
##    questionid        creationdate score viewcount answercount
## 1    70347322 2021-12-14 08:47:41     1       123           0
## 2    65620082 2021-01-07 21:19:16     0        44           1
## 3    65671366 2021-01-11 17:04:26     0        40           1
## 4    66034594 2021-02-03 19:43:00     1       111           2
## 5    66237702 2021-02-17 07:49:58     0       174           2
## 6    66547208 2021-03-09 12:49:23     2       307           1
## 7    66907039 2021-04-01 15:07:28     3      1540           1
## 8    67099166 2021-04-14 21:11:18     1       289           1
## 9    67355292 2021-05-02 10:28:59    -2      5001           1
## 10   67374899 2021-05-03 19:42:10     1        66           0
##    commentcount favoritecount
## 1             5            NA
## 2             9            NA
## 3             0            NA
## 4             0            NA
## 5             0            NA
## 6             0            NA
## 7             0            NA
## 8             0            NA
## 9             5            NA
## 10            7            NA
##                                                                                                   title
## 1                                            Adjust brightness of Arduino's built-in LED with pyFirmata
## 2             How to resolve squiggly lines in HTML files after importing Bootstrap to Angular project?
## 3                                               Input form fields are cannot be updated in Angular app?
## 4                                                      Select first largest value using TOP 1 in MySQL?
## 5                                                                Cron expression in Spring, validation?
## 6  Playing with Strategy Design Pattern using lambda expression by following Venkat Subramaniam's book?
## 7                                                      GraphQL and Spring Security using @PreAuthorize?
## 8                                           Emitting from Flux<String> until one of conditions are met?
## 9                                        Cannot open older Angular project with latest Angular version?
## 10 How to make vertical scrollbar's height same as window's height and make horizontal scrollbar fixed?
##     ownerid rank
## 1  20674445    1
## 2  20390023    1
## 3  20390023    2
## 4  20390023    3
## 5  20390023    4
## 6  20390023    5
## 7  20390023    6
## 8  20390023    7
## 9  20390023    8
## 10 20390023    9

(Sidenote: we rely here on the fact that ordering alphabetically by creationdate is equivalent to time ordering.)

  • Do a lagged analysis
## Get previous value (based on ordering by creationdate) by owner
dbGetQuery(db, "select ownerid, creationdate,
                lag(creationdate, 1) over
                (partition by ownerid order by creationdate)
                as previous_date
                from questions order by ownerid desc limit 5")
##    ownerid        creationdate       previous_date
## 1 20674445 2021-12-14 08:47:41                <NA>
## 2 20390023 2021-01-07 21:19:16                <NA>
## 3 20390023 2021-01-11 17:04:26 2021-01-07 21:19:16
## 4 20390023 2021-02-03 19:43:00 2021-01-11 17:04:26
## 5 20390023 2021-02-17 07:49:58 2021-02-03 19:43:00

So one could now calculate the difference between the previous and current date to analyze the time gaps between users posting questions.

  • Do an analysis within an arbitrary window of rows based on the values in one of the columns
## Summarize questions within 15 days of current question 
dbGetQuery(db, "select ownerid, creationdate,
                count() over
                (partition by ownerid order by julianday(creationdate)
                range between 15 preceding and 15 following)
                as n_window
                from questions where ownerid is not null limit 30")
##    ownerid        creationdate n_window
## 1       33 2021-06-10 07:58:44        1
## 2       51 2021-06-09 18:07:55        1
## 3       56 2021-04-21 10:20:45        1
## 4       56 2021-11-23 09:40:20        2
## 5       56 2021-12-07 14:19:36        2
## 6       58 2021-01-22 19:30:22        1
## 7       58 2021-06-09 14:56:50        1
## 8       95 2021-02-11 19:52:55        1
## 9       95 2021-07-21 13:22:25        1
## 10      95 2021-10-15 07:28:39        1
## 11     101 2021-03-16 18:53:19        1
## 12     114 2021-03-11 16:30:45        1
## 13     116 2021-02-15 21:58:48        2
## 14     116 2021-02-25 03:07:55        2
## 15     116 2021-03-24 16:10:27        1
## 16     116 2021-04-08 20:48:23        2
## 17     116 2021-04-15 19:03:33        2
## 18     116 2021-07-12 00:37:47        1
## 19     159 2021-03-19 21:48:17        1
## 20     159 2021-05-21 17:29:36        2
## 21     159 2021-05-26 21:24:16        2
## 22     188 2021-08-04 01:48:51        1
## 23     199 2021-01-28 23:04:47        1
## 24     199 2021-06-07 08:10:25        1
## 25     199 2021-10-05 10:10:37        1
## 26     199 2021-12-17 23:29:05        1
## 27     214 2021-04-29 14:39:04        1
## 28     230 2021-03-26 16:30:09        2
## 29     230 2021-04-09 13:14:16        2
## 30     234 2021-01-04 01:02:03        1

There the ‘15 preceding’ and ‘15 following’ mean to include all rows within each ownerid that are within 15 Julian days (based on ‘creationdate’) of each row.

So one could now analyze bursts of activity.

One can also choose a fixed number of rows by replacing ‘range’ with ‘rows’. The ROWS and RANGE syntax allow one to specify the window frame in more flexible ways than simply the categories of a categorical variable.

So the syntax of a window function will generally have these elements:

  • a call to some function
  • OVER
  • PARTITION BY (optional)
  • ORDER BY (optional)
  • RANGE or ROW (optional)
  • AS (optional)

You can also name window functions, which comes in handy if you want multiple functions applied to the same window:

dbGetQuery(db, "select ownerid, creationdate,
                lag(creationdate, 1) over w as lag1,
                lag(creationdate, 2) over w as lag2
                from questions where ownerid is not null
                window w as (partition by ownerid order by creationdate)
                order by ownerid limit 5")
##   ownerid        creationdate                lag1
## 1      33 2021-06-10 07:58:44                <NA>
## 2      51 2021-06-09 18:07:55                <NA>
## 3      56 2021-04-21 10:20:45                <NA>
## 4      56 2021-11-23 09:40:20 2021-04-21 10:20:45
## 5      56 2021-12-07 14:19:36 2021-11-23 09:40:20
##                  lag2
## 1                <NA>
## 2                <NA>
## 3                <NA>
## 4                <NA>
## 5 2021-04-21 10:20:45

What does that query do?

Finally, you can use window functions on the entire table, without partitioning.

## Summarize questions within 15 (decimal) days of current question 
dbGetQuery(db, "select ownerid, creationdate,
                count() over
                (order by julianday(creationdate)
                range between 15 preceding and 15 following)
                as n_window
                from questions where ownerid is not null limit 30")
##     ownerid        creationdate n_window
## 1  13708122 2021-01-01 00:00:01    65899
## 2  14920712 2021-01-01 00:00:59    65901
## 3  10407800 2021-01-01 00:01:40    65901
## 4  14593381 2021-01-01 00:02:06    65901
## 5  14783072 2021-01-01 00:02:09    65901
## 6  14853091 2021-01-01 00:02:17    65901
## 7  14920717 2021-01-01 00:03:01    65902
## 8  11645517 2021-01-01 00:03:20    65902
## 9  10197813 2021-01-01 00:04:32    65903
## 10 14694500 2021-01-01 00:05:43    65906
## 11  6335637 2021-01-01 00:05:46    65906
## 12  2242096 2021-01-01 00:05:47    65906
## 13  9574155 2021-01-01 00:06:09    65908
## 14  6281777 2021-01-01 00:06:15    65908
## 15 14260231 2021-01-01 00:07:02    65908
## 16 14920186 2021-01-01 00:07:03    65908
## 17 13103324 2021-01-01 00:08:01    65909
## 18  1127065 2021-01-01 00:09:37    65918
## 19 10841085 2021-01-01 00:10:40    65919
## 20  7336289 2021-01-01 00:11:05    65920
## 21 14634129 2021-01-01 00:11:17    65920
## 22 14920707 2021-01-01 00:12:16    65924
## 23 14461250 2021-01-01 00:13:16    65927
## 24 14920741 2021-01-01 00:13:23    65927
## 25 11035194 2021-01-01 00:16:10    65934
## 26   735332 2021-01-01 00:17:34    65940
## 27 10707986 2021-01-01 00:19:19    65944
## 28 12743240 2021-01-01 00:20:09    65944
## 29  1098815 2021-01-01 00:21:35    65950
## 30 14496928 2021-01-01 00:21:42    65951

Challenge: Use a window function to compute the average viewcount for each ownerid for the 10 questions preceding each question.

Challenge: For each question, get the answer given by the user with the maximum reputation amongst users answering the question. Hint: you’ll need to first create a subquery that determines the maximum reputation amongst the answers for each question and then use that to get the answer of interest for each question.

Challenge (hard): Find the users who have asked one question that is highly-viewed (viewcount > 1000) with their remaining questions not highly-viewed (viewcount < 20 for all other questions).

3.4 Putting it all together to do complicated queries

Here are some real-world style questions one might try to create queries to answer. The context for these questions is a situation in which you have data on user sessions on a website or data on messages between users.

1) Given a table of user sessions with the format

date | session_id | user_id | session_time

calculate the distribution of the average daily total session time in the last month. I.e., you want to get each user’s daily average and then find the distribution over users. The output should be something like:

minutes_per_day | number_of_users

2) Consider a table of messages of the form

sender_id | receiver_id | message_id

For each user, find the three users they message the most.

3) Suppose you have are running an online experiment and have a table on the experimental design:

user_id | test_group | date_first_exposed

Suppose you also have a messages table that indicates if each message was sent on web or mobile:

date | sender_id | receiver_id | message_id | interface (web or mobile)

What is the average (over users) in the average number of messages sent per day for each test group if you look at the users who have sent messages only on mobile in the last month.

Challenge questions with the Stack Overflow data

If we take the three challenges above and translate into problems for the Stack Overflow data, one could consider the following three problems.

1) For each user who has asked at least one question find the average number of questions per month. Then determine the distribution of that average across the users. (I.e., determine the values that would go into a histogram of the average number of questions per month across users.) The output should be something like:

number of questions per month (rounded) | number of users

Next try to figure out how to include the users who have asked no questions. Hint: think about how to get NULL values included and then count a column containing such values.

2) For each user, find the three most common tags they apply to their questions.

The output should be something like:

userid | tag | count

Hints: You’ll need to use subqueries and the final selection of just the top 3 tags will need to be done in its own query and not as part of defining a field using a window function.

3) Consider trying to determine whether users who answer a lot of questions also ask a lot of questions. Grouping the users based on the number of questions they’ve answered (0, 1, 2, etc.), determine the aerage number of questions per month for each group.

The output should be something like:

number of answers | average number of questions per month

You’ll want to work through this in pieces. Try to think about the initial tables you would need and then build up your query in a nested fashion.

3.5 A summary of SQL syntax by example

This section shows the syntax for various queries so as to demonstrate syntax by example. It may be useful to test your understanding by figuring out (either with or without running the query) what the query does.

Selecting columns

select ownerid, title from questions
select ownerid, title from questions limit 5
select * from questions
select * from questions order by answercount desc
select count(*) as n from questions
select count(ownerid) as n from questions
select sum(answercount) from questions

Using distinct

select distinct tag from questions_tags limit 15
select distinct ownerid, answercount from questions limit 15
select count(distinct tag) from questions_tags limit 15

Filtering rows with where

select * from questions where answercount > 40
select * from questions where answercount > 40 order by answercount desc
select * from questions where answercount = 10 limit 5
select * from questions_tags where tag like 'r-%' limit 10
select * from questions_tags where tag similar to 'r-%|%-r|r|%-r-%' limit 10
select * from questions_tags where tag in ('java','r','python') limit 10

Grouping and reduction/aggregation

select tag, count(*) as n from questions_tags \
    group by tag

select tag, count(*) as n from questions_tags \
    group by tag having n > 1000
    
select ownerid, count(*) as n from questions \
    group by ownerid order by n desc limit 15
    
select ownerid, sum(viewcount) as viewed from questions \
    group by ownerid

select *, sum(viewcount) as viewed from questions \
    group by ownerid

select answercount, commentcount, count(*) as n from questions \
    group by answercount, commentcount

select tag, count(*) as n from questions_tags \
    where tag like 'python%' group by tag having n > 1000

The query above starting with select *, sum(viewcount) behaves differently in SQLite and DuckDB.

Joins

Inner joins

select * from questions join questions_tags \
    on questions.questionid = questions_tags.questionid
    
select * from questions Q join questions_tags T \
    on Q.questionid = T.questionid
    
select * from questions Q join questions_tags T \
    using(questionid)
    
select * from questions Q, questions_tags T \
    where Q.questionid = T.questionid

Outer joins

select * from questions Q left outer join answers A \
    on Q.questionid = A.questionid 
    
select * from questions Q left outer join answers A \
    on Q.questionid = A.questionid \
    where A.creationdate is NULL
    
# Note no right outer join in SQLite so here we reverse order of answers and questions \
select * from questions Q right outer join answers A \
    on Q.questionid = A.questionid \
    where Q.creationdate is NULL

select Q.questionid, count(*) as n_tags from questions Q join questions_tags T \
    on Q.questionid = T.questionid \
    group by Q.questionid

Self joins

First we’ll set up a view (a temporary) table that combines questions and tags for ease of illustrating ideas around self joins.

create view QT as select * from questions join questions_tags using(questionid)

In small groups, discuss what these queries do.

select * from QT as QT1 join QT as QT2 \
    using(questionid)

select * from QT as QT1 join QT as QT2 \
    using(questionid) where QT1.tag < QT2.tag
    
select QT1.tag, QT2.tag, count(*) as n from QT as QT1 join QT as QT2 \
    using(questionid) where QT1.tag < QT2.tag \
    group by QT1.tag, QT2.tag order by n desc limit 10


select * from QT as QT1 join QT as QT2 using(ownerid)

Set operations

select ownerid from QT where tag='python' \
    intersect \
    select ownerid from QT where tag='r'
    
select ownerid from QT where tag='python' \
    except \
    select ownerid from QT where tag='r'
    
select ownerid from QT where tag='python' \
    union \
    select ownerid from QT where tag='r'

select userid, displayname, location from users \
    where location like '%United States%' \
    intersect \
    select userid, displayname, location from users \
    where reputation > 10

Subqueries

select * from \
    answers A \
    join \
    ( select ownerid, count(*) as n_answered from answers \
        group by ownerid order by n_answered desc limit 1000 ) most_responsive \
    on A.ownerid = most_responsive.ownerid

select avg(upvotes) from users \
    where userid in \
    ( select distinct ownerid from \
    questions join questions_tags using(questionid) \
    where tag = 'python' )

4 Efficient SQL queries

4.1 Overview

In general, your DBMS should examine your query and try to implement it in the fastest way possible.

Some tips for faster queries include:

  • use indexes on fields used in WHERE and JOIN clauses (see next section)
    • try to avoid wildcards at the start of LIKE string comparison when you have an index on the field (as this requires looking at all of the rows)
    • similarly try to avoid using functions on indexed columns in a WHERE clause as this requires doing the calculation on all the rows in order to check the condition
  • only select the columns you really need
  • create (temporary) tables to store intermediate results that you need to query repeatedly
  • use filtering (WHERE clauses) in inner statements when you have nested subqueries
  • use LIMIT as seen in the examples here if you only need some of the rows a query returns

4.2 Indexes

An index is an ordering of rows based on one or more fields. DBMS use indexes to look up values quickly, either when filtering (if the index is involved in the WHERE condition) or when doing joins (if the index is involved in the JOIN condition). So in general you want your tables to have indexes.

DBMS use indexing to provide sub-linear time lookup. Without indexes, a database needs to scan through every row sequentially, which is called linear time lookup – if there are n rows, the lookup is $O(n)$ in computational cost. With indexes, lookup may be logarithmic – O(log(n)) – (if using tree-based indexes) or constant time – O(1) – (if using hash-based indexes). A binary tree-based search is logarithmic; at each step through the tree you can eliminate half of the possibilities.

Here’s how we create an index, with some time comparison for a simple query.

system.time(dbGetQuery(db, "select * from questions where viewcount > 10000"))  # 2.4 seconds
system.time(dbExecute(db, "create index count_index on questions (viewcount)")) # 5.6 seconds
system.time(dbGetQuery(db, "select * from questions where viewcount > 10000"))   # 0.9 seconds
## restore earlier state by removing index
system.time(dbExecute(db, "drop index count_index"))

In many contexts (but not the example above), an index can save huge amounts of time. So if you’re working with a database and speed is important, check to see if there are indexes.

One downside of indexes is that creation of indexes can be very time-consuming, as seen above. And if the database is updated frequently, this could be detrimental.

Finally, using indexes in a lookup is not always advantageous, as discussed next.

4.2.1 Index lookup vs. sequential scan

Using an index is good in that can go to the data needed very quickly based on random access to the disk locations of the data of interest, but if it requires the computer to examine a large number of rows, it may not be better than sequential scan. An advantage of sequential scan is that it will make good use of the CPU cache, reading chunks of data and then accessing the individual pieces of data quickly.

For example, if you compare the change the query above that filters on viewcount to use a much smaller threshold than 10000, you will probably see that the time used when there is an index is more than without an index.

Ideally you’d do sequential scan of exactly the subset of the rows that you need, with that subset available in contiguous storage.

4.2.2 How indexes work

Indexes are often implemented using tree-based methods. For example in Postgres, b-tree indexes are used for indexes on things that have an ordering. Trees are basically like decision trees - at each node in the tree, there is a condition that sends one down the left or right branch (there might also be more than two branches. Eventually, one reaches the leaves of the tree, which have the actual values that one is looking for. Associated with each value is the address of where that row of data is stored. With a tree-based index, the time cost of b-tree lookup is logarithmic (based on the binary lookup), so it does grow with the number of elements in the table, but it does so slowly. The lookup process is that given a value (which would often be referred to as a key), one walks down the tree based on comparing the value to the condition at each split in the tree until one finds the elements corresponding to the value and then getting the addresses for where the desired rows are stored.

Here’s some information on how such trees are constructed and searched.

In SQLite, indexes are implemented by creating a separate index table that maps from the value to the row index in the indexed table, allowing for fast lookup of a row.

4.3 SQL query plans and EXPLAIN

You can actually examine the query plan that the system is going to use for a query using the EXPLAIN keyword. I’d suggest trying this in Postgres as the output is more interpretable than SQLite.

dbGetQuery(db, "explain select * from webtraffic where count > 500")

In PostgreSQL that gives the following:

                                                                        QUERY PLAN
1                             Gather  (cost=1000.00..388634.17 rows=8513 width=61)
2                                                               Workers Planned: 2
3   ->  Parallel Seq Scan on webtraffic  (cost=0.00..386782.88 rows=3547 width=61)
4                                                            Filter: (count > 500)

The “Workers Planned: 2” seems to indicate that there will be some parallelization used, even without us asking for that.

Now let’s see what query plan is involved in a join and when using indexes.

dbGetQuery(db, "explain select * from questions join questions_tags on
               questions.questionid = questions_tags.questionid")
                                                                         QUERY PLAN
1                   Hash Join  (cost=744893.91..2085537.32 rows=39985376 width=118)
2                     Hash Cond: (questions_tags.questionid = questions.questionid)
3     ->  Seq Scan on questions_tags  (cost=0.00..634684.76 rows=39985376 width=16)
4                     ->  Hash  (cost=365970.96..365970.96 rows=13472796 width=102)
5         ->  Seq Scan on questions  (cost=0.00..365970.96 rows=13472796 width=102)
dbGetQuery(db, "explain select * from questions join questions_tags on
               questions.questionid = questions_tags.questionid where tag like 'python'")
                                                                                                QUERY PLAN
1                                                 Gather  (cost=15339.05..899172.92 rows=687748 width=118)
2                                                                                       Workers Planned: 2
3                                        ->  Nested Loop  (cost=14339.05..829398.12 rows=286562 width=118)
4         ->  Parallel Bitmap Heap Scan on questions_tags  (cost=14338.61..252751.63 rows=286562 width=16)
5                                                                          Filter: (tag ~~ 'python'::text)
6               ->  Bitmap Index Scan on questions_tags_tag_idx  (cost=0.00..14166.68 rows=687748 width=0)
7                                                                       Index Cond: (tag = 'python'::text)
8                     ->  Index Scan using questions_pkey on questions  (cost=0.43..2.01 rows=1 width=102)
9                                                     Index Cond: (questionid = questions_tags.questionid)

Here’s additional information on interpreting what you see: https://www.postgresql.org/docs/current/static/using-explain.html.

The main thing to look for is to see if the query will be done by using an index or by sequential scan (i.e., looking at all the rows).

Finally, let’s compare the query plans for an inner join versus a cross join followed by a WHERE that produces equivalent results.

dbGetQuery(db, "explain select * from questions join questions_tags on
               questions.questionid = questions_tags.questionid")
                                                                         QUERY PLAN
1                   Hash Join  (cost=744893.91..2085537.32 rows=39985376 width=118)
2                     Hash Cond: (questions_tags.questionid = questions.questionid)
3     ->  Seq Scan on questions_tags  (cost=0.00..634684.76 rows=39985376 width=16)
4                     ->  Hash  (cost=365970.96..365970.96 rows=13472796 width=102)
5         ->  Seq Scan on questions  (cost=0.00..365970.96 rows=13472796 width=102)
6                                                                              JIT:
7                                                                     Functions: 10
8       Options: Inlining true, Optimization true, Expressions true, Deforming true
dbGetQuery(db, "explain select * from questions cross join questions_tags where
               questions.questionid = questions_tags.questionid")
                                                                         QUERY PLAN
1                   Hash Join  (cost=744893.91..2085537.32 rows=39985376 width=118)
2                     Hash Cond: (questions_tags.questionid = questions.questionid)
3     ->  Seq Scan on questions_tags  (cost=0.00..634684.76 rows=39985376 width=16)
4                     ->  Hash  (cost=365970.96..365970.96 rows=13472796 width=102)
5         ->  Seq Scan on questions  (cost=0.00..365970.96 rows=13472796 width=102)
6                                                                              JIT:
7                                                                     Functions: 10
8       Options: Inlining true, Optimization true, Expressions true, Deforming true

We see that the query plan indicates the two queries are using the same steps, with the same cost.

4.4 Disk caching

You might think that database queries will generally be slow (and slower than in-memory manipulation such as in R or Python when all the data can fit in memory) because the database stores the data on disk. However, as mentioned on the introduction page, the operating system will generally cache files/data in memory when it reads from disk. Then if that information is still in memory the next time it is needed, it will be much faster to access it the second time around. Other processes might need memory and ‘invalidate’ the cache, but often once the data is read once, the database will be able to do queries quite quickly. This also means that even if you’re using a database, you can benefit from a machine with a lot of memory if you have a large database (ideally a machine with rather more RAM than the size of the table(s) you’ll be accessing).

Given this, it generally won’t be helpful to force your database to reside in memory (e.g., using :memory: for SQLite or putting the database on a RAM disk).

4.5 Parallelization and partitioning

To speed up your work, one might try to split up one’s queries into multiple queries that you run in parallel. However, you’re likely to have problems with parallel queries from a single R or Python session.

However, multiple queries to the same database from separate R or Python sessions will generally run fine but can compete for access to disk/memory. That said, in some basic experiments, the slowdown was moderate, so one may be able to parallelize across processes in a manual fashion.

As of version 9.6 of Postgres, there is some capability for doing parallel queries: https://www.postgresql.org/docs/current/static/parallel-query.html.

Finally Postgres supports partitioning tables. Generally one would divide a large table into smaller tables based on unique values of a key. For example if your data had timetamps, you could partition into subtables for each month or each year. This would allow faster queries when considering data that reside on one or a small number of partitions and could also ease manual implementation of parallelization. Here’s some information: https://www.postgresql.org/docs/current/static/ddl-partitioning.html.

4.6 DuckDB

For a serverless database, DuckDB is a nice alternative to SQLite that may speed up queries substantially. DuckDB stores data column-wise, which can lead to big speedups when doing queries operating on large portions of tables (so-called “online analytical processing” (OLAP)). Also, in this case, working with a column-wise format may faster than using an index.

Let’s compare timing using DuckDB and SQLite versions of the StackOverflow database.

First, we’ll see that simple queries that have to process an entire column can be much faster in DuckDB.

library(duckdb)
drv <- duckdb()
dbFilename <- 'stackoverflow-2021.duckdb'
dbDuck <- dbConnect(drv, file.path(dir, dbFilename))

system.time(dbGetQuery(db, "select count(ownerid) from questions"))
##    user  system elapsed 
##   0.132   0.097   1.527
system.time(dbGetQuery(dbDuck, "select count(ownerid) from questions"))
##    user  system elapsed 
##   0.089   0.009   0.134
system.time(result1 <- dbGetQuery(db, "select distinct ownerid from questions"))
##    user  system elapsed 
##   1.739   0.981   2.869
system.time(result2 <- dbGetQuery(dbDuck, "select distinct ownerid from questions"))
##    user  system elapsed 
##   0.244   0.030   0.067

Now let’s compare timings for some of the queries we ran previously. There are substantial speed-ups in both cases.

Here’s a simple join with a filter.

system.time(result1 <- dbGetQuery(db, "select * from questions join questions_tags 
                           on questions.questionid = questions_tags.questionid 
                           where tag = 'python'"))
##    user  system elapsed 
##   2.931   0.737   4.248
system.time(result2 <- dbGetQuery(dbDuck, "select * from questions join questions_tags 
                           on questions.questionid = questions_tags.questionid 
                           where tag = 'python'"))
##    user  system elapsed 
##   0.998   0.127   1.165

And here’s a subquery in the FROM statement.

system.time(result1 <- dbGetQuery(db, "select * from questions join answers A
                on questions.questionid = A.questionid
                join
                (select ownerid, count(*) as n_answered from answers
                group by ownerid order by n_answered desc limit 1000) most_responsive
                on A.ownerid = most_responsive.ownerid"))
##    user  system elapsed 
## 618.105  48.290 676.191
system.time(result2 <- dbGetQuery(dbDuck, "select * from questions join answers A
                on questions.questionid = A.questionid
                join
                (select ownerid, count(*) as n_answered from answers
                group by ownerid order by n_answered desc limit 1000) most_responsive
                on A.ownerid = most_responsive.ownerid"))
##    user  system elapsed 
##   2.761   0.162   1.462

DuckDB will run in parallel by using multiple threads, which can help speed up computations on top of efficiencies available through the column-wise storage, though with the speed of the DuckDB queries above, I don’t think parallelization made much difference in those cases.

You can manage the number of threads like this:

dbExecute(dbDuck, "set threads to 4")
## [1] 0
dbGetQuery(dbDuck, "SELECT current_setting('threads')")
##   current_setting('threads')
## 1                          4
dbDisconnect(dbDuck, shutdown = TRUE)

Finally, DuckDB databases tend to take up less space on disk than SQLite databases, because the column-wise storage allows for better compression.