Here is a simple query that selects the first five rows (and all columns, based on the * wildcard) from the questions table.
select * from questions limit 5
To run this from R we provide the SQL syntax as a string as the second argument to dbGetQuery.
library(RSQLite)drv <-dbDriver("SQLite")dir <-'data'# relative or absolute path to where the .db file isdbFilename <-'stackoverflow-2021.db'db <-dbConnect(drv, dbname =file.path(dir, dbFilename))dbGetQuery(db, "select * from questions limit 5")
questionid creationdate score viewcount answercount
1 65534165 2021-01-01 22:15:54 0 112 2
2 65535296 2021-01-02 01:33:13 2 1109 0
3 65535910 2021-01-02 04:01:34 -1 110 1
4 65535916 2021-01-02 04:03:20 1 35 1
5 65536749 2021-01-02 07:03:04 0 108 1
commentcount favoritecount title
1 0 NA Can't update a value in sqlite3
2 0 NA Install and run ROS on Google Colab
3 8 0 Operators on date/time fields
4 0 NA Plotting values normalised
5 5 NA Export C# to word with template
ownerid
1 13189393
2 14924336
3 651174
4 14695007
5 14899717
Now let’s see some more interesting usage of other SQL syntax.
First we get the questions that are viewed the most by filtering to the rows for which the ‘viewcount’ is greater than 100000. We’ll limit the results to the first 5 rows so we don’t print too much out.
dbGetQuery(db, "select * from questions where viewcount > 100000 limit 5")
questionid creationdate score viewcount answercount
1 65547199 2021-01-03 06:22:52 124 110832 7
2 65549858 2021-01-03 12:30:19 52 130479 11
3 65630743 2021-01-08 14:20:57 77 107140 19
4 65632698 2021-01-08 16:22:59 74 101044 9
5 65896334 2021-01-26 05:33:33 111 141899 12
commentcount favoritecount
1 2 0
2 0 0
3 4 0
4 1 0
5 7 0
title
1 Using Bootstrap 5 with Vue 3
2 "ERESOLVE unable to resolve dependency tree" when installing npm react-facebook-login
3 How to solve flutter web api cors error only with dart code?
4 How to open a link in a new Tab in NextJS?
5 Python Pip broken with sys.stderr.write(f"ERROR: {exc}")
ownerid
1 11232893
2 12425004
3 12373446
4 9578961
5 202576
Next, let’s find the number of views for the 15 questions viewed the most.
dbGetQuery(db, "select viewcount from questions order by viewcount desc limit 15")
Let’s lay out the various verbs in SQL. Here’s the form of a standard query (but note that the sorting done by ORDER BY is computationally expensive and would be used sparingly):
SELECT <column(s)> FROM <table> WHERE <condition(s) on column(s)> ORDER BY <column(s)>
SQL keywords are often written in ALL CAPITALS by convention, although I won’t necessarily do that in this tutorial.
And here is a table of some important keywords:
Keyword
What it does
SELECT
select columns
FROM
which table to operate on
WHERE
filter (choose) rows satisfying certain conditions
LIKE, IN, <, >, =, <=, >=, !=, etc.
used as part of filtering conditions
ORDER BY
sort based on columns
Some other keywords are: DISTINCT, ON, JOIN, GROUP BY, AS, USING, UNION, INTERSECT, HAVING, SIMILAR TO (not available in SQLite), SUBSTR in SQLite and SUBSTRING in PostgreSQL.
Challenge
Return a few rows from the users, questions, answers, and tags tables so you can get a sense for what the entries in the tables are like.
Challenge
Find the users in the database with the most upvotes.
1.2 Getting unique results (DISTINCT)
A useful SQL keyword is DISTINCT, which allows you to eliminate duplicate rows from any table (or remove duplicate values when one only has a single column or set of values).
## Find the unique tags:dbGetQuery(db, "select distinct tag from questions_tags limit 15")
## Count the number of unique tags:dbGetQuery(db, "select count(distinct tag) from questions_tags")
count(distinct tag)
1 42137
1.3 Grouping / stratifying (GROUP BY)
A common pattern of operation is to stratify the dataset, i.e., collect it into mutually exclusive and exhaustive subsets. One would then generally do some aggregation operation on each subset. The aggregation is always done within each of the groups. In SQL this is done with the GROUP BY keyword.
Here’s a basic example where we count the occurrences of different tags.
dbGetQuery(db, "select tag, count(*) as n from questions_tags group by tag order by n desc limit 100")
Also note the use of as to define a name for the new column.
Challenge
What specifically does that query do? Describe the table that would be returned.
In general GROUP BY statements will involve some aggregation operation on the subsets. Options include: COUNT, MIN, MAX, AVG, SUM.
The result of a query that uses group by is a table with as many rows as groups.
where vs. having
To filter the result of a grouping operation, we need to use having rather than where. (Note that where would filter before the application of the group by).
dbGetQuery(db, "select tag, count(*) as n from questions_tags group by tag having n > 100000 limit 10")
Fields and group by
Determining what fields can be selected when using group by can be tricky, because it varies by DBMS. For example, with Postgres, you can only select fields created by aggregation and the fields that group by is applied to, as well as when there is something called a functional dependency. SQLite allows more flexibility. For example the following can be done in SQLite to find user and answer information for the answer to each question from the user with the highest reputation. However Postgres gives the error ‘ERROR: column “u.userid” must appear in the GROUP BY clause or be used in an aggregate function’.
dbGetQuery(db, "select *, max(reputation) from users U join answers A on A.ownerid = U.userid group by A.questionid limit 5")
Challenge
Write a query that will count the number of answers for each question, returning the IDs of the most answered questions. Hint: consider which field in the “answers” table we do the grouping on (and you shouldn’t need to use the “questions” table).
count and NULL values
When applied to a specific field, COUNT will not count elements that are NULL. That can be useful in cases such as determining the number of non-matches in an outer join. In contrast, COUNT(*) will count the number of rows, regardless of the contents.
1.4 Joins
1.4.1 Introduction to joins
Suppose in the example of students in classes, we want a result that has the grades of all students in 9th grade. For this we need information from the Student table (to determine grade level) and information from the ClassAssignment table (to determine the class grade for each class a student takes). Getting information from multiple tables, where a row in one table is matched with one or more rows in another table is called a join. In this case the join would look for all rows in the ClassAssignment table that match a given row (i.e., student) in the Student table, using the column in each of the tables containing the student ID to do the matching of rows.
The syntax generally looks like this (again the WHERE and ORDER BY are optional):
SELECT <column(s)> FROM <table1> JOIN <table2> ON <columns to match on>
WHERE <condition(s) on column(s)> ORDER BY <column(s)>
Let’s see an example join on the Stack Overflow database. In particular let’s select only the questions with the tag “python”.
result1 <-dbGetQuery(db, "select * from questions join questions_tags on questions.questionid = questions_tags.questionid where tag = 'python'")head(result1)
questionid creationdate score viewcount answercount
1 65526804 2021-01-01 01:54:10 0 2087 3
2 65527402 2021-01-01 05:14:22 1 56 1
3 65529525 2021-01-01 12:06:43 1 175 1
4 65529971 2021-01-01 13:14:40 1 39 0
5 65532644 2021-01-01 18:46:52 -2 49 1
6 65534179 2021-01-01 22:17:15 1 476 0
commentcount favoritecount
1 3 NA
2 0 NA
3 0 NA
4 1 NA
5 1 NA
6 4 NA
title
1 How to play an audio file starting at a specific time
2 Join dataframe columns in python
3 Issues with pygame.time.get_ticks()
4 How to check if Windows prompts a notification box using python?
5 How I divide this text file in a Dataframe?
6 Suppress OpenCV Output Message in Python
ownerid questionid tag
1 14718094 65526804 python
2 1492229 65527402 python
3 13720770 65529525 python
4 13845215 65529971 python
5 14122166 65532644 python
6 10355409 65534179 python
It’s also possible to get the same exact result without using the JOIN keyword, but you’ll need the WHERE keyword to ensure that the rows get matched correctly.
result2 <-dbGetQuery(db, "select * from questions, questions_tags where questions.questionid = questions_tags.questionid and tag = 'python'")identical(result1, result2)
[1] TRUE
We’ll explain what is going on in the next section.
Here’s a three-way join (both with and without the JOIN keyword) with some additional use of aliases to abbreviate table names. What does this query ask for?
result1 <-dbGetQuery(db, "select * from questions Q join questions_tags T on Q.questionid = T.questionid join users U on Q.ownerid = U.userid where tag = 'python' and upvotes > 100")
Once again, we could do that without JOIN and using WHERE to match the rows appropriately.
result2 <-dbGetQuery(db, "select * from questions Q, questions_tags T, users U where Q.questionid = T.questionid and Q.ownerid = U.userid and tag = 'python' and upvotes > 100")
Challenge
Write a query that would return all the answers to questions with the Python tag.
Challenge
Write a query that will count the number of answers for each question, returning the most answered questions and their information. Note that this extends the question in the previous section.
Challenge
Write a query that would return the users who have answered a question with the Python tag.
1.4.2 Types of joins
We’ve seen a bunch of joins but haven’t discussed the full taxonomy of types of joins. There are various possibilities for how to do a join depending on whether there are rows in one table that do not match any rows in another table.
Inner joins: In database terminology an inner join is when the result has a row for each match of a row in one table with the rows in the second table, where the matching is done on the columns you indicate. If a row in one table corresponds to more than one row in another table, you get all of the matching rows in the second table, with the information from the first table duplicated for each of the resulting rows. For example in the Stack Overflow data, an inner join of questions and answers would pair each question with each of the answers to that question. However, questions without any answers or (if this were possible) answers without a corresponding question would not be part of the result.
Outer joins: Outer joins add additional rows from one table that do not match any rows from the other table as follows. A left outer join gives all the rows from the first table but only those from the second table that match a row in the first table. A right outer join is the converse, while a full outer join includes at least one copy of all rows from both tables. So a left outer join of the Stack Overflow questions and answers tables would, in addition to the matched questions and their answers, include a row for each question without any answers, as would a full outer join. In this case there should be no answers that do not correspond to question, so a right outer join should be the same as an inner join. Note that one cannot do a right outer join (or a full outer join) in SQLit; you’ll need to switch the order of the tables and do a left outer join.
Cross joins: A cross join gives the Cartesian product of the two tables, namely the pairwise combination of every row from each table, analogous to expand.grid in R. I.e., take a row from the first table and pair it with each row from the second table, then repeat that for all rows from the first table. Since cross joins pair each row in one table with all the rows in another table, the resulting table can be quite large (the product of the number of rows in the two tables). In the Stack Overflow database, a cross join would pair each question with every answer in the database, regardless of whether the answer is an answer to that question.
Here’s a table of the different kinds of joins:
Type of join
Rows from first table
Rows from second table
inner (default)
all that match on specified condition
all that match on specified condition
left outer
all
all that match first table
right outer
all that match second table
all
full outer
all
all
cross
all combined pairwise with second table
all combined pairwise with first table
A ‘natural’ join is an inner join that doesn’t require you to specify the common columns between tables on which to enforce equality, but it’s often good practice to not use a natural join and to explicitly indicate which columns are being matched on.
Simply listing two or more tables separated by commas as we saw earlier is the same as a cross join. Alternatively, listing two or more tables separated by commas, followed by conditions that equate rows in one table to rows in another is the same as an inner join.
In general, inner joins can be seen as a form of cross join followed by a condition that enforces matching between the rows of the table. More broadly, here are five equivalent joins that all perform the equivalent of an inner join:
select*from table1 join table2 on table1.id= table2.id ## explicit innerjoinselect*from table1, table2 where table1.id= table2.id ## without explicit JOINselect*from table1 crossjoin table2 where table1.id= table2.idselect*from table1 join table2 using(id)select*from table1 naturaljoin table2
Note that in the last query the join would be based on all common columns, which could be a bit dangerous if you don’t look carefully at the schema of both tables. Assuming id is the common column, then the last of these queries is the same as the others.
Challenge
Create a view with one row for every question-tag pair, including questions without any tags.
Challenge
Write a query that would return the displaynames of all of the users who have never posted a question. The NULL keyword will come in handy – it’s like NA in R. Hint: NULLs should be produced if you do an outer join.
Challenge
How many questions tagged with ‘random-forest’ were unanswered? (You should need two different kinds of joins to answer this.)
1.4.3 Counting NULLs
We’ve seen that one can user outer joins as a way to find rows in one table that do not appear in another table. If you want to be able to count the number of entries by group and have a count of 0 when a group doesn’t appear in a table, you can do that by making sure to apply the count only to a field produced by an outer join that contains NULL values and not to all fields.
For example if we wanted to count the number of answers provided by each user and make sure to include people who have answered no questions, assigning them a value of 0, we could do it like this by counting the answerid field, which will have NULL for users with no answers and will be counted as a 0:
dbGetQuery(db, "select userid, count(answerid) as n_answers from users left outer join answers on userid=ownerid group by userid order by userid limit 10")
If we had done count(*), then each row for a user with no answers would be incorrectly assigned a 1 (since they have a row associated with them) instead of a 0 (note user #56 below).
dbGetQuery(db, "select userid, count(*) as n_answers from users left outer join answers on userid=ownerid group by userid order by userid limit 10")
1.4.4 Joining a table with itself (self join)
Sometimes we want to query information across rows of the same table. For example supposed we want to analyze the time lags between when the same person posts a question. Do people tend to post in bursts or do they tend to post uniformly over the year? To do this we need contrasts between the times of the different posts. (One can also address this using window functions, discussed later.)
So we need to join two copies of the same table, which means dealing with resolving the multiple copies of each column.
This would look like this:
dbGetQuery(db, "select * from questions Q1 join questions Q2 on Q1.ownerid = Q2.ownerid")
That should create a new table with all pairs of questions asked by a single person.
Actually, there’s a problem here.
Challenge
What kinds of rows will we get that we don’t want?
A solution to that problem of having the same question paired with itself is:
dbGetQuery(db, "create view question_contrasts as select * from questions Q1 join questions Q2 on Q1.ownerid = Q2.ownerid where Q1.creationdate != Q2.creationdate")
Challenge
There’s actually a further similar problem. What is the problem and how can we fix it by changing two characters in the query above? Hint, even as character strings, the creationdate column has an ordering.
1.5 Temporary tables and views
You can think of a view as a temporary table that is the result of a query and can be used in subsequent queries. In any given query you can use both views and tables. The advantage is that they provide modularity in our querying. For example, if a given operation (portion of a query) is needed repeatedly, one could abstract that as a view and then make use of that view.
Suppose we always want the upvotes and displayname of question owners available. Once we have the view we can query it like a regular table.
## note there is a creationdate in users too, hence disambiguationdbExecute(db, "create view questions_plus as select questionid, questions.creationdate, score, viewcount, title, ownerid, upvotes, displayname from questions join users on questions.ownerid = users.userid")## don't be confused by the "0" response --## it just means that nothing is returned to R; the view _has_ been createddbGetQuery(db, "select * from questions_plus where upvotes > 100 limit 5")
One use of a view would be to create a mega table that stores all the information from multiple tables in the (unnormalized) form you might have if you simply had one data frame in R or Python.
dbExecute(db, "drop view questions_plus") # drop the view if we no longer need it
If you want to create a temporary table just for a single query, you can use a subquery or a WITH clause, as dicussed in Section 3.2.
## Option 1: pass directly from CSV to databasedbWriteTable(conn = db, name ="student", value ="student.csv", row.names =FALSE,header =TRUE)## Option 2: pass from data in an R data frame## First create your data frame:# student_df <- data.frame(...)## or# student_df <- read.csv(...)dbWriteTable(conn = db, name ="student", value = student_df, row.names =FALSE,append =FALSE)
2.2 String processing and creating new fields
We can do some basic matching with LIKE, using % as a wildcard and _ to stand in for any single character:
dbGetQuery(db, "select * from questions_tags where tag like 'r-%' limit 10")
In Postgres, in addition to the basic use of LIKE to match character strings, one can use regular expression syntax with SIMILAR TO.
SIMILAR TO is not available in SQLite so the following can only be done in the Postgres instance of our example database. Here we’ll look for all tags that are of the form “r-”, “-r”, “r” or “-r-”. SQL uses % as a wildcard (this is not standard regular expression syntax).
## Try in postgreSQL, not SQLiteresult <-dbGetQuery(db, "select * from questions_tags where tag similar to 'r-%|%-r|r|%-r-%' limit 10")## Standard regex for 'any character' doesn't seem to work:## result <- dbGetQuery(db, "select * from questions_tags where tag SIMILAR TO 'r-.*|.*-r|r|.*-r-.*' limit 10")
Warning
The matching does not match on substrings, unless one uses wildcards at beginning and end of the pattern, so “r” will only find “r” and not, for example, “dplyr”.
To extract substrings we can SUBSTRING in Postgres. Here’s a basic example:
dbGetQuery(db, "select substring(creationdate from '^[[:digit:]]{4}') as year from questions limit 3")
If you need to specify the pattern to be extracted relative to the surrounding characters, then Postgres requires that the pattern to be extracted be surrounded by #" (one could use another character in place of #), but for use from R we need to escape the double-quote with a backslash so it is treated as a part of the string passed to Postgres and not treated by R as indicating where the character string stops/starts. We also need the % wildcard character when extracting in this way.
dbGetQuery(db, "select substring(creationdate from '%-#\"[[:digit:]]{2}#\"-%' for '#') as month from questions limit 3")
Substrings
SQLite provides SUBSTR for substrings, but the flexibility of SUBSTR seems to be much less than use of SUBSTRING in PostgreSQL.
Select the questions that have “java” but not “javascript” in their titles using regular expression syntax.
Challenge
Figure out how to calculate the length (in characters) of the title of each question.
Challenge
Process the creationdate field to create year, day, and month fields in a new view. Note that this would be good practice for string manipulation but you would want to handle dates and times using the material in the next section and not use string processing.
2.3 Dates and times
Here we’ll see how you can work with dates and times in SQLite, but the functionality should be similar in other DBMS.
SQLite doesn’t have specific date-time types, but it’s standard to store date-times as strings in the text field in the ISO-8601 format: YYYY-MM-DD HH:MM:SS.SSS. That’s the format of the dates in the StackOverflow database:
dbGetQuery(db, "select distinct creationdate from questions limit 5")
Then SQLite provides some powerful functions for manipulating and extracting information in such fields. Here are just a few examples, noting that strftime is particularly powerful. Other DBMS should have similar functionality, but I haven’t investigated further.
## Julian days (decimal days since noon UTC/Greenwich time November 24, 4714 BC (Yikes!)). output <-dbGetQuery(db, "select creationdate, julianday(creationdate) from questions limit 5")output
## day of week: Jan 1 2021 was a Friday (0=Sunday, 6=Saturday)dbGetQuery(db, "select creationdate, strftime('%w', creationdate) from questions limit 5")
You can do set operations like union, intersection, and set difference using the UNION, INTERSECT, and EXCEPT keywords on tables that have the same schema (same column names and types), though most often these would be used on single columns (i.e., single-column tables).
Note
While one can often set up an equivalent query without using INTERSECT or UNION, set operations can be very handy.
Here’s an example of a query that can be done with or without an intersection. Suppose we want to know the names of all individuals who have asked both an R question and a Python question. We can do this with INTERSECT:
system.time( result1 <-dbGetQuery(db, "select displayname, userid from questions Q join users U on U.userid = Q.ownerid join questions_tags T on Q.questionid = T.questionid where tag = 'r' intersect select displayname, userid from questions Q join users U on U.userid = Q.ownerid join questions_tags T on Q.questionid = T.questionid where tag = 'python'") )
user system elapsed
4.239 1.434 7.565
Alternatively we can do a self-join. Note that the syntax gets complicated as we are doing multiple joins.
system.time( result2 <-dbGetQuery(db, "select displayname, userid from (questions Q1 join questions_tags T1 on Q1.questionid = T1.questionid) join (questions Q2 join questions_tags T2 on Q2.questionid = T2.questionid) on Q1.ownerid = Q2.ownerid join users on Q1.ownerid = users.userid where T1.tag = 'r' and T2.tag = 'python'") )
user system elapsed
5.977 3.565 10.854
identical(result1, result2)
[1] FALSE
Challenge
Those two queries return equivalent information, but the results are not exactly the same. What causes the difference? How can we modify the second query to get the exact same results as the first?
Which is faster? The second one looks more involved in terms of the joins, so the timing results seen above make sense.
We could use UNION or EXCEPT to find people who have asked either or only one type of question, respectively.
Challenge
Find the users who have asked either an R question or a Python question.
Challenge
Find the users who have asked but not answered a question.
3.2 Subqueries
A subquery is a full query that is embedded in a larger query.
3.2.1 Subqueries in the FROM statement
We can use subqueries in the FROM statement to create a temporary table to use in a query. Here we’ll do it in the context of a join.
Try to figure out what the following query does.
dbGetQuery(db, "select * from questions join answers A on questions.questionid = A.questionid join (select ownerid, count(*) as n_answered from answers group by ownerid order by n_answered desc limit 1000) most_responsive on A.ownerid = most_responsive.ownerid")
Here the subquery is select ownerid, count(*) as n_answered from answers group by ownerid order by n_answered desc limit 1000, which creates a temporary table, named most_responsive, having the ownerid (and the number of their answers) for the 1000 users who have answered the most questions.
In the ‘outer’ main query, that temporary table is joined to the questions and answers tables.
It might be hard to just come up with that full query all at once. A good strategy is probably to think about creating a view that is the result of the inner query and then have the outer query use that. You can then piece together the complicated query in a modular way. For big databases, you are likely to want to submit this as a single query and not two queries so that the SQL optimizer can determine the best way to do the operations. But you want to start with code that you’re confident will give you the right answer!
Note we could also have done that query using a subquery in the WHERE statement, as discussed in the next section.
Challenge
Write a query that finds the number of answers per question, but only answers from users with at least 100 upvotes.
Finally one can use subqueries in the SELECT clause to create new variables, but we won’t go into that here.
3.2.2 Subqueries in the WHERE statement
Instead of a join, we can use subqueries as a way to combine information across tables, with the subquery involved in a WHERE statement. The subquery creates a set and we then can check for inclusion in (or exclusion from with not in) that set.
For example, suppose we want to know the average number of UpVotes for users who have posted a question with the tag “python”.
dbGetQuery(db, "select avg(upvotes) from users where userid in (select distinct ownerid from questions join questions_tags on questions.questionid = questions_tags.questionid where tag = 'python')")
avg(upvotes)
1 62.72529
In some cases one can do a join rather than using a subquery, but in the following example, it fails.
Challenge
What’s wrong with the following query as an attempt to answer the question above? (See if you can figure it out before looking at the answer below.)
dbGetQuery(db, "select avg(upvotes) from questions, questions_tags, users where questions.questionid = questions_tags.questionid and questions.ownerid = users.userid and tag = 'python'")
Answer
In the subquery, we find the IDs of the users we are looking for and then average over the UpVotes of those individuals. In the join version we found all the questions that had a Python tag and averaged over the UpVotes of the individuals associated with those questions. So the latter includes multiple UpVotes values from individuals who have posted multiple Python questions.
Challenge
Write a query that would return the user information for users who have answered a question with the Python tag. We’ve seen this challenge before, but do it now based on a subquery.
Challenge
Find the users who have asked but not answered a question. We’ve seen this before, but now make use of subqueries instead of a join.
Challenge
How would you find all the answers associated with the user with the most upvotes?
Challenge
Create a frequency list of the tags used in the top 100 most answered questions. Note there is a way to do this with a JOIN and a way without a JOIN.
3.2.3 Using WITH
The WITH clause allows you to create a temporary table to then use in an associated SELECT statement. So it provides similar functionality to using a view but without it being a persistent part of the database. The temporary table is only available within the associated SELECT statement. WITH can only occur as part of a query with SELECT.
Let’s see use of WITH to accomplish what we did with a subquery in the FROM statement above.
dbGetQuery(db, "with most_responsive as ( select ownerid, count(*) as n_answered from answers group by ownerid order by n_answered desc limit 1000 ) select * from questions join answers A on questions.questionid = A.questionid join most_responsive on A.ownerid = most_responsive.ownerid")
One could also replace the subquery in the WHERE statement above using WITH.
dbGetQuery(db, "with tmp as (select distinct ownerid from questions join questions_tags on questions.questionid = questions_tags.questionid where tag = 'python') select avg(UpVotes) from users where userid in tmp")
Finally, you can create multiple temporary tables in the WITH clause, separated by commas. This can help make your query more modular without the complication of creating views that will only be used once.
3.3 Window functions
Window functions provide the ability to perform calculations across sets of rows that are related to the current query row.
Comments:
The result of applying a window function is the same number of rows as the input, even though the functionality is similar to group by. Hint: think about the result of group by + mutate in dplyr in R.
One can apply a window function within groups or across the whole table.
The functions one can apply include standard aggregation functions such as avg and count as well as non-standard functions (specific to using window functions) such as rank, row_number, and cume_dist.
Unless you’re simply grouping into categories, you’ll generally need to order the rows for the window function to make sense.
The syntax is a bit involved, so let’s see with a range of examples:
Aggregate within groups but with one output value per input row
## Total number of questions for each ownerdbGetQuery(db, "select ownerid, count() over (partition by ownerid) as n from questions where ownerid is not NULL limit 10")
Compute cumulative calculations; note the need for ORDER BY within the PARTITION clause (the other ORDER BY is just for display purposes here):
## Rank (based on ordering by creationdate) of questions by ownerdbGetQuery(db, "select *, rank() over (partition by ownerid order by creationdate) as rank from questions order by ownerid desc limit 10")
questionid creationdate score viewcount answercount
1 70347322 2021-12-14 08:47:41 1 123 0
2 65620082 2021-01-07 21:19:16 0 44 1
3 65671366 2021-01-11 17:04:26 0 40 1
4 66034594 2021-02-03 19:43:00 1 111 2
5 66237702 2021-02-17 07:49:58 0 174 2
6 66547208 2021-03-09 12:49:23 2 307 1
7 66907039 2021-04-01 15:07:28 3 1540 1
8 67099166 2021-04-14 21:11:18 1 289 1
9 67355292 2021-05-02 10:28:59 -2 5001 1
10 67374899 2021-05-03 19:42:10 1 66 0
commentcount favoritecount
1 5 NA
2 9 NA
3 0 NA
4 0 NA
5 0 NA
6 0 NA
7 0 NA
8 0 NA
9 5 NA
10 7 NA
title
1 Adjust brightness of Arduino's built-in LED with pyFirmata
2 How to resolve squiggly lines in HTML files after importing Bootstrap to Angular project?
3 Input form fields are cannot be updated in Angular app?
4 Select first largest value using TOP 1 in MySQL?
5 Cron expression in Spring, validation?
6 Playing with Strategy Design Pattern using lambda expression by following Venkat Subramaniam's book?
7 GraphQL and Spring Security using @PreAuthorize?
8 Emitting from Flux<String> until one of conditions are met?
9 Cannot open older Angular project with latest Angular version?
10 How to make vertical scrollbar's height same as window's height and make horizontal scrollbar fixed?
ownerid rank
1 20674445 1
2 20390023 1
3 20390023 2
4 20390023 3
5 20390023 4
6 20390023 5
7 20390023 6
8 20390023 7
9 20390023 8
10 20390023 9
(Sidenote: we rely here on the fact that ordering alphabetically by creationdate is equivalent to time ordering.)
Do a lagged analysis
## Get previous value (based on ordering by creationdate) by ownerdbGetQuery(db, "select ownerid, creationdate, lag(creationdate, 1) over (partition by ownerid order by creationdate) as previous_date from questions order by ownerid desc limit 5")
So one could now calculate the difference between the previous and current date to analyze the time gaps between users posting questions.
Do an analysis within an arbitrary window of rows based on the values in one of the columns
## Summarize questions within 15 days of current question dbGetQuery(db, "select ownerid, creationdate, count() over (partition by ownerid order by julianday(creationdate) range between 15 preceding and 15 following) as n_window from questions where ownerid is not null limit 30")
There the ‘15 preceding’ and ‘15 following’ mean to include all rows within each ownerid that are within 15 Julian days (based on ‘creationdate’) of each row.
So one could now analyze bursts of activity.
One can also choose a fixed number of rows by replacing ‘range’ with ‘rows’. The ROWS and RANGE syntax allow one to specify the window frame in more flexible ways than simply the categories of a categorical variable.
Ranking becomes more complicated when there are ties. RANK will assign the same value to any ties, and then increment based on the number of ties, e.g., you can get 1, 1, 1, 4, if the three lowest value are tied. DENSE RANK will avoid skipping values, giving 1, 1, 1, 2 in the same situation. ROW_NUMBER just numbers, ignoring ties and resulting in ambiguity when there are ties, giving 1, 2, 3, 4 in the same situation.
So the syntax of a window function will generally have these elements:
a call to some function that calculates within the window and assigns value(s) to the rows in the window
OVER
PARTITION BY (optional)
ORDER BY (optional)
RANGE or ROW (optional)
AS (optional)
You can also name window functions, which comes in handy if you want multiple functions applied to the same window:
dbGetQuery(db, "select ownerid, creationdate, lag(creationdate, 1) over w as lag1, lag(creationdate, 2) over w as lag2 from questions where ownerid is not null window w as (partition by ownerid order by creationdate) order by ownerid limit 5")
Finally, you can use window functions on the entire table, without partitioning.
## Summarize questions within 15 (decimal) days of current question dbGetQuery(db, "select ownerid, creationdate, count() over (order by julianday(creationdate) range between 15 preceding and 15 following) as n_window from questions where ownerid is not null limit 30")
Use a window function to compute the average viewcount for each ownerid for the 10 questions preceding each question.
Challenge
For each question, get the answer given by the user with the maximum reputation amongst users answering the question.
Hint: you’ll need to first create a subquery that determines the maximum reputation amongst the answers for each question and then use that to get the answer of interest for each question.
Challenge (hard)
Find the users who have asked one question that is highly-viewed (viewcount > 1000) with their remaining questions not highly-viewed (viewcount < 20 for all other questions).
3.4 Putting it all together to do complicated queries
Here are some real-world style questions one might try to create queries to answer. The context for these questions is a situation in which you have data on user sessions on a website or data on messages between users.
Given a table of user sessions with the format
date | session_id | user_id | session_time
calculate the distribution of the average daily total session time in the last month. I.e., you want to get each user’s daily average and then find the distribution over users. The output should be something like:
minutes_per_day | number_of_users
Consider a table of messages of the form
sender_id | receiver_id | message_id
For each user, find the three users they message the most.
Suppose you have are running an online experiment and have a table on the experimental design:
user_id | test_group | date_first_exposed
Suppose you also have a messages table that indicates if each message was sent on web or mobile:
date | sender_id | receiver_id | message_id | interface (web or mobile)
What is the average (over users) in the average number of messages sent per day for each test group if you look at the users who have sent messages only on mobile in the last month.
Challenge questions with the Stack Overflow data
If we take the three challenges above and translate into problems for the Stack Overflow data, one could consider the following three problems.
For each user who has asked at least one question find the average number of questions per month. Then determine the distribution of that average across the users. (I.e., determine the values that would go into a histogram of the average number of questions per month across users.) The output should be something like:
number of questions per month (rounded) | number of users
Next try to figure out how to include the users who have asked no questions. Hint: think about how to get NULL values included and then count a column containing such values.
For each user, find the three most common tags they apply to their questions.
The output should be something like:
userid | tag | count
Hints: You’ll need to use subqueries and the final selection of just the top 3 tags will need to be done in its own query and not as part of defining a field using a window function.
Consider trying to determine whether users who answer a lot of questions also ask a lot of questions. Grouping the users based on the number of questions they’ve answered (0, 1, 2, etc.), determine the aerage number of questions per month for each group.
The output should be something like:
number of answers | average number of questions per month
You’ll want to work through this in pieces. Try to think about the initial tables you would need and then build up your query in a nested fashion.
3.5 A summary of SQL syntax by example
This section shows the syntax for various queries so as to demonstrate syntax by example. It may be useful to test your understanding by figuring out (either with or without running the query) what the query does.
Selecting columns
select ownerid, title from questionsselect ownerid, title from questions limit5select*from questionsselect*from questions orderby answercount descselectcount(*) as n from questionsselectcount(ownerid) as n from questionsselectsum(answercount) from questions
Using distinct
selectdistinct tag from questions_tags limit15selectdistinct ownerid, answercount from questions limit15selectcount(distinct tag) from questions_tags limit15
Filtering rows with where
select*from questions where answercount >40select*from questions where answercount >40orderby answercount descselect*from questions where answercount =10limit5select*from questions_tags where tag like'r-%'limit10select*from questions_tags where tag similar to'r-%|%-r|r|%-r-%'limit10select*from questions_tags where tag in ('java','r','python') limit10
Grouping and reduction/aggregation
select tag, count(*) as n from questions_tags \groupby tagselect tag, count(*) as n from questions_tags \groupby tag having n >1000select ownerid, count(*) as n from questions \groupby ownerid orderby n desclimit15select ownerid, sum(viewcount) as viewed from questions \groupby owneridselect*, sum(viewcount) as viewed from questions \groupby owneridselect answercount, commentcount, count(*) as n from questions \groupby answercount, commentcountselect tag, count(*) as n from questions_tags \where tag like'python%'groupby tag having n >1000
The query above starting with select *, sum(viewcount) behaves differently in SQLite and DuckDB.
Joins
Inner joins
select*from questions join questions_tags \on questions.questionid = questions_tags.questionidselect*from questions Q join questions_tags T \on Q.questionid = T.questionidselect*from questions Q join questions_tags T \using(questionid)select*from questions Q, questions_tags T \where Q.questionid = T.questionid
Outer joins
select*from questions Q leftouterjoin answers A \on Q.questionid = A.questionid select*from questions Q leftouterjoin answers A \on Q.questionid = A.questionid \where A.creationdate isNULL# Note norightouterjoinin SQLite so here we reverseorderof answers and questions \select*from questions Q rightouterjoin answers A \on Q.questionid = A.questionid \where Q.creationdate isNULLselect Q.questionid, count(*) as n_tags from questions Q join questions_tags T \on Q.questionid = T.questionid \groupby Q.questionid
Self joins
First we’ll set up a view (a temporary) table that combines questions and tags for ease of illustrating ideas around self joins.
select*from QT as QT1 join QT as QT2 \using(questionid)select*from QT as QT1 join QT as QT2 \using(questionid) where QT1.tag < QT2.tagselect QT1.tag, QT2.tag, count(*) as n from QT as QT1 join QT as QT2 \using(questionid) where QT1.tag < QT2.tag \groupby QT1.tag, QT2.tag orderby n desclimit10select*from QT as QT1 join QT as QT2 using(ownerid)
Set operations
select ownerid from QT where tag='python' \intersect \select ownerid from QT where tag='r'select ownerid from QT where tag='python' \except \select ownerid from QT where tag='r'select ownerid from QT where tag='python' \union \select ownerid from QT where tag='r'select userid, displayname, location from users \where location like'%United States%' \intersect \select userid, displayname, location from users \where reputation >10
Subqueries
select*from \ answers A \join \ ( select ownerid, count(*) as n_answered from answers \groupby ownerid orderby n_answered desclimit1000 ) most_responsive \on A.ownerid = most_responsive.ownerid
selectavg(upvotes) from users \where userid in \ ( selectdistinct ownerid from \ questions join questions_tags using(questionid) \where tag ='python' )
4 Efficient SQL queries
4.1 Overview
In general, your DBMS should examine your query and try to implement it in the fastest way possible.
Some tips for faster queries include:
use indexes on fields used in WHERE and JOIN clauses (see next section)
try to avoid wildcards at the start of LIKE string comparison when you have an index on the field (as this requires looking at all of the rows)
similarly try to avoid using functions on indexed columns in a WHERE clause as this requires doing the calculation on all the rows in order to check the condition
only select the columns you really need
create (temporary) tables to store intermediate results that you need to query repeatedly
use filtering (WHERE clauses) in inner statements when you have nested subqueries
use LIMIT as seen in the examples here if you only need some of the rows a query returns
4.2 Indexes
An index is an ordering of rows based on one or more fields. DBMS use indexes to look up values quickly, either when filtering (if the index is involved in the WHERE condition) or when doing joins (if the index is involved in the JOIN condition). So in general you want your tables to have indexes.
DBMS use indexing to provide sub-linear time lookup. Without indexes, a database needs to scan through every row sequentially, which is called linear time lookup – if there are n rows, the lookup is \(O(n)\) in computational cost. With indexes, lookup may be logarithmic – O(log(n)) – (if using tree-based indexes) or constant time – O(1) – (if using hash-based indexes). A binary tree-based search is logarithmic; at each step through the tree you can eliminate half of the possibilities.
Here’s how we create an index, with some time comparison for a simple query.
system.time(dbGetQuery(db, "select * from questions where viewcount > 10000")) # 2.4 secondssystem.time(dbExecute(db, "create index count_index on questions (viewcount)")) # 5.6 secondssystem.time(dbGetQuery(db, "select * from questions where viewcount > 10000")) # 0.9 seconds## restore earlier state by removing indexsystem.time(dbExecute(db, "drop index count_index"))
In many contexts (but not the example above), an index can save huge amounts of time. So if you’re working with a database and speed is important, check to see if there are indexes.
One downside of indexes is that creation of indexes can be very time-consuming, as seen above. And if the database is updated frequently, this could be detrimental.
Finally, using indexes in a lookup is not always advantageous, as discussed next.
4.2.1 Index lookup vs. sequential scan
Using an index is good in that can go to the data needed very quickly based on random access to the disk locations of the data of interest, but if it requires the computer to examine a large number of rows, it may not be better than sequential scan. An advantage of sequential scan is that it will make good use of the CPU cache, reading chunks of data and then accessing the individual pieces of data quickly.
For example, if you compare the change the query above that filters on viewcount to use a much smaller threshold than 10000, you will probably see that the time used when there is an index is more than without an index.
Ideally you’d do sequential scan of exactly the subset of the rows that you need, with that subset available in contiguous storage.
4.2.2 How indexes work
Indexes are often implemented using tree-based methods. For example in Postgres, b-tree indexes are used for indexes on things that have an ordering. Trees are basically like decision trees - at each node in the tree, there is a condition that sends one down the left or right branch (there might also be more than two branches. Eventually, one reaches the leaves of the tree, which have the actual values that one is looking for. Associated with each value is the address of where that row of data is stored. With a tree-based index, the time cost of b-tree lookup is logarithmic (based on the binary lookup), so it does grow with the number of elements in the table, but it does so slowly. The lookup process is that given a value (which would often be referred to as a key), one walks down the tree based on comparing the value to the condition at each split in the tree until one finds the elements corresponding to the value and then getting the addresses for where the desired rows are stored.
Here’s some information on how such trees are constructed and searched.
In SQLite, indexes are implemented by creating a separate index table that maps from the value to the row index in the indexed table, allowing for fast lookup of a row.
4.3 SQL query plans and EXPLAIN
You can actually examine the query plan that the system is going to use for a query using the EXPLAIN keyword. I’d suggest trying this in Postgres as the output is more interpretable than SQLite.
dbGetQuery(db, "explain select * from webtraffic where count > 500")
We see that the query plan indicates the two queries are using the same steps, with the same cost.
4.4 Disk caching
You might think that database queries will generally be slow (and slower than in-memory manipulation such as in R or Python when all the data can fit in memory) because the database stores the data on disk. However, as mentioned on the introduction page, the operating system will generally cache files/data in memory when it reads from disk. Then if that information is still in memory the next time it is needed, it will be much faster to access it the second time around. Other processes might need memory and ‘invalidate’ the cache, but often once the data is read once, the database will be able to do queries quite quickly. This also means that even if you’re using a database, you can benefit from a machine with a lot of memory if you have a large database (ideally a machine with rather more RAM than the size of the table(s) you’ll be accessing).
Given this, it generally won’t be helpful to force your database to reside in memory (e.g., using :memory: for SQLite or putting the database on a RAM disk).
4.5 Parallelization and partitioning
To speed up your work, one might try to split up one’s queries into multiple queries that you run in parallel. However, you’re likely to have problems with parallel queries from a single R or Python session.
However, multiple queries to the same database from separate R or Python sessions will generally run fine but can compete for access to disk/memory. That said, in some basic experiments, the slowdown was moderate, so one may be able to parallelize across processes in a manual fashion.
Finally Postgres supports partitioning tables. Generally one would divide a large table into smaller tables based on unique values of a key. For example if your data had timetamps, you could partition into subtables for each month or each year. This would allow faster queries when considering data that reside on one or a small number of partitions and could also ease manual implementation of parallelization. Here’s some information: https://www.postgresql.org/docs/current/static/ddl-partitioning.html.
4.6 DuckDB
For a serverless database, DuckDB is a nice alternative to SQLite that may speed up queries substantially. DuckDB stores data column-wise, which can lead to big speedups when doing queries operating on large portions of tables (so-called “online analytical processing” (OLAP)). Also, in this case, working with a column-wise format may faster than using an index.
Let’s compare timing using DuckDB and SQLite versions of the StackOverflow database.
First, we’ll see that simple queries that have to process an entire column can be much faster in DuckDB.
library(duckdb)drv <-duckdb()dbFilename <-'stackoverflow-2021.duckdb'dbDuck <-dbConnect(drv, file.path(dir, dbFilename))system.time(dbGetQuery(db, "select count(ownerid) from questions"))
user system elapsed
0.169 0.157 1.551
system.time(dbGetQuery(dbDuck, "select count(ownerid) from questions"))
user system elapsed
0.034 0.004 0.078
system.time(result1 <-dbGetQuery(db, "select distinct ownerid from questions"))
user system elapsed
1.900 1.266 3.331
system.time(result2 <-dbGetQuery(dbDuck, "select distinct ownerid from questions"))
user system elapsed
0.207 0.020 0.038
Now let’s compare timings for some of the queries we ran previously. There are substantial speed-ups in both cases.
Here’s a simple join with a filter.
system.time(result1 <-dbGetQuery(db, "select * from questions join questions_tags on questions.questionid = questions_tags.questionid where tag = 'python'"))
user system elapsed
3.474 0.872 4.770
system.time(result2 <-dbGetQuery(dbDuck, "select * from questions join questions_tags on questions.questionid = questions_tags.questionid where tag = 'python'"))
user system elapsed
0.881 0.126 1.180
And here’s a subquery in the FROM statement.
system.time(result1 <-dbGetQuery(db, "select * from questions join answers A on questions.questionid = A.questionid join (select ownerid, count(*) as n_answered from answers group by ownerid order by n_answered desc limit 1000) most_responsive on A.ownerid = most_responsive.ownerid"))
user system elapsed
8.130 1.862 10.237
system.time(result2 <-dbGetQuery(dbDuck, "select * from questions join answers A on questions.questionid = A.questionid join (select ownerid, count(*) as n_answered from answers group by ownerid order by n_answered desc limit 1000) most_responsive on A.ownerid = most_responsive.ownerid"))
user system elapsed
2.310 0.125 1.484
DuckDB will run in parallel by using multiple threads, which can help speed up computations on top of efficiencies available through the column-wise storage, though with the speed of the DuckDB queries above, I don’t think parallelization made much difference in those cases.