Workshop on databases and SQL. See the top menu for specific topics.
View the Project on GitHub berkeley-scf/scf-databases-2024
This project is maintained by berkeley-scf, the UC Berkeley Statistical Computing Facility
Hosted on GitHub Pages — Theme by orderedlist
Topics:
Discussion question:
I noticed a data analysis challenge (on Kaggle) involves analyzing data on motor vehicle collisions in New York City. Each record represents an individual collision, including the date, time and location of the accident, vehicles and victims involved, and factors contributing to the accident (e.g., weather, excessive speed, drunk driving, …). Consider trying to set this up as a single table (or dataframe). What difficulties do you run into?
Now consider how you would create database containing multiple tables with this sort of information. What would the tables be?
StackOverflow example database
The database we’ll work with as our example is data on questions and answers on the Stack Overflow code discussion website from 2021. We don’t have the full text of the questions or the answers, but we do have:
Each question may have zero or more answers and each question has one or more tags indicating what the question is above.
Python:
import duckdb
dir = 'data' # relative or absolute path to where the .duckdb file is
dbFilename = 'stackoverflow-2021.duckdb' # possibly 'stackoverflow-2021-oldv.duckdb'
import os
db = duckdb.connect(os.path.join(dir, dbFilename))
db.sql("select * from questions limit 5")
To return a result as a dataframe:
db.sql("select * from questions limit 5").fetch_df()
R:
library(duckdb)
drv <- duckdb()
dir <- 'data' # relative or absolute path to where the .db file is
dbFilename <- 'stackoverflow-2021.duckdb'
db <- dbConnect(drv, file.path(dir, dbFilename))
# simple query to get 5 rows from a table
dbGetQuery(db, "select * from questions limit 5")
Let’s review the basic syntax (the “verbs”) used in SQL by way of some examples.
In small groups, discuss what these queries do. Feel free to try running them if you’re not sure or want to confirm your thoughts.
Note that for many of these, if you actually run them, you may want to add limit 5
at the end of the query to limit how much output you get back.
select ownerid, title from questions
select ownerid, title from questions limit 5
select * from questions
select * from questions order by answercount desc
select count(*) as n from questions
select count(ownerid) as n from questions
select sum(answercount) from questions
select distinct tag from questions_tags limit 15
select distinct ownerid, answercount from questions limit 15
select count(distinct tag) from questions_tags limit 15
where
select * from questions where answercount > 40
select * from questions where answercount > 40 order by answercount desc
select * from questions where answercount = 10 limit 5
select * from questions_tags where tag like 'r-%' limit 10
select * from questions_tags where tag similar to 'r-%|%-r|r|%-r-%' limit 10
select * from questions_tags where tag in ('java','r','python') limit 10