Berkeley Statistics Logo

SCF Databases/SQL January 2024 workshop

Workshop on databases and SQL. See the top menu for specific topics.

View the Project on GitHub berkeley-scf/scf-databases-2024

This project is maintained by berkeley-scf, the UC Berkeley Statistical Computing Facility

Hosted on GitHub Pages — Theme by orderedlist

Topics:

Databases overview

Memory and disk

Schema and normalization

Discussion question:

I noticed a data analysis challenge (on Kaggle) involves analyzing data on motor vehicle collisions in New York City. Each record represents an individual collision, including the date, time and location of the accident, vehicles and victims involved, and factors contributing to the accident (e.g., weather, excessive speed, drunk driving, …). Consider trying to set this up as a single table (or dataframe). What difficulties do you run into?

Now consider how you would create database containing multiple tables with this sort of information. What would the tables be?

Example dataset of StackOverflow questions and answers

Accessing a database from Python and R

Python:

import duckdb
dir = 'data' # relative or absolute path to where the .duckdb file is
dbFilename = 'stackoverflow-2021.duckdb'  # possibly 'stackoverflow-2021-oldv.duckdb'
import os
db = duckdb.connect(os.path.join(dir, dbFilename))
db.sql("select * from questions limit 5")

To return a result as a dataframe:

db.sql("select * from questions limit 5").fetch_df()

R:

library(duckdb)
drv <- duckdb()
dir <- 'data' # relative or absolute path to where the .db file is
dbFilename <- 'stackoverflow-2021.duckdb'
db <- dbConnect(drv, file.path(dir, dbFilename))
# simple query to get 5 rows from a table
dbGetQuery(db, "select * from questions limit 5")  

Basic SQL syntax

Let’s review the basic syntax (the “verbs”) used in SQL by way of some examples.

In small groups, discuss what these queries do. Feel free to try running them if you’re not sure or want to confirm your thoughts.

Note that for many of these, if you actually run them, you may want to add limit 5 at the end of the query to limit how much output you get back.

Select columns

select ownerid, title from questions
select ownerid, title from questions limit 5
select * from questions
select * from questions order by answercount desc
select count(*) as n from questions
select count(ownerid) as n from questions
select sum(answercount) from questions

Using distinct

select distinct tag from questions_tags limit 15
select distinct ownerid, answercount from questions limit 15
select count(distinct tag) from questions_tags limit 15

Filtering rows with where

select * from questions where answercount > 40
select * from questions where answercount > 40 order by answercount desc
select * from questions where answercount = 10 limit 5
select * from questions_tags where tag like 'r-%' limit 10
select * from questions_tags where tag similar to 'r-%|%-r|r|%-r-%' limit 10
select * from questions_tags where tag in ('java','r','python') limit 10