We’ll illustrate some basic database management using a different example dataset that contains some data on webtraffic to Wikipedia pages. Note that the input file used here involved some pre-processing relative to the data you get the directly from the Wikistats dataset available through Amazon Web Services (AWS) because in the data posted on AWS, the datetime information is part of the filename, rather than field(s) in the table.
You can get the raw input files of Wikistats data here
1 SQLite
1.1 Setting up a database and using the SQLite command line
With SQLite you don’t need to deal with all the permissions and administrative overhead of a client-server style of DBMS because an SQLite database is simply a file that you can access without a password or connecting to a database server process.
To start the SQLite interpreter in Linux, either operating on or
creating a database named wikistats.db
:
sqlite3 wikistats.db
Here’s the syntax to create an (empty) table:
create table webtraffic (date char(8), hour char(6), site varchar, page varchar, count integer, size double precision);
.quit
1.2 Populating a table
Here’s an example of reading from multiple files into SQLite using the
command line. We create a file import.sql
that has the configuration
for the import:
.separator " "
.import /dev/stdin webtraffic
Then we can iterate through our files from the UNIX shell, piping the
output of gzip to the sqlite3
interpreter:
for file in $(ls part*gz); do
echo "copying $file"
gzip -cd $file | sqlite3 wikistats.db '.read import.sql'
done
Let’s check that the records are in the database. We could do this from R or Python, but here we’ll do it in the SQLite interface:
select * from webtraffic limit 5;
1.3 Data cleaning
A problem in this example with importing from the data files into SQLite
as above is the presence of double quote (“) characters that are not
meant to delineate strings but are actually part of a field. In this
case probably the easiest thing is simply to strip out those quotes from
UNIX. Here we use sed
to search and replace to create versions of the
input files that don’t have the quotes.
for file in $(ls *gz); do
gzip -cd ${file} | sed "s/\"//g" | gzip -c > wikistats-cleaned/${file}
done
Warning: If you want to read the data into SQLite yourself, you will need to do something about the quotes; I haven’t stripped them out of the files.
2 DuckDB
DuckDB also has an interpreter (a command line interface) that you can run from the command line instead of using it via R or Python or other languages.
I won’t demonstrate that here. Instead I’ll demonstrate creation of a
database directly from R without having to read the data into memory in
R. We control the schema from R as well using standard arguments of
read.csv()
(which is used behind the scenes in setting up the DuckDB
database but not for reading in all the data).
library(duckdb)
dbname <- "wikistats.duckdb"
drv <- duckdb()
db <- dbConnect(drv, dbname)
duckdb_read_csv(db, 'webtraffic', list.files('.', pattern = "^part-"),
delim = ' ', header = FALSE, na.strings = 'NA',
colClasses = c('character','character','character','character','integer','numeric'),
col.names = c('date','hour','site','page','count','size')))
dbDisconnect(db, shutdown = TRUE)
The DuckDB Python interface has read_csv
and read_parquet
functions
for creating a database from one or more data files.
3 PostgreSQL
3.1 Setting up a database and using the Postgres command line
First make sure Postgres is installed on your machine.
On Ubuntu, you can install Postgres easily via apt-get
:
sudo apt-get install postgresql postgresql-contrib
Next we’ll see how to set up a database. You’ll generally need to
operate as the postgres
user for these sorts of manipulations. Of
course if you’re just a user accessing an existing database and existing
tables, you don’t need to worry about this.
sudo -u postgres -i # become the postgres user
psql # start postgres interpreter
Now from within the Postgres interpreter, you can create a database, tables within the database, and authenticate users to do things with those tables.
create database wikistats;
create user paciorek with password 'test';
grant all privileges on database wikistats to paciorek;
PostgreSQL and other DBMS (not SQLite) allow various kinds of control over permissions to access and modify databases and tables as well. It can get a bit involved because the administrator has fine-grained control over what each user can do/access.
Now let’s create a table in the database, after first connecting to the specific database so as to operate on it.
\connect wikistats
create table webtraffic (date char(8), hour char(6), site varchar, page varchar,
count integer, size double precision);
grant all privileges on table webtraffic to paciorek;
\quit
Note: Notice the use of
\
to do administrative tasks (as opposed to executing SQL syntax), and the use of;
to end each statement. Without the semicolon, Postgres will return without doing anything.
If you want control over where the database is stored (you probably only need to worry about this if you are creating a large database), you can do things like this:
show data_directory;
create tablespace dbspace location '/var/tmp/pg';
create database wikistats tablespace dbspace;
create user paciorek with password 'test';
grant all privileges on database wikistats to paciorek;
3.2 Populating a table
Here’s an example of importing a single file into Postgres from within
the psql interpreter running as the special postgres user. In this case
we have space-delimited text files. You can obtain the file part-00000
as discussed in the introduction (you’ll need to run
gunzip part-00000.gz
first).
\connect wikistats
copy webtraffic from 'part-00000' delimiter ' ';
If one had CSV files, one could do the following
copy webtraffic from 'part-00000' csv;
To actually handle the Wikistats input files, we need to deal with
backslash characters occurring at the end of text for a given column in
some rows. Ordinarily in standard Postgres ‘text’ format (different from
Postgres ‘csv’ format), a backslash is used to ‘quote’ characters that
would usually be treated as row or column delimiters (i.e., preceding
such a character by a backslash means it is treated as a character that
is part of the field). But we just want the backslash treated as a
character itself. So we need to tell Postgres not to treat a backslash
as the quoting character. To do that we specify the quote
character.
However, the quote keyword is only provided when importing ‘csv’ format.
In ‘csv’ format the double-quote character is by default treated as
delineating the beginning and end of text in a field, but the Wikistats
files have double-quotes as part of the fields. So we need to set the
quote character as neither a double-quote nor a backslash. The following
syntax does that by specifying that the quote character is a character
(\b
) that never actually appears in the file. The ‘e’ part is so that
Postgres treats \b
as a single character, i.e., ‘escaping’ the
backslash, and the ‘csv’ is because the quote keyword only works with
the csv format, but note that by setting the delimiter to a space, it’s
not really a CSV file!
copy webtraffic from 'part-00000' delimiter ' ' quote e'\b' csv;
Often you’ll need to load data from a large number of possibly zipped text files. As an example of how you would load data in a case like that, here’s some shell scripting that will iterate through multiple (gzipped) input files of Wikistats data, running as the regular user:
export PGPASSWORD=test # set password via UNIX environment variable
for file in $(ls part*gz); do # loop thru files whose names start with 'part' and end with 'gz'
echo "copying $file"
## unzip and then pass by UNIX pipe to psql run in non-interactive mode
gzip -cd $file |
psql -d wikistats -h localhost -U paciorek -p 5432 -c "\copy webtraffic from stdin delimiter ' ' quote e'\b' csv"
done
Note: Using
\copy
as above invokes the psqlcopy
command (copy
would invoke the standard SQLcopy
command), which allows one to operate as a regular user and to use relative paths. In turn\copy
invokescopy
in a specific way.
3.3 Data cleaning
One complication is that often the input files will have anomalies in them. Examples include missing columns for some rows, individual elements in a column that are not of the correct type (e.g., a string in a numeric column), and characters that can’t be handled. In the Wikistats data case, one issue was lines without the full set of columns and another was the presence of a backslash character at the end of the text for a column.
With large amounts of data or many files, this can be a hassle to deal with. UNIX shell commands can sometimes be quite helpful, including use of sed and awk. Or one might preprocess files in chunks using Python.
For example the following shell scripting loop over Wikistats files ensures each row has 6 fields/columns by pulling out only rows with the full set of columns. I used this to process the input files before copying into Postgres as done above. Actually there was even more preprocessing because in the form of the data available from Amazon’s storage service, the date/time information was part of the filename and not part of the data files.
for file in $(ls *gz); do
gzip -cd $file | grep "^.* .* .* .* .* .*$" | gzip -c > ../wikistats-fulllines/$file
done
Note that this restriction to rows with a full set of fields has already been done in the data files I provide to you.
3 Database administration and configuration miscellanea
You can often get configuration information by making a query. For example, here’s how one can get information on the cache size in SQLite or on various settings in Postgres.
# SQLite
dbGetQuery(db, "pragma cache_size")
dbGetQuery(db, "pragma cache_size=90000")
# sets cache size to ~90 GB, 1 KB/page, but not really relevant as
# operating system should do disk caching automatically
# Postgres
dbGetQuery(db, "select * from pg_settings")
dbGetQuery(db, "select * from pg_settings where name='dynamic_shared_memory_type'")
4 Remote access to PostgreSQL databases
If you want to connect to a Postgres database running on a different machine, here’s one approach that involves SSH port forwarding. For example, you could connect to a Postgres database running on some server while working as usual in R or Python on your laptop.
First, on your machine, set up the port forwarding where 63333 should be an unused port on your local machine and PostgresHostMachine is the machine on which the database is running.
For Linux/Mac, from the terminal:
ssh -L 63333:localhost:5432 yourUserName@PostgresHostMachine
Using Putty on Windows, go to ‘Connection -> SSH -> Tunnels’ and put ‘63333’ as the ‘Source port’ and ‘127.0.0.1:5432’ as the ‘Destination’. Click ‘Add’ and then connect to the machine via Putty.
In either case, the result is that port 63333 on your local machine is being forwarded to port 5432 (the standard port used by Postgres) on the server. The use of ‘localhost’ is a bit confusing - it means that you are forwarding port 63333 to port 5432 on ‘localhost’ on the server.
Then (on your local machine) you can connect by specifying the port on your local machine, with the example here being from R:
db <- dbConnect(drv, dbname = 'wikistats', user = 'yourUserName',
password = 'yourPassword', host = 'localhost', port = 63333)
5 UNIX tools for examining disk access (I/O) and memory use
5.1 I/O
iotop
shows disk input/output in real time on a per-process basis,
while iostat shows overall disk use.
iotop # shows usage in real time
iostat 1 # shows usage every second
5.2 Memory
To see how much memory is available, one needs to have a clear understanding of disk caching. As discussed above, the operating system will generally cache files/data in memory when it reads from disk. Then if that information is still in memory the next time it is needed, it will be much faster to access it the second time around. While the cached information is using memory, that same physical memory is immediately available to other processes, so the memory is available even though it is in use.
We can see this via free -h
(the -h is for ‘human-readable’, i.e. show
in GB (G)).
total used free shared buff/cache available
Mem: 251G 998M 221G 2.6G 29G 247G
Swap: 7.6G 210M 7.4G
You’ll generally be interested in the Memory
row. (See below for some
comments on Swap
.) The shared
column is complicated and probably
won’t be of use to you. The buff/cache
column shows how much space is
used for disk caching and related purposes but is actually available.
Hence the available
column is the sum of the free
and buff/cache
columns (more or less). In this case only about 1 GB is in use
(indicated in the used
column).
top
and vmstat
both show overall memory use, but remember that the
amount available is the amount free plus any buffer/cache usage. Here is
some example output from vmstat:
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 215140 231655120 677944 30660296 0 0 1 2 0 0 18 0 82 0 0
It shows 232 GB free and 31 GB used for cache and therefore available, for a total of 263 GB available.
Here are some example lines from top:
KiB Mem : 26413715+total, 23180236+free, 999704 used, 31335072 buff/cache
KiB Swap: 7999484 total, 7784336 free, 215148 used. 25953483+avail Mem
We see that this machine has 264 GB RAM (the total column in the Mem row), with 259.5 GB available (232 GB free plus 31 GB buff/cache as seen in the Mem row). (I realize the numbers don’t quite add up for reasons I don’t fully understand, but we probably don’t need to worry about that degree of exactness.) Only 1 GB is in use.
swap
is essentially the reverse of disk caching. It is disk space that
is used for memory when the machine runs out of physical memory. You
never want your machine to be using swap for memory, because your jobs
will slow to a crawl. Here the swap line in both free and top shows 8 GB
swap space, with very little in use, as desired.