- No books required
- Avoid textbooks, use professional books
Grade | Percentage |
---|---|
Midterm | 24% |
Midterm | 25% |
Lab & Quizzes | 20% |
Final Exam | 25% |
Project* | 30% |
* Groups of 3 or 4 for project
Information Data with a meaning
Store data is mostly text, numbers, dates, and (less onen) mulfmedia documents.
as efficiently as possible.
1970
Edgar F. Codd 1923 - 2003
Each column in the table stores a piece of data, and one row represents a "known fact".
All the pieces of data in a row are related, hence "relational".
Codd's big idea was that as relations are well known mathemafcal animals (sets), you could operate on them, and get new sets.
If tables are variables, you can change their content massively, and combine them in complex operafons to obtain new results, as you could with numbers.
Before you can do fun operafons with tables, you must design your database well, a step known as modelling.
First of all, you must know which data you need/want to manage. It depends on whom you are.
Duplicates are forbidden in relational tables.
What differentiates one row from another.
Key cannot be changed.
For purely commercial reasons, though, it may be assumed that the combinafon title/country/year of films will be unique.
First Normal Form
Each column should only contain ONE piece of informafon.
Most choice revolves along assessing whether an attribute is or can be considered unambiguous or whether even unusual cases are important and could break everything, in which case you may need to split data between two or even three tables.
In a database model, you call "entity" something that has a life ot its own. eg. people, films, countries.
Relationships connect entities. They have no life of their own.
If a film project is abandoned, there will be no rows for this film in CREDITS.
An "Enfty Relafonship Diagram" (E/R Diagram) is a way to represent tables in a database (you even have hundreds of tables in a diagram). Crow feet indicate "cardinality".
"Cardinality" is just a complicated word to say "number".
There is always one country per film, but there are several films per country.
In some cases, some attributes haven't always a value; "main language" for instance, because you have silent films.
Several actors usually appear in a film, and an actor usually plays in several films.
(m,n)
cardinality implies a relationship table.
All about splitting
Every non key attribute must provide a fact about the key, the whole key, and nothing but the key. -- William Kent (1936 - 2005)
- ALPHA
- SQL (born SEQUEL, Don Chamberlin with Ray Boyce)
- QUEL (Michael Stonebraker)
- QBE (Query By Example, Moshe Zloof)
Oates, Ellison, Miner - The founders of Oracle
deals with tables
CREATE
ALTER
DROP
deals with data
INSERT
UPDATE
DELETE
SELECT
There is an official SQL standard that no product fully implements, and many subtly and sometimes irritatingly different dialects. We'll see the main variants of SQL.
- PostgreSQL
- ORACLE
- MySQL
- IBM DB2
- Microsoft SQL Server
- CAN BE enforced for tables in SQL
But you have to create your tables well.
- NOT enforced for query results in SQL
You have to be extra-careful if the result of a query is the starting point for another query, which happens often.
SQL keywords (words that have a special meaning in SQL) are NOT case sensitive and can be typed in any case you want.
Same story with identifiers, the names you give to tables or, as you will soon see, columns, aren't case-sensitive.
Table (and column) names must start with a leder (PostgreSQL tolerates an underscore) and only contain letters, digits, or underscores.
The $ sign is also accepted, and some products allow #.
Note that names can sometimes be quoted between double quotes or square brackets, in which case spaces are allowed AND names become case-sensitive. Better to avoid it.
create table table_name
(column_name datatype,
,
)
- Text
- Number
- Date
- Binary
That's basically what you find in a database; nothing fancy.
char(length)
char
char() is for fixed-size columns (data is padded with spaces if shorter). Used for codes.
varchar(max length)
varchar2(max length)
varchars don't pad. They are limited in length (a few thousand bytes).
text clob
CLOB (called TEXT in MySQL) allows to store much bigger text (Gb).
int
float
numeric(precision, scale)
number(precision, scale)
date
includes time, down to second with Oracle, not with other products.
datetime
down to second (other than Oracle, except DB2)
timestamp
down to 0.000001 second
raw(max length)
varbinary(max length)
RAW in Oracle, and VARBINARY (SQL Server) are the binary equivalent of VARCHAR.
blob
BLOB is the binary equivalent of CLOB (BLOB means Binary Large Object).
PostgreSQL calls the binary datatype BYTEA
, don't ask me why.
It indicates the absence of a value, because we don't yet know it, or because in that case the attribute is irrelevant, or because we haven't the slightest idea about what this should be.
For more on NULL
, see Lecture 3.
create table people ( peopleid int not null,
first_name varchar(30),
surname varchar(30) not null,
born numeric(4),
died numeric(4))
comment on column people.surname is 'Surname or stage name';
-- Stored in the database
-- comments in an SQL statement start with a double dash
-- Not usually stored in the database
create table people ( peopleid int not null
primary key,
first_name varchar(30),
check(first_name = upper(first_name)),
surname varchar(30) not null,
check(surname = upper(surname)),
born numeric(4),
died numeric(4),
unique(first_name, surname))
-- MySQL accepts CHECK but doesn't enforce it.
PRIMARY KEY
indicates two things:
- The value is mandatory (the additional
NOT NULL
doesn't hurt but is redundant). - The values are unique (no duplicates allowed in the column).
With many products data is case sensitive and different capitalization means different values that wouldn't violate a uniqueness constraint. You must standardize case. eg. Oracle, PostgreSQL and DB2.
create table movies (movieid int not null primary key,
title vercahr(60) not null,
country char(2) not null,
year_released numeric(4) not null,
check(year_released >= 1895),
unique (title, country, year_released),
foreign key(country)
references countryies(country_code))
A foreign key can be composed of a combination of columns (rare).
Note that the constraint works both ways.
We won't be able to delete a country if movies reference it, because we would get "orphaned rows" and the database would become inconsistent.
Creating tables requires:
- Proper modelling (cardinalities)
- Defining keys (what identifies rows)
- Determining correct data types
- Defining constraints
INSERT
enters data into a table.
insert into table_name
(list of columns)
values (list of values)
Values must match column-names one by one.
insert into countries
(country_code, country_name, continent)
values('us', 'United States', 'AMERICA')
For strings, SINGLE quotes is the standard and what works everywhere.
The standard SQL way to escape a quote is to double it.
There are only two ways to enter a date, as a string or as the result of a function or computation.
to_date('07/20/1969' , 'MM/DD/YYYY')
CURRENT_DATE
(no time) and CURRENT_TIMESTAMP
(time included) are recognized by all products
select * from movies
In a program, always name columns.
Restriction
select ...
from ...
where ...
SELECT
is followed by the names of the columns you want to return, FROM
by the name of the tables that you query, and WHERE
by filtering conditions.
select *
from movies
where country = 'us'
and year_released between 1940 and 1949
and
is "stronger" than or
=
<>
or!=
<
<=
>
>=
Whenever you are comparing data of slightly different types, you should use functions that "cast" data types. It will avoid bad surprises.
Another frequent mistake is with datetime values.
where issued >= `<Monday’s date>`
-- <Monday 00:00:00>
and issued <= `<Friday's date>`
-- <Friday 00:00:00>
where country in ('us', 'gb')
and year_released between 1940 and 1949
country not in ('us', 'gb')
-
%
any number of characters, including none -
_
one and only one character
select * from movies
where title not like '%A%'
and title not like '%a%'
select * from movies
where upper(title) not like '%A%'
Not good to apply a function to a searched column.
The only way to test NULL
.
where column_name is null
where column_name is not null
'hello' || ' world'
Most products
'hello' + ' world'
SQL Server
concat('hello', ' world')
MySQL
Avoid applying functions to columns that are used for comparison.
case upper(color)
when 'Y' then 'Color'
when 'N' then 'B&W'
else '?'
end as color,
NULL
cannot be tested in a WHEN
branch.
round(3.141592, 3)
3.142
trunc(3,141592, 3)
3.141
floor()
ceiling()
upper
lower()
substr('Citizen Kane'), 5, 3
'zen'
trim(' Oops ')
'Oops'
replace('Sheep', 'ee', 'i')
'Ship'
length()
select distinct country
from movies
aggregate function will aggregate all rows that share a feature and return a characteristic of each group of aggregated rows.
select country, year_released, count(*) number_of_movies
from movies
group by country, year_released
You can also group on several columns. Every column that isn't an aggregate function and appears after SELECT
must also appear after GROUP BY
.
count(*)
count(col)
min(col)
max(col)
avg(col)
sum(col)
stddev()
SQLite hasn't
stddev()
, which computes the standard deviation
select * from (
select country,
min(year_released) oldest_movie
from movies
group by country
) earliest_movies_per_country
where oldest_movie < 1940
The HAVING clause was added to SQL because the WHERE keyword could not be used with aggregate functions.
--- W3Schools
select country,
min(year_released) oldest_movie
from movies
group by country
having min(year_released) < 1940
having country = 'us'
hurts performance.
where country = 'us'
is way more efficient.
When you apply a function or operators to a null, with very few exceptions the result is null because the result of a transformation applied to something unknown is an unknown quantity.
Aggregate functions ignore NULL
s.
Counting a mandatory column such as BORN
will return the same value as COUNT(*)
. The third count, though, will only return the number of dead people in the table.
select count(*) people_count,
count(born) birth_year_count,
count(died) death_year_count
from people
select count(distinct colname)
select peopleid, count(*) as number_of_roles
from (select distinct peopleid, credited_as
from credits
where credited_as in ('A', 'D')
) all_actors_and_directors
group by peopleid
having count(*) = 2
select title, country_name, year_released
from movies
join countries
on country_code = country
where country_code <> 'us'
Not supported by SQL Server
select distinct first_name, surname
from people
join credits
using (peopleid)
where credited_as = 'D'
select distinct p.first_name, p.surname
from people p
join credits c
on c.peopleid = p.peopleid
where credited_as = 'D'
the regular join
Use LEFT OUTER JOIN
only.
CONTRARY TO WHAT HAPPENS WITH THE ORDINARY (inner) JOIN, ORDER IS IMPORTANT.
takes an indeterminate number of parameters and returns the first one that isn't NULL
, available with all products.
With LEFT OUTER JOINs, apply all condiIons before joining.
It's not necessary to have a subquery of British films for MOVIES, but it is necessary to have a subquery that only returns directors.
select m.year_released, m.title,
p.first_name, p.surname
from (select movieid, year_released, title
from movies
where country = 'gb') m
left outer join (select movieid, peopleid
from credits
where credited_as = 'D') c
on c.movieid = m.movieid
left outer join people p
on p.peopleid = c.peopleid
An outer join is always a qualifying join, unless it is associated with an IS NULL
condiIon, meaning that not finding a match is significant.
If a join is removed, MORE rows?
- Yes -> Filtering
- No -> Qualifying
takes two result sets and combine them into a single result set.
- must return the same number of columns,
- the data types of corresponding columns must match.
intersect
-> inner join
except
-> outer join
select distinct country
from movies
select country_code
from countries
except
select distinct country
from movies
select c.country_code
from countries c
left outer join
(select distinct country
from movies) m
on m.country = c.country_code
where m.country is null
select m.title, m.year_released,
(select c.country_name
from countries c
where c.country_code = m.country)
as country_name
from movies m
where m.country <> 'us'
Strictly speaking, a subquery afer the SELECT is more equivalent to a LEFT OUTER JOIN.
select m.title,
c.country_name
from movies m
left outer join countries c
on c.country_code = m.country
where m.country <> 'us'
A subquery in the FROM cannot be correlated.
from
clause uncorrelated
select
list correlated
where
clause uncorrelated or correlated
in (select col
from ...
where ...)
select country, year_released, title
from movies
where country in
(select country_code
from countries
where continent = 'EUROPE')
Some products (Oracle, DB2, PostgreSQL with some twisting) even allow comparing a set of column values (the correct word is "tuple") to the result of a subquery.
(col1, col2) in
(select col3, col4
from t
where ...)
IN()
means an implicit DISTINCT in the subquery.
If demonstrably unique, no distinct
wtih a JOIN
(no need with a IN()
).
select * from people
where born >= 1970
and first_name not in
(select first_name
from people
where born < 1970)
col in ('a', 'b', 'c') = (col = 'a' or col = 'b' or col = 'c')
Throw a NULL
in, we have a condition that is never true but because of OR
it can just be ignored.
col = NULL
never true.
A subquery that returns a NULL
in a NOT IN ()
will always give a false condition, and the result will vanish. If you want to be safe, you should add a condition saying that you DON'T WANT NULL
values if they are possible.
select * from people
where born >= 1970
and first_name not in
(select first_name
from people
where born < 1970
and first_name is not null)
Correlated queries in the WHERE
clause are used with the (NOT) EXISTS
construct.
NEVER try to correlate an IN()
!
EXISTS
NOT EXISTS
select distinct m.title
from movies m
where exists
(select null
from credits c
inner join people p
on p.peopleid = c.peopleid
where c.credited_as = 'A'
and p.born >= 1970
and c.movieid = m.movieid)
order by col1 desc, col2 asc
ASC
is the default and nobody uses it.
Ordering depends on the data type.
It depends on the DBMS.
SQL Server, MySQL and SQLite consider by default that nothing is smaller than everything.
DB2, Oracle and PostgreSQL that it's greater than anything.
Local text sorting rules.
Using CASE .. END
to replace each code with a value that sorts as intended.
order by
case credited_as
when 'D' then 1
when 'P' then 2
when 'A' then 3
end
PostgreSQL, MySQL, SQLite
limit 10
IBM DB2, ORACLE, PostgreSQL
fetch first 10 rows
SQL Server
select top 10 title, ...
In Oracle, if you just want to keep the first ten rows, since the rownum of each row is assigned before the sort, the orered query must be nested.
Oracle
select *
from (select title, country, year_released
from movies
order by title) m
where rownum <= 10
PostgreSQL, MySQL, SQLite
select title,
country,
year_released
from movies
order by title limit 10 offset 20
If you are using with SQL Server or DB2 the equivalent of LIMIT, then there is no OFFSET. You must cheat.
Turning the "ancestry" into an attribute.
Dynamic Ordering
select message, ....
from forum_posts ...
connect by answered_postid = prior postid
start with answered_postid is null
and topicid = ...
order siblings by postid
Recursive queries
with q(postid, message) as (select postid, message
from forum_posts
where answered_postid is null
and topicid = ...
union all
select f.postid, f.message
from forum_posts f
inner join q
on f.answered_postid = q.postid)
select *
from q
Recursive queries operate level by level from top to bottom.
Like scalar functions, they return a result for a single row; but like aggregate functions, this result is computed out of several rows.
func(parameters) over (magic clause)
With DBMS products that support window funcIons, every aggregate function can be used as a window function.
min(year_released) over (partition by country)
Window functions always operate againts rows that belong to a result set.
One related characteristics is that they can only appear after the
SELECT
, not in theWHERE
clause, and there is nothing with them similar toHAVING
with aggregate functions.
You can have an empty OVER
clause to indicate that you want the result computed over all rows selected.
min(year_released) over()
When there is an
ORDER BY
you cannot start returning rows before you have seen all of them. The same thing can be obtained withCROSS JOIN
(also called a Cartesian join).
With a ranking window function you MUST have an ORDER BY
clause in the OVER()
(you cannot have an empty OVER()
clause). You can combine it with a PARTITION BY
to order with groups.
over (order by ...)
over (partition by col1, col2, ...
order by col3, col4, ...)
assigns distinct, sequential numbers to everyone.
assigns the same number to ties, but there is a gap in ranks.
assigns the same number to ties, with no gap
<tag></tag>
...
SQL isn't fully relational.
With SQL (at least with complex queries, and they occur a lot in reporting) you keep juggling between relational and non-relational features of the language.
Example:
select * from people
where born >= 1970
-- Relational
select * from people
where born >= 1970
order by born
-- Not relational, it's more like an array.
select * from (
select * from people
where born >= 1970
order by born
)
where rownum <= 3
-- Relational again.
-- Just the fact that you have three of the oldest people born in 1970 or later.
CTEs are simply a kind of factorization of subqueries.
with something as ...
"Filling the gaps"
select x.year_released, count(m.movieid) as films
from (select 1925 as year_released
union all select 1926
union all select 1927
union all select 1928
union all select 1929
union all select 1930
union all select 1931
union all select 1932
union all select 1933
union all select 1934) x
left outer join movies m
on m.year_released = x.year_released
group by x.year_released
-- There are various tricks for generating lists of values
-- This will return 0 when we have no films for that year.
All values are equal iff:
having min(...) = max(...)
If you only want to check whether the result of the count is zero or not, use exists
rather than count
.
Reason: If you use exists
and if the third row you inspect verifies the condition, you can stop here. If you use count
, you must check all rows to find how many other rows also verify the condition.
GROUP BY
can sometimes replace self-joins, especially when a
table appears many, many times.
Aim: Display the names of cities between which you have flights with this (legitimate) query.
select d.city, a.city, ...
from flights f
inner join airports d
on d.code = f.departure
inner join airports a
on a.code = f.arrival
-- Original version.
select a.city, ...
from flights f
inner join airports a
on a.code in (f.departure,
f.arrival)
-- Shown as:
-- Paris
-- Beijing
select case a.code
when f.departure then a.city
else null
end, ...
from flights f
inner join airports a
on a.code in (f.departure,
f.arrival)
-- Shown as:
-- Paris
-- Beijing
select max(case a.code
when f.departure then a.city
else null
end), ...
from flights f
inner join airports a
on a.code in (f.departure,
f.arrival)
group by ...
-- Shown as:
-- Paris Beijing
Apply MAX(), which ignores NULL, and you squash your two rows into one.
Another Interesting use of window functions is limiting damage with runaway queries.
Common way:
select count(*)
from (original query)
if count <= maxcnt:
original query
If the first query saves on data transfers and data rendition, it still executes the painful part. And if the query is OK, in practice you run it twice.
Better way:
select ... ,
count(*) over () count
from ...
where ...
and rownum <= maxcnt + 1
order by ...
If we more than what we want, the ordered set may be wrong, but we won't display it so it doesn't matter. We're putting a cap over what may go wrong.
Basically you retain the first lemer, drop vowels, lemers that sound like vowels (w = oo), h (ofen silent), then replace similarly sounding consonents by the same digit, before eliminating one of two successive identical digits.
Problem: Strong Anglo-Saxon Bias
Split a text in (pure) words, eliminate words that are too common, then associate each word with the film identifier.
Example:
Split 2001: a space odyssey
as:
2001
a SPACE
ODYSSEY
Then rank all movies related to these key words by hit number.
Queries must be thought as successive layers, from the inside out.
Identifying the tables we need.
We must perform them before joining when possible.
Use a main filter that defines the most precisely the subset of rows we want to retrieve.
Core joins are either the ones that contribute to filtering (not in that case) or that returns information that you should return and that shall be here.
Additional information, ordering, etc. Make the result nicer.
begin / start transaction
A transaction ends when you issue either COMMIT
or ROLLBACK
.
Backup the database while it's active.
Example:
Bank records you may think:
Account type | Number | Balance |
---|---|---|
CURRENT ACNT | 1234567 | 300.00 |
SAVINGS ACNT | 8765432 | 1600.00 |
Real life bank records:
Account type | Number | Balance | Date |
---|---|---|---|
CURRENT ACNT | 1234567 | 300.00 | 1-Sep |
SAVINGS ACNT | 8765432 | 1600.00 | 1-Sep |
Account type | Amount | Operation | Date |
---|---|---|---|
1234567 | 100.00 | DEBIT | 3-Sep |
8765432 | 100.00 | CREDIT | 3-Sep |
In banking system, what is stored is operations, and balances are recomputed once in a while.
What appears as an update
may be in fact an insert
.
Most products (exceptions are Oracle and SQLite) allow inserting several rows in one statement, with a comma- separated list of row data between parentheses.
insert into table_name
(column1, column2, ..., columnn)
values (value1, value2, ..., valuen),
...
(valuep, valueq, ..., valuez)
If you don't specify the columns, it's understood as "all the columns, in the same order as they are displayed when running select *
".
If you omit a column in insert
, the value inserted is the default one if defined, otherwise it will be NULL
.
create table <table_name> (...
<column_name> <data type>
default <default_value> not null, ...)
If I have a default value for a mandatory column, it will be OK to omit it in an insert statement.
If the column is nullable, nothing prevents you from explicitly inserting NULL
, and the default value won't be used.
Querying the next value to use is a sure recipe for conflicts. Several users may get the same one:
select max(movieid) + 1
from movies
There are two approaches.
Simply a number generator.
IBM DB2, Oracle, SQL Server 2012, PostgreSQL
create sequence movie_seq
SQL Server
create table movies (movieid int not null identity primary key)
PostgreSQL, MySQL
serial primary key
SQLite
integer primary key
PostgreSQL actually creates a sequence behind the scene, which it "attaches" to the table so that dropping the table drops the sequence.
Oracle (since version 12, it wasn't possible before) can do it PostgreSQL style, but more explicitly.
Linux Line 1\nLine 2
Windows Line1\r\nLine2
- Comma Separated Values (CSV)
- Tab-separated
- Fixed-field Files
- XML
When everything else fails, using a scripting language to generate INSERT statements is usually the simplest solution.
Without a WHERE
all rows are affected.
update table_name
set column_name = new_value,
other_col = other_val,
...
where ...
UPDATE
is a SET operation.
- Updates in loops are WRONG.
- Think massive operations.
Like a join in a select, same issues with nulls and duplicates!
Oracle and DB2 both support subqueries returning several columns (SQLite also now).
SQL Server and PostgreSQL both support the same older-join type of syntax allowing to join the updated table to the one from which we are getting data.
MySQL allows a join with the newer syntax.
It's usually forbidden to update a key - it's the identifier. You cannot change an identifier. You can only delete the row and insert another.
SQL Server, Oracle
MERGE
A interesting operation would be to update a film we know, and insert it if we don't. That's the purpose of MERGE.
MySQL
on duplicate key update
SQLite
insert or replace
When none of the above is available, you should try to update, and if nothing is affected insert.
NEVER count first to see if the row is already here! It's useless work.
delete from table_name
where ...
If you omit the WHERE
clause, then (as with UPDATE
) the statement affects all rows and you EMPTY table_name!
without a WHERE
clause. Can't be rolled back. Leave it to senior DBAs.
Once you have one movie, you are prevented from deleting the country otherwise the foreign key on table MOVIES would no longer work for films from that country.
begin transaction
commit
create function full_name(p_fname varchar, p_sname varchar)
returns varchar
as $$
begin
return case
when p_fname is null then ''
else p_fname || ' '
end |
case position('(' in p_sname,)
when 0 then p_sname
else trim(')' from substr(p_sname,
position('(' in p_sname) + 1))
|| ' '
|| trim(substr(p_sname, 1,
position('(' in p_sname) - 1))
end;
end;
$$ language plpgsql;
select full_name(first_name, surname) as name, born, died
from people order by surname
select col1, col2, ...
into local_var1, local_var2, ...
from ...
"row variables"
...