Databases

Learning Objectives

  • Define what a database and a Database Management System (DBMS) are.
  • Distinguish between Relational (SQL) and Non-Relational (NoSQL) database paradigms, including CAP theorem principles.
  • Understand core relational concepts like Primary, Foreign, and Composite Keys, and common database constraints.
  • Explain the ACID properties (Atomicity, Consistency, Isolation, Durability) and the role of transaction logs.
  • Describe the process and goals of database normalization (1NF, 2NF, 3NF).
  • Understand the basics of SQL querying (SELECT, INSERT, UPDATE, DELETE, JOIN).
  • Explain how indexing works using B-Trees and its impact on performance.

A comprehensive guide to Database Management Systems, SQL vs. NoSQL paradigms, relational algebra, ACID properties, Normalization (1NF, 2NF, 3NF), and Indexing.

Database

A Database is an organized collection of structured or unstructured data, typically stored electronically in a computer system. It is managed by a Database Management System (DBMS), which acts as the software interface allowing users and applications to securely create, read, update, and delete (CRUD) the data.

Schema

A Schema is the logical structure or blueprint of a database. It defines how data is organized, what tables exist, the fields within those tables, and the relationships between them.

Query

A Query is a precise request for information or for an action to be performed on data within a database, typically written in a standardized query language like SQL.

1. Types of Databases: SQL vs. NoSQL

The two primary paradigms for storing data are Relational (SQL) and Non-Relational (NoSQL). Choosing between them depends entirely on the structure of your data and how it needs to scale.

1.1 Relational Databases (SQL)

Data is stored in highly structured Tables (Relations) consisting of rows (records) and columns (attributes). Tables are strictly linked together using Primary and Foreign Keys to form complex relationships. Examples: MySQL, PostgreSQL, Oracle, Microsoft SQL Server.

  • Rigid Schema: You must define exactly what data type (integer, string, date) goes into each column before inserting any data. If you want to add a new attribute (like "MiddleName") later, you must alter the entire table structure.

  • ACID Compliance: Guarantees strong data integrity and absolute reliability for complex, multi-step transactions (like financial bank transfers).

  • Vertical Scaling (Scale-Up): Primarily scales by increasing the horsepower of a single server (buying a bigger, more expensive server with more RAM/CPU). Distributing a relational database across many servers is complex.

1.2 Non-Relational Databases (NoSQL)

Data is stored in flexible formats, often as JSON-like documents, key-value pairs, wide-columns, or graph nodes, rather than strict tables. Examples: MongoDB (Document), Redis (Key-Value), Cassandra (Wide-Column), Neo4j (Graph).

  • Dynamic Schema: Schema-less design. You can insert varied data structures into the same collection without altering the database. One user document might have a "phone_number" field, while another does not.

  • Horizontal Scaling (Scale-Out): Designed from the ground up to scale horizontally by adding cheaper, commodity servers to distribute the load across a massive cluster. Perfect for massive, unstructured Big Data and real-time web apps.

  • BASE Properties: Often trades strict ACID consistency for higher availability and performance (Basically Available, Soft state, Eventual consistency).

The CAP Theorem

When discussing distributed databases (like NoSQL), the CAP Theorem states that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees: Consistency (every read receives the most recent write or an error), Availability (every request receives a non-error response), and Partition Tolerance (the system continues to operate despite an arbitrary number of messages being dropped by the network).

2. Core Relational Concepts

Relational databases rely on specific keys and constraints to mathematically organize and link data across multiple tables.

Key Types:

  • Primary Key (PK): A column (or combination of columns) that uniquely identifies every single row in a table. It must be unique and cannot be NULL. (e.g., A StudentID, a SocialSecurityNumber, or an auto-incrementing integer ID).

  • Foreign Key (FK): A column in one table that specifically references the Primary Key of another table. This creates the relational link between the tables, enforcing referential integrity. (e.g., An Orders table has a CustomerID column linking back to the Customers table).

  • Composite Key: A Primary Key that consists of two or more columns combined to create a unique identifier (often used in linking tables for Many-to-Many relationships).

Common Database Constraints:

  • UNIQUE: Ensures all values in a column are different.
  • NOT NULL: Ensures a column cannot have a NULL value.
  • CHECK: Ensures values in a column satisfy a specific condition.
  • DEFAULT: Provides a default value for a column when none is specified.

Transaction

A Transaction is a logical unit of work that contains one or more SQL statements (e.g., deducting money from Account A, and adding it to Account B).

3. The ACID Properties

Relational DBMS systems guarantee reliability even during power failures, system crashes, or massive concurrent access by enforcing four strict rules known as the ACID properties for every database transaction.

  • Atomicity ("All or Nothing"): A transaction must either complete entirely, or not happen at all. If a server crashes halfway through a bank transfer, the database entirely "rolls back" the partial changes to the exact starting state.

  • Consistency: A transaction can only bring the database from one mathematically valid state to another valid state. It must obey all predefined rules, constraints (like NOT NULL), and cascading triggers.

  • Isolation: Concurrent executions of transactions by hundreds of users must leave the database in the exact same state as if they were executed sequentially (one after the other). This prevents "dirty reads" where two users withdraw from the same account simultaneously, corrupting the final balance.

  • Durability: Once a transaction has been successfully committed, the changes are permanently stored on non-volatile disk. They will survive a catastrophic system crash or power outage immediately afterward. This is heavily reliant on Transaction Logs, where the database records changes before writing them to the main tables, allowing recovery upon a crash.

Normalization

Normalization is the systematic process of organizing data in a relational database to reduce data redundancy (duplicate data) and improve data integrity. It involves breaking large, messy tables into smaller, tightly-related tables and defining relationships between them.

The Normalization Process

  1. First Normal Form (1NF): Eliminate repeating groups or arrays in individual columns. Create a separate table for each set of related data, and identify each row with a unique Primary Key. Rule: Every column must contain atomic (indivisible) single values. (e.g., A "PhoneNumbers" column cannot contain "555-1234, 555-9876").
  2. Second Normal Form (2NF): Must fulfill 1NF. Create separate tables for sets of values that apply to multiple records, and relate these tables with a Foreign Key. Rule: No partial dependency. Every non-key column must depend on the ENTIRE primary key (only relevant if the PK is a composite key).
  3. Third Normal Form (3NF): Must fulfill 2NF. Eliminate fields that do not depend strictly on the primary key. Rule: No transitive dependency. A non-key column cannot depend on another non-key column. (e.g., If a table has ZipCode and City, City depends on ZipCode, not the CustomerID primary key. Move them to a separate Locations table).

The Goal of Normalization

The goal is to ensure that every non-key attribute in a table provides a fact about the Primary Key, the whole Primary Key, and nothing but the Primary Key. This prevents Insertion, Update, and Deletion Anomalies (e.g., updating a customer's address in one place automatically updates it everywhere, preventing conflicting data).

5. Structured Query Language (SQL)

SQL is the standard programming language used to communicate with, manage, and query Relational Databases.

  • SELECT (Read): Retrieves data from a database. (e.g., SELECT FirstName, LastName FROM Users WHERE Age > 18 ORDER BY LastName;)

  • INSERT (Create): Adds new rows (records) into a table.

  • UPDATE (Update): Modifies existing data within specific rows based on a WHERE condition.

  • DELETE (Delete): Removes specific rows from a table based on a WHERE condition.

  • JOIN: The most powerful feature of SQL. It dynamically combines rows from two or more separate tables based on a related column (usually Primary/Foreign keys) to return a unified dataset.

Interact with the simulation below to see how basic SQL SELECT queries filter the data returned from a table.

SQL Queries

SQL> SELECT * FROM Employees;

Result Set: Employees Table

IDNameDeptSalary
1AliceHR$55,000
2BobIT$75,000
3CharlieIT$82,000
4DianaSales$60,000
4 row(s) returned

6. Indexing and Performance

An Index is a specialized data structure (often a B-Tree) created on a database column that drastically improves the speed of data retrieval operations (SELECTs), at the cost of slower writes (INSERTs/UPDATEs) and increased storage space.

How Indexing Works

Without an index, the database engine must perform a "Full Table Scan" (checking every single row from top to bottom) to find WHERE Username = 'John'. If the Username column is indexed, the database uses a sorted tree structure to find 'John' nearly instantly (O(log n) time) instead of scanning millions of rows (O(n) time).

The Trade-off: Every time a new user is inserted or updated, the index tree must be computationally rebuilt and re-sorted. Therefore, over-indexing a table slows down write operations significantly.

Key Takeaways
  • SQL databases use rigid tables with predefined schemas, ideal for highly structured data requiring complex joins and absolute transactional integrity.
  • NoSQL databases use flexible, unstructured formats, ideal for rapid agile development, changing data shapes, and massive horizontal scaling.
  • The CAP Theorem implies distributed systems cannot guarantee Consistency, Availability, and Partition Tolerance all at once.
  • Primary keys guarantee row uniqueness within a single table.
  • Foreign keys mathematically link tables together, preventing orphan records (referential integrity).
  • ACID properties (Atomicity, Consistency, Isolation, Durability) ensure bulletproof data reliability, particularly for financial, medical, and critical systems.
  • Normalization organizes schemas to eliminate data redundancy and prevent anomalies during database modifications.
  • 1NF requires atomic values; 2NF removes partial dependencies; 3NF removes transitive dependencies.
  • SQL provides standardized commands (SELECT, INSERT, UPDATE, JOIN) to query and manipulate relational data.
  • Indexing drastically speeds up data retrieval (SELECTs) by maintaining sorted lookups using B-Trees.
  • Over-indexing slows down data modification (INSERT/UPDATE/DELETE) because the indices must be constantly maintained.