Reading List

Note: Readings may change during the semester but are generally guaranteed to be accurate one lecture in advance.
 Date  Topic  Reading  Notes
     Non-Technical Pointers Database Metatheory: Asking the Big Queries (1995)
You and Your Research (1986)

Four takes on how to read a paper:
Th 8/28
Introduction and Systems Considerations Optional:  The 5 Minute Rule for Trading Memory for Disc Accesses and the 5 Byte Rule for Trading Memory for CPU Time (1985)
Optional: The Five-Minute Rule Ten Years Later, and Other Computer Storage Rules of Thumb (1997)
Optional: The Five-Minute rule twenty years later, and how flash memory changes the rules (2007)

Optional: Numbers Everyone Should Know (and 2012 projection site)

Optional: Operating System Support for Database Management (1981)
Optional: On planning for failure, ThemisMR: An I/O-Efficient MapReduce (2012; Section 2 Only)
JMH notes to self
Tu 9/2 Relational Roots Background: Anatomy of a Relational Database System (2005)
Background: Architecture of a Database System (2008)

System R: Relational Approach to Data Management (1976)
Optional: A History and Evaluation of System R (1981)
The Design of Postgres (1986)
Optional: The POSTGRES Next-Generation Database System (1991)

Optional: What Goes Around Comes Around (2005)
JMH notes to self
Th  9/4 Parallel + Distributed Classics Computation and communication in R*: A distributed database manager (1984)
R*: An Overview of the Architecture (1981)
The Gamma database machine project (1990)

Optional: The Case for Shared Nothing (1985)
Optional: R* Optimizer Validation and Performance Evaluation for Distributed Queries (1986)
Optional: Parallel Database Systems: The Future of High Performance Database Processing [alternative PDF] (1992)
JMH notes to self

Scribe Notes (Alan Fekete)
Tu  9/9 Transactions I: Locks and OCC Background: (no summary): The Transaction Concept: Virtues and Limitations (1981)

Granularity of locks and degrees of consistency in a shared data base (1976; Part I only)
On Optimistic Methods for Concurrency Control (1981)

Optional: Concurrency control in distributed database systems (1981)
JMH notes to self

Scribe Notes (Evan Sparks)
Th 9/11 Transactions II: MVCC, Performance, Isolation Levels Concurrency control performance modeling: alternatives and implications (1987)
Granularity of locks and degrees of consistency in a shared data base (1976; Part II only)

We will discuss the following two papers, so skim if you have time, but don't review:
Calvin: fast distributed transactions for partitioned database systems (2012)
Spanner: Google’s Globally-Distributed Database (2012)

Optional: Consensus on Transaction Commit (2006)
Optional, esp. if you've read Spanner: Practical Uses of Synchronized Clocks In Distributed Systems (1991; esp. Section 7)
Scribe Notes (Ankur Dave)
Tu 9/16 Distributed Consistency and Optimistic Replication Background (optional): The dangers of replication and a solution (1996)

Managing update conflicts in Bayou, a weakly connected replicated storage system (1995)
Dynamo: Amazon's highly available key-value store (2007)

Optional, related to Dynamo: "How Eventual is Eventual?" -- PBS demo, original paperProbabilistically Bounded Staleness for practical partial quorums (2012)

Optional, more "NoSQL": Bigtable: A Distributed Storage System for Structured Data (2006)

Useful surveys (optional, additional reading)
Replicated data consistency explained through baseball (2013)
Optimistic Replication (2005)
Eventual Consistency Today: Limitations, Extensions, and Beyond [PDF] (2013)
PDB notes

Scribe Notes (Dan Haas)

Improved query performance with variant indexes (1997)
C-store: a column-oriented DBMS (2005)

Optional: Column-Stores vs. Row-Stores: How Different Are They Really? (2008)

Optional: MonetDB/X100: Hyper-Pipelining Query Execution (2004)
Optional: The log-structured merge-tree (LSM-tree) (1996)
Optional: Database Cracking (2007)
Useful survey (optional): The Design and Implementation of Modern Column-Oriented Database Systems (2013)
JMH Notes to self

Scribe Notes (Dan Crankshaw)
Tu 9/23 Semi-structured Storage A fast index for semistructured data (2001)
Scalable semantic web data management using vertical partitioning (2007)

Optional: Dataguides: Enabling query formulation and optimization in semistructured databases (1997)

Optional: repeatability of the 2007 VLDB paper (via Alan) 
Column-Store Support for RDF Data Management:  not all swans are white
JMH Notes to self

Scribe Notes (Qifan Pu)
Th 9/25 Weak Isolation (Alan Fekete) Generalized isolation level definitions (2000)
Simple rational guidance for chopping up transactions (1992)

Additional reading (optional): Highly Available Transactions: Virtues and Limitations (2014; esp. Table 2 on page 3)

Additional reading (optional): Weak consistency: a generalized theory and optimistic implementations for distributed transactions (1999)
Tu 9/30 Main-memory Storage OLTP through the looking glass, and what we found there (2008)
Hekaton: SQL server's memory-optimized OLTP engine (2013)

Optional: Dali: A high performance main memory storage manager (1994)
Optional: Main Memory Database Systems: An Overview (1992)
Optional: Implementation techniques for main memory database systems (1984)

Mentioned in class: High Volume Transaction Processing Without Concurrency Control, Two Phase Commit, SQL or C++ (1997)
Scribe Notes (Ben Zhang)
Th 10/2 Advanced Query Optimization IIs Query Optimization a “Solved” Problem? [PDF] (2014; no summary)
Robust query processing through progressive optimization (2004)
Eddies: Continuously adaptive query processing (2000)

Mentioned in class: Foundations and Trends survey on Adaptive Query Processing (2007)
Mentioned in class (STeMs): Using State Modules for Adaptive Query Processing (2003
Mentioned in class (STAIRs): Lifting the Burden of History from Adaptive Query Processing (2004)
Mentioned in class (Postgres Eddies): An Initial Study of Overheads of Eddies (2004)

Mentioned in class: 
Postgres selectivity heuristics "This is not an operator, so we guess at the selectivity. THIS IS A HACK TO GET V4 OUT THE DOOR. FUNCS SHOULD BE ABLE TO HAVE SELECTIVITIES THEMSELVES. -- JMH 7/9/92"
Scribe Notes (Anurag Khandelwal)
Tu 10/7 Advanced Query Optimization II The Volcano optimizer generator (1991)
Query execution techniques for caching expensive methods (1996)

Optional: Extensible/rule based query rewrite optimization in Starburst (1992)

Mentioned in class (datalog optimizer implemented in datalog): 
Evita Raced: Metacompilation for Declarative Networks (2008)
Mentioned in class (column store optimizer example):
DB2 with BLU Acceleration: So Much More than Just a Column Store (2013)
Scribe Notes (Joao Carreira)
Th 10/9 No Readings   Email Peter if you'd like to meet about projects during class time.  
Tu 10/14 MapReduce Revolution!
MapReduce Revolution?
MapReduce: simplified data processing on large clusters (2003)
DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language (2008) 

Skim: Shark: SQL and rich analytics at scale (2013; no summary)
Useful survey (optional): Massively Parallel Databases and MapReduce Systems (2013)
Optional: A comparison of approaches to large-scale data analysis (2009)
Optional: MapReduce: A major step backwards [PDF] (2008, no summary)

Mentioned in class: Quora: Who were the engineers who developed Hadoop at Yahoo? (2013)
Mentioned in class, on planning for failure: ThemisMR: An I/O-Efficient MapReduce (2012)
Scribe Notes (Frank Nothaft)
Th 10/16 Graph Processing (Joey Gonzalez)Pregel: a system for large-scale graph processing (2010)
PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs (2012)

Optional: GraphChi: Large-Scale Graph Computation on Just a PC (2012)
Scribe Notes (Dan Haas)
Tu 10/21 Streaming FundamentalsTelegraphCQ: continuous dataflow processing (2003)
Exploiting punctuation semantics in continuous data streams (2003)

Optional: Aurora: a new model and architecture for data stream management (2003)
Optional: The CQL Continuous Query Language: Semantic Foundations and Query Execution (2006)
Optional: Stream: The stanford data stream management system (2004)
Scribe Notes (Qifan Pu)
Th 10/23 Distributed Streams Highly Available, Fault-Tolerant, Parallel Dataflows (2004)
The Design of the Borealis Stream Processing Engine (2005; no summary)
Discretized streams: Fault-tolerant streaming computation at scale (2013)
Skim: MillWheel: Fault-Tolerant Stream Processing at Internet Scale (2013; no summary)

Optional: Storm@Twitter (2014)
Optional: Summingbird: A Framework for Integrating Batch and Online MapReduce Computations (2014)
Tu 10/28 Approximate Query ProcessingOnline aggregation (1997)
BlinkDB: queries with bounded errors and bounded response times on very large data (2013)

Optional: An improved data stream summary: the count-min sketch and its applications (2004)
Optional: Random sampling with a reservoir (1985)
Optional: Probabilistic counting algorithms for data base applications (1985)
Optional survey: Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches (2012)
Th 10/30 Large Scale SQL    Dremel: Interactive Analysis Of Web-Scale Datasets
Shark: SQL And Rich Analytics At Scale
Tu 11/4
Materialized ViewsEfficiently updating materialized views (1986)
Automated Selection of Materialized Views and Indexes in SQL Databases (2000)

Useful survey (optional): Materialized views (2011)
Optional: Lazy maintenance of materialized views (2007)
Th 11/6 Data CubingBackground (no summary): Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals (1997)

Implementing data cubes efficiently (1996)
On the computation of multidimensional aggregates (1996)
Tu 11/11 Buffer Day    
Th 11/13 Clustering and Classification Scalable k-means++ (2012)
Planet: massively parallel learning of tree ensembles with MapReduce (2009)

Optional: BIRCH: an efficient data clustering method for very large databases (1996)
Optional: SPRINT: A Scalable Parallel Classifier for Data Mining (1996)
The MADlib analytics library: or MAD skills, the SQL (2012)
SystemML: Declarative machine learning on MapReduce (2011)

Optional: Towards a unified architecture for in-RDBMS analytics (2012)
Th 11/20
Buffer Day
Tu 11/25 Information Extraction PEER FEEDBACK DUE
SystemT: an algebraic approach to declarative information extraction (2010)
Hybrid in-database inference for declarative information extraction (2011)
Th 11/27  No class
Tu 12/2 Application-Level Consistency Consistency Analysis in Bloom: a CALM and Collected Approach (2012)
Coordination Avoidance in Database Systems (2014) 

Optional: Building on Quicksand (2009)
Additional reading (optional): A comprehensive study of convergent and commutative replicated data types (2011)
Th 12/4 Buffer Day  Background (no summary): The Google file system (2003)

The datacenter as a computer: An introduction to the design of warehouse-scale machines (Chapter 2 only, remainder optional; 2009)

Optional: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing (2012)
Optional: Dremel: interactive analysis of web-scale datasets (2010)
Optional: Pig latin: a not-so-foreign language for data processing (2008)
Optional: Dryad: distributed data-parallel programs from sequential building blocks (2007)
Optional: F1: A distributed SQL database that scales (2013)
Th 12/11 Poster Session 1-2:30PM  
Th 12/18      Final Papers due at 5PM