Indexing and Hashing

TOC:

Basic Concepts
Ordered Indices: Primary index, Secondary index
B ⁺ - Tree Index
Hash Indices
Index Definition in SQL

Basic Concepts

Indexing mechanisms used to speed up access to desired data.

Search Key

An attribute or set of attributes used to look up records in a file.

An index file consists of records (called index entries) of the form

| search-key | pointer |

Index files are typically much smaller than the original file

Two basic kinds of indices:

Ordered indices: search keys are stored in sorted order
Hash indices: search keys are distributed uniformly across “buckets” using a “hash function”.

Index Evaluation Metrics:

Access types supported efficiently.
- records with a specified value in the attribute
- or records with an attribute value falling in a specified range of values.
Access time
Insertion time
Deletion time
Space overhead

Ordered Indices

In an ordered index , index entries are stored sorted on the search key value:

Primary index (主索引) : in a sequentially ordered file, the index whose search key specifies the sequential order of the file.
Also called clustering index (聚集索引)
The search key of a primary index is usually but not necessarily the primary key.
Index-sequential file : ordered sequential file with a primary index.
Secondary index (辅助索引) : an index whose search key specifies an order different from the sequential order of the file. Also called non-clustering index.

Primary Index

建立在数据文件排序域上的索引

Dense Indices (稠密索引)

Dense index

Index record appears for every search-key value in the file.

每个取值都有索引，而不是每一条记录。

Sparse Indices (稀疏索引)

Sparse Index

contains index records for only some search-key values. Applicable when records are sequentially ordered on search-key

辅助索引上不能建稀疏索引。

To locate a record with search-key value K we:
- Find index record with largest search-key value < K
- Search file sequentially starting at the record to which the index record points
Less space and less maintenance overhead for insertions and deletions.
Generally slower than dense index for locating records.
Good tradeoff: sparse index with an index entry for every block in file, corresponding to least search-key value in the block.

Multilevel Index

If primary index does not fit in memory, access becomes expensive. To reduce number of disk accesses to index records, treat primary index kept on disk as a sequential file and construct a sparse index on it.

outer index – a sparse index of primary index
inner index – the primary index file

If even outer index is too large to fit in main memory, yet another level of index can be created, and so on. Indices at all levels must be updated on insertion or deletion from the file.

树结构的索引

Index Update: Deletion

If deleted record was the only record in the file with its particular search-key value, the search-key is deleted from the index also.

Single-level index deletion:

Dense indices:
- Delete the pointers (stores pointers to all records);
- Update the pointer (stores pointers to the first record);
Sparse indices:
- Nothing done;
- replace with next search-key value;
- updates the pointer;

Multilevel deletion algorithms are simple extensions of the single-level algorithms.

Index Update: Insertion

Single-level index insertion:

Perform a lookup using the search-key value appearing in the record to be inserted.
Dense indices
- insert index
- add pointer
- update pointer
Sparse indices
- no change
- insert index
- update pointer

Multilevel insertion algorithms are simple extensions of the single-level algorithms.

Secondary Indices

建立在数据文件非排序域上的索引 —— 稠密，且每一条记录都有指针(可能有重复数据)

Primary and Secondary Indices

Secondary indices have to be dense.
Indices offer substantial benefits when searching for records.
When a file is modified, every index on the file must be updated, Updating indices imposes overhead on database modification.
Sequential scan using primary index is efficient, but a sequential scan using a secondary index is expensive
- each record access may fetch a new block from disk

B ⁺ -Tree

A B ⁺ -tree is a rooted tree satisfying the following properties:

All paths from root to leaf are of the same length
Each node that is not a root or a leaf has between [n/2] andnchildren.
A leaf node has between [(n–1)/2] and n–1 values
Special cases:
- If the root is not a leaf, it has at least 2 children.
- If the root is a leaf (that is, there are no other nodes in the tree), it can have between 0 and (n–1) values.

Static Hashing

A bucket is a unit of storage containing one or more records (a bucket is typically a disk block).

Hash function h is a function from the set of all search-key values K to the set of all bucket addresses B. Hash function is used to locate records for access, insertion as well as deletion.

In a hash file organization we obtain the bucket of a record directly from its search-key value using a hash function.

Index Definition in SQL

Create an index

create [unique] index <index-name> on <relation-name>(<attribute-list>)

E.g.: create index b-index on branch(branch-name)

create cluster index
create noncluster index
create unique index
To drop an index
- drop index <index-name>

Multiple-Key Access

Use multiple single-key indices for certain types of queries.
- select account-number
- from account
- where branch-name = “Perryridge” and balance =1000

Possible strategies for processing query using indices on single attributes:

Use index on  _branch-name_  to find branch with name of Perryridge; test  _balance = 1000_  .

1. Use indexon balance to find accounts with balances of $1000; test branch-name = “Perryridge”.

Use  _branch-name_  index to find pointers to all records pertaining to the Perryridge branch.  Similarly use index on  _balance_  .  Take intersection of both sets of pointers obtained.