One of my favourite interview questions at an intermediate level has always been: “Can I have a primary key on a table that does not contribute to the clustered index?”. Of course, it’s not always the same question – there are multiple variants, such as :
- Can a table have a primary key that does not participate in a clustered index?
- Does designating a column(s) as a primary key automatically create a clustered index on the table based on the key column(s)?
- Can a clustered index be created using a column(s) other than the primary key of the table?
The answer that most candidates come up with is that there is a direct dependency between the primary key and the clustered index – the primary key has to be the set of columns on which one wishes to create a clustered index.
Contrary to popular belief, the correct answer is “Yes, a table may have a clustered index defined on a column other than the primary key of the table”. It’s just that most developers are so used to the default design and coding mechanisms that generally the thought of going against the flow may not come to the mind.
Please note that this post is not about the recommended practices for creation of clustered index and primary keys, but is intended to achieve an understanding of the fact that having a primary key does not automatically imply a clustered index in all situations.
As supporting proof, allow me to create a simple scenario. As part of the demonstration, I would be executing the following steps:
- Create a test table without any keys or indexes defined
- Define a primary key on the test table, with an explicit NONCLUSTERED keyword added to the definition
- Define a CLUSTERED index on the table on a column that’s not the primary key
- Check the details of the indexes created using sys.indexes catalog view
USE tempdb GO --Safety Check IF OBJECT_ID('pkTest') IS NOT NULL DROP TABLE pkTest GO --Create test table definition CREATE TABLE pkTest (Id INT NOT NULL IDENTITY(1,1), DummyId INT NOT NULL, KeyName VARCHAR(20) ) GO --Notice the use of NONCLUSTERED ALTER TABLE pkTest ADD CONSTRAINT pk_pkTest_pkTestId PRIMARY KEY NONCLUSTERED (Id) GO --Now create a clustered index on a column other than the Primary Key CREATE CLUSTERED INDEX idx_pkTest_DummyId ON pkTest (DummyId) GO
As you can see, we created the Primary Key with a keyword – NONCLUSTERED. This should create a non-clustered index for the primary key and a clustered index for the DummyId column. Let’s run the following query to check:
--Check the types of Indexes that have been created SELECT si.object_id, OBJECT_NAME(si.object_id) AS ObjectName, si.name, si.index_id, si.type, si.type_desc, si.is_unique, si.is_unique_constraint, si.is_primary_key FROM sys.indexes AS si WHERE si.object_id = OBJECT_ID('pkTest') GO
Query execution yields the following results:
A clustered index can be defined on any column of a table as long as the column satisfies the criteria for a clustered index. Having a primary key does not imply that clustered index will exist on the primary key under all situations.
- Clustered Index Design Guidelines (BOL/MSDN): http://msdn.microsoft.com/en-us/library/ms190639(v=sql.105).aspx
Until we meet next time,
Even I ask the same question during interview.
Primary Key and Clustered Index are two different things. Developers get confused because they usually misunderstand the following line “On a table when a Primary Key defined by default SQL Server creates a Unique Clustered Index on those column(s) where primary key is defined”
As we know that we can’t have 2 clustered index on a table, When we create a Primary key where a Clustered index is already exist in that case , SQL Server creates a Unique Non Clustered Index.
Sometime I ask a different version of same topic question? i.e. What will happen if I wanted to create a Primary on a column on which already a clustered index exist?
Thanks & Regards,
Thank-you for sharing your views, Sandip. I do like the nice twist to the question. Thank-you for taking the time out and reading this post.
However there is a difference. When you create a table with column with Primary key defined, by default the column which is defined as Primary key will have clustered index. Now see what happens when you create this table
CREATE TABLE pkTest (Id INT NOT NULL IDENTITY(1,1) PRIMARY KEY,
DummyId INT NOT NULL,
Column Id is automatically associated with a clustered index. Your case is valid only when you alter the table to have primary key constraint
You are right,What you mentioned here is the default behavior.
Kindly refer the **Conclusion** section, Here It is mentioned that “Having a primary key does not imply that clustered index will exist on the primary key under all situations”.
Thanks & Regards,
@Madhivanan try this:
CREATE TABLE pkTest
CONSTRAINT [PK__pkTest__Id] PRIMARY KEY NONCLUSTERED ( [id] )
There is no need for an ALTER TABLE, you can create a non-clustered primary key directly in the CREATE TABLE statement.
Here’s the real question: Why have a clustered index on a column or columns OTHER than the Primary Key?
> A clustered index can be defined on
> any column of a table as long as the
> column satisfies the criteria for a
> clustered index
What exactly are the critieria for clustered indexes?
I have posted similar stuff in 2006 @ http://vadivel.blogspot.in/2006/03/primary-keys-without-clustered-index.html …
Thank-you for sharing @Vadivel!
@Marc: Criteria for clustered indexes can be found in the references section of my post. The link is repeated here for your kind reference: [http://msdn.microsoft.com/en-us/library/ms190639(v=sql.105).aspx]
@Nakul, there are no “criteria” listed, only recommendations or situations where a clustered index should be considered.
> Here’s the real question: Why have a
> clustered index on a column or columns
> OTHER than the Primary Key?
Well, for example, because primary key is multicolumn and large, so you risk inefficient INSERTs, many page splits, and having large, inefficient clustering key in every nonclustering index on the table. Acquaint yourself with the concept of [surrogate keys].
@chojrak: After working with SQL Server for the past 18 years, including 5 years at Microsoft, I’m very familiar with the concepts (and worst practices) around primary keys, candidate keys and surrogate keys. I’ve spent a good portion of my career improving the performance of people who blindly create an [int] IDENTITY(1, 1) based prirmary key on every table
If your primary key is clustered, the risks you list are valid. If the primary key is nonclustered, none of the risks you list are valid.
Create your clustered index to satisfy your most common sorting requirements. Create your primary key to satisfy your most relevant uniqueness constraints. If they are the same column (or columns), use a clusterer primary key. If they are not, the most common sorting requirement with the least volitile data should be the clustered index.
thus you just answered your very question. Offtopic – if possible, I’d like to know more about problems with those “[int] IDENTITY(1, 1) based prirmary key on every table”. I have to confess that, although not blindly, I use them from time to time, and until now successfully 🙂
@Chojrak11: my question was rhetorical… a question designed to get the audience to think. Here are some more rhetorical questions:
Problems with primary keys based on [int] IDENTITY(1,1)
Lets’s say that you have a customer list, which is most often sorted by name.
A) What does a sorting operation on [CustomerName] look like when the primary key is clustered and based on [CustomerId]?
B) What does a sorting operation on [CustomerName] look like when the primary key is nonclustered and based on [CustomerID] and there is a clustered index on [CustomerName]?
Both A and B will work. Which will work faster?
Which will work faster when retrieving records based on a customer name (a lookup)?
Now let’s say that you have a link table that links Vendors to the Products they sell. Products can be provided by multiple Vendors and Vendors sell multiple Products.
CREATE TABLE [Products]
[ProductId] [int] IDENTITY (1, 1),
CREATE TABLE [Vendors]
[VendorId] [int] IDENTITY (1,1),
CREATE TABLE [lnk_Vendors_Products]
[recordId] [int] IDENTITY(1,1),
[StartDate] [date], — date vendor starting offering product
[EndDate] [date] — date vendor stopped offering product
Is it going to be best to create a clustered primary key on [lnk_Vendors_Products].[recordId] with a uniqueness constraint on ([fkVendorId], [fkProductId]) or just have a clustered primary key directly on ([fkVendorId], [fkProductId])?
In order to keep key proliferation to a minimum (any children of [lnk_Vendors_Products], should reference the primary key [recordId]. But the clustered index should go on ([fkVendorId], [fkProductId]). This allows children to reference only the [recordId], but allow for fast retrieval based on [fkVendorId].
Note: if you are searching for Products more often than Vendors, the clustered index should go on [fkProductId], [fkVendorId]; rather than the other way around.
@Marc Jellinek: Thank-you very much for the questions. They do get me to think hard and think more – which is always exciting. Also, thank-you very much for sharing some of the nuances around key design.
Regarding your note:
> Note: if you are searching for
> Products more often than Vendors, the
> clustered index should go on
> [fkProductId], [fkVendorId]; rather
> than the other way around.
I am sure you remember, but for the benefit of others I would like to refer a previous post here which would highlight one of the many reasons behind the note: [SQL Server – T-SQL – Best Practice – Composite Primary Keys – Performance (and accuracy) impact]
This has been a very, very interesting discussion. Thank-you, all.
@Nakul, I’m not sure I understand the reference to the blog post about composite primary keys. For that matter, I don’t understand what best practice it recommends (other than referencing ALL columns in a primary key when making a join)
@Marc: Yes, referencing ALL columns of the composite key is definitely a take-away and the other (I agree, it’s not so obvious) point was that switching the order of the columns does have an impact on the way SQL Server would use the index defined on the key.
@Marc thank you for detailed answer. A lot of additional rhetorical questions in it 🙂
So, this is an additional thing to consider while designing a table. In case of clustered keys on most often sorted columns that aren’t ever-increasing on INSERT, we gain performance on /some/ queries, but one must be sure that amount of INSERTs (and associated splits) will not kill overall performance and not cause locks on high concurrency. Some performance can be gained by properly selecting fill factor, but then we increasing # of I/Os for reads. That said such a key may possibly be better for larger tables that are mostly read and sometimes updated, but for smaller transactional tables heavily INSERTed/DELETEd I’d use small, monotonic clustering key. But as always in SQL world, it’d be best to try and measure.
I’ll do my homework and analyze this subject matter thoroughly.
Thanks again, I’d upvote you more times, but it doesn’t let me.
Returning to the orignal theme of “default behaviour” for a moment… Adding a primary key to a table which is already clustered creates a new non-clustered
index by default.
— The NONCLUSTERED keyword is optional in the DDL statement:
ALTER TABLE pkTest
ADD CONSTRAINT pk_pkTest_pkTestId PRIMARY KEY /* NONCLUSTERED */ (Id)
This is true **even when the primary key is added to the clustering key**:
— A redundant nonclustered index is created on the clustering key, DummyId:
ALTER TABLE pkTest
ADD CONSTRAINT pk_pkTest_pkTestId PRIMARY KEY /* NONCLUSTERED */ (DummyId)
As you’d expect, you can’t “promote” an already-existing clustering key to a primary key without dropping it first. eg:
ALTER TABLE pkTest
ADD CONSTRAINT pk_pkTest_pkTestId PRIMARY KEY CLUSTERED (DummyId)
Fails with error:
Msg 1902, Level 16, State 3, Line 3
Cannot create more than one clustered index on table ‘pkTest’. Drop the existing clustered index ‘idx_pkTest_DummyId’ before creating another.
Msg 1750, Level 16, State 0, Line 3
Could not create constraint. See previous errors.
Again the fundamental is same. See my first comment.
Primary Key and Clustered Index are two different thing.
When we define primary key on a table It goes through 2 stages. One is constraint creation and then Index creation. So When you define a primary key on a column where already Clustered index is exist it will create a non clustered index on the same column.
Think of a composite key. If you define a clustered index on two columns and then You are defining Primary key in one of the column , It has to create a index on that column.
@sqlcommitted: I was expanding on Nakul’s post about default behaviour, in particular implicit CLUSTERED/NONCLUSTERED primary key constraint.
Another spin on the interview question:
A database contains an empty table called SomeClusteredData and defines the field, ClusteredKey int IDENTITY.
What type of index is created when you execute the following batch:
ALTER TABLE SomeClusteredData
ADD CONSTRAINT SomeClusteredDataPKClusteredKey PRIMARY KEY (ClusteredKey);
The answer: it depends. If you don’t have appropriate rights on the table, an error is thrown and no index is created! Assuming you have rights and a clustered index already exists on SomeClusteredData at the time the batch is run, a new **unqiue, non-clustered index** is created with the key, ClusteredKey. Conversely, the batch will create a **clustered index** with clustering key, ClusteredKey, if SomeClusteredData is a heap when run by someone with appropriate access.
This question tests several things:
1) the candidates’ knowledge of the default behaviour when creating a primary key constraint.
2) the candidates’ appreciation of object-naming conventions. Object names have no relation to the structure of the object; implying a structure with the object name is bad practice (except in one or two exceptional circumstances).
3) the candidates’ attention to detail and ability to ‘think outside the box’. By considering security, they demonstrate broad vision; they question the scenario that most would take as given.
When interviewing candidates, I believe it’s equally important to test soft skills. Deterministic questions provide a basis for assessing technical knowledge. Formulating slighlty open or ambiguous questions – with the occaisional typo – allows candidates to demonstrate their softer skills (like attention to detail) rather than just making claim to them.
As stated perviously, primary keys and clustered indexes are two distinct thigns. One has nothing to do with the other. A primary key uniquely identifies a row, a clustered index orders the data.
To state this:
Here’s the real question: Why have a clustered index on a column or columns OTHER than the Primary Key?
indicates no real understanding of relational database design, the SQL database engine, or the query optimizer.
There are many reasons, far too many to list in a forum posting, where the primary key should not be the clustered index.
As for the criteria for creating a clustered index, the stated article is only a GUIDE to point you in the right direction on choosing the proper field(s) in a table for an index. The rules are any column, or set of columns, can be part of an index.
Agree with Michael
I agree with Michael
If candidates demonstrate they don’t know the difference between a primary key and an index (let alone why you might omit the primary key from a clustered key) the question of default behaviour is moot. Unless I was looking to train a junior, I’d politely show them the door.
@Michael: I know I’m a stickler for detail but:
> The rules are any column, or set of columns, can be part of an index.
Is not accurate. Columns of type text, ntext, and image cannot form part of an index. Further restrictions apply to the index’s key (<=900 bytes, up to 16 columns, non-LOB types etc…)
I stand corrected!!! Thanks! And, to make matters worse, I attempted to created an index with included columns about an hour before I posted this. One of the columns was a binary type!!! Short term memory loss, I guess!!!
Why was this reposted, without the comments or discussion?
@Marc: I am not sure what went wrong, however, will ask the webmaster to check the logs. I can see most of the comments back now.
Here is what happened:
Originally, we had comments paging by 10 records. Then we saw that many posts are having more than 10 comments and then it becomes bit un-easy as a reader to look at the latest comments. So we increased the page size to 20.
However, the email notification was assuming to be 10 rows per page and was taking you to page 3, which is empty. So the paging parameter on the query string (when you open it from the email directly) was having the problem. If you open the post directly, you could see all comments.
We just fixed it and currently monitoring it.
In Data Wareouse situations, especially when you’re partitioning your cube on date, a clustered index that only includes the date can be very efficient, because it does not produce too much index pages (fast first row) and the cube will always read the whole data for one day, so including more columns does not speed up things. We do this as a default on all OLAP fact tables.
In this case, you might want to include a primary unique index to enforce data integrity (e.g. the list of all surrogate keys in the data) or help while searching the data for reference entries.