Most technology users don’t really stop to think about HOW their technology works. Instead, we take for granted that the actions we perform will rapidly and reliably produce the results we seek. For example, if you were to Google something, you’d be frustrated if you couldn’t quickly find the information you sought. The expectation is that Google’s massive servers will efficiently run your query and sift to the top the best possible results. (Never fear: Just now, the search «how does Google work» produced ~888,000 results in 0.46 seconds, with a link from Google itself labeled «How Google Search Works» sitting comfortably atop the list.)
In actuality, Google relies on a complex, interconnected, multi-layered, and continually updated web of systems that feed into a single Google Search index that contains hundreds of billions of webpages over 100 million gigabytes in size. And in actuality, much of our technology relies on the successful querying and retrieval of information — a crucial functionality powered by database applications. As businesses continue to explore new ways to harness the powers of «big data,» properly functioning databases will become increasingly crucial in supporting our applications and websites.
Over the past decade, Distillery has had the good fortune of working with several companies who are innovating in the data space, including Tala, Syncplicity, and nēdl. Our dev team is strongly focused on ensuring seamless database transactions and lightning-fast search results. In this spirit, we’re reposting the blog below, written by Database Expert Kristina Kucherova, in the hopes that it can help other software developers to keep their applications’ databases working at top performance.
Sometimes in the process of designing a database, developers can underestimate the scope of a project. If that project were to become successful, it would result in significant loads and put undue strain on the system and it’s developers down the line. Eventually, people realize that they are using an INT identity key with a maximum value of 2,147,483,647 as the primary key of a huge table.
On one hand, 2 billion values should be pretty much enough. But what if you may have more than 10 million records added on a daily basis, and you have already used more than one billion of them? That only gives you 114 days to make the necessary fixes or suffer the consequences. That’s not much, especially if the identity key is the kind that is “hardwired” to both the web interface and the client application.
Should you have reached that regrettable moment, have no fear, this article is here to help, even if it’s already too late to change something, and your application is working 24/7. In a worst case scenario, you should already have seen the following error message: “Arithmetic overflow error converting IDENTITY to data type INT”. We at Distillery can offer you several useful solutions for when this situation turns out to be an emergency in your MS SQL Server.
- Switch to BIGINT. In a perfect world, the best option would be to invent a time traveling machine, go back in time and change INT to BIGINT. However, changing INT to BIGINT on a more advanced stage of the development process will cause changes not only in the database, but in both server and client applications. If you have the opportunity to change INT to BIGINT without serious consequences, do it and do it fast.
- Use negative values. The increment step is set to one by default — IDENTITY(1,1). At the moment when you are about to reach 2 billion, you can use
DBCC CHECKIDENT (TableWithPKViolation, −2147483647, reseed)
thus adding additional time before switching to BIGINT. However, this can create a problem if your domain logic prevents the use of negative values. Make sure you domain logic can accommodate first before making the switch.
- Create a table with unused values. You need to find those identity key values that were skipped, thus creating a pool of free and unused values. This approach allows the creation of additional time before moving to BIGINT. In any situation when you use increment keys, you will see skipped values, because such values are reserved at the beginning of a transaction and never returned to the main pool in case of a rollback. Thus, a new transaction will get a new value. You can use more free values if you delete data from the table. Your main task is to form a table containing all available values and a logical field that works as a value reservation attribute. You can delete the reserved values from such a table according to the timetable, e.g. once a day. After this, you have to create a function that will provide the required number of identity keys, and use the received results with SET IDENTITY INSERT ON command. If you delete information from the main table on a regular basis, you can add such deleted values to your table as free values. This method can be used only if you don’t need to keep the order of the records in the table, in other words, in case you are not using ORDER BY Id. Otherwise, you have to select a different sorting option.
Below, you will find examples of the third method. In case you already deleted some information from the table while having less than one billion records, it will be more efficient to generate a full table with values, deleting identity keys in use.
- Create a table to generate a sequence and fill it with values from 1 to 2,147,483,647.
CREATE TABLE [dbo].[IntRange]( [Id] [int] NOT NULL ) ON [PRIMARY]
SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO CREATE PROCEDURE [dbo].[spNewIDPopulateInsert] @batchsize INT = 10000, @startFrom INT = NULL AS BEGIN SET NOCOUNT ON; SET XACT_ABORT ON; IF @startFrom IS NULL BEGIN SELECT @startFrom = MAX(id) FROM dbo.IntRange; END; DECLARE @startId INT = ISNULL(@startFrom,0); DECLARE @rowscount INT = @batchsize; DECLARE @maxId INT = 2147483647; WHILE @rowscount = @batchsize BEGIN INSERT INTO dbo.IntRange (id) SELECT id FROM ( SELECT TOP (@batchsize) @startId + ROW_NUMBER() OVER(ORDER BY TableWithPKViolationId) AS id FROM dbo.TableWithPKViolation --any table where you have @batchsize rows ) AS genids WHERE id < @maxId; SET @rowscount = @@ROWCOUNT; SET @startId = @startId + @rowscount; PRINT CONVERT(VARCHAR(50),GETDATE(),121)+' '+ CAST(@startId AS VARCHAR(50)); END END GO
exec dbo.spNewIDPopulateInsert @batchsize = 10000000
This script involves a TableWithPKViolation table in order to generate the sequence. However, you are free to use any desired or favorite method, including the SEQUENCE command integrated into MS SQL Server. We have selected this method as the most productive one.
ALTER TABLE [dbo].[IntRange] ADD PRIMARY KEY CLUSTERED ( [Id] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = OFF) ON [PRIMARY] GO
- Create a table for free values and fill it.
SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO CREATE TABLE [dbo].[NewIds]( [NewId] [int] NOT NULL, [DateUsedUtc] [datetime] NULL ) ON [PRIMARY] GO
SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO CREATE PROCEDURE [dbo].[spNewIDPopulateInsertFiltered] @batchsize INT = 10000, @startFrom INT = NULL, @endTill INT = NULL AS BEGIN SET NOCOUNT ON; SET XACT_ABORT ON; IF @startFrom IS NULL BEGIN SELECT @startFrom = MAX([NewId]) FROM dbo.NewIds; END; DECLARE @startId INT = ISNULL(@startFrom,0); DECLARE @rowscount INT = @batchsize; DECLARE @maxId INT = ISNULL(@endTill,2147483647); DECLARE @endId INT = @startId + @batchsize; WHILE @startId < @maxId BEGIN INSERT INTO [NewIds] ([NewId]) SELECT IR.id FROM [dbo].[IntRange] AS IR WHERE IR.id >= @startId AND IR.id < @endId AND NOT EXISTS ( SELECT 1 FROM [dbo].[TableWithPKViolation] as Tb WITH (NOLOCK) WHERE Tb.TableWithPKViolationId = IR.id ); SET @rowscount = @@ROWCOUNT; SET @startId = @endId; SET @endId = @endId + @batchsize; IF @endId > @maxId SET @endId = @maxId; PRINT CONVERT(VARCHAR(50),GETDATE(),121)+' '+ CAST(@startId AS VARCHAR(50)); END END GO
----- -----Run each part in separate window in parallel ----- --part 1 exec dbo.spNewIDPopulateInsertFiltered @batchsize = 10000000, @startFrom = 1, @endTill= 500000000 --end of part 1 --part 2 exec dbo.spNewIDPopulateInsertFiltered @batchsize = 10000000, @startFrom = 500000000, @endTill= 1000000000 --end of part 2 --part 3 exec dbo.spNewIDPopulateInsertFiltered @batchsize = 10000000, @startFrom = 1000000000, @endTill= 1500000000 --end of part 3 --part 4 DECLARE @maxId INT SELECT @maxId = MAX(TableWithPKViolationId) FROM dbo.TableWithPKViolation exec dbo.spNewIDPopulateInsertFiltered @batchsize = 10000000, @startFrom = 1500000000, @endTill= @maxId --end of part 4
ALTER TABLE [dbo].[NewIds] ADD CONSTRAINT [PK_NewIds] PRIMARY KEY CLUSTERED ( [NewId] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] GO CREATE NONCLUSTERED INDEX [IX_NewIds_DateUsedUtc] ON [dbo].[NewIds] ( [DateUsedUtc] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = OFF) GO ALTER TABLE [dbo].[NewIds] SET ( LOCK_ESCALATION = DISABLE ) GO
Check that the table was generated properly. The NewId table shall not contain any identity keys that are included in main TableWithPKViolation table.
- Create a process that will mark and return all the available values.
create PROCEDURE [dbo].[spGetTableWithPKViolationIds] @batchsize INT = 1 AS BEGIN SET NOCOUNT ON; SET XACT_ABORT ON; DECLARE @rowcount INT, @now DATETIME = GETUTCDATE(); BEGIN TRAN UPDATE TOP (@batchsize) dbo.NewIds SET DateUsedUtc = @now OUTPUT inserted.[NewId] WHERE DateUsedUtc IS NULL; SET @rowcount = @@ROWCOUNT; IF @rowcount != @batchsize BEGIN DECLARE @msg NVARCHAR(2048); SET @msg = 'TableWithPKViolationId out of ids.' + 'sp spGetTableWithPKViolationIds, table NewIds. ' + 'Ids requested ' + CAST(@batchsize AS NVARCHAR(255)) + ', IDs available ' + CAST(@rowcount AS NVARCHAR(255)); RAISERROR(@msg, 16,1); ROLLBACK; END ELSE BEGIN COMMIT TRAN END; END GO
- Take all the procedures that add information to the main table and return SCOPE_IDENTITY() and add procedure invocation. If your time is limited or your hardware is relatively powerful, you can create an INSTEAD OF INSERT trigger, thus avoiding changes in a huge number of procedures. I will show you an example of procedure changing.
CREATE TABLE #tmp_Id (Id INT); INSERT INTO #tmp_Id EXEC spGetTableWithPKViolationIds @batchsize=@IDNumber; SELECT @newVersionId = Id FROM #tmp_Id; SET IDENTITY_INSERT [dbo].[TableWithPKViolation] ON;
Please note that SET IDENTITY_INSERT option requires a user with enough rights to use ALTER command for TableWithPKViolation table.
- Next, we need to set identity key table clearing according to the time-table.
create PROCEDURE dbo.spCleanupNewIds @batchSize INT = 4999 AS BEGIN SET NOCOUNT ON DECLARE @minId INT DECLARE @maxId INT SELECT @minId = Min([NewId]), @maxId = MAX([NewId]) FROM dbo.NewIds WITH (NOLOCK) WHERE DateUsedUtc IS NOT NULL; DECLARE @totRowCount INT = 0 DECLARE @rowCount INT = @batchSize WHILE @rowcount = @batchsize BEGIN DELETE TOP (@batchsize) FROM dbo.NewIds WHERE DateUsedUtc IS NOT NULL AND [NewId] >= @minId AND [NewId] <= @maxId SET @rowcount = @@ROWCOUNT SET @totRowCount = @totRowCount + @rowcount END PRINT 'Total records cleaned up - ' + CAST(@totRowCount AS VARCHAR(100)) END GO
In the case that sequence of identity keys plays an important role, this method will be very helpful, because all new values are given incrementally. However, you need to solve the problem of the transition period. The best solution is to switch existing procedures requiring proper sequence to a different field in the same table, e.g. to the date field.
There you go! This smart solution allows you to buy some additional time before switching to BIGINT. Nevertheless, we highly recommend you avoid these sorts of difficult situations at any cost by planning your transition in advance. If you need help administering your database, our DBA team will be happy to help!