SQL insert question

**JeenLeen** · 2021-07-13, 10:58 PM (ISO 8601)

For SQL Server (MS SQL Server in particular), when I use a INSERT INTO line to add data to a table, are the data added in the same order as the source?

E.g., if I use
INSERT INTO storageTable
select <variables> from sourceTable
;

Would the records in storageTable necessarily be in the same order (sort order) as they are in sourceTable?

I think the answer is no, based on what I'm seeing, but can't seem to find if that should be expected or if there's likely a bug somewhere.

**factotum** · 2021-07-14, 01:23 AM (ISO 8601)

There is never any guarantee of rows being inserted or returned in any particular order in SQL server--it's all up to whatever the query optimiser decides is the best order to handle things. This is why you have to use an ORDER BY in a SELECT if you want the results returned the way you want.

**sihnfahl** · 2021-07-14, 07:17 AM (ISO 8601)

Also, how many records are you anticipating in the destination table?

If it's a large amount, do you have an index defined for that table?

**Khedrac** · 2021-07-14, 08:21 AM (ISO 8601)

To a fair extent the order records are in the database should not matter at all, the display order should be based on whatever field(s) you choose to sort by.
I think most SQL databases default to returning records in order of creation, but I wouldn't guarantee it.

You could try adding an ORDER BY clause to the SELECT statement you are inserting from, but again I don't think there are any guarantees for the actual order in this case, at least partially because SQL is defined as not being sequential so your basic SELECT and INSERT statements are notionally simultaneous.

**JeenLeen** · 2021-07-14, 09:23 AM (ISO 8601)

Thanks. That's what some Googling made me think, but hearing it from another source (and in clearer English) makes me feel more confident I didn't accidentally screw something up. (My other checks also give evidence that I didn't screw something up, but the changing order made me nervous.)

To answer questions:
I'm not sure if the data would count as indexed or not. I'm doing all the coding in SAS, but SAS is 'talking to' a SQL table via a... I think OMBC connection, but I don't have the code in front of me.
It was sorted in SAS, then SAS connected to the SQL Server and told it what SQL code to run within the SQL Server.... so, yeah, not sure what that exactly means. But the SQL table itself doesn't have an index.
Hundreds of thousands of records, at least.

**factotum** · 2021-07-14, 10:20 AM (ISO 8601)

Indexes are pretty important. Without them, any query you execute against that table essentially has to search through the entire table to get the results, and while that might be quite fast so long as the entire table fits into the memory available to the SQL server, it doesn't scale at all once you get beyond that.

**JeenLeen** · 2021-07-14, 10:39 AM (ISO 8601)

Originally Posted by factotum

Indexes are pretty important. Without them, any query you execute against that table essentially has to search through the entire table to get the results, and while that might be quite fast so long as the entire table fits into the memory available to the SQL server, it doesn't scale at all once you get beyond that.

Thanks.
I'll plan to bring that up when we're in code review later this year. It's "crunch time" right now, so folk are moving too fast to fix things to be more efficiently, but the team is making notes for the next project year. Adding some indices to the CREATE TABLE code would be a good idea, as I know some of our tables get really large.

This all came up because I realized two of the tables had fields created as Int, when they really needed to hold decimal data. I was told to DROP the old table and CREATE it anew, but I wanted to do a simple PROC COMPARE (SAS procedure) to check the data I was inserting into the newly-made table, to make sure it really got everything correctly. When almost nothing matched, I realized it was probably that SQL inserted the data in a (to my eyes, but of course not really) random order.

**Khedrac** · 2021-07-14, 11:44 AM (ISO 8601)

Yes - if you want to compare two versions of a table then throwing in a ORDER BY to sort it (usually on the primary key is best) is basically necessary.

Indexes are a bit of a two-edged sword - they greatly speed up queries against the table that use them (so an index on one field is of no use if you query on another), but having too many can slow down adding, deleting and modifying records (as you have to update all of the indexes). Back in the day it wasn't uncommon when doing major data uploads to drop the indexes, load the data and then re-index.

If your table has a primary key defined (and it should) then it is indexed on that key.

**factotum** · 2021-07-14, 02:14 PM (ISO 8601)

Depends on the application, obviously, but querying the data tends to happen a *lot* more than updating does, for most tables at any rate. Also, JeenLeen, leaving indexes for "we'll optimise when we're done" is a really bad idea, because that's when you *do* end up with multiple indexes due to one not being sufficient--the design of the queries themselves really needs to go hand-in-hand with the index design, because carefully-designed queries might be able to use a single index rather than having to set up two differerent ones.

For example: let's say you have a table of names, and you set up an index on it which is indexed via last name, then first name. If your queries are querying against last name, or both last name and first name, they can use this index no problem. A query which is *only* looking for first name can't make best use of the index, though, because first name is the second column in the index and such a query would have to perforce scan through all the last names. The index might still provide a benefit if it's much smaller than the actual table, because it will take less time to scan it, but it won't be as big an advantage as it otherwise might.

**JeenLeen** · 2021-07-14, 02:33 PM (ISO 8601)

Ah; that makes a lot of sense.

I'm relatively new to this position, so I'm not sure what queries are done most often... but I think there's just a few ID fields that we normally will filter on, so I could see something like an index on 2 or 3 fields.

One of them is something like a "Publication Number". It is generally 1, and ideally is just 1 for all records. But if we have to have a redo (due to some error or just a change in specs), we retain all the PubNum=1 records and load new data as PubNum=2. Then we'd query against PubNum=2 when pulling data there.
Would cases that like generally be good for indexing?
I recall it's bad to index on something where it changes for most all records (like a unique ID number)... eh, I guess I'm really saying I don't recall the basic index Best Practices I read a year or two ago, so asking this.

**MCerberus** · 2021-07-14, 04:31 PM (ISO 8601)

Originally Posted by JeenLeen

Ah; that makes a lot of sense.

I'm relatively new to this position, so I'm not sure what queries are done most often... but I think there's just a few ID fields that we normally will filter on, so I could see something like an index on 2 or 3 fields.

One of them is something like a "Publication Number". It is generally 1, and ideally is just 1 for all records. But if we have to have a redo (due to some error or just a change in specs), we retain all the PubNum=1 records and load new data as PubNum=2. Then we'd query against PubNum=2 when pulling data there.
Would cases that like generally be good for indexing?
I recall it's bad to index on something where it changes for most all records (like a unique ID number)... eh, I guess I'm really saying I don't recall the basic index Best Practices I read a year or two ago, so asking this.

Under a normalized database this would generally be split into two tables. This is a classic one-to-many.

The first table would contain the information not related to publish numbers and you would store PubNum data on a second table referring to the first one with a foreign key with all the field that can change on the second table.

edit - or are you trying to have a running log or record OF the revisions themselves?

edit2- but if you want to guarantee you have a way of recalling records in the order they were inserted into the table, put an identity index on the table. You can't trust sql on un-indexed tables ever.

**factotum** · 2021-07-15, 01:16 AM (ISO 8601)

Originally Posted by JeenLeen

Would cases that like generally be good for indexing?

Columns that only have a few unique values are generally a *bad* choice for an index. Indexes work best when your query is distinct enough that it'll only return a few rows.

**Khedrac** · 2021-07-15, 02:14 AM (ISO 8601)

Something to bear in mind is that modern databases are a **lot** better at optimizing than they used to be. If a query gets run frequently the underlying database will optimize the query regardless of the formal indexes defined on the database.

Also anything defined as a "Key" (primary or foreign) is indexed automatically - it has to be to be a Key, so you don't need to index them individually.
(The only twist here is you have a multiple-field Key, e.g. Country-State/Region-County, the fields that come down the order in the index - i.e. State/Region and County - are not indexed individually and so queries against them won't benefit from the Key unless the first field - country - is also referenced in the query.)

**Jasdoif** · 2021-07-15, 01:50 PM (ISO 8601)

Originally Posted by factotum

Columns that only have a few unique values are generally a *bad* choice for an index. Indexes work best when your query is distinct enough that it'll only return a few rows.

A column where you're expecting the same value in the overwhelming majority of rows is particularly bad; you might honestly be better off with a table scan rather than incurring the overhead of an index lookup pointing to nearly all the rows, that'll then need to be scanned to fetch the other columns in the query anyway.

**Wintermoot** · 2021-07-19, 01:18 PM (ISO 8601)

Originally Posted by Jasdoif

A column where you're expecting the same value in the overwhelming majority of rows is particularly bad; you might honestly be better off with a table scan rather than incurring the overhead of an index lookup pointing to nearly all the rows, that'll then need to be scanned to fetch the other columns in the query anyway.

oracle has a bitmap index which is build for those low-cardinality type indices, but generally speaking, they are of very low worth. Still, on a case by case basis, worth looking into.

For example. A geographic bitmap index on state for a very large table of data is a good use of it, if the majority of the reports are going to be limiting on state codes. We have a database with several hundred million values, and slapping a bitmap index on the state code saved us a ton of cycles because 90% of the reports on the data are looking for a single state or small group of states.

**Wintermoot** · 2021-07-19, 01:35 PM (ISO 8601)

Originally Posted by JeenLeen

Ah; that makes a lot of sense.

I'm relatively new to this position, so I'm not sure what queries are done most often... but I think there's just a few ID fields that we normally will filter on, so I could see something like an index on 2 or 3 fields.

One of them is something like a "Publication Number". It is generally 1, and ideally is just 1 for all records. But if we have to have a redo (due to some error or just a change in specs), we retain all the PubNum=1 records and load new data as PubNum=2. Then we'd query against PubNum=2 when pulling data there.
Would cases that like generally be good for indexing?
I recall it's bad to index on something where it changes for most all records (like a unique ID number)... eh, I guess I'm really saying I don't recall the basic index Best Practices I read a year or two ago, so asking this.

If I'm following you have a database where 99% of the records have a PubNum of 1. But a few of the records there has been a change so a PubNum 2 exists.

So for example, you have a book "the zen of motorcycle malignance", and there had to be a reprint or something, so there are two records for that book.

So most of your queries want to return the just the highest pubnum for each book. In that case, my advice would be to add a current_record flag or delete flag, then set it to flag the old pubnum records. Then you can just put "where current_record = 'Y'" or similar in all your views or queries. I would probably build a view with just the current records and run most queries off the view.

**Jasdoif** · 2021-07-19, 01:45 PM (ISO 8601)

Originally Posted by Wintermoot

Originally Posted by Jasdoif

A column where you're expecting the same value in the overwhelming majority of rows is particularly bad; you might honestly be better off with a table scan rather than incurring the overhead of an index lookup pointing to nearly all the rows, that'll then need to be scanned to fetch the other columns in the query anyway.

oracle has a bitmap index which is build for those low-cardinality type indices, but generally speaking, they are of very low worth. Still, on a case by case basis, worth looking into.

For example. A geographic bitmap index on state for a very large table of data is a good use of it, if the majority of the reports are going to be limiting on state codes. We have a database with several hundred million values, and slapping a bitmap index on the state code saved us a ton of cycles because 90% of the reports on the data are looking for a single state or small group of states.

It's not the cardinality that's the problem, it's the extremely uneven distribution. If e.g. 98% of all rows have a value of "1", and it is expected that nearly all rows have a value of "1"....Then when you're looking at the index for the value of "1", you'll get references to 98% of the rows. Following that many references row by row to pull out the other columns in the query could easily incur more overhead than simply going through the rows with a table scan. Sure, it'll be great when you're looking for values other than "1"; but with the expectation that nearly all rows will be "1", you'd need to come up with a specific scenario where that comes up often enough to warrant the overhead of maintaining that index (in this particular case, maybe some sort of regular audit on records that needed to have corrections published?).

**factotum** · 2021-07-19, 03:12 PM (ISO 8601)

Also, the OP said he was specifically (or at the very least, mostly) using MS SQL Server, which has no such index type that I'm aware of.

**veti** · 2021-07-19, 03:43 PM (ISO 8601)

I would think the column to index would be the (foreign key id) column whose value remains the same, when PubNum changes from 1 to 2 to whatever. Because that's the column with non-unique values that you're going to do a lot of querying against.

For instance, to get the latest version of each record, you would want to run a query of the form:

Code:

select (stuff) from BigTable bt
    left join BigTable bt1 on bt.nonuniqueId = bt1.nonuniqueId
                          and bt1.PubNum > bt.PubNum
    where bt1.uniqueId is null

So the nonuniqueId column is going to be doing a lot of lifting.

**Khedrac** · 2021-07-20, 02:27 AM (ISO 8601)

Originally Posted by veti

I would think the column to index would be the (foreign key id) column whose value remains the same, when PubNum changes from 1 to 2 to whatever. Because that's the column with non-unique values that you're going to do a lot of querying against.

For instance, to get the latest version of each record, you would want to run a query of the form:

Code:

select (stuff) from BigTable bt
    left join BigTable bt1 on bt.nonuniqueId = bt1.nonuniqueId
                          and bt1.PubNum > bt.PubNum
    where bt1.uniqueId is null

So the nonuniqueId column is going to be doing a lot of lifting.

I've worked on a database that kept the record history within the table for the data and the table design was slightly more complex than this but the queries were much simpler.

Basically you add three columns to every data table (not the reference tables):

Last Modified (timestamp)
Modified by (user id)
Active Indicator (boolean)

Every time a record is "updated" it's Active Indicator is set to FALSE and a new record is created with the new details, plus the relevant info for the extra fields. Under normal circumstances every query has the extra WHERE condition "table.active_indicator = TRUE" and there's no need for extra joins and their processing overhead. If you need to pull out the history of a record it's so much easier than any other audit system I have ever seen (though I stopped being an SQL-write 10+ years ago).

In practice it's a very good storage model for a data set where there is a high volume of information with a low update rate and the ability to access audit information is important. (If you have a high rate of change to records it quickly becomes a poor system due to space requirements.)
The actual database I worked with actually used a single character text string (Y or N) for the active_ind rather than a boolean which helped with readability for the uninitiated (both code and record histories).

This would make the equivalent code this:

Code:

SELECT (stuff) FROM BigTable bt
  WHERE bt.active_ind IS TRUE

**MCerberus** · 2021-07-20, 05:06 PM (ISO 8601)

Originally Posted by Khedrac

I've worked on a database that kept the record history within the table for the data and the table design was slightly more complex than this but the queries were much simpler.

Basically you add three columns to every data table (not the reference tables):

Last Modified (timestamp)
Modified by (user id)
Active Indicator (boolean)

Every time a record is "updated" it's Active Indicator is set to FALSE and a new record is created with the new details, plus the relevant info for the extra fields. Under normal circumstances every query has the extra WHERE condition "table.active_indicator = TRUE" and there's no need for extra joins and their processing overhead. If you need to pull out the history of a record it's so much easier than any other audit system I have ever seen (though I stopped being an SQL-write 10+ years ago).

In practice it's a very good storage model for a data set where there is a high volume of information with a low update rate and the ability to access audit information is important. (If you have a high rate of change to records it quickly becomes a poor system due to space requirements.)
The actual database I worked with actually used a single character text string (Y or N) for the active_ind rather than a boolean which helped with readability for the uninitiated (both code and record histories).

This would make the equivalent code this:

Code:

SELECT (stuff) FROM BigTable bt
  WHERE bt.active_ind IS TRUE

Alright I just had a bad kind of flashback to a warehouse management product I used to sub-contract for had their entire menu on a self-joined mess of a table structure. And they weren't indexed on the join.

I would still actually think this is a case to split into two tables because the redundancy of data is massive here. Then making an inner-join statement into a view so that the execution plan is mostly pre-compiled. Recent era MS SQL is actually very nice efficiency wise if you take the right steps.

Like I'm talking pushing 5 figure records out to multiple clients through an IIS service where the database and IIS a running off an old celeron laptop

**Jasdoif** · 2021-07-20, 06:50 PM (ISO 8601)

Originally Posted by MCerberus

Alright I just had a bad kind of flashback to a warehouse management product I used to sub-contract for had their entire menu on a self-joined mess of a table structure. And they weren't indexed on the join.

I would still actually think this is a case to split into two tables because the redundancy of data is massive here. Then making an inner-join statement into a view so that the execution plan is mostly pre-compiled. Recent era MS SQL is actually very nice efficiency wise if you take the right steps.

If we're not talking about indexes like JeenLeen was (sort of) asking about, and if we don't have the luxury of changing the underlying table design, I'd say a view would benefit from using a window function rather than a self-join. Something like

Code:

SELECT pubId,
     {stuff},
     pubNum,
     MAX(pubNum) OVER (PARTITION BY pubId) AS MaxPubNum
WHERE
     pubNum = MaxPubNum

**SpoonR** · 2021-07-24, 07:39 PM (ISO 8601)

Quick sanity check. Do your source tables have a primary key? A database I took over did overnight runs for reports a couple pages long. I go digging; the table has about a decade’s worth of cruft, and no key. At all. Add key, some other fixes, and it ran pretty much instantly.

Re indexing, the goal is to break the table into smaller, similar size chunks. For example, a real index breaks all page lookups to 26 (approx), one for each letter. 98%=1, 2%=2 is pretty much identical to no index. A separate index bin for each record is also problematic - at that point you’re kludging together a key.

Also, if you need to go googling, it’s “odbc” not ombc.

Thread: SQL insert question

SQL insert question

Re: SQL insert question

Re: SQL insert question

Re: SQL insert question

Re: SQL insert question

Re: SQL insert question

Re: SQL insert question

Re: SQL insert question

Re: SQL insert question

Re: SQL insert question

Re: SQL insert question

Re: SQL insert question

Re: SQL insert question

Re: SQL insert question

Re: SQL insert question

Re: SQL insert question

Re: SQL insert question

Re: SQL insert question

Re: SQL insert question

Re: SQL insert question

Re: SQL insert question

Re: SQL insert question

Re: SQL insert question

Posting Permissions