No, this isn't the latest agony aunt column - I'm talking about data relationships! Specifically, relationships between columns within single tables. For example, let's take a simple list of UK towns with their associated county and country. This was the first hit on Google when I did such a search. Now assume we also have a table of people and the town in which they live. So we might want to ask the very simple and reasonable question Get me all the people who live in a town which is in England and in the county of Essex.
So why would this cause a problem? Well first, we need to understand how Oracle retrieves information from the database and its strategy (plan) for doing that. This is a topic for which there are many entire books written on already, and so I'm just going to give a very basic example with just two strategies.
- We could query the towns table to get a list of all the towns matching our predicate (the town is in Essex, England). Then for each town we find, we go to the people table and get the list of people in that town. This kind of approach is commonly known as Nested Loops.
- We query the list of towns matching our predicate. Then we query the list of people. Once we have those two lists together, we match the two up (how this is done can vary). This kind of approach is commonly known as a Hash Join where rows are matched using a hash function.
So which approach is better? It depends. If we found only one town in the county of Essex, then clearly it would be quicker to read that one row, and then go and retrieve just the people living in Essex using an index access path. We are done. However what if we found 10,000 towns in Essex? Fetching the list of people for a single town might be very quick, however doing it 10,000 times likely isn't so. No matter how quick something is, if we do it enough times then the process as a whole becomes slow. In this case we would be quicker reading out the two list up front and then matching (i.e. hash joining).
So how does Oracle know how many rows (the cardinality) it is going to process before it has actually processed? Well... it doesn't! It has to estimate this up-front at the time it parses the query based upon the information it knows (statistics of the table and columns). In simple examples this is very basic mathematics:
Number of values in the table / Number of distinct values in the column
So if we have a table of 10,000 records with 500 distinct values, then if we filter on one value Oracle will expect to retrieve 10,000/500 rows = 20.
This basic mathematics extends even further when we filter upon multiple columns. We use standard probability theory that states the probability of two events is the product of the probabilities of those distinct events:
P(X and Y) = P(X) * P(Y)
The probability of retrieving a row is simply the inverse number of rows we expect to retrieve. So again using the example above, say we have the same 10,000 record and column A with 500 distinct values and column B with 20 distinct values. The probability of retrieving a row with single value predicates on A and B is:
(10,000/500)/10,000 * (10,000/20)/10,000 0.002 (0.2%) * 0.05 (5%) 0.0001 (0.1%)
Most of the time this works perfectly fine. However... when we have relationships within our dataset, this breaks down somewhat. For example, if we pick the county of Essex, we naturally know that Essex doesn't exist in any other country. Oracle doesn't though. So considering our single data set, we can estimate the cardinality using the above rules without even loading into Oracle, simply by looking at the data. So let's do that.