PDA

View Full Version : Help with normalizing tables.



Mystic Muse
2014-03-02, 03:20 AM
I need help normalizing two tables in Microsoft Word for a class of mine.

[http://i1206.photobucket.com/albums/bb443/Zernobog888/Capture_zps07c928d0.png (http://s1206.photobucket.com/user/Zernobog888/media/Capture_zps07c928d0.png.html)[/B]

http://i1206.photobucket.com/albums/bb443/Zernobog888/Capture1_zps460326d9.png (http://s1206.photobucket.com/user/Zernobog888/media/Capture1_zps460326d9.png.html)

Here are the tables. Sorry about the battery thing in the middle of the picture, my cord is funky.

I just need help getting walked through it a bit.

Teddy
2014-03-02, 07:43 AM
Well, I won't do your assignment for you, but at least I can explain the different normalisation forms.

1NF is the first form, and it basically forbids attributes to be multivalued. If a tuple (i.e. a single entry in a table) is of variable length, you're not in 1NF.

2NF requires the table to be in 1NF and all attributes in the table to be fully functionaly dependent on the whole of any candidate key. If just one part of a candidate key (only applicable in the instances when there exists a candidate key consisting of several attributes) is enough to determine the value of any other attribute when doing a query, you're not in 2NF.

3NF required the table to be in 2NF and forbids any non-prime attribute to be transitively dependent of any candidate key (that is, no non-key attribute is allowed to determine any other non-key attribute). If you can't use an attribute to find the right row in a table, but still use it to determine any other attribute which isn't part of a candidate key, it's not in 3NF.

To help you get started though, I can say that the first table has Department as its primary key and isn't normalised to start with. I'm trying to figure out the second table, but it's hard to tell whether correlation of data actually implies dependencies or just coincidence, and thus what the candidate keys may be (my own database course is doing these insufficiently varied tables as well, much to my great annoyance). Is there any other information which comes with the assignment?

Manga Shoggoth
2014-03-02, 03:29 PM
Puts on DBA hat...

Looking at the two tables, they both appear to be a combination of four entities, and how the attributes are split across the entities will vary depending on what you are actually trying to achieve.

(At least, I would have split each of them into at least four entities - the person who devised the questions may have different ideas).

The first table consists of Department, Product [Description], Aisle Number, Price and Unit of Measure. It is obvious that Department needs to come into its own entity, and it is possible that at least one other item should, and it is not clear if you need to use any intersection entities in either case.

The second table consists of Supplier, Product [Description], Cost, Markup, Price and Dept Code. Again, it is immediately obvious that Supplier needs to be split out, but there are a number of ways you could normalise the rest of the data, depending on whether the Product is tightly coupled with Supplier. (There is also a repeating group in Product, but that doesn't really affect how you would normalise the table, it just requires splitting a record).

ChristianSt
2014-03-04, 05:42 AM
Teddy has presented what each of the NF's basically means. For more there is also a Wikipedia articles on Normalization in general (http://en.wikipedia.org/wiki/Database_normalization), 1NF (http://en.wikipedia.org/wiki/First_normal_form), 2NF (http://en.wikipedia.org/wiki/Second_normal_form) and 3NF (http://en.wikipedia.org/wiki/Third_normal_form).

For your table that basically means:


1NF: The first table clearly violates 1NF, because of the Department column. In 1NF basically each cell of your table needs to have a single proper value.
For the second table it depends on the type of the supplier attribute. You could argue that it is a valid single-valued attribute. You could also split it in two attributes (e.g. Supplier Id and Supplier Name). There is also the entry "tomatoes, plum" which most likely should also be but in separate cells. (And produces a second tuple/row with "Very Veggie". Though it could also mean that tomatoes and plums form a single product which seems kinda stupid to me.)
2NF: Since each table supplies only candidate keys with a single attribute ("Product" in each case), this doesn't apply. [It is easy to show that there isn't another candidate key in both tables: "Product" is clearly a candidate key, because it is a single attribute with no repetition. If you take all other attributes besides "Product" in any of the tables, there isn't enough information left to identify all tuples (i.e. I can give you a combination of all those attributes, and you aren't able to say which Product I mean)]
3NF: This is where the real problem starts: The table have multiple transitive dependencies which result in redundant data.
The first table features "Product" -> "Department" -> "Aisle Number".
The second one features "Product" -> "Supplier" -> "Dept Code".
Basically the solution is in both cases to split the table in two tables (so Product + Department/Supplier).
The second table could also be problematic in regards to "Cost"/"Markup"/"Price": Normally I would assume that there is a dependencies between two of them and the third (and from the values inside the table there is. The given table even suggest a dependency from "Cost" or "Price" to the other two). To solve this you need another table for the dependent entries. [Though depending on the dependency you could argue that one of them is a derived attribute, though from the numbers in the table there isn't a simple formula at work, so I wouldn't feel good with that on the given problem.] With more values it could be clear that there aren't any dependencies between them, but under the assumption that you have a representative extension there are.


@Manga Shoggoth:
I would be interested in what entities you see in both tables. As presented there is imo clearly two (and only two) entities in each table (Product and Department in the first, Product and Supplier in the second). [Though Dept Code from the second table most likely represent a foreign key to an entity, which from the given scenario is most likely Department from the first table]

Manga Shoggoth
2014-03-04, 09:03 AM
@Manga Shoggoth:
I would be interested in what entities you see in both tables. As presented there is imo clearly two (and only two) entities in each table (Product and Department in the first, Product and Supplier in the second). [Though Dept Code from the second table most likely represent a foreign key to an entity, which from the given scenario is most likely Department from the first table]

I would be delighted to answer, but unfortunately I'm at work at the moment (photobucket is on the restricted list...) - I'll post a full reply tonight if I get a chance...

Teddy
2014-03-04, 11:04 AM
1NF: The first table clearly violates 1NF, because of the Department column. In 1NF basically each cell of your table needs to have a single proper value.
For the second table it depends on the type of the supplier attribute. You could argue that it is a valid single-valued attribute. You could also split it in two attributes (e.g. Supplier Id and Supplier Name). There is also the entry "tomatoes, plum" which most likely should also be but in separate cells. (And produces a second tuple/row with "Very Veggie". Though it could also mean that tomatoes and plums form a single product which seems kinda stupid to me.)

Good call on the atomicity of the Supplier attribute. I didn't want to press the subject of atomicity too hard, because it's very tricky to decide where the line for an atomic value should be drawn. Mystic Muse should probably try and recall what has been said about atomicity earlier in her course to decide whether it's needed or not, but if in doubt, split (and create a new table).

Also, "tomatoes, plum" probably referes to plum tomatoes (http://en.wikipedia.org/wiki/Plum_tomato), and not tomatoes and plums. One could argue that "boneless chicken breasts" should be called "chicken breasts, boneless" for simplified indexing, but that's their problem, not ours...


2NF: Since each table supplies only candidate keys with a single attribute ("Product" in each case), this doesn't apply. [It is easy to show that there isn't another candidate key in both tables: "Product" is clearly a candidate key, because it is a single attribute with no repetition. If you take all other attributes besides "Product" in any of the tables, there isn't enough information left to identify all tuples (i.e. I can give you a combination of all those attributes, and you aren't able to say which Product I mean)]

This holds true for these tables right now, yes, but it's not very reasonable to assume a grocery store will only ever use one single supplier for any given product. Then again, if there haven't been any directives to think outside of the box, it's probably better to just think of the tables as an isolated occurence when normalising...


3NF: This is where the real problem starts: The table have multiple transitive dependencies which result in redundant data.
The first table features "Product" -> "Department" -> "Aisle Number".
The second one features "Product" -> "Supplier" -> "Dept Code".
Basically the solution is in both cases to split the table in two tables (so Product + Department/Supplier).
The second table could also be problematic in regards to "Cost"/"Markup"/"Price": Normally I would assume that there is a dependencies between two of them and the third (and from the values inside the table there is. The given table even suggest a dependency from "Cost" or "Price" to the other two). To solve this you need another table for the dependent entries. [Though depending on the dependency you could argue that one of them is a derived attribute, though from the numbers in the table there isn't a simple formula at work, so I wouldn't feel good with that on the given problem.] With more values it could be clear that there aren't any dependencies between them, but under the assumption that you have a representative extension there are.


Once again, I wouldn't say it's reasonable in a real world perspective to assume that a supplier can't supply more than one department, but yes, my point above stands.

ChristianSt
2014-03-04, 04:23 PM
Good call on the atomicity of the Supplier attribute. I didn't want to press the subject of atomicity too hard, because it's very tricky to decide where the line for an atomic value should be drawn. Mystic Muse should probably try and recall what has been said about atomicity earlier in her course to decide whether it's needed or not, but if in doubt, split (and create a new table).

Also, "tomatoes, plum" probably referes to plum tomatoes (http://en.wikipedia.org/wiki/Plum_tomato), and not tomatoes and plums. One could argue that "boneless chicken breasts" should be called "chicken breasts, boneless" for simplified indexing, but that's their problem, not ours...



This holds true for these tables right now, yes, but it's not very reasonable to assume a grocery store will only ever use one single supplier for any given product. Then again, if there haven't been any directives to think outside of the box, it's probably better to just think of the tables as an isolated occurence when normalising...



Once again, I wouldn't say it's reasonable in a real world perspective to assume that a supplier can't supply more than one department, but yes, my point above stands.

You're totally right.

But I assumed to judge this information as complete, representative extensions. Otherwise it is basically impossible to tell what "the solution" is, because it is possible to argue for all kind of different scenarios.

Even the assumption that the product is a valid candidate key is possible wrong. The first thing I would probably add is a field ProductId to solve this.

Also without further information it is just impossible to judge how these enteties relate to each other.


(To the plum tomatoes: that is quite likely. But since I never heard that term, I haven't thought of that.)

Manga Shoggoth
2014-03-04, 04:34 PM
OK. Here goes:


Things to split out:

Department, obviously.
It is not unusual for things to be placed in multiple locations (particularly for promotions), so there is potentially an intersection between Aisle and Product.


So, we could have:


Department ---< Product >--- Location


or:


Department ---< Product ---< Product Location >--- Location (aka Aisle)




Things to split out:

Supplier, obviously.
There may be more than one supplier per product (and I would be amazed if there wasn't in a real-life situation).
If so, the cost would probably be different between suppliers and the actual price and markup may or may not be different between suppliers.

So, we could have:


Supplier ---< Product >--- Department


or:


Supplier ---< Product Supplied >--- Product >---- Department


With the Cost, Markup and Price being on either Product or Product Supplied as required. My best guess would be Price against Product, and Markup and Cost against Product Supplied.



I freely admit that I may be over-thinking this, but I have seen plenty of supposedly simple ideas requiring complex data models. And customer requirements that haven't been thought through...

...And also I am used to dealing with complex situations. For the same reason I try not to help my daughter with her homework as she generally requires a simpler answer than I would normally supply.

Teddy
2014-03-04, 04:51 PM
You're totally right.

But I assumed to judge this information as complete, representative extensions. Otherwise it is basically impossible to tell what "the solution" is, because it is possible to argue for all kind of different scenarios.

Even the assumption that the product is a valid candidate key is possible wrong. The first thing I would probably add is a field ProductId to solve this.

Also without further information it is just impossible to judge how these enteties relate to each other.

Yeah, if the assignment doesn't require you to motivate your choices, that's probably the far better solution. And I agree that adding a ProductId would be the best idea, even Product + Supplier is unrealistic since the same supplier may provide different package sizes...


(To the plum tomatoes: that is quite likely. But since I never heard that term, I haven't thought of that.)

An easy mistake to make. They're sold in our grocery store, which is why I know...

ChristianSt
2014-03-04, 05:02 PM
OK. Here goes:


Things to split out:

Department, obviously.
It is not unusual for things to be placed in multiple locations (particularly for promotions), so there is potentially an intersection between Aisle and Product.


So, we could have:


Department ---< Product >--- Location


or:


Department ---< Product ---< Product Location >--- Location (aka Aisle)




Things to split out:

Supplier, obviously.
There may be more than one supplier per product (and I would be amazed if there wasn't in a real-life situation).
If so, the cost would probably be different between suppliers and the actual price and markup may or may not be different between suppliers.

So, we could have:


Supplier ---< Product >--- Department


or:


Supplier ---< Product Supplied >--- Product >---- Department


With the Cost, Markup and Price being on either Product or Product Supplied as required. My best guess would be Price against Product, and Markup and Cost against Product Supplied.



I freely admit that I may be over-thinking this, but I have seen plenty of supposedly simple ideas requiring complex data models. And customer requirements that haven't been thought through...

...And also I am used to dealing with complex situations. For the same reason I try not to help my daughter with her homework as she generally requires a simpler answer than I would normally supply.

I can see your reasoning behind it, but I think you make assumptions that cannot be verified. For example there is no information wheter aisle and product directly relate. Or how department is used in the second table.

Also I think there isn't enough information to treat location and department in the second table as entities. It also depends on what detail-level is needed (individual physical products? Only different product classes?)


But certainly if you would make a usable model it would be much more complicated. But imo the OP doesn't want a more realistic model, but only a database normalization.

Manga Shoggoth
2014-03-05, 02:16 PM
But imo the OP doesn't want a more realistic model, but only a database normalization.

Very probably, I imagine.

It will be interesting if she posts the actual answers when she gets them.

EDIT:
It appears that the worksheet is posted here with the solution (http://andyoppel.com/X4025/lectures/Class_01_Exercise_Normalization_1_Solution_2.pdf) (PDF File). One for you, then...