Talk:Data vault modeling
This article is rated Start-class on Wikipedia's content assessment scale. |
Okay - edited it a bit, but I still see a few issues. Will update it this afternoon. RonaldKunenborg (talk) 09:48, 9 July 2009 (UTC)
- I'm done for now. Added several references, lots of context and some explanation. Removed all the flags for lack of context and lack of notability. Moved the marketing blurb to a quote (since it was written by Dan Linstedt on wikipedia it's a bit hard to refer to it :)) and added some background and general information. I may add more once I read up a bit more on Data Vault. I haven't addressed Point-In-Time tables, for instance.RonaldKunenborg (talk) 12:52, 9 July 2009 (UTC)
- Removed the tag for the introduction length (yes, it was a bit too long) and cleaned up the article, adding a few references here and there. More work needs to be done in that area so I leave the tag for that item where it is. RonaldKunenborg (talk) 15:57, 6 December 2009 (UTC)
- I've changed the article a bit more, cleaning up and adding references (as requested). Also added a section on reference tables (how appropriate :)). I will try and incorporate the request by Jomsviking sometime this week. And add some pictures and other resources. RonaldKunenborg (talk) 22:31, 13 September 2011 (UTC)
Can somebody Explain
editCan somebody explain what you can do with a Data vault model, but can't do with f.ex "dimensional modelling" (or why it's advantageous to use data vault over dimensional modelling, some cases)??Jomsviking (talk) 14:54, 7 January 2011 (UTC)
- Hi ! :-)
- Although I know it's been 11 years since your question, but...
- ...your question is fundamental, and therefore conceptual. I will try to give you a sense, in the most ordinary way possible, of the train of thought that will help you to understand the real content of your own question :-)
- The DV is an answer to how to create a "Business Change Resilient" data storage structure.
- This is interesting because the optimal (i.e. sufficiently accurate and fast) way to create large amounts and multiple layers of interrelated data requires different data structures than the analytical retrieval of the same data. On today's 'average' computers, the efficiency of data processing is not only based on the 'speed' of the built-in devices, such as the processor, or memory, or storage devices, but also on a data layout, called data structuring. It is the combination of devices and data organisation that allows efficiency to be achieved. In other words, to put it simply, the purpose for which the data is managed determines the structure in which it is stored. If the same data have to be managed in different ways, "as many databases" are needed as there are different ways of managing the data. Therefore, in a company there are "source systems" that handle data in a certain way (such as creating and searching for a specific data), like HR or CRM systems, and as soon as the purpose is to combine and analyse different data, a new layout is needed, so "Data Warehouse", "Data Lake", etc. systems are created. These systems are therefore fundamentally different in the way they organise data.
- Each "system", whether it is a data processing system or some mechanical system, has basically many characteristics that affect its most important property, its performance. Hence, there are many solutions to the same problem, based on different values of the characteristics considered, and of course, because of this, they give different performance.
- Data warehouse systems also differ, depending on which characteristics they focus on. But basically, the shortening of data processing time is their most important value metric, in addition of course to accuracy and reliability.
- The currently known 3 major Data Warehouse architectures ("layouts", such as Kimball, Inmon, Linstedt) or concepts have different performance due to their different concepts, but the choice between them is not a choice between "good" and "bad", but a conscious (considered) decision to achieve the chosen goal.
- When you look at one of the above Data Warehouse solutions, you will see that they all use many "layers" to store data. With the basic idea I have explained, it should be clear to you that these layers are different arrangements of the same data to best suit the data management goals you need at the time. The goal is to transform the initial ("received") arrangement of data for non-analytical purposes into an arrangement that is appropriate for data analysis purposes, through one or more steps ("rearrangement").
- The basic difference between the commonly used concepts is, to put it very simply, but keeping the essence in mind :
- Kimball : 2 steps are sufficient to make the data analysable, and as many independent analysis nodes (data sets) can be constructed as needed. If one wishes, then - to create consistency of common features (data categories) - these island-like data nodes can be connected along common features (data). The analysis nodes, data sets are called : Data Marts. These are the "final" results that Data Warehouses "produce".
- (*) The main problem with the structure and solution used is that it is sensitive to changes in the data structure in the source data, and to the way they are passed through the various structure transformation layers ("steps"). The main reason for this is that the Business Rules, which ensure the reliability and compliance of the data, are already applied in the first layer, so changing the rules is a lot of work due to the dependencies between each further layer.
- Inmon : 3 steps are needed in any case, because we also need a structure that combines data from different source systems, covering only one specific area, into a "big, unified map" that shows the whole (say large Enterprises) context (relation between isolated, functionaly oriented source data). The end result is the same here : a series of Data Marts.
- The main problem with the structure and solution used is : see (*) section under Kimball's concept description.
- Linstedt : 3 steps are needed in any case, but in fact as many as needed. Indeed, the point is that it uses a data structure that - with the exception of the Data Mart layer - has a flexible layout in the other layers that is almost completely resistant to rapid business changes, which of course mean data structure changes at the source systems level, which then have to be driven through all the layers of the Data Warehouses. Data Vault promises to do this with the least "pain" (i.e. cost and effort). Besides the change tolerant structure, the enforcement of Business Rules is part of the creation (data upload) of the Data Mart layer, so compared to the other two concepts, changing the rules is only a single step change. The consequence of it's universal structure and postponed Business Rule impact is improved auditability and a more agile data management process.
- Your question is a mix of one about the overall architecture ("Why Data Vault ?") and one about a specific model part ("dimensional modeling").
- Well, I hope that after my longer answer, "Why Data Vault ?" has become more understandable.
- Your model-detail question refers to a "stand-alone" domain with a relatively stable structure, used in the same way by all three Data Warehouse concepts, and is Kimball's basic idea. This is the Data Mart part of a Data Warehouse, and one of the base logical model behind is "Dimensional Modeling". Data Marts are common final output for every different concept. Because of this, your entire question cannot be interpreted clearly, as you seem to be missing some basic information, which I hope I have managed to fill in.
- Regards,
- Tamás 193.245.81.8 (talk) 11:31, 28 April 2022 (UTC)
Hi - sorry, this took ages and I forgot about your comment. A small explanation should be put in the article but for now, let;s just say this.
You (should want to) separate the layers of your DWH into at least three layers nowadays, as a best practice. One layer is where the data comes in, modelled to reflect the source. Let's call this the staging area. Then we have one area designed for the users, where data is stored as the users like to receive it, ready for consumption. Let's call this the data mart area.
And then we have the area inbetween, where we have to store the data in such a manner that it will keep for a long time - the storage area. We manage to achieve this long-time stability by modelling the data in such a way that we disconnect keys from attributes, relationships from the keys, and functional areas from each other. We also want to load this area fast, but do not care about query performance all that much. Hence, the Data Vault: highly scalable during loads, disconnects all areas and only connects data over the business keys that are much more stable than the underlying software systems. However, for query performance almost everyone will use a datamart on top of this, with facts and dimensions. The good news is that you can transform a well-designed DataVault into facts and dimensions quite easily, often with views. This means that dropping the datamart before reloading is not a problem and is actually standard practice when rebuilding smaller data warehouses. This makes it very easy to change business rules, dimensions, add columns to facts or change the granularity of a fact table. RonaldKunenborg (talk) 21:44, 26 September 2012 (UTC)
Another point is that it is often better with a 3nf model and a persistent staging area that has good metadata in ETL tool. In some respects data vault has many of the problems with dimenensional but is impossible the query. A 3nf model is almost as easy to load, says more about the data and is possible to query even if performence is worsa than a deimansional model With a modern applienace like Netezza, MS PDW, Pivotal, teradata you often get away with 3nf and some simplifying views. With DV (Data Vault) you MUST have views and since you need to self join tables and they are plentiful the SQL is impossible to generate with query tools or push down SQL generated from ETL tools. The best advice would be to avoid data vault in a EDW architecture except for sotuations where preserving history where sources don't is the primary interest. Abergdahl (talk) 21:22, 10 December 2013 (UTC)
Example inconsistency
editIn the Link example, the surrogate primary key L_DRIVER_ID is optional, but in the satellite example (which refers to the link example), the field L_DRIVER_ID is mandatory as a foreign key to the driver link. This requires resolution/explanation. --Joachim Pense (talk) 11:35, 7 April 2013 (UTC)
Link Example is potentially confusing
editThe link example uses the example of a "Driver" (a person driving a car). Link's in data vault also have the concept of "Driving Key", readers might get confused between the "Driver" example, and the "Driving Key".. — Preceding unsigned comment added by 203.53.49.245 (talk) 06:40, 19 April 2013 (UTC)
- Good point, I'll take a lot at it later. The example needs a touch-up anyway, I think. RonaldKunenborg (talk)
Which Hubs?
editHow do you figure out the Hubs you should have? Is there some kind of technique to make use of? 62.220.160.156 (talk) 21:34, 30 October 2013 (UTC)
- Hi, I'll re-read the article to see if I missed something there but the technique is simple: use the business keys. Like: license plate, invoice number, SSN, etcetera. If a system doesn't have natural business keys, you need to find other keys, perhaps even the surrogate keys. For more information you should really read the references at the end of the article, they explain this in more detail. RonaldKunenborg (talk)
Linstedt point of view
editThe article is written quite much from Linstedt's point of view. To my understanding, he was the original author. However, there are also other experts who have been both training and developing the data vault modeling practices, at least Genesee academy with whom Linstedt was in close collaboration previously. They have developed other data vault aspects on technical side. However, Linstedt's brand Data vault 2.0 also includes new material on modeling processing and higher level architecture point of view. All in all, is a link to Genesee Academy in "External links" enough or should we add something more? Misna (talk) 09:45, 5 February 2020 (UTC)