1. Artificial Intelligence

The increasing importance of data modeling for good data

Disclaimer: This is a user generated content submitted by a member of the WriteUpCafe Community. The views and writings here reflect that of the author and not of WriteUpCafe. If you have any complaints regarding this post kindly report it to us.

It has been an important foundation in the development of applications in the past. However, it has also been under the spotlight for a while. In our opinion, this is partly because in the context of agile development the recording of knowledge in documents is positioned as less important. In this blog we describe the increasing importance of data modeling and its value in the context of data quality. Data quality is also the higher goal of data management as a whole; ensuring that the quality of data matches the use.

TerminusDB is an open-source document graph database featuring collaboration and workflow tools to build concurrently with your team.

A data model is a structure of data, including a definition of the elements in this structure. Besides the fact that this structure is important for recording data, it is also important for data exchange. The exchange of data across the boundaries of organizations will continue to increase. As a result, data also crosses the boundaries of language and it is increasingly important that it is clear what the data means in order to prevent misinterpretation. An important aspect of data quality is the extent to which data corresponds to reality. If you do not understand reality properly and also do not properly define what the data means, then the basis for speaking about quality is missing. Meaning of data, in the form of definitions, is therefore the most important substantiation why data models are so important for data quality.
Another important reason why data models are so important from a data quality perspective is that data models also describe rules. Think of rules that indicate which data is mandatory, which values ​​they can take or which relationships must exist between data elements. These rules are also called constraints or quality rules, which directly indicates why they are important for data quality. Incidentally, these rules go beyond what you can visually represent in a data model. They can also place constraints on more complex relationships between data. Various (formal) languages ​​are available for this, such as OCL and SHACL. These rules are actually an integral part of the data model. However, they can also be used directly as data quality checks. You can program them out in SQL or configure them in a data quality tool. The result is a data quality report or dashboard in which you can see all records that do not comply with the rules so that you can work on continuous improvement of the quality of the data.
If you look at data models from the perspective of data quality and quality rules, it suddenly becomes clear that the more extensive recording of a data model has a lot of added value, also for existing applications and databases. You can also record specific properties that simplify the creation of quality rules. For example, consider registering the domains; the values ​​that the data can take. This means, among other things, recording the data type, field lengths, minimum and maximum values ​​and format of fields. With the latter, you could also think of using regular expressions to express the format. These expressions can be used directly in quality rules and/or queries resulting from them. If you have also neatly connected the logical data model to the physical data model, you could potentially also generate the quality rules or queries automatically. This leads to an important acceleration in the drafting of quality rules. You could also use these rules in several places in the process. Preferably they are already used when collecting or otherwise when receiving data. However, an important basis is already created if you define them on data already stored, for example in your data warehouse. This quickly creates a first insight. Preferably they are already used when collecting or otherwise when receiving data. However, an important basis is already created if you define them on data already stored, for example in your data warehouse. This quickly creates a first insight. Preferably they are already used when collecting or otherwise when receiving data. However, an important basis is already created if you define them on data already stored, for example in your data warehouse. This quickly creates a first insight.
If you look further at data quality, you could also capture more advanced aspects of data quality in the data model. An important frame of reference in that context are the ISO/IEC 25012 and 25024 standards that describe standard dimensions and indicators for data quality. Think, for example, of accuracy, completeness, topicality, consistency, precision and traceability of data. Many of these quality aspects are defined at the attribute level. You would therefore prefer to record them as a quality requirement in the data model. For a specific attribute you could then indicate, for example, that the precision is 2 digits after the decimal point or that the positional accuracy is half a meter. You can also check some of these types of rules automatically.
Traceability is also an aspect of data quality and is receiving increasing attention. I recently organized a workshop on data quality. I showed the participants a number of map images and asked what information should have been in the metadata about data quality to determine whether the data is usable. At number one, participants indicated that they would like to know who had created the data, how and when. Financial institutions are also making increasing demands on the traceability of data. Data in formal reports must be traceable throughout the entire chain, including the operations they have undergone in the process. This traceability must in the first instance be made explicit at the level of the data model. The physical data model must be connected to the logical data model and it must be connected to the conceptual data model. This is also known as linkage. Next, the in-process processing must be expressed in terms of relationships between attributes in different data models. This is also known as data lineage. This places high demands on the way in which data models are recorded and managed.

Login

Welcome to WriteUpCafe Community

Join our community to engage with fellow bloggers and increase the visibility of your blog.
Join WriteUpCafe