The purpose and importance of a data dictionary

As per the state of data 2018 report , “The estimated global annual spend on data initiatives by companies in 2018 was $114 billion”. Despite significant investments in data lakes, most organizations don’t have an easy way for humans to discover, access and share data. Collecting vast amounts of data is useless if you can’t interpret or analyze it.

Usually, the database administrator or engineer handles transforming and storing this data in warehouses or databases or further analysis. Now imagine if this person were to suddenly disappear tomorrow. Is there documentation somewhere that will explain everything that you need to know to take over the reins?

If you have a data dictionary in place, this won’t be a problem. A data dictionary can help team members learn everything about a data set.

But this isn’t the only reason that you should care about a data dictionary.

Here are the four biggest benefits of a modern data dictionary:

  1. Detect anomalies quickly
  2. Evaluate data quality
  3. Get more trustworthy data
  4. Build transparency within data teams

What is a data dictionary?

Often, the humans of data (aka folks like you) spend an insane amount of time figuring out what data means and whether or not it’s credible. As per HBR, “80% of a data scientist’s valuable time is spent simply finding, cleaning, and organizing data, leaving only 20% to actually perform analysis”.

A data dictionary describes the data stored in a database. In simple terms, it provides information and insights about your database, in other words, a data dictionary is a documentation for all the data assets in a database.

Traditional data dictionaries usually only make sense to engineering, operations or IT, leaving business people in the dark.

A traditional data dictionary cannot solve this problem. Then what will? A modern data dictionary. It is a repository of all column descriptions along with metrics describing the characteristics of the column as well like: mean, median, missing values, etc.

How to create a data dictionary

To create a data dictionary, you should answers to these six questions:

  1. What does each variable/element/field/attribute within a data set mean? What is it describing?
  2. How did you collect each variable? How did you measure it?
  3. If there are numeric values, are these values raw or are they calculated using a formula?
  4. What are the tests or checks you need to run to determine whether your data is trustworthy?
  5. Who collected your data? Are they still the owners or is it somebody else? Who has interacted with your data, and what are the changes that they made? Who oversees the changes made to your data?
  6. How can you reach out to the owners, admins, and editors of your data?

You might notice that it’s harder to find these answers once your data’s already modeled, prepped, and being actively used for analysis.

That’s why it’s a best practice to start building a data dictionary right when you’re modeling your data—it makes it a lot easier to define what each variable stands for, how it is being measured or calculated, who can make changes, and who is responsible for monitoring the changes made.

If you are looking to build a modern data dictionary, do take  a standard data dictionary, it is a modern data catalog software Dataedo built on the premise of embedded collaboration that is key in today’s modern workplace.

Examples of what is included in a data dictionary:

A data dictionary should become the go-to tool for the humans of data in your organization to understand everything about a data set and check data quality at a glance. It will have information such as:

  • Referential constraints — foreign keys and primary keys
  • Data and time when the property was created or changed
  • Data profiling with descriptive statistics — missing values, min-max values, and histogram distribution.
  • Data and time when the property was created or changed
  • Owners and editors of data sets that contain these variables.
  • Social metadata management tools associated with each data asset stored as tags, notes, and chat transcripts
  • Auto-classification of PII and other sensitive data assets
  • Tables names and descriptions
  • Table relationships
  • Number of columns, column name, and descriptions
  • Permissible values and validation rules for a field
  • Data types
  • Column nullability

Most importantly, a data dictionary should be right next to your data table with all information easily accessible.

Detect anomalies quickly

Identifying anomalies in data or missing data is easier with a dictionary since it displays the results of data checks such as minimum and maximum values or the count of distinct values. Spot duplicate, inaccurate or questionable data at a glance.

Evaluate data quality

Data dictionaries make it easier to create a standard set of variable names and descriptions across an organization. This helps you automatically understand the quality of your data and makes data analysis quicker and easier. Quickly evaluate data quality and speed up your analysis!

Get more trustworthy data

With all of the information about a data set (sources, owners, descriptions, discussions, etc.) recorded in one place, data becomes more reliable. Now you can truly say, “In data we trust!”

Build transparency within data teams

When the entire organization understands what every detail within a data set means, it brings everyone on the same page, reduces dependencies, helps everyone use the data in the same way, and makes onboarding a breeze.

Well, now that you know how handy a data dictionary can be, let’s see how to create one.