Data Integrity: Best Practices for Clean Data

I’ve had many discussions around data challenges with technical consultants and engineers, and have learned a few things along the way.

Here are a few takeaways that will help you identify, visualize and fix data anomalies quickly and efficiently:

Data Integrity Defined
If you want to learn about data integrity best practices, it helps to know exactly what we’re talking about here. According to BusinessDictionary, data integrity is defined as:

“The accuracy and consistency of stored data, indicated by an absence of any alteration in data between two updates of a data record. Data integrity is imposed within a database at its design stage through the use of standard rules and procedures, and is maintained through the use of error checking and validation routines.”

For data to be useful, it has to meet a number of expectations:

  • Complete: All data necessary to meet current and future business demand are available in the data source – no omissions or holes.
  • Accurate: All data must be an accurate representation of its source – consisting of the right value and that value should be represented it in the business model and architecture.
  • Consistent: Keep all information uniform as it moves across a network and between applications. Consistency, in the context of databases, states that data cannot be written in a way that would violate the database’s rules for valid data. If a certain transaction occurs that attempts to introduce inconsistent data, the entire transaction is rolled back, and an error is returned to the user.
  • Point in Time: Reference pieces of data via timestamps and other markers, so you can restore a crashed system at the point in time where all data was in place. Amazon, for example, makes continuous transactions 24 hours a day. Its backup is a snapshot of a single point in time for all systems. If a crash occurs, programmers must reboot the system to that snapshot in time.
  • Transaction Consistency: A transaction is a logical unit of work that could contain a number of file or database updates. During normal processing, transaction consistency is present only before any transactions have run, following the completion of a successful transaction and before the next transaction begins, and when the application ends normally or the database is closed.
    • “A good example of the importance of transaction consistency is a database that handles the transfer of money. Suppose a money transfer requires two operations: writing a debit in one place, and a credit in another. If the system crashes or shuts down when one operation has completed but the other has not, and there is nothing in place to correct this, the system can be said to lack transaction consistency. With a money transfer, it is desirable that either the entire transaction completes, or none of it completes. Both of these scenarios keep the balance in check.”
  • Application Consistency: Application consistency is merely transaction consistency between programs.
    • “For example, if the banking program communicates with a tax program on the computer, application consistency means that the information moving between the programs will remain in its original state. Without application consistency, the same problems arise here as do under flawed transaction consistency: there will be no way to tell whether a value entered into the system remains correct over time.”

The primary advantage to ensuring data consistency is maintaining the integrity of the information stored across the network. You can’t tell whether the data stored on the computer today will be the same following a crash, installation or other major system event – all three types of consistency must work together.

Business and Data Integrity
Businesses rely on data for all aspects of business, financials, operations, compliance, quality, profitability and productivity. Their success relies on accurately entered and contained data. Data integrity is also important because it affects the reporting of data for customers or other businesses – such as airlines reporting to FAA, or your tax returns and the IRS.

Accurate, reliable and timely information is vital to effective decision-making in almost every aspect of human and business behavior. When covering state and local governments, most state and city websites had an open data portal. We use this information to hold our elected officials accountable for representing us as we expect them to.

As employees, we want our corporate leaders making sound decisions to ensure we continue to prosper. As consumers, we expect that the products and services we purchase are of the highest quality standards, and that our private information is held securely. Accurate, reliable and consistent data is imperative for these expectations to be achieved.

With today’s technological advances, businesses and consumers are being inundated with more data, from more sources, than ever before. In the business sector it is becoming increasingly common – some might even say essential – to extract and analyze minutely detailed information about customer behavior.

This information has also become the norm for determining the profitability of individual products or services, points of sale, business units or even individual employees – in pursuit of goals such as market share, productivity or profit.

Data Integrity Best Practices
Data integrity isn’t a one-size-fits-all practice. There are both right and wrong ways to ensure your data is reliable. Here are a few best practices for you to consider:

  • Data Encryption: Guarantee your data’s safety by encrypting it. Cyphertext is referred to as encrypted or encoded information, because it encompasses a type of plaintext that is incomprehensible to a computer or human without the appropriate cipher to decrypt it.
  • Data Backup: Store your data in an alternate location from the original is essential. In a system crash, this backup can restore databases and apps back to the point in time where all your data was complete.

Access Controls

    : Assign read/write privileges, and use locks and strong login credentials, in order to limit the factor of human error.
  • Input Validation: Verify that the user-supplied content contains only expected information. Make sure PDF docs with static information and only certain fields are editable.
  • Data Validation: Ensure that a program operates on clean, correct and useful data. It uses routines called “validation rules” that check for correctness, meaningfulness and security of data that are input into the system.
  • Cleansing: Input values in a column (Jan vs. January).
  • Aggregating: Sort and sum.
  • Filtering: Group specific interests (sort customers by state, for example).
  • Merging: Combine and consolidate data that’s scattered in multiple datasets.
  • Appending: Stack datasets with same or similar fields. Instead of tracking daily sales in separate datasets, for example, merge and get one larger dataset.
  • Deduping: Remove duplicate data.
  • Transforming: Use column functions that results in a new outcome (for example, if you’re measuring students’ SAT scores but some have ACT scores, apply a transformation to ACT column to approximate SAT scores).

Hopefully, you’re already using some or all of these best practices in your data. If not, our iDashboards Data Integrator helps automate this cleanup process – giving you the ability to consolidate data from multiple locations in diverse formats, and transform it into a useful data set without the need for programming.

What are other best practices do you follow in your organization?

Leave a Reply

Your email address will not be published. Required fields are marked *