Home » Automatic data classification for Azure SQL Server

Automatic data classification for Azure SQL Server

It is important, also for regulatory reasons, to know which data is stored and what kind of data it is. It can be financial data, personal data or any other type. For GDPR reasons, its especially important to mark personal data and have a documentation ready, which data is stored and why.
This can be a challenging and annoying task if you have to do this for an existing database with a few hundred tables. Fortunately, there ways to make life easier.

Built-in Data Discovery and Classification

The easiest way is to use the built-in data discovery and classification feature that comes with Azure SQL Databases. Just go to the ‘Advanced data security’ tab and enable the feature:

By doing that, SQL Server will try to automatically detect Columns and Tables that need to be classified and you can accept or decline those changes:

You can also create a report out of it or export it.

SQL Script to classify data

If you don’t want to use that feature, but still need to classify the data, then another way is to write a SQL script which will create a table containing all classified data. The following script can be extended by columns or tables to classify those data:

SELECT schema_name(tab.schema_id) AS schema_name,
    tab.name AS table_name,
    col.name AS column_name,
    t.name AS data_type,
	
    -- start data classification
    CASE 

    -- ignore specific columns in table
    WHEN CONCAT(tab.name, '.', col.name) IN (
    'my_table.ignore_this_column'
    ) 
    -- ignore all columns in these tables
    OR tab.name IN (
    'table1', 'table2'
    ) THEN ''

    -- personal data
    -- match column names
    WHEN col.name IN (
    'first_name', 'middle_name', 'last_name', 'date_of_birth', 'email'    
    ) 
    
    -- match specific columns (table_name.column_name)
    OR CONCAT(tab.name, '.', col.name) IN (
    'user.state', 'table1.column1'
    ) THEN 'personal data' 
    
    -- XXX Data
    -- match column names
    WHEN col.name IN (
    'column1', 'column2'
    ) 

    -- match specific columns (table_name.column_name)
    OR CONCAT(tab.name, '.', col.name) IN (
        'users.display_name'
    ) THEN 'XXX data'

    ELSE 
    '' 
    
    END 
	AS classification 

FROM sys.tables AS tab
    INNER JOIN sys.columns AS col ON tab.object_id = col.object_id
    LEFT JOIN sys.types AS t ON col.user_type_id = t.user_type_id
	
ORDER BY schema_name,
    table_name,
    column_id

The result will be table that looks like:

Explanation of the SQL Script

This SQL Script basically contains at first, a selection of all schemes, tables and columns in the database:

SELECT schema_name(tab.schema_id) AS schema_name,
    tab.name AS table_name,
    col.name AS column_name,
    t.name AS data_type

FROM sys.tables AS tab
    INNER JOIN sys.columns AS col ON tab.object_id = col.object_id
    LEFT JOIN sys.types AS t ON col.user_type_id = t.user_type_id
	
ORDER BY schema_name,
    table_name,
    column_id

This is then extended by a column ‘classification’ and a simple CASE statement to e.g. classify all columns with name first_name or last_name as personal data:

CASE WHEN col.name IN ('first_name', 'last_name') THEN 'personal data' ELSE '' END 
AS classification 

Finally, I added a few more CASEs, to allow to:

-- Do not classify specific column names (e.g.: classify all columns with name 'first_name' as personal data, BUT ignore the column in the table 'dogs')
WHEN CONCAT(tab.name, '.', col.name) IN ( 'dogs.first_name' ) THEN ''
-- Do not classify columns in specific tables (e.g. ignore all columns in table pets)
WHEN tab.name IN ( 'pets' ) THEN ''
-- Both combined in one statement
WHEN CONCAT(tab.name, '.', col.name) IN ( 'dogs.first_name' ) 
OR tab.name IN ( 'pets' ) THEN ''

-- Only match columns in a specific table (column 'display_name' is only personal data in table 'users')
WHEN CONCAT(tab.name, '.', col.name) IN ( 'users.display_name' ) THEN 'personal data'

Additional information

One comment

Leave a Reply

Your email address will not be published. Required fields are marked *