Chapter 11: Data Governance, Catalog, and Access Control

Learning Objectives

Part 1: Why Governance and Catalog Layers

Pre-reading Check — Part 1

1. Why is a data catalog often described as the "connective tissue" between regulation and engineering?

It compresses Parquet files for cheaper storage. It links inventory, policy, and audit trail so compliance claims are verifiable. It replaces IAM with a single key-value access list. It auto-generates SQL queries from regulatory text.

2. What role does the Glue Data Catalog play when Athena runs a query against a table whose data lives in S3?

It executes the SQL plan and returns rows. It stores the actual Parquet files Athena scans. It supplies the schema, partition layout, and S3 location so Athena can prune and read. It signs the user's IAM credentials.

3. Which statement best contrasts technical metadata and business metadata?

Technical metadata is for engines (types, partitions); business metadata is for humans (terms, owners). Business metadata is stored only in CSV files outside AWS. Technical metadata always includes glossary terms; business metadata never does. They are different names for the same thing.

4. Lineage answers two complementary questions. Which pairing is correct?

Provenance (upstream) and Impact analysis (downstream). Throughput (upstream) and Latency (downstream). Encryption at rest and Encryption in transit. Cost reporting and Quota enforcement.

5. Why does an ungoverned data lake tend to balloon in cost?

S3 storage prices rise when no IAM policy is attached. Engineers re-derive the same metrics in many projects because canonical versions are not discoverable. Every uncatalogued file is automatically replicated three times more than catalogued files. Athena charges a 10x premium for queries against unregistered tables.

A modern data platform that no one trusts is worse than no platform at all. Governance is the discipline that turns a pile of well-engineered storage into an enterprise asset. Think of it as the building code for a city of data: anyone can stack bricks, but the building code ensures the wiring is up to spec, the exits are marked, and the inspector can verify both.

Compliance: GDPR, HIPAA, SOC 2

Regulations dictate that PII, PHI, and financial records be tracked, restricted, and auditable end to end. In AWS this translates to three concrete needs:

  1. A canonical inventory of every table, column, and partition — the AWS Glue Data Catalog plays this role.
  2. A policy engine that decides who can see what — Lake Formation grants, IAM, plus row and column filters.
  3. An audit trail proving who accessed which row when — CloudTrail logs of Lake Formation events.

Key Takeaway

Data Quality, Trust, and Cost

Trust is the currency of analytics. A dashboard whose numbers contradict last week's report drives users back to private spreadsheets. Governance enforces trust by exposing provenance, freshness, and quality metrics. Glue Data Quality scores embedded in Glue Data Catalog tables become a visible attribute consumers see before they query — the data product equivalent of a USDA stamp on a milk carton.

Ungoverned lakes also balloon in cost. Engineers re-derive the same daily_active_users metric in five projects because they cannot find the canonical version. Storage tiers go unmanaged. Query costs spike because partition strategies are invisible. A catalog drives down both costs simultaneously: discovery prevents duplication, and metadata-aware tooling (partition pruning, column masking) prevents wasteful scans.

The Three-Tier Metadata Model

Metadata is data about data. Without it, a Parquet file in S3 is an opaque blob. AWS organizes metadata in three layers:

LayerWhat it DescribesWhere it Lives
Technical metadataSchema, types, partitions, formats, locationsGlue Data Catalog
Business metadataGlossary terms, owners, descriptions, classificationsDataZone, Glue tags
Lineage metadataSource-to-target maps, transformation historyGlue, DataZone, OpenLineage

Animation: Three-Tier Metadata Flow

Technical, business, and lineage metadata each route to their canonical AWS service before joining at the lineage graph.
Technical Metadata Schema, types Partitions, formats S3 locations, serdes Business Metadata Glossary terms Owners, descriptions Classifications, tags Lineage Metadata Source -> target Transform history Impact graph Glue Data Catalog Amazon DataZone Glue + DataZone + OpenLineage Unified Lineage Athena, Redshift, Portal Search

Glue Data Catalog as Hive Metastore

The AWS Glue Data Catalog is a Hive-compatible metastore: a managed registry of databases, tables, columns, types, partitions, and storage locations. Any tool that speaks the Hive metastore protocol — Athena, Redshift Spectrum, EMR (Spark, Presto, Hive), and Lake Formation itself — can read it. Register a table once, query it from anywhere.

SELECT activity_name, time_dt, src_endpoint.ip
FROM "amazon_security_lake_glue_db_us_east_1"
     ."amazon_security_lake_table_us_east_1_eks_audit_2_0"
WHERE time_dt BETWEEN CURRENT_TIMESTAMP - INTERVAL '7' DAY
                  AND CURRENT_TIMESTAMP;

Athena does not know the file layout; it asks Glue. Glue replies "those files are Parquet, partitioned by time_dt, here are the S3 prefixes," and Athena prunes the scan. The catalog also tracks schema evolution — the OCSF version progression from _1_0 to _2_0 shows how the catalog tracks versioned schemas so older partitions can still be queried with their original layout.

Business Glossary in DataZone

Technical metadata tells you a column is a VARCHAR(36). It does not tell you the column represents the canonical UUID issued by Identity Service v3, considered PII under GDPR Article 4. That second sentence is business metadata, and Amazon DataZone manages it.

DataZone's business glossary is a hierarchical collection of standardized terms. A term like Customer Identifier can sit under a parent category Customer Domain, link to a metadata form, and be attached to assets, schemas, and individual columns across many Glue tables.

Lineage and Impact Analysis

Lineage is the directed graph that connects sources to derived datasets. It answers two complementary questions:

Lineage is the airline route map. If a storm grounds a hub airport, the impact map shows every connecting flight that will be disrupted. If a passenger asks how their bag traveled, the provenance map shows every leg.

Figure: AWS Lineage Chain (Mermaid)

flowchart LR A[CloudTrail / VPC Flow / EKS] --> B[Lake Formation
Ingestion Control] B --> C[(Glue Data Catalog
Metadata Registration)] C --> D{Query Engines} D --> D1[Athena] D --> D2[Redshift Spectrum] D --> D3[EMR / Spark] D1 --> E[Curated Data Product] D2 --> E D3 --> E E --> F[DataZone Lineage View] E --> G[Dashboards / ML]

Key Takeaways

Post-reading Check — Part 1

1. Why is a data catalog often described as the "connective tissue" between regulation and engineering?

It compresses Parquet files for cheaper storage. It links inventory, policy, and audit trail so compliance claims are verifiable. It replaces IAM with a single key-value access list. It auto-generates SQL queries from regulatory text.

2. What role does the Glue Data Catalog play when Athena runs a query against a table whose data lives in S3?

It executes the SQL plan and returns rows. It stores the actual Parquet files Athena scans. It supplies the schema, partition layout, and S3 location so Athena can prune and read. It signs the user's IAM credentials.

3. Which statement best contrasts technical metadata and business metadata?

Technical metadata is for engines (types, partitions); business metadata is for humans (terms, owners). Business metadata is stored only in CSV files outside AWS. Technical metadata always includes glossary terms; business metadata never does. They are different names for the same thing.

4. Lineage answers two complementary questions. Which pairing is correct?

Provenance (upstream) and Impact analysis (downstream). Throughput (upstream) and Latency (downstream). Encryption at rest and Encryption in transit. Cost reporting and Quota enforcement.

5. Why does an ungoverned data lake tend to balloon in cost?

S3 storage prices rise when no IAM policy is attached. Engineers re-derive the same metrics in many projects because canonical versions are not discoverable. Every uncatalogued file is automatically replicated three times more than catalogued files. Athena charges a 10x premium for queries against unregistered tables.

Part 2: Lake Formation and LF-TBAC

Pre-reading Check — Part 2

1. What is the main reason Lake Formation grants are easier to author than raw S3 IAM policies for analytics?

They use database-style permissions (SELECT, INSERT) on databases, tables, and columns instead of S3 prefix JSON. They are written in YAML rather than JSON. They eliminate the need for a query engine. They are automatically generated by CloudTrail.

2. Which scenario is the strongest argument for switching from named-resource grants to LF-TBAC?

A single one-table dataset with no growth. A two-person team that only uses one database. A company with 5,000 tables and 200 roles where named grants do not scale. A workload that does not use the Glue Data Catalog.

3. An LF-Tag is best defined as:

A row-level encryption key. A key/value pair attached to a database, table, or column to drive permissions. An IAM policy variable resolved at sign-in. A backup snapshot identifier.

4. In Lake Formation cross-account sharing, what actually crosses the account boundary?

All raw S3 data is copied into the consumer account. Only metadata and grants cross; underlying S3 data stays in the producer account. The Glue Data Catalog is rebuilt nightly on the consumer side. A snapshot of the producer's IAM users.

5. A user runs SELECT * on customers, but the Lake Formation grant is SELECT (customer_id, email) only. What does Athena return?

An access-denied error, no rows. All columns — the grant only restricts INSERT. Only customer_id and email; Athena enforces the column filter at runtime. Only the ssn column, because filtered columns invert.

AWS Lake Formation layers governance on top of the Glue Data Catalog. Where Glue describes your data, Lake Formation decides who can use it. It introduces a permission model that is simpler, finer-grained, and more portable than raw IAM policies — and it integrates with every Glue-aware query engine.

Permissions vs IAM

Pure IAM policies on S3 prefixes are clumsy for analytics. They cannot express "Alice can see all columns of customers except ssn," and they cannot scale to hundreds of tables without a thicket of statements. Lake Formation introduces a relational permission model — SELECT, INSERT, ALTER, DROP, DESCRIBE — granted on databases, tables, and columns.

The hand-off works like this: Lake Formation registers an S3 location and takes over its access decisions. When Athena runs a query, it asks Lake Formation for a vended credential scoped exactly to the rows and columns the user is allowed to see. The user's underlying IAM identity gets just enough permission to call Lake Formation; Lake Formation does the heavy lifting.

TierMechanismExample
Catalog-levelGlue database/table grantsGRANT DESCRIBE ON DATABASE finance_db TO IAM:role/Analyst
Column/row-levelLake Formation filtersMask ssn; row filter region = 'EU'
Principal-basedIAM roles, cross-accountThird-party subscribers via IAM roles

A simple example shows the column-level filter at work. A customers table contains customer_id, email, address, ssn. The marketing analyst's grant looks like:

GRANT SELECT (customer_id, email)
    ON TABLE customers TO IAM:role/MarketingAnalyst;

Even if the analyst writes SELECT *, Athena returns only the two permitted columns. The query engine, not the policy author, enforces the filter at runtime.

Tag-Based Access Control (LF-TBAC)

Named-resource grants do not scale. Imagine 5,000 tables and 200 roles — a million potential cell entries. LF-TBAC inverts the model: instead of writing grants per resource, you tag resources with attributes and write grants that match those attributes.

An LF-Tag is a key/value pair attached to a database, table, or column. Common patterns include domain=finance, classification=restricted, pii=true. Tags inherit hierarchically: a database tag flows to its tables, a table tag flows to its columns, and lower levels can override.

GRANT SELECT ON
  LF-TAG-EXPRESSION (
    domain         IN ('customer'),
    classification IN ('internal')
  )
  TO IAM:role/MarketingAnalyst;

This means "Marketing can see anything tagged domain=customer AND classification=internal." Add 50 new customer tables and tag them appropriately — Marketing automatically gets access. Tag a new column as classification=restricted and Marketing automatically loses it.

Animation: LF-TBAC Tag-Match Evaluation Flow

Athena requests a query, Lake Formation looks up tags, matches the grant expression, and either vends scoped credentials or masks the column.
Analyst SELECT * FROM customers Athena Lake Formation Authorize Glue Catalog Tags on customers domain = customer classification = internal pii = ssn (column-only) Grant Expression for MarketingAnalyst domain IN ('customer') classification IN ('internal') Tag match? Vend scoped credentials customer_id, email Mask / drop column ssn (pii)
GranularityTag ExampleEffect
Databasedomain=financeAll tables inherit
Tabledataset=transactionsOverrides domain default if conflict
Columnpii=ssnSpecific column-level restriction

The analogy is a museum's security badging. You do not list every painting each guard can stand near; you give the guard a badge level (Bronze, Silver, Gold) and tag each painting with a required level. Adding a new painting is just sticking on a tag, not rewriting every guard's job description.

Cross-Account and Cross-Region Sharing

Most enterprises are multi-account by design — a Producer Account owns the data, a Consumer Account does the analytics, and a Central Governance Account holds the catalog. Lake Formation makes this viable by supporting cross-account grants on both named resources and LF-Tags.

Producer Account (111111111111)
   GRANTS LF-Tag(domain=customer) DESCRIBE/ASSOCIATE
       to Consumer Account (222222222222)
   GRANTS SELECT on LF-Tag(domain=customer, classification=internal)
       to Consumer Account (222222222222)

Consumer Account (222222222222)
   GRANTS SELECT on LF-Tag(domain=customer, classification=internal)
       to IAM:role/MarketingAnalyst

The key insight is that only the catalog metadata crosses boundaries — the underlying S3 data stays put — so consumers see a federated view rather than copies of the data.

Figure: Cross-Account Governance Topology (Mermaid)

graph TD subgraph Producer["Producer Account 111111111111"] PS[(S3 Data Lake)] PG[Glue Catalog] PLF[Lake Formation Admin] end subgraph Central["Central Governance Account"] CC[Shared Catalog View] Tags[LF-Tags
domain, classification] end subgraph Consumer["Consumer Account 222222222222"] CLF[Lake Formation Admin] CRole[IAM:role/MarketingAnalyst] CAthena[Athena Workgroup] end PG -->|register| Tags PLF -->|GRANT DESCRIBE/ASSOCIATE
on LF-Tag| CC PLF -->|GRANT SELECT
on tag expression| CLF CLF -->|sub-grant| CRole CRole -->|query| CAthena CAthena -.->|read in place
vended credentials| PS

Key Takeaways

Post-reading Check — Part 2

1. What is the main reason Lake Formation grants are easier to author than raw S3 IAM policies for analytics?

They use database-style permissions (SELECT, INSERT) on databases, tables, and columns instead of S3 prefix JSON. They are written in YAML rather than JSON. They eliminate the need for a query engine. They are automatically generated by CloudTrail.

2. Which scenario is the strongest argument for switching from named-resource grants to LF-TBAC?

A single one-table dataset with no growth. A two-person team that only uses one database. A company with 5,000 tables and 200 roles where named grants do not scale. A workload that does not use the Glue Data Catalog.

3. An LF-Tag is best defined as:

A row-level encryption key. A key/value pair attached to a database, table, or column to drive permissions. An IAM policy variable resolved at sign-in. A backup snapshot identifier.

4. In Lake Formation cross-account sharing, what actually crosses the account boundary?

All raw S3 data is copied into the consumer account. Only metadata and grants cross; underlying S3 data stays in the producer account. The Glue Data Catalog is rebuilt nightly on the consumer side. A snapshot of the producer's IAM users.

5. A user runs SELECT * on customers, but the Lake Formation grant is SELECT (customer_id, email) only. What does Athena return?

An access-denied error, no rows. All columns — the grant only restricts INSERT. Only customer_id and email; Athena enforces the column filter at runtime. Only the ssn column, because filtered columns invert.

Part 3: DataZone and Data Mesh

Pre-reading Check — Part 3

1. The data mesh pattern is best characterized as:

A single physical warehouse owned by one central team. Decentralized domain ownership of data products with a central governance fabric. A backup-and-restore strategy for object storage. An IAM policy template generator.

2. In DataZone, what is a "data product"?

A raw S3 prefix with no metadata. A curated bundle of one or more assets enriched with business metadata, owner, glossary terms. A Lambda function that transforms data. A stored procedure inside Redshift.

3. When a consumer's subscription to a Glue-table-backed data product is approved in DataZone, what happens automatically?

DataZone calls Lake Formation to create the appropriate grants and wires the consumer environment. DataZone emails an IAM policy JSON for manual application. DataZone copies the data to the consumer's S3 bucket. DataZone disables encryption on the asset.

4. What is the benefit of the SageMaker Catalog integration with DataZone for a data scientist?

Approved subscriptions appear in Studio with Lake Formation grants pre-wired, so no manual IAM tickets are needed. SageMaker bypasses Lake Formation entirely. Studio re-creates raw S3 data inside its own account. Glossary terms are removed during model training.

5. Which is the correct ordering of the DataZone publish/subscribe lifecycle?

Subscribe -> Publish -> Approve -> Discover -> Fulfill. Publish -> Discover -> Subscribe -> Approve -> Fulfill. Approve -> Discover -> Publish -> Subscribe -> Fulfill. Fulfill -> Discover -> Subscribe -> Publish -> Approve.

A data mesh is an organizational and technical pattern that decentralizes data ownership. Instead of one central team controlling a monolithic warehouse, individual business domains (Customer, Finance, Logistics) own and publish their data as data products, while a central platform team supplies the governance, discovery, and policy machinery. Amazon DataZone is AWS's managed implementation of this pattern.

Domains and Data Products

DataZone organizes the world into domains and projects. A domain is a top-level container, typically deployed in a central governance account, that holds the catalog, glossary, projects, and policy rules. A project is a use-case workspace inside a domain — a place where a small group collaborates with a defined set of tools (Athena, Redshift, SageMaker) and data assets.

A data product is the unit of consumption. Concretely, it is a curated bundle of one or more assets — a Glue table, a Redshift view, an S3 prefix — wrapped in business metadata: name, description, owner, sensitivity, lineage, freshness, glossary terms, sample queries.

DataZone ConceptWhat it isExample
DomainOrg-wide containeracme-corp
Project (Producer)Publishes dataCustomer Identity Team
Project (Consumer)Subscribes to dataMarketing Analytics
AssetUnderlying technical objectGlue table customers_v3
Data ProductCurated, published bundle"Active Customers (Daily)"

Think of data products as books in a public library. The library (DataZone) does not own the books; it lets independent publishers (domain teams) shelve them with consistent metadata so any reader (consumer) can find and check them out.

Publish/Subscribe Model

DataZone implements a producer/consumer workflow that mirrors how application teams consume APIs:

  1. Producers create assets, add business metadata and glossary terms, and publish the assets as a data product to the domain catalog.
  2. Consumers discover the data product via search powered by metadata and glossary terms.
  3. The consumer subscribes on behalf of their project, attaching a justification.
  4. The data owner approves or rejects the request in the data portal.
  5. On approval, fulfillment workflows automatically grant access — for Glue tables DataZone calls Lake Formation; for Redshift it adjusts data sharing; for non-native assets it publishes an EventBridge event for custom fulfillment.
  6. The consumer's project environment is automatically wired so analysts can query immediately via Athena or Redshift.

Animation: DataZone Publish/Subscribe Lifecycle

Producer publishes a data product, consumer discovers and subscribes, the owner approves, and Lake Formation fulfills the grant automatically.
Producer Domain Catalog Consumer Owner + Lake Formation 1. publish data product 2. search/browse 3. subscribe + justify 4. notify approval request 5. approve -> LF grant + env wiring 6. query Athena/Redshift No emailed JSON policies. No manual IAM tickets. Governance becomes the runway.

Figure: Publish/Subscribe Sequence (Mermaid)

sequenceDiagram participant P as Producer Project participant D as Domain Catalog participant O as Data Owner participant LF as Lake Formation participant C as Consumer Project P->>P: Curate asset + glossary terms P->>D: Publish data product C->>D: Search / browse catalog D-->>C: Discover data product C->>D: Subscribe (with justification) D->>O: Notify approval request O->>D: Approve subscription D->>LF: Fulfillment: create grants LF-->>C: Wire env (Athena/Redshift) C->>LF: Run query LF-->>C: Return governed results

SageMaker Catalog Integration

In late 2024, AWS unified DataZone with SageMaker under the SageMaker Catalog umbrella. Approved DataZone subscriptions appear directly in SageMaker Studio, where data scientists can query them via Athena, load them into Spark sessions, or pipe them into training jobs — without manual data movement and with end-to-end governance preserved.

A practical scenario:

  1. A data scientist in the Churn Modeling project searches the catalog for "active customers."
  2. They subscribe to the Active Customers (Daily) data product, citing customer churn prediction.
  3. The Identity team owner approves overnight.
  4. The next morning, the scientist opens Studio. The dataset is already accessible through the pre-wired Athena workgroup — no IAM tickets, no S3 paths to memorize.
  5. Lake Formation enforces column masking: ssn is hidden, email is hashed.
  6. Every query is logged in CloudTrail and visible in the DataZone lineage view.

This loop turns governance from a roadblock into a runway. The scientist gets data faster because governance is automated, not despite it.

Key Takeaways

Post-reading Check — Part 3

1. The data mesh pattern is best characterized as:

A single physical warehouse owned by one central team. Decentralized domain ownership of data products with a central governance fabric. A backup-and-restore strategy for object storage. An IAM policy template generator.

2. In DataZone, what is a "data product"?

A raw S3 prefix with no metadata. A curated bundle of one or more assets enriched with business metadata, owner, glossary terms. A Lambda function that transforms data. A stored procedure inside Redshift.

3. When a consumer's subscription to a Glue-table-backed data product is approved in DataZone, what happens automatically?

DataZone calls Lake Formation to create the appropriate grants and wires the consumer environment. DataZone emails an IAM policy JSON for manual application. DataZone copies the data to the consumer's S3 bucket. DataZone disables encryption on the asset.

4. What is the benefit of the SageMaker Catalog integration with DataZone for a data scientist?

Approved subscriptions appear in Studio with Lake Formation grants pre-wired, so no manual IAM tickets are needed. SageMaker bypasses Lake Formation entirely. Studio re-creates raw S3 data inside its own account. Glossary terms are removed during model training.

5. Which is the correct ordering of the DataZone publish/subscribe lifecycle?

Subscribe -> Publish -> Approve -> Discover -> Fulfill. Publish -> Discover -> Subscribe -> Approve -> Fulfill. Approve -> Discover -> Publish -> Subscribe -> Fulfill. Fulfill -> Discover -> Subscribe -> Publish -> Approve.

Your Progress

Answer Explanations