Chapter 11: Data Governance, Catalog, and Access Control
Learning Objectives
Differentiate technical metadata, business metadata, and lineage, and explain why each matters for compliance and discovery.
Use the AWS Glue Data Catalog and AWS Lake Formation together to govern lake data, register schemas, and enforce permissions across query engines.
Configure Amazon DataZone for enterprise data publishing and subscription, including domains, projects, business glossaries, and data products.
Implement fine-grained access control using row filters, column masks, and LF-TBAC, and reason about cross-account sharing.
Part 1: Why Governance and Catalog Layers
Pre-reading Check — Part 1
1. Why is a data catalog often described as the "connective tissue" between regulation and engineering?
It compresses Parquet files for cheaper storage.It links inventory, policy, and audit trail so compliance claims are verifiable.It replaces IAM with a single key-value access list.It auto-generates SQL queries from regulatory text.
2. What role does the Glue Data Catalog play when Athena runs a query against a table whose data lives in S3?
It executes the SQL plan and returns rows.It stores the actual Parquet files Athena scans.It supplies the schema, partition layout, and S3 location so Athena can prune and read.It signs the user's IAM credentials.
3. Which statement best contrasts technical metadata and business metadata?
Technical metadata is for engines (types, partitions); business metadata is for humans (terms, owners).Business metadata is stored only in CSV files outside AWS.Technical metadata always includes glossary terms; business metadata never does.They are different names for the same thing.
4. Lineage answers two complementary questions. Which pairing is correct?
Provenance (upstream) and Impact analysis (downstream).Throughput (upstream) and Latency (downstream).Encryption at rest and Encryption in transit.Cost reporting and Quota enforcement.
5. Why does an ungoverned data lake tend to balloon in cost?
S3 storage prices rise when no IAM policy is attached.Engineers re-derive the same metrics in many projects because canonical versions are not discoverable.Every uncatalogued file is automatically replicated three times more than catalogued files.Athena charges a 10x premium for queries against unregistered tables.
A modern data platform that no one trusts is worse than no platform at all. Governance is the discipline that turns a pile of well-engineered storage into an enterprise asset. Think of it as the building code for a city of data: anyone can stack bricks, but the building code ensures the wiring is up to spec, the exits are marked, and the inspector can verify both.
Compliance: GDPR, HIPAA, SOC 2
Regulations dictate that PII, PHI, and financial records be tracked, restricted, and auditable end to end. In AWS this translates to three concrete needs:
A canonical inventory of every table, column, and partition — the AWS Glue Data Catalog plays this role.
A policy engine that decides who can see what — Lake Formation grants, IAM, plus row and column filters.
An audit trail proving who accessed which row when — CloudTrail logs of Lake Formation events.
Key Takeaway
Governance is not a feature you bolt on; it is the connective tissue between regulation and engineering. If your catalog, your policy engine, and your audit log do not agree, your compliance story is fiction.
Data Quality, Trust, and Cost
Trust is the currency of analytics. A dashboard whose numbers contradict last week's report drives users back to private spreadsheets. Governance enforces trust by exposing provenance, freshness, and quality metrics. Glue Data Quality scores embedded in Glue Data Catalog tables become a visible attribute consumers see before they query — the data product equivalent of a USDA stamp on a milk carton.
Ungoverned lakes also balloon in cost. Engineers re-derive the same daily_active_users metric in five projects because they cannot find the canonical version. Storage tiers go unmanaged. Query costs spike because partition strategies are invisible. A catalog drives down both costs simultaneously: discovery prevents duplication, and metadata-aware tooling (partition pruning, column masking) prevents wasteful scans.
The Three-Tier Metadata Model
Metadata is data about data. Without it, a Parquet file in S3 is an opaque blob. AWS organizes metadata in three layers:
Technical, business, and lineage metadata each route to their canonical AWS service before joining at the lineage graph.
Glue Data Catalog as Hive Metastore
The AWS Glue Data Catalog is a Hive-compatible metastore: a managed registry of databases, tables, columns, types, partitions, and storage locations. Any tool that speaks the Hive metastore protocol — Athena, Redshift Spectrum, EMR (Spark, Presto, Hive), and Lake Formation itself — can read it. Register a table once, query it from anywhere.
SELECT activity_name, time_dt, src_endpoint.ip
FROM "amazon_security_lake_glue_db_us_east_1"
."amazon_security_lake_table_us_east_1_eks_audit_2_0"
WHERE time_dt BETWEEN CURRENT_TIMESTAMP - INTERVAL '7' DAY
AND CURRENT_TIMESTAMP;
Athena does not know the file layout; it asks Glue. Glue replies "those files are Parquet, partitioned by time_dt, here are the S3 prefixes," and Athena prunes the scan. The catalog also tracks schema evolution — the OCSF version progression from _1_0 to _2_0 shows how the catalog tracks versioned schemas so older partitions can still be queried with their original layout.
Business Glossary in DataZone
Technical metadata tells you a column is a VARCHAR(36). It does not tell you the column represents the canonical UUID issued by Identity Service v3, considered PII under GDPR Article 4. That second sentence is business metadata, and Amazon DataZone manages it.
DataZone's business glossary is a hierarchical collection of standardized terms. A term like Customer Identifier can sit under a parent category Customer Domain, link to a metadata form, and be attached to assets, schemas, and individual columns across many Glue tables.
Lineage and Impact Analysis
Lineage is the directed graph that connects sources to derived datasets. It answers two complementary questions:
Provenance (upstream): Where did the values in this dashboard come from?
Impact (downstream): If I change this source column, what breaks?
Lineage is the airline route map. If a storm grounds a hub airport, the impact map shows every connecting flight that will be disrupted. If a passenger asks how their bag traveled, the provenance map shows every leg.
Figure: AWS Lineage Chain (Mermaid)
flowchart LR
A[CloudTrail / VPC Flow / EKS] --> B[Lake Formation Ingestion Control]
B --> C[(Glue Data Catalog Metadata Registration)]
C --> D{Query Engines}
D --> D1[Athena]
D --> D2[Redshift Spectrum]
D --> D3[EMR / Spark]
D1 --> E[Curated Data Product]
D2 --> E
D3 --> E
E --> F[DataZone Lineage View]
E --> G[Dashboards / ML]
Key Takeaways
The Glue Data Catalog is the Rosetta Stone of AWS analytics — one definition serves Athena, Redshift Spectrum, EMR, and Lake Formation.
Technical metadata is for engines; business metadata is for humans. A platform that captures only one will be either unsearchable or unqueryable.
Lineage gives the data platform memory: provenance answers "where did this come from?", impact answers "what breaks if I change this?".
Post-reading Check — Part 1
1. Why is a data catalog often described as the "connective tissue" between regulation and engineering?
It compresses Parquet files for cheaper storage.It links inventory, policy, and audit trail so compliance claims are verifiable.It replaces IAM with a single key-value access list.It auto-generates SQL queries from regulatory text.
2. What role does the Glue Data Catalog play when Athena runs a query against a table whose data lives in S3?
It executes the SQL plan and returns rows.It stores the actual Parquet files Athena scans.It supplies the schema, partition layout, and S3 location so Athena can prune and read.It signs the user's IAM credentials.
3. Which statement best contrasts technical metadata and business metadata?
Technical metadata is for engines (types, partitions); business metadata is for humans (terms, owners).Business metadata is stored only in CSV files outside AWS.Technical metadata always includes glossary terms; business metadata never does.They are different names for the same thing.
4. Lineage answers two complementary questions. Which pairing is correct?
Provenance (upstream) and Impact analysis (downstream).Throughput (upstream) and Latency (downstream).Encryption at rest and Encryption in transit.Cost reporting and Quota enforcement.
5. Why does an ungoverned data lake tend to balloon in cost?
S3 storage prices rise when no IAM policy is attached.Engineers re-derive the same metrics in many projects because canonical versions are not discoverable.Every uncatalogued file is automatically replicated three times more than catalogued files.Athena charges a 10x premium for queries against unregistered tables.
Part 2: Lake Formation and LF-TBAC
Pre-reading Check — Part 2
1. What is the main reason Lake Formation grants are easier to author than raw S3 IAM policies for analytics?
They use database-style permissions (SELECT, INSERT) on databases, tables, and columns instead of S3 prefix JSON.They are written in YAML rather than JSON.They eliminate the need for a query engine.They are automatically generated by CloudTrail.
2. Which scenario is the strongest argument for switching from named-resource grants to LF-TBAC?
A single one-table dataset with no growth.A two-person team that only uses one database.A company with 5,000 tables and 200 roles where named grants do not scale.A workload that does not use the Glue Data Catalog.
3. An LF-Tag is best defined as:
A row-level encryption key.A key/value pair attached to a database, table, or column to drive permissions.An IAM policy variable resolved at sign-in.A backup snapshot identifier.
4. In Lake Formation cross-account sharing, what actually crosses the account boundary?
All raw S3 data is copied into the consumer account.Only metadata and grants cross; underlying S3 data stays in the producer account.The Glue Data Catalog is rebuilt nightly on the consumer side.A snapshot of the producer's IAM users.
5. A user runs SELECT * on customers, but the Lake Formation grant is SELECT (customer_id, email) only. What does Athena return?
An access-denied error, no rows.All columns — the grant only restricts INSERT.Only customer_id and email; Athena enforces the column filter at runtime.Only the ssn column, because filtered columns invert.
AWS Lake Formation layers governance on top of the Glue Data Catalog. Where Glue describes your data, Lake Formation decides who can use it. It introduces a permission model that is simpler, finer-grained, and more portable than raw IAM policies — and it integrates with every Glue-aware query engine.
Permissions vs IAM
Pure IAM policies on S3 prefixes are clumsy for analytics. They cannot express "Alice can see all columns of customers except ssn," and they cannot scale to hundreds of tables without a thicket of statements. Lake Formation introduces a relational permission model — SELECT, INSERT, ALTER, DROP, DESCRIBE — granted on databases, tables, and columns.
The hand-off works like this: Lake Formation registers an S3 location and takes over its access decisions. When Athena runs a query, it asks Lake Formation for a vended credential scoped exactly to the rows and columns the user is allowed to see. The user's underlying IAM identity gets just enough permission to call Lake Formation; Lake Formation does the heavy lifting.
Tier
Mechanism
Example
Catalog-level
Glue database/table grants
GRANT DESCRIBE ON DATABASE finance_db TO IAM:role/Analyst
Column/row-level
Lake Formation filters
Mask ssn; row filter region = 'EU'
Principal-based
IAM roles, cross-account
Third-party subscribers via IAM roles
A simple example shows the column-level filter at work. A customers table contains customer_id, email, address, ssn. The marketing analyst's grant looks like:
GRANT SELECT (customer_id, email)
ON TABLE customers TO IAM:role/MarketingAnalyst;
Even if the analyst writes SELECT *, Athena returns only the two permitted columns. The query engine, not the policy author, enforces the filter at runtime.
Tag-Based Access Control (LF-TBAC)
Named-resource grants do not scale. Imagine 5,000 tables and 200 roles — a million potential cell entries. LF-TBAC inverts the model: instead of writing grants per resource, you tag resources with attributes and write grants that match those attributes.
An LF-Tag is a key/value pair attached to a database, table, or column. Common patterns include domain=finance, classification=restricted, pii=true. Tags inherit hierarchically: a database tag flows to its tables, a table tag flows to its columns, and lower levels can override.
GRANT SELECT ON
LF-TAG-EXPRESSION (
domain IN ('customer'),
classification IN ('internal')
)
TO IAM:role/MarketingAnalyst;
This means "Marketing can see anything tagged domain=customer AND classification=internal." Add 50 new customer tables and tag them appropriately — Marketing automatically gets access. Tag a new column as classification=restricted and Marketing automatically loses it.
Animation: LF-TBAC Tag-Match Evaluation Flow
Athena requests a query, Lake Formation looks up tags, matches the grant expression, and either vends scoped credentials or masks the column.
Granularity
Tag Example
Effect
Database
domain=finance
All tables inherit
Table
dataset=transactions
Overrides domain default if conflict
Column
pii=ssn
Specific column-level restriction
The analogy is a museum's security badging. You do not list every painting each guard can stand near; you give the guard a badge level (Bronze, Silver, Gold) and tag each painting with a required level. Adding a new painting is just sticking on a tag, not rewriting every guard's job description.
Cross-Account and Cross-Region Sharing
Most enterprises are multi-account by design — a Producer Account owns the data, a Consumer Account does the analytics, and a Central Governance Account holds the catalog. Lake Formation makes this viable by supporting cross-account grants on both named resources and LF-Tags.
Producer Account (111111111111)
GRANTS LF-Tag(domain=customer) DESCRIBE/ASSOCIATE
to Consumer Account (222222222222)
GRANTS SELECT on LF-Tag(domain=customer, classification=internal)
to Consumer Account (222222222222)
Consumer Account (222222222222)
GRANTS SELECT on LF-Tag(domain=customer, classification=internal)
to IAM:role/MarketingAnalyst
The key insight is that only the catalog metadata crosses boundaries — the underlying S3 data stays put — so consumers see a federated view rather than copies of the data.
graph TD
subgraph Producer["Producer Account 111111111111"]
PS[(S3 Data Lake)]
PG[Glue Catalog]
PLF[Lake Formation Admin]
end
subgraph Central["Central Governance Account"]
CC[Shared Catalog View]
Tags[LF-Tags domain, classification]
end
subgraph Consumer["Consumer Account 222222222222"]
CLF[Lake Formation Admin]
CRole[IAM:role/MarketingAnalyst]
CAthena[Athena Workgroup]
end
PG -->|register| Tags
PLF -->|GRANT DESCRIBE/ASSOCIATE on LF-Tag| CC
PLF -->|GRANT SELECT on tag expression| CLF
CLF -->|sub-grant| CRole
CRole -->|query| CAthena
CAthena -.->|read in place vended credentials| PS
Key Takeaways
Lake Formation replaces S3 IAM acrobatics with a database-style permission model. If a permission is hard to express in IAM, it is probably a one-line Lake Formation grant.
Tags scale where named resources do not. With LF-TBAC, governance becomes attribute-driven: tag once, grant once, and your permissions follow your data.
Cross-account sharing is metadata-first. Data does not move; access does. This is what makes a real data mesh feasible at AWS scale.
Post-reading Check — Part 2
1. What is the main reason Lake Formation grants are easier to author than raw S3 IAM policies for analytics?
They use database-style permissions (SELECT, INSERT) on databases, tables, and columns instead of S3 prefix JSON.They are written in YAML rather than JSON.They eliminate the need for a query engine.They are automatically generated by CloudTrail.
2. Which scenario is the strongest argument for switching from named-resource grants to LF-TBAC?
A single one-table dataset with no growth.A two-person team that only uses one database.A company with 5,000 tables and 200 roles where named grants do not scale.A workload that does not use the Glue Data Catalog.
3. An LF-Tag is best defined as:
A row-level encryption key.A key/value pair attached to a database, table, or column to drive permissions.An IAM policy variable resolved at sign-in.A backup snapshot identifier.
4. In Lake Formation cross-account sharing, what actually crosses the account boundary?
All raw S3 data is copied into the consumer account.Only metadata and grants cross; underlying S3 data stays in the producer account.The Glue Data Catalog is rebuilt nightly on the consumer side.A snapshot of the producer's IAM users.
5. A user runs SELECT * on customers, but the Lake Formation grant is SELECT (customer_id, email) only. What does Athena return?
An access-denied error, no rows.All columns — the grant only restricts INSERT.Only customer_id and email; Athena enforces the column filter at runtime.Only the ssn column, because filtered columns invert.
Part 3: DataZone and Data Mesh
Pre-reading Check — Part 3
1. The data mesh pattern is best characterized as:
A single physical warehouse owned by one central team.Decentralized domain ownership of data products with a central governance fabric.A backup-and-restore strategy for object storage.An IAM policy template generator.
2. In DataZone, what is a "data product"?
A raw S3 prefix with no metadata.A curated bundle of one or more assets enriched with business metadata, owner, glossary terms.A Lambda function that transforms data.A stored procedure inside Redshift.
3. When a consumer's subscription to a Glue-table-backed data product is approved in DataZone, what happens automatically?
DataZone calls Lake Formation to create the appropriate grants and wires the consumer environment.DataZone emails an IAM policy JSON for manual application.DataZone copies the data to the consumer's S3 bucket.DataZone disables encryption on the asset.
4. What is the benefit of the SageMaker Catalog integration with DataZone for a data scientist?
Approved subscriptions appear in Studio with Lake Formation grants pre-wired, so no manual IAM tickets are needed.SageMaker bypasses Lake Formation entirely.Studio re-creates raw S3 data inside its own account.Glossary terms are removed during model training.
5. Which is the correct ordering of the DataZone publish/subscribe lifecycle?
A data mesh is an organizational and technical pattern that decentralizes data ownership. Instead of one central team controlling a monolithic warehouse, individual business domains (Customer, Finance, Logistics) own and publish their data as data products, while a central platform team supplies the governance, discovery, and policy machinery. Amazon DataZone is AWS's managed implementation of this pattern.
Domains and Data Products
DataZone organizes the world into domains and projects. A domain is a top-level container, typically deployed in a central governance account, that holds the catalog, glossary, projects, and policy rules. A project is a use-case workspace inside a domain — a place where a small group collaborates with a defined set of tools (Athena, Redshift, SageMaker) and data assets.
A data product is the unit of consumption. Concretely, it is a curated bundle of one or more assets — a Glue table, a Redshift view, an S3 prefix — wrapped in business metadata: name, description, owner, sensitivity, lineage, freshness, glossary terms, sample queries.
DataZone Concept
What it is
Example
Domain
Org-wide container
acme-corp
Project (Producer)
Publishes data
Customer Identity Team
Project (Consumer)
Subscribes to data
Marketing Analytics
Asset
Underlying technical object
Glue table customers_v3
Data Product
Curated, published bundle
"Active Customers (Daily)"
Think of data products as books in a public library. The library (DataZone) does not own the books; it lets independent publishers (domain teams) shelve them with consistent metadata so any reader (consumer) can find and check them out.
Publish/Subscribe Model
DataZone implements a producer/consumer workflow that mirrors how application teams consume APIs:
Producers create assets, add business metadata and glossary terms, and publish the assets as a data product to the domain catalog.
Consumers discover the data product via search powered by metadata and glossary terms.
The consumer subscribes on behalf of their project, attaching a justification.
The data owner approves or rejects the request in the data portal.
On approval, fulfillment workflows automatically grant access — for Glue tables DataZone calls Lake Formation; for Redshift it adjusts data sharing; for non-native assets it publishes an EventBridge event for custom fulfillment.
The consumer's project environment is automatically wired so analysts can query immediately via Athena or Redshift.
Animation: DataZone Publish/Subscribe Lifecycle
Producer publishes a data product, consumer discovers and subscribes, the owner approves, and Lake Formation fulfills the grant automatically.
Figure: Publish/Subscribe Sequence (Mermaid)
sequenceDiagram
participant P as Producer Project
participant D as Domain Catalog
participant O as Data Owner
participant LF as Lake Formation
participant C as Consumer Project
P->>P: Curate asset + glossary terms
P->>D: Publish data product
C->>D: Search / browse catalog
D-->>C: Discover data product
C->>D: Subscribe (with justification)
D->>O: Notify approval request
O->>D: Approve subscription
D->>LF: Fulfillment: create grants
LF-->>C: Wire env (Athena/Redshift)
C->>LF: Run query
LF-->>C: Return governed results
SageMaker Catalog Integration
In late 2024, AWS unified DataZone with SageMaker under the SageMaker Catalog umbrella. Approved DataZone subscriptions appear directly in SageMaker Studio, where data scientists can query them via Athena, load them into Spark sessions, or pipe them into training jobs — without manual data movement and with end-to-end governance preserved.
A practical scenario:
A data scientist in the Churn Modeling project searches the catalog for "active customers."
They subscribe to the Active Customers (Daily) data product, citing customer churn prediction.
The Identity team owner approves overnight.
The next morning, the scientist opens Studio. The dataset is already accessible through the pre-wired Athena workgroup — no IAM tickets, no S3 paths to memorize.
Lake Formation enforces column masking: ssn is hidden, email is hashed.
Every query is logged in CloudTrail and visible in the DataZone lineage view.
This loop turns governance from a roadblock into a runway. The scientist gets data faster because governance is automated, not despite it.
Key Takeaways
A data mesh is not a tool but a contract: domains own their data products, and a central platform supplies the governance fabric. DataZone is the fabric.
Publish/subscribe converts ad-hoc access requests into a versioned, auditable, automated workflow. Producers never see a Lake Formation console; consumers never see an IAM role ARN.
When DataZone, Lake Formation, and SageMaker Catalog cooperate, governance becomes invisible — permissions arrive as a side effect of subscription.
Post-reading Check — Part 3
1. The data mesh pattern is best characterized as:
A single physical warehouse owned by one central team.Decentralized domain ownership of data products with a central governance fabric.A backup-and-restore strategy for object storage.An IAM policy template generator.
2. In DataZone, what is a "data product"?
A raw S3 prefix with no metadata.A curated bundle of one or more assets enriched with business metadata, owner, glossary terms.A Lambda function that transforms data.A stored procedure inside Redshift.
3. When a consumer's subscription to a Glue-table-backed data product is approved in DataZone, what happens automatically?
DataZone calls Lake Formation to create the appropriate grants and wires the consumer environment.DataZone emails an IAM policy JSON for manual application.DataZone copies the data to the consumer's S3 bucket.DataZone disables encryption on the asset.
4. What is the benefit of the SageMaker Catalog integration with DataZone for a data scientist?
Approved subscriptions appear in Studio with Lake Formation grants pre-wired, so no manual IAM tickets are needed.SageMaker bypasses Lake Formation entirely.Studio re-creates raw S3 data inside its own account.Glossary terms are removed during model training.
5. Which is the correct ordering of the DataZone publish/subscribe lifecycle?