Skip to main content

Schema Extraction

The extraction phase connects to your PostgreSQL database and exports its complete schema as a JSON representation. This JSON serves as the “current state” for schema comparison.

How It Works

pgtofu queries PostgreSQL’s system catalogs directly to gather comprehensive schema information:
┌───────────────────────────────────────┐
│        PostgreSQL Database            │
├──────────────────────────────────────┤
│  pg_catalog                           │
│  ├── pg_class (tables, indexes)       │
│  ├── pg_attribute (columns)           │
│  ├── pg_constraint (constraints)      │
│  ├── pg_index (index details)         │
│  ├── pg_proc (functions)              │
│  ├── pg_trigger (triggers)            │
│  ├── pg_type (types)                  │
│  └── pg_extension (extensions)        │
└───────────────────────────────────────┘


┌───────────────────────────────────────┐
│     pgtofu Extractor                  │
│     ─────────────────                 │
│     Queries → Parsing → Normalization │
└───────────────────────────────────────┘


┌───────────────────────────────────────┐
│     JSON Schema Output                │
│     ──────────────────                │
│     {                                 │
│       "tables": [...],                │
│       "views": [...],                 │
│       "functions": [...]              │
│     }                                 │
└───────────────────────────────────────┘

Extraction Order

Objects are extracted in dependency order:
  1. Schemas - Database namespaces
  2. Extensions - Installed extensions
  3. Custom Types - Enums, composites, domains
  4. Sequences - Serial/identity sources
  5. Tables - With columns, constraints, indexes
  6. Views - Regular and materialized
  7. Functions - All languages
  8. Triggers - Table triggers
  9. Hypertables - TimescaleDB (if installed)
  10. Continuous Aggregates - TimescaleDB (if installed)

What Gets Extracted

Tables

For each table, pgtofu extracts:
{
  "schema": "public",
  "name": "users",
  "columns": [
    {
      "name": "id",
      "data_type": "bigint",
      "position": 1,
      "is_nullable": false,
      "is_identity": true,
      "identity_generation": "ALWAYS"
    },
    {
      "name": "email",
      "data_type": "character varying",
      "max_length": 255,
      "is_nullable": false
    }
  ],
  "constraints": [
    {
      "name": "users_pkey",
      "type": "PRIMARY KEY",
      "columns": ["id"]
    },
    {
      "name": "users_email_key",
      "type": "UNIQUE",
      "columns": ["email"]
    }
  ],
  "indexes": [
    {
      "name": "idx_users_created_at",
      "columns": ["created_at"],
      "type": "btree"
    }
  ],
  "comment": "Application users"
}

Column Details

PropertyDescription
nameColumn name
data_typePostgreSQL type name
positionOrdinal position (1-based)
is_nullableWhether NULL is allowed
defaultDefault value expression
max_lengthFor VARCHAR/CHAR types
precision, scaleFor NUMERIC types
is_arrayWhether it’s an array type
is_identityGENERATED AS IDENTITY column
identity_generationALWAYS or BY DEFAULT
is_generatedComputed/generated column
generation_expressionExpression for generated columns
commentColumn comment

Constraint Details

Constraint TypeExtracted Properties
PRIMARY KEYColumns
FOREIGN KEYColumns, referenced table, ON DELETE/UPDATE
UNIQUEColumns, deferrable settings
CHECKCheck expression
EXCLUDEColumns, operators, where clause

Index Details

PropertyDescription
nameIndex name
columnsIndexed columns
include_columnsINCLUDE columns (covering index)
typebtree, hash, gin, gist, etc.
is_uniqueUnique index
wherePartial index condition
definitionFull CREATE INDEX statement

Functions

{
  "schema": "public",
  "name": "update_updated_at",
  "language": "plpgsql",
  "argument_types": [],
  "return_type": "trigger",
  "volatility": "VOLATILE",
  "is_strict": false,
  "is_security_definer": false,
  "body": "BEGIN\n    NEW.updated_at = NOW();\n    RETURN NEW;\nEND;",
  "definition": "CREATE OR REPLACE FUNCTION..."
}

Views

{
  "schema": "public",
  "name": "active_users",
  "definition": "SELECT id, email, name FROM users WHERE status = 'active'",
  "is_updatable": false,
  "check_option": "NONE"
}

TimescaleDB Objects

If TimescaleDB is installed, pgtofu also extracts:
{
  "hypertables": [
    {
      "schema": "public",
      "table_name": "metrics",
      "time_column_name": "time",
      "chunk_time_interval": "1 day",
      "compression_enabled": true,
      "compression_settings": {
        "segment_by_columns": ["device_id"],
        "order_by_columns": [{"column": "time", "desc": true}]
      },
      "retention_policy": {
        "drop_after": "2 years"
      }
    }
  ],
  "continuous_aggregates": [
    {
      "schema": "public",
      "view_name": "metrics_hourly",
      "query": "SELECT time_bucket('1 hour', time) AS bucket...",
      "refresh_policy": {
        "start_offset": "3 days",
        "end_offset": "1 hour",
        "schedule_interval": "1 hour"
      }
    }
  ]
}

Schema Filtering

Automatically Excluded

pgtofu excludes system and internal schemas by default:
CategorySchemas
PostgreSQLpg_catalog, information_schema, pg_toast
TimescaleDB_timescaledb_*, timescaledb_information, timescaledb_internal
Hasurahdb_catalog

Manual Exclusion

Exclude additional schemas using --exclude-schema:
pgtofu extract \
  --exclude-schema _prisma \
  --exclude-schema graphql_public \
  --output schema.json
Common third-party schemas to exclude:
ToolSchemas
Prisma_prisma, _prisma_migrations
Supabaseauth, storage, graphql_public, supabase_*
PostgRESTpostgrest
pgAdminpgagent

Type Normalization

pgtofu normalizes type names for consistent comparison:
PostgreSQL RepresentationNormalized Form
int4integer
int8bigint
boolboolean
varcharcharacter varying
timestamptimestamp without time zone
timestamptztimestamp with time zone

Performance Considerations

Query Efficiency

  • All extraction queries use indexed system catalog columns
  • Single-pass extraction minimizes database round trips
  • Typical extraction takes 5-30 seconds for most databases

Large Databases

For databases with 1000+ tables:
  • Consider extracting specific schemas
  • Use output streaming (--output -)
  • Extraction time may reach 60+ seconds

Troubleshooting

Ensure the user has SELECT access to system catalogs:
GRANT SELECT ON ALL TABLES IN SCHEMA pg_catalog TO myuser;
GRANT SELECT ON ALL TABLES IN SCHEMA information_schema TO myuser;
Objects in excluded schemas won’t appear. Check the exclusion list:
  • System schemas (automatic)
  • --exclude-schema flags
Ensure TimescaleDB extension is installed and accessible:
SELECT * FROM pg_extension WHERE extname = 'timescaledb';

Output Verification

After extraction, verify the output:
# Check table count
jq '.tables | length' schema.json

# List all tables
jq '.tables[].name' schema.json

# Check specific table
jq '.tables[] | select(.name == "users")' schema.json

See Also