diff --git a/content/post/postgres/extension-ecosystem-summit/index.md b/content/post/postgres/extension-ecosystem-summit/index.md index c6d3a2da..9134aafa 100644 --- a/content/post/postgres/extension-ecosystem-summit/index.md +++ b/content/post/postgres/extension-ecosystem-summit/index.md @@ -2,7 +2,7 @@ title: Extension Ecosystem Summit 2024 slug: extension-ecosystem-summit date: 2024-02-27T17:46:58Z -lastMod: 2024-02-27T17:46:58Z +lastMod: 2024-03-06T21:50:27Z description: | Some pals and I organized a summit at PGConf.dev on May 28 to work together as a community toward comprehensive indexing, discovery, and binary @@ -54,21 +54,22 @@ outline the problems they want to solve, their attempts to so, challenges discovered along the way, and dreams for an ideal extension ecosystem in the future. Tentative speaker lineup: -* March 6: [David Wheeler], PGXN: “History and Context of Extension Distribution” -* March 20: [Ian Stanton], Tembo: “Trunk” -* April 3: [Devrim Gündüz]: “Overview of the yum.postgresql.org architecture, - how new RPMs are added, and issues and challenges with distributing RPMed +* March 6: [David Wheeler], PGXN: State of the Extension Ecosystem” +* March 20: [Ian Stanton], Tembo: “Building Trunk: A Postgres Extension Registry and CLI” +* April 3: [Devrim Gündüz]: yum.postgresql.org and the challenges RPMifying extensions” -* April 17: TBD +* April 17: [Jonathan Katz]: "TLE Vision and Specifics" * May 1: [Yurii Rashkovskii], Omnigres: “Universally buildable extensions: dev to prod” -* May 15: [David Wheeler], PGXN: “Metadata for All: Enabling discovery, +* May 15: (Placeholder) [David Wheeler], PGXN: “Metadata for All: Enabling discovery, packaging, and community” Hit the [event page][mini-event] for details. Many thanks to my co-organizers [Jeremy Schneider], [David Christensen], [Keith Fiske], and [Devrim Gündüz], as well as the [PGConf.dev organizers] for making this all happen! +**Update:** 2024-03-06: Updated the talk schedule. + [Extension Ecosystem Summit]: https://www.pgevents.ca/events/pgconfdev2024/schedule/session/191-extension-ecosystem-summit/ "PGConf.dev: Extensions Ecosystem Summit: Enabling comprehensive indexing, discovery, and binary distribution" [PGConf.dev]: https://2024.pgconf.dev "PostgresQL Development Conference 2024" @@ -79,6 +80,7 @@ as well as the [PGConf.dev organizers] for making this all happen! [David Wheeler]: {{% ref "/" %}} [Ian Stanton]: https://www.linkedin.com/in/istanton [Devrim Gündüz]: https://github.com/devrimgunduz + [Jonathan Katz]: https://jkatz05.com [Yurii Rashkovskii]: https://ca.linkedin.com/in/yrashk [Jeremy Schneider]: https://about.me/jeremy_schneider [David Christensen]: https://www.crunchydata.com/blog/author/david-christensen diff --git a/content/post/postgres/state-of-the-extension-ecosystem.md b/content/post/postgres/state-of-the-extension-ecosystem/index.md similarity index 85% rename from content/post/postgres/state-of-the-extension-ecosystem.md rename to content/post/postgres/state-of-the-extension-ecosystem/index.md index 0818f190..9eacedd9 100644 --- a/content/post/postgres/state-of-the-extension-ecosystem.md +++ b/content/post/postgres/state-of-the-extension-ecosystem/index.md @@ -2,7 +2,7 @@ title: "Talk: State of the Extension Ecosystem" slug: state-of-the-extension-ecosystem date: 2024-03-04T18:50:24Z -lastMod: 2024-03-04T20:12:11Z +lastMod: 2024-03-06T21:50:27Z description: | A quick reminder that I'll be giving a brief talk on the "State of the Extension Ecosystem" on Wednesday at noon US Eastern / 17:00 UTC. @@ -16,6 +16,8 @@ image: copyright: 2011 David E. Wheeler --- +**Update:** 2024-03-06: Slides and video linked below. + A quick reminder that I'll be giving a brief talk on the "State of the Extension Ecosystem" on Wednesday at noon US Eastern / 17:00 UTC. This talk is the first in a series of [community talks and discussions][mini-summit] on the postgres @@ -33,6 +35,12 @@ summit]. without using Eventbrite, hit me up at `david@` this domain, [on Mastodon], or via the [#extensions] channel on the [Postgres Slack]. +**Update:** 2024-03-06: Great turnout and discussion, thank you! Links: + +* [Video](https://www.youtube.com/watch?v=6o1N1-Eq-Do) +* [Keynote]({{% link "state-of-the-ecosystem.key" %}}) +* [PDF Slides]({{% link "state-of-the-extension-ecosystem.pdf" %}}) + [mini-summit]: https://www.eventbrite.com/e/851125899477/ "Postgres Extension Ecosystem Mini-Summit" [the summit]: https://www.pgevents.ca/events/pgconfdev2024/schedule/session/191 diff --git a/content/post/postgres/state-of-the-extension-ecosystem/state-of-the-ecosystem.key b/content/post/postgres/state-of-the-extension-ecosystem/state-of-the-ecosystem.key new file mode 100755 index 00000000..9697cd18 Binary files /dev/null and b/content/post/postgres/state-of-the-extension-ecosystem/state-of-the-ecosystem.key differ diff --git a/content/post/postgres/state-of-the-extension-ecosystem/state-of-the-extension-ecosystem.pdf b/content/post/postgres/state-of-the-extension-ecosystem/state-of-the-extension-ecosystem.pdf new file mode 100644 index 00000000..92c0f499 Binary files /dev/null and b/content/post/postgres/state-of-the-extension-ecosystem/state-of-the-extension-ecosystem.pdf differ diff --git a/feed.xml b/feed.xml new file mode 100644 index 00000000..94174503 --- /dev/null +++ b/feed.xml @@ -0,0 +1,4103 @@ + + + https://justatheory.com/tags/postgres/ + Postgres on Just a Theory + An ongoing list of Just a Theory posts about Postgres + 2024-03-04T18:50:24Z + + + + + David E. Wheeler + david@justatheory.com + https://justatheory.com/ + + Hugo + https://justatheory.com/icon-512x512.png + © David E. Wheeler. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. + + https://justatheory.com/2024/03/state-of-the-extension-ecosystem/ + <![CDATA[Talk: State of the Extension Ecosystem]]> + + 2024-03-04T20:12:11Z + 2024-03-04T18:50:24Z + + David E. Wheeler + david@justatheory.com + https://justatheory.com/ + + + + + + + + + + + +
+ Photo of the summit of Mount Hood + +
+ +
+

A quick reminder that I’ll be giving a brief talk on the “State of the Extension +Ecosystem” on Wednesday at noon US Eastern / 17:00 UTC. This talk is the first +in a series of community talks and discussions on the postgres +extension ecosystem leading up to the Extension Ecosystem Summit +at pgconf.dev on May 28.

+

I plan to give a brief history of Postgres extension tools and distribution, +the challenges encountered, recent developments, and opportunities for the +future. It should take about 30 minutes, followed by discussion. Following +this pattern for all the talks in the series, I hope to set up +some engaging discussions and to surface significant topics ahead of the +summit.

+

Join us! Need other information or just want an invitation +without using Eventbrite, hit me up at david@ this domain, on Mastodon, or +via the #extensions channel on the Postgres Slack.

+ +
+ + + +]]>
+
+ + https://justatheory.com/2024/02/extension-ecosystem-summit/ + <![CDATA[Extension Ecosystem Summit 2024]]> + + 2024-02-27T17:46:58Z + 2024-02-27T17:46:58Z + + David E. Wheeler + david@justatheory.com + https://justatheory.com/ + + + + + + + + + + + + +
+ Logo for PGConf.dev + +
+ +
+

I’m pleased to announce that some pals and I have organized and will host the +(first annual?) Extension Ecosystem Summit at PGConf.dev in Vancouver (and +more, see below) on May 28:

+
+

Enabling comprehensive indexing, discovery, and binary distribution.

+

Participants will collaborate to examine the ongoing work on PostgreSQL +extension distribution, examine its challenges, identify questions, propose +solutions, and agree on directions for execution.

+
+

Going to PGConf? Select it as an “Additional Option” when you register, or +update your registration if you’ve already registered. Hope to see +you there!

+
+ + +
+ Photo of the summit of Mount Hood + +
+ +

Extension Ecosystem Mini-Summit

+

But if you can’t make it, that’s okay, because in the lead up to the Summit, to +we’re hosting a series of six virtual gatherings, the Postgres Extension +Ecosystem Mini-Summit.

+

Join us for an hour or so every other Wednesday starting March 6 to hear +contributors to a variety of community and commercial extension initiatives +outline the problems they want to solve, their attempts to so, challenges +discovered along the way, and dreams for an ideal extension ecosystem in the +future. Tentative speaker lineup:

+
    +
  • March 6: David Wheeler, PGXN: “History and Context of Extension Distribution”
  • +
  • March 20: Ian Stanton, Tembo: “Trunk”
  • +
  • April 3: Devrim Gündüz: “Overview of the yum.postgresql.org architecture, +how new RPMs are added, and issues and challenges with distributing RPMed +extensions”
  • +
  • April 17: TBD
  • +
  • May 1: Yurii Rashkovskii, Omnigres: “Universally buildable extensions: dev +to prod”
  • +
  • May 15: David Wheeler, PGXN: “Metadata for All: Enabling discovery, +packaging, and community”
  • +
+

Hit the event page for details. Many thanks to my co-organizers +Jeremy Schneider, David Christensen, Keith Fiske, and Devrim Gündüz, +as well as the PGConf.dev organizers for making this all happen!

+ +
+ + + +]]>
+
+ + https://justatheory.com/2024/02/extension-metadata-typology/ + <![CDATA[RFC: Extension Metadata Typology]]> + + 2024-02-27T17:19:24Z + 2024-02-27T17:19:24Z + + David E. Wheeler + david@justatheory.com + https://justatheory.com/ + + + + + + + + + + +
+

Lately I’ve been thinking a lot about metadata for Postgres extensions. +Traditional use cases include control file metadata, which lives in +.control files used by CREATE EXTENSION and friends, and PGXN metadata, +which lives in META.json files used by PGXN to index and publish extensions. +But these two narrow use cases for SQL behavior and source code distribution +don’t provide the information necessary to enable other use cases, including +building, installing, configuration, and more.

+

So I have also been exploring other metadata formats, including:

+ +

These standards from neighboring communities reveal a great deal of overlap, as +one might expect (everything has a name, a version, an author, license, and so +on), but also types of metadata that had not occurred to me. As I took notes and +gathered suggestions from colleagues and coworkers, I began to recognize natural +groupings of metadata. This lead to the realization that it might be easier — +and more productive — to think about these groupings rather than individual +fields.

+

I therefore propose a typology for Postgres extension metadata.

+

Extension Metadata Typology

+

Essentials

+

Essential information about the extension itself, including its name (or unique +package name), version, list of authors, license, etc. Pretty much every +metadata format encompasses this data. Ecosystem applications use it for +indexing, installation locations, naming conventions, and display information.

+

Artifacts

+

A list of links and checksums for downloading the extension in one or more +formats, including source code, binaries, system packages, and more. Apps use +this information to determine the best option for installing an extension on a +particular system.

+

Resources

+

External information about the extension, mostly links, including source code +repository, bug reporting, documentation, badges, funding, etc. Apps use this +data for links, of course, but also full text indexing, documentation rendering, +and displaying useful information about the extension.

+

Contents

+

A description of what’s included in the extension package. Often an “extension” +consists of multiple extensions, such as PostGIS, which includes postgis, +postgis_tiger_geocoder, address_standardizer, and more. Furthermore, some +extensions are not CREATE EXTENSION-type extension at all, such as background +workers, command-line apps, libraries, etc. Each should be listed along with +documentation links where they differ from the package overall (or are simply +more specific).

+

Prerequisites

+

A list of external dependencies required to configure, build, test, install, and +run the extension. These include not only other extensions, but also external +libraries and OS-specific lists of binary package dependencies. And let’s not +forget the versions of Postgres required, as well as any OS and version +dependencies (e.g, does it work on Windows? FreeBSD? What versions?) and +architectures (arm64, amd64, etc.)

+

How to Build It

+

Metadata that apps use to determine how to build the extension. Does it use the +PostgreSQL PGXS build pipeline? Or perhaps it needs the cargo-based pgrx +toolchain. Maybe a traditional ./configure && make pattern? Perl, Ruby, +Python, Go, Rust, or NPM tooling? Whatever the pattern, this metadata needs to +be sufficient for an ecosystem app to programmatically determine how to build +an extension.

+

How to Install It

+

Usually an extension of the build metadata, the install metadata describes how +to install the extension. That could be PGXS or pgrx again, but could also +use other patterns — or multiple patterns! For example, perhaps an extension +can be built and installed with PGXS, but it might also be TLE-safe, and +therefore provide details for handing the SQL files off to a TLE installer.

+

This typology might include additional data, such as documentation files to +install (man pages anyone?), or directories of dependent files or libraries, +and the like — whatever needs to be installed for the extension.

+

How to Run It

+

Not all Postgres extensions are CREATE EXTENSION extensions. Some provide +background workers to perform various tasks; others simply provide Utility +applications like pg_top and pg_repack. In fact pg_repack provides both +a command-line application and a CREATE EXTENSION extension in one package!

+

This metadata also provides configuration information, both control file +parameters like trusted, superuser, and schema, but also load +configuration information, like whether an extension needs its libraries +included in shared_preload_libraries to enable LOAD or requires a +cluster restart. (Arguably this information should be in the “install” typology +rather than “run”.)

+

Classification

+

Classification metadata lets the extension developer associate additional +information to improve discovery, such as key words. It might also allow +selections from a curated list of extension classifications, such as the +category slugs supported for the cargo categories field. Ecosystem apps use +this data to organize extensions under key words or categories, making it easier +for users to find extensions often used together or for various workloads or +tasks.

+

Metrics and Reports

+

This final typology differs from the others in that its metadata derives from +third party sources rather than the extension developer. It includes data such +as number of downloads, build and test status on various Postgres/OS/version +combinations, binary packaging distributions, test coverage, security scan +results, vulnerability detection, quality metrics and user ratings, and more.

+

In the broader ecosystem, it would be the responsibility of the root registry to +ensure such data in the canonical data for each extension comes only from +trusted sources, although applications downstream of the root registry might +extend metrics and reports metadata with their own information.

+

What More?

+

Reading through various metadata standards, I suspect this typology is fairly +comprehensive, but I’m usually mistaken about such things. What other types of +metadata do you find essential for the use cases you’re familiar with? Do they +fit one of the types here, or do they require some other typology I’ve failed to +imagine? Hit the #extensions channel on the Postgres Slack to contribute to +the discussion, or give me a holler on Mastodon.

+

Meanwhile, I’ll be refining this typology and assigning all the metadata fields +to them in the coming weeks, with an eye to proposing a community-wide metadata +standard. I hope it will benefit us all; your input will ensure it does.

+ +
+ + + +]]>
+
+ + https://justatheory.com/2024/02/extension-versioning/ + <![CDATA[The History and Future of Extension Versioning]]> + + 2024-02-22T19:33:12Z + 2024-02-22T19:33:12Z + + David E. Wheeler + david@justatheory.com + https://justatheory.com/ + + + + + + + + + + +
+

Every software distribution system deals with versioning. Early in the design of +PGXN, I decided to require semantic versions (SemVer), a +clearly-defined and widely-adopted version standard, even in its pre-1.0 +specification. I implemented the semver data type that would properly sort +semantic versions, later ported to C by Sam Vilain and eventually updated to +semver 2.0.0.

+

As I’ve been thinking through the jobs and tools for the Postgres extension +ecosystem, I wanted to revisit this decision, the context in which it was made, +and survey the field for other options. Maybe a “PGXN v2” should do something +different?

+

But first that context, starting with Postgres itself.

+

PostgreSQL Extension Version Standard

+

From the introduction extensions in PostgreSQL 9.1, the project side-stepped the +need for version standardization and enforcement by requiring extension authors +to adopt a file naming convention, instead. For example, an extension named +“pair” must have a file with its name, two dashes, then the version as listed in +its control file, like so:

+
pair--1.1.sql
+

As long as the file name is correct and the version part byte-compatible with +the control file entry, CREATE EXTENSION will find it. To upgrade an extension +the author must provide a second file with the extension name, the old version, +and the new version, all delimited by double dashes. For example, to upgrade our +“pair” extension to version 1.2, the author supply all the SQL commands +necessary to upgrade it in this file:

+
pair--1.1--1.2.sql
+

This pattern avoids the whole question of version standards, ordering for +upgrades or downgrades, and all the rest: extension authors have full +responsibility to name their files correctly.

+

PGXN Versions

+

SemVer simplified a number of issues for PGXN in ways that the PostgreSQL +extension versioning did not (without having to re-implement the core’s file +naming code). PGXN wants all metadata for an extension in its META.json +file, and not to derive it from other sources that could change over time.

+

Following the CPAN model, PGXN also required that extension releases never +decrease the version.1 The well-defined sortability of semantic versions +made this validation trivial. PGXN later relaxed enforcement to allow updates +to previously-released versions. SemVer’s clearly specified sorting made this +change possible, as the major.minor.patch precedence intuitively compare from +left to right.

+

In other words, if one had previously released version 1.2.2, then released +1.3.0, a follow-up 1.2.3 is allowed, increasing the 1.2.x branch version, but +not, say, 1.2.1, which decreases the 1.2.x branch version.

+

Overall, semantic versions have been great for clarity of versioning of PGXN +extensions. The one bit of conflict comes from extensions that use some other +other version standard in the control file, usually a two-part x.y version not +allowed by SemVer, which requires x.y.z (or, more specifically, +major.minor.patch).

+

But such versions are usually compatible with SemVer, and because PGXN cares +only about the contents of the META.json, they’re free to use their own +versions in the control file, just as long as the META.json file uses SemVers.

+

For example, the recent nominatim_fdw v1.0.0 release, which of course lists +"version": "1.0.0" in its META.json file, sticks to its preferred +default_version = '1.0' in its control file. The extension author simply +appends .0 to create a valid SemVer from their preferred version, and as long +as they never use any other patch number, it remains compatible.

+

Versioning Alternatives

+

Surveying the versioning landscape in 2024 yields a number of approaches. Might +we prefer an alternative for future extensions distribution? Let’s look at the +possibilities.

+

Ad Hoc Versions

+

As described above, the Postgres file naming convention allows ad hoc +versions. As far as I can tell, so does the R Project’s CRAN. This approach +seems fine for systems that don’t need to follow version changes themselves, but +much trickier for systems that do. If I want to install the latest version of an +extension, how does the installer know what that latest version is?

+

The answer is that the extension author must always release them in the proper +order. But if someone releases 1.3.1 of an extension, and then 1.2.1, well then +1.2.1 is the latest, isn’t it? It could get confusing pretty quickly.

+

Seems better to require some system, so that download and install clients can +get the latest version — or the latest maintenance version of an earlier +release if they need it.

+

User Choice

+

Quite a few registries allow users to choose their own versioning standards, but +generally with some very specific recommendations to prevent confusion for +users.

+
    +
  • Python Packaging is fairly liberal in the versions it allows, but strongly +recommends semantic versioning or calendar versioning +(more on that below).
  • +
  • CPAN (Perl) is also fairly liberal, due to its long history of module +distribution, but currently requires “Decimal versions”, which are evaluated +as floating-point numbers, or dotted integer versions, which require +three dot-separated positive integers and must begin with the letter v.
  • +
  • RubyGems does not enforce a versioning policy, but warns that “using an +‘irrational’ policy will only be a disservice to those in the community who +use your gems.” The project therefore urges developers to follow SemVer.
  • +
+

These three venerable projects date from an earlier period of registration and +distribution, and have made concessions to times when no policies existed. Their +solutions either try to cover as many legacy examples as possible while +recommending better patterns going forward (Python, Perl), or simply make +recommendations and punt responsibility to developers.

+

SemVer

+

More recently-designed registries avoid this problem by requiring some level of +versioning standard from their inception. Nearly all use SemVer, including:

+
    +
  • Go Modules, where “Each version starts with the letter v, followed by a +semantic version.”
  • +
  • Cargo (Rust), which “uses SemVer for specifying version numbers. This +establishes a common convention for what is compatible between different +versions of a package.”
  • +
  • npm, where the “version must be parseable by node-semver, which is +bundled with npm as a dependency.”
  • +
+

CalVer

+

CalVer eschews context-free incrementing integers in favor of +semantically-meaningful versions, at least for some subset of a version string. +In other words: make the version date-based. CalVer-versioned projects usually +include the year and sometimes the month. Some examples:

+
    +
  • Ubuntu uses YY.0M.MICRO, e.g., 23.04, released in April 2023, and +23.10.1, released in October 2023
  • +
  • Twisted uses YY.MM.MICRO, e.g., 22.4.0, released in April 2022
  • +
+

Ultimately, adoption of a CalVer format is a more choice about embedding +calendar-based meaning into a version more than standardizing a specific format. +One can of course use CalVer semantics in a semantic version, as in the Twisted +example, which is fully-SemVer compliant.

+

In other words, adoption of CalVer need not necessitate rejection of SemVer.

+

Package Managers

+

What about package managers, like RPM and Apt? Some canonical examples:

+
    +
  • +

    RPM packages use the format:

    +
    <name>-<version>-<release>.<architecture>
    +

    Here <version> is the upstream version, but RPM practices a reasonable (if +baroque) version comparison of all its parts. But it does not impose a +standard on upstream packages, since they of course vary tremendously +between communities and projects.

    +
  • +
  • +

    Apt packages use a similar format:

    +
    [epoch:]upstream_version[-debian_revision]
    +

    Again, upstream_version is the version of the upstream package, and not +enforced by Apt.

    +
  • +
  • +

    APK (Alpine Linux) packages use the format

    +
    {digit}{.digit}...{letter}{_suf{#}}...{-r#}
    +

    I believe that {digit}{.digit}...{letter} is the upstream package version.

    +
  • +
+

This pattern makes perfect sense for registries that repackage software from +dozens of upstream sources that may or may not have their own policies. But a +system that defines the standard for a specific ecosystem, like Rust or +PostgreSQL, need not maintain that flexibility.

+

Recommendation

+

Given this survey, I’m inclined to recommend that the PostgreSQL community +follow the PGXN (and Go, and Rust, and npm) precedent and continue to rely on +and require semantic versions for extension distribution. It’s not +perfect, given the contrast with the core’s lax version requirements. CalVer +partisans can still use it, though with fewer formatting options (SemVer forbids +leading zeros, as in the Ubuntu 23.04 example).

+

But with its continuing adoption, and especially its requirement by more recent, +widely-used registries, and capacity to support date-based semantics for those +who desire it, I think it continues to make the most sense.

+

Wrong!

+

I’m probably wrong. I’m often mistaken in one way or another, on the details or +the conclusion. Please tell me how I’ve messed up! Find me on the #extensions +channel on the Postgres Slack or ping me on Mastodon.

+
+
+
    +
  1. +

    Why? Because every module on CPAN has one and only one entry in the +index file. Ricardo Signes explains↩︎

    +
  2. +
+
+ +
+ + + +]]>
+
+ + https://justatheory.com/2024/02/decentralized-extension-publishing/ + <![CDATA[Contemplating Decentralized Extension Publishing]]> + + 2024-02-01T15:50:00Z + 2024-02-01T15:50:00Z + + David E. Wheeler + david@justatheory.com + https://justatheory.com/ + + + + + + + + + + +
+

TL;DR

+

As I think through the future of the Postgres extension ecosystem as a key part +of the new job, I wanted to understand how Go decentralized publishing +works. In this post I work it out, and think through how we might do something +similar for Postgres extension publishing. It covers the +Go architecture, namespacing challenges, +and PGXS abuse; then experiments with +URL-based namespacing and ponders +reorganizing installed extension files; +and closes with a high-level design for +making it work now and in the future.

+

It is, admittedly, a lot, mainly written for my own edification and for the +information of my fellow extension-releasing travelers.

+

I find it fascinating and learned a ton. Maybe you will too! But feel free to +skip this post if you’re less interested in the details of the journey and want +to wait for more decisive posts once I’ve reached the destination.

+

Introduction

+

Most language registries require developers to take some step to make releases. +Many automate the process in CI/CD pipelines, but it requires some amount of +effort on the developer’s part:

+
    +
  • Register for an account
  • +
  • Learn how to format things to publish a release
  • +
  • Remember to publish again for every new version
  • +
  • Create a pipeline to automate publishing (e.g., a GitHub workflow)
  • +
+

Decentralized Publishing

+

Go decentralized publishing has revised this pattern: it does not require user +registration or authentication to to publish a module to pkg.go.dev. Rather, +Go developers simply tag the source repository, and the first time someone +refers to the tag in Go tools, the Go module index will include it.

+

For example, publishing v1.2.1 of a module in the github.com/golang/example +repository takes just three commands:

+
git tag v1.2.1 -sm 'Tag v1.2.1'
+git push --tags
+go list -m github.com/golang/example@v1.2.1
+

After a few minutes, the module will show up in the index and then on +pkg.go.dev. Anyone can run go get -u github.com/golang/example to get the +latest version. Go developers rest easy in the knowledge that they’re getting +the exact module they need thanks to the global checksum database, which Go +uses “in many situations to detect misbehavior by proxies or origin servers”.

+

This design requires go get to understand multiple source code management +systems: it supports Git, Subversion, Mercurial, Bazaar, and Fossil.1 +It also needs the go.mod metadata file to live in the project defining the +package.

+

But that’s really it. From the developer’s perspective it could not be easier to +publish a module, because it’s a natural extension of the module development +tooling and workflow of committing, tagging, and fetching code.

+

Decentralized Extension Publishing

+

Could we publish Postgres extensions in such a decentralized pattern? It might +look something like this:

+
    +
  • The developer places a metadata file in the proper location (control file, +META.json, Cargo.toml, whatever — standard TBD)
  • +
  • To publish a release, the developer tags the repository and calls some sort +of indexing service hook (perhaps from a tag-triggered release workflow)
  • +
  • The indexing service validates the extension and adds it to the index
  • +
+

Note that there is no registration required. It simply trusts the source code +repository. It also avoids name collision: github.com/bob/hash +is distinct from github.com/carol/hash.

+

This design does raise challenges for clients, whether they’re compiling +extensions on a production system or building binary packages for distribution: +they have to support various version control systems to pull the code (though +starting with Git is a decent 90% solution).

+

Namespacing

+

Then there’s name conflicts. Perhaps github.com/bob/hash and +github.com/carol/hash both create an extension named hash. By the current +control file format, the script directory and module path can use any name, +but in all likelihood the use these defaults:

+
directory = 'extension'
+module_pathname = '$libdir/hash'
+

Meaning .sql files will be installed in the Postgres share/extension +subdirectory — along with all the other installed extensions — and library +files will be installed in the library directory along with all other libraries. +Something like this:

+
pgsql
+├── lib
+│   └── hash.so
+└── share
+    └── extension
+    │   └── hash.control
+    │   ├── hash--1.0.0.sql
+    └── doc
+        └── hash.md
+

If both projects include, say, hash.control, hash--1.0.0.sql, and hash.so, +the files from one will stomp all over the files of the other.

+

Installer Abuse

+

Go avoids this issue by using the domain and path from each package’s repository +in its directory structure. For example, here’s a list of modules from +google.golang.org repositories:

+
$ ls -1 ~/go/pkg/mod/google.golang.org
+api@v0.134.0
+api@v0.152.0
+appengine@v1.6.7
+genproto
+genproto@v0.0.0-20230731193218-e0aa005b6bdf
+grpc@v1.57.0
+grpc@v1.59.0
+protobuf@v1.30.0
+protobuf@v1.31.0
+protobuf@v1.32.0
+

The ~/go/pkg/mod directory has subdirectories for each VCS host name, and each +then subdirectories for package paths. For the github.com/bob/hash example, +the files would all live in ~/go/pkg/mod/github.com/bob/hash.

+

Could a Postgres extension build tool follow a similar distributed pattern by +renaming the control file and installation files and directories to something +specific for each, say github.com+bob+hash and github.com+carol+hash? That +is, using the repository host name and path, but replacing the slashes in the +path with some other character that wouldn’t create subdirectories — because +PostgreSQL won’t find control files in subdirectories. The control file entries +for github.com/carol/hash would look like this:

+
directory = 'github.com+carol+hash'
+module_pathname = '$libdir/github.com+carol+hash'
+

Since PostgreSQL expects the control file to have the same name as the +extension, and for SQL scripts to start with that name, the files would have to +be named like so:

+
hash
+├── Makefile
+├── github.com+carol+hash.control
+└── sql
+    └── github.com+carol+hash--1.0.0.sql
+

And the Makefile contents:

+
EXTENSION  = github.com+carol+hash
+MODULEDIR  = $(EXTENSION)
+DATA       = sql/$(EXTENSION)--1.0.0.sql
+PG_CONFIG ?= pg_config
+
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+

In other words, the extension name is the full repository host name and path and +the Makefile MODULEDIR variable tells pg_config to put all the SQL and +documentation files into a directories named github.com+carol+hash — +preventing them from conflicting with any other extension.

+

Finally, the github.com+carol+hash.control file — so named becaus it must +have the same name as the extension — contains:

+
default_version = '1.0.0'
+relocatable = true
+directory = 'github.com+carol+hash'
+module_pathname = '$libdir/github.com+carol+hash'
+

Note the directory parameter, which must match MODULEDIR from the +Makefile, so that CREATE EXTENSION can find the SQL files. Meanwhile, +module_pathname ensures that the library file has a unique name — the same +as the long extension name — again to avoid conflicts with other projects.

+

That unsightly naming extends to SQL: using the URL format could get to be a +mouthful:

+
CREATE EXTENSION "github.com+carol+hash";
+

Which is do-able, but some new SQL syntax might be useful, perhaps something +like:

+
CREATE EXTENSION hash FROM "github.com+carol+hash";
+

Or, if we’re gonna really go for it, use slashes after all!

+
CREATE EXTENSION hash FROM "github.com/carol/hash";
+

Want to use both extensions but they have conflicting objects (e.g., both create +a “hash” data type)? Put them into separatre schemas (assuming +relocatable = true in the control file):

+
CREATE EXTENSION hash FROM "github.com/carol/hash" WITH SCHEMA carol;
+CREATE EXTENSION hash FROM "github.com/bob/hash" WITH SCHEMA bob;
+CREATE TABLE try (
+    h1 carol.hash,
+    h2 bob.hash
+);
+

Of course it would be nice if PostgreSQL added support for something like +Oracle packages, but using schemas in the meantime may be sufficient.

+

Clearly we’re getting into changes to the PostgreSQL core, so put that aside and +we can just use long names for creating, modifying, and dropping extensions, but +not necessarily otherwise:

+
CREATE EXTENSION "github.com+carol+hash" WITH SCHEMA carol;
+CREATE EXTENSION "github.com+bob+hash" WITH SCHEMA bob;
+CREATE EXTENSION "gitlab.com+barack+kicker_type";
+CREATE TABLE try (
+    h1 carol.hash,
+    h2 bob.hash
+    kt kicker
+);
+

Namespacing Experiment

+

To confirm that this approach might work, I committed 24134fd and pushed it in +the namespace-experiment branch of the semver extension. This commit changes +the extension name from semver to github.com+theory+pg-semver, and follows +the above steps to ensure that its files are installed with that name.

+

Abusing the Postgres extension installation infrastructure like this does +work, but suffers from a number of drawbacks, including:

+
    +
  • The extension name is super long, as before, but now so too are the files in +the repository (as opposed to the installer renaming them on install). The +shared library file has to have the long name, so therefore does the .c +source file. The SQL files must all start with +github.com+theory+pg-semver, although I skipped that bit in this commit; +instead the Makefile generates just one from sql/semver.sql.
  • +
  • Any previous installation of the semver type would remain unchanged, with +no upgrade path. Changing an extension’s name isn’t a great idea.
  • +
+

I could probably script renaming and modifying file contents like this and make +it part of the build process, but it starts to get complicated. We could also +modify installers to make the changes, but there are a bunch of moving parts +they would have to compensate for, and given how dynamic this can be (e.g., the +semver Makefile reads the extension name from META.json), we would rapidly +enter the territory of edge case whac-a-mole. I suspect it’s simply too +error-prone.

+

Proposal: Update Postgres Extension Packaging

+

Perhaps the Go directory pattern could inspire a similar model in Postgres, +eliminating the namespace issue by teaching the Postgres extension +infrastructure to include all but one of the files for an extension in a +single directory. In other words, rather than files distributed like so for +semver:

+
pgsql
+├── lib
+│   └── semver.so
+└── share
+    └── extension
+    │   └── semver.control
+    │   ├── semver--0.32.1.sql
+    │   ├── semver--0.32.0--0.32.1.sql
+    └── doc
+        └── semver.md
+

Make it more like this:

+
pgsql
+└── share
+    └── extension
+        └── github.com
+            └── theory
+                └── pg-semver
+                    └── extension.control
+                    └── lib
+                    │   └── semver.so
+                    └── sql
+                    │   └── semver--0.32.1.sql
+                    │   └── semver--0.32.0--0.32.1.sql
+                    └── doc
+                        └── semver.md
+

Or perhaps:

+
pgsql
+└── share
+    └── extension
+        └── github.com
+            └── theory
+                └── pg-semver
+                    └── extension.control
+                    └── semver.so
+                    └── semver--0.32.1.sql
+                    └── semver--0.32.0--0.32.1.sql
+                    └── semver.md
+

The idea is to copy the files exactly as they’re stored in or compiled in the +repository. Meanwhile, the new semver.name file — the only relevant file +stored outside the extension module directory — simply points to that path:

+
github.com/theory/pg-semver
+

Then for CREATE EXTENSION semver, Postgres reads semver.name and knows where +to find all the files to load the extension.

+

This configuration would require updates to the control file, now named +extension.control, to record the full package name and appropriate locations. +Add:

+
name = 'semver'
+package = 'github.com/theory/pg-semver'
+

This pattern could also allow aliasing. Say we try to install a different +semver extension from github.com/example/semver. This is in its +extension.control file:

+
name = 'semver'
+package = 'github.com/example/pg-semver'
+

The installer detects that semver.name already exists for a different package +and raises an error. The user could then give it a different name by running +something like:

+
make install ALIAS_EXTENSION_NAME=semver2
+

This would add semver2.name right next to semver.name, and its contents +would contain github.com/example/semver, where all of its files are installed. +This would allow CREATE EXTENSION semver2 to load the it without issue +(assuming no object conflicts, hopefully resolved by relocate-ability).

+

I realize a lot of extensions with libraries could wreak some havoc on the +library resolver having to search so many library directories, but perhaps +there’s some way around that as well? Curious what techniques experienced C +developers might have adopted.

+

Back to Decentralized Publishing

+

An updated installed extension file structure would be nice, and is surely worth +a discussion, but even if it shipped in Postgres 20, we need an updated +extension ecosystem today, to work well with all supported versions of Postgres. +So let’s return to the idea of decentralized publishing without such changes.

+

I can think of two pieces that’d be required to get Go-style decentralized +extension publishing to work with the current infrastructure.

+

Module Uniqueness

+

The first is to specify a new metadata field to be unique for the entire index, +and which would contain the repository path. Call it module, after Go (a +single Git repository can have multiple modules). In PGXN Meta Spec-style JSON +it’d look something like this:

+
{
+    "module": "github.com/theory/pg-semver",
+    "version": "0.32.1",
+    "provides": {
+      "semver": {
+         "abstract": "A semantic version data type",
+      }
+    }
+}
+

Switch from the PGXN-style uniqueness on the distribution name (usually the name +of the extension) and let the module be globally unique. This would allow +another party to release an extension with the same name. Even a fork where only +the module is changed:

+
{
+    "module": "github.com/example/pg-semver",
+    "version": "0.32.1",
+    "provides": {
+      "semver": {
+         "abstract": "A semantic version data type",
+      }
+    }
+}
+

Both would be indexed and appear under the module name, and both would be +find-able by the provided extension name, semver.

+

Where that name must still be unique is in a given install. In other words, +while github.com/theory/pg-semver and github.com/example/pg-semver both +exist in the index, the semver extension can be installed from only one of +them in a given Postgres system, where the extension name semver defines its +uniqueness.

+

This pattern would allow for much more duplication of ideas while preserving the +existing per-cluster namespacing. It also allows for a future Postgres release +that supports something like the flexible per-cluster packaging as described +above.2

+

Extension Toolchain App

+

The second piece is an extension management application that understands all +this stuff and makes it possible. It would empower both extension development +workflows — including testing, metadata management, and releasing — and +extension user workflows — finding, downloading, building, and installing.

+

Stealing from Go, imagine a developer making a release with something like this:

+
git tag v1.2.1 -sm 'Tag v1.2.1'
+git push --tags
+pgmod list -m github.com/theory/pg-semver@v1.2.1
+

The creatively named pgmod tells the registry to index the new version +directly from its Git repository. Thereafter anyone can find it and install it +with:

+
    +
  • pgmod get github.com/theory/pg-semver@v1.2.1 — installs the specified version
  • +
  • pgmod get github.com/theory/pg-semver — installs the latest version
  • +
  • pgmod get semver — installs the latest version or shows a list of +matching modules to select from
  • +
+

Any of these would fail if the cluster already has an extension named semver +with a different module name. But with something like the updated extension +installation locations in a future version of Postgres, that limitation could be +loosened.

+

Challenges

+

Every new idea comes with challenges, and this little thought experiment is no +exception. Some that immediately occur to me:

+
    +
  • Not every extension can be installed directly from its repository. Perhaps +the metadata could include a download link for a tarball with the results of +any pre-release execution?
  • +
  • Adoption of a new CLI could be tricky. It would be useful to include the +functionality in existing tools people already use, like pgrx.
  • +
  • Updating the uniqueness constraint in existing systems like PGXN might be +a challenge. Most record the repository info in the resources META.json +object, so it would be do-able to adapt into a new META format, either +on PGXN itself or in a new registry, should we choose to build one.
  • +
  • Getting everyone to standardize on standardized versioning tags might take +some effort. Go had the benefit of controlling its entire toolchain, while +Postgres extension versioning and release management has been all over the +place. However PGXN long ago standardized on semantic versioning and +those who have released extensions on PGXN have had few issues (one can +still use other version formats in the control file, for better or worse).
  • +
  • Some PGXN distributions have shipped different versions of extensions in a +single release, or the same version as in other releases. The release +version of the overall package (repository, really) would have to become +canonical.
  • +
+

I’m sure there are more, I just thought of these offhand. What have you thought +of? Post ’em if you got ’em in the #extensions channel on the Postgres +Slack, or give me a holler on Mastodon or via email.

+
+
+
    +
  1. +

    Or does it? Yes, it does. Although the Go CLI downloads most +public modules from a module proxy server like proxy.golang.org, it +still must know how to download modules from a version control system when +a proxy is not available. ↩︎

    +
  2. +
  3. +

    Assuming, of course, that if and when the Postgres core +adopts more bundled packaging that they’d use the same naming convention as +we have in the broader ecosystem. Not a perfectly safe assumption, but given +the Go precedent and wide adoption of host/path-based projects, it seems +sound. ↩︎

    +
  4. +
+
+ +
+ + + +]]>
+
+ + https://justatheory.com/2024/01/pgxn-tools-v1.4/ + <![CDATA[PGXN Tools v1.4]]> + + + 2024-01-31T17:13:40Z + 2024-01-31T17:13:40Z + + David E. Wheeler + david@justatheory.com + https://justatheory.com/ + + + + + + + + +
+

Over on the PGXN Blog I’ve posted a brief update on recent bug fixes and +improvements to the pgxn-tools Docker image, which is used fairly widely these +days to test, bundle, and release Postgres extensions to PGXN. This fix is +especially important for Git repositories:

+
+

v1.4.1 fixes an issue where git archive was never actually used to build a +release zip archive. This changed at some point without noticing due to the +introduction of the safe.directory configuration in recent versions of Git. +Inside the container the directory was never trusted, and the pgxn-bundle +command caught the error, decided it wasn’t working with a Git repository, and +used the zip command, instead.

+
+

I also posted a gist listing PGXN distributions with a .git directory.

+ +
+ + + +]]>
+
+ + https://justatheory.com/2024/01/pgxn-challenges/ + <![CDATA[PGXN Challenges]]> + + 2024-01-30T00:11:11Z + 2024-01-30T00:11:11Z + + David E. Wheeler + david@justatheory.com + https://justatheory.com/ + + + + + + + + + +
+ PGXN Gear +
+ +
+

Last week, I informally shared Extension Ecosystem: Jobs and Tools with +colleagues in the #extensions channel on the Postgres Slack. The document +surveys the jobs to be done by the ideal Postgres extension ecosystem and the +suggests the tools and services required to do those jobs — without reference +to existing extension registries and packaging systems.

+

The last section enumerates some questions we need to ponder and answer. The +first one on the list is:

+
+

What will PGXN’s role be in this ideal extension ecosystem?

+
+

The PostgreSQL Extension Network, or PGXN, is the original extension +distribution system, created 2010–11. It has been a moderate success, but as we +in the Postgres community imagine the ideal extension distribution future, it’s +worthwhile to also critically examine existing tools like PGXN, both to inform +the project and to realistically determine their roles in that future.

+

With that in mind, I here jot down some thoughts on the challenges with PGXN.

+

PGXN Challenges

+

PGXN sets a lot of precedents, particularly in its decoupling of the registry +from the APIs and services that depend on it. It’s not an all-in-one thing, and +designed for maximum distributed dissemination via rsync and static JSON files.

+

But there are a number of challenges with PGXN as it currently stands; a +sampling:

+
    +
  • +

    PGXN has not comprehensively indexed all public PostgreSQL extensions. While +it indexes more extensions than any other registry, it falls far short of +all known extensions. To be a truly canonical registry, we need to make it +as simple as possible for developers to register their extensions. (More +thoughts on that topic in a forthcoming post.)

    +
  • +
  • +

    In that vein, releasing extensions is largely a manual process. The +pgxn-tools Docker image has improved the situation, allowing developers to +create relatively simple GitHub workflows to automatically test and +release extensions. Still, it requires intention and work by extension +developers. The more seamless we can make publishing extensions the better. +(More thoughts on that topic in a forthcoming post.)

    +
  • +
  • +

    It’s written in Perl, and therefore doesn’t feel modern or easily +accessible to other developers. It’s also a challenge to build and +distribute the Perl services, though Docker images could mitigate this +issue. Adopting a modern compiled language like Go or Rust might +increase community credibility and attract more contributions.

    +
  • +
  • +

    Similarly, pgxnclient is written in Python and the pgxn-utils +developer tools in Ruby, increasing the universe of knowledge and skill +required for developers to maintain all the tools. They’re also more +difficult to distribute than compiled tools would be. Modern +cross-compilable languages like Go and Rust once again simplify +distribution and are well-suited to building both web services and CLIs (but +not, perhaps native UX applications — but then neither are dynamic +languages like Ruby and Python).

    +
  • +
  • +

    The PGXN Search API uses the Apache Lucy search engine library, a +project that retired in 2018. Moreover, the feature never worked very +well, thanks to the decision to expose separate search indexes for different +objects — and requiring the user to select which to search. People often +can’t find what they need because the selected index doesn’t contain it. +Worse, the default index on the site is “Documentation”, on the +surface a good choice. But most extensions include no documentation other +than the README, which appears in the “Distribution” index, not +“Documentation”. Fundamentally the search API and UX needs to be completely +re-architected and -implemented.

    +
  • +
  • +

    PGXN uses its own very simple identity management and basic +authentication. It would be better to have tighter community identity, +perhaps through the PostgreSQL community account.

    +
  • +
+

Given these issues, should we continue building on PGXN, rewrite some or all of +its components, or abandon it for new services. The answer may come as a natural +result of designing the overall extension ecosystem architecture or from the +motivations of community consensus. But perhaps not. In the end, we’ll need a +clear answer to the question.

+

What are your thoughts? Hit us up in the #extensions channel on the Postgres +Slack, or give me a holler on Mastodon or via email. We expect to start +building in earnest in February, so now’s the time!

+ +
+ + + +]]>
+
+ + https://justatheory.com/2024/01/tembonaut/ + <![CDATA[I’m a Postgres Extensions Tembonaut]]> + + 2024-01-22T17:00:26Z + 2024-01-22T17:00:26Z + + David E. Wheeler + david@justatheory.com + https://justatheory.com/ + + + + + + + + + + + +
+ Tembo Logo +
+ +
+

New year, new job.

+

I’m pleased to announce that I started a new job on January 2 at Tembo, a +fully-managed PostgreSQL developer platform. Tembo blogged the news, too.

+

I first heard from Tembo CTO Samay Sharma last summer, when he inquired about +the status of PGXN, the PostgreSQL Extension Network, which I built in +2010–11. Tembo bundles extensions into Postgres stacks, which let developers +quickly spin up Postgres clusters with tools and features optimized for specific +use cases and workloads. The company therefore needs to provide a wide variety +of easy-to-install and well-documented extensions to power those use cases. +Could PGXN play a role?

+

I’ve tended to PGXN’s maintenance for the last fourteen years, and thanks in no +small part to hosting provided by depesz. As of today’s stats it distributes +376 extensions on behalf of 419 developers. PGXN has been a moderate success, +but Samay asked how we could collaborate to build on its precedent to improve +the extensions ecosystem overall.

+

It quickly became apparent that we share a vision for what that ecosystem could +become, including:

+
    +
  • Establishing the canonical Postgres community index of extensions, something +PGXN has yet to achieve
  • +
  • Improving metadata standards to enable new patterns, such as automated binary +packaging
  • +
  • Working with the Postgres community to establish documentation standards +that encourage developers to provide comprehensive extension docs
  • +
  • Designing and building developer tools that empower more developers to +build, test, distribute, and maintain extensions
  • +
+

Over the the past decade I’ve have many ideas and discussion on these topics, +but seldom had the bandwidth to work on them. In the last couple years I’ve +enabled TLS and improved the site display, increased password security, and +added a notification queue with hooks that post to both Twitter (RIP @pgxn) +and Mastodon (@pgxn@botsin.space). Otherwise, aside from keeping the site +going, periodically improving new accounts, and eyeing the latest releases, I’ve +had little bandwidth for PGXN or the broader extension ecosystem.

+

Now, thanks to the vision and strategy of Samay and Tembo CEO Ry Walker, I +will focus on these projects full time. The Tembo team have already helped me +enumerate the extension ecosystem jobs to be done and the tools required to do +them. This week I’ll submit it to collaborators from across the Postgres +community1 to fill in the missing parts, make adjustments and +improvements, and work up a project plan.

+

The work also entails determining the degree to which PGXN and other extension +registries (e.g., dbdev, trunk, pgxman, pgpm (WIP), etc.) will play a +role or provide inspiration, what bits should be adopted, rewritten, or +discarded.2 Our goal is to build the foundations for a community-owned +extensions ecosystem that people care about and will happily adopt and +contribute to.

+

I’m thrilled to return to this problem space, re-up my participation in the +PostgreSQL community, and work with great people to build out the extensions +ecosystem for future.

+

Want to help out or just follow along? Join the #extensions channel on the +Postgres Slack. See you there.

+
+
+
    +
  1. +

    Tembo was not the only company whose representatives have reached +out in the past year to talk about PGXN and improving extensions. I’ve also +had conversations with Supabase, Omnigres, Hydra, and others. ↩︎

    +
  2. +
  3. +

    Never be afraid to kill your darlings↩︎

    +
  4. +
+
+ +
+ + + +]]>
+
+ + https://justatheory.com/2023/10/sql-jsonpath-operators/ + <![CDATA[JSON Path Operator Confusion]]> + + 2023-10-14T22:39:55Z + 2023-10-14T22:39:55Z + + David E. Wheeler + david@justatheory.com + https://justatheory.com/ + + + + + + + + @@ and @? +confused me. Here’s how I figured out the difference.]]> + + +
+

The CipherDoc service offers a robust secondary key lookup API and search +interface powered by JSON/SQL Path queries run against a GIN-indexed JSONB +column. SQL/JSON Path, introduced in SQL:2016 and added to Postgres in +version 12 in 2019, nicely enables an end-to-end JSON workflow and entity +lifecycle. It’s a powerful enabler and fundamental technology underpinning +CipherDoc. I’m so happy to have found it.

+

Confusion

+

However, the distinction between the SQL/JSON Path operators @@ and @? +confused me. Even as I found that the @? operator worked for my needs and @@ +did not, I tucked the problem into my mental backlog for later study.

+

The question arose again on a recent work project, and I can take a hint. It’s +time to figure this thing out. Let’s see where it goes.

+

The docs say:

+
+
+
jsonb @? jsonpath → boolean
+
Does JSON path return any item for the specified JSON value?
+
+

'{"a":[1,2,3,4,5]}'::jsonb @? '$.a[*] ? (@ > 2)' → t

+
+
+
jsonb @@ jsonpath → boolean
+
Returns the result of a JSON path predicate check for the specified JSON +value. Only the first item of the result is taken into account. If the +result is not Boolean, then NULL is returned.
+
+

'{"a":[1,2,3,4,5]}'::jsonb @@ '$.a[*] > 2' → t

+
+

These read quite similarly to me: Both return true if the path query returns an +item. So what’s the difference? When should I use @@ and when @?? I went so +far as to ask Stack Overflow about it. The one answer directed my attention +back to the jsonb_path_query() function, which returns the results from a +path query.

+

So let’s explore how various SQL/JSON Path queries work, what values various +expressions return.

+

Queries

+

The docs for jsonb_path_query say:1

+
+
+
jsonb_path_query ( target jsonb, path jsonpath [, vars jsonb [, silent boolean ]] ) → setof jsonb
+
Returns all JSON items returned by the JSON path for the specified JSON +value. If the vars argument is specified, it must be a JSON object, and +its fields provide named values to be substituted into the jsonpath +expression. If the silent argument is specified and is true, the +function suppresses the same errors as the @? and @@ operators do.
+
+
select * from jsonb_path_query(
+    '{"a":[1,2,3,4,5]}',
+    '$.a[*] ? (@ >= $min && @ <= $max)',
+    '{"min":2, "max":4}'
+) 
+ jsonb_path_query
+------------------
+ 2
+ 3
+ 4
+
+
+
+

The first thing to note is that a SQL/JSON Path query may return more than one +value. This feature matters for the @@ and @? operators, which return a +single boolean value based on the values returned by a path query. And path queries +can return a huge variety of values. Let’s explore some examples, derived from +the sample JSON value and path query from the docs.2

+
select jsonb_path_query('{"a":[1,2,3,4,5]}', '$ ?(@.a[*] > 2)');
+    jsonb_path_query    
+------------------------
+ {"a": [1, 2, 3, 4, 5]}
+(1 row)
+

This query returns the entire JSON value, because that’s what $ selects at the +start of the path expression. The ?() filter returns true because its +predicate expression finds at least one value in the $.a array greater than +2. Here’s what happens when the filter returns false:

+
select jsonb_path_query('{"a":[1,2,3,4,5]}', '$ ?(@.a[*] > 5)');
+ jsonb_path_query 
+------------------
+(0 rows)
+

None of the values in the $.a array are greater than five, so the query +returns no value.

+

To select just the array, append it to the path expression after the ?() +filter:

+
select jsonb_path_query('{"a":[1,2,3,4,5]}', '$ ?(@.a[*] > 2).a');
+ jsonb_path_query 
+------------------
+ [1, 2, 3, 4, 5]
+(1 row)
+

Path Modes

+

One might think you could select $.a at the start of the path query to get the +full array if the filter returns true, but look what happens:

+
select jsonb_path_query('{"a":[1,2,3,4,5]}', '$.a ?(@[*] > 2)');
+ jsonb_path_query 
+------------------
+ 3
+ 4
+ 5
+(3 rows)
+

That’s not the array, but the individual array values that each match the +predicate. Turns out this is a quirk of the Postgres implementation of path +modes. From what I can glean, the SQL:2016 standard dictates something like +these SQL Server descriptions:

+
+
    +
  • In lax mode, the function returns empty values if the path expression +contains an error. For example, if you request the value $.name, and the +JSON text doesn’t contain a name key, the function returns null, but +does not raise an error.
  • +
  • In strict mode, the function raises an error if the path expression +contains an error.
  • +
+
+

But the Postgres lax mode does more than suppress errors. From the docs (emphasis added):

+
+

The lax mode facilitates matching of a JSON document structure and path +expression if the JSON data does not conform to the expected schema. If an +operand does not match the requirements of a particular operation, it can be +automatically wrapped as an SQL/JSON array or unwrapped by converting its +elements into an SQL/JSON sequence before performing this operation. +Besides, comparison operators automatically unwrap their operands in the lax +mode, so you can compare SQL/JSON arrays out-of-the-box.

+
+

There are a few more details, but this is the crux of it: In lax mode, which is +the default, Postgres always unwraps an array. Hence the unexpected list of +results.3 This could be particularly confusing when querying multiple +rows:

+
select jsonb_path_query(v, '$.a ?(@[*] > 2)')
+        from (values ('{"a":[1,2,3,4,5]}'::jsonb), ('{"a":[3,5,8]}')) x(v);
+ jsonb_path_query 
+------------------
+ 3
+ 4
+ 5
+ 3
+ 5
+ 8
+(6 rows)
+

Switching to strict mode by preprending strict to the JSON Path query restores +the expected behavior:

+
select jsonb_path_query(v, 'strict $.a ?(@[*] > 2)')
+        from (values ('{"a":[1,2,3,4,5]}'::jsonb), ('{"a":[3,5,8]}')) x(v);
+ jsonb_path_query 
+------------------
+ [1, 2, 3, 4, 5]
+ [3, 5, 8]
+(2 rows)
+

Important gotcha to watch for, and a good reason to test path queries thoroughly +to ensure you get the results you expect. Lax mode nicely prevents errors when a +query references a path that doesn’t exist, as this simple example demonstrates:

+
select jsonb_path_query('{"a":[1,2,3,4,5]}', 'strict $.b');
+ERROR:  JSON object does not contain key "b"
+
+select jsonb_path_query('{"a":[1,2,3,4,5]}', 'lax $.b');
+ jsonb_path_query 
+------------------
+(0 rows)
+

In general, I suggest always using strict mode when executing queries. Better +still, perhaps always prefer strict mode with our friends the @@ and @? +operators, which suppress some errors even in strict mode:

+
+

The jsonpath operators @? and @@ suppress the following errors: missing +object field or array element, unexpected JSON item type, datetime and numeric +errors. The jsonpath-related functions described below can also be told to +suppress these types of errors. This behavior might be helpful when searching +JSON document collections of varying structure.

+
+

Have a look:

+
select '{"a":[1,2,3,4,5]}' @? 'strict $.a';
+ ?column? 
+----------
+ t
+(1 row)
+
+select '{"a":[1,2,3,4,5]}' @? 'strict $.b';
+ ?column? 
+----------
+ <null>
+(1 row)
+

No error for the unknown JSON key b in that second query! As for the error +suppression in the jsonpath-related functions, that’s what the silent +argument does. Compare:

+
select jsonb_path_query('{"a":[1,2,3,4,5]}', 'strict $.b');
+ERROR:  JSON object does not contain key "b"
+
+select jsonb_path_query('{"a":[1,2,3,4,5]}', 'strict $.b', '{}', true);
+ jsonb_path_query 
+------------------
+(0 rows)
+

Boolean Predicates

+

The Postgres SQL/JSON Path Language docs briefly mention a pretty significant +deviation from the SQL standard:

+
+

A path expression can be a Boolean predicate, although the SQL/JSON standard +allows predicates only in filters. This is necessary for implementation of the +@@ operator. For example, the following jsonpath expression is valid in +PostgreSQL:

+

$.track.segments[*].HR < 70

+
+

This pithy statement has pretty significant implications for the return value +of a path query. The SQL standard allows predicate expressions, which are akin +to an SQL WHERE expression, only in ?() filters, as seen previously:

+
select jsonb_path_query('{"a":[1,2,3,4,5]}', '$ ?(@.a[*] > 2)');
+    jsonb_path_query    
+------------------------
+ {"a": [1, 2, 3, 4, 5]}
+(1 row)
+

This can be read as “return the path $ if @.a[*] > 2 is true. But have a +look at a predicate-only path query:

+
select jsonb_path_query('{"a":[1,2,3,4,5]}', '$.a[*] > 2');
+ jsonb_path_query 
+------------------
+ true
+(1 row)
+

This path query can be read as “Return the result of the predicate $.a[*] > 2, +which in this case is true. This is quite the divergence from the standard, +which returns contents from the JSON queried, while a predicate query returns +the result of the predicate expression itself. It’s almost like they’re two +different things!

+

Don’t confuse the predicate path query return value with selecting a boolean +value from the JSON. Consider this example:

+
select jsonb_path_query('{"a":[true,false]}', '$.a ?(@[*] == true)');
+ jsonb_path_query 
+------------------
+ true
+(1 row)
+

Looks the same as the predicate-only query, right? But it’s not, as shown by +adding another true value to the $.a array:

+
select jsonb_path_query('{"a":[true,false,true]}', '$.a ?(@[*] == true)');
+ jsonb_path_query 
+------------------
+ true
+ true
+(2 rows)
+

This path query returns the trues it finds in the $.a array. The fact that +it returns values from the JSON rather than the filter predicate becomes more +apparent in strict mode, which returns all of $a if one or more elements of +the array has the value true:

+
select jsonb_path_query('{"a":[true,false,true]}', 'strict $.a ?(@[*] == true)');
+  jsonb_path_query   
+---------------------
+ [true, false, true]
+(1 row)
+

This brief aside, and its mention of the @@ operator, turns out to be key to +understanding the difference between @? and @@. Because it’s not just that +this feature is “necessary for implementation of the @@ operator”. No, I would +argue that it’s the only kind of expression usable with the @@ operator

+

Match vs. Exists

+

Let’s get back to the @@ operator. We can use a boolean predicate JSON Path +like so:

+
select '{"a":[1,2,3,4,5]}'::jsonb @@ '$.a[*] > 2';
+ ?column? 
+----------
+ t
+(1 row)
+

It returns true because the predicate JSON path query $.a[*] > 2 returns true. +And when it returns false?

+
select '{"a":[1,2,3,4,5]}'::jsonb @@ '$.a[*] > 6';
+ ?column? 
+----------
+ f
+(1 row)
+

So far so good. What happens when we try to use a filter expression that returns +a true value selected from the JSONB?

+
select '{"a":[true,false]}'::jsonb @@ '$.a ?(@[*] == true)';
+ ?column? 
+----------
+ t
+(1 row)
+

Looks right, doesn’t it? But recall that this query returns all of the +true values from $.@, but @@ wants only a single boolean. What happens +when we add another?

+
select '{"a":[true,false,true]}'::jsonb @@ 'strict $.a ?(@[*] == true)';
+ ?column? 
+----------
+ <null>
+(1 row)
+

Now it returns NULL, even though it’s clearly true that @[*] == true +matches. This is because it returns all of the values it matches, as +jsonb_path_query() demonstrates:

+
select jsonb_path_query('{"a":[true,false,true]}'::jsonb, '$.a ?(@[*] == true)');
+ jsonb_path_query 
+------------------
+ true
+ true
+(2 rows)
+

This clearly violates the @@ documentation claim that “Only the first item of +the result is taken into account”. If that were true, it would see the first +value is true and return true. But it doesn’t. Turns out, the corresponding +jsonb_path_match() function shows why:

+
select jsonb_path_match('{"a":[true,false,true]}'::jsonb, '$.a ?(@[*] == true)');
+ERROR:  single boolean result is expected
+

Conclusion: The documentation is inaccurate. Only a single boolean is expected +by @@. Anything else is an error.

+

Futhermore, it’s dangerous, at best, to use an SQL standard JSON Path expression +with @@. If you need to use it with a filter expression, you can turn it into +a boolean predicate by wrapping it in exists():

+
select jsonb_path_match('{"a":[true,false,true]}'::jsonb, 'exists($.a ?(@[*] == true))');
+ jsonb_path_match 
+------------------
+ t
+(1 row)
+

But there’s no reason to do so, because that’s effectively what the @? +operator (and the corresponding, cleverly-named jsonb_path_exists() function +does): it returns true if the SQL standard JSON Path expression contains any +results:

+
select '{"a":[true,false,true]}'::jsonb @? '$.a ?(@[*] == true)';
+ ?column? 
+----------
+ t
+(1 row)
+

Here’s the key thing about @?: you don’t want to use a boolean predicate path +query with it, either. Consider this predicate-only query:

+
select jsonb_path_query('{"a":[1,2,3,4,5]}'::jsonb, '$.a[*] > 6');
+ jsonb_path_query 
+------------------
+ false
+(1 row)
+

But see what happens when we use it with @?:

+
select '{"a":[1,2,3,4,5]}'::jsonb @? '$.a[*] > 6';
+ ?column? 
+----------
+ t
+(1 row)
+

It returns true even though the query itself returns false! Why? Because false +is a value that exists and is returned by the query. Even a query that returns +null is considered to exist, as it will when a strict query encounters an +error:

+
select jsonb_path_query('{"a":[1,2,3,4,5]}'::jsonb, 'strict $[*] > 6');
+ jsonb_path_query 
+------------------
+ null
+(1 row)
+
+select '{"a":[1,2,3,4,5]}'::jsonb @? 'strict $[*] > 6';
+ ?column? 
+----------
+ t
+(1 row)
+

The key thing to know about the @? operator is that it returns true if +anything is returned by the path query, and returns false only if nothing is +selected at all.

+

The Difference

+

In summary, the difference between the @? and @@ JSONB operators is this:

+
    +
  • @? (and jsonb_path_exists()) returns true if the path query returns any +values — even false or null — and false if it returns no values. +This operator should be used only with SQL-standard JSON path queries that +select data from the JSONB. Do not use predicate-only JSON path expressions +with @?.
  • +
  • @@ (and jsonb_path_match()) returns true if the path query returns the +single boolean value true and false otherwise. This operator should be +used only with Postgres-specific boolean predicate JSON path queries, +that return data from the predicate expression. Do not use SQL-standard JSON +path expressions with @@.
  • +
+

This difference of course assumes awareness of this distinction between +predicate path queries and SQL standard path queries. To that end, I submitted +a patch that expounds the difference between these types of JSON Path +queries, and plan to submit another linking these differences in the docs for +@@ and @?.

+

Oh, and probably another to explain the difference in return values between +strict and lax queries due to array unwrapping.

+

Thanks

+

Many thanks to Erik Wienhold for patiently answering my pgsql-hackers +questions and linking me to a detailed pgsql-general thread in which the +oddities of @@ were previously discussed in detail.

+
+
+
    +
  1. +

    Well almost. The docs for jsonb_path_query actually say, about the +last two arguments, “The optional vars and silent arguments act the same +as for jsonb_path_exists.” I replaced that sentence with the relevant +sentences from the jsonb_path_exists docs, about which more later. ↩︎

    +
  2. +
  3. +

    Though omitting the vars argument, as variable interpolation just +gets in the way of understanding basic query result behavior. ↩︎

    +
  4. +
  5. +

    In fairness, the Oracle docs also discuss “implicit array +wrapping and unwrapping”, but I don’t have a recent Oracle server to +experiment with at the moment. ↩︎

    +
  6. +
+
+ +
+ + + +]]>
+
+ + https://justatheory.com/2023/10/cipherdoc/ + <![CDATA[CipherDoc: A Searchable, Encrypted JSON Document Service on Postgres]]> + + 2023-10-03T23:02:21Z + 2023-10-01T21:36:13Z + + David E. Wheeler + david@justatheory.com + https://justatheory.com/ + + + + + + + + + +
+

Over the last year, I designed and implemented a simple web service, code-named +“CipherDoc”, that provides a CRUD API for creating, updating, searching, and +deleting JSON documents. The app enforces document structure via JSON +schema, while JSON/SQL Path powers the search API by querying a hashed subset +of the schema stored in a GIN-indexed JSONB column in Postgres.

+

In may I gave a public presentation on the design and implementation of the +service at PGCon: CipherDoc: A Searchable, Encrypted JSON Document Service on +Postgres. Links:

+ +

I enjoyed designing this service. The ability to dynamically change the JSON +schema at runtime without database changes enables more agile development cycles +for busy teams. Its data privacy features required a level of intellectual +challenge and raw problem-solving (a.k.a., engineering) that challenge and +invigorate me.

+

Two minor updates since May:

+
    +
  1. I re-implemented the JSON/SQL Path parser using the original Postgres +path grammar and goyacc, replacing the hand-written parser roundly +castigated in the presentation.
  2. +
  3. The service has yet to be open-sourced, but I remain optimistic, and +continue to work with leadership at The Times towards an open-source +policy to enable its release.
  4. +
+ +
+ + + +]]>
+
+ + https://justatheory.com/2020/10/release-postgres-extensions-with-github-actions/ + <![CDATA[Automate Postgres Extension Releases on GitHub and PGXN]]> + + 2023-02-20T23:55:17Z + 2020-10-25T23:48:36Z + + David E. Wheeler + david@justatheory.com + https://justatheory.com/ + + + + + + + + + + +
+

Back in June, I wrote about testing Postgres extensions on +multiple versions of Postgres using GitHub Actions. The pattern relies on +Docker image, pgxn/pgxn-tools, which contains scripts to build and run any +version of PostgreSQL, install additional dependencies, build, test, bundle, and +release an extension. I’ve since updated it to support testing on the the latest +development release of Postgres, meaning one can test on any major version from +8.4 to (currently) 14. I’ve also created GitHub workflows for all of my PGXN +extensions (except for pgTAP, which is complicated). I’m quite happy with it.

+

But I was never quite satisfied with the release process. Quite a number of +Postgres extensions also release on GitHub; indeed, Paul Ramsey told me +straight up that he did not want to manually upload extensions like pgsql-http +and PostGIS to PGXN, but for PGXN to automatically pull them in when they were +published on GitHub. It’s pretty cool that newer packaging systems like +pkg.go.dev auto-index any packages on GibHub. Adding such a feature to PGXN +would be an interesting exercise.

+

But since I’m low on TUITs for such a significant undertaking, I decided instead +to work out how to automatically publish a release on GitHub and PGXN via +GitHub Actions. After experimenting for a few months, I’ve worked out a +straightforward method that should meet the needs of most projects. I’ve proven +the pattern via the pair extension’s release.yml, which successfully +published the v0.1.7 release today on both GitHub and +PGXN. With that success, I updated the pgxn/pgxn-tools +documentation with a starter example. It looks like this:

+
+ +
+
 1
+ 2
+ 3
+ 4
+ 5
+ 6
+ 7
+ 8
+ 9
+10
+11
+12
+13
+14
+15
+16
+17
+18
+19
+20
+21
+22
+23
+24
+25
+26
+27
+28
+29
+30
+31
+32
+33
+34
+35
+36
+37
+38
+39
+40
+41
+42
+43
+
+
name: Release
+on:
+  push:
+    tags:
+      - 'v*' # Push events matching v1.0, v20.15.10, etc.
+jobs:
+  release:
+    name: Release on GitHub and PGXN
+    runs-on: ubuntu-latest
+    container: pgxn/pgxn-tools
+    env:
+      # Required to create GitHub release and upload the bundle.
+      GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+    steps:
+    - name: Check out the repo
+      uses: actions/checkout@v3
+    - name: Bundle the Release
+      id: bundle
+      run: pgxn-bundle
+    - name: Release on PGXN
+      env:
+        # Required to release on PGXN.
+        PGXN_USERNAME: ${{ secrets.PGXN_USERNAME }}
+        PGXN_USERNAME: ${{ secrets.PGXN_PASSWORD }}
+      run: pgxn-release
+    - name: Create GitHub Release
+      id: release
+      uses: actions/create-release@v1
+      with:
+        tag_name: ${{ github.ref }}
+        release_name: Release ${{ github.ref }}
+        body: |
+          Changes in this Release
+          - First Change
+          - Second Change          
+    - name: Upload Release Asset
+      uses: actions/upload-release-asset@v1
+      with:
+        # Reference the upload URL and bundle name from previous steps.
+        upload_url: ${{ steps.release.outputs.upload_url }}
+        asset_path: ./${{ steps.bundle.outputs.bundle }}
+        asset_name: ${{ steps.bundle.outputs.bundle }}
+        asset_content_type: application/zip
+
+
+

Here’s how it works:

+
    +
  • +

    Lines 4-5 trigger the workflow only when a tag starting with the letter v is +pushed to the repository. This follows the common convention of tagging +releases with version numbers, such as v0.1.7 or v4.6.0-dev. This +assumes that the tag represents the commit for the release.

    +
  • +
  • +

    Line 10 specifies that the job run in the pgxn/pgxn-tools container, where +we have our tools for building and releasing extensions.

    +
  • +
  • +

    Line 13 passes the GITHUB_TOKEN variable into the container. This is the +GitHub personal access token that’s automatically set for every build. It +lets us call the GitHub API via actions later in the workflow.

    +
  • +
  • +

    Step “Bundle the Release”, on Lines 17-19, validates the extension +META.json file and creates the release zip file. It does so by simply +reading the distribution name and version from the META.json file and +archiving the Git repo into a zip file. If your process for creating a +release file is more complicated, you can do it yourself here; just be sure +to include an id for the step, and emit a line of text so that later +actions know what file to release. The output should be appended to the +$GITHUB_OUTPUT file like this, with $filename representing the name of +the release file, usually $extension-$version.zip:

    +
    echo bundle=$filename >> $GITHUB_OUTPUT
    +
  • +
  • +

    Step “Release on PGXN”, on lines 20-25, releases the extension on PGXN. We +take this step first because it’s the strictest, and therefore the most +likely to fail. If it fails, we don’t end up with an orphan GitHub release +to clean up once we’ve fixed things for PGXN.

    +
  • +
  • +

    With the success of a PGXN release, step “Create GitHub Release”, on lines +26-35, uses the GitHub create-release action to create a release +corresponding to the tag. Note the inclusion of id: release, which will be +referenced below. You’ll want to customize the body of the release; for the pair extension, I added a simple make target to generate a file, then pass it +via the body_path config:

    +
    - name: Generate Release Changes
    +  run: make latest-changes.md
    +- name: Create GitHub Release
    +  id: release
    +  uses: actions/create-release@v1
    +  with:
    +    tag_name: ${{ github.ref }}
    +    release_name: Release ${{ github.ref }}
    +    body_path: latest-changes.md
    +
  • +
  • +

    Step “Upload Release Asset”, on lines 36-43, adds the release file to the +GitHub release, using output of the release step to specify the URL to +upload to, and the output of the bundle step to know what file to upload.

    +
  • +
+

Lotta steps, but works nicely. I only wish I could require that the testing +workflow finish before doing a release, but I generally tag a release once it +has been thoroughly tested in previous commits, so I think it’s acceptable.

+

Now if you’ll excuse me, I’m off to add this workflow to my other PGXN +extensions.

+ +
+ + + +]]>
+
+ + https://justatheory.com/2018/08/pgenv/ + <![CDATA[pgenv]]> + + 2018-08-02T04:31:03Z + 2018-08-02T04:31:03Z + + David E. Wheeler + david@justatheory.com + https://justatheory.com/ + + + + + + +
+

For years, I’ve managed multiple versions of PostgreSQL by regularly editing and +running a simple script that builds each major version from source and +installs it in /usr/local. I would shut down the current version, remove the +symlink to /usr/local/pgsql, symlink the one I wanted, and start it up again.

+

This is a pain in the ass.

+

Recently I wiped my work computer (because reasons) and started reinstalling all +my usual tools. PostgreSQL, I decided, no longer needs to run as the postgres +user from /usr/local. What would be much nicer, when it came time to test +pgTAP against all supported versions of Postgres, would be to use a tool like +plenv or rbenv to do all the work for me.

+

So I wrote pgenv. To use it, clone it into ~/.pgenv (or wherever you want) +and add its bin directories to your $PATH environment variable:

+
$ git clone https://github.com/theory/pgenv.git
+echo 'export PATH="$HOME/.pgenv/bin:$HOME/.pgenv/pgsql/bin:$PATH"' >> ~/.bash_profile
+

Then you’re ready to go:

+
$ pgenv build 10.4
+

A few minutes later, it’s there:

+
$ pgenv versions
+pgsql-10.4
+

Let’s use it:

+
$ pgenv use 10.4
+The files belonging to this database system will be owned by user "david".
+This user must also own the server process.
+#    (initdb output elided)
+waiting for server to start.... done
+server started
+PostgreSQL 10.4 started
+

Now connect:

+
$ psql -U postgres
+psql (10.4)
+Type "help" for help.
+
+postgres=# 
+

Easy. Each version you install – as far back as 8.0 – has the default super +user postgres for compatibility with the usual system-installed version. It +also builds all contrib modules, including PL/Perl using /usr/bin/perl.

+

With this little app in place, I quickly built all the versions I need. Check it +out:

+
$ pgenv versions
+     pgsql-10.3
+  *  pgsql-10.4
+     pgsql-11beta2
+     pgsql-8.0.26
+     pgsql-8.1.23
+     pgsql-8.2.23
+     pgsql-8.3.23
+     pgsql-8.4.22
+     pgsql-9.0.19
+     pgsql-9.1.24
+     pgsql-9.2.24
+     pgsql-9.3.23
+     pgsql-9.4.18
+     pgsql-9.5.13
+     pgsql-9.6.9
+

Other commands include start, stop, and restart, which act on the +currently active version; version, which shows the currently-active version +(also indicated by the asterisk in the output of the versions command); +clear, to clear the currently-active version (in case you’d rather fall back +on a system-installed version, for example); and remove, which will remove a +version. See the docs for details on all the commands.

+

How it Works

+

All this was written in an uncomplicated Bash script. I’ve ony tested it on a +couple of Macs, so YMMV, but as long as you have Bash, Curl, and /usr/bin/perl +on a system, it ought to just work.

+

How it works is by building each version in its own directory: +~/.pgenv/pgsql-10.4, ~/.pgenv/pgsql-11beta2, and so on. The currently-active +version is nothing more than symlink, ~/.pgenv/pgsql, to the proper version +directory. There is no other configuration. pgenv downloads and builds versions +in the ~/.pgenv/src directory, and the tarballs and compiled source left in +place, in case they’re needed for development or testing. pgenv never uses them +again unless you delete a version and pgenv build it again, in which case +pgenv deletes the old build directory and unpacks from the tarball again.

+

Works for Me!

+

Over the last week, I hacked on pgenv to get all of these commands working. It +works very well for my needs. Still, I think it might be useful to add support +for a configuration file. It might allow one to change the name of the default +superuser, the location Perl, and perhaps a method to change postgresql.conf +settings following an initdb. I don’t know when (or if) I’ll need that stuff, +though. Maybe you do, though? Pull requests welcome!

+

But even if you don’t, give it a whirl and let me know if you find any +issues.

+ +
+ + + +]]>
+
+ + https://justatheory.com/2013/10/indexing-nested-hstore/ + <![CDATA[Indexing Nested hstore]]> + + 2013-10-25T14:36:00Z + 2013-10-25T14:36:00Z + + David E. Wheeler + david@justatheory.com + https://justatheory.com/ + + + + + + +
+

In my first Nested hstore post yesterday, I ran a query against unindexed +hstore data, which required a table scan. But hstore is able to take advantage +of GIN indexes. So let’s see what that looks like. Connecting to the same +database, I indexed the review column:

+
reviews=# CREATE INDEX idx_reviews_gin ON reviews USING GIN(review);
+CREATE INDEX
+Time: 360448.426 ms
+reviews=# SELECT pg_size_pretty(pg_database_size(current_database()));
+ pg_size_pretty 
+----------------
+ 421 MB
+

Well, that takes a while, and makes the database a lot bigger (it was 277 MB +unindexed). But is it worth it? Let’s find out. Oleg and Teodor’s patch adds +support for a nested hstore value on the right-hand-side of the @> operator. +In practice, that means we can specify the full path to a nested value as an +hstore expression. In our case, to query only for Books, instead of using this +expression:

+
WHERE review #> '{product,group}' = 'Book'
+

We can use an hstore value with the entire path, including the value:

+
WHERE review @> '{product => {group => Book}}'
+

Awesome, right? Let’s give it a try:

+
reviews=# SELECT
+    width_bucket(length(review #> '{product,title}'), 1, 50, 5) title_length_bucket,
+    round(avg(review #^> '{review,rating}'), 2) AS review_average,
+    count(*)
+FROM
+    reviews
+WHERE
+    review @> '{product => {group => Book}}'
+GROUP BY
+    title_length_bucket
+ORDER BY
+    title_length_bucket;
+ title_length_bucket | review_average | count  
+---------------------+----------------+--------
+                   1 |           4.42 |  56299
+                   2 |           4.33 | 170774
+                   3 |           4.45 | 104778
+                   4 |           4.41 |  69719
+                   5 |           4.36 |  47110
+                   6 |           4.43 |  43070
+(6 rows)
+
+Time: 849.681 ms
+

That time looks better than yesterday’s, but in truth I first ran this query +just before building the GIN index and got about the same result. Must be that +Mavericks is finished indexing my disk or something. At any rate, the index is +not buying us much here.

+

But hey, we’re dealing with 1998 Amazon reviews, so querying against books +probably isn’t very selective. I don’t blame the planner for deciding that a +table scan is cheaper than an index scan. But what if we try a more selective +value, say “DVD”?

+
reviews=# SELECT
+    width_bucket(length(review #> '{product,title}'), 1, 50, 5) title_length_bucket,
+    round(avg(review #^> '{review,rating}'), 2) AS review_average,
+    count(*)
+FROM
+    reviews
+WHERE
+    review @> '{product => {group => DVD}}'
+GROUP BY
+    title_length_bucket
+ORDER BY
+    title_length_bucket;
+ title_length_bucket | review_average | count 
+---------------------+----------------+-------
+                   1 |           4.27 |  2646
+                   2 |           4.44 |  4180
+                   3 |           4.53 |  1996
+                   4 |           4.38 |  2294
+                   5 |           4.48 |   943
+                   6 |           4.42 |   738
+(6 rows)
+
+Time: 73.913 ms
+

Wow! Under 100ms. That’s more like it! Inverted indexing FTW!

+ +
+ + + +]]>
+
+ + https://justatheory.com/2013/10/testing-nested-hstore/ + <![CDATA[Testing Nested hstore]]> + + 2013-10-23T10:26:00Z + 2013-10-23T10:26:00Z + + David E. Wheeler + david@justatheory.com + https://justatheory.com/ + + + + + + +
+

I’ve been helping Oleg Bartunov and Teodor Sigaev with documentation for the +forthcoming nested hstore patch for PostgreSQL. It adds support for +arrays, numeric and boolean types, and of course arbitrarily nested data +structures. This gives it feature parity with JSON, but unlike the +JSON type, its values are stored in a binary representation, which makes it +much more efficient to query. The support for GiST and GIN indexes to +speed up path searches doesn’t hurt, either.

+

As part of the documentation, we wanted to include a short tutorial, something +to show off the schemaless flexibility of the new hstore. The CitusDB guys +were kind enough to show off their json_fdw with some Amazon review data in +a blog post a few months back; it even includes an interesting query against +the data. Let’s see what we can do with it. First, load it:

+
> createdb reviews
+> psql -d reviews -c '
+    CREATE EXTENSION HSTORE;
+    CREATE TABLE reviews(review hstore);
+'
+CREATE TABLE
+> gzcat customer_reviews_nested_1998.json.gz | sed -e 's/\\/\\\\/g' \
+ | sed -e "s/'/''/g" | sed -e 's/":/" =>/g' > /tmp/hstore.copy
+> time psql -d reviews -c "COPY reviews FROM '/tmp/hstore.copy'"
+COPY 589859
+       0.00s user 0.00s system 0% cpu 13.059 total
+

13 seconds to load 589,859 records from a file – a little over 45k records +per second. Not bad. Let’s see what the storage looks like:

+
> psql -d reviews -c 'SELECT pg_size_pretty(pg_database_size(current_database()));'
+ pg_size_pretty 
+----------------
+ 277 MB
+

The original, uncompressed data is 208 MB on disk, so roughly a third bigger +given the overhead of the database. Just for fun, let’s compare it to JSON:

+
> createdb reviews_js
+> psql -d reviews_js -c 'CREATE TABLE reviews(review json);'
+CREATE TABLE
+> gzcat customer_reviews_nested_1998.json.gz | sed -e 's/\\/\\\\/g' \
+ | sed -e "s/'/''/g" | > /tmp/json.copy
+> time psql -d reviews_js -c "COPY reviews FROM '/tmp/json.copy'"
+COPY 589859
+       0.00s user 0.00s system 0% cpu 7.434 total
+> psql -d reviews_js -c 'SELECT pg_size_pretty(pg_database_size(current_database()));'
+ pg_size_pretty 
+----------------
+ 239 MB
+

Almost 80K records per second, faster, I’m guessing, because the JSON type +doesn’t convert the data to binary representation its way in. JSON currently +uses less overhead for storage, aw well; I wonder if that’s the benefit of +TOAST storage?

+

Let’s try querying these guys. I adapted the query from the CitusDB blog +post and ran it on my 2013 MacBook Air (1.7 GHz Intel Core i7) with iTunes +and a bunch of other apps running in the background [yeah, I’m lazy]). Check +out those operators, by the way! Given a path, #^> returns a numeric value:

+
reviews=# SELECT
+    width_bucket(length(review #> '{product,title}'), 1, 50, 5) title_length_bucket,
+    round(avg(review #^> '{review,rating}'), 2) AS review_average,
+    count(*)
+FROM
+    reviews
+WHERE
+    review #> '{product,group}' = 'Book'
+GROUP BY
+    title_length_bucket
+ORDER BY
+    title_length_bucket;
+ title_length_bucket | review_average | count  
+---------------------+----------------+--------
+                   1 |           4.42 |  56299
+                   2 |           4.33 | 170774
+                   3 |           4.45 | 104778
+                   4 |           4.41 |  69719
+                   5 |           4.36 |  47110
+                   6 |           4.43 |  43070
+(6 rows)
+
+Time: 2301.620 ms
+

The benefit of the native type is pretty apparent here. I ran this query +several times, and the time was always between 2.3 and 2.4 seconds. The Citus +json_fdw query took “about 6 seconds on a 3.1 GHz CPU core.” Let’s see how +well the JSON type does (pity there is no operator to fetch a value as +numeric; we have to cast from text):

+
reviews_js=# SELECT
+    width_bucket(length(review #>> '{product,title}'), 1, 50, 5) title_length_bucket,
+    round(avg((review #>> '{review,rating}')::numeric), 2) AS review_average,
+    count(*)
+FROM
+    reviews
+WHERE
+    review #>> '{product,group}' = 'Book'
+GROUP BY
+    title_length_bucket
+ORDER BY
+    title_length_bucket;
+ title_length_bucket | review_average | count  
+---------------------+----------------+--------
+                   1 |           4.42 |  56299
+                   2 |           4.33 | 170774
+                   3 |           4.45 | 104778
+                   4 |           4.41 |  69719
+                   5 |           4.36 |  47110
+                   6 |           4.43 |  43070
+(6 rows)
+
+Time: 5530.120 ms
+

A little faster than the json_fdw version, but comparable. But takes well +over twice as long as the hstore version, though. For queries, hstore is the +clear winner. Yes, you pay up-front for loading and storage, but the payoff at +query time is substantial. Ideally, of course, we would have the insert and +storage benefits of JSON and the query performance of hstore. There was talk +last spring at PGCon of using the same representation for JSON and hstore; +perhaps that can still come about.

+

Meanwhile, I expect to play with some other data sets over the next week; +watch this spot for more!

+ +
+ + + +]]>
+
+ + https://justatheory.com/2013/09/the-power-of-enums/ + <![CDATA[The Power of Enums]]> + + + 2022-05-22T21:37:05Z + 2013-09-29T14:50:00Z + + David E. Wheeler + david@justatheory.com + https://justatheory.com/ + + + + + + + +
+

Jim Mlodgenski on using Enums in place of references to small lookup tables:

+
+

I saw something else I didn’t expect: […] There was a 8% increase +in performance. I was expecting the test with the enums to be close +to the baseline, but I wasn’t expecting it to be faster. Thinking +about it, it makes sense. Enums values are just numbers so we’re +effectively using surrogate keys under the covers, but the users would +still the the enum labels when they are looking at the data. It ended +up being a no brainer to use enums for these static tables. There was +a increase in performance while still maintaining the integrity of the +data.

+
+

I’ve been a big fan of Enums since Andrew and Tom Dunstan released a patch for +them during the PostgreSQL 8.2 era. Today they’re a core feature, and as of +9.1, you can even modify their values! You’re missing out if you’re not using +them yet.

+ +
+ + + +]]>
+
+ + https://justatheory.com/2013/08/understanding-window-functions/ + <![CDATA[Understanding Window Functions]]> + + + 2013-08-28T17:25:00Z + 2013-08-28T17:25:00Z + + David E. Wheeler + david@justatheory.com + https://justatheory.com/ + + + + + + + +
+

Dimitri Fontaine:

+
+

There was SQL before window functions and SQL after window functions: +that’s how powerful this tool is. Being that of a deal breaker unfortunately +means that it can be quite hard to grasp the feature. This article aims at +making it crystal clear so that you can begin using it today and are able to +reason about it and recognize cases where you want to be using window +functions.

+
+

Great intro to a powerful feature.

+ +
+ + + +]]>
+
+ + https://justatheory.com/2013/06/agile-db-dev/ + <![CDATA[Agile Database Development Tutorial]]> + + 2022-01-02T17:18:32Z + 2013-06-06T19:02:55Z + + David E. Wheeler + david@justatheory.com + https://justatheory.com/ + + + + + + + + + +
+

I gave a tutorial at PGCon a couple weeks back, entitled “Agile Database +Development with Git, Sqitch, and pgTAP.” It went well, I think. The Keynote +document and an exported PDF have been posted on PGCon.org, and also uploaded here and to Speaker Deck. And embedded +below, too. Want to follow along? Clone the tutorial Git repository and follow +along. Here’s the teaser:

+
+

Hi, I’m David. I like to write database apps. Just as much as I like to write +web apps. (Maybe more!) How? Not by relying on bolted-on, half-baked database +integration tools like migrations, I’ll tell you that!. Instead, I make +extensive use of best-of-breed tools for source control (Git), database unit +testing (pgTAP), and database change management and deployment (Sqitch). +If you’d like to get as much pleasure out of database development as you do +application development, join me for this tutorial. We’ll develop a sample +application using the processes and tools I’ve come to depend on, and you’ll +find out whether they might work for you. Either way, I promise it will at +least be an amusing use of your time.

+
+

+

+ +

Looking for the comments? Try the old layout.

+
+ + + +]]>
+
+ + https://justatheory.com/2013/02/bootstrap-bucardo-mulitmaster/ + <![CDATA[Bootstrapping Bucardo Master/Master Replication]]> + + 2013-02-12T22:11:19Z + 2013-02-12T22:11:19Z + + David E. Wheeler + david@justatheory.com + https://justatheory.com/ + + + + + + + +
+

Let’s say you have a production database up and running and you want to set up a +second database with Bucardo-powered replication between them. Getting a new +master up and running without downtime for an existing master, and without +losing any data, is a bit fiddly and under-documented. Having just figured out +one way to do it with the forthcoming Bucardo 5 code base, I wanted to blog it +as much for my own reference as for yours.

+

First, let’s set up some environment variables to simplify things a bit. I’m +assuming that the database names and usernames are the same, and only the host +names are different:

+
export PGDATABASE=widgets
+export PGHOST=here.example.com
+export PGHOST2=there.example.com
+export PGSUPERUSER=postgres
+

And here are some environment variables we’ll use for Bucardo configuration +stuff:

+
export BUCARDOUSER=bucardo
+export BUCARDOPASS=*****
+export HERE=here
+export THERE=there
+

First, let’s create the new database as a schema-only copy of the existing +database:

+
createdb -U $PGSUPERUSER -h $PGHOST2 $PGDATABASE
+pg_dump -U $PGSUPERUSER -h $PGHOST --schema-only $PGDATABASE \
+ | psql -U $PGSUPERUSER -h $PGHOST2 -d $PGDATABASE
+

You might also have to copy over roles; use pg_dumpall --globals-only to do +that.

+

Next, we configure Bucardo. Start by telling it about the databases:

+
bucardo add db $HERE$PGDATABASE dbname=$PGDATABASE host=$PGHOST user=$BUCARDOUSER pass=$BUCARDOPASS
+bucardo add db $THERE$PGDATABASE dbname=$PGDATABASE host=$PGHOST2 user=$BUCARDOUSER pass=$BUCARDOPASS
+

Tell it about all the tables we want to replicate:

+
bucardo add table public.foo public.bar relgroup=myrels db=$HERE$PGDATABASE 
+

Create a multi-master database group for the two databases:

+
bucardo add dbgroup mydbs $HERE$PGDATABASE:source $THERE$PGDATABASE:source
+

And create the sync:

+
bucardo add sync mysync relgroup=myrels dbs=mydbs autokick=0
+

Note autokick=0. This ensures that, while deltas are logged, they will not be +copied anywhere until we tell Bucardo to do so.

+

And now that we know that any changes from here on in will be queued for +replication, we can go ahead and copy over the data. The only caveat is that we +need to disable the Bucardo triggers on the target system, so that our copying +does not try to queue up. We do that by setting the session_replication_role +GUC to “replica” while doing the copy:

+
pg_dump -U $PGSUPERUSER -h $PGHOST --data-only -N bucardo $PGDATABASE \
+  | PGOPTIONS='-c session_replication_role=replica' \
+  | psql -U $PGSUPERUSER -h $PGHOST2 -d $PGDATABASE
+

Great, now all the data is copied over, we can have Bucardo copy any changes +that have been made in the interim, as well as any going forward:

+
bucardo update sync mysync autokick=1
+bucardo reload config
+

Bucardo will fire up the necessary syncs and copy over any interim deltas. And +any changes you make to either system in the future will be copied to the other.

+ +

Looking for the comments? Try the old layout.

+
+ + + +]]>
+
+ + https://justatheory.com/2012/11/postgres-format-function/ + <![CDATA[New in PostgreSQL 9.2: format()]]> + + 2012-11-16T01:31:00Z + 2012-11-16T01:31:00Z + + David E. Wheeler + david@justatheory.com + https://justatheory.com/ + + + + + + +
+

There’s a new feature in PostgreSQL 9.2 that I don’t recall seeing blogged about +elsewhere: the format() function. From the docs:

+
+

Format a string. This function is similar to the C function sprintf; but only +the following conversion specifications are recognized: %s interpolates the +corresponding argument as a string; %I escapes its argument as an SQL +identifier; %L escapes its argument as an SQL literal; %% outputs a literal %. +A conversion can reference an explicit parameter position by preceding the +conversion specifier with n$, where n is the argument position.

+
+

If you do a lot of dynamic query building in PL/pgSQL functions, you’ll +immediately see the value in format(). Consider this function:

+
CREATE OR REPLACE FUNCTION make_month_partition(
+    base_table   TEXT,
+    schema_name  TEXT,
+    month        TIMESTAMP
+) RETURNS VOID LANGUAGE plpgsql AS $_$
+DECLARE
+    partition TEXT := quote_ident(base_table || '_' || to_char(month, '"y"YYYY"m"MM'));
+    month_start TIMESTAMP := date_trunc('month', month);
+BEGIN
+    EXECUTE '
+        CREATE TABLE ' || quote_ident(schema_name) || '.' || partition || ' (CHECK (
+                created_at >= ' || quote_literal(month_start) || '
+            AND created_at < '  || quote_literal(month_start + '1 month'::interval) || '
+        )) INHERITS (' || quote_ident(schema_name) || '.' || base_table || ')
+    ';
+    EXECUTE 'GRANT SELECT ON ' || quote_ident(schema_name) || '.' || partition || '  TO dude;';
+END;
+$_$;
+

Lots of concatenation and use of quote_ident() to get things just right. I +don’t know about you, but I always found this sort of thing quite difficult to +read. But format() allows use to eliminate most of the operators and function +calls. Check it:

+
CREATE OR REPLACE FUNCTION make_month_partition(
+    base_table   TEXT,
+    schema_name  TEXT,
+    month        TIMESTAMP
+) RETURNS VOID LANGUAGE plpgsql AS $_$
+DECLARE
+    partition TEXT := base_table || '_' || to_char(month, '"y"YYYY"m"MM');
+    month_start TIMESTAMP := date_trunc('month', month);
+BEGIN
+    EXECUTE format(
+        'CREATE TABLE %I.%I (
+            CHECK (created_at >= %L AND created_at < %L)
+        ) INHERITS (%I.%I)',
+        schema_name, partition,
+        month_start, month_start + '1 month'::interval,
+        schema_name, base_table
+    );
+    EXECUTE format('GRANT SELECT ON %I.%I TO dude', schema_name, partition);
+END;
+$_$;
+

I don’t know about you, but I find that a lot easier to read. which means +it’ll be easier to maintain. So if you do much dynamic query generation inside +the database, give format() a try, I think you’ll find it a winner.

+

Update 2012-11-16: Okay, so I somehow failed to notice that format() was +actually introduced in 9.1 and covered by depesz. D’oh! Well, hopefully my +little post will help to get the word out more, at least. Thanks to my +commenters.

+ +

Looking for the comments? Try the old layout.

+
+ + + +]]>
+
+ + https://justatheory.com/2012/11/mock-postgres-serialization-failures/ + <![CDATA[Mocking Serialization Failures]]> + + 2012-11-02T22:16:28Z + 2012-11-02T22:16:28Z + + David E. Wheeler + david@justatheory.com + https://justatheory.com/ + + + + + + +
+

I’ve been hacking on the forthcoming Bucardo 5 code base the last couple +weeks, as we’re going to start using it pretty extensively at work, and it +needed a little love to get it closer to release. The biggest issue I fixed was +the handling of serialization failures.

+

When copying deltas from one database to another, Bucardo sets the transaction +isolation to “Serializable”. As of PostgreSQL 9.1, this is true serializable +isolation. However, there were no tests for it in Bucardo. And since pervious +versions of PostgreSQL had poorer isolation (retained in 9.1 as “Repeatable +Read”), I don’t think anyone really noticed it much. As I’m doing all my testing +against 9.2, I was getting the serialization failures about half the time I ran +the test suite. It took me a good week to chase down the issue. Once I did, I +posted to the Bucardo mail list pointing out that Bucardo was not attempting +to run a transaction again after failure, and at any rate, the model for how it +thought to do so was a little wonky: it let the replicating process die, on the +assumption that a new process would pick up where it left off. It did not.

+

Bucardo maintainer Greg Sabino Mullane proposed that we let the replicating +process try again on its own. So I went and made it do that. And then the tests +started passing every time. Yay!

+

Returning to the point of this post, I felt that there ought to be tests for +serialization failures in the Bucardo test suite, so that we can ensure that +this continues to work. My first thought was to use PL/pgSQL in 8.4 and higher +to mock a serialization failure. Observe:

+
david=# \set VERBOSITY verbose
+david=# DO $$BEGIN RAISE EXCEPTION 'Serialization error'
+       USING ERRCODE = 'serialization_failure'; END $$;
+ERROR:  40001: Serialization error
+LOCATION:  exec_stmt_raise, pl_exec.c:2840
+
+

Cool, right? Well, the trick is to get this to run on the replication target, +but only once. When Bucardo retries, we want it to succeed, thus properly +demonstrating the COPY/SERIALIZATION FAIL/ROLLBACK/COPY/SUCCESS pattern. +Furthermore, when it copies deltas to a target, Bucardo disables all triggers +and rules. So how to get something trigger-like to run on a target table and +throw the serialization error?

+

Studying the Bucardo source code, I discovered that Bucardo itself does not +disable triggers and rules. Rather, it sets the session_replica_role GUC to +“replica”. This causes PostgreSQL to disable the triggers and rules — except for +those that have been set to ENABLE REPLICA. The PostgreSQL ALTER TABLE +docs:

+
+

The trigger firing mechanism is also affected by the configuration variable +session_replication_role. Simply enabled triggers will fire when the +replication role is “origin” (the default) or “local”. Triggers configured as +ENABLE REPLICA will only fire if the session is in “replica” mode, and +triggers configured as ENABLE ALWAYS will fire regardless of the current +replication mode.

+
+

Well how cool is that? So all I needed to do was plug in a replica trigger and +have it throw an exception once but not twice. Via email, Kevin Grittner pointed +out that a sequence might work, and indeed it does. Because sequence values are +non-transactional, sequences return different values every time they’re access.

+

Here’s what I came up with:

+
CREATE SEQUENCE serial_seq;
+
+CREATE OR REPLACE FUNCTION mock_serial_fail(
+) RETURNS trigger LANGUAGE plpgsql AS $_$
+BEGIN
+    IF nextval('serial_seq') % 2 = 0 THEN RETURN NEW; END IF;
+    RAISE EXCEPTION 'Serialization error'
+            USING ERRCODE = 'serialization_failure';
+END;
+$_$;
+
+CREATE TRIGGER mock_serial_fail AFTER INSERT ON bucardo_test2
+    FOR EACH ROW EXECUTE PROCEDURE mock_serial_fail();
+ALTER TABLE bucardo_test2 ENABLE REPLICA TRIGGER mock_serial_fail;
+

The first INSERT (or, in Bucardo’s case, COPY) to bucardo_test2 will die +with the serialization error. The second INSERT (or COPY) succeeds. This +worked great, and I was able to write test in a few hours and get them +committed. And now we can be reasonably sure that Bucardo will always properly +handle serialization failures.

+ +

Looking for the comments? Try the old layout.

+
+ + + +]]>
+
+ + https://justatheory.com/2012/04/postgres-use-timestamptz/ + <![CDATA[Always Use TIMESTAMP WITH TIME ZONE]]> + + 2022-05-22T21:36:58Z + 2012-04-16T22:08:26Z + + David E. Wheeler + david@justatheory.com + https://justatheory.com/ + + + + + + + + + +
+

My recommendations for sane time zone management in PostgreSQL:

+
    +
  • Set timezone = 'UTC' in postgresq.conf. This makes UTC the default time +zone for all connections.
  • +
  • Use timestamp with time zone (aka timestamptz) and +time with time zone (aka timetz). They store values as UTC, but convert +them on selection to whatever your time zone setting is.
  • +
  • Avoid timestamp without time zone (aka timestamp) and +time without time zone (aka time). These columns do not know the time +zone of a value, so different apps can insert values in different zones no +one would ever know.
  • +
  • Always specify a time zone when inserting into a timestamptz or timetz +column. Unless the zone is UTC. But even then, append a “Z” to your value: +it’s more explicit, and will keep you sane.
  • +
  • If you need to get timestamptz or timetz values in a zone other than +UTC, use the AT TIME ZONE expression in your query. But be aware that +the returned value will be a timestamp or time value, with no more time +zone. Good for reporting and queries, bad for storage.
  • +
  • If your app always needs data in some other time zone, have it +SET timezone = 'UTC' on connection. All values then retrieved from the +database will be in the configured time zone. The app should still include +the time zone in values sent to the database.
  • +
+

The one exception to the rule preferring timestamptz and timetz is a special +case: partitioning. When partitioning data on timestamps, you must not use +timestamptz. Why? Because almost no expression involving timestamptz +comparison is immutable. Use one in a WHERE clause, and constraint exclusion +may well be ignored and all partitions scanned. This is usually something you +want to avoid.

+

So in this one case and only in this one case, use a +timestamp without time zone column, but always insert data in UTC. This will +keep things consistent with the timestamptz columns you have everywhere else +in your database. Unless your app changes the value of the timestamp +GUC when it connects, it can just assume that +everything is always UTC, and should always send updates as UTC.

+ +

Looking for the comments? Try the old layout.

+
+ + + +]]>
+
+ + https://justatheory.com/2011/11/iovationeering/ + <![CDATA[iovationeering]]> + + 2022-06-12T03:26:54Z + 2011-11-30T05:34:11Z + + David E. Wheeler + david@justatheory.com + https://justatheory.com/ + + + + + + + + + + + + +
+

Since June, as part of my work for PGX, I’ve been doing on-site full-time +consulting for iovation here in Portland. iovation is in the business of +deterring online fraud via device identification and reputation. Given the +nature of that business, a whole lot of data arrives every day, and I’ve been +developing PostgreSQL-based solutions to help get a handle on it. The work has +been truly engaging, and a whole hell of a lot of fun. And there are some really +great, very smart people at iovation, whom I very much like and respect.

+ + +
+ iovation + +
+ +

So much so, in fact, that I decided to accept their offer of a full time +position as “Senior Data Architect.” I started on Monday.

+

I know, crazy, right? They’ve actually been talking me up about it for a long +time. In our initial contact close to two years ago, as I sought to land them as +a PGX client, they told me they wanted to hire someone, and was I interested. I +said “no.” I said “no” through four months of contracting this summer and fall, +until one day last month I said to myself, “wait, why don’t I want this job?” +I had been on automatic, habitually insisting I wasn’t interested in a W2 +position. And with good reason. Aside from 15 months as CTO at values of n +(during which time I worked relatively independently anyway), I’ve been an +independent consultant since I founded Kineticode in November of 2001. Yeah. +Ten Years.

+

Don’t get me wrong, those ten years have been great! Not only have I been able +to support myself doing the things I love—and learned a ton in the process—but +I’ve managed to write a lot of great code. Hell, I will be +continuing as an associate with PGX, though with greatly reduced +responsibilities. And someday I may go indy again. But in the meantime, the +challenges, opportunities, and culture at iovation are just too good to pass up. +I’m loving the work I’m doing there, and expect to learn a lot over the next few +years.

+ + +
+ Kineticode + +
+ +

So what, you might ask, does this mean for Kineticode, the company I founded to +offer support, consulting, and training services for Bricolage CMS? The truth +is that Kineticode has only a few technical support customers left; virtually +all of my work for the last two years has been through PGX. So I’ve decided to +shut Kineticode down. I’m shifting the Bricolage tech support offerings over to +PGX and having Kineticode’s customers move there as their contacts come up for +renewal. They can expect the same great service as always. Better even, as there +are 10 associates in PGX, and, lately, only me at Kineticode. Since Kineticode +itself is losing its Raison d’être, it’s going away.

+ + +
+ PGX + +
+ +

I intend to remain involved in the various open-source projects I work on. I +still function as the benevolent dictator of Bricolage CMS, though other folks +have stepped up their involvement quite a lot in the last few years. And I +expect to keep improving [PGXN] and DesignScene as time allows (I’ve actually +been putting some effort into both in the last few weeks; watch for PGXN +and Lunar/Theory announcements in the coming weeks and months!). And, in fact, +I expect that a fair amount of the work I do at iovation will lead to blog +posts, conference presentations, and more open-source code.

+

This is going to be a blast. Interested in a front-row seat? Follow me on +Twitter.

+ +

Looking for the comments? Try the old layout.

+
+ + + +]]>
+
+ + https://justatheory.com/2011/09/dbix-connector-and-ssi/ + <![CDATA[DBIx::Connector and Serializable Snapshot Isolation]]> + + 2011-09-26T19:09:48Z + 2011-09-26T19:09:48Z + + David E. Wheeler + david@justatheory.com + https://justatheory.com/ + + + + + + + + + + + +
+

I was at Postgres Open week before last. This was a great conference, very +welcoming atmosphere and lots of great talks. One of the more significant, for +me, was the session on serializable transactions by Kevin Grittner, who +developed SSI for PostgreSQL 9.1. I hadn’t paid much attention to this feature +before now, but it became clear to me, during the talk, that it’s time.

+

So what is SSI? Well, serializable transactions are almost certainly how you +think of transactions already. Here’s how Kevin describes them:

+
+

True serializable transactions can simplify software development. Because any +transaction which will do the right thing if it is the only transaction +running will also do the right thing in any mix of serializable transactions, +the programmer need not understand and guard against all possible conflicts. +If this feature is used consistently, there is no need to ever take an +explicit lock or SELECT FOR UPDATE/SHARE.

+
+

This is, in fact, generally how I’ve thought about transactions. But I’ve +certainly run into cases where it wasn’t true. Back in 2006, I wrote an article +on managing many-to-many relationships with PL/pgSQL which demonstrated a race +condition one might commonly find when using an ORM. The solution I offered was +to always use a PL/pgSQL function that does the work, and that function +executes a SELECT...FOR UPDATE statement to overcome the race condition. This +creates a lock that forces conflicting transactions to be performed serially.

+

Naturally, this is something one would rather not have to think about. Hence +SSI. When you identify a transaction as serializable, it will be executed in a +truly serializable fashion. So I could actually do away with the +SELECT...FOR UPDATE workaround — not to mention any other race conditions I +might have missed — simply by telling PostgreSQL to enforce transaction +isolation. This essentially eliminates the possibility of unexpected +side-effects.

+

This comes at a cost, however. Not in terms of performance so much, since the +SSI implementation uses some fancy, recently-developed algorithms to keep +things efficient. (Kevin tells me via IRC: “Usually the rollback and retry work +is the bulk of the additional cost in an SSI load, in my testing so far. A +synthetic load to really stress the LW locking, with a fully-cached database +doing short read-only transactions will have no serialization failures, but can +run up some CPU time in LW lock contention.”) No, the cost is actually in +increased chance of transaction rollback. Because SSI will catch more +transaction conflicts than the traditional “read committed” isolation level, +frameworks that expect to work with SSI need to be prepared to handle more +transaction failures. From the fine manual:

+
+

The Serializable isolation level provides the strictest transaction isolation. +This level emulates serial transaction execution, as if transactions had been +executed one after another, serially, rather than concurrently. However, like +the Repeatable Read level, applications using this level must be prepared to +retry transactions due to serialization failures.

+
+

And that brings me to DBIx::Connector, my Perl module for safe connection and +transaction management. It currently has no such retry smarts built into it. The +feature closest to that is the “fixup” connection mode, wherein if a execution +of a code block fails due to a connection failure, DBIx::Connector will +re-connect to the database and execute the code reference again.

+

I think I should extend DBIx::Connector to take isolation failures and deadlocks +into account. That is, fixup mode would retry a code block not only on +connection failure but also on serialization failure (SQLSTATE 40001) and +deadlocks (SQLSTATE 40P01). I would also add a new attribute, retries, to +specify the number of times to retry such execution, with a default of three +(which likely will cover the vast majority of cases). This has actually been an +oft-requested feature, and I’m glad to have a new reason to add it.

+

There are a few design issues to overcome, however:

+
    +
  • Fixup mode is supported not just by txn(), which scopes the execution of a +code reference to a single transaction, but also run(), which does no +transaction handling. Should the new retry support be added there, too? I +could see it either way (a single SQL statement executed in run() is +implicitly transaction-scoped).
  • +
  • Fixup mode is also supported by svp(), which scopes the execution of a +code reference to a savepoint (a.k.a. a subtransaction). Should the rollback +and retry be supported there, too, or would the whole transaction have to be +retried? I’m thinking the latter, since that’s currently the behavior for +connection failures.
  • +
  • Given these issues, will it make more sense to perhaps create a new mode? +Maybe it would be supported only by txn().
  • +
+

This is do-able, will likely just take some experimentation to figure it out and +settle on the appropriate API. I’ll need to find the tuits for that soon.

+

In the meantime, given currently in-progress changes, I’ve just released a new +version of DBIx::Connector with a single change: All uses of the deprecated +catch syntax now throw warnings. The previous version threw warnings only the +first time the syntax was used in a particular context, to keep error logs from +getting clogged up. Hopefully most folks have changed their code in the two +months since the previous release and switched to Try::Tiny or some other +model for exception handling. The catch syntax will be completely removed in +the next release of DBIx::Connector, likely around the end of the year. +Hopefully the new SSI-aware retry functionality will have been integrated by +then, too.

+

In a future post I’ll likely chew over whether or not to add an API to set the +transaction isolation level within a call to txn() and friends.

+ +

Looking for the comments? Try the old layout.

+
+ + + +]]>
+
+ + https://justatheory.com/2010/11/postgres-fk-locks-project/ + <![CDATA[Fixing Foreign Key Deadlocks in PostgreSQL]]> + + 2010-11-24T22:30:53Z + 2010-11-24T22:30:53Z + + David E. Wheeler + david@justatheory.com + https://justatheory.com/ + + + + + + + + + + + + +
+

PGX had a client come to us recently with a rather nasty deadlock issue. It +took far longer than we would have liked to figure out the issue, and once we +did, they were able to clear it up by dropping an unnecessary index. Still, it +shouldn’t have been happening to begin with. Joel Jacobson admirably explained +the issue on pgsql-hackers (and don’t miss the screencast).

+

Some might consider it a bug in PostgreSQL, but the truth is that PostgreSQL can +obtain stronger than necessary locks. Such locks cause some operations to block +unnecessarily and some other operations to deadlock, especially when foreign +keys are used in a busy database. And really, who doesn’t use FKs in their busy +database?

+

Fortunately, Simon Riggs proposed a solution. And it’s a good one. So good +that PGX is partnering with Glue Finance and Command Prompt as +founding sponsors on a new FOSSExperts project to actually get it done. +Álvaro Herrera is doing the actual hacking on the project, and has already +blogged about it here and here.

+

If you use foreign key constraints (and you should!) and you have a high +transaction load on your database (or expect to soon!), this matters to you. In +fact, if you use ActiveRecord with Rails, there might even be a special place in +your heart for this issue, says Mina Naguib. We’d really like to get this +done in time for the PostgreSQL 9.1 release. But it will only happen if the +project can be funded.

+

Yes, that’s right, as with PGXN, this is community project for which we’re +raising funds from the community to get it done. I think that more and more work +could be done this way, as various interested parties contribute small amounts +to collectively fund improvements to the benefit of us all. So can you help out? +Hit the FOSSExperts project page for all the project +details, and to make your contribution.

+

Help us help the community to make PostgreSQL better than ever!

+ +

Looking for the comments? Try the old layout.

+
+ + + +]]>
+
+ + https://justatheory.com/2010/08/postgres-key-value-pairs/ + <![CDATA[Managing Key/Value Pairs in PostgreSQL]]> + + 2010-08-09T13:00:00Z + 2010-08-09T13:00:00Z + + David E. Wheeler + david@justatheory.com + https://justatheory.com/ + + + + + + + +
+

Let’s say that you’ve been following the latest research in key/value data +storage and are interested in managing such data in a PostgreSQL database. You +want to have functions to store and retrieve pairs, but there is no natural way +to represent pairs in SQL. Many languages have hashes or or data dictionaries to +fulfill this role, and you can pass them to functional interfaces. SQL’s got +nothin’. In PostgreSQL, have two options: use nested arrays (simple, fast) or +use a custom composite data type (sugary, legible).

+

Let’s assume you have this table for storing your pairs:

+
CREATE TEMPORARY TABLE kvstore (
+    key        TEXT PRIMARY KEY,
+    value      TEXT,
+    expires_at TIMESTAMPTZ DEFAULT NOW() + '12 hours'::interval
+);
+

To store pairs, you can use nested arrays like so:

+
 SELECT store(ARRAY[ ['foo', 'bar'], ['baz', 'yow'] ]);
+

Not too bad, and since SQL arrays are a core feature of PostgreSQL, there’s +nothing special to do. Here’s the store() function:

+
CREATE OR REPLACE FUNCTION store(
+    params text[][]
+) RETURNS VOID LANGUAGE plpgsql AS $$
+BEGIN
+    FOR i IN 1 .. array_upper(params, 1) LOOP
+        UPDATE kvstore
+            SET value      = params[i][2],
+                expires_at = NOW() + '12 hours'::interval
+            WHERE key        = param[i][1];
+        CONTINUE WHEN FOUND;
+        INSERT INTO kvstore (key, value)
+        VALUES (params[i][1], params[i][2]);
+    END LOOP;
+END;
+$$;
+

I’ve seen worse. The trick is to iterate over each nested array, try an update +for each, and insert when no row is updated. Alas, you have no control over how +many elements a user might include in a nested array. One might call it as:

+
SELECT store(ARRAY[ ['foo', 'bar', 'baz'] ]);
+

Or:

+
SELECT store(ARRAY[ ['foo'] ]);
+

No errors will be thrown in either case. In the first the “baz” will be ignored, +and in the second the value will default to NULL. If you really didn’t like +these behaviors, you could add some code to throw an exception if +array_upper(params, 2) returns anything other than 2.

+

Let’s look at fetching values for keys. PostgreSQL 8.4 added variadic function +arguments, so it’s easy to provide a nice interface for retrieving one or more +values. The obvious one fetches a single value:

+
CREATE OR REPLACE FUNCTION getval(
+    text
+) RETURNS TEXT LANGUAGE SQL AS $$
+    SELECT value FROM kvstore WHERE key = $1;
+$$;
+

Nice and simple:

+
SELECT getval('baz');
+
+ getval 
+--------'
+ yow
+
+

The variadic version looks like this:

+
CREATE OR REPLACE FUNCTION getvals(
+    variadic text[]
+) RETURNS SETOF text LANGUAGE SQL AS $$
+    SELECT value
+      FROM kvstore
+      JOIN (SELECT generate_subscripts($1, 1)) AS f(i)
+        ON kvstore.key = $1[i]
+     ORDER BY i;
+$$;
+

Note the use of ORDER BY i to ensure that the values are returned in the same +order as the keys are passed to the function. So if I’ve got the key/value pairs +'foo' => 'bar' and 'baz' => 'yow', the output is:

+
SELECT * FROM getvals('foo', 'baz');
+
+ getvals 
+---------
+ bar
+ yow
+
+

If we want to the rows to have the keys and values together, we can return them +as arrays, like so:

+
CREATE OR REPLACE FUNCTION getpairs(
+    variadic text[]
+) RETURNS SETOF text[] LANGUAGE SQL AS $$
+    SELECT ARRAY[key, value]
+      FROM kvstore
+      JOIN unnest($1) AS k ON kvstore.key = k
+$$;
+

Here I’m assuming that order isn’t important, which means we can use unnest +to “flatten” the array, instead of the slightly more baroque +generate_subscripts() with array access. The output:

+
SELECT * FROM getpairs('foo', 'baz');
+
+  getpairs   
+-------------
+ {baz,yow}
+ {foo,bar}
+
+

Now, this is good as far as it goes, but the use of nested arrays to represent +key/value pairs is not exactly ideal: just looking at the use of a function, +there’s nothing to indicate that you’re using key/value pairs. What would be +ideal is to use row constructors to pass arbitrary pairs:

+
SELECT store( ROW('foo', 'bar'), ROW('baz', 42) );
+

Alas, one cannot pass RECORD values (the data type returned by ROW()) to +non-C functions in PostgreSQL.1 But if you don’t mind your +keys and values always being TEXT, we can get almost all the way there by +creating an “ordered pair” data type as a composite type like so:

+
CREATE TYPE pair AS ( k text, v text );
+

Then we can create store() with a signature of VARIADIC pair[] and pass in +any number of these suckers:

+
CREATE OR REPLACE FUNCTION store(
+    params variadic pair[]
+) RETURNS VOID LANGUAGE plpgsql AS $$
+DECLARE
+    param pair;
+BEGIN
+    FOR param IN SELECT * FROM unnest(params) LOOP
+        UPDATE kvstore
+           SET value = param.v,
+               expires_at = NOW() + '12 hours'::interval
+         WHERE key = param.k;
+        CONTINUE WHEN FOUND;
+        INSERT INTO kvstore (key, value) VALUES (param.k, param.v);
+    END LOOP;
+END;
+$$;
+

Isn’t it nice how we can access keys and values as param.k and param.v? Call +the function like this:

+
SELECT store( ROW('foo', 'bar')::pair, ROW('baz', 'yow')::pair );
+

Of course, that can get a bit old, casting to pair all the time, so let’s +create some pair constructor functions to simplify things:

+
CREATE OR REPLACE FUNCTION pair(anyelement, text)
+RETURNS pair LANGUAGE SQL AS 'SELECT ROW($1, $2)::pair';
+
+CREATE OR REPLACE FUNCTION pair(text, anyelement)
+RETURNS pair LANGUAGE SQL AS 'SELECT ROW($1, $2)::pair';
+
+CREATE OR REPLACE FUNCTION pair(anyelement, anyelement)
+RETURNS pair LANGUAGE SQL AS 'SELECT ROW($1, $2)::pair';
+
+CREATE OR REPLACE FUNCTION pair(text, text)
+RETURNS pair LANGUAGE SQL AS 'SELECT ROW($1, $2)::pair;';
+

I’ve created four variants here to allow for the most common combinations of +types. So any of the following will work:

+
SELECT pair('foo', 'bar');
+SELECT pair('foo', 1);
+SELECT pair(12.3, 'foo');
+SELECT pair(1, 43);
+

Alas, you can’t mix any other types, so this will fail:

+
SELECT pair(1, 12.3);
+
+ERROR:  function pair(integer, numeric) does not exist
+LINE 1: SELECT pair(1, 12.3);
+

We could create a whole slew of additional constructors, but since we’re using a +key/value store, it’s likely that the keys will usually be text anyway. So now +we can call store() like so:

+
SELECT store( pair('foo', 'bar'), pair('baz', 'yow') );
+

Better, eh? Hell, we can go all the way and create a nice binary operator to +make it still more sugary. Just map each of the pair functions to the operator +like so:

+
CREATE OPERATOR -> (
+    LEFTARG   = text,
+    RIGHTARG  = anyelement,
+    PROCEDURE = pair
+);
+
+CREATE OPERATOR -> (
+    LEFTARG   = anyelement,
+    RIGHTARG  = text,
+    PROCEDURE = pair
+);
+
+CREATE OPERATOR -> (
+    LEFTARG   = anyelement,
+    RIGHTARG  = anyelement,
+    PROCEDURE = pair
+);
+
+CREATE OPERATOR -> (
+    LEFTARG   = text,
+    RIGHTARG  = text,
+    PROCEDURE = pair
+);
+

Looks like a lot of repetition, I know, but checkout the new syntax:

+
SELECT store( 'foo' -> 'bar', 'baz' -> 1 );
+

Cute, eh? I chose to use -> because => is deprecated as an operator in +PostgreSQL 9.0: SQL 2011 reserves that operator for named parameter +assignment.2

+

As a last twist, let’s rewrite getpairs() to return pairs instead of arrays:

+
CREATE OR REPLACE FUNCTION getpairs(
+    variadic text[]
+) RETURNS SETOF pair LANGUAGE SQL AS $$
+    SELECT key -> value
+      FROM kvstore
+      JOIN unnest($1) AS k ON kvstore.key = k
+$$;
+

Cute, eh? Its use is just like before, only now the output is more table-like:

+
SELECT * FROM getpairs('foo', 'baz');
+
+  k  |   v   
+-----+-------
+ baz | yow
+ foo | bar
+

You can also get them back as composites by omitting * FROM:

+
SELECT getpairs('foo', 'baz');
+
+  getpairs   
+-------------
+ (foo,bar)
+ (baz,yow)
+

Anyway, just something to consider the next time you need a function that allows +any number of key/value pairs to be passed. It’s not perfect, but it’s pretty +sweet.

+
+
+
    +
  1. +

    In the recent pgsql-hackers discussion that inspired +this post, Pavel Stehule suggested adding something like Oracle COLLECTIONs +to address this shortcoming. I don’t know how far this idea will get, but +it sure would be nice to be able to pass objects with varying kinds of +data, rather than be limited to data all of one type (values in an SQL +array must all be of the same type). ↩︎

    +
  2. +
  3. +

    No, you won’t be able to use named parameters for this +application because named parameters are inherently non-variadic. That is, +you can only pre-declare so many named parameters: you can’t anticipate +every parameter that’s likely to be wanted as a key in our key/value store. ↩︎

    +
  4. +
+
+ +

Looking for the comments? Try the old layout.

+
+ + + +]]>
+
+ + https://justatheory.com/2010/08/pgxn-blog-twitterstream/ + <![CDATA[PGXN Blog and Twitterstream]]> + + 2010-08-04T16:51:39Z + 2010-08-04T16:51:39Z + + David E. Wheeler + david@justatheory.com + https://justatheory.com/ + + + + + + + +
+

I crated the PGXN Blog yesterday. Tune in there for news and announcements. +I’ll also be posting status reports once development gets underway, so that all +you fans out there can follow my progress. Once the site is done (or at 1.0 +anyway), the blog will be used for announcements, discussion of support issues, +etc. So tune in!

+

Oh, and I created a PGXN Twitterstream, too. You should follow it! New blog +posts will be tweeted, and once the site gets going, new uploads will be +tweeted, too. Check it out!

+ +

Looking for the comments? Try the old layout.

+
+ + + +]]>
+
+ + https://justatheory.com/2010/07/introducing-mytap/ + <![CDATA[Introducing MyTAP]]> + + 2010-07-28T19:38:54Z + 2010-07-28T19:38:54Z + + David E. Wheeler + david@justatheory.com + https://justatheory.com/ + + + + + + + + + + + + + +
+

I gave my OSCON tutorial (slides) last week. It went okay. I spent way +too much time helping to get everyone set up with pgTAP, and then didn’t have +time to have the attendees do the exercises, and I had to rush through 2.5 hours +of material in 1.5 hours. Yikes! At least the video will be better when it’s +released (more when that happens).

+

But as often happens, I was asked whether something like pgTAP exists for +MySQL. But this time I was asked by MySQL Community Manager Giuseppe Maxia, +who also said that he’d tried to create a test framework himself (a fellow Perl +hacker!), but that it wasn’t as nice as pgTAP. Well, since I was at OSCON and +tend to like to hack on side projects while at conferences, and since I hoped +that Giuseppe will happily take it over once I’ve implemented the core, I +started hacking on it myself. And today, I’m pleased to announce the release of +MyTAP 0.01 (downloads).

+

Once you’ve downloaded it, install it against your MySQL server like so:

+
mysql -u root < mytap.sql
+
+

Here’s a very simple example script:

+
-- Start a transaction.
+BEGIN;
+
+-- Plan the tests.
+SELECT tap.plan(1);
+
+-- Run the tests.
+SELECT tap.pass( 'My test passed, w00t!' );
+
+-- Finish the tests and clean up.
+CALL tap.finish();
+ROLLBACK;
+

You can run this test from a .sql file using the mysql client like so:

+
mysql -u root --disable-pager --batch --raw --skip-column-names --unbuffered --database try --execute 'source test.sql'
+
+

But that’s a PITA and can only run one test at a time. Instead, put all of your +tests into a directory, perhaps named tests, each with the suffix “.my”, and +use my_prove (install TAP::Parser::SourceHandler::MyTAP from CPAN to get +it) instead:

+
my_prove -u root --database try tests/
+
+

For MyTAP’s own tests, the output looks like this:

+
tests/eq.my ........ ok
+tests/hastap.my .... ok
+tests/matching.my .. ok
+tests/moretap.my ... ok
+tests/todotap.my ... ok
+tests/utils.my ..... ok
+All tests successful.
+Files=6, Tests=137,  1 wallclock secs
+(0.06 usr  0.03 sys +  0.01 cusr  0.02 csys =  0.12 CPU)
+Result: PASS
+
+

Nice, eh? Of course there are quite a few more assertion functions. See the +complete documentation for details.

+

Now, I did my best to keep the interface the same as pgTAP, but there are a few +differences:

+
    +
  • MySQL temporary tables are teh suck, so I had to use permanent tables to +track test state. To make this more feasible, MyTAP is always installed in +its own database, (named “tap” by default), and you must always +schema-qualify your use of the MyTAP functions.
  • +
  • Another side-effect of permanent tables is that MyTAP must keep track of +test outcomes without colliding with the state from tests running in +multiple concurrent connections. So MyTAP uses connection_id() to keep +track of state for a single test run. It also deletes the state when tests +finish(), but if there’s a crash before then, data can be left in those +tables. If the connection ID is ever re-used, this can lead to conflicts. +This seems mostly avoidable by using InnoDB tables and transactions in the +tests.
  • +
  • The word “is” is strictly reserved by MySQL, so the function that +corresponds to pgTAP’s is() is eq() in MyTAP. Similarly, isnt() is +called not_eq() in MyTAP.
  • +
  • There is no way to throw an exception in MySQL functions an procedures, so +the code cheats by instead performing an illegal operation: selecting from a +non-existent column, where the name of that column is the error message. +Hinky, but should get the point across.
  • +
+

Other than these issues, things went fairly smoothly. I finished up the 0.01 +version last night and released it today with most of the core functionality in +place. And now I want to find others to take over, as I am not a MySQL hacker +myself and thus unlikely ever to use it. If you’re interested, my +recommendations for things to do next are:

+ +

So fork on GitHub or contact me if you’d like to be added as a collaborator +(I’m looking at you, Giuseppe!).

+

Hope you find it useful.

+ +

Looking for the comments? Try the old layout.

+
+ + + +]]>
+
+ + https://justatheory.com/2010/06/pgxn-development-project/ + <![CDATA[PGXN Development Project]]> + + 2022-05-22T21:36:57Z + 2010-06-15T17:56:33Z + + David E. Wheeler + david@justatheory.com + https://justatheory.com/ + + + + + + + + +
+

I’m pleased to announce the launch of the PGXN development project. I’ve +written a detailed specification and pushed it through general approval on +pgsql-hackers. I’ve written up a detailed project plan and estimated things +at a highly reduced PostgreSQL Experts rate to come up with a fundraising +goal: $25,000. And now, thanks to founding contributions from +myYearbook.com, and PostgreSQL Experts, we have started the fundraising +phase of the project.

+

So what’s this all about? PGXN, the PostgreSQL Extension Network, is modeled on +CPAN, the Perl community’s archive of “all things Perl.” PGXN will provide +four major pieces of infrastructure to the PostgreSQL community:

+ +

I’ve been wanting to start this project for a long time, but given my need to +pay the bills, it didn’t seem like I’d ever be able to find the time for it. +Then Josh Berkus suggested that we try to get community interest and raise money +for me to have the time to work on it. So I jumped on that, putting in the hours +needed to get general approval from the core PostgreSQL developers and to create +a reasonable project plan and web site. And thanks to MyYearbook’s and PGX’s +backing, I’m really excited about it. I hope to start on it in August.

+

If you’d like to contribute, first: Thank You!. The PGXN site has a Google +Checkout widget that makes it easy to make a donation. If you’d rather pay by +some other means (checks are great for us!), drop me a line and we’ll work +something out. We have a few levels of contribution as +well, including permanent linkage on the PGXN site for your organization, as +well as the usual t-shirts launch party invitations.

+ +

Looking for the comments? Try the old layout.

+
+ + + +]]>
+
+ + https://justatheory.com/2010/05/pgan-bikeshedding/ + <![CDATA[PGAN Bikeshedding]]> + + 2010-05-24T19:15:55Z + 2010-05-24T19:15:55Z + + David E. Wheeler + david@justatheory.com + https://justatheory.com/ + + + + + + +
+

I’ve put together a description of PGAN, the PostgreSQL extension distribution +system I plan to develop later this year based on the Comprehensive Archive Perl +Network or CPAN. Its primary features will be:

+
    +
  • Extension distribution
  • +
  • Search site with extension documentation
  • +
  • Client for downloading, building, testing, and installing extensions.
  • +
+

I’ve never been thrilled with the name, though, so I’m asking for suggestions +for a better one. I’ve used the term “extension” here because it seems to be the +term that the PostgreSQL community has settled on, but other terms might work, +since things other than extensions might be distributed.

+

What I’ve come up with so far is:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameLong NamePronunciationAdvantagesDisadvantages
PGANPostgreSQL Add-on Networkpee-ganShort, similar to CPANUgly
PGEXPostgreSQL Extensionspee-gee-ex or pee-gexShort, easier to pronounceToo similar to PGX)
PGCANPostgreSQL Comprehensive Archive Networkpee-gee-canSimilar to CPANSimilar to CPAN
PGDANPostgreSQL Distribution Archive Networkpee-gee-danShort, easy to pronounceWho’s “Dan”? Doesn’t distribute PostgreSQL itself.
PGEDANPostgreSQL Extension Distribution Archive Networkpee-gee-ee-danReferences extensionsLong, sounds stupid
+

Of these, I think I like “PGEX” best, but none are really great. So I’m opening +up the bike shed to all. What’s a better name? Or if you can’t think of one, +which of the above do you like best? Just leave a comment on this post. The only +requirements for suggestions are that a .org domain be available and that it +suck less than the alternatives.

+

Comments close in 2 weeks. Thanks!

+ +

Looking for the comments? Try the old layout.

+
+ + + +]]>
+
+ + https://justatheory.com/2010/04/execute-sql-on-connect/ + <![CDATA[Execute SQL Code on Connect]]> + + 2010-04-28T00:14:07Z + 2010-04-28T00:14:07Z + + David E. Wheeler + david@justatheory.com + https://justatheory.com/ + + + + + + + + + + + +
+

I’ve been writing a fair bit of PL/Perl for a client, and one of the things +I’ve been doing is eliminating a ton of duplicate code by creating utility +functions in the %_SHARED hash. This is great, as long as the code that +creates those functions gets executed at the beginning of every database +connection. So I put the utility generation code into a single function, called +prepare_perl_utils(). It looks something like this:

+
CREATE OR REPLACE FUNCTION prepare_perl_utils(
+) RETURNS bool LANGUAGE plperl IMMUTABLE AS $$
+    # Don't bother if we've already loaded.
+    return 1 if $_SHARED{escape_literal};
+
+    $_SHARED{escape_literal} = sub {
+        $_[0] =~ s/'/''/g; $_[0] =~ s/\\/\\\\/g; $_[0];
+    };
+
+    # Create other code refs in %_SHARED…
+$$;
+

So now all I have to do is make sure that all the client’s apps execute this +function as soon as they connect, so that the utilities will all be loaded up +and ready to go. Here’s how I did it.

+

First, for the Perl app, I just took advantage of the DBI’s callbacks to +execute the SQL I need when the DBI connects to the database. That link might +not work just yet, as the DBI’s callbacks have only just been documented and +that documentation appears only in dev releases so far. Once 1.611 drops, the +link should work. At any rate, the use of callbacks I’m exploiting here has been +in the DBI since 1.49, which was released in November 2005.

+

The approach is the same as I’ve described before: Just specify the +Callbacks parameter to DBI->connect, like so:

+
my $dbh = DBI->connect_cached($dsn, $user, $pass, {
+    PrintError     => 0,
+    RaiseError     => 1,
+    AutoCommit     => 1,
+    Callbacks      => {
+        connected => sub { shift->do('SELECT prepare_perl_utils()' },
+    },
+});
+

That’s it. The connected method is a no-op in the DBI that gets called to +alert subclasses that they can do any post-connection initialization. Even +without a subclass, we can take advantage of it to do our own initialization.

+

It was a bit trickier to make the same thing happen for the client’s Rails +app. Rails, alas, provides no on-connection callbacks. So we instead have to +monkey-patch Rails to do what we want. With some help from “dfr|mac” on +#rubyonrails (I haven’t touched Rails in 3 years!), I got it worked down to +this:

+
class ActiveRecord::ConnectionAdapters::PostgreSQLAdapter
+    def initialize_with_perl_utils(*args)
+    returning(initialize_without_perl_utils(*args)) do
+        execute('SELECT prepare_perl_utils()')
+    end
+    end
+    alias_method_chain :initialize, :perl_utils
+end
+

Basically, we overpower the PostgreSQL adapter’s initialize method and have it +call initialize_with_perl_utils before it returns. It’s a neat trick; if +you’re going to practice fuck typing, alias_method_chain makes it about as +clean as can be, albeit a little too magical for my tastes.

+

Anyway, recorded here for posterity (my blog is my other brain!).

+ +

Looking for the comments? Try the old layout.

+
+ + + +]]>
+
+ + https://justatheory.com/2010/03/no-more-use-pgxs/ + <![CDATA[No more USE_PGXS=1?]]> + + 2010-03-15T18:33:18Z + 2010-03-15T18:33:18Z + + David E. Wheeler + david@justatheory.com + https://justatheory.com/ + + + + + + +
+

I’ve become very tired of having to set USE_PGXS=1 every time I build pgTAP +outside the contrib directory of a PostgreSQL distribution:

+
make USE_PGXS=1
+make USE_PGXS=1 install
+make USE_PGXS=1 installcheck
+

I am forever forgetting to set it, and it’s just not how one normally expects a +build incantation to work. It was required because that’s how the core contrib +extensions work: They all have this code in their Makefiles, which those of +us who develop third-party modules have borrowed:

+
ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/citext
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+

They generally expect ../../src/Makefile.global to exist, and if it doesn’t, +you have to tell it so. I find this annoying, because third-party extensions are +almost never built from the contrib directory, so one must always remember to +specify USE_PGXS=1.

+

I’d like to propose, instead, that those of us who maintain third-party +extensions like pgTAP, PL/Parrot, and Temporal PostgreSQL not force our +users to have to remember this special variable by instead checking to see if +it’s needed ourselves. As such, I’ve just added this code to pgTAP’s +Makefile:

+
ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+else
+ifeq (exists, $(shell [ -e ../../src/bin/pg_config/pg_config ] && echo exists) ) 
+top_builddir = ../..
+PG_CONFIG := $(top_builddir)/src/bin/pg_config/pg_config
+else
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+endif
+endif
+

So it still respects USE_PGXS=1, but if it’s not set, it looks to see if it +can find pg_config where it would expect it to be if built from the contrib +directory. If it’s not there, it simply uses pg_config as if USE_PGXS=1 was +set. This makes building from the contrib directory or from anywhere else the +same process:

+
make
+make install
+make installcheck
+

Much better, much easier to remember.

+

Is there any reason why third-party PostgreSQL extensions should not adopt +this pattern? I don’t think it makes sense for contrib extensions in core to do +it, but for those that will never be in core, I think it makes a lot of sense.

+

Comments?

+ +

Looking for the comments? Try the old layout.

+
+ + + +]]>
+
+ + https://justatheory.com/2010/01/somethingest-from-each-entity/ + <![CDATA[SQL Hack: The Something-est From Each Entity]]> + + 2022-06-12T22:42:22Z + 2010-01-12T06:12:02Z + + David E. Wheeler + david@justatheory.com + https://justatheory.com/ + + + + + + + + +
+

This is a pattern that I have dealt with many times, but never figured out how +to adequately handle. Say that you have imported a mailbox into your database, +and you want a list of the latest messages between each pair of recipients +(sender and receiver — I’m ignoring multiple receivers for the moment). The data +might look like this:

+
BEGIN;
+
+CREATE TABLE messages (
+    sender   TEXT        NOT NULL,
+    receiver TEXT        NOT NULL,
+    sent_at  TIMESTAMPTZ NOT NULL DEFAULT clock_timestamp(),
+    body     TEXT        NOT NULL DEFAULT ''
+);
+
+INSERT INTO messages ( sender, receiver, body )
+VALUES ('Theory', 'Strongrrl', 'Hi There.' );
+
+INSERT INTO messages ( sender, receiver, body )
+VALUES ('Strongrrl', 'Theory', 'Hi yourself.' );
+
+INSERT INTO messages ( sender, receiver, body )
+VALUES ('Anna', 'Theory', 'What''s for dinner?' );
+
+INSERT INTO messages ( sender, receiver, body )
+VALUES ('Theory', 'Anna', 'Brussels Sprouts.' );
+
+INSERT INTO messages ( sender, receiver, body )
+VALUES ('Anna', 'Theory', 'Oh man!' );
+
+COMMIT;
+

So the goal is to show the most recent message between Theory and Strongrrl and +the most recent message between Theory and Anna, without regard to who is the +sender and who is the receiver. After running into this many times, today I +consulted my colleagues, showing them this dead simple (and wrong!) query to +demonstrate what I wanted:

+
SELECT sender, recipient, sent_at, body
+  FROM messages
+ GROUP BY sender, recipient
+HAVING sent_at = max(sent_at);
+

That’s wrong because one can’t have columns in the SELECT expression that are +not either aggregate expressions or included in theGROUP BY expression. It’s a +violation of the standard (and prone to errors, I suspect). Andrew immediately +said, “Classic case for DISTINCT ON”. This lovely little expression is a +PostgreSQL extension not included in the SQL standard. It’s implementation looks +like this:

+
SELECT DISTINCT ON (
+          CASE WHEN receiver > sender
+               THEN receiver || sender
+               ELSE sender   || receiver
+          END
+       ) sender, receiver, sent_at, body
+  FROM messages
+ ORDER BY CASE WHEN receiver > sender
+               THEN receiver || sender
+               ELSE sender   || receiver
+          END, sent_at DESC;
+

This query is saying, “fetch the rows where the sender and the receiver are +distinct, and order by sent_at DESC. THE CASE statement to get a uniform +value for the combination of sender and receiver is a bit unfortunate, but it +does the trick:

+
  sender   | receiver |            sent_at            |     body     
+-----------+----------+-------------------------------+--------------
+ Anna      | Theory   | 2010-01-12 05:00:07.026711+00 | Oh man!
+ Strongrrl | Theory   | 2010-01-12 05:00:07.02589+00  | Hi yourself.
+
+

Great, exactly the data I wanted. And the CASE statement can actually be +indexed to speed up filtering. But I wondered if it would be possible to get the +same results without the DISTINCT ON. In other words, can this be done with +standard SQL? If you’re using PostgreSQL 8.4, the answer is “yes.” All you have +to do is exploit window functions and a subquery. It looks like this:

+
SELECT sender, receiver, sent_at, body
+  FROM (
+    SELECT sender, receiver, sent_at, body,
+           row_number() OVER ( PARTITION BY 
+               CASE WHEN receiver > sender
+                    THEN receiver || sender
+                    ELSE sender   || receiver
+               END
+               ORDER BY sent_at DESC
+           ) AS rnum
+      FROM messages
+  ) AS t
+ WHERE rnum = 1;
+

Same nasty CASE statement as before (no way around it with this database +design, alas), but this is fully conforming SQL. It’s also the first time I’ve +ever used window functions. If you just focus on the row_number() OVER () +expression, it’s simply partitioning the table according to the same value as in +the DISTINCT ON value, but it’s ordering it by sent_at directly. The result +is a row number, where the first is 1 for the most recent message for each +combination of recipients. Then we just filter for that in the WHERE clause.

+

Not exactly intuitive (I’m really only understanding it now as I explain write +it out), but quite straight-forward once you accept the expressivity in this +particular OVER expression. It might be easier to understand if we remove some +of the cruft. If instead we wanted the most recent message from each sender +(regardless of the recipient), we’d write:

+
SELECT sender, receiver, sent_at, body
+  FROM (
+    SELECT sender, receiver, sent_at, body,
+           row_number() OVER (
+               PARTITION BY sender ORDER BY sent_at DESC
+           ) AS rnum
+      FROM messages
+  ) AS t
+ WHERE rnum = 1;
+

And that yields:

+
  sender   | receiver |            sent_at            |     body     
+-----------+----------+-------------------------------+--------------
+ Anna      | Theory   | 2010-01-12 05:00:07.026711+00 | Oh man!
+ Strongrrl | Theory   | 2010-01-12 05:00:07.02589+00  | Hi yourself.
+ Theory    | Anna     | 2010-01-12 05:00:07.24982+00  | Brussels Sprouts.
+
+

Furthermore, we can use a common table expression to eliminate the subquery. +This query is functionally identical to the subquery example (returning to +uniqueness for sender and receiver), just with the WITH clause coming before +the SELECT clause, setting things up for it:

+
WITH t AS (
+    SELECT sender, receiver, sent_at, body,
+           row_number() OVER (PARTITION BY CASE
+               WHEN receiver > sender
+                   THEN receiver || sender
+                   ELSE sender   || receiver
+                   END
+               ORDER BY sent_at DESC
+           ) AS rnum
+      FROM messages
+) SELECT sender, receiver, sent_at, body
+    FROM t
+   WHERE rnum = 1;
+

So it’s kind of like putting the subquery first, only it’s not a subquery, it’s +more like a temporary view. Nice, eh? Either way, the results are the same as +before:

+
  sender   | receiver |            sent_at            |     body     
+-----------+----------+-------------------------------+--------------
+ Anna      | Theory   | 2010-01-12 05:00:07.026711+00 | Oh man!
+ Strongrrl | Theory   | 2010-01-12 05:00:07.02589+00  | Hi yourself.
+
+

I hereby dub this “The Entity’s Something-est” pattern (I’m certain someone else +has already come up with a good name for it, but this will do). I can see it +working any place requiring the highest, lowest, latest, earliest, or something +else-est item from each of a list of entities. Perhaps the latest headline from +every news source:

+
WITH t AS (
+    SELECT source, headline, dateline, row_number() OVER (
+               PARTITION BY source ORDER BY dateline DESC
+           ) AS rnum
+      FROM news
+) SELECT source, headline, dateline
+    FROM t
+   WHERE rnum = 1;
+

Or perhaps the lowest score for for each basketball team over the course of a +season:

+
WITH t AS (
+    SELECT team, date, score, row_number() OVER (
+               PARTITION BY team ORDER BY score
+           ) AS rnum
+      FROM games
+) SELECT team, date, score
+    FROM t
+   WHERE rnum = 1;
+

Easy! How have you handled a situation like this in your database hacking?

+ +

Looking for the comments? Try the old layout.

+
+ + + +]]>
+
+