AI prompts
base on postgres support for BSON postgresbson
============
BSON support for PostgreSQL
Introduction
============
This PostgreSQL extension realizes the BSON data type, together with functions to create and inspect BSON objects for the purposes of expressive and performant
querying.
BSON (http://bsonspec.org/) is a high-performance, richly-typed data carrier
similar to JSON but offers a number of attractive features including:
* Datetime is a first class type. In pure
JSON these must all be represented as a string (typically ISO 8601),
requiring conversion, potentially introducing lossiness, and impairing
native operations like `>` and `<=`.
* Numeric values are unambiguously carried in 4 first class types:
int32, int64, float, and decimal.
Unlike JSON there is no risk of inadvertently emitting a float of
1.0 as 1 and having it be parsed by a consumer as an integer.
A `decimal` value of 10.09 is preserved *exactly* as 10.09 with no risk of
floating point precision error e.g. `10.08999999999`
* `byte[]` is a first class type. This relieves the program of dealing
with base64 or other encodings of binary data to stringify it for JSON.
* Performance. Moving binary BSON in and out of the database under some
conditions is almost 10x faster than using native `jsonb` or `json` because
it avoids to- and from-string and to-dictionary conversion.
* Roundtrip ability. BSON is binary spec, not a string. There is no whitespace,
quoting rules, etc. BSON that goes into Postgres comes out *exactly* the
same way, each time, every time.
* Standard SDK implementations in upwards of 20 languages
Roundtripping and many available language SDKs enables seamless creation,
manipulation, transmission, and querying of data in a distributed system without
coordinating IDLs, compile-time dependencies, nuances in platform type representation, etc. Here is an example of a typical processing chain: Java program -> message bus -> python util -> save to database -> other Java program wakes up on
insert trigger:
1. Java program constructs an `org.bson.Document` (which honors `java.util.Map` interface), e.g. `doc.put("name", "Steve")`, `doc.put("balance", new BigDecimal("143.99"))` etc.
2. Java program encodes `org.bson.Document` using Java BSON SDK to a BSON `byte[]`.
3. Java program publishes `byte[]` to a Kafka topic.
4. python listener wakes up on topic and receives `byte[]`.
5. python listener decodes `byte[]` using python BSON SDK into `dict`, **not**
a string. It is a fully reconstituted `dict` object with substructures, `datetime.datetime` for date fieldss, `Decimal` for penny-precise fields, etc. The listener prints some things but does not change anything in the dict.
6. python listener encodes `dict` to back to `byte[]` and `INSERT`s it to a BSON column in Postgres using the `psycopg2` module.
7. A different Java program wakes up on a Postgres insert trigger and `SELECT`s
the BSON column as a `byte[]` (e.g. `select bson_column::bytea where ...`)
8. This `byte[]` is *identical* to the one created in step 2.
Another example of roundtrip ability involves hashes and digital signatures. BSON
offers an easy, robust, and precise means to associate arbitrarily complex data
with hashes thereof in a way that is basically not possible or certainly not
*reliable* using JSON or JSONB:
```
# Create complex data. Note array g is polymorphic:
amt = bson.decimal128.Decimal128(Decimal("-10267.01"))
data = {"dt":datetime.datetime.utcnow(),"amt":amt,"msg":"Hello!","g":["2",2,2.0,{'v':2}]}
# Probably in a util function, take data and associate it with metadata.
# Get the SHA2 of this data:
raw = bson.encode(data)
h2 = hashlib.sha256(raw).hexdigest()
meta = {'security':{'hash':{'algo':'SHA256','v':h2}}}
doc = {'d': data,'m': meta}
# Encode the *whole* doc (data + meta) and insert into BSON column:
rb2 = bson.encode(doc)
curs.execute('insert into foo (bson_data) values (%s);', (rb2,))
conn.commit()
# -- L A T E R --
# Pull the material back out of the database. Be sure to cast to
# bytea so the psycopg2 does not try to turn it into an EJSON string!
curs.execute("select bson_data::bytea from foo;")
all_recs = curs.fetchall()
# bytea comes in as type 'memoryview' for efficiency; get the bytes
# from row 0, column 0:
r = bytes(all_recs[0][0])
doc2 = bson.decode(r, codec_options=CodecOptions(document_class=collections.OrderedDict))
# Extract the stored hash value:
h3 = doc2['m']['security']['hash']['v']
# REHASH the data portion:
r2 = bson.encode(doc2['d'])
h4 = hashlib.sha256(r2).hexdigest()
if h3 == h4:
print("rehash of data OK")
else:
print("WARN: rehash of data does not match associated hash")
```
The extension offers two kinds of accessor suites:
* Typesafe high performance accessor functions that take a dotpath notation
to get to a field e.g.<br>
```
select bson_get_datetime(bson_column, 'msg.header.event.ts') from table;
```
* Arrow and double arrow operators similar to those found in the JSON type e.g.<br>
```
select (bson_column->'msg'->'header'->'event'->>'ts')::timestamp from table;
```
The BSON type is castable to JSON in so-called [EJSON](https://www.mongodb.com/docs/manual/reference/mongodb-extended-json/) format to preserve type fidelity.
Thus, the wealth
of functions and operations and even other extensions built around the JSON type
can be used on BSON. These of course can be combined:
-- Use dotpath to quickly get to event substructure, then cast to jsonb and
-- use ?& operator to ask if both `id` and `type` are present as top level tags:
select (bson_get_bson(bson_column, 'msg.header.event')::jsonb) ?& array['id','type'] from table;
Why dotpath accessors are better in BSON (*and* native json and jsonb, too)
---------------------------------------------------------------------------
In general, the dotpath functions will be much faster and memory efficient
especially for larger and/or deeper structures. This is because the dotpath
implementation in the underlying C library itself will "walk" the BSON structure
and only vend allocated material at the terminal of the path. The arrow operators
necessitate the construction of a fully inflated substructure at each step in
the path, which is exactly what happens with arrow operators and the native
`json` and `jsonb` types. For example, consider this instance with approx. 3K of
data, from which we want to query where `id` is `AAA`:
```
{
header: { event: {id: "E123"} },
data: {
payload: {
product: {
definition: {
id: "AAA",
... another 1K of data ...
},
constraints: {
... another 1K of data ...
},
approvals: {
... another 1K of data ...
}
}
}
}
}
```
The dotpath method will make postgres pass the 3K structure to the BSON extension (internally of course, not back to the client!), which will only have to examine roughly 64 bytes of data to dig down to the `id` field to return a string which is then examined for equality to `AAA`:
```
select * from table where bson_get_string(bson_column, 'data.payload.product.definition.id') = 'AAA';
```
With arrow operators, *at each arrow*, almost 3K of data has to be processed:
```
Remember double arrow operator yields text type, which fortunately is easily
compared to a literal string; no casting necessary, but the journey there is
tough because each single arrow forces construction of a whole new object to
pass to the next stage. This happens internal to the engine but still...
select * from table where
bson_column->'data'->'payload'->'product'->'definition'->>'id' = 'AAA';
^ ^ ^ ^ ^ ^
| | | | | |
+- initial pull of a little more than 3K | |
| | | | |
+- almost 3K reprocessed | |
| | | |
+- another 3K reprocessed| |
| | |
+- another 3K | |
| |
+- about 1K here
|
+- a handful of bytes
Total: about 13K of data in 4 separate pass **and** construct chunks of 3K processed to extract 3 bytes.
```
Again, this is *exactly* the same situation that occurs with native `json` and `jsonb`
types using the arrow operators; it is *not* particular to BSON. This is why postgres
provides the `#>` operator and the corresponding `json_extract_path` and `jsonb_extract_path` functions for these native types.
Arrays, BSON, and JSON
----------------------
Unlike JSON, BSON can only be created from a key:value object (which can be
empty); it cannot be created from a simple scalar or in particular an array.
In the course of descending into substructure, it is of course possible to
encounter an array and have that returned. The BSON spec gets around this
problem by returning a string indexed object where the keys
are the integer representation of the location in the array starting at offset
zero. From the CLI, this will be seen as:
select bson_column->'d'->'payload'->'vector' from ...
returns {"0":"some value", "1":"another value", "2":{fldx:"a rich shape"}}
not ["some value", "another value", {fldx:"a rich shape"}]
The postgres array type does not help here because postgres requires a
homogenous type and BSON allows for heterogenous types.
There are three ways to deal with this:
1. If you don't need the whole array, access individual items using strings; note the double arrow and the quotes around '0':
```
select bson_column->'d'->'payload'->'vector'->>'0' from ...
```
2. "Back up one level" in the arrow chain and cast to `jsonb`:
```
mydb=# select (bson_column->'d'->'payload')::jsonb->'vector' from ...
?column?
--------------
[21, 17, 19]
Note no quotes around 0 because now it is jsonb:
mydb=# select (bson_column->'d'->'payload')::jsonb->'vector'->>0 from ...
?column?
--------------
21
```
3. Use the `bson_get_jsonb_array` function:
```
mydb=# select bson_get_jsonb_array(bson_column,'d.payload.vector') from ...
?column?
--------------
[21, 17, 19]
mydb=# select bson_get_jsonb_array(bson_column,'d.payload.vector')->0 from ...
?column?
--------------
21
```
The subtle but at times very important benefit of `bson_get_jsonb_array`
over approach #2 above is that the dotpath will quickly and efficiently
navigate to just the `vector`. The issue with "backing up one level" is
that if `payload` is very big, the whole structure must be converted to
`jsonb` only to pull out what could be a small `vector`.
Remember: For objects, if you wish to use the `jsonb` functions, simply
cast to `jsonb`:
```
mydb=# select (bson_column,'d.payload')::jsonb ...
```
Status
======
Experimental. All contribs / PRs / comments / issues welcome.
Example
=======
CREATE EXTENSION pgbson;
CREATE TABLE data_collection ( data BSON );
-- Programmatic insert of material through postgres SDK e.g. psycopg2
-- can use native types directly; this is this high-fidelity sweet spot
-- for the BSON extension:
import bson
import psycopg2
import datetime
conn = psycopg2.connect(DSN) # 'host=machine port=5432 dname=foo .... '
sdata = {
"header": {
"ts": datetime.datetime(2022,5,5,12,13,14,456),
"evId":"E23234"
},
"data": {
"id":"ID0",
"notIndexed":"N0",
# BSON decimal128 adheres more strongly to IEEE spec than native
# python Decimal esp. wrt NaN so we must use that. We are
# encoding to BSON later anyway so no extra dependencies here.
"amt": bson.decimal128.Decimal128("107.78"),
"txDate": datetime.datetime(2022,12,31),
"userPrefs": [
{"type": "DEP", "updated":datetime.datetime(2021,4,4,12,13,14),"u":{
"favoriteCar":"Bugatti",
"thumbnail": bson.binary.Binary(bytes("Pretend this is a JPEG", 'utf-8'))
}},
{"type": "X2A", "updated":datetime.datetime(2021,3,3,12,13,14),"u":{
"listOfPrimes":[2,3,5,7,11,13,17,19],
"atomsInBowlOfSoup": 283572834759209881,
"pi": 3.1415926
}}
]
}
}
# raw_bson is byte[]. BSON is castable to/from bytea type in PG.
# See pgbson--2.0.sql about byte[] validation to ensure that the bytea
# being stored is real BSON not malformed junk. Note that null handling is
# the same as for regular types and the to/from JSON and validation machinery
# is NOT invoked. In other words:
#
# mydb=# insert into bsontest (bdata,marker) values (null,'hello!');
# INSERT 0 1
# mydb=# select bdata,marker from bsontest where bdata is null;
# bdata | marker
# -------+--------
# | hello!
# (1 row)
# mydb=# select bdata->'d'->'payload',marker from bsontest where bdata is null;
# OR mydb=# select bdata::bytea,marker from bsontest where bdata is null;
# OR mydb=# select bdata::json,marker from bsontest where bdata is null;
# ?column? | marker
# ----------+--------
# | hello!
# Following a path from a NULL BSON still produces NULL.
raw_bson = bson.encode(sdata)
curs = conn.cursor()
curs.execute("INSERT INTO bsontest (data) VALUES (%s)",(raw_bson,))
conn.commit()
curs.execute("""
SELECT
2 * bson_get_decimal128(data, 'data.amt'),
4 + bson_get_datetime(data, 'data.txDate')::date
from bsontest
""")
rr = curs.fetchall()[0] # one row
print("amt: ", rr[0], type(rr[0])) # amt: 215.56 <class 'decimal.Decimal'> fetched as relaxed native Decimal type
print("txDate: ", rr[1], type(rr[1])) # txDate: 2023-01-04 <class 'datetime.date'>
-- EJSON is recognized upon insertion. For example, the $date substructure
-- "ts": {"$date":"2022-03-03T12:13:14.789Z"}
-- is converted upon parse to a single scalar of type timestamp:
-- "ts": timestamp
INSERT INTO data_collection (data) values (
'{"d":{
"recordId":"R1",
"notIndexed":"N1",
"baz":27,
"bigint":{"$numberLong":"88888888888888888"},
"dbl":3.1415,
"ts": {"$date":"2022-03-03T12:13:14.789Z"},
"amt": {"$numberDecimal":"77777809838.97"},
"payload": {
"fun":"scootering",
"val":13,
"vector":[21,17,19],
"image" : { "$binary" : { "base64" : "VGhpcyBpcyBhIHRlc3Q=", "subType" : "00" } }
}
}
}'
);
-- Functional indexes work as well, enabling high performance queries
-- into any part of the structure; note the dotpath in the example below:
CREATE INDEX ON data_collection( bson_get_string(data, 'd.recordId'));
-- ... and those indexes help. Queries on peer fields in a substructure,
-- one with an index, one without, yields nearly a 10000x improvement in
-- performance.
select count(*) from data_collection;
count
---------
1000000
explain analyze select count(*) from btest where bson_get_string(data,'d.recordId') = 'R31';
---- QUERY PLAN
Aggregate (cost=8.45..8.46 rows=1 width=8) (actual time=0.076..0.077 rows=1 loops=1)
-> Index Scan using btest_bson_get_string_idx on btest (cost=0.42..8.44 rows=1 width=0) (actual time=0.070..0.071 rows=1 loops=1)
Index Cond: (bson_get_string(data, 'd.recordId'::cstring) = 'R31'::text)
Planning Time: 0.296 ms
Execution Time: 0.108 ms
explain analyze select count(*) from btest where bson_get_string(data,'d.notIndexed') = 'N31';
Aggregate (cost=215012.50..215012.51 rows=1 width=8) (actual time=993.471..993.472 rows=1 loops=1)
-> Seq Scan on btest (cost=0.00..215000.00 rows=5000 width=0) (actual time=0.124..993.464 rows=1 loops=1)
Filter: (bson_get_string(data, 'd.notIndexed'::cstring) = 'N31'::text)
Rows Removed by Filter: 999999
Planning Time: 0.079 ms
Execution Time: 993.505 ms
-- These are equivalent queries but the dotpath accessors will be
-- significantly faster (and much more memory efficient) if the path depth > 3.
-- Note of course that the BSON accessor functions can be used in
-- a selection set (select), a predicate (where), or essentially anywhere else.
select bson_get_bson(data, 'd.payload') from btest where bson_get_string(data,'d.recordId') = 'R1';
select data->'d'->'payload' from btest where bson_get_string(data,'d.recordId') = 'R1';
?column?
---------------------------------------------------------------------------------------------------------------------------------------------------------------- ------------------------------------
{ "fun" : "scootering", "val" : 13, "vector" : [ 1, 2, 3 ], "fancy" : [ { "A" : 1 }, { "B" : [ "X", "Z" ] } ], "image" : { "$binary" : { "base64" : "VGhpcyBpcyBhIHRlc3Q=", "subType" : "00" } } }
-- The ->> operator, like JSON, yields a text type, which you can cast
-- whatever you believe is appropriate
select (data->'d'->>'amt')::numeric as my_amount from btest where bson_get_string(data,'d.recordId') = 'R1';
my_amount
----------------
77777809838.97
Building
========
Requires:
* postgres development SDK (mostly for `.h` files in `.../postgresql/server`).
On OS X you can use `brew`. On RH 8 it is a little trickier because many
repos do not have version 14.x. Here is a [a good step-by-step install
of 14.4 for RH 8](https://www.linuxshelltips.com/install-postgresql-rhel)
Note you will need both server and development (e.g. `postgresql14-devel`)
packages because you need `.h` files. It is not necessary to do a complete
build of postgres.
* `pg_config` (which comes with postgres) and is used as part of the Makefile.
Note that some earlier versions of postgres did not not include the
`pg_config` exec and the `pgxs` environment.
* `libbson.so` and BSON C SDK `.h` files. You can make or install these
separately and there is plenty of material on this topic. There are some
quick tips at the top of the `Makefile`
* C compiler. No C++ used. The compiler arguments are driven by the
environment set up by `pg_config`.
Then:
```
git clone https://github.com/buzzm/postgresbson.git
# edit the Makefile to point at the BSON includes and dynamic lib; then:
make PGUSER=postgres # compiles pgbson.c to pgbson.so
make PGUSER=postgres install # copies .so, .sql, and .control files into target dirs in postgres environment
```
There are too many build directory permissions, install directory permissions,
and other local oddments to document here, but
neither postgres nor root privilege is required to compile and link the shared
lib but *installation* in your particular environment will vary. In general,
on OS X using `brew` you won't need root because root does not own `/opt/homebrew`
but on Linux, lots of things are done with `sudo yum` and resources end up
owned as root in `/usr/pgsql-nn` and `/usr/lib64`.
In addition, on RHEL 9, there appears to be an oddment around the compilation
target in make trying to create a `.bc` LLVM file using `clang` in addition to
the regular `.so` shlib using `gcc`. On many platforms, `clang` may not be
installed and even if it is, the `pg_config` modified `Makefile` may use the
wrong path to it e.g. `/usr/lib64/ccache/clang` instead of `/usr/bin/clang` or
just `clang`. The good news it appears the `.bc` LLVM output is *not* necessary
for the postgres extension. As long as the `.so` is installed into the proper
path the extension will work. At your discretion you can "manually make" the
`.bc` file by editing the command line and then running `make PGUSER=postgres install` again.
Make sure you install *and then restart* your postgres server to properly
pick up the new BSON extension.
Testing
========
Tested on
* PG 16.3:
* OS X 14.3.1 Sonoma; Apple clang version 15.0.0 (clang-1500.1.0.2.5)
* RHEL 9.3
* PG 16.1.3:
* OS X 14.3.1 Sonoma; Apple clang version 15.0.0 (clang-1500.1.0.2.5)
* OS X 13.2 Ventura
* RHEL 9.3
* PG 15.5.3: OS X 13.2 Ventura, RHEL 9.3
* PG 14.4: OS X 10.15.7, OS X 13.2 Ventura, and RHEL 8.6.
```
# Make sure postgresql16-devel (or 14 or 15) is installed *first*.
# pip3 install psycopg2 uses pg_config; thus the PATH must also be set!
PATH=/path/to/pgsql-16/bin/pg_config:$PATH pip3 install psycopg2
pip3 install pymongo # for bson only; we won't be using the mongo driver
python3 pgbson_test.py
```
See excuse at top of file regarding the non-standard test driver approach.
Quick reference
===============
The module defines BSON data type with operator families defined for B-TREE and HASH indexes.
Field access (supports dot notation):
* bson_get_string(bson_column, dotpath) RETURNS text
* bson_get_int32(bson_column, dotpath) RETURNS int4
* bson_get_int64(bson_column, dotpath) RETURNS int8
* bson_get_double(bson_column, dotpath) RETURNS float8
* bson_get_decimal(bson_column, dotpath) RETURNS numeric
* bson_get_datetime(bson_column, dotpath) RETURNS timestamp without time zone
* bson_get_binary(bson_column, dotpath) RETURNS bytea
* bson_get_boolean(bson_column, dotpath) RETURNS boolean
* bson_get_bson(bson_column, dotpath) RETURNS bson
* bson_get_jsonb_array(bson_column, dotpath) RETURNS jsonb
* bson_as_text(bson_column, dotpath) RETURNS text
Operators and comparison:
* Operators: =, <>, <=, <, >=, >, == (binary equality), <<>> (binary inequality)
* bson_hash(bson) RETURNS INT4
TO DO
========
1. **Significantly** tidy up test driver
2. Need something better when extracting arrays.
3. Need something better when bson_get_bson() resolves to a scalar because
simple scalars like a string are not BSON. Currently, it just returns NULL
which is "technically correct" but unsatisfying
4. Need additional safety checks when doing BSON compare and other operations
because corrupted BSON has a tendency to segfault the PG process.
5. Need more docs or a more clever solution for when calling `bson_get_{type}`
points to a field that exists but the type is wrong. Currently it just
returns null because that is "safest."
See also
========
* PostgreSQL - http://www.postgresql.org/
* BSON - http://bsonspec.org/
* MongoDB - http://mongodb.org
", Assign "at most 3 tags" to the expected json: {"id":"7025","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"