base on New file format for storage of large columnar datasets. # The Nimble File Format
Nimble (formerly known as _“Alpha”_) is a new columnar file format for large
datasets created by Meta. Nimble is meant to be a replacement for file formats
such as Apache Parquet and ORC.
Watch [this talk](https://www.youtube.com/watch?v=bISBNVtXZ6M) to learn more
about Nimble’s internals.
Nimble has the following design principles:
- **Wide:** Nimble is better suited for workloads that are wide in nature, such
as tables with thousands of columns (or streams) which are commonly found in
feature engineering workloads and training tables for machine learning.
- **Extensible:** Since the state-of-the-art in data encoding evolves faster
than the file layout itself, Nimble decouples stream encoding from the
underlying physical layout. Nimble allows encodings to be extended by library
users and recursively applied (cascading).
- **Parallel:** Nimble is meant to fully leverage highly parallel hardware by
providing encodings which are SIMD and GPU friendly. Although this is not
implemented yet, we intend to expose metadata to allow developers to better
plan decoding trees and schedule kernels without requiring the data streams
themselves.
- **Unified:** More than a specification, Nimble is a product. We strongly
discourage developers to (re-)implement Nimble’s spec to prevent environmental
fragmentation issues observed with similar projects in the past. We encourage
developers to leverage the single unified Nimble library, and create
high-quality bindings to other languages as needed.
Nimble has the following features:
- Lighter metadata organization to efficiently support thousands to tens of
thousands of columns and streams.
- Use Flatbuffers instead of thrift/protobuf to more efficiently access large
metadata sections.
- Use block encoding instead of stream encoding to provide predictable memory
usage while decoding/reading.
- Supports many encodings out-of-the-box, and additional encodings can be added
as needed.
- Supports cascading (recursive/composite) encoding of streams.
- Supports pluggable encoding selection policies.
- Provide extensibility APIs where encodings and other aspects of the file can
be extended.
- Clear separation between logical and physical encoded types.
- And more.
Nimble is a work in progress, and many of these features above are still under
design and/or active development. As such, Nimble does not provide stability or
versioning guarantees (yet). They will be eventually provided with a future
stable release. Use it at your own risk.
## Build
Nimble’s CMake build system is self-sufficient and able to either locate its
main dependencies or compile them locally. In order to compile it, one can
simply:
```shell
$ git clone
[email protected]:facebookincubator/nimble.git
$ cd nimble
$ make
```
To override the default behavior and force the build system to, for example,
build a dependency locally (bundle it), one can:
```shell
$ folly_SOURCE=BUNDLED make
```
Nimble builds have been tested using clang 15 and 16. It should automatically
compile the following dependencies: gtest, glog, folly, abseil, and velox. You
may need to first install the following system dependencies for these to compile
(example from Ubuntu 22.04):
```shell
$ sudo apt install -y \
git \
cmake \
flatbuffers-compiler \
protobuf-compiler \
libflatbuffers-dev \
libgflags-dev \
libunwind-dev \
libgoogle-glog-dev \
libdouble-conversion-dev \
libevent-dev \
liblz4-dev \
liblzo2-dev \
libelf-dev \
libdwarf-dev \
libsnappy-dev \
libssl-dev \
bison \
flex \
libfl-dev
```
Although Nimble’s codebase is today closely coupled with velox, we intend to
decouple them in the future.
## License
Nimble is licensed under the Apache 2.0 License. A copy of the license
[can be found here.](LICENSE)
", Assign "at most 3 tags" to the expected json: {"id":"9699","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"