git: 49cf560f1881 - main - filesystems/py-kerchunk: Add py-kerchunk 0.2.7

Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: Po-Chuan Hsieh <sunpoet_at_FreeBSD.org>
Date: Thu, 29 May 2025 04:52:52 UTC
The branch main has been updated by sunpoet:

URL: https://cgit.FreeBSD.org/ports/commit/?id=49cf560f18818c74f2e587a5bfb8bc1933ae9250

commit 49cf560f18818c74f2e587a5bfb8bc1933ae9250
Author:     Po-Chuan Hsieh <sunpoet@FreeBSD.org>
AuthorDate: 2025-05-29 04:39:19 +0000
Commit:     Po-Chuan Hsieh <sunpoet@FreeBSD.org>
CommitDate: 2025-05-29 04:52:18 +0000

    filesystems/py-kerchunk: Add py-kerchunk 0.2.7
    
    Kerchunk is a library that provides a unified way to represent a variety of
    chunked, compressed data formats (e.g. NetCDF, HDF5, GRIB), allowing efficient
    access to the data from traditional file systems or cloud object storage. It
    also provides a flexible way to create virtual datasets from multiple files. It
    does this by extracting the byte ranges, compression information and other
    information about the data and storing this metadata in a new, separate object.
    This means that you can create a virtual aggregate dataset over potentially many
    source files, for efficient, parallel and cloud-friendly in-situ access without
    having to copy or translate the originals. It is a gateway to in-the-cloud
    massive data processing while the data providers still insist on using legacy
    formats for archival storage.
    
    We provide the following things:
    - completely serverless architecture
    - metadata consolidation, so you can understand a many-file dataset (metadata
      plus physical storage) in a single read
    - read from all of the storage backends supported by fsspec, including object
      storage (s3, gcs, abfs, alibaba), http, cloud user storage (dropbox, gdrive)
      and network protocols (ftp, ssh, hdfs, smb...)
    - loading of various file types (currently netcdf4/HDF, grib2, tiff, fits,
      zarr), potentially heterogeneous within a single dataset, without a need to go
      via the specific driver (e.g., no need for h5py)
    - asynchronous concurrent fetch of many data chunks in one go, amortizing the
      cost of latency
    - parallel access with a library like zarr without any locks
    - logical datasets viewing many (>~millions) data files, and direct
      access/subselection to them via coordinate indexing across an arbitrary number
      of dimensions
---
 filesystems/Makefile              |  1 +
 filesystems/py-kerchunk/Makefile  | 29 +++++++++++++++++++++++++++++
 filesystems/py-kerchunk/distinfo  |  3 +++
 filesystems/py-kerchunk/pkg-descr | 28 ++++++++++++++++++++++++++++
 4 files changed, 61 insertions(+)

diff --git a/filesystems/Makefile b/filesystems/Makefile
index 7225d1423458..c61eae0c5e36 100644
--- a/filesystems/Makefile
+++ b/filesystems/Makefile
@@ -94,6 +94,7 @@
     SUBDIR += py-fsspec-xrootd
     SUBDIR += py-fusepy
     SUBDIR += py-gcsfs
+    SUBDIR += py-kerchunk
     SUBDIR += py-libzfs
     SUBDIR += py-llfuse
     SUBDIR += py-prometheus-zfs
diff --git a/filesystems/py-kerchunk/Makefile b/filesystems/py-kerchunk/Makefile
new file mode 100644
index 000000000000..4fef8b7643e6
--- /dev/null
+++ b/filesystems/py-kerchunk/Makefile
@@ -0,0 +1,29 @@
+PORTNAME=	kerchunk
+PORTVERSION=	0.2.7
+CATEGORIES=	filesystems python
+MASTER_SITES=	PYPI
+PKGNAMEPREFIX=	${PYTHON_PKGNAMEPREFIX}
+
+MAINTAINER=	sunpoet@FreeBSD.org
+COMMENT=	Functions to make reference descriptions for ReferenceFileSystem
+WWW=		https://fsspec.github.io/kerchunk/ \
+		https://github.com/fsspec/kerchunk
+
+LICENSE=	MIT
+LICENSE_FILE=	${WRKSRC}/LICENSE
+
+BUILD_DEPENDS=	${PYTHON_PKGNAMEPREFIX}setuptools>=42:devel/py-setuptools@${PY_FLAVOR} \
+		${PYTHON_PKGNAMEPREFIX}setuptools-scm>=7:devel/py-setuptools-scm@${PY_FLAVOR} \
+		${PYTHON_PKGNAMEPREFIX}wheel>=0:devel/py-wheel@${PY_FLAVOR}
+RUN_DEPENDS=	${PYTHON_PKGNAMEPREFIX}fsspec>=0:filesystems/py-fsspec@${PY_FLAVOR} \
+		${PYTHON_PKGNAMEPREFIX}numcodecs>=0:misc/py-numcodecs@${PY_FLAVOR} \
+		${PYTHON_PKGNAMEPREFIX}numpy>=0,1:math/py-numpy@${PY_FLAVOR} \
+		${PYTHON_PKGNAMEPREFIX}ujson>=0:devel/py-ujson@${PY_FLAVOR} \
+		${PYTHON_PKGNAMEPREFIX}zarr>=0.1<3,1:devel/py-zarr@${PY_FLAVOR}
+
+USES=		python
+USE_PYTHON=	autoplist concurrent pep517
+
+NO_ARCH=	yes
+
+.include <bsd.port.mk>
diff --git a/filesystems/py-kerchunk/distinfo b/filesystems/py-kerchunk/distinfo
new file mode 100644
index 000000000000..45262a4ccc15
--- /dev/null
+++ b/filesystems/py-kerchunk/distinfo
@@ -0,0 +1,3 @@
+TIMESTAMP = 1748107898
+SHA256 (kerchunk-0.2.7.tar.gz) = 0425aa0fbf56f898053ee4c4dd40b35cea12d2fc986e036086e99a4ad16bd4e6
+SIZE (kerchunk-0.2.7.tar.gz) = 709052
diff --git a/filesystems/py-kerchunk/pkg-descr b/filesystems/py-kerchunk/pkg-descr
new file mode 100644
index 000000000000..a351e30943fe
--- /dev/null
+++ b/filesystems/py-kerchunk/pkg-descr
@@ -0,0 +1,28 @@
+Kerchunk is a library that provides a unified way to represent a variety of
+chunked, compressed data formats (e.g. NetCDF, HDF5, GRIB), allowing efficient
+access to the data from traditional file systems or cloud object storage. It
+also provides a flexible way to create virtual datasets from multiple files. It
+does this by extracting the byte ranges, compression information and other
+information about the data and storing this metadata in a new, separate object.
+This means that you can create a virtual aggregate dataset over potentially many
+source files, for efficient, parallel and cloud-friendly in-situ access without
+having to copy or translate the originals. It is a gateway to in-the-cloud
+massive data processing while the data providers still insist on using legacy
+formats for archival storage.
+
+We provide the following things:
+- completely serverless architecture
+- metadata consolidation, so you can understand a many-file dataset (metadata
+  plus physical storage) in a single read
+- read from all of the storage backends supported by fsspec, including object
+  storage (s3, gcs, abfs, alibaba), http, cloud user storage (dropbox, gdrive)
+  and network protocols (ftp, ssh, hdfs, smb...)
+- loading of various file types (currently netcdf4/HDF, grib2, tiff, fits,
+  zarr), potentially heterogeneous within a single dataset, without a need to go
+  via the specific driver (e.g., no need for h5py)
+- asynchronous concurrent fetch of many data chunks in one go, amortizing the
+  cost of latency
+- parallel access with a library like zarr without any locks
+- logical datasets viewing many (>~millions) data files, and direct
+  access/subselection to them via coordinate indexing across an arbitrary number
+  of dimensions