Query Terabytes of Parquet Data Directly from S3 — Introducing ParquetReader On-Prem 🚀

A New Era for On-Premise Analytics 🏢

Until now, ParquetReader made it easy to preview and query local Parquet, Avro, ORC, and Feather files — no setup, no database.

Today, we’re taking that one step further. You can now connect ParquetReader directly to your own S3 or compatible object storage (including Google Cloud Storage, MinIO, Backblaze, and Wasabi).

Run SQL queries on terabytes of data instantly — all inside your own infrastructure, without moving or uploading a single byte.

Why We Built It 💡

Many teams want full control over their data while keeping the power of instant analytics. Cloud services like Athena and BigQuery are great — but they come with vendor lock-in, cost, and complexity.

That’s why we created the ParquetReader Self-Hosted Edition — a lightweight, Docker-ready engine that runs entirely in your cloud or data center, with direct access to your own object storage.

Step 1 — Deploy ParquetReader in Your Cloud ☁️

Run ParquetReader on-premise using Docker or Kubernetes. The setup is simple: one container, zero dependencies, no telemetry.

Once deployed, open the web UI and access the familiar ParquetReader interface — it’s the same experience, but now hosted within your own secure network.

Step 2 — Connect Your S3 or GCS Bucket 🔗

In the self-hosted UI, you’ll see the new Connect Storage panel.

Enter your S3 or GCS credentials once — access key, secret key, and endpoint (for example, storage.googleapis.com for Google Cloud).

From there, ParquetReader automatically lists and connects to your Parquet files in that bucket or folder path.

Step 3 — Query Without Uploading Anything ⚡

Your data stays right where it is — ParquetReader streams and queries files directly in place.

Behind the scenes, multiple Parquet files are merged automatically, so schema drift or missing columns never break your query.

In the SQL editor, your bucket is represented as a single table called dataset. Just type:

`SELECT country, COUNT(*) FROM dataset WHERE active = true GROUP BY country;`

and see results instantly — even on multi-gigabyte or terabyte-scale datasets.

Step 4 — Explore, Filter, and Export 📊

Filter, sort, or search directly from the ParquetReader UI. The table view updates in real time while your queries execute on remote storage.

Need to share results? You can export them as CSV, JSON, or Parquet without ever downloading the full dataset.

Security and Compliance First 🔒

Your data never leaves your environment. All queries are executed locally in your container — no external APIs, no telemetry, no uploads.

You have full control over credentials, IAM roles, and storage endpoints. Perfect for regulated industries or privacy-sensitive organizations.

Why This Matters 🔥

Data teams no longer need to upload files, spin up clusters, or maintain expensive query services just to explore their Parquet data.

With the new S3 integration, ParquetReader becomes your own in-house analytics engine — fast, private, and storage-agnostic.

Query terabytes of Parquet data directly from your own S3 or GCS bucket — instantly, securely, and without moving a single byte.

Get Started with ParquetReader On-Prem 🚀

Ready to try it out? Download the Self-Hosted Edition and deploy it inside your cloud or data center today.

Once running, open your browser, add your S3 credentials in the UI, and start running SQL queries across all your Parquet files — it’s that simple.