Ask HN: Distributed SQL engine for ultra-wide tables


I ran into a practical limitation while working on ML feature engineering and multi-omics data.

At some point, the problem stops being “how many rows” and becomes “how many columns”. Thousands, then tens of thousands, sometimes more.

What I observed in practice:

- Standard SQL databases usually cap out around ~1,000–1,600 columns. - Columnar formats like Parquet can handle width, but typically require Spark or Python pipelines. - OLAP engines are fast, but tend to assume relatively narrow schemas. - Feature stores often work around this by exploding data into joins or multiple tables.

At extreme width, metadata handling, query planning, and even SQL parsing become bottlenecks.

I experimented with a different approach: - no joins - no transactions - columns distributed instead of rows - SELECT as the primary operation

With this design, it’s possible to run native SQL selects on tables with hundreds of thousands to millions of columns, with predictable (sub-second) latency when accessing a subset of columns.

On a small cluster (2 servers, AMD EPYC, 128 GB RAM each), rough numbers look like: - creating a 1M-column table: ~6 minutes - inserting a single column with 1M values: ~2 seconds - selecting ~60 columns over ~5,000 rows: ~1 second

I’m curious how others here approach ultra-wide datasets. Have you seen architectures that work cleanly at this width without resorting to heavy ETL or complex joins?


Comments URL: https://news.ycombinator.com/item?id=46624190

Points: 4

# Comments: 2

Sign in to read the full article.

Sign in with Google

Settings

Appearance
API Keys