Thousands of files created while unloading with INSERT INTO FILES() SELECT ... FROM table - Exporting Parquet to S3 #56267

DhyeyMoliya · 2025-02-25T09:34:43Z

Steps to reproduce the behavior (Required)

shared-data cluster with AWS S3 as Storage volume
Table roughly having the structure as given below :

CREATE TABLE `messages` (
  `tenantId` varchar(500) NOT NULL,
  `created` datetime NOT NULL,
  `msgId` varchar(500) NOT NULL,
  `from` varchar(30) NULL,
  `to` varchar(100) NULL,
  `type` varchar(30) NULL,
  `source` varchar(500) NULL,
  `sent` datetime NULL,
  `code` varchar(100) NULL,
  `description` varchar(1048576) NULL,
  `payload` json NULL,
  `metadata` json NULL,
  `updatedAt` datetime NULL,
  `version` int(11) NULL,
)
PRIMARY KEY(`tenantId`, `created`, `msgId`)
PARTITION BY date_trunc('day', `created`)
DISTRIBUTED BY HASH(`tenantId`)
ORDER BY(`tenantId`, `created`)
PROPERTIES (
	"bloom_filter_columns" = "msgId",
	"compression" = "LZ4",
	"datacache.enable" = "true",
	"enable_async_write_back" = "false",
	"enable_persistent_index" = "true",
	"persistent_index_type" = "CLOUD_NATIVE",
	"replication_num" = "1",
	"storage_volume" = "builtin_storage_volume"
);

This table has 600M records. Average 4M records per created date bucket. Average Record size around 1KB.
Export the data for one day using INSERT INTO FILES() to S3 in parquet format with zstd compression. Using following SUBMIT TASK query :

SUBMIT TASK AS 
INSERT
	INTO
	FILES (
		"path" = "s3://test/exports/export1/messages/2024_12_26",
		"format" = "parquet",
		"compression" = "zstd",
		"single" = "true", -- turn this on or off
		"target_max_file_size" = "104857600",
		"aws.s3.access_key" = "AAAA",
		"aws.s3.secret_key" = "BBBB",
		"aws.s3.region" = "ap-south-1",
		"aws.s3.use_instance_profile" = "false"
	)
SELECT
	tenantId,
	created,
	msgId,
	from,
	to,
	type,
	source,
	sent,
	code,
	description,
	json_string(`payload`) as payload,
	json_string(`metadata`) as metadata,
	updatedAt,
	version
FROM
	messages
WHERE
	created >= '2024-12-25T18:30:00.000Z'
	AND created < '2024-12-26T18:30:00.000Z';

Check the output .parquet files in AWS S3.

Expected behavior (Required)

Case 1 - with single=true:
1. There should be a single .parquet file in the destination S3 location.
Case 2 - with single=false:
1. There should be multiple files of around 1 GB size in the destination S3 location.

Real behavior (Required)

Case 1 - with single=true:
1. Randomly seeing thousands of files (1500+) for 1 day of data. More if query targets more rows.
2. File sizes range from 100KB to 1 GB. (Only 4-5 1GB files)
Case 2 - with single=false:
1. Randomly seeing thousands of files (1500+) for 1 day of data. More if query targets more rows.
2. File sizes range from 100KB to 1 GB. (Only 4-5 1GB files)

StarRocks version (Required)

3.3.9
3.3.7

The text was updated successfully, but these errors were encountered:

DhyeyMoliya added the type/bug Something isn't working label Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thousands of files created while unloading with INSERT INTO FILES() SELECT ... FROM table - Exporting Parquet to S3 #56267

Thousands of files created while unloading with INSERT INTO FILES() SELECT ... FROM table - Exporting Parquet to S3 #56267

DhyeyMoliya commented Feb 25, 2025 •

edited

Loading

Thousands of files created while unloading with INSERT INTO FILES() SELECT ... FROM table - Exporting Parquet to S3 #56267

Thousands of files created while unloading with INSERT INTO FILES() SELECT ... FROM table - Exporting Parquet to S3 #56267

Comments

DhyeyMoliya commented Feb 25, 2025 • edited Loading

Steps to reproduce the behavior (Required)

Expected behavior (Required)

Real behavior (Required)

StarRocks version (Required)

DhyeyMoliya commented Feb 25, 2025 •

edited

Loading