Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thousands of files created while unloading with INSERT INTO FILES() SELECT ... FROM table - Exporting Parquet to S3 #56267

Open
DhyeyMoliya opened this issue Feb 25, 2025 · 0 comments
Labels
type/bug Something isn't working

Comments

@DhyeyMoliya
Copy link

DhyeyMoliya commented Feb 25, 2025

Steps to reproduce the behavior (Required)

  1. shared-data cluster with AWS S3 as Storage volume
  2. Table roughly having the structure as given below :
CREATE TABLE `messages` (
  `tenantId` varchar(500) NOT NULL,
  `created` datetime NOT NULL,
  `msgId` varchar(500) NOT NULL,
  `from` varchar(30) NULL,
  `to` varchar(100) NULL,
  `type` varchar(30) NULL,
  `source` varchar(500) NULL,
  `sent` datetime NULL,
  `code` varchar(100) NULL,
  `description` varchar(1048576) NULL,
  `payload` json NULL,
  `metadata` json NULL,
  `updatedAt` datetime NULL,
  `version` int(11) NULL,
)
PRIMARY KEY(`tenantId`, `created`, `msgId`)
PARTITION BY date_trunc('day', `created`)
DISTRIBUTED BY HASH(`tenantId`)
ORDER BY(`tenantId`, `created`)
PROPERTIES (
	"bloom_filter_columns" = "msgId",
	"compression" = "LZ4",
	"datacache.enable" = "true",
	"enable_async_write_back" = "false",
	"enable_persistent_index" = "true",
	"persistent_index_type" = "CLOUD_NATIVE",
	"replication_num" = "1",
	"storage_volume" = "builtin_storage_volume"
);
  1. This table has 600M records. Average 4M records per created date bucket. Average Record size around 1KB.
  2. Export the data for one day using INSERT INTO FILES() to S3 in parquet format with zstd compression. Using following SUBMIT TASK query :
SUBMIT TASK AS 
INSERT
	INTO
	FILES (
		"path" = "s3://test/exports/export1/messages/2024_12_26",
		"format" = "parquet",
		"compression" = "zstd",
		"single" = "true", -- turn this on or off
		"target_max_file_size" = "104857600",
		"aws.s3.access_key" = "AAAA",
		"aws.s3.secret_key" = "BBBB",
		"aws.s3.region" = "ap-south-1",
		"aws.s3.use_instance_profile" = "false"
	)
SELECT
	tenantId,
	created,
	msgId,
	from,
	to,
	type,
	source,
	sent,
	code,
	description,
	json_string(`payload`) as payload,
	json_string(`metadata`) as metadata,
	updatedAt,
	version
FROM
	messages
WHERE
	created >= '2024-12-25T18:30:00.000Z'
	AND created < '2024-12-26T18:30:00.000Z';
  1. Check the output .parquet files in AWS S3.

Expected behavior (Required)

  1. Case 1 - with single=true:
    1. There should be a single .parquet file in the destination S3 location.
  2. Case 2 - with single=false:
    1. There should be multiple files of around 1 GB size in the destination S3 location.

Real behavior (Required)

  1. Case 1 - with single=true:
    1. Randomly seeing thousands of files (1500+) for 1 day of data. More if query targets more rows.
    2. File sizes range from 100KB to 1 GB. (Only 4-5 1GB files)
  2. Case 2 - with single=false:
    1. Randomly seeing thousands of files (1500+) for 1 day of data. More if query targets more rows.
    2. File sizes range from 100KB to 1 GB. (Only 4-5 1GB files)

StarRocks version (Required)

  • 3.3.9
  • 3.3.7
@DhyeyMoliya DhyeyMoliya added the type/bug Something isn't working label Feb 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant