-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SUPPORT] Flink Hudi Job Configuration and Parameter Conflict Issues #12024
Comments
Hi @usberkeley Thank you for reporting this issue. I was able to replicate the issue using Spark as well. Specifically, I created a table using Spark SQL, specifying primary columns as id and name. Then, I inserted data into the table using a DataFrame (df) by specifying primary columns as id and salary instead of id and name. Spark Code: package com.ranga
import org.apache.spark.SparkConf
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._
object Test12024 extends App {
val name = this.getClass.getSimpleName.replace("$", "")
val sparkConf = new SparkConf().setAppName(name).setIfMissing("spark.master", "local[2]")
val spark = SparkSession.builder.appName(name).config(sparkConf)
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension")
.config("spark.sql.hive.convertMetastoreParquet", "false")
.getOrCreate()
spark.sql(
"""
|CREATE TABLE IF NOT EXISTS t_test (
| `id` VARCHAR(20),
| `name` VARCHAR(10),
| `age` INT,
| `ts` Long
|) USING HUDI TBLPROPERTIES (primaryKey = 'id,name', preCombineField = 'ts')
| LOCATION '/tmp/warehouse/t_test'
""".stripMargin)
val input_schema = StructType(Seq(
StructField("id", LongType),
StructField("name", StringType),
StructField("age", IntegerType),
StructField("ts", LongType),
))
val input_data = Seq(
Row(1L, "hello", 42, 1695159649087L),
Row(2L, "world", 13, 1695091554788L),
Row(3L, "spark", 7, 1695115999911L),
Row(1L, "hello", 43, 1695159649087L),
)
val tableName = name
val basePath = f"file:///tmp/$tableName"
val hoodieConf = scala.collection.mutable.Map[String, String]()
hoodieConf.put("hoodie.datasource.write.recordkey.field", "id,age")
hoodieConf.put("hoodie.table.precombine.field", "ts")
hoodieConf.put("hoodie.table.name", tableName)
val input_df = spark.createDataFrame(spark.sparkContext.parallelize(input_data), input_schema)
input_df.write.format("hudi").
options(hoodieConf).
mode("overwrite").
save(basePath)
spark.read.format("hudi").load(basePath).show(false)
spark.stop()
} |
Created upstream jira to track the issue: |
I later tested the append mode as well and encountered the same issue. |
This is by-design. |
@danny0405 In this case, should we directly throw an error? Inform the user of the conflict and ask them to correct the configuration. |
Maybe we just log some warnings there. |
Tested the same code using Hive Sync. No issue is reported while writing and reading. import org.apache.spark.SparkConf
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._
val name = "Hudi_Test"
val sparkConf = new SparkConf().setAppName(name).setIfMissing("spark.master", "local[2]")
val spark = SparkSession.builder.appName(name).config(sparkConf)
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension")
.config("spark.sql.hive.convertMetastoreParquet", "false")
.getOrCreate()
spark.sql(
"""
|CREATE TABLE IF NOT EXISTS t_test (
| `id` VARCHAR(20),
| `name` VARCHAR(10),
| `age` INT,
| `ts` Long
|) USING HUDI TBLPROPERTIES (primaryKey = 'id,name', preCombineField = 'ts')
| LOCATION '/tmp/warehouse/t_test'
""".stripMargin)
val input_schema = StructType(Seq(
StructField("id", LongType),
StructField("name", StringType),
StructField("age", IntegerType),
StructField("ts", LongType),
))
val input_data = Seq(
Row(1L, "hello", 42, 1695159649087L),
Row(2L, "world", 13, 1695091554788L),
Row(3L, "spark", 7, 1695115999911L),
Row(1L, "hello", 43, 1695159649087L),
)
val basePath = f"file:///tmp/$tableName"
val tableName = name
val databaseName = "test"
val hoodieConf = Map(
"hoodie.datasource.write.recordkey.field" -> "id,age",
"hoodie.datasource.write.recordkey.field" -> "id,age",
"hoodie.table.precombine.field" -> "ts",
"hoodie.table.name" -> tableName,
"hoodie.database.name" -> databaseName,
"hoodie.datasource.meta.sync.enable" -> "true",
"hoodie.datasource.hive_sync.enable" -> "true",
"hoodie.datasource.hive_sync.table" -> tableName,
"hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.hive.MultiPartKeysValueExtractor",
"hoodie.datasource.hive_sync.use_jdbc" -> "false",
"hoodie.datasource.hive_sync.mode" -> "hms",
"hoodie.datasource.write.hive_style_partitioning" -> "true"
)
val input_df = spark.createDataFrame(spark.sparkContext.parallelize(input_data), input_schema)
input_df.write.format("hudi").options(hoodieConf).mode("overwrite").save(basePath)
spark.read.format("hudi").load(basePath).show(false) |
Hi @usberkeley I got the expected exception when we specify the same location while creating the table and saving the data. Exception: 24/10/17 12:10:12 INFO HoodieTableConfig: Loading table properties from file:/tmp/hudi/Test12114/.hoodie/hoodie.properties
24/10/17 12:10:12 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from file:///tmp/hudi/Test12114
Exception in thread "main" org.apache.hudi.exception.HoodieException: Config conflict(key current value existing value):
RecordKey: id,age id,name
at org.apache.hudi.HoodieWriterUtils$.validateTableConfig(HoodieWriterUtils.scala:229)
at org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:232)
at org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:187)
at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:125)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:168) Code: import org.apache.spark.SparkConf
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Row, SparkSession}
object Test12024 extends App {
val name = this.getClass.getSimpleName.replace("$", "")
val sparkConf = new SparkConf().setAppName(name).setIfMissing("spark.master", "local[2]")
val spark = SparkSession.builder.appName(name).config(sparkConf)
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension")
.config("spark.sql.hive.convertMetastoreParquet", "false")
.getOrCreate()
val tableName = name
val basePath = f"file:///tmp/hudi/$tableName"
spark.sql(
f"""
|CREATE TABLE IF NOT EXISTS ${tableName} (
| `id` VARCHAR(20),
| `name` VARCHAR(10),
| `age` INT,
| `ts` Long
|) USING HUDI TBLPROPERTIES (primaryKey = 'id,name', preCombineField = 'ts')
| LOCATION '${basePath}'
""".stripMargin)
val input_schema = StructType(Seq(
StructField("id", LongType),
StructField("name", StringType),
StructField("age", IntegerType),
StructField("ts", LongType),
))
val input_data = Seq(
Row(1L, "hello", 42, 1695159649087L),
Row(2L, "world", 13, 1695091554788L),
Row(3L, "spark", 7, 1695115999911L),
Row(1L, "hello", 43, 1695159649087L),
)
val hoodieConf = scala.collection.mutable.Map[String, String]()
hoodieConf.put("hoodie.datasource.write.recordkey.field", "id,age")
hoodieConf.put("hoodie.table.precombine.field", "ts")
hoodieConf.put("hoodie.table.name", tableName)
val input_df = spark.createDataFrame(spark.sparkContext.parallelize(input_data), input_schema)
input_df.write.format("hudi").
options(hoodieConf).
mode("append").
save(basePath)
spark.read.format("hudi").load(basePath).show(false)
spark.stop()
} |
Hi @usberkeley Please let me know is there any update? |
@rangareddy Wow, that's great! From the code, it seems Spark has checks in place. Could you please help take a look at Flink as well? |
Hi @usberkeley Have you tried to replicate the issue from Flink? |
Describe the problem you faced
PRIMARY KEY
andPARTITIONED BY
conflict with user configurationshoodie.datasource.write.recordkey.field
andhoodie.datasource.write.partitionpath.field
, the job does not throw an error. Instead, it prioritizes the Flink SQL keywords over the Hoodie configurations.To Reproduce
Steps to reproduce the behavior:
Configuration Conflict in Flink Hudi Job
hoodie.properties
.Parameter Conflict Handling
hoodie.datasource.write.recordkey.field = 'uuid,name'
.hoodie.datasource.write.keygenerator.class='org.apache.hudi.keygen.SimpleAvroKeyGenerator'
.Flink SQL Keywords Conflict
PRIMARY KEY
andPARTITIONED BY
over the Hoodie Config settings.Expected behavior
Discussion item: I'd like to ask whether we should strictly check for configuration conflicts. Should we directly report an error in case of a conflict, rather than internally modifying user parameters?
I prefer directly reporting an error. I have a reason: if we don't report an error, users might mistakenly believe their configuration is valid, which could lead to confusion.
Environment Description
Hudi version : 0.15.0
Spark version : none
Hive version : none
Hadoop version : 3.3.5
Storage (HDFS/S3/GCS..) : HDFS
Running on Docker? (yes/no) : no
The text was updated successfully, but these errors were encountered: