16 August 2022
Spark Tune Ui
- 检查不同的执行计划
select * from table where value=0.0; => spark cast value to Double
select * from table where value=0; => spark cast value to Int
this maybe cause different result - schema prune ,模式裁剪 可以较少不必要的字段的的读取 spark.conf.set(“spark,sql.optimizer.nestedSchemaPruning.enabled”,true)
- collapse project
- cross session cached
- A sql query=>mutiple job
- A job => A directed acyclic graph
- 优化
- 找到不平衡的地方
- 持久化
- join 优化
- SortMergeJoins(Standard) => both sides is big
- Broadcast Joins(Fastest) => one side is small( one sided < spark.sql.autoBroadcastJoinThreshold)(default 10M)
- Sknew Joins
- Range Joins
- BroadcastedNestedLoopJoin(BNLJ)
- 处理数据倾斜
- 减少耗时的操作
- repartition :use coalesce or shuffle partition count
- count :Do you really need it
- use approxCountDistinct instead of distinctCount
- use dropDuplicateds instead distinct before join or group by
- udfs
- 多维度并行
- spark 包含三个维度的并行度
- driver 并行度
- horizontal
- executor