Release Notes
Also available as:
PDF

Spark

This release provides Spark 1.6.3 with no additional Apache patches.

HDP 2.6.1 provided Spark 1.6.3 with no additional Apache patches.

HDP 2.6.1 also provided Spark 2.1.1 and the following Apache patches:

  • SPARK-4105: retry the fetch or stage if shuffle block is corrupt.

  • SPARK-12717: Adding thread-safe broadcast pickle registry.

  • SPARK-13931: Resolve stage hanging up problem in a particular case.

  • SPARK-14658: when executor lost DagScheduer may submit one stage twice even if the first running taskset for this stage is not finished.

  • SPARK-16251: Flaky test: org.apache.spark.rdd.LocalCheckpointSuite.missing checkpoint block fails with informative message.

  • SPARK-16929: Speculation-related synchronization bottleneck in checkSpeculatableTasks.

  • SPARK-17424: Fix unsound substitution bug in ScalaReflection..

  • SPARK-17663: SchedulableBuilder should handle invalid data access via scheduler.allocation.file.

  • SPARK-17685: Make SortMergeJoinExec's currentVars is null when calling createJoinKey.

  • SPARK-18099: Spark distributed cache should throw exception if same file is specified to dropped in --files --archives.

  • SPARK-18113: Use ask to replace askWithRetry in canCommit and make receiver idempotent..

  • SPARK-18251: DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class.

  • SPARK-18406: Race between end-of-task and completion iterator read lock release.

  • SPARK-18535: Redact sensitive information.

  • SPARK-18579: Use ignoreLeadingWhiteSpace and ignoreTrailingWhiteSpace options in CSV writing.

  • SPARK-18629: Fix numPartition of JDBCSuite Testcase.

  • SPARK-18967: Locality preferences should be used when scheduling even when delay scheduling is turned off.

  • SPARK-18986: ExternalAppendOnlyMap shouldn't fail when forced to spill before calling its iterator.

  • SPARK-19059: Unable to retrieve data from parquet table whose name startswith underscore.

  • SPARK-19104: Lambda variables in ExternalMapToCatalyst should be global.

  • SPARK-19218: Fix SET command to show a result correctly and in a sorted order.

  • SPARK-19219: Fix Parquet log output defaults.

  • SPARK-19220: SSL redirect handler only redirects the server's root.

  • SPARK-19263: DAGScheduler should avoid sending conflicting task set..

  • SPARK-19263: Fix race in SchedulerIntegrationSuite..

  • SPARK-19276: FetchFailures can be hidden by user (or sql) exception handling.

  • SPARK-19539: Block duplicate temp table during creation.

  • SPARK-19556: Broadcast data is not encrypted when I/O encryption is on.

  • SPARK-19570: Allow to disable hive in pyspark shell.

  • SPARK-19631: OutputCommitCoordinator should not allow commits for already failed tasks.

  • SPARK-19688: Not to read `spark.yarn.credentials.file` from checkpoint..

  • SPARK-19727: Fix for round function that modifies original column.

  • SPARK-19775: Remove an obsolete `partitionBy().insertInto()` test case.

  • SPARK-19796: taskScheduler fails serializing long statements received by thrift server.

  • SPARK-19812: YARN shuffle service fails to relocate recovery DB acro….

  • SPARK-19868: conflict TasksetManager lead to spark stopped.

  • SPARK-20211: Fix the Precision and Scale of Decimal Values when the Input is BigDecimal between -1.0 and 1.0.

  • SPARK-20217: Executor should not fail stage if killed task throws non-interrupted exception.

  • SPARK-20250: Improper OOM error when a task been killed while spilling data.

  • SPARK-20250: Improper OOM error when a task been killed while spilling data.

  • SPARK-20275: Do not display "Completed" column for in-progress applications.

  • SPARK-20341: Support BigInt's value that does not fit in long value range.

  • SPARK-20342: Update task accumulators before sending task end event..

  • SPARK-20358: Executors failing stage on interrupted exception thrown by cancelled tasks.

  • SPARK-20393: Strengthen Spark to prevent XSS vulnerabilities.

  • SPARK-20405: Dataset.withNewExecutionId should be private.

  • SPARK-20412: Throw ParseException from visitNonOptionalPartitionSpec instead of returning null values..

  • SPARK-20426: OneForOneStreamManager occupies too much memory..

  • SPARK-20439: Fix Catalog API listTables and getTable when failed to fetch table metadata.

  • SPARK-20459: JdbcUtils throws IllegalStateException: Cause already initialized after getting SQLException.

  • SPARK-20496: Bug in KafkaWriter Looks at Unanalyzed Plans.

  • SPARK-20517: Fix broken history UI download link.

  • SPARK-20540: Fix unstable executor requests..

  • SPARK-20546: spark-class gets syntax error in posix mode.

  • SPARK-20555: Fix mapping of Oracle DECIMAL types to Spark types in read path.

  • SPARK-20558: clear InheritableThreadLocal variables in SparkContext when stopping it.

  • SPARK-20566: ColumnVector should support `appendFloats` for array.

  • SPARK-20603: Set default number of topic partitions to 1 to reduce the load.

  • SPARK-20613: Remove excess quotes in Windows executable.

  • SPARK-20615: SparseVector.argmax throws IndexOutOfBoundsException.

  • SPARK-20616: RuleExecutor logDebug of batch results should show diff to start of batch.

  • SPARK-20627: Drop the hadoop distirbution name from the Python version.

  • SPARK-20631: LogisticRegression._checkThresholdConsistency should use values not Params.

  • SPARK-20665: Bround" and "Round" function return NULL.

  • SPARK-20685: Fix BatchPythonEvaluation bug in case of single UDF w/ repeated arg..

  • SPARK-20686: PropagateEmptyRelation incorrectly handles aggregate without grouping.

  • SPARK-20687: mllib.Matrices.fromBreeze may crash when converting from Breeze sparse matrix.

  • SPARK-20688: correctly check analysis for scalar sub-queries.

  • SPARK-20705: The sort function can not be used in the master page when you use Firefox or Google Chrome..

  • SPARK-20735: Enable cross join in TPCDSQueryBenchmark.

  • SPARK-20756: yarn-shuffle jar references unshaded guava.

  • SPARK-20759: SCALA_VERSION in _config.yml should be consistent with pom.xml.

  • SPARK-20763: The function of `month` and `day` return the value which is not we expected..

  • SPARK-20769: Incorrect documentation for using Jupyter notebook.

  • SPARK-20781: the location of Dockerfile in docker.properties.templat is wrong.

  • SPARK-20796: the location of start-master.sh in spark-standalone.md is wrong.

  • SPARK-20798: GenerateUnsafeProjection should check if a value is null before calling the getter.

  • SPARK-20843: Add a config to set driver terminate timeout.

  • SPARK-20848: Shutdown the pool after reading parquet files.

  • SPARK-20848: Shutdown the pool after reading parquet files.

  • SPARK-20862: Avoid passing float to ndarray.reshape in LogisticRegressionModel.

  • SPARK-20868: UnsafeShuffleWriter should verify the position after FileChannel.transferTo.

  • SPARK-20874: Add Structured Streaming Kafka Source to examples project.

  • SPARK-20914: Javadoc contains code that is invalid.

  • SPARK-20920: ForkJoinPool pools are leaked when writing hive tables with many partitions.

  • SPARK-20922: Add whitelist of classes that can be deserialized by the launcher..

  • SPARK-20922: Don't use Java 8 lambdas in older branches..

  • SPARK-20940: Replace IllegalAccessError with IllegalStateException.

  • SPARK-20974: we should run REPL tests if SQL module has code changes.

  • SPARK-21041: SparkSession.range should be consistent with SparkContext.range.

  • SPARK-21064: Fix the default value bug in NettyBlockTransferServiceSuite.

  • SPARK-21072: TreeNode.mapChildren should only apply to the children node..

  • SPARK-21083: Store zero size and row count when analyzing empty table.

  • SPARK-21114: Fix test failure in Spark 2.1/2.0 due to name mismatch.

  • SPARK-21123: Options for file stream source are in a wrong table - version to fix 2.1.

  • SPARK-21138: Cannot delete staging dir when the clusters of "spark.yarn.stagingDir" and "spark.hadoop.fs.defaultFS" are different.

  • SPARK-21159: Don't try to connect to launcher in standalone cluster mode..

  • SPARK-21167: Decode the path generated by File sink to handle special characters.

  • SPARK-21176: Limit number of selector threads for admin ui proxy servlets to 8.

  • SPARK-21181: Release byteBuffers to suppress netty error messages.

  • SPARK-21203: Fix wrong results of insertion of Array of Struct.

  • SPARK-21306: For branch 2.1, OneVsRest should support setWeightCol.

  • SPARK-21312: correct offsetInBytes in UnsafeRow.writeToStream.

  • SPARK-21330: Bad partitioning does not allow to read a JDBC table with extreme values on the partition column.

  • SPARK-21332: Incorrect result type inferred for some decimal expressions.

  • SPARK-21345: SparkSessionBuilderSuite should clean up stopped sessions..

  • SPARK-21376: Token is not renewed in yarn client process in cluster mode.

  • SPARK-21441: Incorrect Codegen in SortMergeJoinExec results failures in some cases.

  • SPARK-21446: Fix setAutoCommit never executed.

  • SPARK-21522: Fix flakiness in LauncherServerSuite..

  • SPARK-21555: RuntimeReplaceable should be compared semantically by its canonicalized child.

  • SPARK-21588: SQLContext.getConf(key, null) should return null.

HDP 2.6.0 provided Spark 1.6.3 and the following Apache patches:

  • SPARK-6717: Clear shuffle files after checkpointing in ALS.

  • SPARK-6735: Add window based executor failure tracking mechanism for long running service.

  • SPARK-6847: Stack overflow on updateStateByKey which followed by a stream with checkpoint set.

  • SPARK-7481: Add spark-cloud module to pull in aws+azure object store FS accessors; test integration.

  • SPARK-7889: Jobs progress of apps on complete page of HistoryServer shows uncompleted.

  • SPARK-10582: using dynamic-executor-allocation, if AM failed, the new AM will be started. But the new AM does not allocate executors to driver.

  • SPARK-11137: Make StreamingContext.stop() exception-safe.

  • SPARK-11314: Add service API and test service for Yarn Cluster schedulers.

  • SPARK-11315: Add YARN extension service to publish Spark events to YARN timeline service (part of SPARK-1537).

  • SPARK-11323: Add History Service Provider to service application histories from YARN timeline server (part of SPARK-1537).

  • SPARK-11627: Spark Streaming backpressure mechanism has no initial rate limit, receivers receive data at the maximum speed , it might cause OOM exception.

  • SPARK-12001: StreamingContext cannot be completely stopped if the stop() is interrupted.

  • SPARK-12009: Avoid re-allocate yarn container while driver want to stop all executors.

  • SPARK-12142: Can't request executor when container allocator us bit ready.

  • SPARK-12241: Improve failure reporting in Yarn client obtainTokenForHBase().

  • SPARK-12353: wrong output for countByValue and countByValueAndWIndow.

  • SPARK-12513: SocketReceiver hang in Netcat example.

  • SPARK-12523: Support long-running of the Spark on HBase and hive metastore.

  • SPARK-12920: Fix high CPU usage in Spark thrift server with concurrent users..

  • SPARK-12948: OrcRelation uses HadoopRDD which can broadcast conf objects frequently..

  • SPARK-12967: NettyRPC races with SparkContext.stop() and throws exception.

  • SPARK-12998: Enable OrcRelation even when connecting via spark thrift server..

  • SPARK-13021: Fail fast when custom RDD's violate RDD.partition's API contract.

  • SPARK-13117: WebUI should use the local ip not 0.0.0.0.

  • SPARK-13278: Launcher fails to start with JDK 9 EA.

  • SPARK-13308: ManagedBuffers passed to OneToOneStreamManager need to be freed in non error cases.

  • SPARK-13360: pyspark related enviroment variable is not propagated to driver in yarn-cluster mode.

  • SPARK-13468: Fix a corner case where the page UI should show DAG but it doesn't show.

  • SPARK-13478: Use real user when fetching delegation tokens.

  • SPARK-13885: Fix attempt id regression for Spark running on Yarn.

  • SPARK-13902: Make DAGScheduler not to create duplicate stage.

  • SPARK-14062: Fix log4j and upload metrics.properties automatically with distributed cache.

  • SPARK-14091: Consider improving performance of SparkContext.getCallSite()..

  • SPARK-15067: YARN executors are launched with fixed perm gen size.

  • SPARK-1537: Add integration with Yarn's Application Timeline Server.

  • SPARK-15705: Change the default value of spark.sql.hive.convertMetastoreOrc to false.

  • SPARK-15844: HistoryServer doesn't come up if spark.authenticate = true.

  • SPARK-15990: Add rolling log aggregation support for Spark on yarn.

  • SPARK-16110: Can't set Python via spark-submit for YARN cluster mode when PYSPARK_PYTHON & PYSPARK_DRIVER_PYTHON are set.

  • SPARK-19033: HistoryServer still uses old ACLs even if ACLs are updated.

  • SPARK-19306: Fix inconsistent state in DiskBlockObjectWriter when exception occurred.

  • SPARK-19970: Table owner should be USER instead of PRINCIPAL in kerberized clusters.