spark3.0.1集成delta0.7.0之delta如何进行DDL操作

小编给大家分享一下spark 3.0.1集成delta 0.7.0之delta如何进行DDL操作，相信大部分人都还不怎么了解，因此分享这篇文章给大家参考一下，希望大家阅读完这篇文章后大有收获，下面让我们一起去了解一下吧！

10余年的延津网站建设经验，针对设计、前端、开发、售后、文案、推广等六对一服务，响应快，48小时及时工作处理。营销型网站建设的优势是能够根据用户设备显示端的尺寸不同，自动调整延津建站的显示方式，使网站能够适用不同显示终端，在浏览器中调整网站的宽度，无论在任何一种浏览器上浏览网站，都能展现优雅布局与设计，从而大程度地提升浏览体验。成都创新互联从事“延津网站设计”,“延津网站推广”以来，每个客户项目都认真落实执行。

分析

delta在0.7.0以前是不能够进行save表操作的，只能存储到文件中，也就是说他的元数据是和spark的其他元数据是分开的，delta是独立存在的，也是不能和其他表进行关联操作的，只有到了delta 0.7.0版本以后，才真正意义上和spark进行了集成，这也得益于spark 3.x的Catalog plugin API 特性。
还是先从delta的configurate sparksession入手,如下:

import org.apache.spark.sql.SparkSession

val spark = SparkSession
  .builder()
  .appName("...")
  .master("...")
  .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
  .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
  .getOrCreate()

对于第二个配置 config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") 从spark configuration,我们可以看到对该spark.sql.catalog.spark_catalog的解释是

A catalog implementation that will be used as the v2 interface to Spark's built-in v1 catalog: spark_catalog. This catalog shares its identifier namespace with the spark_catalog and must be consistent with it; for example, if a table can be loaded by the spark_catalog, this catalog must also return the table metadata. To delegate operations to the spark_catalog, implementations can extend 'CatalogExtension'.

也就是说，通过该配置可以实现元数据的统一性,其实这也是spark社区和delta社区进行交互的一种结果

spark 3.x的Catalog plugin API

为了能搞懂delta为什么能够进行DDL和DML操作，就得先知道spark 3.x的Catalog plugin机制SPARK-31121.

首先是interface CatalogPlugin,该接口是catalog plugin的顶级接口，正如注释所说:

* A marker interface to provide a catalog implementation for Spark. *

* Implementations can provide catalog functions by implementing additional interfaces for tables, * views, and functions. *

* Catalog implementations must implement this marker interface to be loaded by * {@link Catalogs#load(String, SQLConf)}. The loader will instantiate catalog classes using the * required public no-arg constructor. After creating an instance, it will be configured by calling * {@link #initialize(String, CaseInsensitiveStringMap)}. *

* Catalog implementations are registered to a name by adding a configuration option to Spark: * {@code spark.sql.catalog.catalog-name=com.example.YourCatalogClass}. All configuration properties * in the Spark configuration that share the catalog name prefix, * {@code spark.sql.catalog.catalog-name.(key)=(value)} will be passed in the case insensitive * string map of options in initialization with the prefix removed. * {@code name}, is also passed and is the catalog's name; in this case, "catalog-name".

可以通过spark.sql.catalog.catalog-name=com.example.YourCatalogClass集成到spark中
该类的实现还可以集成其他额外的tables views functions的接口，这里就得提到接口TableCatalog,该类提供了与tables相关的方法:

/** * List the tables in a namespace from the catalog. *

* If the catalog supports views, this must return identifiers for only tables and not views. * * @param namespace a multi-part namespace * @return an array of Identifiers for tables * @throws NoSuchNamespaceException If the namespace does not exist (optional). */ Identifier[] listTables(String[] namespace) throws NoSuchNamespaceException; /** * Load table metadata by {@link Identifier identifier} from the catalog. *

* If the catalog supports views and contains a view for the identifier and not a table, this * must throw {@link NoSuchTableException}. * * @param ident a table identifier * @return the table's metadata * @throws NoSuchTableException If the table doesn't exist or is a view */ Table loadTable(Identifier ident) throws NoSuchTableException;

这样就可以基于TableCatalog开发自己的catalog，从而实现multi-catalog support

还得有个接口DelegatingCatalogExtension,这是个实现了CatalogExtension接口的抽象类，而CatalogExtension继承了TableCatalog, SupportsNamespaces。DeltaCatalog实现了DelegatingCatalogExtension ，这部分后续进行分析。
最后还有一个class CatalogManager,这个类是用来管理CatalogPlugins的，且是线程安全的:

/**
 * A thread-safe manager for [[CatalogPlugin]]s. It tracks all the registered catalogs, and allow
 * the caller to look up a catalog by name.
 *
 * There are still many commands (e.g. ANALYZE TABLE) that do not support v2 catalog API. They
 * ignore the current catalog and blindly go to the v1 `SessionCatalog`. To avoid tracking current
 * namespace in both `SessionCatalog` and `CatalogManger`, we let `CatalogManager` to set/get
 * current database of `SessionCatalog` when the current catalog is the session catalog.
 */
// TODO: all commands should look up table from the current catalog. The `SessionCatalog` doesn't
//       need to track current database at all.
private[sql]
class CatalogManager(
    conf: SQLConf,
    defaultSessionCatalog: CatalogPlugin,
    val v1SessionCatalog: SessionCatalog) extends Logging {

我们看到CatalogManager管理了v2版本的 CatalogPlugin和v1版本的sessionCatalog,这个是因为历史的原因导致必须得兼容v1版本

那CatalogManager在哪里被调用呢。我们看一下BaseSessionStateBuilder ,可以看到该类中才是正宗使用CatalogManager的地方：

/**
   * Catalog for managing table and database states. If there is a pre-existing catalog, the state
   * of that catalog (temp tables & current database) will be copied into the new catalog.
   *
   * Note: this depends on the `conf`, `functionRegistry` and `sqlParser` fields.
   */
  protected lazy val catalog: SessionCatalog = {
    val catalog = new SessionCatalog(
      () => session.sharedState.externalCatalog,
      () => session.sharedState.globalTempViewManager,
      functionRegistry,
      conf,
      SessionState.newHadoopConf(session.sparkContext.hadoopConfiguration, conf),
      sqlParser,
      resourceLoader)
    parentState.foreach(_.catalog.copyStateTo(catalog))
    catalog
  }

  protected lazy val v2SessionCatalog = new V2SessionCatalog(catalog, conf)

  protected lazy val catalogManager = new CatalogManager(conf, v2SessionCatalog, catalog)

SessionCatalog 是v1版本的，主要是跟底层的元数据存储通信，以及管理临时视图，udf的，这一部分暂时不分析，重点放到v2版本的sessionCatalog，我们看一下V2SessionCatalog:

/**
 * A [[TableCatalog]] that translates calls to the v1 SessionCatalog.
 */
class V2SessionCatalog(catalog: SessionCatalog, conf: SQLConf)
  extends TableCatalog with SupportsNamespaces {
  import org.apache.spark.sql.connector.catalog.CatalogV2Implicits.NamespaceHelper
  import V2SessionCatalog._

  override val defaultNamespace: Array[String] = Array("default")

  override def name: String = CatalogManager.SESSION_CATALOG_NAME

  // This class is instantiated by Spark, so `initialize` method will not be called.
  override def initialize(name: String, options: CaseInsensitiveStringMap): Unit = {}

  override def listTables(namespace: Array[String]): Array[Identifier] = {
    namespace match {
      case Array(db) =>
        catalog
          .listTables(db)
          .map(ident => Identifier.of(Array(ident.database.getOrElse("")), ident.table))
          .toArray
      case _ =>
        throw new NoSuchNamespaceException(namespace)
    }
  }

我们分析一下listTables方法可知，v2的sessionCatalog操作都是委托给了v1版本的sessionCatalog去操作的，其他的方法也是一样，而且name默认为CatalogManager.SESSION_CATALOG_NAME,也就是spark_catalog，这里后面也会提到，注意一下。而且，catalogmanager在逻辑计划中的分析器和优化器中也会用到，因为会用到其中的元数据：

protected def analyzer: Analyzer = new Analyzer(catalogManager, conf) {
...
protected def optimizer: Optimizer = {
    new SparkOptimizer(catalogManager, catalog, experimentalMethods) {
      override def earlyScanPushDownRules: Seq[Rule[LogicalPlan]] =
        super.earlyScanPushDownRules ++ customEarlyScanPushDownRules

      override def extendedOperatorOptimizationRules: Seq[Rule[LogicalPlan]] =
        super.extendedOperatorOptimizationRules ++ customOperatorOptimizationRules
    }
  }

而analyzer和optimizer正是spark sql进行解析的核心中的核心，当然还有物理计划的生成。那这些analyzer和optimizer是在哪里被调用呢？
我们举一个例子，DataSet中的filter方法就调用了：

 */
  def filter(conditionExpr: String): Dataset[T] = {
    filter(Column(sparkSession.sessionState.sqlParser.parseExpression(conditionExpr)))
  }

sessionState.sqlParser就是刚才所说的sqlParser:

protected lazy val sqlParser: ParserInterface = {
    extensions.buildParser(session, new SparkSqlParser(conf))
  }

只有整个逻辑从sql解析到使用元数据的数据链路，我们就能大致知道怎么一回事了。

delta的DeltaCatalog

我们回过头来看看，delta的DeltaCatalog是怎么和spark 3.x进行结合的，上源码DeltaCatalog：

class DeltaCatalog(val spark: SparkSession) extends DelegatingCatalogExtension
  with StagingTableCatalog
  with SupportsPathIdentifier {

  def this() = {
    this(SparkSession.active)
  }
  ...

就如之前所说的DeltaCatalog继承了DelegatingCatalogExtension,从名字可以看出这是一个委托类，那到底是怎么委托的呢以及委托给谁呢？

public abstract class DelegatingCatalogExtension implements CatalogExtension {

  private CatalogPlugin delegate;

  public final void setDelegateCatalog(CatalogPlugin delegate) {
    this.delegate = delegate;
  }

该类中有个setDelegateCatalog方法，该方法在CatalogManager中的loadV2SessionCatalog方法中被调用:

private def loadV2SessionCatalog(): CatalogPlugin = {
    Catalogs.load(SESSION_CATALOG_NAME, conf) match {
      case extension: CatalogExtension =>
        extension.setDelegateCatalog(defaultSessionCatalog)
        extension
      case other => other
    }
  }

而该方法被v2SessionCatalog调用:

private[sql] def v2SessionCatalog: CatalogPlugin = {
    conf.getConf(SQLConf.V2_SESSION_CATALOG_IMPLEMENTATION).map { customV2SessionCatalog =>
      try {
        catalogs.getOrElseUpdate(SESSION_CATALOG_NAME, loadV2SessionCatalog())
      } catch {
        case NonFatal(_) =>
          logError(
            "Fail to instantiate the custom v2 session catalog: " + customV2SessionCatalog)
          defaultSessionCatalog
      }
    }.getOrElse(defaultSessionCatalog)
  }

这个就是返回默认的v2版本的SessionCatalog实例,分析一下这个方法：

   首先得到配置项SQLConf.V2_SESSION_CATALOG_IMPLEMENTATION，也就是spark.sql.catalog.spark_catalog配置，
   如果spark配置了的话，就调用loadV2SessionCatalog加载该类，，否则就加载默认的v2SessionCatalog，也就是V2SessionCatalog实例

这里我们就发现了:
delta配置的spark.sql.catalog.spark_catalog为"org.apache.spark.sql.delta.catalog.DeltaCatalog",也就是说，spark中的V2SessionCatalog是DeltaCatalog的实例，而DeltaCatalog的委托给了BaseSessionStateBuilder中的V2SessionCatalog实例。

具体看看DeltaCatalog 的createTable方法，其他的方法类似：

override def createTable(
      ident: Identifier,
      schema: StructType,
      partitions: Array[Transform],
      properties: util.Map[String, String]): Table = {
    if (DeltaSourceUtils.isDeltaDataSourceName(getProvider(properties))) {
      createDeltaTable(
        ident, schema, partitions, properties, sourceQuery = None, TableCreationModes.Create)
    } else {
      super.createTable(ident, schema, partitions, properties)
    }
  }
...
private def createDeltaTable(
      ident: Identifier,
      schema: StructType,
      partitions: Array[Transform],
      properties: util.Map[String, String],
      sourceQuery: Option[LogicalPlan],
      operation: TableCreationModes.CreationMode): Table = {
     ...
    val tableDesc = new CatalogTable(
      identifier = TableIdentifier(ident.name(), ident.namespace().lastOption),
      tableType = tableType,
      storage = storage,
      schema = schema,
      provider = Some("delta"),
      partitionColumnNames = partitionColumns,
      bucketSpec = maybeBucketSpec,
      properties = tableProperties.toMap,
      comment = Option(properties.get("comment")))
    // END: copy-paste from the super method finished.

    val withDb = verifyTableAndSolidify(tableDesc, None)
    ParquetSchemaConverter.checkFieldNames(tableDesc.schema.fieldNames)
    CreateDeltaTableCommand(
      withDb,
      getExistingTableIfExists(tableDesc),
      operation.mode,
      sourceQuery,
      operation,
      tableByPath = isByPath).run(spark)

    loadTable(ident)
      }
 override def loadTable(ident: Identifier): Table = {
    try {
      super.loadTable(ident) match {
        case v1: V1Table if DeltaTableUtils.isDeltaTable(v1.catalogTable) =>
          DeltaTableV2(
            spark,
            new Path(v1.catalogTable.location),
            catalogTable = Some(v1.catalogTable),
            tableIdentifier = Some(ident.toString))
        case o => o
      }
    
  }

判断是否是delta数据源，如果是的话，跳到createDeltaTable方法，否则直接调用super.createTable方法，
createDeltaTable先会进行delta特有的CreateDeltaTableCommand.run()命令写入delta数据，之后载loadTable
loadTable则会调用super的loadTable，而方法会调用V2SessionCatalog的loadTable,而V2SessionCatalog最终会调用v1版本sessionCatalog的getTableMetadata方法，从而组成V1Table(catalogTable)返回，这样就把delta的元数据信息持久化到了v1 SessionCatalog管理的元数据库中
如果不是delta数据源，则调用super.createTable方法，该方法调用V2SessionCatalog的createTable,而最终还是调用v1版本sessionCatalog的createTable方法

我们这里重点分析了delta数据源到元数据的存储，非delta数据源的代码就没有粘贴过来，有兴趣的自己可以编译源码跟踪一下

我们还得提一下spark.sql.defaultCatalog的默认配置为spark_catalog，也就是sql的默认catalog为spark_catalog,对应到delta的话就是DeltaCatalog。

以上是“spark 3.0.1集成delta 0.7.0之delta如何进行DDL操作”这篇文章的所有内容，感谢各位的阅读！相信大家都有了一定的了解，希望分享的内容对大家有所帮助，如果还想学习更多知识，欢迎关注创新互联行业资讯频道！

当前名称：spark3.0.1集成delta0.7.0之delta如何进行DDL操作
链接URL：http://chengdu.cdxwcx.cn/article/jgsesh.html

甜橘子，专注成都网站制作网站设计与营销型网站建设与优化

首页

网站建设

网站制作案例

解决方案

网站设计报价

网站制作动态

关于我们

联系我们

成都网站建设设计将想法与焦点和您一起共享

spark3.0.1集成delta0.7.0之delta如何进行DDL操作

分析

spark 3.x的Catalog plugin API

delta的DeltaCatalog

其他资讯

vbnet读取ini,vbnet读取内存地址

包含苹果macos系统重装的词条

关于html斜体css样式的信息

nosql适合储存吗,为什么需要nosql

linux常用命令中am,linux常用命令中cd

甜橘子，专注成都网站制作网站设计与营销型网站建设与优化

成都网站建设设计 将想法与焦点和您一起共享

spark3.0.1集成delta0.7.0之delta如何进行DDL操作

分析

spark 3.x的Catalog plugin API

delta的DeltaCatalog

其他资讯

vbnet读取ini,vbnet读取内存地址

包含苹果macos系统重装的词条

关于html斜体css样式的信息

nosql适合储存吗,为什么需要nosql

linux常用命令中am,linux常用命令中cd

成都网站建设设计将想法与焦点和您一起共享