SemyonSinchenko opened a new pull request, #46368: URL: https://github.com/apache/spark/pull/46368
### What changes were proposed in this pull request? In PySpark connect there is no access to JVM to call `queryExecution().optimizedPlan.stats`. So, there is no way to get information about size in bytes from plan except parsing by regexps an output of `explain`. This PR is trying to fill that gap by providing `sizeInBytesApproximation` method to JVM, PySpark Classic and PySpark Connect APIs. Under the hood it is just a call to `queryExecution().optimizedPlan.stats.sizeInBytes`. JVM and PySpark Classic APIs were updated just to have a parity. 1. Update of `Dataset.scala` in JVM connect by adding a new API 2. Update of `Dataset.scala` in JVM classic by adding a new API 3. Update `dataframe.py` in sql by adding signature and doc of a new API 4. Update `dataframe.py` in connect by adding an implementation of a new API 5. Update `dataframe.py` in classic by adding an implementation of a new API 6. Update `base.proto` in part `AnalyzeRequest` / `AnalyzeResponse` by adding new message 7. Generate new py-files from proto 8. Update `SparkConnectAnalyzeHandler` by extending `match` and adding call to `queryExecution` 9. Update `SparkConnectClient` by adding a new method that build a new request 10. Update `SparkSession` by adding a call to client and parsing a response 11. Add/update corresponding tests ### Why are the changes needed? To provide to PySpark Connect users an ability to get in runtime the DataFrame size estimation without forcing them to parse string-output of `df.explain`. Other changes are needed to have a parity across Connect / Classic and PySpark / JVM Spark. ### Does this PR introduce _any_ user-facing change? Only a new API. The new API is mostly for PySpark Connect users. ### How was this patch tested? Because the actual logic is in `queryExecution` I added tests only for syntax / calls. In tests we are testing that for a dataframe the returned size is greater than zero. ### Was this patch authored or co-authored using generative AI tooling? No. @grundprinzip We discussed that ticket with you, may you please make a look? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org