[PR] [SPARK-47336][SQL][CONNECT] Provide to PySpark a functionality to get estimated size of DataFrame in bytes [spark]

via GitHub Fri, 03 May 2024 10:42:25 -0700


SemyonSinchenko opened a new pull request, #46368:
URL: https://github.com/apache/spark/pull/46368


   ### What changes were proposed in this pull request?
   
   In PySpark connect there is no access to JVM to call 
`queryExecution().optimizedPlan.stats`. So, there is no way to get information 
about size in bytes from plan except parsing by regexps an output of `explain`. 
This PR is trying to fill that gap by providing `sizeInBytesApproximation` 
method to JVM, PySpark Classic and PySpark Connect APIs. Under the hood it is 
just a call to `queryExecution().optimizedPlan.stats.sizeInBytes`. JVM and 
PySpark Classic APIs were updated just to have a parity.
   
   1. Update of `Dataset.scala` in JVM connect by adding a new API
   2. Update of `Dataset.scala` in JVM classic by adding a new API
   3. Update `dataframe.py` in sql by adding signature and doc of a new API
   4. Update `dataframe.py` in connect by adding an implementation of a new API
   5. Update `dataframe.py` in classic by adding an implementation of a new API
   6. Update `base.proto` in part `AnalyzeRequest` / `AnalyzeResponse` by 
adding new message
   7. Generate new py-files from proto
   8. Update `SparkConnectAnalyzeHandler` by extending `match` and adding call 
to `queryExecution`
   9. Update `SparkConnectClient` by adding a new method that build a new 
request
   10. Update `SparkSession` by adding a call to client and parsing a response
   11. Add/update corresponding tests
   
   ### Why are the changes needed?
   To provide to PySpark Connect users an ability to get in runtime the 
DataFrame size estimation without forcing them to parse string-output of 
`df.explain`. Other changes are needed to have a parity across Connect / 
Classic and PySpark / JVM Spark.
   
   ### Does this PR introduce _any_ user-facing change?
   Only a new API. The new API is mostly for PySpark Connect users.
   
   ### How was this patch tested?
   Because the actual logic is in `queryExecution` I added tests only for 
syntax / calls. In tests we are testing that for a dataframe the returned size 
is greater than zero.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   
   
   @grundprinzip We discussed that ticket with you, may you please make a look? 
Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-47336][SQL][CONNECT] Provide to PySpark a functionality to get estimated size of DataFrame in bytes [spark]

Reply via email to