Social Question

capet's avatar

Python+Spark beginner: What am I not learning when I practice on a local Spark cluster?

Asked by capet (988points) April 1st, 2021

I’m starting to learn to use Python and Spark. I’m spending most of my time doing structured tutorials and exploring small local datasets. But I’m trying to get used to tools and habits that will give me a smooth transition into doing very distributed computation.

I have been practicing with PySpark by running a local cluster using spark=SparkSession.builder.master(“local[4]”).appName(“local”).getOrCreate() or similar, followed by “spark.stuff”.

By doing things this way, I’m sure I’m missing out on experience that I will need if I ever run PySpark using a remote cluster, doing complicated stuff, etc. Can y’all recommend me some tasks to do that would help complement my practicing on a local cluster?

I’m interested in beginner-level tasks that would help prepare me both for the syntax and the higher-level organization involved in using PySpark to do very distributed computation.

One idea that I had was to make an HDInsight cluster on Azure and do stuff on there. Does that make sense?

Observing members: 0 Composing members: 0

0 Answers

Answer this question

Login

or

Join

to answer.
Your answer will be saved while you login or join.

Have a question? Ask Fluther!

What do you know more about?
or
Knowledge Networking @ Fluther