Results

After the execution from all scripts, we have obtained the charts below:

1. Top 10 Most Expensive Stocks in the US (historically and year 2021)

2. Top 10 Most Expensive Stocks in China (historically and year 2021)

3. Top 10 Most Expensive Stocks in India (historically and year 2021)

4. Top 10 Highest Growth Stocks in the US (historically and year 2021)

5. Top 10 Highest Growth Stocks in China (historically and year 2021)

6. Top 10 Highest Growth Stocks in India (historically and year 2021)

Performance on Cloud

This is the performance evaluation on Google Cloud obtained from the scripts top5HighestGrowth.py and top10MostExpensive.py. We have done serval experiments with a different amount of worker-nodes and threads.

The results can be see in the charts below, where:

X axis : number of executors and threads
Y axis : running time in minutes

    Calculation of speed up (T sequential /T parallel):

    s = 123,23/71,46 = 1,72 times faster

    s = 123,23/32,25 = 3,82 times faster

    We can see the code run faster with more threads, with a huge gap bewteen 1 and 2 executors and 2 threads, and 1 thread and 2 executors and 8 threads

    Calculation of speedup (T sequential /T parallel):

    s = 41,23/31,02 = 1,329 times faster

    s = 41,23/15,27= 2,7 times faster

    As mentioned previously, the increase in the number of threads significantly improves running time

    Calculation of speed up (T sequential /T parallel):

    s = 123,23/88,54 = 1,39 times faster

    s = 123,23/68,5 = 1,79 times faster

    Calculation of speed up (T sequential /T parallel):

    s = 41,23/34,41 = 1,198 times faster

    s = 41,23/28,22= 1,461 times faster

Other scripts involved in the program will have a better perform due to a less amount of data.

Advanced function

Through this project, we had the opportunity to learn a lot of new functions about Spark.

Some of these functions are:

- pyspark.sql.DataFrame.agg: This method is used to perform aggregation operations. It is used to calculate sums, averages, maximums, and other types of aggregates on one or more columns of the DataFrame.

df.agg(max(col("close"))) # takes the maximum value of the close column

- pyspark.sql.DataFrame.withColumn: This method is used to add a new column or to replace an existing one with a column expression.

df.withColumn("close", col("close").cast("float")) 
# convert the value in the "close" column from type string to type float

- pyspark.sql.functions.lag: This function is a window function (pyspark.sql.window.lag) and is used to get the value of a column in a previous row.

df.withColumn("prev_value", f.lag(f.col("close"))) # create a column with the previous value of close

- pyspark.sql.functions.to_date: This function is used to convert a column of type String to a column of type Date. We can specify the date format through its second parameter (e.g. dd-MM-yyyy).

df.withColumn("date", f.to_date(df[0])) # convert the values of the "date" column from string to DateType()

- pyspark.sql.DataFrame.drop: This method is used to remove one or more columns.

df.drop("prev_value") # we remove the column "prev_value"