How did we make it?

The aim of this project was to analyze vast datasets from different parts of the world, drawing conclusions about the best stocks and making estimations of potential stock growth.

To manage our project effectively, we chose GitHub as our version control platform. Our collaboration took place in Visual Studio Code (VS Code), a versatile IDE that supports Python development and integrates well with Git.
For data processing, we opted for PySpark, the Python API for Apache Spark, enabling us to analyze a large datasets efficiently. We curated datasets from Kaggle, focusing on stock data from the US, China, and India.

- Technical description -

Name Description
Programming Language The application is primarily developed using Python for scripting and data analysis
Big Data Processing Framework PySpark: The application leverages PySpark for distributed data processing. PySpark is the Python API for Apache Spark, a powerful open-source distributed computing system
Version Control and Collaboration Use of GitHub for version control and collaboration. This facilitates code synchronization among team members and allows efficient version control
Development Environment The development environment includes Visual Studio Code (VS Code) as the integrated development environment (IDE)
Datasets The application utilizes stock market datasets obtained from Kaggle, specifically focusing on data from the US, China, and India
Platform for Local Execution The application can be executed locally on a machine with the following components:
Python and pip: Required for running Python scripts
Unix-like Command Interpreter: Linux, Google Cloud, or Windows Subsystem for Linux (WSL)
Java Runtime Environment (JRE): Necessary for running Spark
PySpark: Installed using the pip package manager
Platform for Cloud Execution (Google Cloud) The application is designed to be executed on Google Cloud for scalability and reduced configuration requirements on local machines
Use of Google Cloud Dataproc to create clusters for distributed data processing
Usage You can just run main.py and select the function that you want. In the last line, you will get the result that you want to know (This script will use the dataset in ./Sample by default)
The result consists of a list of tuples, each tuple consists of two values, the company name and the corresponding value

Try it out!

Here is a link to our GitHub repository where you can download our program. In the in the README file you will find a detailed explanation on how to execute our scripts. We have provided some dataset test samples so you can run some tests.

Our application can be executed both locally and on Google Cloud for scalability:

- For local execution, we provided detailed instructions for setting up the Python environment, Java Runtime Environment (JRE), and PySpark. In our case, we utilized VS Code for coding convenience.

- For cloud execution, we leveraged Google Cloud and specifically Google Cloud Dataproc to create clusters for distributed data processing. The application seamlessly integrates with Google Cloud, and we have also outlined the steps for cluster creation and Spark job submission.

Here are some sample tests from our program:

1. View the 10 historically most expensive stocks

2. View the 10 most expensive stocks in a specific year

3. View the 10 most expensive stocks in a specific country historically

4. View the 10 most expensive stocks in a specific country in a specific year

5. View the 5 stocks with the highest historical growth

6. View the 5 stocks with the highest growth in a specific year

7. View the 5 stocks with the highest growth in a specific country historically

8. View the 5 stocks with the highest growth in a specific country in a specific year

9. View the probability of a stock increasing in value in a specific year