Data Science projects from kaggle

Airbnbs in Amsterdam

Airbnb is an American San Francisco-based company operating an online marketplace for short-term homestays and experiences. It started at 2008 and nowadays shares a non ignorable percentage of the respective market worldwide. More and more people provide a flat or house to the platform. It is a sustainable source of money. 

 

In this project, I have taken the data for the airbnb listings in Amsterdam. After the necessary analysis to understand the dataset, I predict the prices of the airbnb listings.

 

link to GitHub

Airfares

Many people travel every day using different modes of transport such as train, bus, car, plane and others. Studies have shown that statistically, airplanes have become the safest mode of transport due to the number of deaths per year compared to the other modes of transport.

 

The prices of airfares can depend on different factors such as distance, connections, time of day, class, time of year and others. In this project, I am trying to explore a dataset related to the price of air fares and flights between 7 cities in India.

 

link to GitHub

New customer's segments

Segmentation (or clustering) is a well-known technique for categorizing unlabelled data and has been used in unsupervised learning. It helps to create classes for the given data.

 

The aim of this project is to redefine customer segments and place customers in the correct class.

link to GitHub

Scrapped wafers during manufacturing

Semiconductors are the brains of the modern electronics. They enable technologies critical to a country's economic growth, national security and global competitiveness. From smartphones to airplanes, semiconductors have evolved to improve technologies and do wonders for our convenience. Industries that rely heavily on semiconductors include computing, telecommunications, consumer electronics, banking, security, automotive/transportation, healthcare and manufacturing.

 

To build the devices on a chip, we need the basic substrate, the wafer. A wafer is a thin slice of semiconductor, such as crystalline silicon (c-Si), used to make integrated circuits. While silicon is the basic material, there are wafers that contain other elements. 

 

Chip manufacturers request wafers from third suppliers based on the volume of order and their experiments for process and technology enhancement and development. No wafer manufacturer wants to manufacture products with anomalies. In this project, I am trying to find out a model that predicts wafers with anomalies.

 

link to GitHub

Forecasting house prices

Buying a house is one of the biggest dreams of any family, or even individuals. It is a "long-time" project and and "life-time" investment, because one pays the loan to the financial institution for several years.

 

In this project, I have taken the data for the houses at Ames at Iowa of the United States of America. After the necessary analysis, I build up a predictive model to forecast the prices of houses

link to GitHub

Apache Spark

 

Pandas - SQL - Spark: Commands similarities

Pandas is an open-source python library constructed data analysis and manipulation. It is fast, powerful, flexible and easy. 

SQL (Structured Query Language) is a standardized programming language that is used to manage relational databases and perform various operations on the data in them.

Apache Spark is an open-source, multi-language engine used for big data. It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size.

 

What are the command similarities?

 

Link to GitHub

PySpark MLlib vs Scikit-Learn

Apache Spark is a very powerful tool for Big data. It has been used in academia and industry. The role of this small tutorial is to show similarities/differences between the ML library of PySpark vs Scikit-Learn, a conventional python library for machine learning, from the documentation to the results. 

 

Link to GitHub

An instance of a Data Pipeline using ETL

As the amount of data, data sources, and data types at organizations grow, the importance of making use of that data grows as well. Data pipelines are a set of tools and activities for moving data from one system with its method of data storage and processing to another system in which it can be stored and managed differently.


In this mini project I use PySpark and its functionalities to create the ETL pipeline. The data include airbnb listings from 6 european capitals and the end-user suppose to find the transformed data in the data warehouse and implement the analysis that he needs.

 

Link to GitHub

 

Interactive Web Apps

 

Airbnb prices comparison per neighbourhood

Airbnb has become very famous and a very good option for hosting while people are on vacation or on a business trip. It started in San Francisco (USA). The client can select different hosting options, which are provided in the platform.

 

For this project, I took the data from kaggle for 6 European cities, Amsterdam-Lisbon-London-Paris-Rome-Vienna, and compared prices per neighbourhoods for the different room types.

 

link to GitHub

link to WebApp

It can be highly possible that Safari does not open the WebApp. Please, try with another Web browser!

I recommend that you open the app from a PC rather than a mobile phone.

Airfares comparison

The airplane has become the safest form of transport and is used by many people every day. The prices of airfares can depend on different factors like distance, connecting flights, time of day the flight is operated, class and others.

 

For this project, I took the dataset from kaggle, which includes flights between 7 cities in India, operated by different airlines. By selecting different options, you can explore the dataset and compare the prices.

 

link to GitHub

link to WebApp

It can be highly possible that Safari does not open the WebApp. Please, try with another Web browser!

I recommend that you open the app from a PC rather than a mobile phone.

Erstelle deine eigene Website mit Webador