This project aims to perform an exploratory data analysis (EDA) of São Paulo’s real estate market using pure SQL, with a focus on:
- Understanding price patterns;
- Identifying seasonal variations;
- Investigating the impact of features such as area, bedrooms, and parking spaces; and
- Mapping the most expensive and most affordable neighborhoods.
The goal is to extract actionable insights from a real-world dataset from the Zap Imóveis platform, demonstrating how SQL can be used to support business decisions.
- ⚡ Real-world, practical analysis: I chose a public, real dataset because I believe data analysis should drive action and real-world impact. This is a relevant topic for professionals working in BI, marketing, pricing, or territorial expansion.
- 🎯 SQL focus for quick decision-making: This is a more direct EDA, aligned with SQL’s strengths — quick querying, preprocessing, and business-focused calculations. For deeper explorations, I usually prefer Python (see my complementary Python EDA project here).
- 🧱 Technical coverage: I used CTEs, Window Functions, and created a
VIEW
to filter outliers using percentiles. This makes the analysis cleaner and enhances the statistical representativeness of the metrics. - 🗣️ Commented code and results: I added comments throughout the SQL script to explain the logic and findings. This improves async communication, knowledge sharing, and supports my own analytical process.
- Platform: Zap Imóveis
- Dataset: Kaggle - Discover São Paulo Apartment Prices Insights (public source, anonymized data of 27k apartments for sale in São Paulo)
- Initial inspection and data understanding
- Null value checking
- Outlier detection and removal using percentiles
VIEW
creation for reusability- Univariate and bivariate exploration
- Bucketing (area) and ranking (price by neighborhood)
- Aggregations by year, month, and neighborhood
- Conclusion
- Average price and price per square meter peaked in 2021 and have been declining since.
- In 2024, prices dropped by around 18% compared to the peak.
- There is seasonality in pricing, with lower prices in April and higher ones in February, November, and December.
- The most expensive neighborhoods were: Ibirapuera, Jardim América, and Jardim Panorama.
- The most affordable neighborhoods included Guapira, Jardim Patente, and Jaraguá.
- Small apartments (< 50m²) had an average price of R$ 587k, with some reaching over R$ 2 million.
- Area, number of bedrooms, bathrooms, and parking spaces showed a direct relationship with price — confirming common assumptions with data.
- Calculate the Pearson correlation coefficient between numerical features and price.
- Build a price prediction model using Python and regression.
- Enrich with external data (e.g. income per neighborhood, mobility, safety) and additional platforms to enable more robust analyses.
- PostgreSQL (via DBeaver)
- SQL (CTEs, Window Functions, Views, percentiles, aggregations)
- GitHub for version control and documentation
Arthur Curry
Data Analyst with a background in Growth and RevOps
LinkedIn • GitHub