Takehome_Ex03

Take-home Exercise 3: Predicting HDB Public Housing Resale Pricies using Geographically Weighted Methods

Background

Housing is an essential component of household wealth worldwide. Buying a housing has always been a major investment for most people. The price of housing is affected by many factors. Some of them are global in nature such as the general economy of a country or inflation rate. Others can be more specific to the properties themselves. These factors can be further divided to structural and locational factors. Structural factors are variables related to the property themselves such as the size, fitting, and tenure of the property. Locational factors are variables related to the neighbourhood of the properties such as proximity to childcare centre, public transport service and shopping centre.

Conventional, housing resale prices predictive models were built by using Ordinary Least Square (OLS) method. However, this method failed to take into consideration that spatial autocorrelation and spatial heterogeneity exist in geographic data sets such as housing transactions. With the existence of spatial autocorrelation, the OLS estimation of predictive housing resale pricing models could lead to biased, inconsistent, or inefficient results (Anselin 1998). In view of this limitation, Geographical Weighted Models were introduced for calibrating predictive model for housing resale prices.

Objective

To predict HDB resale prices at the sub-market level for the month of January and February 2023 in Singapore.The predictive models will be built by using by using conventional OLS method and GWR methods. The objective is to compare the performance of the conventional OLS method versus the geographical weighted methods.

Concept for GWR, Hedonic pricing model

Geographically weighted regression (GWR) is a spatial statistical technique that takes non-stationary variables into consideration (e.g., climate; demographic factors; physical environment characteristics)

In this take home assignment, we need to build predictive model using GWR. The predictive model need to take consideration into locational factors such as proximity to childcare centre, public transport service and shopping centre.

Hence, we have to first identify the relevant location factors for this assignment, which including:

  • Proximity to CBD

  • Proximity to eldercare

  • Proximity to foodcourt/hawker centres

  • Proximity to MRT

  • Proximity to park

  • Proximity to good primary school

  • Proximity to shopping mall

  • Proximity to supermarket

  • Numbers of kindergartens within 350m

  • Numbers of childcare centres within 350m

  • Numbers of bus stop within 350m

  • Numbers of primary school within 1km

Steps

Load Necessary Library

pacman::p_load(olsrr, corrplot, sf, spdep, GWmodel, tmap, tidyverse, gtsummary,readxl,jsonlite,rvest)

Importing geospatial data (MP2019)

mpsz2019 = st_read(dsn = "data/geospatial", layer = "MPSZ-2019")
Reading layer `MPSZ-2019' from data source 
  `C:\Quanfang777\IS415-GAA\Takehome_Exercise\Takehome_Exercise3\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 332 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
Geodetic CRS:  WGS 84

Updating CRS information

mpsz_svy21 <- st_transform(mpsz2019, 3414)

check whether the CRS has correctly assigned:

st_crs(mpsz_svy21)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]

Check invalid geometries

length(which(st_is_valid(mpsz_svy21) == FALSE))
[1] 6

Make the invalid geometries valid

mpsz_svy21 <- st_make_valid(mpsz_svy21)
length(which(st_is_valid(mpsz_svy21) == FALSE))
[1] 0

Aspatial Data Wrangling

Importing the aspatial data

resale = read_csv("data/aspatial/resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv")
Rows: 148277 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): month, town, flat_type, block, street_name, storey_range, flat_mode...
dbl (3): floor_area_sqm, lease_commence_date, resale_price

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We plan to look at four-room flat and a transaction period between1st January 2021 to 31st December 2022.

resale <- resale %>% 
  filter(flat_type == "4 ROOM") 
resale
# A tibble: 61,961 × 11
   month   town    flat_…¹ block stree…² store…³ floor…⁴ flat_…⁵ lease…⁶ remai…⁷
   <chr>   <chr>   <chr>   <chr> <chr>   <chr>     <dbl> <chr>     <dbl> <chr>  
 1 2017-01 ANG MO… 4 ROOM  472   ANG MO… 10 TO …      92 New Ge…    1979 61 yea…
 2 2017-01 ANG MO… 4 ROOM  475   ANG MO… 07 TO …      91 New Ge…    1979 61 yea…
 3 2017-01 ANG MO… 4 ROOM  629   ANG MO… 01 TO …      94 New Ge…    1981 63 yea…
 4 2017-01 ANG MO… 4 ROOM  546   ANG MO… 01 TO …      92 New Ge…    1981 63 yea…
 5 2017-01 ANG MO… 4 ROOM  131   ANG MO… 01 TO …      98 New Ge…    1979 61 yea…
 6 2017-01 ANG MO… 4 ROOM  254   ANG MO… 04 TO …      97 New Ge…    1977 59 yea…
 7 2017-01 ANG MO… 4 ROOM  470   ANG MO… 04 TO …      92 New Ge…    1979 61 yea…
 8 2017-01 ANG MO… 4 ROOM  601   ANG MO… 04 TO …      91 New Ge…    1980 62 yea…
 9 2017-01 ANG MO… 4 ROOM  463   ANG MO… 04 TO …      92 New Ge…    1980 62 yea…
10 2017-01 ANG MO… 4 ROOM  207   ANG MO… 04 TO …      97 New Ge…    1976 58 yea…
# … with 61,951 more rows, 1 more variable: resale_price <dbl>, and abbreviated
#   variable names ¹​flat_type, ²​street_name, ³​storey_range, ⁴​floor_area_sqm,
#   ⁵​flat_model, ⁶​lease_commence_date, ⁷​remaining_lease

However, when we look at the code, there is no Longitude and Latitude Information, which means we need to geocode it, to make sure we get accurate geocode result, we have to combine the Block and Street Name as input:

In OneMapSG API, “ST.” is usually written as “SAINT” instead, so it is better for us to replace ST. with SAINT

resale$street_name <- gsub("ST\\.", "SAINT", resale$street_name)

Here is the code for geocoding the aspatial data: (Reference was taken from the senior sample submissions for the code for this section, with credit to MEGAN SIM TZE YEN)

# 
# library(httr)
# geocode <- function(block, streetname) {
#   base_url <- "https://developers.onemap.sg/commonapi/search"
#   address <- paste(block, streetname, sep = " ")
#   query <- list("searchVal" = address, 
#                 "returnGeom" = "Y",
#                 "getAddrDetails" = "N",
#                 "pageNum" = "1")
#   
#   res <- GET(base_url, query = query)
#   restext<-content(res, as="text")
#   
#   output <- fromJSON(restext)  %>% 
#     as.data.frame %>%
#     select(results.LATITUDE, results.LONGITUDE)
# 
#   return(output)
# }
# resale$LATITUDE <- 0
# resale$LONGITUDE <- 0
# 
# for (i in 1:nrow(resale)){
#   temp_output <- geocode(resale[i, 4], resale[i, 5])
#   
#   resale$LATITUDE[i] <- temp_output$results.LATITUDE
#   resale$LONGITUDE[i] <- temp_output$results.LONGITUDE}

Make a copy for the geocoded data:

Make a copy for the geocoded data so we don’t need to run the previous process again:

#write_csv(resale, "data/rds/resalecopy1.csv")

Read the geocoded data

resale = read_csv("data/rds/resalecopy1.csv")
Rows: 23656 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): month, town, flat_type, block, street_name, storey_range, flat_mode...
dbl (5): floor_area_sqm, lease_commence_date, resale_price, LATITUDE, LONGITUDE

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Update CRS

resale.sf <- st_as_sf(resale,
                            coords = c("LONGITUDE", "LATITUDE"),
                            crs=4326) %>%
  st_transform(crs=3414)

Structural Factors

FLOOR LEVEL

unique(resale$storey_range)
 [1] "04 TO 06" "01 TO 03" "07 TO 09" "10 TO 12" "13 TO 15" "16 TO 18"
 [7] "19 TO 21" "22 TO 24" "28 TO 30" "25 TO 27" "34 TO 36" "31 TO 33"
[13] "43 TO 45" "37 TO 39" "40 TO 42" "46 TO 48" "49 TO 51"

We can observe that there are 17 categories of storey ranges. To represent each range, we will use the pivot_wider() function to create duplicate variables, assigning a value of 1 if the observation falls within the range and 0 if it does not.

resale <- resale %>%
  pivot_wider(names_from = storey_range, values_from = storey_range, 
              values_fn = list(storey_range = ~1), values_fill = 0) 

REMAINING LEASE

The current format of the remaining_lease variable is a string, but we need it to be in numeric format. To achieve this, we can split the string into its month and year values, calculate the remaining lease in years, and replace the original string values with the calculated numeric values.

str_list <- str_split(resale$remaining_lease, " ")

for (i in 1:length(str_list)) {
  if (length(unlist(str_list[i])) > 2) {
      year <- as.numeric(unlist(str_list[i])[1])
      month <- as.numeric(unlist(str_list[i])[3])
      resale$remaining_lease[i] <- year + round(month/12, 2)
  }
  else {
    year <- as.numeric(unlist(str_list[i])[1])
    resale$remaining_lease[i] <- year
  }
}
resale_sf <- st_as_sf(resale, 
                      coords = c("LONGITUDE", 
                                 "LATITUDE"), 
                      crs=4326) %>%
  st_transform(crs = 3414)

Locational Factor

CBD

As the proximity to CBD is one of the locational factor we interested in to improve our predicted model, let’s take the coordinates of Downtown core to be the coordinates of CBD

lat <- 1.287953
lng <- 103.851784

cbd_sf <- data.frame(lat, lng) %>%
  st_as_sf(coords = c("lng", "lat"), crs=4326) %>%
  st_transform(crs=3414)

Eldercare

Eldercare_sf <- st_read(dsn = "data/geospatial/eldercare-services-shp", 
                layer = "ELDERCARE")
Reading layer `ELDERCARE' from data source 
  `C:\Quanfang777\IS415-GAA\Takehome_Exercise\Takehome_Exercise3\data\geospatial\eldercare-services-shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 133 features and 18 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 14481.92 ymin: 28218.43 xmax: 41665.14 ymax: 46804.9
Projected CRS: SVY21
st_crs(Eldercare_sf)
Coordinate Reference System:
  User input: SVY21 
  wkt:
PROJCRS["SVY21",
    BASEGEOGCRS["SVY21[WGS84]",
        DATUM["World Geodetic System 1984",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]],
            ID["EPSG",6326]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["Degree",0.0174532925199433]]],
    CONVERSION["unnamed",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]]]
Eldercare_sf <- st_transform(Eldercare_sf, crs=3414)

Kindergartens

To get the Kindergartens’ geometry data, I refer to Megan’s method of using ONEMAP API to do the geocoding

The step including:

  1. Get the token from ONEMAP API
  2. Search the themes and check the themes available
  3. Using get_theme function to get the geospatial information of the Kindergarten
# library(sf)
# library(onemapsgapi)
# 
# token <- "eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzdWIiOjEwMDUyLCJ1c2VyX2lkIjoxMDA1MiwiZW1haWwiOiJ0YWtvdGFrb3YwMDBAZ21haWwuY29tIiwiZm9yZXZlciI6ZmFsc2UsImlzcyI6Imh0dHA6XC9cL29tMi5kZmUub25lbWFwLnNnXC9hcGlcL3YyXC91c2VyXC9zZXNzaW9uIiwiaWF0IjoxNjc5NTU4NDYxLCJleHAiOjE2Nzk5OTA0NjEsIm5iZiI6MTY3OTU1ODQ2MSwianRpIjoiZDgyZTg1MDMzMTUzMTRkZjEzYjk5MWJmMDJkMDQ1NjIifQ.070RoJrraz95GuLVvYZpfzyMyGWQZ6S0D5FsLL39WGU"
# search_themes(token, "kindergartens") 
# Kindergartens_tibble <- get_theme(token, "kindergartens")
# Kindergartens_sf <- st_as_sf(Kindergartens_tibble, coords=c("Lng", "Lat"), crs=4326)
# Kindergartens_sf <- st_transform(Kindergartens_sf, crs=3414)

Childcare Center

# search_themes(token, "childcare")
# library(sf)
# library(onemapsgapi)
# 
# themetibble <- get_theme(token, "childcare")
# childcaresf <- st_as_sf(themetibble, coords=c("Lng", "Lat"), crs=4326)
# childcaresf <- st_transform(childcaresf, crs=3414)

Park

# search_themes(token, "parks")
# library(sf)
# library(onemapsgapi)
# Parks_themetibble <- get_theme(token, "nationalparks")
# Parks_sf <- st_as_sf(Parks_themetibble, coords=c("Lng", "Lat"), crs=4326)
# Parks_sf <- st_transform(Parks_sf, crs=3414)

Shopping mall

(Reference was taken from the senior sample submissions for the code for this section, with credit to NOR AISYAH BINTE AJIT) The approach is about extracting the names of the malls from Wikipedia and then use get_coords function to obtain the respective coordinates

The following code chunk performs three steps:

  • Reads the Wikipedia HTML page containing the Shopping Malls in Singapore.

  • Reads the text portion of the unordered list element selected by html_nodes().

  • Appends the extracted mall names to an empty mall_list.

# get_coords <- function(add_list){
#   
#   # Create a data frame to store all retrieved coordinates
#   postal_coords <- data.frame()
#     
#   for (i in add_list){
#     #print(i)
# 
#     r <- GET('https://developers.onemap.sg/commonapi/search?',
#            query=list(searchVal=i,
#                      returnGeom='Y',
#                      getAddrDetails='Y'))
#     data <- fromJSON(rawToChar(r$content))
#     found <- data$found
#     res <- data$results
#     
#     # Create a new data frame for each address
#     new_row <- data.frame()
#     
#     # If single result, append 
#     if (found == 1){
#       postal <- res$POSTAL 
#       lat <- res$LATITUDE
#       lng <- res$LONGITUDE
#       new_row <- data.frame(address= i, postal = postal, latitude = lat, longitude = lng)
#     }
#     
#     # If multiple results, drop NIL and append top 1
#     else if (found > 1){
#       # Remove those with NIL as postal
#       res_sub <- res[res$POSTAL != "NIL", ]
#       
#       # Set as NA first if no Postal
#       if (nrow(res_sub) == 0) {
#           new_row <- data.frame(address= i, postal = NA, latitude = NA, longitude = NA)
#       }
#       
#       else{
#         top1 <- head(res_sub, n = 1)
#         postal <- top1$POSTAL 
#         lat <- top1$LATITUDE
#         lng <- top1$LONGITUDE
#         new_row <- data.frame(address= i, postal = postal, latitude = lat, longitude = lng)
#       }
#     }
# 
#     else {
#       new_row <- data.frame(address= i, postal = NA, latitude = NA, longitude = NA)
#     }
#     
#     # Add the row
#     postal_coords <- rbind(postal_coords, new_row)
#   }
#   return(postal_coords)
# }
# url <- "https://en.wikipedia.org/wiki/List_of_shopping_malls_in_Singapore"
# malls_list <- list()
# 
# for (i in 2:7){
#   malls <- read_html(url) %>%
#     html_nodes(xpath = paste('//*[@id="mw-content-text"]/div[1]/div[',as.character(i),']/ul/li',sep="") ) %>%
#     html_text()
#   malls_list <- append(malls_list, malls)
# }
# library(httr)
# malls_list_coords <- get_coords(malls_list) %>% 
#   rename("mall_name" = "address")

Some of the names of the shopping malls in our dataset are not up-to-date, which can cause issues in the analysis. For example, POMO has been renamed to GR.ID and OD Mall has been renamed to The Grandstand. Hence, we should address these issues before proceeding. One of the shopping malls, Yew Tee Shopping Centre, was not found when researched further, so we should remove this row from our dataset

# malls_list_coords <- subset(malls_list_coords, mall_name!= "Yew Tee Shopping Centre")
# invalid_malls<- subset(malls_list_coords, is.na(malls_list_coords$postal))
# invalid_malls_list <- unique(invalid_malls$mall_name)
# corrected_malls <- c("Clarke Quay", "City Gate", "Raffles Holland V", "Knightsbridge", "Mustafa Centre", "GR.ID", "Shaw House",
#                      "The Poiz Centre", "Velocity @ Novena Square", "Singapore Post Centre", "PLQ Mall", "KINEX", "The Grandstand")
# 
# for (i in 1:length(invalid_malls_list)) {
#   malls_list_coords <- malls_list_coords %>% 
#     mutate(mall_name = ifelse(as.character(mall_name) == invalid_malls_list[i], corrected_malls[i], as.character(mall_name)))
# }

Then create a list to store unique names

# malls_list <- sort(unique(malls_list_coords$mall_name))

To retrieve the coordinates of shopping malls

# malls_coords <- get_coords(malls_list)
# malls_coords[(is.na(malls_coords$postal) | is.na(malls_coords$latitude) | is.na(malls_coords$longitude)), ]
# malls_sf <- st_as_sf(malls_coords,
#                     coords = c("longitude", 
#                                "latitude"),
#                     crs=4326) %>%
#   st_transform(crs = 3414)

Primary School

# pri_sch <- read_csv("data/geospatial/general-information-of-schools.csv")
# pri_sch <- pri_sch %>%
#   filter(mainlevel_code == "PRIMARY"| mainlevel_code == "MIXED LEVELS") %>%
#   select(school_name, address, postal_code, mainlevel_code)
# prisch_list <- sort(unique(pri_sch$postal_code))
# prisch_coords <- get_coords(prisch_list)
# prisch_coords[(is.na(prisch_coords$postal) | is.na(prisch_coords$latitude) | is.na(prisch_coords$longitude)), ]
# prisch_coords = prisch_coords[c("postal","latitude", "longitude")]
# pri_sch <- left_join(pri_sch, prisch_coords, by = c('postal_code' = 'postal'))
# prisch_sf <- st_as_sf(pri_sch,
#                     coords = c("longitude", 
#                                "latitude"),
#                     crs=4326) %>%
#   st_transform(crs = 3414)

Select top 10 school

I refer to this website and check which are the exact top 10 primary schools and get their postal code, then I filtered them from the primary school dataset

# postal_codes <- c(599986, 449761, 597610, 536451, 579767, 128806, 569405, 738907, 579646, 227988)
# 
# # Filter prisch_sf for rows with postal code in the vector
# top10prisch_sf <- prisch_sf %>%
#   filter(postal_code %in% postal_codes)
# top10prisch_sf

Hawker Center

# hawker_sf <- st_read("data/geospatial/hawker-centres-kml.kml") 
# hawker_sf <- st_transform(hawker_sf, crs=3414)
# st_crs(hawker_sf)

Bus Stop

# bus_sf <- st_read(dsn="data/geospatial/BusStop_Feb2023", layer="BusStop")
# bus_sf <- st_transform(bus_sf, crs=3414)

Mrt

# rail_network_sf <- st_read(dsn="data/geospatial/TrainStation_Feb2023", layer="RapidTransitSystemStation")
# rail_network_sf<- st_transform(rail_network_sf, crs=3414)

SuperMarket

# supermkt_sf <- st_read("data/geospatial/supermarkets-geojson.geojson")
# supermkt_sf<- st_transform(supermkt_sf, crs=3414)

Check invalid geometry for locational factor

Bus stop

# length(which(st_is_valid(bus_sf) == FALSE))

MRT

# length(which(st_is_valid(rail_network_sf) == FALSE))

It got invalid geometry and here is the code to make it valid:

# # Identify the invalid geometry
# library(leaflet)
# invalid_geom <- which(!st_is_valid(rail_network_sf))
# 
# # Use st_cast() to convert the invalid geometry to a valid geometry type
# rail_network_sf <- st_cast(rail_network_sf, "MULTILINESTRING")
# 
# # Use shiny to manually edit the geometry
# library(shiny)
# shinyApp(
#   ui = fluidPage(
#     leafletOutput("map")
#   ),
#   server = function(input, output) {
#     output$map <- renderLeaflet({
#       leaflet(rail_network_sf) %>%
#         addTiles() %>%
#         addPolylines()
#     })
#     observe({
# length(which(st_is_valid(rail_network_sf) == FALSE))
# length(which(st_is_valid(childcaresf) == FALSE))

ElderlyCare

# length(which(st_is_valid(Eldercare_sf) == FALSE))
# length(which(st_is_valid(cbd_sf)))

CBD

# cbd_sf <- st_make_valid(cbd_sf)
# length(which(st_is_valid(cbd_sf) == FALSE))

Childcare

# length(which(st_is_valid(childcaresf) == FALSE))

Kindergartens

# length(which(st_is_valid(Kindergartens_sf) == FALSE))

Malls

# length(which(st_is_valid(malls_sf) == FALSE))

Parks

# length(which(st_is_valid(Parks_sf) == FALSE))

Primary school

# length(which(st_is_valid(prisch_sf) == FALSE))

Top primary school

# length(which(st_is_valid(top10prisch_sf) == FALSE))

Calculate Proximity

# library(units)
# proximity <- function(df1, df2, varname) {
#   dist_matrix <- st_distance(df1, df2) %>%
#     drop_units()
#   df1[,varname] <- rowMins(dist_matrix)
#   return(df1)
# }
# install.packages("matrixStats")
# library(matrixStats)
# proximity <- function(df1, df2, varname) {
#   dist_matrix <- st_distance(df1, df2) %>%
#     drop_units()
#   df1[,varname] <- rowMins(dist_matrix)
#   return(df1)
# }
# resale.sf <-
#   proximity(resale.sf, cbd_sf, "PROX_CBD") %>%
#   proximity(., childcaresf, "PROX_CHILDCARE") %>%
#   proximity(., Eldercare_sf, "PROX_ELDERCARE") %>%
#   proximity(., hawker_sf, "PROX_HAWKER") %>%
#   proximity(., rail_network_sf, "PROX_MRT") %>%
#   proximity(., Parks_sf, "PROX_PARK") %>%
#   proximity(., top10prisch_sf, "PROX_TOPPRISCH") %>%
#   proximity(., malls_sf, "PROX_MALL") %>%
#   proximity(., supermkt_sf, "PROX_SPRMKT")
# num_radius <- function(df1, df2, varname, radius) {
#   dist_matrix <- st_distance(df1, df2) %>%
#     drop_units() %>%
#     as.data.frame()
#   df1[,varname] <- rowSums(dist_matrix <= radius)
#   return(df1)
# }
# resale.sf <- 
#   num_radius(resale.sf, Kindergartens_sf, "NUM_KNDRGTN", 350) %>%
#   num_radius(., childcaresf, "NUM_CHILDCARE", 350) %>%
#   num_radius(., bus_sf, "NUM_BUS_STOP", 350) %>%
#   num_radius(., prisch_sf, "NUM_PriSch", 1000)
# resale.sf <- resale.sf %>%
#   mutate() %>%
#   rename("AREA_SQM" = "floor_area_sqm",
#          "LEASE_YRS" = "remaining_lease",
#          "PRICE"= "resale_price") %>%
#   relocate(`PRICE`)
# st_write(resale.sf, "data/geospatial/resale-final_cleaned.shp")
resale.sf <- st_read(dsn="data/geospatial", layer="resale-final_cleaned")
Reading layer `resale-final_cleaned' from data source 
  `C:\Quanfang777\IS415-GAA\Takehome_Exercise\Takehome_Exercise3\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 23656 features and 24 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 11519.79 ymin: 28217.39 xmax: 42645.18 ymax: 48741.06
Projected CRS: SVY21 / Singapore TM
resale.sf <- resale.sf %>%
  rename( "AREA_SQM"= "AREA_SQ",
          "LEASE_YRS" = "LEASE_Y" , 
          "PROX_CBD" = "PROX_CB",
          "PROX_CHILDCARE" = "PROX_CH" ,
          "PROX_ELDERCARE" = "PROX_EL",
          "PROX_HAWKER"= "PROX_HA" ,
        "PROX_PARK"=  "PROX_PA",
          "PROX_MRT" ="PROX_MR",
          "PROX_SPRMKT"= "PROX_SP",
        "PROX_MALL"= "PROX_MA",
         "NUM_KNDRGTN"= "NUM_KND" ,
         "NUM_CHILDCARE"="NUM_CHI" ,
        "NUM_BUS_STOP"="NUM_BUS" ,
         "PROX_TOPPRISCH"= "PROX_TO"
      
      ) %>%
  relocate(`PRICE`)

Drawing Statistical Point Map

tmap_mode("view")
tmap mode set to interactive viewing
tmap_options(check.and.fix = TRUE)
tm_shape(resale.sf) +  
  tm_dots(col = "PRICE",
          alpha = 0.6,
          style="quantile") +
  tm_view(set.zoom.limits = c(11,14))

From the statistical point map, we could observe that the southern and central area of the Singapore has a higher HDB resale price around 600k SGD

EDA

ggplot(data=resale.sf, aes(x=`PRICE`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

The figure above reveals a right skewed distribution. This means that more 4 room units were transacted at relative lower prices.Statistically, the skewed dsitribution can be normalised by using log transformation. We could see that most buildings seem to fall within 400k- 600k SGDdollars

resale.sf <- resale.sf %>%
  mutate(`LOG_PRICE` = log(PRICE))
ggplot(data=resale.sf, aes(x=`LOG_PRICE`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

resale.sf <- resale.sf  %>%
  mutate(resale.sf, LEASE_YRS = as.integer(str_sub(LEASE_YRS, 0, 2)))

Multiple Histogram Plots distribution of variables

Let’s also look at the Multiple Histogram Plots distribution of variables

library(ggpubr)
Warning: package 'ggpubr' was built under R version 4.2.3
AREA_SQM <- ggplot(data = resale.sf, aes(x = `AREA_SQM`)) + 
  geom_histogram(bins=20, color="black", fill = 'lightblue')

PROX_CBD <- ggplot(data = resale.sf, aes(x = `PROX_CBD`)) + 
  geom_histogram(bins=20, color="black", fill = 'lightblue')

PROX_CHILDCARE <- ggplot(data = resale.sf, aes(x = `PROX_CHILDCARE`)) + 
  geom_histogram(bins=20, color="black", fill = 'lightblue')

PROX_ELDERCARE <- ggplot(data = resale.sf, aes(x = `PROX_ELDERCARE`)) + 
  geom_histogram(bins=20, color="black", fill = 'lightblue')

PROX_HAWKER <- ggplot(data = resale.sf, aes(x = `PROX_HAWKER`)) + 
  geom_histogram(bins=20, color="black", fill = 'lightblue')

PROX_MRT <- ggplot(data = resale.sf, aes(x = `PROX_MRT`)) + 
  geom_histogram(bins=20, color="black", fill = 'lightblue')

PROX_PARK <- ggplot(data = resale.sf, aes(x = `PROX_PARK`)) + 
  geom_histogram(bins=20, color="black", fill = 'lightblue')

PROX_TOPPRISCH <- ggplot(data = resale.sf, aes(x = `PROX_TOPPRISCH`)) + 
  geom_histogram(bins=20, color="black", fill = 'lightblue')

PROX_MALL <- ggplot(data = resale.sf, aes(x = `PROX_MALL`)) + 
  geom_histogram(bins=20, color="black", fill = 'lightblue')

PROX_SPRMKT <- ggplot(data = resale.sf, aes(x = `PROX_SPRMKT`)) + 
  geom_histogram(bins=20, color="black", fill = 'lightblue')


ggarrange(AREA_SQM, PROX_CBD, PROX_CHILDCARE, PROX_ELDERCARE, PROX_HAWKER, PROX_MRT, PROX_PARK, PROX_TOPPRISCH, PROX_MALL, PROX_SPRMKT, ncol = 2, nrow = 5)

NUM_CHILDCARE <- ggplot(data = resale.sf, aes(x = `NUM_CHILDCARE`)) + 
  geom_histogram(bins=20, color="black", fill = 'lightblue')

NUM_KNDRGTN <- ggplot(data = resale.sf, aes(x = `NUM_KNDRGTN`)) + 
  geom_histogram(bins=20, color="black", fill = 'lightblue')

NUM_BUS_STOP <- ggplot(data = resale.sf, aes(x = `NUM_BUS_STOP`)) + 
  geom_histogram(bins=20, color="black", fill = 'lightblue')

LEASE_YRS <- ggplot(data = resale.sf, aes(x = `LEASE_YRS`)) + 
  geom_histogram(bins=20, color="black", fill = 'lightblue')

ggarrange(NUM_KNDRGTN, NUM_CHILDCARE, NUM_BUS_STOP,LEASE_YRS, ncol = 2, nrow = 2)

summary(resale.sf)
     PRICE            month               town             flt_typ         
 Min.   : 250000   Length:23656       Length:23656       Length:23656      
 1st Qu.: 440000   Class :character   Class :character   Class :character  
 Median : 490000   Mode  :character   Mode  :character   Mode  :character  
 Mean   : 526124                                                           
 3rd Qu.: 568000                                                           
 Max.   :1370000                                                           
    block             strt_nm            stry_rn             AREA_SQM     
 Length:23656       Length:23656       Length:23656       Min.   : 70.00  
 Class :character   Class :character   Class :character   1st Qu.: 91.00  
 Mode  :character   Mode  :character   Mode  :character   Median : 93.00  
                                                          Mean   : 94.67  
                                                          3rd Qu.:100.00  
                                                          Max.   :145.00  
   flt_mdl             ls_cmm_       LEASE_YRS        PROX_CBD      
 Length:23656       Min.   :1967   Min.   :44.00   Min.   :  999.4  
 Class :character   1st Qu.:1988   1st Qu.:65.00   1st Qu.: 9701.8  
 Mode  :character   Median :2001   Median :79.00   Median :13106.4  
                    Mean   :2001   Mean   :78.33   Mean   :12130.6  
                    3rd Qu.:2015   3rd Qu.:92.00   3rd Qu.:14885.0  
                    Max.   :2019   Max.   :97.00   Max.   :19650.1  
 PROX_CHILDCARE   PROX_ELDERCARE    PROX_HAWKER        PROX_MRT       
 Min.   :  0.00   Min.   :   0.0   Min.   :  30.6   Min.   :   9.112  
 1st Qu.: 64.35   1st Qu.: 318.6   1st Qu.: 396.5   1st Qu.: 243.245  
 Median :105.50   Median : 600.4   Median : 684.5   Median : 427.835  
 Mean   :112.23   Mean   : 780.3   Mean   : 796.3   Mean   : 514.828  
 3rd Qu.:153.82   3rd Qu.:1083.9   3rd Qu.:1025.1   3rd Qu.: 704.065  
 Max.   :547.39   Max.   :3301.6   Max.   :2867.6   Max.   :2111.909  
   PROX_PARK       PROX_TOPPRISCH      PROX_MALL        PROX_SPRMKT    
 Min.   :  45.42   Min.   :  79.58   Min.   :  43.45   Min.   :   0.0  
 1st Qu.: 490.12   1st Qu.:2314.17   1st Qu.: 394.59   1st Qu.: 162.1  
 Median : 703.40   Median :3661.82   Median : 604.26   Median : 251.6  
 Mean   : 796.82   Mean   :3650.32   Mean   : 710.02   Mean   : 274.4  
 3rd Qu.:1014.12   3rd Qu.:4857.56   3rd Qu.: 917.68   3rd Qu.: 362.4  
 Max.   :2411.72   Max.   :8763.99   Max.   :2668.57   Max.   :1571.3  
  NUM_KNDRGTN     NUM_CHILDCARE     NUM_BUS_STOP       NUM_PrS     
 Min.   :0.0000   Min.   : 0.000   Min.   : 0.000   Min.   :0.000  
 1st Qu.:0.0000   1st Qu.: 3.000   1st Qu.: 6.000   1st Qu.:2.000  
 Median :1.0000   Median : 5.000   Median : 8.000   Median :3.000  
 Mean   :0.9865   Mean   : 4.841   Mean   : 7.958   Mean   :3.285  
 3rd Qu.:1.0000   3rd Qu.: 6.000   3rd Qu.:10.000   3rd Qu.:4.000  
 Max.   :8.0000   Max.   :22.000   Max.   :19.000   Max.   :9.000  
          geometry       LOG_PRICE    
 POINT        :23656   Min.   :12.43  
 epsg:3414    :    0   1st Qu.:12.99  
 +proj=tmer...:    0   Median :13.10  
                       Mean   :13.15  
                       3rd Qu.:13.25  
                       Max.   :14.13  

Hedonic Pricing Modelling

Multiple Linear Regression Method

resale_nogeo <- resale.sf %>%
  st_drop_geometry() %>%
  select_if(is.numeric) 
corrplot::corrplot(cor(resale_nogeo[, 2:17]), 
                   diag = FALSE, 
                   order = "AOE",
                   tl.pos = "td", 
                   tl.cex = 0.5, 
                   method = "number", 
                   type = "upper")

Before building a multiple regression model, it is important to ensure that the indepdent variables used are not highly correlated to each other. If these highly correlated independent variables are used in building a regression model by mistake, the quality of the model will be compromised. This phenomenon is known as multicollinearity in statistics.

From the scatterplot matrix , we could see that we have “lease_commence_date” highly corrlated to “Lease_Years” so we decided to remove “lease_commence_date” to prevent the issue of multicollinearity

#resale.sf<- select(resale.sf, -lease_commence_date)

Data Sampling

training_data<-resale.sf %>% 
  filter(month >= "2021-01" & month <= "2022-10")

testing_data<-resale.sf %>%
  filter(month >= "2022-10" & month <= "2022-12")

Building a non-spatial multiple linear regression

price_mlr <- lm(PRICE ~ AREA_SQM + PROX_CBD +  PROX_ELDERCARE + PROX_HAWKER+ PROX_MRT +  PROX_PARK + PROX_TOPPRISCH + PROX_MALL +PROX_SPRMKT +LEASE_YRS + NUM_CHILDCARE+NUM_KNDRGTN+ NUM_PrS,
                data=training_data)
write_rds(price_mlr, "data/ML_results/price_mlr.rds")
gtsummary::tbl_regression(price_mlr)
Characteristic Beta 95% CI1 p-value
AREA_SQM 3,070 2,915, 3,226 <0.001
PROX_CBD -18 -19, -18 <0.001
PROX_ELDERCARE -9.5 -11, -7.7 <0.001
PROX_HAWKER -25 -27, -23 <0.001
PROX_MRT -34 -37, -31 <0.001
PROX_PARK 9.3 6.7, 12 <0.001
PROX_TOPPRISCH -0.34 -0.97, 0.30 0.3
PROX_MALL 8.2 5.3, 11 <0.001
PROX_SPRMKT 13 6.1, 20 <0.001
LEASE_YRS 5,224 5,143, 5,306 <0.001
NUM_CHILDCARE -2,098 -2,642, -1,555 <0.001
NUM_KNDRGTN 10,654 9,505, 11,802 <0.001
NUM_PrS -8,889 -9,659, -8,120 <0.001
1 CI = Confidence Interval

Building Hedonic Pricing Models using GWmodel

Converting the sf data.frame to SpatialPointDataFrame

train_data_sp <- as_Spatial(training_data)
train_data_sp
class       : SpatialPointsDataFrame 
features    : 21761 
extent      : 11519.79, 42645.18, 28217.39, 48741.06  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
variables   : 25
names       :   PRICE,   month,       town, flt_typ, block,       strt_nm,  stry_rn, AREA_SQM,       flt_mdl, ls_cmm_, LEASE_YRS,         PROX_CBD,   PROX_CHILDCARE,   PROX_ELDERCARE,      PROX_HAWKER, ... 
min values  :  250000, 2021-01, ANG MO KIO,  4 ROOM,     1,  ADMIRALTY DR, 01 TO 03,       70, Adjoined flat,    1967,        44, 999.393538715878, 1.4128908036e-05, 1.9894378743e-05, 30.6031805185926, ... 
max values  : 1370000, 2022-10,     YISHUN,  4 ROOM,    9B, YUNG SHENG RD, 49 TO 51,      145,       Type S1,    2019,        97, 19650.0691667807, 547.386819517238, 3301.63731686804, 2867.63031236184, ... 
# bw_adaptive <- bw.gwr(PRICE ~ AREA_SQM + PROX_CBD +  PROX_ELDERCARE + PROX_HAWKER+ PROX_MRT +  PROX_PARK + PROX_TOPPRISCH + PROX_MALL +PROX_SPRMKT +LEASE_YRS + NUM_CHILDCARE + NUM_KNDRGTN,
#                   data=train_data_sp,
#                   approach="CV",
#                   kernel="gaussian",
#                   adaptive=TRUE,
#                   longlat=FALSE)
# 
# saving the result as an rds object
# write_rds(bw_adaptive, "data/ML_results/bw_adaptive.rds")

Constructing the adaptive bandwidth gwr model

bw_adaptive <- read_rds("data/ML_results/bw_adaptive.rds")
# gwr_adaptive <- 
#   
#   gwr.basic(formula = PRICE ~ AREA_SQM + PROX_CBD +  PROX_ELDERCARE + PROX_HAWKER+ PROX_MRT +  PROX_PARK + PROX_TOPPRISCH + PROX_MALL +PROX_SPRMKT +LEASE_YRS + NUM_CHILDCARE + NUM_KNDRGTN,
#                           data=train_data_sp,
#                           bw=bw_adaptive, 
#                           kernel = 'gaussian', 
#                           adaptive=TRUE,
#                           longlat = FALSE)
# write_rds(gwr_adaptive, "data/ML_results/gwr_adaptive.rds")
gwr_adaptive <- read_rds("data/ML_results/gwr_adaptive.rds")
gwr_adaptive
   ***********************************************************************
   *                       Package   GWmodel                             *
   ***********************************************************************
   Program starts at: 2023-03-26 22:26:06 
   Call:
   gwr.basic(formula = PRICE ~ AREA_SQM + PROX_CBD + PROX_ELDERCARE + 
    PROX_HAWKER + PROX_MRT + PROX_PARK + PROX_TOPPRISCH + PROX_MALL + 
    PROX_SPRMKT + LEASE_YRS + NUM_CHILDCARE + NUM_KNDRGTN, data = train_data_sp, 
    bw = bw_adaptive, kernel = "gaussian", adaptive = TRUE, longlat = FALSE)

   Dependent (y) variable:  PRICE
   Independent variables:  AREA_SQM PROX_CBD PROX_ELDERCARE PROX_HAWKER PROX_MRT PROX_PARK PROX_TOPPRISCH PROX_MALL PROX_SPRMKT LEASE_YRS NUM_CHILDCARE NUM_KNDRGTN
   Number of data points: 21761
   ***********************************************************************
   *                    Results of Global Regression                     *
   ***********************************************************************

   Call:
    lm(formula = formula, data = data)

   Residuals:
    Min      1Q  Median      3Q     Max 
-283341  -50985   -3418   45691  582840 

   Coefficients:
                    Estimate Std. Error  t value Pr(>|t|)    
   (Intercept)    66495.3963  9054.3131    7.344 2.15e-13 ***
   AREA_SQM        3088.0245    80.2651   38.473  < 2e-16 ***
   PROX_CBD         -19.5026     0.1778 -109.708  < 2e-16 ***
   PROX_ELDERCARE    -7.4127     0.9215   -8.044 9.12e-16 ***
   PROX_HAWKER      -23.3608     1.0982  -21.272  < 2e-16 ***
   PROX_MRT         -23.9847     1.5076  -15.909  < 2e-16 ***
   PROX_PARK          8.3200     1.3475    6.175 6.75e-10 ***
   PROX_TOPPRISCH     0.7344     0.3234    2.271   0.0232 *  
   PROX_MALL         13.9977     1.4544    9.624  < 2e-16 ***
   PROX_SPRMKT       15.9805     3.6244    4.409 1.04e-05 ***
   LEASE_YRS       5293.4698    41.9127  126.298  < 2e-16 ***
   NUM_CHILDCARE  -2787.6671   279.0819   -9.989  < 2e-16 ***
   NUM_KNDRGTN    11528.6684   591.4796   19.491  < 2e-16 ***

   ---Significance stars
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
   Residual standard error: 76360 on 21748 degrees of freedom
   Multiple R-squared: 0.6532
   Adjusted R-squared: 0.653 
   F-statistic:  3413 on 12 and 21748 DF,  p-value: < 2.2e-16 
   ***Extra Diagnostic information
   Residual sum of squares: 1.267994e+14
   Sigma(hat): 76337.7
   AIC:  551095.5
   AICc:  551095.5
   BIC:  529586.2
   ***********************************************************************
   *          Results of Geographically Weighted Regression              *
   ***********************************************************************

   *********************Model calibration information*********************
   Kernel function: gaussian 
   Adaptive bandwidth: 75 (number of nearest neighbours)
   Regression points: the same locations as observations are used.
   Distance metric: Euclidean distance metric is used.

   ****************Summary of GWR coefficient estimates:******************
                         Min.     1st Qu.      Median     3rd Qu.       Max.
   Intercept      -4.5747e+09 -1.2223e+06  9.7471e+04  2.5338e+06 1087267026
   AREA_SQM       -5.7334e+04  1.5505e+03  2.6810e+03  4.7707e+03     133142
   PROX_CBD       -7.7697e+04 -2.1551e+02 -1.8201e+01  1.0786e+02     477174
   PROX_ELDERCARE -1.8221e+04 -5.0766e+01  1.3162e+01  9.0825e+01      40394
   PROX_HAWKER    -1.3933e+05 -8.8827e+01 -5.4893e+00  6.7709e+01      45626
   PROX_MRT       -7.6797e+04 -1.1851e+02 -4.1459e+01  2.5467e+01      60516
   PROX_PARK      -6.8828e+04 -9.1626e+01 -1.4159e+01  4.0267e+01      74312
   PROX_TOPPRISCH -4.9480e+05 -1.3808e+02 -1.6049e+00  2.2624e+02      84282
   PROX_MALL      -2.0066e+05 -8.5682e+01 -2.2139e+00  7.5807e+01      85617
   PROX_SPRMKT    -1.6146e+04 -6.8659e+01 -1.1844e+01  4.4411e+01      21304
   LEASE_YRS      -5.7258e+04 -1.2576e+02  3.5584e+03  6.0068e+03      16062
   NUM_CHILDCARE  -2.0052e+06 -3.7452e+03  4.5055e+02  3.8454e+03      98876
   NUM_KNDRGTN    -1.1298e+06 -1.0413e+04 -1.6427e+03  5.5618e+03    1250099
   ************************Diagnostic information*************************
   Number of data points: 21761 
   Effective number of parameters (2trace(S) - trace(S'S)): 1524.432 
   Effective degrees of freedom (n-2trace(S) + trace(S'S)): 20236.57 
   AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 525447.9 
   AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 524059.5 
   BIC (GWR book, Fotheringham, et al. 2002,GWR p. 61, eq. 2.34): 513415.9 
   Residual sum of squares: 3.462735e+13 
   R-square value:  0.905288 
   Adjusted R-square value:  0.898153 

   ***********************************************************************
   Program stops at: 2023-03-26 22:34:27 

Preparing coordinates data

coords <- st_coordinates(resale.sf)
coords_train <- st_coordinates(training_data)
coords_test <- st_coordinates(testing_data)
coords_train <- write_rds(coords_train, "data/ML_results/coords_train.rds" )
coords_test <- write_rds(coords_test, "data/ML_results/coords_test.rds" )
coords_train<-read_rds("data/ML_results/coords_train.rds")
coords_test<-read_rds("data/ML_results/coords_test.rds")

Dropping geometry field

library(dplyr)
library(sf)
train_data <- training_data %>% 
  st_drop_geometry()
write_rds(train_data, "data/ML_results/train_data.rds")

Calibrating Random Forest Model

pacman::p_load(sf, tmap, sfdep, tidyverse, olsrr, ggpubr, GWmodel, SpatialML, tidymodels, jsonlite, readxl, Rfast, corrplot, gtsummary, spdep, Metrics, ggplot)
Installing package into 'C:/Users/65917/AppData/Local/R/win-library/4.2'
(as 'lib' is unspecified)
Warning: package 'ggplot' is not available for this version of R

A version of this package for your version of R might be available elsewhere,
see the ideas at
https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
Warning: unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/4.2:
  cannot open URL 'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/4.2/PACKAGES'
Warning: 'BiocManager' not available.  Could not check Bioconductor.

Please use `install.packages('BiocManager')` and then retry.
Warning in p_install(package, character.only = TRUE, ...):
Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
logical.return = TRUE, : there is no package called 'ggplot'
Warning in pacman::p_load(sf, tmap, sfdep, tidyverse, olsrr, ggpubr, GWmodel, : Failed to install/load:
ggplot
install.packages("grf")
Installing package into 'C:/Users/65917/AppData/Local/R/win-library/4.2'
(as 'lib' is unspecified)
Warning: unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/4.2:
  cannot open URL 'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/4.2/PACKAGES'
package 'grf' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\65917\AppData\Local\Temp\RtmpwVML58\downloaded_packages
library(grf)
Warning: package 'grf' was built under R version 4.2.3
# set.seed(1234)
# bw_adaptive <- bw.gwr(PRICE ~ AREA_SQM + PROX_CBD +  PROX_ELDERCARE + PROX_HAWKER+ PROX_MRT +  PROX_PARK + PROX_TOPPRISCH + PROX_MALL +PROX_SPRMKT +LEASE_YRS + NUM_CHILDCARE + NUM_KNDRGTN,
#                       data=train_data_nogeom, 
#                       approach="CV", 
#                       kernel="gaussian", 
#                       adaptive=TRUE, 
#                       longlat=FALSE)

# set.seed(1234)
# grf_adaptive <- grf(formula = PRICE ~ AREA_SQM + PROX_CBD +  PROX_ELDERCARE + PROX_HAWKER+ PROX_MRT +  PROX_PARK + PROX_TOPPRISCH + PROX_MALL +PROX_SPRMKT +LEASE_YRS + NUM_CHILDCARE + NUM_KNDRGTN,
#                      dframe=train_data,
#                      bw=75,
#                      kernel="adaptive",
#                      coords=coords_train,
#                       ntree = 5)
# write_rds(grf_adaptive, "data/ML_results/grf_adaptive.rds")
file.exists("data/ML_results/grf_adaptive.rds")
[1] TRUE
grf_adaptive <- read_rds("data/ML_results/grf_adaptive.rds")
class(grf_adaptive)
[1] "gwrm"

Prediction

test_data <- cbind(testing_data, coords_test) %>%
  st_drop_geometry()
# gwRF_pred <- predict.grf(grf_adaptive, 
#                            test_data, 
#                            x.var.name="X",
#                            y.var.name="Y", 
#                            local.w=1,
#                            global.w=0)
# grf_pred <- write_rds(gwRF_pred, "data/ML_results/GRF_pred.rds")
# grf_pred <- read_rds("data/ML_results/GRF_pred.rds")

Visualization

# test_data_plot <- cbind(test_data, grf_pred)
# ggplot(data = test_data_plot,
#        aes(x = grf_pred,
#            y = PRICE)) +
#   geom_point() +
#   geom_abline(col = "Red")

Conclusion

Upon comparing two models, it was observed that the Random Forest Model outperformed the Linear Regression model. The results from the Random Forest Model suggest that, at a 5% significance level, except PROX_TOPPRISCH and PROX_SPRMKT, the other variables are significantly associated with the dependent variable PRICE. However, it is possible that there are other factors that were not accounted for in the analysis.