Helsinki city bikes: a summer love story

Although a bit reluctant first about availability and amount of stations, this summer I decided to purchase a seasonal pass (costs 30€) of Helsinki City bikes. As I have been living in a town outside Helsinki, the bikes have proved to be a great means of transportation for short distances where I would have otherwise needed to buy a rather costly transport ticket.

HSL webpage offers the possibility to see the basic information of the travels done with the city bikes.

The page also shows some basic data about the different trips. However due to a lack of possibility to download the data, an easy option in order to get the data into a tabular form is web scraping.

So let’s dive into web scraping with R!

Web scrapping step by step

In order to simplify the scrapping since we need to sign in, first step is to go to the HSL city bike webpage https://kaupunkipyorat.hsl.fi/en and downloading the page locally as HTML.

Then we need to go to the page where all travel segments are shown by clicking Katso Kaikki (show all)

First step is to click right mouse button inside the webpage and click save as Save as

Just save the complete webpage into your project folder

savecomplete

savecomplete

Next step is to Inspect the page elements (this can be done in the actual webpage, or also in the downloaded page).

Now we can see some of the code of the webpage, good! But we only need certain parts. So next step is to click the button to select elements, or use the shorcut (Ctrl + Shift + C).

Now every time we hover something in the webpage, it shows parts of its code. As an example, the departure station

Final step is to get the selector for that specific element. We can do that by right mouse click button in the code part related to that element, then we choose Copy -> Copy selector

From there we get the following selector for that specific element:

‘#activity-feed > div > div:nth-child(31) > div.station-info > span.return-station’

However we are interested in getting all the similar pieces of text, in this case departure-station. For that we can delete the part :nth-child(31), thus getting the whole list.

We need to do this for all the relevant points of data we are interested on scrapping. Once we have those we can start with getting the info we are interested in.

We will use read_html() from xml2 package and html_nodes() and html_text() from rvest package. Here is the list of all the packages used:

#Scrapping
library(rvest)
library(xml2)
library(jsonlite)
#Data transformation
library(tidyverse)
library(lubridate)
library(glue)
#API calls to HSL
library(ghql) #devtools::install_github("ropensci/ghql")
#Spatial data and maps
library(googlePolylines)
library(sf)
library(ggmap)
library(osmdata)
#Nice table formatting
library(kableExtra)

First step is reading the html page.

webpage <- read_html("data/citybikes/Omat matkasi _ Kaupunkipyörät.html")

Then we will create a named vector with the element selectors.Note that here the selectors are shorter, since in the next step we will be pasting the selector initial part that is repeated in all of the selectors.

selectors <- c(departure_station = "station-info > span.departure-station",
  return_station = "station-info > span.return-station",
  departure_time = "date-info > span.departure-date",
  return_time = "date-info >  span.return-date",
  duration = "trip-info > span.duration",
  distance = "trip-info > span.covered-distance")

Finally, we will loop over all the selectors to read those specific nodes and convert them into text. Column names are defined by the names of the vector.

df_raw <- map_dfc(selectors,~html_nodes(webpage,paste0( "#activity-feed > div > div > div.",.x)) %>% 
    html_text(),
  .id = names(.x))

Data transformation

When data scraping is over, now we need to clean the data into something usable for the analysis including station ID numbers and getting numbers for duration and distance. Trips starting and ending in same station are excluded since they are “false rides”.

df <- df_raw %>% 
  mutate(departure_time = dmy_hm(departure_time),
    arrival_time = dmy_hm(return_time),
    duration_min = parse_number(duration),
    distance_km = parse_number(distance),
    dep_station_id = str_sub(departure_station,1,3),
    arr_station_id = str_sub(return_station,1,3)) %>% 
  filter(dep_station_id != arr_station_id)

In order to get the coordinates of the stops we will make a data_frame with the unique departure-arrival combinations.

stops_unique <- df %>% 
  select(dep_station_id,arr_station_id) %>%
  unique()

Coordinates of bike stations can be found in https://www.hsl.fi/en/opendata Helsinki and Vantaa are separated, so rbind will be used to get data of all stations.

all_stations_raw <- read_csv("data/citybikes/HKI_bikestations.csv") %>% 
  rbind(read_csv("data/citybikes/Vantaa_bikestations.csv"))

all_stations <- all_stations_raw %>%
  select(ID,x,y)

Finally, we join the departure station ID with the coordinates, followed by doing the same with the arrival station. This provides us a data.frame with the travels and coordinates for arrival and departure. Which will be used to feed the API and get the route and estimated times.

df_with_coord <- df %>% 
  left_join(all_stations,by = c("dep_station_id" = "ID")) %>% 
  left_join(all_stations,by = c("arr_station_id" = "ID"),suffix = c("_dep_","_arr_")) 

Talking API

Next step is to use an API to get estimated times between directions as well as route details.

HSL offers a graphql API environment to fetch data from. ropensci offers a package to do API calls from R, this are the steps needed to follow:

Initialize client

path <- "https://api.digitransit.fi/routing/v1/routers/hsl/index/graphql"
client <- GraphqlClient$new(
  url = path
)

Make a Query class object

qry <- Query$new()

Query function

route_query <- function(from_lat,from_lon,to_lat,to_lon){
  #Generate random ID for the query
 query_hash <- stringi::stri_rand_strings(1,5) 
  qry$query(query_hash,glue::glue('
{
  plan(
    from: {lat: <<from_lat>>, lon: <<from_lon>>}
    to: {lat: <<to_lat>>, lon: <<to_lon>>}
    numItineraries: 1
    transportModes: [{mode: BICYCLE}]
  ){
    itineraries {
      legs {
        mode
        duration
        distance
        legGeometry {
          points
        }
      }
    }
  }
}
  ',.open = "<<", 
    .close = ">>"))
#Always ghet the last response from the queries
response <- client$exec(qry$queries[[length(qry$queries)]])
#Convert the response to more readable format
result <- fromJSON(response,flatten = TRUE) %>% 
  data.frame()
#Filter routes, to take just the longest single trip. (Avoids considering walking few meters)
result[[1]][[1]] %>% 
  filter(mode == "BICYCLE") %>% 
  filter(distance == max(distance))
}
#Example query
route_query(60.15978,24.91842,60.18204,24.92756)

Once query function is tested and working, we will again use the map family to pass coordinated arguments into the query and arrange them into a data frame.

routes_df <- pmap_dfr(list(df_with_coord$y_dep_,df_with_coord$x_dep_,
  df_with_coord$y_arr_,df_with_coord$x_arr_),route_query)

write_csv2(routes_df,"output/routes_data_HSL_API.csv")

Saving the output, to avoid repeating the API calls in the future.

Data wrangling and geometries

The API returns a handy googlePolylines text string to define the route. This needs to be converted into points.

routes_df <- routes_df %>% 
  mutate(duration_min = duration/60,
    distance_km = distance/1000) %>% 
  select(duration_min_HSL = duration_min,
    distance_km_HSL = distance_km,
    polyline = legGeometry.points)

points_all <- map_dfr(routes_df$polyline,~decode(.x) %>% as.data.frame(),.id = "trip")

For handling easier, SF will be used and a table with each route and the containing LINESTRING, which is very easy to plot with geom_sf()

linestrings <- points_all %>% 
  mutate(point_order = row_number()) %>% 
  st_as_sf(coords = c("lon","lat")) %>% 
  sf::st_set_crs(4326) %>% 
  group_by(trip) %>% 
  summarize(n = n(),do_union = FALSE) %>% 
  st_cast("LINESTRING")

Final wranglings are combining the data from the webpage and API, together with the newly created linestring table.

all_data_df <- cbind(df_with_coord,routes_df) %>% 
  mutate(time_diff = duration_min_HSL - duration_min,
    dist_diff = distance_km_HSL - distance_km,
    trip_num = as.character(row_number()) )

all_data_sf <- linestrings %>% 
  left_join(all_data_df,by = c("trip" = "trip_num"))

Simple data exploration

With the data ready and some simple dplyr, it is easy to find the longest bike ride. According to the API, it took 27 minutes more than needed to complete the ride. This was due to a distraction on the way, there was a flea market in the area.

all_data_df %>% 
  filter(duration_min == max(duration_min)) %>% 
  select(departure_station,return_station,departure_time,time_diff,dist_diff) %>% 
  kable() %>%
  kable_styling()  
departure_station return_station departure_time time_diff dist_diff
141 Intiankatu 137 Arabian kauppakeskus 2019-05-25 11:52:00 -27.05 -1.353054

In general I have been riding faster than the time defined by the HSL API, which might mean I was taking shortcuts from roads without designated biking lane, or then the speed defined in the API is a safe assumption and rather slow.

ggplot(all_data_sf) +
  geom_jitter(mapping = aes(duration_min,duration_min_HSL)) +
  geom_abline(intercept = 0, slope = 1,color = "red",linetype = "dashed") +
  jechaveR::theme_ensto() +
  labs(x = "Trip duration (min)",
    y = "Trip estimation HSL (min)",
    title = "Do I ride fast?",
    subtitle = "Above red dotted line, trips riding faster than estimation")

In the plot it can be seen that there is a group of points gathered horizontally, these correspond to the same route done in different days. From Viikki to Arabianranta. Let’s calculate some simple statistics for these trips.

all_data_df %>% 
  filter(dep_station_id %in% c("139","239"),
    arr_station_id %in% c("139","239")) %>% 
  mutate(direction = case_when(dep_station_id == "139" ~ "Arabianranta -> Viikki",
    TRUE ~ "Viikki -> Arabianranta")) %>% 
  group_by(direction) %>% 
  summarize(trips = n(),
    mean_duration = mean(duration_min),
    max_duration = max(duration_min),
    min_duration = min(duration_min)) %>% 
  kable() %>%
  kable_styling()  
direction trips mean_duration max_duration min_duration
Arabianranta -> Viikki 15 10.60000000 13 8
Viikki -> Arabianranta 9 11.44444444 14 10

The trip from Arabianranta to Viikki was mainly to catch the bus, which could explain the shorter durations (fear of missing the bus).

Visualizing all the routes

To finalize, a map visualization where all the routes done with bike are shown. I have used ggmap and the OSM capabilities to get the map of Helsinki. Then in order to be able to plot something on top, the bbox and CRS (coordinate reference system) needs to be changed as pointed in the following stackoverflow question:

ggmap_bbox <- function(map) {
  if (!inherits(map, "ggmap")) stop("map must be a ggmap object")
  # Extract the bounding box (in lat/lon) from the ggmap to a numeric vector, 
  # and set the names to what sf::st_bbox expects:
  map_bbox <- setNames(unlist(attr(map, "bb")), 
                       c("ymin", "xmin", "ymax", "xmax"))

  # Coonvert the bbox to an sf polygon, transform it to 3857, 
  # and convert back to a bbox (convoluted, but it works)
  bbox_3857 <- st_bbox(st_transform(st_as_sfc(st_bbox(map_bbox, crs = 4326)), 3857))

  # Overwrite the bbox of the ggmap object with the transformed coordinates 
  attr(map, "bb")$ll.lat <- bbox_3857["ymin"]
  attr(map, "bb")$ll.lon <- bbox_3857["xmin"]
  attr(map, "bb")$ur.lat <- bbox_3857["ymax"]
  attr(map, "bb")$ur.lon <- bbox_3857["xmax"]
  map
}

Get the map and transform CRS

hki_map <- get_map(getbb("Helsinki"),maptype = "toner-lite",zoom = 12,crop = TRUE,color = "bw")
hki_map <- get_map(location = c(left = 24.85,
  bottom = 60.149, 
  right = 25.05,
  top = 60.23),maptype = "toner-lite",zoom = 13,color = "bw")
hki_map_ggplot <- ggmap_bbox(hki_map)
linestrings_3857 <- st_transform(linestrings, 3857)

Plotting everything together for final result

ggmap(hki_map_ggplot) +
  coord_sf(crs = st_crs(3857)) +
  geom_sf(data = linestrings_3857,color = alpha("#FED100",0.5),size = 0.75, inherit.aes = FALSE) +
  theme_void() +
   theme(legend.position = "bottom",
    plot.title = element_text(size = 18, hjust = 0.5,face = "bold",color = "#FED100"),
    plot.subtitle = element_text(size = 10, hjust = 0.5, vjust = 1, color = "grey40",margin = margin(0,10,10,0)),
    strip.text.x = element_text(size = 10, face = "bold"),
    plot.caption = element_text(size = 7.5, color = "grey40"),
    #plot.background = element_rect(fill = "#dddddd", color = NA),
#panel.background = element_rect(fill = "#dddddd", color = NA),
legend.background = element_rect(fill = "#dddddd", color = NA)) +
  labs(title = "Itineraries in Helsinki City bikes",
    subtitle = "Summer 2019, data from HSL",
    caption = "@perspectivalean")

Conclusion: In this post we have done some web scrapping, created a personalized function that does API calls and put everything on a map to visualize the usage of the city bikes during the summer season in Helsinki. This has been my story of how I fell in love with those bikes . Cannot wait to see those yellow bikes again on the city.

Avatar
Jose Maria Echave
Business data analyst

Jose Maria Echave Sanz is Business Data Analyst in Ensto. He holds a master in Industrial Engineering and Management from TalTech. He started the working life being involved in Lean project management, successfully implementing a lean culture in two factories located in Spain and France. Always being curious about data and analytics, in 2014 he jumped from mainly using excel to R universe, and haven’t stopped since, by continuous self-learning. He now lives in Finland and supports the company on making data-driven decisions to support the strategy, as well as working with business processes and tool prototyping with Shiny. You can find him on twitter @perspectivalean.

comments powered by Disqus