Counting cyclists in Berlin

I recently found out that berlin has a bunch of automatic bike-counters around berlin…and the data is public! There’s even an interactive map of the stations. This is all pretty awesome. Although the data is stored in multiple sheets of an excel file (#?&!!), I thought it’d be an interesting exercise to clean it up and see how useful it could be for other projects.

Get the data

Let’s grab the data.

filename <- 'cyclist_counts.xlsx'

if (!file.exists(filename))
download.file(url, filename, 'libcurl')

It’s probably a good idea to take a look at the excel file so that it’s easier to follow along with the code below. You can also open up the R markdown in RStudio.

Unfortunately, there’s no official checksum provided to guarantee the data were downloaded correctly, but I can show you what my checksum is.

digest::digest(filename, algo = 'md5', file = TRUE)
[1] "e79c4ea5b5e1437d90a7d06f572f741e"

If you get a different md5 checksum, then you don’t have the same file as I do.

The counters

The first sheet Hinweise (= advice) is mostly text describing some general information about the data. The notable parts tell us that it is the uneditted data from the counter systems and is published at the end of the counting-year. What’s a counting-year? Good question; no idea. The excel file was last modified on the 20th of April 2018, so I guess the next update will probably be around April 2019.

There are also some disclaimers basically telling us that the counters don’t necessarily give an exact count due to weather conditions, maintainance, and other complications. This is described in more detail on the FAQ page (in German). They work using an induction loop, which detects a metal object moving passed. Full carbon bikes are not countable but there are likely so few that the difference is absorbed in the measurement error anyway. Apparently bikes with more or less than two wheels can also cause miscounts. However, groups of bikes can be counted accurately so long as they don’t ride too closely side-by-side.

The FAQ page also tells us that some counters are set up in pairs in order to count the two directions of traffic separately. Well see how to detect these pairs when we inspect the data.

The first data sheet, Standortdaten = location data, lists the counters with a unique ID, a street name, lattitude-longitude (not shown here), and installation date.

counters <- filename %>%
transmute(
id = Zähler %>% as_factor(),
pair = str_replace(id, '-[NSOW]$', '') %>% as_factor(), location = Beschreibung - Fahrtrichtung, installation_date = as.Date(Installationsdatum) ) %>% arrange(installation_date, location) %>% write_csv('counters.csv') n_counters <- counters %>% distinct(id) %>% nrow() There are 26 counters listed in the sheet. I have seen it written that there are only 17, but I this number doesn’t count the paired counters as separate. The pairs are listed as north/south or east/west (in German), with the last letter of the ID indicating the direction. This allows us to identify the pair by removing the last character of the ID. The counter locations id pair location installation_date 12-PA-SCH 12-PA-SCH Schwedter Steg 2012-03-01 02-MI-JAN-N 02-MI-JAN Jannowitzbrücke Nord 2015-04-01 02-MI-JAN-S 02-MI-JAN Jannowitzbrücke Süd 2015-04-01 13-CW-PRI 13-CW-PRI Prinzregentenstraße 2015-04-01 18-TS-YOR-O 18-TS-YOR Yorckstraße Ost 2015-04-01 18-TS-YOR-W 18-TS-YOR Yorkstraße West 2015-04-01 27-RE-MAR 27-RE-MAR Markstraße 2015-05-01 19-TS-MON 19-TS-MON Monumentenstraße 2015-05-01 03-MI-SAN-O 03-MI-SAN Invalidenstraße Ost 2015-06-01 03-MI-SAN-W 03-MI-SAN Invalidenstraße West 2015-06-01 05-FK-OBB-O 05-FK-OBB Oberbaumbrücke Ost 2015-06-01 05-FK-OBB-W 05-FK-OBB Oberbaumbrücke West 2015-06-01 26-LI-PUP 26-LI-PUP Paul-und-Paula-Uferweg 2015-06-01 24-MH-ALB 24-MH-ALB Alberichstraße 2015-07-01 10-PA-BER-N 10-PA-BER Berliner Straße Nord 2016-05-01 10-PA-BER-S 10-PA-BER Berliner Straße Süd 2016-05-01 17-SK-BRE-O 17-SK-BRE Breitenbachplatz Ost 2016-05-01 17-SK-BRE-W 17-SK-BRE Breitenbachplatz West 2016-05-01 23-TK-KAI 23-TK-KAI Kaisersteg 2016-05-01 15-SP-KLO-S 15-SP-KLO Klosterstraße Süd 2016-05-01 20-TS-MAR-N 20-TS-MAR Mariendorfer Damm Nord 2016-05-01 20-TS-MAR-S 20-TS-MAR Mariendorfer Damm Süd 2016-05-01 21-NK-MAY 21-NK-MAY Maybachufer 2016-05-01 06-FK-FRA-O 06-FK-FRA Frankfurter Allee Ost 2016-06-01 06-FK-FRA-W 06-FK-FRA Frankfurter Allee West 2016-06-01 15-SP-KLO-N 15-SP-KLO Klosterstraße Nord 2016-06-01 I’m not sure what the numbers in the ID are, but the first group of letters is an abbreviation of the district (Mitte, Pankow, etc), the second group is an abbreviation of the street name, and the third group (where it exists) is the direction of traffic (north, south, east, west). Take a deep breath Now for the data we came here for. As of this writing, there are 6 sheets, one for each year from 2012 to 2017. years <- 2012:2017 The corresponding sheets are called Jahresdatei YYYY (= year data). sheets0 <- years %>% as.character() %>% paste0('Jahresdatei ', .) sheets0 [1] "Jahresdatei 2012" "Jahresdatei 2013" "Jahresdatei 2014" [4] "Jahresdatei 2015" "Jahresdatei 2016" "Jahresdatei 2017" Let’s turn that into a named list so that it works well with the map_dfr function below. sheets <- sheets0 %>% as.list() %>% set_names(., .)  The read_excel function isn’t as flexible as its read_csv cousin so we’ll write a wrapper for loading the data from a particular sheet. read_excel <- function(sheet, .filename=filename, .n=n_counters) { readxl::read_excel( .filename, sheet = sheet, col_types = c('date', rep('numeric', .n)) ) } Now we can simply map that function over all of our sheet names to get a dataframe from each sheet, then bind the rows of these dataframes together so that we have all the data together. The columns also have awkward headers, so we’ll rename them to the counter ID. df_untidy <- sheets %>% map_dfr(read_excel, .id = 'sheet') %>% # get rid of redundant info in the col names # and use english names set_names(names(.) %>% str_replace_all(" .*", "")) %>% rename(measured_at = Zählstelle) Example of untidy output. Only the counters in Mitte are shown. sheet measured_at 02-MI-JAN-N 02-MI-JAN-S 03-MI-SAN-O 03-MI-SAN-W Jahresdatei 2017 2017-12-31 19:00:00 29 29 19 14 Jahresdatei 2017 2017-12-31 20:00:00 31 22 16 15 Jahresdatei 2017 2017-12-31 21:00:00 21 18 11 5 Jahresdatei 2017 2017-12-31 22:00:00 11 18 6 10 Jahresdatei 2017 2017-12-31 23:00:00 7 9 10 20 All that remains is to tidy up the data: 1. Each variable forms a column: we have the ID of the counter, the hour of measurement, and the actual measurement. 2. Each observation forms a row: every hourly measurement should have exactly one row. 3. Each type of observational unit forms a table: there’s only the hourly count. The untidy data has a possible measurement for every counter, so we’ll need to gather those columns together. The nulls exists only because of the untidy format used - we can simply drop them. df_tidy <- df_untidy %>% gather(counter_id, total, -measured_at, -sheet) %>% filter(!is.na(total)) %>% mutate(counter_id = counter_id %>% factor(levels = levels(counters$id))) 
Example of tidy output.
sheet measured_at counter_id total
Jahresdatei 2017 2017-12-31 18:00:00 27-RE-MAR 13
Jahresdatei 2017 2017-12-31 19:00:00 27-RE-MAR 4
Jahresdatei 2017 2017-12-31 20:00:00 27-RE-MAR 5
Jahresdatei 2017 2017-12-31 21:00:00 27-RE-MAR 2
Jahresdatei 2017 2017-12-31 22:00:00 27-RE-MAR 2
Jahresdatei 2017 2017-12-31 23:00:00 27-RE-MAR 4

After performing some checks, I realised there are a lot of duplicated data. In particular, Jahresdatei 2012 also contains a complete copy of Jahresdatei 2013. Let’s get rid of the duplicates. Whilst we’re at it, we’ll add in some extra info about each counter by joining on the counter dataframe.

df <- df_tidy %>%
distinct(counter_id, measured_at, .keep_all = TRUE) %>%
# add a unique row id
ungroup() %>%
arrange(counter_id, measured_at) %>%
mutate(id = 1:n()) %>%

# reorder columns
select(id, sheet, counter_id, measured_at, total) %>%
# save the data
write_csv('counter_data.csv') %>%
inner_join(counters, by = c('counter_id' = 'id')) %>%
arrange(id)

This leaves us with a total of 526,790 measurements.

I’ll use the stopifnot function to list a bunch of properties that I know the data must satisfy if it is correct. This prevents this notebook from compiling if any of them have an entry that doesn’t evaluate to TRUE.

stopifnot(
# all counts non-negative
df$total >= 0, # measurments occur after installation df$measured_at >= df$installation_date, # measurements occur within the time range year(df$measured_at) <= max(years),
year(df$measured_at) >= min(years), # installations occur within the time range year(df$installation_date) <= max(years),
year(df$installation_date) >= min(years), # there are as many row ids as rows max(df$id) == nrow(df),

# at most as many measurements as hours in the year
df %>%
group_by(counter_id, year = year(measured_at)) %>%
count() %>%
transmute(8760 + if_else(year %% 4 == 0, 24, 0) - n >= 0) %>%
pull(),

# exactly 26 counters
n_counters == df %>%
distinct(counter_id) %>%
nrow()
)

Exploring the data

In principle there should be one measurement per counter per hour, starting from when each counter is installed. Let’s check how many gaps appear in the data.

gaps <- df %>%
group_by(counter_id) %>%
mutate(
difference = next_measurement - measured_at
) %>%
filter(difference > 1) %>%
select(id, counter_id, difference, measured_at, next_measurement)
Gaps in measurment records.
id counter_id difference measured_at next_measurement
578 12-PA-SCH 2 hours 2012-03-25 01:00:00 2012-03-25 03:00:00
580 12-PA-SCH 2 hours 2012-03-25 04:00:00 2012-03-25 06:00:00
103630 13-CW-PRI 16 hours 2015-09-21 23:00:00 2015-09-22 15:00:00
146616 18-TS-YOR-O 13 hours 2017-11-15 16:00:00 2017-11-16 05:00:00
354380 10-PA-BER-N 4 hours 2016-05-14 00:00:00 2016-05-14 04:00:00
356679 10-PA-BER-N 6 hours 2016-08-17 22:00:00 2016-08-18 04:00:00
408515 15-SP-KLO-N 16 hours 2017-08-13 23:00:00 2017-08-14 15:00:00
414641 17-SK-BRE-O 8 hours 2016-08-24 20:00:00 2016-08-25 04:00:00
426976 17-SK-BRE-W 2 hours 2016-05-21 02:00:00 2016-05-21 04:00:00
466572 20-TS-MAR-S 485 hours 2017-07-24 23:00:00 2017-08-14 04:00:00
494816 23-TK-KAI 198 hours 2017-07-01 23:00:00 2017-07-10 05:00:00
494859 23-TK-KAI 14 hours 2017-07-11 23:00:00 2017-07-12 13:00:00

There are only 12 gaps, and all but two are resolved fairly quickly. I’m not sure what caused the two larger gaps.

Some counters are in higher traffic locations than others. Note that the larger the median is, the larger the interquartile range is too. This happens in almost all of the plots.

by_counter <- df %>%
group_by(counter_id) %>%
summarise(
q25 = quantile(total, 0.25),
q50 = quantile(total, 0.50),
q75 = quantile(total, 0.75)
) 

There is also a commuter-effect, with peaks showing at 08:00 and 17:00. There is almost no traffic in the wee hours of the morning.

by_hour <- df %>%
group_by(h = hour(measured_at)) %>%
summarise(
q25 = quantile(total, 0.25),
q50 = quantile(total, 0.50),
q75 = quantile(total, 0.75)
) 

Since there is a strong commuter-effect and some counters are paired to count different directions, it makes sense to take a look at the difference in traffic between the pairs during the day. We see that, for example, in Jannowitzbrücke (02-MI-JAN) cyclists travel south-bound in the morning and north-bound in the evening. Interestingly, traffic seems to flow in mostly north-bound at Berlinerstraße (10-PA-BER) at almost any time of the day. I’m not sure why this might be the case but it’s worth looking into later. The one exception is at 08:00, when many cyclists are likely commuting to the centre of Berlin (south-bound).

pairs_by_hour <- df %>%
filter(str_detect(location, 'Nord|Ost|Süd|West')) %>%
inner_join(df, by = c('pair', 'measured_at')) %>%
filter(as.numeric(counter_id.x) < as.numeric(counter_id.y)) %>%
select(measured_at, matches('id|total')) %>%
mutate(
comparison = paste0(counter_id.x, ' - ', counter_id.y),
difference = total.x - total.y
) %>%
group_by(g = hour(measured_at), comparison) %>%
summarise(difference = mean(difference)) 

In Klosterstraße (15-SP-KLO) and Mariendorfer Damm (20-TS-MAR) it looks like traffic is roughly equal in both directions all day. The former flows slightly more north-bound in evening, the latter more north-bound in the morning. Given the positions of these two pairs of counters, it is unlikely that they are part of a commuter-loop. The overall counts for these four counters is fairly small on average, so these apparent effects may just be noise.

The weekday counts seem fairly constant and fall by about a factor of a third on the weekend. This is consistent with the commuter-effect seen above.

by_weekday <- df %>%
group_by(g = wday(measured_at, label = TRUE, week_start = 1)) %>%
summarise(
q25 = quantile(total, 0.25),
q50 = quantile(total, 0.50),
q75 = quantile(total, 0.75)
) 

There is also a yearly seasonal pattern. Apart from berliners being more willing to commute in summer, I’d be willing to bet that tourists are also responsible for the summer peak in June/July. In winter, the roads can become unsafe for cyclists due to the ice, snow, and lack of maintenance.

by_month <- df %>%
group_by(g = month(measured_at, label = TRUE)) %>%
summarise(
q25 = quantile(total, 0.25),
q50 = quantile(total, 0.50),
q75 = quantile(total, 0.75)
) 

There is no clear yearly effect, as the median just ambles around 60 cyclist per hour.

by_year <- df %>%
group_by(g = year(measured_at)) %>%
summarise(
q25 = quantile(total, 0.25),
q50 = quantile(total, 0.50),
q75 = quantile(total, 0.75)
) 

What’s next?

So…that was all easier than expected. The weird north-bound pattern at Berlinerstraße and the two large measurement gaps still need to be looked into, but overall the data are in a good state. I’d like to be able to combine this with data on bike-related traffic accidents, but from what I can see, the Berlin police chose one of the few formats worse than an excel sheet: pdf tables.

With 3D pie charts.