Get started with z22 • z22

library(z22)
library(dplyr, warn.conflicts = FALSE)
library(ggplot2)
library(terra)
library(sf)

The German 2022 census marks the second German census for which spatial census data is published at a grid level. While some progress has been made in the way the data is processed, it is still only available as CSV dumps. Working with these data programmatically can be quite burdensome. z22 aims to make it easier to read and process census grid data from both 2022 and 2011 censuses. The functions establish a standardized feature language that can translate data from 2011 to 2022, and vice versa. It relies on preprocessed parquet chunks from the z22data data repository where the data chunks are stored. The package is a successor of Stefan Jüngers z11 package.

Data chunks are uniquely identified using features and categories. When I talk about a feature, I talk about an indicator aggregated to grid cells, for example, age groups or the number of dwellings. When I talk about a category, I talk about the discrete classifications of features, e.g., ages 10 to 19, 20 to 29, 30 to 39, etc. You can retrieve a list of all available features using:

z22_features()
#> # A tibble: 55 × 8
#>    theme      feature            english                                german                z22   z11_100m z11_1km has_cat
#>    <chr>      <chr>              <chr>                                  <chr>                 <lgl> <lgl>    <lgl>   <lgl>  
#>  1 Population population         Population                             Bevölkerung           TRUE  TRUE     TRUE    FALSE  
#>  2 Population citizens           Number of german citizens, 18 or older Deutsche Staatsangeh… TRUE  FALSE    FALSE   FALSE  
#>  3 Population foreigners         Share of foreigners                    Ausländeranteil       TRUE  FALSE    TRUE    FALSE  
#>  4 Population foreigners_from_18 Share of foreigners, 18 or older       Ausländeranteil ab 1… TRUE  FALSE    FALSE   FALSE  
#>  5 Population birth_country      Country of birth (groups)              Geburtsland (Gruppen) TRUE  TRUE     FALSE   TRUE   
#>  6 Population sex                Sex                                    Geschlecht            FALSE TRUE     FALSE   TRUE   
#>  7 Population women              Share of women                         Frauenanteil          FALSE FALSE    TRUE    FALSE  
#>  8 Population religion           Religion                               Religion              FALSE TRUE     FALSE   TRUE   
#>  9 Population citizenship        Citizenship                            Staatsangehörigkeit   TRUE  TRUE     FALSE   TRUE   
#> 10 Population citizenship_group  Citizenship (groups)                   Staatsangehörigkeit … TRUE  TRUE     FALSE   TRUE   
#> # ℹ 45 more rows

Similarly, you can view a list of all categories for a feature.

z22_categories("birth_country")
#> # A tibble: 6 × 3
#>    code german      english     
#>   <dbl> <chr>       <chr>       
#> 1     1 Deutschland Germany     
#> 2    20 Ausland     Foreign     
#> 3    21 EU27-Land   EU27 country
#> 4    22 Europa      Europe      
#> 5    23 Welt        World       
#> 6    24 Sonstige    Other

When I talk about grids, I mean a geographical grid net according to the EU INSPIRE standard. Grids always have a fixed resolution (100x100 m, 1x1 km, 10x10 km). Such grids are useful to aggregate sensitive spatial data to a relatively detailed level that is not as arbitrary as, e.g., municipality boundaries. You can get a raw grid dataset based on the GeoGitter dataset using:

grid <- z22_grid("10km", as_sf = TRUE)

plot(grid, pch = 15, cex = 0.5)

Although the data chunks are relatively small in size, downloading grids and features again and again can take a while and strain internet tariffs unnecessarily. That’s why data files are automatically cached, i.e., if a file is downloaded again, it is simply read from the cache. You can update stale cache files using the update_cache argument.

To equip a raw raster with feature values, you can use the z22_data function. Here, I download data on the type of heating in buildings across Germany. Using the rasterize argument, I can convert the output to a SpatRaster from the terra package.

grid_ht <- z22_data("building_heat_type", res = "10km", rasterize = TRUE)

After rasterization, the results are returned in a SpatRasterDataset, which is essentially a list of SpatRasters. To plot them, you can use the z22_pivot_longer function to turn them to a long (and facetable) table.

df_ht <- z22_pivot_longer(grid_ht, "building_heat_type")
ggplot(df_ht) +
  geom_raster(aes(x, y, fill = value)) +
  facet_wrap(~category, nrow = 2, labeller = label_wrap_gen(width = 35)) +
  coord_sf(crs = 3035) +
  scale_fill_viridis_c("Count", na.value = "transparent", transform = "log2") +
  theme_bw(base_size = 8) +
  labs(x = NULL, y = NULL) +
  theme(panel.grid = element_blank(), axis.text = element_blank())

However, as you can see, we are dealing with absolute counts, which makes the meaningfulness of the data questionable on its own. The spatial patterns look very similar in each facet and mostly trace spatial population patterns of Germany. To retrieve normalized (= percentage) figures, we can use the normalize argument of z22_data. Re-downloading the data should also take much less time because the data files are still cached.

grid_ht <- z22_data(
  "building_heat_type",
  res = "10km",
  rasterize = TRUE,
  normalize = TRUE
)

Code for the plot

df_ht <- z22_pivot_longer(grid_ht, "building_heat_type")
ggplot(df_ht) +
  geom_raster(aes(x, y, fill = value)) +
  facet_wrap(~category, nrow = 2, labeller = label_wrap_gen(width = 35)) +
  coord_sf(crs = 3035) +
  scale_fill_viridis_c("Share", na.value = "transparent", transform = "log2") +
  theme_bw(base_size = 8) +
  labs(x = NULL, y = NULL) +
  theme(panel.grid = element_blank(), axis.text = element_blank())

By visualizing shares, we can observe that spatial differences are not nearly as pronounced as the total numbers had us believe. Particularly central heating is almost spatially homogeneous. Note that the normalize argument is only sensible for features that are counts.

There is one more issue that you can see in the map above. Some cells are NA or entirely missing from the dataset and thus not shown at all on the maps. These are cells where no observations can be counted. In the case of heating, the issue is not as pressing but in other cases the feature grid becomes a patchwork. Some marital statuses, for example, are rarely seen in some areas of Germany.

grid_ht <- z22_data(
  "marital_status",
  categories = 6,
  res = "10km",
  rasterize = TRUE
)

Code for the plot

df_ht <- z22_pivot_longer(grid_ht, "marital_status")
ggplot(df_ht) +
  geom_raster(aes(x, y, fill = value)) +
  coord_sf(crs = 3035) +
  facet_wrap(~category, nrow = 2) +
  scale_fill_viridis_c("Count", na.value = "transparent") +
  theme_bw() +
  labs(x = NULL, y = NULL) +
  theme(panel.grid = element_blank(), axis.text = element_blank())

To fill these gaps, you can request all cells using the all_cells argument.

grid_ht <- z22_data(
  "marital_status",
  categories = 6,
  res = "10km",
  rasterize = TRUE,
  all_cells = TRUE
)

Code for the plot

df_ht <- z22_pivot_longer(grid_ht, "marital_status")
ggplot(df_ht) +
  geom_raster(aes(x, y, fill = value)) +
  coord_sf(crs = 3035) +
  facet_wrap(~category, nrow = 2) +
  scale_fill_viridis_c("Count", na.value = "transparent") +
  theme_bw() +
  labs(x = NULL, y = NULL) +
  theme(panel.grid = element_blank(), axis.text = element_blank())

Working with 10x10 km grids is cool and all, but the census grids get really exciting when looking at 100x100 m grids. Let’s, for example, look at the average age distribution in the area of Berlin.

grid_age <- z22_data("age_avg", res = "100m", rasterize = TRUE, all_cells = TRUE)

ext <- ext(c(4532432, 4574087, 3257338, 3287499))
berlin_age <- crop(grid_age$cat_0, ext)
plot(berlin_age)