Memory Management - Kernel


Memory is among the most basic and essential resources for any operating system. It’s hard to imagine any system without memory, not even humans.


You can close your eyes to reality but not to memories. - Stanislaw Jerzy Lec


Agenda

In this article, we will cover how the kernel manages the memory and how it is different from what we are aware of user space memory management or memory management for a process. Getting hold of memory in kernel space is complicated compared to user space as it’s hard to deal will memory allocation errors and also the kernel cannot sleep.

Disclaimer: Inspired mostly from the book by Robert Love on Linux Kernel Development. It’s a must-read and an amazing lucid explanation of the Linux Kernel.


Pages

For kernel, the basic unit of memory is a Page unlike for a processor where the units can be in bytes or a word. Every architecture defines its page size which is handled by the hardware that manages the memory and performs the physical to virtual address translations, it’s called the MMU - Memory Management Unit.

Many architectures also support multiple page sizes, most 32-bit architectures have 4 KB pages whereas the 64-bit architectures have 8 KB page sizes. By this, you can deduce the distinct pages for given memory size. For example - a machine with a 4 KB page size and 4 GB of physical memory will have 10,48,567 distinct pages.

In Linux kernel v2.6, the representation of the kernel page can be found defined as (distilled view):

struct page {
  unsigned long         flags;
  unsigned long         private;
  atomic_t              _count;
  atomic_t              _mapcount;
  pgoff_t               index;  
  struct address_space  *mapping;
  struct list_head      lru;
  void                  *virtual
};

Here it is important to understand that the page structure is associated with the physical pages and not the virtual pages. Also, it only represents the pages and not the data inside the pages.


Zones

A zone is a mechanism for grouping the pages of similar properties. Since the kernel cannot treat all the pages similarly, hence this division is required. This is also due to hardware limitations related to memory mapping like:

  • Some hardware devices can perform direct memory access to only certain memory addresses.
  • Some architectures can physically address a larger amount of memory than they can virtually address.

Linux has four primary memory zones:

Zone Description Physical Memory
ZONE DMA This zone contains pages that can undergo DMA < 16 MB
ZONE DMA32 This is similar to ZONE_DMA but contains pages that are only accessible by 32-bit devices < 16 MB
ZONE NORMAL This contains normal regularly mapped pages 16-896 MB
ZONE HIGHMEM This contains “high memory”, which are pages that are not permanently mapped into the Lernel’s address space > 896 MB

The layout and the actual use of the memory zones are architecture-dependent.

Note:

The division of the zone is not at the physical level but only a logical division at the kernel level to facilitate the pooling for the allocation of pages on demand. For example, if pages are quired for DMA then the kernel can simply pull out pages from the ZONE_DMA without tampering anywhere else.

But in case of scarcity in ZONE_DMA, the kernel will not hesitate to pull out pages from ZONE_NORMAL or ZONE_HIGHMEM for DMA requirement.

The zone representation is a struct, a distilled view of the code looks like this:

struct zone {
  unsigned long             watermark[NR_WMARK];
  unsigned long             lowmem_reserve[MAX_NR_ZONE];
  struct per_cpu_pageset    pageset[NR_CPUS];
  struct free_area          free_area[MAX_ORDER];
  struct zone_lru {
    struct list_head        list;
    unsigned long           nr_saved_scan;
  }                         lru[NR_LRU_LISTS];
  struct zone_reclaim_stat  reclaim_stat;
  struct pglist_data        *zone_pgdat;
  spinglock_t               lock;
  spinglock_t               lru_lock;
  wait_queue_head_t         *wait_table;
  atoomic_long_t            vm_start[NR_VM_ZONE_START_ITEMS];
  unsigned long             pages_scanned;
  unsigned long             flags;
  int                       prev_priority;
  unsigned int              inactive_ratio;
  unsigned long             wait_table_hash_nr_entry;
  unsigned long             wait_table_bits;
  unsigned long             zone_start_pfn;
  unsigned long             spanned_pages;
  unsigned long             present_pages;
  const char                *name;
};

Free Lists

Allocating data structures and freeing them is one of the most frequent operations of any kernel. To expedite recurrent allocations & deallocations of data structures the kernel uses free lists.

A free list is a container of already allocated data structures that can be requested by the code on-demand instead of creating and initializing a data structure on the go. later when the data structure is no longer required, it is returned to the free list instead of destroying it. Hence free lists are also known as Object Cache.

Hold on, did you notice a problem with the free lists? When memory is low, the kernel has no idea of how many free lists are living there in the memory and where they are hence the kernel cannot order any free list to shrink to free up memory. A control center is required for the free lists roaming around.


Slab Layer

The slab layer was introduced to combat the problem of no control center for free lists. It was first used by Sun Microsystems’s SunOS 5.4. Linux shares the same basic design and name as a slab allocator.

The aims of the slab layer are:

  • Frequently used data structures are allocated and freed often, so cache them.
  • Frequent allocation & deallocation can result in fragmentaion1, to prevent this, the free lists are arranged contiguously.
  • Free lists provide improved performance and if the slab allocator is aware of the properties of the lists such as object size, page size, and total cache size, it can make more intelligent decisions.

The slab layer divides the different objects into groups called caches each of which stores a different type of object. The caches are further divided into slabs. The slabs are composed of one or more physically contiguous pages. Generally, each slab is composed of only a single page, each cache can contain multiple slabs. This strategy reduces fragmentation.

The slab representation is a struct like:

struct slab {
  struct list_head  list;
  unsigned long     colouroff;
  void              *s_mem;
  unsigned int      inuse;
  kmem_bufctl_s     free;
};

The slab layer handles all the low-level alignment, coloring, allocations, freeing, and reaping during low-memory conditions.


Wrap Up

Now you have a good idea that a lot is going around memory for the kernel which is completely different from the memory landscape for a process. Well, this is not the end and there are a few more untouched mechanisms that are floating around memory management for a kernel, especially the working of Stacks for the kernel space. I am planning to take it for another new post as this one went pretty lengthy.

Stay healthy, stay blessed!



  1. The inability to find large chunks of contiguous memory blocks ↩︎