The RAM subsystem Linux is a rather multifaceted construction. To understand its details, you need to purposefully immerse yourself in the topic, with the mandatory reading of the kernel sources, but not everyone needs this. For the development and operation of server software, it is important to have at least a basic idea of ​​how it works, but it never ceases to amaze me how small a fraction of people possess it. In this post I will try to briefly run through the main things, without understanding which in my opinion it is very easy to mess things up.

What is memory?

Physical and virtual

Let’s start from afar. In the specification of any computer and in particular of the server, the inscription “N gigabytes of RAM” is by all means registered – that is how much physical memory is at its disposal .

The task of allocating available resources between executable software, including physical memory, lies on the shoulders of the operating system, in our case Linux. To ensure the illusion of complete independence, it provides each program with its own independent virtual address space and a low-level interface to work with it. This eliminates the need to know about each other, the amount of available physical memory and its current employment. Addresses in the virtual space of processes are called logical .

To track the correspondence between physical and virtual memory, the Linux kernel uses a hierarchical set of data structures in its service area of physical memory (only it works directly with it), as well as specialized hardware circuits, which are collectively referred to as MMU .

It would be costly to monitor each byte of memory separately, so the kernel operates with fairly large blocks of memory – pages , the typical size of which is 4 kilobytes.

It is also worth mentioning that at the hardware level, as a rule, there is support for an additional level of abstraction in the form of “segments” of RAM, with which you can divide programs into parts. Unlike other operating systems, in Linux it is practically not used – the logical address always coincides with the linear one (the address inside the segment that is configured in a fixed manner).

File and Anonymous

Applications have many ways to allocate memory for oneself or another. High-level programming languages ​​and libraries often hide from the developers which of them actually used other details (although they can always be “cracked” with the help strace). If you delve into the features of each available option, this article would quickly become a book. Instead, I propose to divide them into two, in my opinion, extremely important groups according to the memory they emit:

  • File memory uniquely corresponds to a file or part of it in the file system. First of all, it usually contains the executable code of the program itself. For application tasks, you can request a file mapping to the virtual address space of a process using a system call mmap— after which you can work with it as with any other memory area without an explicit read / write, what will happen to the data in the file system and what others will see The processes “mapped” the same file depends on the settings.
  • Any other allocated memory is called anonymous , since it does not correspond to any file that is known to be named. This includes both variables on the stack, and areas allocated using functions like malloc(by the way, they are usually used behind the scene to allocate large blocks of memory mmapwith a special set of settings, and for everything else, brk/sbrkor give out previously freed memory).

At first glance, the differences do not look anything special, but the fact that the file memory areas are named allows the operating system to save physical memory, sometimes very significantly, by matching the virtual addresses of several processes working with the same file to the same physical page of memory. It works transparently, starting with the code of running multiple copies of applications, ending with systems specially designed for this optimization.

Displaced and no

The total amount of used virtual memory of all programs can easily exceed the amount of available physical memory. At the same time, applications can use only a small subset of data stored at virtual addresses at any given time. This means that the operating system can save unused data from the RAM to the hard disk (“displacing” them from the memory), and then trying to access this data – copy it back to the physical RAM. This mechanism is officially called major page fault, but just a page fault as a rule also implies it, since the minor page fault doesn’t care much (the difference is that in the case of a minor the kernel can find the requested data already loaded into memory for some other purpose and accessing the disk as a result not happening).

At the time the data requested by the application is restored, its execution is interrupted and control is transferred to the kernel to perform the corresponding procedure. The time it takes for the application to continue its work depends directly on the type of hard disk used:

  • Read 4KB of data from a normal 7200rpm server hard disk takes about 10 ms , with a good deal of circumstances a little less.
    • If there are a lot of crowded pages, it can easily run into noticeable fractions of a second (both for conditional users and for internal devices, depending on the task).
    • Cyclic pagefaults are especially dangerous when there are two or more regularly used memory areas that do not fit in physical memory together, so they push each other back and forth indefinitely.
    • In this case, the disk is forced to do an honest seek, which in itself may not be the same. For example, if any database works with the same disk.
  • If SSD is used , then the situation is somewhat more optimistic – due to the absence of mechanical movement, a similar operation takes about an order of magnitude less, about 1 ms or its share, depending on the type and specific model of the disk. But years go by, and SSDs remain a niche compromise product at a price-volume.
  • And now for comparison: if the page were already in memory, then when accessing it, the score would go to hundreds of nanoseconds . This is almost 4 orders of magnitude faster than pagefault, even on SSD .

It should be noted that from the point of view of the application, all this is transparent and is an external influence, that is, it can occur at the most inappropriate, from the point of view of the task it is solving, the moment.

I think it is clear that applications that are important for high performance and stable response time should avoid pagefault’s using all available methods, and we will turn to them.

Memory subsystem management methods

swap

With file memory, everything is simple: if the data in it has not changed, then you don’t need to do anything special to force it out – you just rub it, and then you can always restore it from the file system.

With anonymous memory, such a trick does not work: no file corresponds to it, so that the data does not disappear irretrievably, they need to be put somewhere else. To do this, you can use the so-called “swap” section or file. It is possible, but not necessary in practice. If the swap is turned off, then anonymous memory becomes non-extrusive, which makes the time to access it predictable.

It may seem like a minus of the swap turned off, which, for example, if the application leaks memory, then it will be guaranteed to keep physical memory for nothing (leaked cannot be pushed out). But such things are more likely to be viewed from the point of view that this, on the contrary, will help to detect and eliminate the error earlier.

mlock

By default, all file memory is preemptable, but the Linux kernel provides the ability to prevent it from being preempted with accuracy not only to files, but also to pages within a file.

To do this, use the system call mlockon the area of ​​virtual memory obtained using mmap. If you do not want to go down to the level of system calls, I recommend looking towards the console utility vmtouch, which does exactly the same thing, but outside of the application.

A few examples where this might be appropriate:

  • The application has a large executable file with a large number of branches, some of which rarely work, but regularly. This should be avoided for other reasons, but if there is no other way, then in order not to wait for too much code on these rare branches, you can prevent them from being pushed out.
  • Indexes in databases are often physically the file that they work through mmap, but they are mlockneeded to minimize delays and the number of I / O operations on an already loaded disk (s).
  • The application uses some kind of static dictionary, for example, with the correspondence of the subnets of IP addresses and the countries to which they belong. It is doubly relevant if several processes running on this dictionary are running on the same server.

OOM killer

Having overtaken with non-preemptive memory, it is not difficult to drive the operating system into a situation when the physical memory is over and you cannot force out anything. It looks hopeless only at first glance: instead of crowding out, memory can be freed.

This happens with quite radical methods: the mechanism that chooses this section chooses a process, according to a certain algorithm, that is most appropriate at the moment to sacrifice – by stopping the process, the memory it used can be redistributed among the survivors. The main criterion for selection: the current consumption of physical memory and other resources, plus there is an opportunity to intervene and manually mark the processes as more or less valuable, and also to exclude from consideration. If you disable the OOM killer completely, then the system in the event of a complete shortage, nothing remains, how to reboot.

cgroups

By default, all user processes as well as almost all physically available memory within a single server. This behavior is rarely acceptable. Even if the server is conditionally single-tasking, for example, it only gives static files via HTTP using nginx , there are always some service processes like a syslog or some kind of temporary command run by a person. If the server simultaneously running multiple production processes, for example, a popular option – to plant the web server the memcached , highly desirable that they could not begin to “fight” each other for the memory in the case of its deficit.

To isolate important processes in modern kernels, there is a cgroups mechanism ; with its help, processes can be divided into logical groups and statically configured for each group how much physical memory can be allocated to it. After that, each group creates its own almost independent memory subsystem, with its own displacement tracking, OOM killer and other joys.

The cgroups mechanism is much more extensive than just controlling memory consumption, it can be used to distribute computing resources, nail groups to the processor cores, limit I / O, and much more. The groups themselves can be organized in a hierarchy, and in general many “light” virtualization systems and now fashionable Docker containers work on the basis of cgroups .

But in my opinion, it is the control over memory consumption that is the most necessary minimum, which is definitely worth adjusting, the rest is optional / necessary.

NUMA

In multiprocessor systems, not all memory is the same. If there are Nprocessors (for example, 2 or 4) on the motherboard , then as a rule all slots for RAM are physically divided into Ngroups so that each of them is closer to the corresponding processor – this scheme is called NUMA .

Thus, each processor can access a certain 1/Npart of the physical memory faster (about one and a half times) than the remaining ones (N-1)/N.

The Linux kernel alone is able to determine everything and, by default, reasonably enough to take into account when planning the execution of processors and allocating them memory. You can see how it all looks and adjust it with the help of the utility numactland a number of available system calls, in particular get_mempolicyset_mempolicy.

Memory operations

There are several topics that only C / C ++ developers of low-level systems face in reality , and I don’t tell them about it. But even if I don’t directly come across this in my opinion, it is useful to know in general terms what the nuances are:

  • Memory operations:
    • Most of them are not atomic (that is, another stream can “see” them halfway), without explicit synchronization, atomicity is possible only for memory blocks no more than a pointer (i.e., usually 64 bits) and then under certain conditions.
    • In reality, they are far from always taking place in the order in which they are written in the source code of a program: processors and compilers with optimization rights can change their order as they see fit. In the case of multi-threaded programs, these optimizations can often lead to a violation of the logic of their work. To prevent such errors, developers can use special tools, such as memory barriers – instructions that prohibit transferring memory operations between parts of the program before and after the program.
  • New processes are created using a system call fork, which generates a copy of the current process (to start another program in a new process, there is a separate family of system calls – exec), in which the virtual space is almost completely identical to the parent, which does not consume additional physical memory until or the other will not begin to change it. This mechanism is called copy on writeand it can be played to create a large number of independent processes of the same type (for example, processing some requests), with a minimum of additional expenses of physical memory — in some cases it is more convenient to live than with a multi-threaded application.
  • Between the processor and RAM there are several levels of caches, access to which is even orders of magnitude faster than RAM. To the fastest – nanosecond fractions, to the slowest nanosecond unit. On the features of their work, you can do micro-optimization, but from high-level programming languages ​​you really can not get to them.

Total

Linux memory subsystem should not be left to fend for themselves. At a minimum, the following indicators should be monitored and displayed on the devices (both in summary and by process or their groups):

  • The rate of occurrence of major page faults;
  • Triggering the OOM killer;
  • The current amount of physical memory usage (this number is usually called RSS , not to be confused with the format of the same name for publishing text content).

In normal mode, all three indicators must be stable (and the first two are close to zero). Bursts or smooth growth should be considered as an anomaly, the reasons for which should be understood. What methods – I hope I showed enough directions where you can dig.

  • The article is written with reference to modern Debian-like distributions of Linux and physical equipment with two processors Intel Xeon. The general principles are orthogonal to this and are valid even for other operating systems, but the details can vary greatly, even depending on the kernel build or configuration.
  • Most of the above-mentioned system calls, functions and commands are available man, to which I recommend to look for details about their use and operation. If you don’t have a linux machine on hand where you can type man foo– they are usually easily searched for with the same query.
  • If there is a desire to delve into any of the topics touched upon in passing – write about this in the comments, any of them can become the title of a separate article.

PS

Lastly, I repeat the numbers, which I strongly recommend to remember:

  • 0.0001 ms (100 ns) – access to main memory
  • 0.1–1 ms (0.1–1 million ns) —accession to SSD with a major pagefault, 3–4 orders of magnitude more expensive
  • 5–10 ms (5–10 million ns) —access to a traditional hard disk with pagefault, an order of magnitude more expensive

// ms – milliseconds, ns – nanoseconds.