LPC2019 – pidfds: Process file descriptors on Linux


– Hey, I’m Christian. I work as a sort of kernel
engineer at Canonical. Do a lot of upstream work, maintain a few bits and
pieces in the upstream kernel. And originally I was
supposed to give a talk about writing a kernel driver in Rust, which is still work we are doing actually, but I got bronchitis, so you get pidfds. I don’t know if it’s a fair trade but everybody, we’re on the same page, this is pidfds not Rust, in this case. Right. And we can do this in different ways. Usually I don’t mind taking
questions during my talk so if you have a pressing issue just ask right away. Okay. That will switch us to the next slide. Great. So, pidfds, what is that, I guess is the first question. I mean, who has heard about
this work in the first place? Ah, okay, so LWN did a good job. (Christian laughs)
(audience laughs) So, the basic idea, it’s a file descriptor
that refers to a process, which is not a new idea and I’m going to get to
this piece in a little bit, but specifically the way we did it for the initial implementation is that it’s an FD which refers
to a thread-group leader. So, right now you cannot have a pidfd that refers to a single thread, just because we didn’t
have a use case for it. It’s not necessarily out of scope, it’s something we could think about but it probably requires
a lot more thinking about the semantics that we would need. And the idea is that a
pidfd serves as sort of stable and private handle that guarantees you it will always refer to the same process. And it abuses something in the kernel that has existed for a long time which is the kernel’s
version of a stable handle on a process which is struct pid. And note this uses, yeah, we use struct pid not task_struct, right, for the kernel, actually, thread if you want to put it like this. A task is identified by a task_struct. You could also argue that we could’ve made a pidfd
refer to a task_struct. Why didn’t we do this? Did somebody have an obvious
intuition why we didn’t do it? – [Audience Member] It
recycles them too fast? – Sorry? – [Audience Member] It
recycles them too fast? – Probably that, yes, but
also just the sheer size. So one of the reasons, if you read the comment for
struct pid in the kernel, it will give an explanation why
it exists in the first place and why it exists in the
first place was basically, oh, we need to keep a stable reference to a process a lot of times and
we wanna recycle task_struct because it’s burning too much memory. It’s pretty big, if you look
at the file it’s like there’s massive amounts of information in there. And so that’s why you
used struct pid, exactly. So why did we do this in the first place? Stable, private handle,
why would you want this? Why aren’t PIDs enough, I guess is the burning first question. And why did we use FDs
and not, for example, UUIDs which some people
suggested, and so on. So, the issue, well, the main reason why we
did it is PID recycling which some people think
is not really an issue. So, PID allocation inside of the kernel. Oh, and by the way, if anyone knows more
about something than I do please yell as well. So, the way pid allocation in the kernel happens is cyclically. So, if you hit the maximum number of PIDs that you configured on your system then it wraps around and
finds the next free PID. So, this way you can recycle PIDs, especially if you’re on a system that is under a lot of pressure, so you create a lot of processes that exit very fast, and so on. And it shouldn’t be an issue
if you bump the number of what the maximum size of a
PID can be, very highest, to four million, and so on, but usually on a standard
system it’s about 32,000 which is fairly quick to recycle. And it’s also not a theoretical issue. If you look, I linked to a
bunch of CVEs and problems. So, the most well-known one is I guess, yes, this is the one in polkit, I think Jann might have found this. So, you could win a race against polkit to recycle a bunch of processes so fast that you wrapped the PID and then tricked polkit
into authenticating you with the wrong process. So this is an issue that actually happens. There is a bunch of PID-based
Mac exploits, actually. That’s something I found
which is pretty interesting. So they have issues with this as well, and as far as I know, I don’t know the Apple source code so this is a wild guess on my part, they don’t have something like a stable handle on a process. And there’s a bunch more issues I linked to that were discussed, another CVE as well. Another reason was shared libraries. So, basically forking off
invisible helper processes without having to rely on SIGCHLD to get exit notifications
for a given process. This is especially relevant if you’re a generic shared
library, as some people call it, that have an event loop which
has a bunch of callbacks and some of those callbacks, for example, react to SIGCHLD signals. If they get one they try to wait, either generically on all processes, but they might end up reaping processes they don’t really wanted to reap and also taking away that process from the other callback in the event loop that actually wanted to
wait on those processes. So with pidfds, and hopefully we can get
to this bit time-wise, will allow you to solve this
issue eventually, cleanly. They partially already do. And process management delegation which requires a bit more work
than what we have right now. Right now we just have sort of a skeleton for process handling. So, ideally hand off a handle
to a non-parent process, for example for waiting
and signaling safely. I would like to at least
explore the possibility of making it possible
that non-parent processes can wait on a process. So if you hand off a pidfd to them, the way exactly how we
would implement this is, I haven’t worked on this specifically but it would be pretty cool
if this would be possible, but it needs to be safe and it probably would need to
be a property that you specify at process creation time, so
when you create the pidfd. We’ll see in a bit how we create pidfds. And the last reason is, and this is sort of my, I guess, defense against
why we didn’t use UUIDs, the ubiquity of FDs in userspace. So, we already have a lot of patterns in userspace to deal with FDs. This includes, for
example, parsing out fdinfo from /proc/self/fdinfo and then the FD, that’s generic code you can reuse. Most userspace programs that are related to any kind of
Linuxy, UNIXy system will know how to deal with FDs. So, they will have an event loop usually where you can stuff FDs
in and listen for events. They have code to receive and send FDs. So, it’s pretty easy, so ideally it should be pretty easy to switch to using pidfds which was also pretty important. So, does userspace really
care about this feature? I mean, this a question
as a kernel engineer we ask ourselves quite a lot, right? So, are we just doing this, well, I would be fine with just doing
my work for fun, obviously, but it’s also pretty cool
that you can come up and say, oh, by the way, this is really
a feature that people want and that people actually use. And it’s a feature that
people actually use. So some people actually, some projects got in touch and were, like, cool that this work exists,
we’ve been trying to use it. Qt was one of the examples,
systemd was one of the examples, CRIU and lmkd definitely because there were also
people, Joel, is Joel around? Ah, Joel Fernandes from
Google was involved in part of this work as well. And so D-Bus, for example,
has something that is called ConnectionUnixProcessIDHandle which is PID-based right now, which is used for, well I’m
not sure if it’s used for, oh, it’s used to track a remote peer, and it’s obviously vulnerable
to PID races as well. And so they have at least an
issue up where they discuss switching to pidfds to
get rid of this problem and reliably track peers. Qt, they were once involved
in an initial version of the patch set a long while ago, I’ll be mentioning this in a bit. They want to fork off
invisible helper process because they fall under the category of shared libraries that
I’ve been mentioning, generic shared libraries, that don’t know what other processes, what other callbacks in the event loop will fork off helper processes. And systemd wants to use
it for process management toto caelo as far as I understand. A specific issue they have up right now is to reliably kill processes when you don’t have the
freezer cgroup available. So, if you have the
freezer cgroup available you just freeze the cgroup and
then you kill all processes, and zombies can’t do
syscalls so you’re fine. But on systems where you
don’t have the freezer cgroup, you want to reliably kill processes so, and you need to identify whether
they’re in the right cgroup so what you can do is you can get a pidfd, you can read the information in what cgroup that process is in, but because you’re holding and
FD which is a stable handle you can then reliably kill the process. So pidfd_send_signal makes it safe to kill off processes reliably. And CRIU has something that
is called detect_pid_reuse. They do predump which is, Adrian can correct me if I’m wrong, they do a series of predumps when they store all information of a process that they later on want to restore. And if you do multiple
predumps to, for example, track memory changes
over time for a process, you need to make sure that
it’s still the same process. So they have a function
called detect_pid_reuse which uses the, I think, process
starting time or something. It’s really just a heuristic
and it’s not really reliable, and they wanna switch to
pidfds for this as well which would also let them
get rid of this problem. And lmkd is Android’s low memory killer daemon which wants to make use of pidfds to also avoid PID recycling issues. And they’re probably the ones who profit a lot off this since
I’m assuming that they are under memory pressure a lot, (laughs) and they fork off a bunch of processes. Right, so prior art, this
was always important. Obviously, this is completely my idea and nobody had that idea before. (laughs) No, I’m joking. This is obviously, it’s something that, actually, Alexa reminded me of this, it’s pretty obvious if
you look at Linux itself, if you look at the proc PID directory, proc itself already pins a process, right? That’s sort of why struct
pid exists as well. So, a proc PID directory,
if you have an FD to it, it pins struct pid in the kernel already. You can’t do anything with it and it doesn’t help you at all but the concept is there as well. Just staring at the code for a while you could probably have figured this out. But there are also other systems who had similar ideas. My fault actually was, which is sort of the fault a lot of people have that
are born later in time, I naively assumed, for some systems, that they had it when they didn’t so I didn’t get my
history right, basically. For example, I always
assumed that Solaris had it. So illumos, which is the open
source alternative to it, but they don’t, they actually
just have a pure userspace emulation of a process table handle. There’s procopen, procrun,
procclose, procfree, which is vulnerable to all of the problems that I detailed at the
beginning of the talk. And OpenBSD and NetBSD also don’t have it, I’ve looked at their kernel source code. There is no private,
stable process handles. They have references to it,
so it’s sometimes mentioned, but there is no implementation for it. FreeBSD has it, I guess that this is the
most well-known example. I think this derives back
from the Capsicum project why they implemented it. So they have something
which is not called pidfd, obviously on Linux you always have to come up with your own name, so they call it procdesc. A process file descriptor, I
guess, and process descriptor. And they have three syscalls,
pdfork, pdgetpid, and pdkill, and on Linux we sort of have two of those. We have pdfork and pdkill now but the semantics actually
differ in a bunch of aspects and I can go into more detail if you really want to know about it, but for now it’s sort of, the
concept is at least similar. The semantics are
sometimes different, like, for example, the way we on
Linux deal with processes where we explicitly ignore SIGCHLD, we auto-reap the processes
so they just go away. There is nothing fancy going on. What FreeBSD actually is doing, it’s reparenting it to init and then init gets a SIGCHLD, and so FreeBSD is, for
example, saying (claps) PID one, go deal with it. So, this means that they have
to do some things differently the way we did it or intend to do it for future features on Linux. And Linux had, there were
multiple pushes to get a concept of a private
process stable handle via an FD in forkfd and CLONE_FD. CLONE_FD might be one that
is the most well-known which was sort of a
collaboration between Qt, yes, which came from a
collaboration from Qt and, uh, is it David Drysdale? I don’t wanna lie right now. But there is a patch for this out. You can Google for CLONE_FD and you should see the patch set. And actually, even back
in the day, I guess 2015, people were very receptive to the idea but the patch set itself
tried get in, sorry, yes? – [Audience Member] Do
you want more prior art? – [Host] Mic. – Another piece of prior art
is the work from Casey Dahlin, I think it was in 2009, waitfd, which let you do a waitpid
on file descriptors so you could poll things. This also worked for
thread-directed signals, such as you get from
waitpid on ptrace processes, which unfortunately it
looks like this won’t do because it’s a per-process
thing only, a shame. – Currently, yeah, there is technically, to repeat, there is technically– – [Audience Member] There
is actually a use case but I’ll get to that later. – Yes, there are use cases. I originally thought, for example, for pthread management or something it would be really nice to have this, yes. So there is technically
no reason not to do it. I think when we started this work I spoke to Florian Weimar and asked him what do you think about making
use of this for pthread? And he was like, mm, yeah,
it sounds interesting, but also we need to be
backwards compatible. So if they ever, for example, they are saying this
is a use case we have, please let’s get this done
and we can think about it. It’s really that Oleg
Nesterov basically said oh, if we do CLONE_THREAD
right now with pidfds it gets really hairy and are we really sure that
we want to do this right now? (staff member murmurs) Ah, okay. Right, so this is prior art. I can talk about CLONE_FD. So, I tried to figure out
why it didn’t get merged. I think the reason is it
tried to do many things at the same time in one patch set, which is understandable, right? It’s usually you have a really cool idea and you think these are all the features that we can build on top of it, here is the patch set
that makes all of them available at the same time. And this is usually not an
approach that flies very well on Linux which is also
fine because, you know, if you merge it and then you
have to route it through a tree and then you have to
be responsible for it, somebody needs to maintain it and it’s not guaranteed that
people are staying around, and so on, so doing it in a
sort of more piecemeal fashion is usually the better approach. But the idea was similar but it also
makes auto-reaping semantics so that a process just exits and goes away and nobody has to wait on
it with FDs, and so on. So, it wasn’t clear-cut, I
think, as it should have been. So this really didn’t land. Right. So what did we do? Well, we tried to build a new API, and so far this work is
spanning four kernel releases. And actually this sounds like this has been a massive amount of code and we’ve changed many, many things. Actually, we didn’t. The actual changes that we needed in the kernel are not that huge. It’s just that I sort
of wanted to choose it, being the one who sort of
tried to guide this a little, there were a lot more people
involved in the discussion and design so this is obviously
not my personal achievement. Was just that I wanted to
have a sustainable speed so that we could be sure
that the things we are doing are correct and also that we
have time to react to bugs. So, if you have to do, you push new features at the same time, and a lot of features at the same time. If things break you, you know, it might break in a bunch of
places at the same, and so on. If you do it piecemeal
over a couple of kernels you have time to catch design
mistakes also and, yeah. Building a new API, a
comment I would like to make, this is the first time I
actually speak about this so lucky you, or not, is the pidfd API was, in
my imagination at least and people who were involved in this work like Joel and a bunch of other Google guys might disagree with this, my intention was never to sort of say pidfd is replacing the PID API completely. I always thought of it, it’s an alternative way
of managing processes that is probably very useful to you if you need to be very, very sure that the process you’re operating on is actually the process
that you’re operating on, and also that there is
a connection between the PID API and the pidfd API so that you sort of, you can choose one, you don’t necessarily have to choose either I’m doing pidfds or I’m doing PIDs, but you can use both at the same time. Now, that has limits obviously, but the way we designed it, actually, and you will see this at the clone when we talk about the CLONE_PIDFD flag, there is a nice interaction
between the two of them. So, the pidfd API is not PIDs are a completely wrong
concept, don’t use them anymore, here’s a new way of
doing process management. Sure, I expect that there
will be new features in the future that will be based on pidfds that you can’t based on PIDs by virtue of how they are implemented but that should be about it. Right, so, one of the first things we did in kernel 5.1 was pidfd_send_signal. So, a way to send signals through pidfds. You could argue that we
started the wrong way around. So, we didn’t implement something that lets you create a pidfd right away, we started off with a syscall
that operates on a pidfd. The reason for this is that it’s the most obvious thing that userspace wants or would ask for, reliably sending signals. Especially if you think about any type of process management,
this is what you want. You don’t want to be end up in a situation where you accidentally send
a signal to a wrong process, especially if you’re
talking about SIGKILL, so it was very easy to make
the case for pidfd_send_signal. Yes, so, here is a bunch of, here we get into a tricky
area to some extent. It being the obvious piece
that you really, really want, it also meant that people
had a lot of opinions, which is fine, but everybody pushed in
a different direction because everybody had different needs. Some people just wanted a pidfd to be a very abstract
handle, myself included. Other people wanted to correspond to a proc PID directory so that they could get
easy metadata access, which then proved to be really difficult in terms of security when you think about creating a process from the
clone syscall, and so on. So there was a lot of
back-and-forth going on, and long, long email threads about this, but we finally, we came to a sort of compromise, I think. So with pidfd_send_signal
you can use a shortcut, you can open a process proc PID directory and use the FD that you get from it which already is in-kernel a stable handle as it pins a struct pid, and you can pass this to pidfd_send_signal and then send signals to processes. This is a very nice
shortcut for userspace. Actually, they like it a lot. I don’t like it from the perspective of, I would’ve liked an API that
is very, very consistent and you don’t have two
different types of FDs that your API is dealing with, but it’s only this syscall and actually there is
precedent in the kernel. Like for example, in new
mount API we’ll likely gain a fsinfo syscall and the
fsinfo syscall will operate on regular FDs that you get from directories and so on
that are mount points, but also on FDs that are returned from the new mount API syscalls
such as fspick or fsopen which are a total different type. So, we have precedence
of this in the kernel so I don’t think it’s that bad. pidfd_send_signal currently
does the job of kill, positive PID, and then the signals. So, currently there is no way, we can enable this later, there’s currently no way to say I want to signal this specific thread. It’s always a random
thread in the thread group that wants the signal so that doesn’t have
it blocked, and so on. This is exactly how kill operates today. Also, we don’t allow you to signal a pidfd that lives in a PID namespace of which you are not an
ancestor PID namespace because they are hierarchical. Yes, so you can signal upwards. If you somehow get access to a pidfd from a different PID namespace, that is either a sibling PID namespace or is an ancestor PID namespace, you can signal upwards and
you can’t signal horizontally. There might be use cases
for this in the future where you can do this, where you can signal between different, send signals between different
PID namespaces via pidfds, but again, there wasn’t a use case for it so we didn’t see a reason to come up with complicated
semantics right away. But again, there is
nothing that prevents us from doing this in the future. Eric actually was in
favor of this, but, yeah, just didn’t have a use case. And then in 5.2 we landed probably, ah, it’s not the most
important bit but it’s, I like that piece of code specifically. Oh, and by the way, I should say I talked about code size and
this syscall is really small. It lives in kernel/signal.c and if you look at the
patch it’s really not, it’s not a complicated syscall and it’s not a lot of code, so. CLONE_PIDFD, so the idea
was you want to be able to create processes at creation time, and here we ran into another challenge. We ran into the same challenge we had with pidfd_send_signals, what type of file descriptor
are we going to make this? And at first it was, like, for consistency, please make it, Linus had opinions about this, I think, make CLONE_PIDFD return an FD to the proc PID
directory of the process. And then Jann and I teamed up and implemented two different solutions because we thought it wouldn’t be feasible to do it as file descriptors to proc PID directories of the process, because of security
reasons it would’ve meant rework proc to make it more safe because there are, like for example, proc PID net contains a bunch
of information that is not, where you can snoop on
networking information of other processes so we
would have needed a way to restrict access to this to be able to safely send around these file descriptors
and so on, so it was all, and the code was really
complicated to get right. The patch set is still on LKML because we sent out two
RFCs at the same time. One for the approach that we went with and one for the proc PID approach. And the proc PID approach,
even just the code, even though we tried to make
it very as elegant as possible, it’s really ugly. It’s also because of how
proc works, and so on. I’m pretty happy that we ended up with what we ended up with. So, what we did is, as you can see, we used anon_inodes, anonymous inodes. Does everybody know what this is? Okay, so a brief explanation, it’s basically just a
single inode in the kernel, it’s a small, little subsystem, and this inode is shared
between all file descriptors. Right, so you have a timerfd, this is used by timerfds, by signalfds, BPF uses it as well, I guess the new mount API uses it as well. So they don’t really require full inodes, you don’t need to allocate
a new inode all of the time and then when all references
to the inode are dropped you destroy it, and so on. The inode just functions to hang on a bunch of
file operations on there, that’s all you need it for. So it’s a really cheap way
of creating a stable handle. And the other nice thing is all of the infrastructure
is already there. We don’t need to come up with a separate tiny, little file
systems for pidfds and so on, but all of the infrastructure is there. That’s the code, that’s the core code, there is more changes to
this but that’s the core code in fork.c that creates a pidfd. You specify a flag at
process creation time and then you allocate a new file. You stash a reference to struct pid of the process
you just created in there, and then you have a stable process handle, you return that FD. So, we stole (laughs) the last usable flag for this from clone, so when CLONE_PIDFD landed,
clone was saturated. There are a bunch of flags that are unused that the kernel currently ignores but we can’t safely reuse them because I started looking at userspace and glibc and musl, that’s the way you
pronounce it, right, muscle? Useful muscle. And they, for example, they still pass CLONE_DETACHED down, and the CLONE_DETACHED flag
has been ignored, I think, since kernel 2.6 something, or something. But it means if I then would go on, someone to decide, like, let’s recycle the CLONE_DETACHED flag because nobody should be using
this anymore in userspace, and, well, two libcs are broken, so that wasn’t going to fly. So, out of clone flags, we’ve
solved that problem later, we have a new clone version. Another specialty about pidfds is, which I tried to push for
also for the new mount API but that wasn’t super well-received, is that they are CLOEXEC by default, but especially for pidfds
I guess it makes sense. You really don’t want to leak pidfds into a child process. So, yeah, as you can see
they’re CLOEXEC by default. Which, in my ideal world, every new file descriptor
type that we create should be CLOEXEC by
default because you can use, what we decide, how is fcntl? – [Man] I, I mean, I say functal. – Functal, okay. Let’s go with functal. You can take fcntl and
take away the CLOEXEC flag. To do it race-free the other way around is a bit more difficult. But there are certain
factions of the kernel that think that’s not a great idea because then we end up with some file descriptors that
have O_CLOEXEC by default and some file descriptors that are not. But actually, even before
pidfd change landed this was already the case, for example, for the seccomp notifier FD, which is a new file descriptor type that you can get from seccomp, is CLOEXEC by default as well so that ship had already sailed, but, yes. If you think about adding a new file descriptor type it would be, I would strongly urge you to consider to make it CLOEXEC by default because userspace will really
thank you for doing that. And also, we wanted to have a connection. So there are two ways where the pidfd API and the PID API are connected here. So if you specify CLONE_PIDFD, the original implementation that we had and then also the original
CLONE_FD patch that had, last I looked at it, was that if you specified the flag then clone didn’t return a PID but it returned you an FD, so basically you did type
switching based on a flag which is not very nice but this was the first
implementation that we had. It also meant that you
couldn’t return zero, right? For file descriptor, zero is a perfectly fine
value if stdin is closed. But obviously zero as a return value is used to indicate that this is the child process right now
and not the parent process, so we could not have
allocated file descriptors starting from zero, which
again, is not very nice. So Oleg suggested how, at least for clone, we abuse the parent TID pointer argument, TID pointer argument, which is already reused
as a return argument only for CLONE_PARENT_SETTID, and for legacy clone make it incompatible with CLONE_PIDFD. So what you get right now,
even with legacy clone, is you get a PID back, normal behavior, but you also get a pidfd placed in the parent TID pointer argument. So you see, there is not additional effort needed for you to find out what
the PID is for that pidfd, which is again different to
FreeBSD, you have pdfork, it gives you a procdesc back and then you need pdgetpid,
I think, another system call, that gives you the pidfd back, gives you the PID back for the proc file descriptor that you used. For us, it’s both at the same time, PID and pidfd which is nice, so you can choose what
you’re operating on. Probably if you don’t wanted pidfds, you wouldn’t have specified
CLONE_PIDFD but, yeah. And also, if you have a pidfd but
you don’t have a PID and you want to learn what PID this pidfd is referring to, we have /proc//fd/fdinfo, and fd/fdinfo will currently give you the PID number for, in the PID namespace of the proc instance you’re looking at, so you can parse out the PID. There’s an alternative way of doing this. Right, so, this is obviously, now you can in a raise free way create pidfds at process creation time, but with the new clone system call that I added, clone three, you also have a dedicated return argument and it’s now a structure which gives you back the
pidfd, so it’s cleaner. Also, we’re not out of flags. This is work that has been done by Joel, this is something that
we discussed early on. This also had some
controversy associated with it because we have different requirements and different ideas of
what we want from this. But basically, right now, (coughs) if you have a pidfd, I’m sorry, if you have a pidfd you can get exit notifications even
as a non-parent process. You don’t get the exit status currently but you at least get notified
that process has exited. Well, technically, to be very precise, you get notified when the
thread group leader exits and the thread group is empty. Well, it should never be the case. That’s a bug. There was actually a bug where you could have a
zombie thread group leader but thread SID was still
alive which shouldn’t happen because now you have a
problem that you can’t send signals to a zombie
thread group leader so find all of the threads
and kill them one by one. But yes, so when the
thread group leader exits and the thread group is
empty you get notified that this process is now gone, which means, for a shared library, if you use pidfds you
can now turn off SIGCHLD, say I don’t want a signal
when that process exits because I have a pidfd
and it’s in an epoll loop, and I just wanna be notified
over the pidfd not via SIGCHLD. Which made a lot of people
very happy, apparently. So you can hand off these
pidfds, stuff it in epoll loops, and then watch for process exits. Also again, this is the code, it lives in two different files
but I compacted it together, pidfd_poll lives in fork.c because it’s a file method and do_notify_pidfd lives in,
don’t let me lie, signal.c? Oh, yes. So when a process exits and
do_notify_parent is called, it calls do_notify_pidfd and if you have been watching closely you see that in struct pid there is hlist_head task, no, nonsense, wait_queue_head_t wait_pidfd. So, struct pid, everyone gets notified that
has a reference to struct pid via such a pidfd. Okay, and, that’s obviously, that’s
a pretty interesting, it’s a pretty interesting piece of work. It’s not as advanced as what FreeBSD has to draw comparisons. FreeBSD has kqueue and they also can get notifications such as the exit status for non-parent processes via kqueue, because kqueue, in contrast to epoll, gives you information back from the kernel so you can just stuff, put stuff in the, as we do with epoll, you can put stuff from userspace in there and if you get a notification on the FD you get it back from the kernel so you stuff your own
information in there, kqueue also gives you back information from the kernel places in there especially for proc file descriptors. And then you can also watch when one of these proc file descriptors forks or exits, and so
on, that’s pretty nice. Maybe at some point in the future we can have something similar, but since we don’t have kqueue and epoll can’t give you
back data from the kernel, at least not as far as I know, there is currently no nice way to do this and we didn’t wanna implement read on pidfds at this point in time. Jann was against this as well for security reasons, and so on. So currently if you’re
a non-parent you cannot, well, you can read proc,
obviously, to get the exit status, but there is no easy way to
read it off of the pidfd, but, yeah, we can probably
find ways to do this. And also in 5.3 is pidfds
without CLONE_PIDFD. So, this came especially from the lmkd guys at Android, and also systemd had a use case for this. So if you forked a process and you wanted to get a pollable pidfd for them then you couldn’t do this, but with pidfd_open you can. You specify the PID and
then you get a pidfd for it which is pollable as well. And for 5.4, this is sort of the last bit
for the skeleton, more or less, is waiting, this is proposed
so it’s been sent to Linus, I don’t know if he’s going
to take it, maybe he won’t, but it’s up for the 5.4 merge window, is waiting through pidfds. We had a bit of a discussion how we exactly wanted to implement this. The most obvious way to a lot of people seemed to add a new type to the
waitid system call, P_PIDFD, which, yes, which you can specify a pidfd and then you can wait on it. So that’s sort of the core
API that we have right now. The ideas, there is a bunch of ideas that I still have that I
have started working on but haven’t fully, I’m not fully sure about
what semantics I want. Like, one of the things
for example that I like, and that a lot of userspace
projects also like is the kill-on-close semantics that FreeBSD has by default. So if you close the pidfd and it’s the last FD that refers to the struct file that stashes away struct pid, then the process
automatically gets SIGKILLed. We have it the other way, and if you want to keep the process alive even when the last pidfd
has been closed on FreeBSD you need to specify a special flag called PD_DAEMON, I think. For us, it’s the default. The process stays alive
even if you specify, even if the last pidfd is closed. But there is, we could implement a flag at process creation time that lets you kill a process when the last FD
referring to it is closed. The problem is that on FreeBSD cleaning up struct files, basically the release
method that is called when a file is destroyed
is called synchronously, so by the time close returns on FreeBSD you are sure that the
release method has run. On Linux we have a work queue, so if you close an FD and close returns and it’s the last FD, the release method might
not necessarily been called because it’s been added to your work queue the kernel might decide, ah, I’m gonna do this a little bit later because, you know, memory
pressure or whatever. So it’s asynchronously, and it’s kind of, I’m kind of on the
fence whether that still makes it a good idea or not, but I think I’m getting,
running out of time. So, I hope I could give you an overview of the API we sort of built, and hopefully convinced
you of its usefulness, and a little bit of
glimpse into the future. There’s one more thing that we would need to make
the shared library case completely usable, but yeah, I’m pretty happy
that we have this right now. And if you have any questions. – [Host] Also if, excuse me, if Jerome’s here, could you
please start hooking up? – [Christian] Oh, yes, I can also unplug. – [Host] Thank you. (audience claps) – [Man In White] Wait, I have a question. – Oh, yeah. – Have you given any thought
about integrating with cgroups? Like, cgroups 2’s cgroup.procs? – Mmm, no, what exactly
would be your idea? – Well, when you open cgroup.procs it gives you a list of
file descriptors, right, or PIDs, right? And between the time you do
something with those PIDs and you opened it, it
could’ve changed, right? – You could, ah, I think
we had a discussion about, did we have this discussion about it? A flag where you can kill
per cgroup or something? There are certainly ways
where we can think about this where you can make it so that you could take down a whole cgroup, for example, but so far I haven’t thought
about this, but yeah. To be honest, there are
so many possibilities that you could go with, that you could do, that it’s kind of sometimes hard to be, to stay calm about this, but I don’t want to rush things. But we’re always open for patches. Like, the pidfds stuff has
its own tree, so, yeah. One thing I would really like is I wanna make it so, and
I tried to make it so, is often on Linux at process creation if you create a process
with a specific property you then can later on use a
prctl, or some other syscall, to change this property, to
switch it on or switch it off, which is something that
I don’t really like especially if you think
about process delegation, process management delegation, you really want that
property to be created at process creation time and that process sticks with this process until it goes away, and that’s something
that I would really like. So basically treat it like a, almost like a capability
on the file descriptor, that would be something I
would really like to see more. Yeah, sorry. – So, do I understand
correctly that pidfds prevent PID recycling, the
recycling for the PIDs? – [Man] The PID can be
recycled but you won’t– – Oh, sorry, ah yes, I
should be very clear, I didn’t mention this. Oh, this was a totally snafu on my part. So, the PID can be recycled itself, so it doesn’t pin the PID. It’s not like, I think on Windows, where the PID just stays around. It guarantees you that
when the process is exited and, for example, you send a signal to it, you say kill this process, but the process is already been gone, then the kernel will tell you ESRCH which is errno for there is
no such process, it has gone. So that pidfd is a stable
handle in the sense you can’t be tricked into operating on a process that doesn’t exist anymore. But the PID itself can be recycled. – But that means that, for
example, to track the cgroup I need to open the
pidfd, check the cgroup, and check if the pidfd is still valid. – Yes, there is code, there is a sample program
exactly for that reason because I knew there came up, there is a sample program
in the samples directory /pidfd something something that shows you how to,
in the raise free manner, turn an anonymous inode pidfd
into a proc PID directory. It basically involves parsing
out the proc pidfd info, the PID from that file, then opening that file
and then sending a signal and checking whether it’s
still the same process. So, it’s already in there, you can see how this can be
done in a raise free manner, we exactly thought about this case– – [Host] Okay, I’m sorry to interrupt but, thank you very much. Let’s thank the speaker and
let’s welcome our next speaker. (audience claps)

Leave a Reply

Your email address will not be published. Required fields are marked *