-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: include ancestors in process events #2938
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for tetragon ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
Hello. Upd: found even more problems, converting back to draft for now. |
I'll look shortly sorry was travelling and then catching up. Should be able to get to this today or tomorrow thanks! |
e2e9ea1
to
558ee86
Compare
I think i misunderstood the purpose of both So now there seems to be no real reason to return The biggest obstacle now is that due to current implementation of process cleanup, as described in commit 45745a0, it becomes impossible to reconstruct full process ancestry in some cases. Assume the following scenario:
If n > 3, then all processes with id < n-1 would have their refcnt set to 0 and as a result removed from the event cache, breaking the ancestry chain. I don't think i can resolve that without introducing any potentially breaking changes, as this PR already has quite a lot of code to review. And there is still inconsistency in the ancestors field's value across protobuf messages in api/v1/tetragon/tetragon.proto and vendor/github.com/cilium/tetragon/api/v1/tetragon/tetragon.proto. I'm still not sure how to properly handle that. @jrfastab, may I ask you for a code review now? Sorry for the delay. |
558ee86
to
9332a4d
Compare
pkg/process/process.go
Outdated
for process.process.Pid.Value > 2 { | ||
if process, err = procCache.get(process.process.ParentExecId); err != nil { | ||
logger.GetLogger().WithError(err).WithField("id in event", execId).Debugf("ancestor process not found in cache") | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we return ancestors we were able to get in here and just increase err metric ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it depends on whether we want to try to redo it later, if we were not able to do it this time. I chose to return nil
here, so i could then add a condition to ec.Add()
call in GetProcessExec()
, GetProcessExit()
, etc. that will add corresponding events to the event cache, if we were not able to get all ancestors.
Here:
if useCache {
if ec := eventcache.Get(); ec != nil &&
(ec.Needed(tetragonProcess) ||
(tetragonProcess.Pid.Value > 1 && ec.Needed(tetragonParent)) ||
(option.Config.EnableProcessAncestors && tetragonParent.Pid.Value > 2 && tetragonAncestors == nil)) {
ec.Add(proc, tetragonEvent, event.Unix.Msg.Common.Ktime, event.Unix.Process.Ktime, event)
return nil
}
}
Later in Retry()
/RetryInternal()
i call eventcache.CacheRetries(eventcache.AncestorsInfo).Inc()
, if GetAncestorProcessesInternal()
returns nil
again.
I guess, we could return both ancestors slice and an error in GetAncestorProcessesInternal()
, but then what should we do in Retry()
/RetryInternal()
? Or maybe we shouldn't event bother to retry if we were unable to get all ancestors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, it's fine
pkg/grpc/exec/exec.go
Outdated
@@ -427,6 +470,20 @@ func (msg *MsgExitEventUnix) RetryInternal(ev notify.Event, timestamp uint64) (* | |||
internal, parent := process.GetParentProcessInternal(msg.ProcessKey.Pid, timestamp) | |||
var err error | |||
|
|||
if option.Config.EnableProcessAncestors && ev.GetAncestors() == nil && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we put this in function and call it from all the other places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could, but i might have to reimplement it due to the process cleanup problem. I've described it in the latest PR comment. Furthermore, I'm still not sure, what to do about that problem, as all my attempts to deal with it to this day were not very reliable. I would really appreciate an advice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, yea.. seems like a problem.. not sure yet how to deal with that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess, i can make it so getCleanupEvent
returns MsgProcessCleanupEventUnix
only if MsgExecveEventUnix
event's process has clone
flag.
func (msg *MsgExecveEventUnix) getCleanupEvent() *MsgProcessCleanupEventUnix {
flags := strings.Join(readerexec.DecodeCommonFlags(msg.Unix.Process.Flags), " ")
if msg.Unix.Msg.CleanupProcess.Ktime == 0 || strings.Contains(flags, "clone") != true {
return nil
}
return &MsgProcessCleanupEventUnix{
PID: msg.Unix.Msg.CleanupProcess.Pid,
Ktime: msg.Unix.Msg.CleanupProcess.Ktime,
}
}
Then in MsgExecveEventUnix
event i can make it so parent.RefInc("parent")
is called only, if tetragonProcess
has any of clone
/ procFS
flags.
if parent != nil && (strings.Contains(tetragonProcess.Flags, "clone") == true ||
strings.Contains(tetragonProcess.Flags, "procFS") == true) {
parent.RefInc("parent")
}
So then in MsgExitEventUnix
event we would have it so all exec()
processes would have their refcnt set to 1, clone()
- to 0 and an actual parent - to 2. So we'll just have to call ancestor.RefDec("parent")
for all ancestors, returned by GetAncestorProcessesInternal()
with condition process.process.Pid.Value == tetragonProcess.Pid.Value
, where process
is the process inside GetAncestorProcessesInternal()
and tetragonProcess
is the process from MsgExitEventUnix
event. So basically, GetAncestorProcessesInternal()
would look like this:
// GetAncestorProcessesInternal returns a slice, representing a continuous sequence of ancestors of
// the process, including the immediate parent, that satisfy a given condition. The last element of
// the slice corresponds to the ancestor that no longer meets the given condition.
// The initial process, identified by the exedId parameter, must meet the condition as well.
func GetAncestorProcessesInternal(execId string, condition func(*ProcessInternal) bool) []*ProcessInternal {
var process *ProcessInternal
var err error
if process, err = procCache.get(execId); err != nil {
return nil
}
var ancestors []*ProcessInternal
for condition(process) == true {
if process, err = procCache.get(process.process.ParentExecId); err != nil {
logger.GetLogger().WithError(err).WithField("id in event", execId).Debug("ancestor process not found in cache")
return nil
}
ancestors = append(ancestors, process)
}
return ancestors
}
This approach has an obvious flaw, of course. If exec()
ancestor with id=M is already removed from the process cache at that time for some reason, then we won't be able to call RefDec()
for any of the previous M ancestors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tpapagian any idea on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to write down an example to understand exactly the situation and post that here. This may help to understand what we miss exactly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, sorry. Currently, if i'm not mistaken, it should be like this:
...
exec() (id=1) (refcnt=1)
...
exec() (id=1) (refcnt=2)
clone() (id=2) (refcnt=1) - clone: process=1, parent++
...
exec() (id=1) (refcnt=2)
clone() (id=2) (refcnt=0) - cleanup: process--, parent--
exec() (id=3) (refcnt=1) - exec: process=1, parent++
...
exec() (id=1) (refcnt=1)
clone() (id=2) (refcnt=0)
exec() (id=3) (refcnt=1) - cleanup: process--, parent--
exec() (id=4) (refcnt=1) - exec: process=1, parent++
...
exec() (id=1) (refcnt=1)
clone() (id=2) (refcnt=0)
exec() (id=3) (refcnt=0)
exec() (id=4) (refcnt=1) - cleanup: process--, parent--
exec() (id=5) (refcnt=1) - exec: process=1, parent++
...
exec() (id=1) (refcnt=1)
clone() (id=2) (refcnt=0)
exec() (id=3) (refcnt=0)
exec() (id=4) (refcnt=0)
exec() (id=5) (refcnt=1) - cleanup: process--, parent--
exec() (id=6) (refcnt=1) - exec: process=1, parent++
...
exec() (id=1) (refcnt=1)
clone() (id=2) (refcnt=0)
exec() (id=3) (refcnt=0)
exec() (id=4) (refcnt=0)
exec() (id=5) (refcnt=0)
exec() (id=6) (refcnt=1) - cleanup: process--, parent--
exec() (id=7) (refcnt=1) - exec: process=1, parent++
...
exec() (id=1) (refcnt=1)
clone() (id=2) (refcnt=0)
exec() (id=3) (refcnt=0)
exec() (id=4) (refcnt=0)
exec() (id=5) (refcnt=0)
exec() (id=6) (refcnt=0)
exec() (id=7) (refcnt=0)
exit() (id=7) - exit: process--, parent--
Now, if i call RefInc for all ancestors inside each exec, it will look like this:
...
exec() (id=1) (refcnt=1)
...
exec() (id=1) (refcnt=2)
clone() (id=2) (refcnt=1) - clone: process=1, parent++
...
exec() (id=1) (refcnt=2)
clone() (id=2) (refcnt=0) - cleanup: process--, parent--
exec() (id=3) (refcnt=1) - exec: process=1, parent++, ancestors++
...
exec() (id=1) (refcnt=2)
clone() (id=2) (refcnt=0)
exec() (id=3) (refcnt=1) - cleanup: process--, parent--
exec() (id=4) (refcnt=1) - exec: process=1, parent++, ancestors++
...
exec() (id=1) (refcnt=3)
clone() (id=2) (refcnt=0)
exec() (id=3) (refcnt=1)
exec() (id=4) (refcnt=1) - cleanup: process--, parent--
exec() (id=5) (refcnt=1) - exec: process=1, parent++, ancestors++
...
exec() (id=1) (refcnt=4)
clone() (id=2) (refcnt=0)
exec() (id=3) (refcnt=2)
exec() (id=4) (refcnt=1)
exec() (id=5) (refcnt=1) - cleanup: process--, parent--
exec() (id=6) (refcnt=1) - exec: process=1, parent++, ancestors++
...
exec() (id=1) (refcnt=5)
clone() (id=2) (refcnt=0)
exec() (id=3) (refcnt=3)
exec() (id=4) (refcnt=2)
exec() (id=5) (refcnt=1)
exec() (id=6) (refcnt=1) - cleanup: process--, parent--
exec() (id=7) (refcnt=1) - exec: process=1, parent++, ancestors++
...
exec() (id=1) (refcnt=4)
clone() (id=2) (refcnt=0)
exec() (id=3) (refcnt=2)
exec() (id=4) (refcnt=1)
exec() (id=5) (refcnt=0)
exec() (id=6) (refcnt=0)
exec() (id=7) (refcnt=0)
exit() (id=7) - exit: process--, parent--, ancestors--
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me provide my thoughts. Let's assume the following in order to make things work:
getAncestors
from exec callsRefInc
for each ancestorgetAncestors
from exit callsRefDec
for each ancestorgetAncestors
from clone callsRefDec
for each ancestorgetCleanupEvent
callsRefDec
for each ancestor (except from the process and parent that it currently does)
Initially we have 1 process:
init /usr/init pid=1 exec_id=1 refCnt=1
After calling clone()
on process with pid=1:
/usr/init pid=1 exec_id=1 refCnt=2 [+1 from clone]
clone /usr/init pid=2 exec_id=2 refCnt=1 [+1 from clone]
After calling clone()
on process with pid=2:
/usr/init pid=1 exec_id=1 refCnt=3 [+1 from getAncestors]
/usr/init pid=2 exec_id=2 refCnt=2 [+1 from clone]
clone /usr/init pid=3 exec_id=3 refCnt=1 [+1 from clone]
After calling exec("/usr/a")
on process with pid=3:
/usr/init pid=1 exec_id=1 refCnt=3 [+1 from getAncestors | -1 from getCleanupEvent]
/usr/init pid=2 exec_id=2 refCnt=2 [+1 from getAncestors | -1 from getCleanupEvent]
/usr/init pid=3 exec_id=3 refCnt=1 [+1 from exec | -1 from getCleanupEvent]
exec /usr/a pid=3 exec_id=4 refCnt=1 [+1 from exec]
After calling clone()
on process with pid=3:
/usr/init pid=1 exec_id=1 refCnt=4 [+1 from getAncestors]
/usr/init pid=2 exec_id=2 refCnt=3 [+1 from getAncestors]
/usr/init pid=3 exec_id=3 refCnt=2 [+1 from getAncestors]
/usr/a pid=3 exec_id=4 refCnt=2 [+1 from clone]
clone /usr/a pid=4 exec_id=5 refCnt=1 [+1 from clone]
After calling exec("/usr/b")
on process with pid=4:
/usr/init pid=1 exec_id=1 refCnt=4 [+1 from getAncestors | -1 from getCleanupEvent]
/usr/init pid=2 exec_id=2 refCnt=3 [+1 from getAncestors | -1 from getCleanupEvent]
/usr/init pid=3 exec_id=3 refCnt=2 [+1 from getAncestors | -1 from getCleanupEvent]
/usr/a pid=3 exec_id=4 refCnt=2 [+1 from getAncestors | -1 from getCleanupEvent]
/usr/a pid=4 exec_id=5 refCnt=1 [+1 from exec | -1 from getCleanupEvent]
exec /usr/b pid=4 exec_id=6 refcnt=1 [+1 from exec]
After calling exit()
on process with pid=4:
/usr/init pid=1 exec_id=1 refCnt=3 [-1 from getAncestors]
/usr/init pid=2 exec_id=2 refCnt=2 [-1 from getAncestors]
/usr/init pid=3 exec_id=3 refCnt=1 [-1 from getAncestors]
/usr/a pid=3 exec_id=4 refCnt=1 [-1 from getAncestors]
/usr/a pid=4 exec_id=5 refCnt=0 [-1 from exit] [DELETED]
exit /usr/b pid=4 exec_id=6 refcnt=0 [-1 from exit] [DELETED]
After calling exit()
on process with pid=3:
/usr/init pid=1 exec_id=1 refCnt=2 [-1 from getAncestors]
/usr/init pid=2 exec_id=2 refCnt=1 [-1 from getAncestors]
/usr/init pid=3 exec_id=3 refCnt=0 [-1 from exit] [DELETED]
exit /usr/a pid=3 exec_id=4 refCnt=0 [-1 from exit] [DELETED]
After calling exit()
on process with pid=2:
/usr/init pid=1 exec_id=1 refCnt=1 [-1 from exit]
exit /usr/init pid=2 exec_id=2 refCnt=0 [-1 from exit] [DELETED]
After calling exit()
on process with pid=1:
exit /usr/init pid=1 exec_id=1 refCnt=0 [-1 from exit] [DELETED]
As I initially said, in order to make that work, we need to do some modifications on getAncestors
and getCleanupEvent
. I haven't also thought of how this will work in the case when we need to add something in the eventcache. But we will find out those details if needed.
Does this make sense? Do I miss anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does make sense, yes. This was the first thing i've tried, and, unironically, i think it worked the best. I chose to try something else, because getAncestors
and getCleanupEvent
here do a lot of +1/-1 calls that basically just cancel each other. So we're kind of doing most of the work for no actual value.
Plus, i couldn't call RefInc/RefDec for ancestors right away inside GetAncestorProcessesInternal
, because at that moment i couldn't guarantee, that i will be able to collect all of them there. So i looped over all ancestors at least 2 times: first time to collect them all, second - to call RefInc/RefDec for each.
The more processes i had with lots of ancestors, the less good it seemed. So i eventually decided to abandon that approach and try to at least call RefInc/RefDec only when i need to. Honestly, i'm not sure, if that was the right call.
I'll implement your suggestions and come back, thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because getAncestors and getCleanupEvent here do a lot of +1/-1 calls that basically just cancel each other
Yes, I agree on that. But this is just a matter or increasing/decreasing a counter (i.e. minimal overhead) in order to make a simpler design. Making reference counting correct is a bin challenging so I would prioritize a simpler design here.
Honestly, i'm not sure, if that was the right call.
Not sure either to be honest. I just provided my thoughts on how this could work but I haven't spend time to understand all the details. So what I provided in my previous message could have issues as well when dealing with the eventcache etc.
I'll implement your suggestions and come back, thank you.
I would propose to spend some time first to understand all the details and please ask any questions here in case something does not make sense. Once everything is clear, you can proceed with the implementation.
pkg/filters/binary_regex.go
Outdated
process = GetProcess(ev) | ||
var processes []*tetragon.Process | ||
switch level { | ||
case 0: // Process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add const/iota for Process/Parent/Ancestors values
pkg/grpc/exec/exec.go
Outdated
proc := ev.GetProcess() | ||
parent := ev.GetParent() | ||
tetragonProcess := ev.GetProcess() | ||
tetragonParent := ev.GetParent() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this rename makes the whole change more complex, please keep proc and parent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've renamed them for consistency purposes, mostly because GetProcessExec
uses these names for the same objects. I.e proc
for *process.ProcessInternal
and tetragonProcess
for *tetragon.Process
.
In some functions process of type *process.ProcessInternal
is named proc
, in others - internal
. And in MsgCloneEventUnix
.Retry
proc
has type *tetragon.Process
. It's kind of inconvenient.
Also, i would say that tracing.go is much easier to understand, as it uses the same naming in all functions. I'll revert these changes back, but for the future i think it would be better to make variables names more consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I'd prefer the this rename change in separate change that woudn't change the behavior
return err | ||
} | ||
parent.RefInc("parent") | ||
ev.SetParent(parent.UnsafeGetProcess()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, it looks like we can keep the same path as original code, just add eventcache.CacheRetries to the error path?
pkg/grpc/exec/exec.go
Outdated
(ec.Needed(tetragonEvent.Process) || (tetragonProcess.Pid.Value > 1 && ec.Needed(tetragonEvent.Parent))) { | ||
(ec.Needed(tetragonProcess) || | ||
(tetragonProcess.Pid.Value > 1 && ec.Needed(tetragonParent)) || | ||
(option.Config.EnableProcessAncestors && tetragonParent.Pid.Value > 2 && tetragonAncestors == nil)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this condition is used everywhere.. could we add ec.NeededAncestors ?
pkg/grpc/exec/exec.go
Outdated
@@ -427,6 +470,20 @@ func (msg *MsgExitEventUnix) RetryInternal(ev notify.Event, timestamp uint64) (* | |||
internal, parent := process.GetParentProcessInternal(msg.ProcessKey.Pid, timestamp) | |||
var err error | |||
|
|||
if option.Config.EnableProcessAncestors && ev.GetAncestors() == nil && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, yea.. seems like a problem.. not sure yet how to deal with that
9332a4d
to
bd44a87
Compare
Hello @olsajiri @tpapagian . I made some changes based on your suggestions:
I also renamed some variables for the sake of consistency.
|
Thanks for the update, I will review that possibly on Monday. |
Allow to include ancestors of the process beyond the immediate parent (up to PID 1 / PID 2) in process_exec, process_exit, process_uprobe, process_kprobe, process_lsm, process_tracepoint events via `--enable-process-ancestors` option. Turn `--enable-process-ancestors` option off by default. Signed-off-by: t0x01 <[email protected]>
Implement a new export filter that can filter over ancestor binary names using RE2 regular expressions. Signed-off-by: t0x01 <[email protected]>
Add information about ancestors, ancestor filter and ancestors related metrics to documentation. Signed-off-by: t0x01 <[email protected]>
bd44a87
to
815a7d7
Compare
Fixes 2420
Description
Reason: Option to include all ancestors of the process in process events can be very useful for observability and filtering purposes. E.g. to apply complex correlation rules later in data processing pipeline, or to filter out extra events.
Changes made:
enable-process-ancestors
from the config file. Turn optionenable-process-ancestors
off by default.enable-process-ancestors
is set, try to include ancestors (up to PID 1/PID 2) of the process beyond the immediate parent inprocess_exec
,process_exit
,process_uprobe
,process_kprobe
,process_lsm
,process_tracepoint
events in a respective protobuf message for the given process.enable-process-ancestors
is set, and we were unsuccessful when trying to collect process' ancestors in an event, add that event to eventcache for reprocessing.enable-process-ancestors
is set, try to collect process' ancestors again.