locks and kernel randomness...

Wed Feb 25 01:01:15 UTC 2015

> On Feb 24, 2015, at 5:29 PM, John-Mark Gurney <jmg at funkthat.com> wrote:
> 
> Ian Lepore wrote this message on Tue, Feb 24, 2015 at 17:02 -0700:
>> On Tue, 2015-02-24 at 15:19 -0800, John-Mark Gurney wrote:
>>> Alfred Perlstein wrote this message on Tue, Feb 24, 2015 at 16:16 -0500:
>>>> On 2/24/15 1:25 PM, John-Mark Gurney wrote:
>>>>> Alfred Perlstein wrote this message on Tue, Feb 24, 2015 at 13:04 -0500:
>>>>>> On 2/24/15 12:40 PM, John-Mark Gurney wrote:
>>>>>>> Warner Losh wrote this message on Tue, Feb 24, 2015 at 07:56 -0700:
>>>>>>>> Then again, if you want to change random(), provide a weak_random() that???s
>>>>>>>> the traditional non-crypto thing that???s fast and lockless. That would make it easy
>>>>>>>> to audit in our tree. The scheduler doesn???t need cryptographic randomness, it
>>>>>>>> just needs to make different choices sometimes to ensure its notion of fairness.
>>>>>>> 
>>>>>>> I do not support having a weak_random...  If the consumer is sure
>>>>>>> enough that you don't need a secure random, then they can pick an LCG
>>>>>>> and implement it themselves and deal (or not) w/ the locking issues...
>>>>>>> 
>>>>>>> It appears that the scheduler had an LCG but for some reason the authors
>>>>>>> didn't feel like using it here..
>>>>>> 
>>>>>> The way I read this argument is that no low quality sources of
>>>>>> randomness shall be allowed.
>>>>> 
>>>>> No, I'm saying that the person who needs the predictable randomness
>>>>> needs to do extra work to get it...  If they care that much about
>>>>> performance/predictability/etc, then a little extra work won't hurt
>>>>> them..  And if they don't know what an LCG is, then they aren't
>>>>> qualified to make the decision that a weaker RNG is correct for their
>>>>> situation..
>>>>> 
>>>>>> So we should get rid of rand(3)?  When do we deprecate that?
>>>>> 
>>>>> No, we should replace it w/ proper randomness like OpenBSD has...
>>>>> I'm willing to go that far and I think FreeBSD should...  OpenBSD has
>>>>> done a lot of leg work in tracking down ports that correctly use
>>>>> rand(3), and letting them keep their deterministic randomness, while
>>>>> the remaining get real random..
>>>>> 
>>>>>> Your argument doesn't hold water.
>>>>> 
>>>>> Sorry, you're argument sounds like it's from the 90's when we didn't
>>>>> know any better on how to make secure systems...  Will you promise to
>>>>> audit all new uses of randomness in the system to make sure that they
>>>>> are using the correct, secure API?
>>>>> 
>>>>> Considering that it's been recommended that people NOT use
>>>>> read_random(9) for 14 years, yet people continue to use it in new code,
>>>>> demonstrates that people do not know what they are doing (wrt
>>>>> randomness), and the only way to make sure they do the correct, secure
>>>>> thing is to only provide the secure API...
>>>> 
>>>> That speaks to more of the drive-by czars we have in BSD land that take 
>>>> an area with a hard lock and then go away.
>>> 
>>> It also speaks to the airchair quarterbacking that stops people from
>>> wanting to contribute...  Someone comes along and tries to make an
>>> improvement, then x number of people raise their arms about oh, I
>>> still use grdc (sorry dteske, not trying to pick on you) as tcp keep
>>> alive, and then the person abandons or leaves incomplete the work that
>>> they started...
>>> 
>>> I was very close to NOT posting the email to -arch, but after various
>>> questions from twitter, and adrian's continued pleas to talk changes
>>> more publicly, I decided to do so...  If people continue to react this
>>> way, it just demonstrates that doing things publicly is NOT a way to
>>> get things to move forward in FreeBSD, and people will continue to do
>>> things in private...  Luckily, I'm consulting, so I have a few more
>>> hours (for now) to fight these fights, but if it continues to be an
>>> issue, we'll continue to have this problem of czars that come in, drop
>>> a bunch of code and then leave, because dealing w/ this becomes too
>>> expensive...
>>> 
>>> So far, only ONE person has commented on the patch on reviews, and that
>>> is delphij...
>>> 
>>>> Also, do not want to attempt to be like openbsd, learn from for sure, 
>>>> but to be like, no way.
>>> 
>>> I'm fine not being like OpenBSD, but as you said, we should learn from
>>> them, and leverage their work...  Though I agree w/ OpenBSD's work to
>>> replace random(3), it also isn't who FreeBSD is, but if we want to
>>> continue to be relevant, we do need to take security seriously, and
>>> IMO, this is one of those steps.
>>> 
>>> If someone does find a performance issue w/ my patch, I WILL work with
>>> them on a solution, but I will not work w/ people who make unfounded
>>> claims about the impact of this work...
>> 
>> Yeah, the problem could all that.
>> 
>> Or it could be people who "collaborate" by saying I'm going to make this
>> change.  I'm not going to justify it in any way, and if anybody
> 
> I have justified it…

I think you should explain what you explained to me on IRC.

Specifically, through a timing attack, you can find (by default) the lower 7
bits of the value returned by random(). Since random() is not MP safe,
it can sometimes return the same value twice (through some race that may
or may not have been lost). This means other users can see this data.

In this instance, it isn’t so much what sched_ule is doing, but rather what
others are able to glean from it. Now, it isn’t clear that these 7 bits are a big
deal since you also have to lose the race and know the race was lost. Other
things in the system might care if you expose this state.

Also, in this specific case, it can use the current random generator in sched_ule
to get this number as well. It’s run on a time scale of ticks, with some jitter.
In this specific case, it doesn’t need to be using random(), but it isn’t clear if
the get_cyclecount() stuff provides enough low-order bits that are random
enough to meet sched_ule’s needs. But it isn’t clear that it doesn’t (only cause
for concern is if there’s a beat pattern for a cycle count that’s low-resolution,
but I don’t think we have any of these on SMP work loads).

Ideally, since there’s a small chance of a performance regression, we should
find some benchmark to run that would exercise this code path and see if
a regression can be measured or not. After looking at the code, I’m skeptical
that there would be one. But data would settle this once and for all, since this
is an interaction with the scheduler, which historically has made people very
nervous. 

>> disagrees I'm just going to dismiss their concerns and demand that THEY
>> hold the burden of proof that my unnecessary change is harmful, and if
> 
> How many audits of the random() calls in the kernel have you done?
> 
> You've raised concers, I've said I've looked and don't see any, how can
> I prove a negative?  What can I do to convince you that you're wrong?
> All you have to do to convince me I'm wrong is show me a place in the
> kernel where it is a performance issue.  Is it really that hard to
> come up w/ one?

You can prove a negative with benchmarks. Then we’d be arguing over
the efficacy of them, but at least that would be progress :) Or you can strongly
suggest a negative by failing to reject the null hypothesis of no change.
That too would be progress.

>> they don't, then screw the collobaration thing, I'm just going to do it
>> anyway.
> 
> It goes both ways, I see it that you're objecting w/o complete
> intformation, and no mater what evidence or work I do, you'll just
> ignore it, and still say it isn't correct or that there's this
> unprovable codition that prevents the work for going in…

Data is going to break this log-jam.

Warner