duplicate popularity-contest ID

classic Classic list List threaded Threaded
58 messages Options
123
Reply | Threaded
Open this post in threaded view
|

duplicate popularity-contest ID

Bill Allombert-4
Dear Debian developers,

Each Debian popularity-contest submitter is supposed to have
a different random 128bit popcon ID.
However, the popularity-constest server <https://popcon.debian.org>
receives a lot of submissions with identical popcon ID, which cause them
to be treated as a single submission.

I am not quite sure what it is the reason for this problem.
Maybe people use prebuild system images with a pregenerated
/etc/popularity-contest.conf file (instead of being generated
by popcon postinst).

I am not sure what to do about this.

Cheers,
--
Bill. <[hidden email]>

Imagine a large red swirl here.

Reply | Threaded
Open this post in threaded view
|

Re: duplicate popularity-contest ID

"Yao Wei (魏銘廷)"-2


On Aug 5, 2019, at 20:29, Bill Allombert <[hidden email]> wrote:

I am not quite sure what it is the reason for this problem.
Maybe people use prebuild system images with a pregenerated
/etc/popularity-contest.conf file (instead of being generated
by popcon postinst).

Could this be caused by Debian-live installer based on Calamares?

Yao Wei

(This email is sent from a phone; sorry for HTML email if it happens.)
Reply | Threaded
Open this post in threaded view
|

Re: duplicate popularity-contest ID

Jonathan Carter (highvoltage)-2
Hey Yao and Bill

On 2019/08/05 14:31, "Yao Wei (魏銘廷)" wrote:
>> I am not quite sure what it is the reason for this problem.
>> Maybe people use prebuild system images with a pregenerated
>> /etc/popularity-contest.conf file (instead of being generated
>> by popcon postinst).
>
> Could this be caused by Debian-live installer based on Calamares?

Very unlikely, we don't install popularity-contest on live media and
it's not added/removed at any point by Calamares, so essentially when
you install popularity-contest on a calamares-live-installed system,
it's basically the same as installing it on any other type of Debian
system that didn't have it before.

I also just double-checked whether any /etc/popularity-contest.conf
exists on debian live images, and can confirm that it doesn't.

Bill, it might also be a good idea to ask on the debian-derivatives
mailing list, perhaps someone there might know. I don't suppose there's
any server logs with IPs that you can use to deduce from which country
it's coming from?

-Jonathan

--
  ⢀⣴⠾⠻⢶⣦⠀  Jonathan Carter (highvoltage) <jcc>
  ⣾⠁⢠⠒⠀⣿⡁  Debian Developer - https://wiki.debian.org/highvoltage
  ⢿⡄⠘⠷⠚⠋   https://debian.org | https://jonathancarter.org
  ⠈⠳⣄⠀⠀⠀⠀  Be Bold. Be brave. Debian has got your back.

Reply | Threaded
Open this post in threaded view
|

Re: duplicate popularity-contest ID

Andrey Rahmatullin-3
In reply to this post by Bill Allombert-4
On Mon, Aug 05, 2019 at 02:29:33PM +0200, Bill Allombert wrote:
> Dear Debian developers,
>
> Each Debian popularity-contest submitter is supposed to have
> a different random 128bit popcon ID.
> However, the popularity-constest server <https://popcon.debian.org>
> receives a lot of submissions with identical popcon ID, which cause them
> to be treated as a single submission.
Do you mean just one ID or several IDs with multiple submissions each?

--
WBR, wRAR

signature.asc (911 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: duplicate popularity-contest ID

merkys
In reply to this post by Bill Allombert-4
On 2019-08-05 15:29, Bill Allombert wrote:
> However, the popularity-constest server <https://popcon.debian.org>
> receives a lot of submissions with identical popcon ID, which cause them
> to be treated as a single submission.

I would suspect cloned VMs to have identical popcon IDs. In this case
the collation of identical IDs would be a desirable property, IMO.

Best,
Andrius

--
Andrius Merkys
Vilnius University Institute of Biotechnology, Saulėtekio al. 7, room V325
LT-10257 Vilnius, Lithuania

Reply | Threaded
Open this post in threaded view
|

Re: duplicate popularity-contest ID

Russ Allbery-2
In reply to this post by Bill Allombert-4
Bill Allombert <[hidden email]> writes:

> Each Debian popularity-contest submitter is supposed to have a different
> random 128bit popcon ID.  However, the popularity-constest server
> <https://popcon.debian.org> receives a lot of submissions with identical
> popcon ID, which cause them to be treated as a single submission.

Are you getting lots and lots of submissions with one identical popcon ID,
or lots of cases of 10-20 systems duplicating different popcon IDs?  I
think those lead to different conclusions.

If it's the second, I agree with the suggestion of cloned VMs.  Containers
are making this a bit less common, but building out a system and then
cloning it repeatedly used to be the most common way of scaling a web
service in environments such as AWS.

--
Russ Allbery ([hidden email])               <http://www.eyrie.org/~eagle/>

Reply | Threaded
Open this post in threaded view
|

Re: duplicate popularity-contest ID

Marco d'Itri
In reply to this post by Bill Allombert-4
On Aug 05, Bill Allombert <[hidden email]> wrote:

> Each Debian popularity-contest submitter is supposed to have
> a different random 128bit popcon ID.
> However, the popularity-constest server <https://popcon.debian.org>
> receives a lot of submissions with identical popcon ID, which cause them
> to be treated as a single submission.

> I am not quite sure what it is the reason for this problem.
> Maybe people use prebuild system images with a pregenerated
> /etc/popularity-contest.conf file (instead of being generated
> by popcon postinst).
Probably yes.

> I am not sure what to do about this.
Change popularity-contest by transmissing the hostid after it has been
hashed with the content of /etc/machine-id.

--
ciao,
Marco

signature.asc (673 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: duplicate popularity-contest ID

Bill Allombert-4
In reply to this post by Russ Allbery-2
On Mon, Aug 05, 2019 at 08:46:15AM -0700, Russ Allbery wrote:

> Bill Allombert <[hidden email]> writes:
>
> > Each Debian popularity-contest submitter is supposed to have a different
> > random 128bit popcon ID.  However, the popularity-constest server
> > <https://popcon.debian.org> receives a lot of submissions with identical
> > popcon ID, which cause them to be treated as a single submission.
>
> Are you getting lots and lots of submissions with one identical popcon ID,
> or lots of cases of 10-20 systems duplicating different popcon IDs?  I
> think those lead to different conclusions.

Both.
Yesterday I received the same popcon ID 2600 times, and 4700 differents ID were received
two times and 22000 ID were received exactly once.

I understand the need for totally identical systems, but then probably
it does not make sense for them to report to popcon.

A related issue is that the submission time is randomized, but if
2600 systems have identical /etc/cron.d/popularity-contest files, they
will report at the same time, causing network spikes.

Cheers,
--
Bill. <[hidden email]>

Imagine a large red swirl here.

Reply | Threaded
Open this post in threaded view
|

Re: duplicate popularity-contest ID

Russ Allbery-2
Bill Allombert <[hidden email]> writes:

> Both.

> Yesterday I received the same popcon ID 2600 times, and 4700 differents
> ID were received two times and 22000 ID were received exactly once.

Hm.  I think that's still in the range of what could be explained by VM
cloning, although the 2600 with the same ID is surprising.

> I understand the need for totally identical systems, but then probably
> it does not make sense for them to report to popcon.

Marco's suggestion of hashing with /etc/machine-id is a good one.  That
file is unique per cloned VM (if the cloning is done properly), and if you
hash it with the popcon ID to form a new ID, there shouldn't be any
realistic chance of leaking a unique identifier for the system that might
be useful for other purposes.

> A related issue is that the submission time is randomized, but if 2600
> systems have identical /etc/cron.d/popularity-contest files, they will
> report at the same time, causing network spikes.

You could add a bit of per-run randomization to the cron job.  I assume an
individual report doesn't take a lot of effort to process, so even skewing
that load across 10-20 seconds might be enough.

--
Russ Allbery ([hidden email])               <http://www.eyrie.org/~eagle/>

Reply | Threaded
Open this post in threaded view
|

Re: duplicate popularity-contest ID

Jeremy Stanley
On 2019-08-06 08:33:36 -0700 (-0700), Russ Allbery wrote:
[...]
> Hm.  I think that's still in the range of what could be explained
> by VM cloning, although the 2600 with the same ID is surprising.
[...]

A CI system which is using cloned virtual machines could easily do
that. I help operate a CI system which boots and deletes far more
than 2600 new virtual machines of some distro/version combinations
every day, though it hasn't relied on cloning to create images for a
few years now. On the other hand, including popcon on a test VM
*and* enabling it to report does seem like an odd choice for a CI
system, but I've seen far stranger misconfigurations over the years.
--
Jeremy Stanley

signature.asc (981 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: duplicate popularity-contest ID

Bill Allombert-4
In reply to this post by Marco d'Itri
On Tue, Aug 06, 2019 at 12:08:13AM +0200, Marco d'Itri wrote:

> On Aug 05, Bill Allombert <[hidden email]> wrote:
>
> > Each Debian popularity-contest submitter is supposed to have
> > a different random 128bit popcon ID.
> > However, the popularity-constest server <https://popcon.debian.org>
> > receives a lot of submissions with identical popcon ID, which cause them
> > to be treated as a single submission.
>
> > I am not quite sure what it is the reason for this problem.
> > Maybe people use prebuild system images with a pregenerated
> > /etc/popularity-contest.conf file (instead of being generated
> > by popcon postinst).
> Probably yes.
>
> > I am not sure what to do about this.
> Change popularity-contest by transmissing the hostid after it has been
> hashed with the content of /etc/machine-id.

This is potentially an excellent idea!

Does not /etc/machine-id suffer of exactly the same issue as
/etc/popularity-contest.conf ?

Is there some statistic about /etc/machine-id reuse or unexpected change ?

Cheers,
--
Bill. <[hidden email]>

Imagine a large red swirl here.

Reply | Threaded
Open this post in threaded view
|

Re: duplicate popularity-contest ID

Sam Hartman-3
>>>>> "Bill" == Bill Allombert <[hidden email]> writes:

    Bill> This is potentially an excellent idea!

    Bill> Does not /etc/machine-id suffer of exactly the same issue as
    Bill> /etc/popularity-contest.conf ?

A lot more procedures for cloning images know that they need to generate
new /etc/machine-ids.

It's one of those things you tend to realize fairly quickly that you
need to fix up in cloned images.
Interactions with machined, systemd journal, and a few other things tend
to make it obvious.

--Sam

Reply | Threaded
Open this post in threaded view
|

Re: duplicate popularity-contest ID

Paul Wise via nm
In reply to this post by Bill Allombert-4
On Tue, Aug 6, 2019 at 7:34 PM Bill Allombert wrote:

> A related issue is that the submission time is randomized, but if
> 2600 systems have identical /etc/cron.d/popularity-contest files, they
> will report at the same time, causing network spikes.

BTW, a systemd service timer has native randomisation with
RandomizedDelaySec/AccuracySec so adding one that shadows the cron job
and disabling the cron job on systemd systems could provide more load
spread because each system would send the data at a completely
different time each day. The apt package is a good example of how to
do this (except for the randomisation part).

--
bye,
pabs

https://wiki.debian.org/PaulWise

Reply | Threaded
Open this post in threaded view
|

Re: duplicate popularity-contest ID

Marc Haber-3
In reply to this post by Bill Allombert-4
On Tue, 6 Aug 2019 11:33:42 +0000, Bill Allombert
<[hidden email]> wrote:
>Yesterday I received the same popcon ID 2600 times, and 4700 differents ID were received
>two times and 22000 ID were received exactly once.
>
>I understand the need for totally identical systems, but then probably
>it does not make sense for them to report to popcon.

Why? Does a node in a cluster count less than a desktop installation?
If so, why do we not value the input of our biggest users while
putting so much focus on installations in a market segment that we're
losing anyway?

>A related issue is that the submission time is randomized, but if
>2600 systems have identical /etc/cron.d/popularity-contest files, they
>will report at the same time, causing network spikes.

Then the randomization should not be in the configuration file, but
for example hashed from the MAC address.

Greetings
Marc
--
-------------------------------------- !! No courtesy copies, please !! -----
Marc Haber         |   " Questions are the         | Mailadresse im Header
Mannheim, Germany  |     Beginning of Wisdom "     |
Nordisch by Nature | Lt. Worf, TNG "Rightful Heir" | Fon: *49 621 72739834

Reply | Threaded
Open this post in threaded view
|

Re: duplicate popularity-contest ID

Marc Haber-3
In reply to this post by Sam Hartman-3
On Tue, 06 Aug 2019 14:01:16 -0400, Sam Hartman <[hidden email]>
wrote:
>>>>>> "Bill" == Bill Allombert <[hidden email]> writes:
>
>    Bill> This is potentially an excellent idea!
>
>    Bill> Does not /etc/machine-id suffer of exactly the same issue as
>    Bill> /etc/popularity-contest.conf ?
>
>A lot more procedures for cloning images know that they need to generate
>new /etc/machine-ids.

I am using Debian for two decades now, and I realized that necessity
two days ago.

>It's one of those things you tend to realize fairly quickly that you
>need to fix up in cloned images.
>Interactions with machined, systemd journal, and a few other things tend
>to make it obvious.

I didn't.

Greetings
Marc
--
-------------------------------------- !! No courtesy copies, please !! -----
Marc Haber         |   " Questions are the         | Mailadresse im Header
Mannheim, Germany  |     Beginning of Wisdom "     |
Nordisch by Nature | Lt. Worf, TNG "Rightful Heir" | Fon: *49 621 72739834

Reply | Threaded
Open this post in threaded view
|

Re: duplicate popularity-contest ID

Russell Stuart
On Wed, 2019-08-07 at 09:34 +0200, Marc Haber wrote:
> I am using Debian for two decades now, and I realized that necessity
> two days ago.

Ditto - except for me it was a few seconds ago.

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Generating new IDs for cloning (was Re: duplicate popularity-contest ID)

The Wanderer
On 2019-08-07 at 04:26, Russell Stuart wrote:

> On Wed, 2019-08-07 at 09:34 +0200, Marc Haber wrote:
>
>> I am using Debian for two decades now, and I realized that
>> necessity two days ago.
>
> Ditto - except for me it was a few seconds ago.

In my case, it was when I read this thread last night. (After more like
~1.5 decades of Debian, for what that's worth.)

This isn't the first time I've discovered that some aspect of a Debian
system would actually need to be cleared and re-generated when that
system is cloned, well after the point where it would have been easy for
me to address that need. (Fortunately, although I've moved in the
direction of cloned Debian systems multiple times in the past, so far
all of those have petered out before reaching production. I still want
to change that at some point, however.)


Cloning isn't the only example of a case where some machine-specific
configuration detail may need to be updated, without that being obvious
in advance.

I've been bitten by attempting to change the name of a computer running
off of LVM on mdraid, and discovering that the hostname entered during
the original install process when those two things were configured had
actually been encoded into the definition of one of the two, such that
the machine could no longer automatically find its filesystems at boot
until some action to update the hostname in that definition was taken;
the original hostname was effectively a critical ID for that filesystem.
(I *still* haven't been able to pin down with certainty what action
would do that update safely.)

Since cloning a machine often involves specifying a new hostname for the
clone, I'd expect to encounter the same issue there - although it's
probably not all that common to want to clone a machine running from
RAID, so if the mdraid is where the hostname is needed, the issue may
not tend to come up in that context.

I wouldn't be even slightly surprised if there were other examples, as
well, somewhere in the package ecosystem.


I've begun to wonder whether it might be worth the overhead to set up
some type of mechanism to let packages which define such
machine-specific IDs A: declare the fact, in a central location which
the sysadmin of a machine where that package is installed can easily
check, and B: define an automated way of performing the appropriate
update / regenerate step in a way which covers all known places where
the ID needs to be updated.

Maybe a mechanism vaguely similar to /etc/init.d/ | /etc/rc*.d/ ? Say,
one directory (name bikeshedding welcome) to contain package-installed
scripts which will generate and apply the new GUID (or replace an
existing ID with a specified new one in all relevant places, for cases
such as the hostname one given above), and another directory to contain
symlinks to scripts in the first directory. Then either a flag file to
tell the system to run the symlinked scripts (and clear the flag) on the
next boot, or just let the presence of any such symlinks be the flag
indicating to run that script and remove the symlink at boot time.

That way, rather than needing to research to find out what elements of
the installed system need to be updated at clone time, the sysadmin
could just check the relevant directory, run any scripts whose effects
need to be applied pre-clone (if any), create appropriate symlinks for
whichever others the sysadmin wants to have run in this case, create the
flag file if applicable, shut down, and clone.

...this would be arguably reminiscent of the Sysprep tool on the Windows
side of things, although probably all of more general, more flexible,
and less heavy-weight. I'm sad at there being any need for such a thing
in the Linux world, but as long as there are machine-specific IDs which
need to be updated for effective cloning, I'd rather have such a
mechanism than need to do all the work (or do the research, and write
the necessary automation scripts) myself in every case.

I'm not particularly attached to that exact solution; it's just the
first one I came up with that seemed as if it could work with sufficient
generality. If people think the idea is worth pursuing but that solution
is not ideal, I would be more than happy to defer to those with more
expertise.

--
   The Wanderer (will, statistically, probably regret posting this)

The reasonable man adapts himself to the world; the unreasonable one
persists in trying to adapt the world to himself. Therefore all
progress depends on the unreasonable man.         -- George Bernard Shaw


signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: duplicate popularity-contest ID

Michael Stone-2
In reply to this post by Marc Haber-3
On Wed, Aug 07, 2019 at 09:31:34AM +0200, Marc Haber wrote:

>On Tue, 6 Aug 2019 11:33:42 +0000, Bill Allombert
><[hidden email]> wrote:
>>Yesterday I received the same popcon ID 2600 times, and 4700 differents ID were received
>>two times and 22000 ID were received exactly once.
>>
>>I understand the need for totally identical systems, but then probably
>>it does not make sense for them to report to popcon.
>
>Why? Does a node in a cluster count less than a desktop installation?
>If so, why do we not value the input of our biggest users while
>putting so much focus on installations in a market segment that we're
>losing anyway?

I guess the question is what is the point of the popcon statistics.
Insofar as they're used to determine defaults, skewing them toward
custom images (which likely do not care about defaults) is probably a
mistake.

Reply | Threaded
Open this post in threaded view
|

Re: duplicate popularity-contest ID

Ian Jackson-2
Michael Stone writes ("Re: duplicate popularity-contest ID"):
> I guess the question is what is the point of the popcon statistics.
> Insofar as they're used to determine defaults, skewing them toward
> custom images (which likely do not care about defaults) is probably a
> mistake.

popcon is a really bad way to determine defaults because it is so
heavily skewed by existing defaults.

More useful uses of popcon include: estimating the downside, if some
package is (or may become) broken or removed; and, maybe, estimating
the user preferences between different non-default leaf packages.

For me, if I were doing (say) RC bugfixing and was considering asking
for a removal, even a moderate popcon figure would give me pause.
Conversely, a low popcon figure would encourage me to consult on
removing the package.

Ian.

--
Ian Jackson <[hidden email]>   These opinions are my own.

If I emailed you from an address @fyvzl.net or @evade.org.uk, that is
a private address which bypasses my fierce spamfilter.

Reply | Threaded
Open this post in threaded view
|

Re: Generating new IDs for cloning (was Re: duplicate popularity-contest ID)

Marvin Renich
In reply to this post by The Wanderer
* The Wanderer <[hidden email]> [190807 09:28]:

> Cloning isn't the only example of a case where some machine-specific
> configuration detail may need to be updated, without that being obvious
> in advance.
>
> I've begun to wonder whether it might be worth the overhead to set up
> some type of mechanism to let packages which define such
> machine-specific IDs A: declare the fact, in a central location which
> the sysadmin of a machine where that package is installed can easily
> check, and B: define an automated way of performing the appropriate
> update / regenerate step in a way which covers all known places where
> the ID needs to be updated.

I think this is a good idea, but will require work and coordination to
accomplish.  A wiki.debian.org page with your ideas and (perhaps on a
separate page) a place to list things that need updating after the
physical copying is complete would be wonderful, if you feel motivated
to get it started.  :-)  Hostname, machine-id (new to me too!), and ssh
host keys can start the list.

...Marvin

123