Privacy isn’t about hiding data — The quiet power of minimum group size

About fifteen years ago, I was helping build a Laboratory Information Management System (LIMS) in a public-health institution. The goal sounded noble: track disease prevalence across the country from lab results.

To protect privacy, we stripped the data clean. Names—gone. ID numbers—gone. All that remained were things that felt harmless: the patient’s age, gender, the city district (barrio), and the type of test.

One night, while checking the database for oddities, I ran a random query—just curiosity, really. I looked for something rare: an HIV test, for someone over seventy, in a small, well-defined barrio in Managua.

The query returned exactly one record.
One.

And I froze.

In a tiny neighborhood, a seventy-five-year-old man from Barrio Santa Julia who got tested for HIV last week man in his seventies, tested for a certain illness (I’ve anonymized my own story; I learned my lesson), isn’t an anonymous entry. It’s someone’s grandfather. Someone’s neighbor. Someone.

My blood went cold. The system built to protect public health could, with one careless query, betray a person’s dignity.

That small moment taught me something that’s never left me: data privacy isn’t a technical checkbox — it’s a human safeguard.


We usually think privacy means hiding.
Blur the faces. Delete the names. Mask the IDs.

But the real challenge isn’t to make data invisible — it’s to make it collective.

Privacy, at its best, means no one stands alone in a dataset.

That quiet idea lives behind one of the most elegant principles in data protection: minimum group size, or what the experts call k-anonymity.

It doesn’t encrypt. It doesn’t brag.
It simply whispers: every record should blend in.
Insight should come from groups, not from individuals.

From uniqueness to togetherness

Most datasets don’t expose people by name. They expose them by combination.
Age, gender, location — innocent on their own, revealing when stitched together.

If you see “a 47-year-old marine biologist in a small coastal town,” you don’t need Sherlock Holmes to know who that is.

That’s why k-anonymity matters. It makes sure that every record is shared by at least k – 1 others, so nobody becomes the only one wearing a neon sign in the crowd.

The beauty of it is subtle: it doesn’t silence the data, it reshapes it — keeping patterns visible but letting identities fade into the background.

Step 1: a simple dataset

Let’s see what that means with a small, transparent example using standard SQL.
Imagine we have a table of citizens with their age, gender, and city:

CREATE TABLE citizens (
    id INTEGER PRIMARY KEY,
    age INTEGER,
    gender TEXT,
    city TEXT
);

INSERT INTO citizens (age, gender, city) VALUES
(31, 'Female', 'Valencia'),
(32, 'Female', 'Valencia'),
(33, 'Female', 'Valencia'),
(34, 'Female', 'Valencia'),
(35, 'Female', 'Valencia'),

(40, 'Male', 'Seville'),
(41, 'Male', 'Seville'),

(29, 'Male', 'Bilbao'),
(30, 'Male', 'Bilbao'),
(45, 'Male', 'Bilbao');
idagegendercity
131FemaleValencia
232FemaleValencia
333FemaleValencia
434FemaleValencia
535FemaleValencia
640MaleSeville
741MaleSeville
829MaleBilbao
930MaleBilbao
1045MaleBilbao

Each row could represent a real person.
Our task: find out which groups are big enough to keep them anonymous.

Step 2: count all groups

SELECT gender, city, COUNT(*) AS group_size
FROM citizens
GROUP BY gender, city
ORDER BY group_size DESC;

Output:

gendercitygroup_size
FemaleValencia5
MaleBilbao3
MaleSeville2

Here we can already see the structure: some groups are big enough to provide anonymity, others are too small.

Step 3: apply a minimum group size

Let’s set k = 3, meaning every group must contain at least three people.

SELECT gender, city, COUNT(*) AS group_size
FROM citizens
GROUP BY gender, city
HAVING COUNT(*) >= 3
ORDER BY group_size DESC;

Output:

gendercitygroup_size
FemaleValencia5
MaleBilbao3

“Male, Seville” doesn’t qualify — it’s a crowd of two, not a crowd at all.

So those records would need to be generalized or suppressed to meet the privacy threshold.

The same in Python

The beauty of a principle like k-anonymity is that it’s not tied to any single technology. It’s a way of thinking about data. To show you what I mean, here’s the exact same technique, but this time in Python using the Pandas library. Whether you’re a SQL guru or a Pythonista or whatever programming language you prefer, the goal is the same: to count your groups and make sure no one is left standing alone.

import pandas as pd

data = pd.DataFrame([
    {"age": 31, "gender": "Female", "city": "Valencia"},
    {"age": 32, "gender": "Female", "city": "Valencia"},
    {"age": 33, "gender": "Female", "city": "Valencia"},
    {"age": 34, "gender": "Female", "city": "Valencia"},
    {"age": 35, "gender": "Female", "city": "Valencia"},
    {"age": 40, "gender": "Male", "city": "Seville"},
    {"age": 41, "gender": "Male", "city": "Seville"},
    {"age": 29, "gender": "Male", "city": "Bilbao"},
    {"age": 30, "gender": "Male", "city": "Bilbao"},
    {"age": 45, "gender": "Male", "city": "Bilbao"},
])

k = 3

group_sizes = (
    data.groupby(["gender", "city"])
    .size()
    .reset_index(name="group_size")
)

print("All groups:")
print(group_sizes, "\n")

compliant = group_sizes[group_sizes["group_size"] >= k]
print(f"Groups meeting minimum group size (k = {k}):")
print(compliant)

Output:

All groups:
   gender     city  group_size
0  Female  Valencia           5
1    Male    Bilbao           3
2    Male   Seville           2

Groups meeting minimum group size (k = 3):
   gender     city  group_size
0  Female  Valencia           5
1    Male    Bilbao           3

Here we can see both worlds — all groups, and those safe to share.

Why it matters

That single row I found in the LIMS system years ago wasn’t just a technical bug. It was a potential human tragedy. Because a medical record isn’t just data — it’s trust. And trust, once broken, doesn’t restore easily. Even the mere knowledge that someone got tested can spread shame, fear, or discrimination.

When we enforce a minimum group size, we’re doing more than protecting privacy — we’re protecting dignity. We’re promising each person that their data will help the collective — to spot outbreaks, to plan resources — but never expose them as individuals.

k-anonymity lets the system answer,

“How many elderly people in this district are being tested?”

without ever answering,

“Who is the elderly person being tested?”

It turns a portrait of one vulnerable person into a landscape of public health.

So, What’s a Good k?

This is the million-dollar question. If k is the guardian of privacy, how big does that guardian need to be?

The honest answer is: it depends. There’s no universal magic number. Choosing a k is a balancing act between protecting individuals and keeping the data useful. A k so high that it merges everyone into one giant group is safe, but useless. A k that’s too low is useful, but unsafe.

However, here are some common rules of thumb used in the industry:

  • k = 3: Often considered the absolute, bare-minimum floor. It’s better than nothing, but for most cases, it’s like putting up a “beware of dog” sign for a chihuahua. It might deter the most casual intruders, but not a determined one.
  • k = 5 or k = 10: This is a widely cited starting point for general-purpose or commercial datasets. It provides a reasonable degree of anonymity without completely obscuring the patterns in the data. Many organizations, including some regulatory bodies, use k=5 as a baseline.
  • k = 25 or higher: When you’re dealing with highly sensitive information—like the health records in my LIMS story, financial data, or location tracking—you need a bigger crowd. A higher k ensures that individuals are much harder to isolate, which is critical when a leak could have severe consequences.

While there are rules of thumb, the scientific literature doesn’t recommend a simple ratio (like k should be 0.1% of your dataset size). Here’s why: a dataset’s vulnerability doesn’t come from its length, but from its uniqueness.

Imagine two datasets, both with one million people:
  1. Dataset A: Records of people living in a single, massive apartment complex. The variety of quasi-identifiers (age, gender, floor) is low.
  2. Dataset B: Records of people from all over the world, each with a unique profession. The variety is immense.

A k=25 would be easy to achieve in Dataset A. But in Dataset B, you might find that almost every person is unique. A k=25 would force you to discard nearly the entire dataset, making it useless.

This brings us to the central challenge of anonymization: the Privacy-Utility Trade-off. The more privacy you demand (a higher k), the more information you lose—either by generalizing details (e.g., changing  
 “age 47” to “age 40-50”) or by suppressing records entirely.

So, instead of starting with a fixed k, professionals often approach it from the other direction: by defining an acceptable level of data loss.

You might decide, “I will choose the highest k possible, as long as I don’t have to suppress more than 15% of my records.” This approach guarantees your dataset remains useful.

Ultimately, choosing k is a risk-management decision, not a mathematical formula. It requires you to ask:

  • How sensitive is this data? (Are we talking about favorite ice cream flavors or medical diagnoses?)
  • How big is the underlying population? (A group of 5 in a city of millions is different from a group of 5 in a village of 50.)
  • How much data can I afford to lose? (The cost of privacy.)
  • What other data could this be combined with? (Could someone link this to another public dataset?)

Choosing k is less about finding a fixed number and more about making a thoughtful, ethical judgment. It’s the moment where the data scientist also has to be a data ethicist.

The quiet ethic of k

K-anonymity may look like a small technical trick — a modest HAVING COUNT(*) >= k tucked into a query — but behind that line lives an ethic of restraint.

It says: data can inform, but it must not expose.

It’s a philosophy as much as a formula.
A small “k” standing guard between curiosity and intrusion.

It reminds us that the best insights don’t come from staring at individuals, but from listening to the chorus.

When data blends, people are protected. And that’s what real privacy sounds like. So next time someone mentions data privacy, don’t picture encryption or hidden columns. Picture the humble GROUP BY and the quiet COUNT(*). They’re tiny guardians of human dignity, quietly ensuring that data speaks for the many, not the few.

In a world obsessed with personalization, maybe a little intentional generalization is the most personal thing we can do.


Further Reading

Latanya Sweeney (2002)k-Anonymity: A Model for Protecting Privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems. DOI: 10.1142/S0218488502001648

The Algorithmic Foundations of Differential Privacy by Dwork & Roth (2014) – A readable theoretical perspective on privacy through mathematics.

Ninghui Li, Tiancheng Li, & S. Venkatasubramanian (2007)t-Closeness: Privacy Beyond k-Anonymity and ℓ-Diversity. IEEE 23rd International Conference on Data Engineering. PDF

Harvard Privacy Tools Project – Resources on differential privacy and anonymization. https://privacytools.seas.harvard.edu

Leave a Reply

Your email address will not be published. Required fields are marked *