Data Clumps: The Friends Who Always Travel Together
Data Clumps code smell for beginners — learn to spot groups of values that always travel together and bundle them into one class, like a student ID card.
🎯 Three Friends on Every School Form
In St. Mary's School, Pune, every single form asks for the same three things:
Name. Class. Roll Number.
Library book issue form? Name, class, roll number. Sports day registration? Name, class, roll number. Exam hall ticket? Name, class, roll number. Picnic permission slip? You guessed it.
Twelve-year-old Kabir fills these three boxes maybe two hundred times a year. And mistakes happen constantly. On the sports form he wrote his old class from last year. On the library form, his friend Arjun filled roll number 23 instead of 32 — and the overdue notice went to the wrong student, poor Sanjana of roll 23, who had never even entered the library that month. Each form checks the three values its own way: the librarian verifies against her register, the sports teacher against his list, and sometimes nobody verifies at all.
Then one year, the school becomes smart. It issues every student an ID card. Name, class, and roll number — printed once, verified once, laminated together. Now every form just says: "Attach photocopy of ID card." One thing to carry. Impossible to write the class wrong. When Kabir moves up a class, the school reissues one card — and every future form is automatically correct.
Those three values were never really three separate things. They were one thing — a student's identity — travelling in pieces because nobody had given it a card.
Behind the scenes, there is a second story. The school's app was built by Neha, an old student of St. Mary's who now works as a developer. When the principal asks her to digitise the ID card system, Neha opens her own two-year-old code — and freezes. Her library module, sports module, and exam module each pass name, className, rollNo as three loose values. The same trio. The same checks, copy-pasted. Her code has been making Arjun's mistake all along, just faster.
In code, this smell is called Data Clumps: the same little group of values appearing together, again and again — in parameter lists, in fields, in local variables — without ever being given a name and a home of its own. This lesson follows Kabir's forms and Neha's refactor side by side, because they are the same story.
💡 What is this smell?
Our usual reminder first: a code smell is not a bug. Code full of clumps runs perfectly. The smell warns of friction ahead — duplicated checks, swap bugs, and changes that ripple across dozens of files. Data Clumps is the last of the five "Bloater" smells from Martin Fowler's Refactoring, and in a way it is the quietest one: nothing is visibly huge, yet bloat is spread thinly everywhere.
A Data Clump is a group of two, three, or four data items that always appear side by side:
startDateandendDatelatitudeandlongitudestreet,city, andpinCodename,className, androllNumber
The clump appears as parameters in method after method, and as fields in class after class — yet the group itself has no name in the code. Martin Fowler discusses this pattern on his bliki and in Refactoring, with advice that has become a classic line in the refactoring world: when a few data items keep gathering together, turn them into an object of their own.
Fowler's famous deletion test: mentally delete one member of the group. Do the others still make sense? An endDate without a startDate is nonsense — so the pair is one concept (a date range) split into pieces. If the leftovers still make sense alone, it was never a real clump.
Notice how this smell connects its Bloater siblings. Clumps passed as parameters create Long Parameter Lists. Clumps are usually made of loose primitives — Primitive Obsession. And repeated field clumps inside a Large Class mark exactly the seams where it should be split. Find one Bloater and you will usually find its cousins nearby. Neha finds all four in one afternoon.
College corner: A clump with a shared rule — "start must be before end" — is a multi-field invariant, and invariants need exactly one enforcer. In design terms, the cure object (like DateRange or StudentId) is a value object whose constructor is the single gate where the invariant is checked. This is the same "make invalid states unrepresentable" idea from Primitive Obsession, lifted from one value to a small group of values.
👃 How to spot it
Neha hunts through her codebase with this checklist. Hunt through yours:
- The same two-to-four values travel together through many signatures: every booking function takes
(checkIn, checkOut), every map function takes(lat, lng). - The group appears as fields in several unrelated classes —
Order,Invoice, andShipmentall separately declarestreet,city,pinCode. - A rule about the group is repeated wherever it goes: "start must be before end" is checked in five different methods.
- The members are primitives passed positionally, so neighbours of the same type can be swapped silently.
- The group fails the deletion test — remove one member and the rest become meaningless.
- Adding a related value (say, a timezone next to the two dates) means editing every place the group appears.
| Symptom | What it tells you |
|---|---|
| Same trio in many signatures | A concept exists in everyone's head but nowhere in the code |
| Same fields re-declared in many classes | Each class is hand-copying a missing type |
| "start before end" checked everywhere | The group has an invariant but no owner to enforce it |
(from, to) both DateTime or both string | Transposition bug waiting; compiler cannot help |
| Deletion test fails | The values are one concept split into fragments |
| One new field ripples through 20 files | The clump has no single home for change |
Neha's audit result: the trio appears in three signatures and two classes, and the roll-number check is pasted three times — already with small differences.
⚠️ Why it is a problem
- Duplicated validation that drifts. The "check-out must be after check-in" rule is re-typed at every site. One day, one copy is fixed to handle same-day bookings and the other four are not. Now your system disagrees with itself.
- A missing word in your code's language. Your team says "date range", "coordinate", "student identity" in every meeting — but the code only says
DateTime, DateTimeandstring, string, number. The code speaks in fragments while humans speak in concepts. - Wide ripple on change. Add a
timezoneto your date handling and you must edit every method that carries the loose pair. With aDateRangetype, you edit one file. - Long parameter lists everywhere. Clumps are the number-one feeder of the Long Parameter List smell — three clumps of three makes a nine-parameter monster.
- Transposition bugs.
isAvailable(end, start)compiles beautifully and books rooms in reverse time. Same-typed neighbours are silent traps.
The saddest path in that figure is the last one: when change becomes too wide, developers give up and bolt the new value on somewhere local — making the design even more inconsistent. Unfixed clumps do not stay still; they decay.
When Neha maps where her student trio lives, the spread surprises her — the clump has colonised every module:
And here is the ripple cost the principal actually feels. The school plans to add a fourth value — the student's house (Red, Blue, Green, Yellow) — for sports day. Without a bundle, the cost of adding one related field grows with every copy of the clump:
The flat line at the bottom is the bundled design: one type, one file, done. That single flat line is the whole business case for fixing clumps.
🧪 The wrong overdue notice — anatomy of a clump bug
Remember Arjun writing 23 instead of 32? Here is the exact same incident inside Neha's app, where a transposed value sails through three loose parameters:
Every individual check passed — 23 is a perfectly valid roll number. The bug was in the relationship: that roll number did not belong with that name and class. Only a bundled, verified identity (the ID card) can catch that. Meanwhile, Kabir's year of paperwork looks like this:
📊 Which clumps deserve a class?
Not every pair of values that meets is a clump. Neha ranks her candidates on two axes: how often the group repeats, and how strongly it is one concept (failing the deletion test, sharing a rule). Plot a group here before extracting it:
The student trio and the date pair sit in "Bundle it now" — heavy repetition, strong meaning. A name appearing next to a theme colour once is just a coincidence; bundling it would invent a fake concept.
🧪 A real-life code example
Let us look at Neha's actual code — St. Mary's School before the ID card. Watch the trio travel:
// Library module
function issueBook(
name: string, className: string, rollNo: number, bookTitle: string,
): void {
if (rollNo <= 0 || rollNo > 60) throw new Error("Bad roll number");
if (!/^[5-9][A-D]$/.test(className)) throw new Error("Bad class");
console.log(`Issued "${bookTitle}" to ${name} (${className}-${rollNo})`);
}
// Sports module
function registerForSportsDay(
name: string, className: string, rollNo: number, event: string,
): void {
if (rollNo <= 0 || rollNo > 60) throw new Error("Bad roll number"); // copy 2
if (!/^[5-9][A-D]$/.test(className)) throw new Error("Bad class"); // copy 2
console.log(`${name} (${className}-${rollNo}) registered for ${event}`);
}
// Exam module
function printHallTicket(
name: string, className: string, rollNo: number, examCode: string,
): string {
if (rollNo <= 0 || rollNo > 60) throw new Error("Bad roll number"); // copy 3
if (!/^[5-9][A-D]$/.test(className)) throw new Error("Bad class"); // copy 3
return `HALL TICKET ${examCode}: ${name}, ${className}, Roll ${rollNo}`;
}
// And the classes copy the fields too:
class LibraryRecord {
name = ""; className = ""; rollNo = 0; booksIssued: string[] = [];
}
class SportsEntry {
name = ""; className = ""; rollNo = 0; event = "";
}Count the damage, the way Neha counts it that afternoon:
- The trio
name, className, rollNoappears in three signatures and two classes — five copies of one concept. - The validation rules are pasted three times. The day the school adds class 10 (
/^[5-9]/becomes/^([5-9]|10)/), how many copies will actually get updated? Experience says: not all. - Every call site is swap-prone:
issueBook("7A", "Kabir", 32, ...)— name and class transposed — sails through if both happen to be accepted strings, or fails far away from the real mistake. - The deletion test fails loudly: a roll number without a class identifies nobody. Roll 32 of which class?
Distributed duplication is the hardest kind to see. Each module looks fine alone — the librarian's code, the sports code, the exam code each seem tidy. The smell is only visible when you compare them side by side. Make that comparison a habit during code review.
🛠️ Cleaning it up, step by step
Step 1: Issue the ID card with Introduce Parameter Object. Neha names the concept that was always there:
class StudentId {
constructor(
readonly name: string,
readonly className: string,
readonly rollNo: number,
) {
if (rollNo <= 0 || rollNo > 60) throw new Error("Bad roll number");
if (!/^[5-9][A-D]$/.test(className)) throw new Error("Bad class");
}
label(): string {
return `${this.name} (${this.className}-${this.rollNo})`;
}
}The validation now exists once, at the only gate where a StudentId can be born. Adding class 10 next year is a one-line, one-file change. The label() formatting — previously re-built with string glue in three places — also moved into its natural home.
Step 2: Slim every signature. Each function now accepts the card:
function issueBook(student: StudentId, bookTitle: string): void {
console.log(`Issued "${bookTitle}" to ${student.label()}`);
}
function registerForSportsDay(student: StudentId, event: string): void {
console.log(`${student.label()} registered for ${event}`);
}
function printHallTicket(student: StudentId, examCode: string): string {
return `HALL TICKET ${examCode}: ${student.label()}`;
}Four parameters became two. No validation in sight — because none is needed. And a transposed call like issueBook(bookTitle, student) is now a compile error, not a wrong overdue notice landing on innocent Sanjana.
Step 3: Fix the fields with Extract Class thinking, and stop disassembling with Preserve Whole Object. The classes that copied the trio now hold the card instead:
class LibraryRecord {
constructor(
readonly student: StudentId,
readonly booksIssued: string[] = [],
) {}
}
// Preserve Whole Object: do NOT take the card apart to pass its pieces!
// Smelly: issueBook(record.student.name, record.student.className, ...)
// Clean:
issueBook(record.student, "The Jungle Book");That last point deserves a pause. A sneaky way clumps come back after you create the type is callers unpacking it: f(student.name, student.className, student.rollNo). The moment you catch yourself passing an object's pieces, pass the object. That is Preserve Whole Object in one sentence.
The refactored design, as Neha presents it to the principal (who understands it instantly, because it is the ID card system):
And the before-and-after flow:
College corner: Notice what happened to behaviour during the refactor: label() migrated onto StudentId, and overlap or comparison logic migrates onto types like DateRange the same way. This is Fowler's deeper point about clumps — the new class starts as a data bag but quickly attracts the methods that belong to it, raising cohesion across the whole codebase. Object-oriented design largely is this: moving behaviour next to the data it concerns.
🔄 The life cycle of this smell
Clumps follow a quiet but predictable arc. Because nothing ever looks big, teams rarely notice the state changes until the ripple cost bites:
As with every bloater, the cheap exit is early: name the concept the second time you type the same group. By the time the clump is entrenched, the refactor is still worth it — but now it is a project, not an afternoon.
🧰 The same smell in C#
A delivery app passes latitude and longitude as loose doubles — the world's most widespread clump:
public double DistanceKm(double lat1, double lng1, double lat2, double lng2)
{
// four same-typed doubles in a row: swap any two and
// your delivery boy swims in the Arabian Sea
/* haversine formula ... */
}
public bool IsInDeliveryZone(double lat, double lng,
double centerLat, double centerLng, double radiusKm)
{
return DistanceKm(lat, lng, centerLat, centerLng) <= radiusKm;
}Name the concept once:
public readonly record struct GeoPoint(double Latitude, double Longitude)
{
public double DistanceKmTo(GeoPoint other)
{
/* haversine formula ... */
}
}
public bool IsInDeliveryZone(GeoPoint customer, GeoPoint hub, double radiusKm)
=> customer.DistanceKmTo(hub) <= radiusKm;Four anonymous doubles became two named points; the distance behaviour moved onto the concept it belongs to; and a (lng, lat) swap inside a call list became much harder to write. One tiny record struct — costing almost nothing at runtime — removed an entire family of map bugs.
🔍 Where this smell hides in real projects
Data Clumps have favourite habitats, and researchers and practitioners keep finding the same ones:
- Date pairs.
startDate/endDate,validFrom/validTo,checkIn/checkOut— Fowler himself uses the range as the textbook example; aDateRangetype with anOverlapsmethod replaces scattered comparison logic across booking, reporting, and billing modules. - Coordinates and dimensions.
(x, y),(lat, lng),(width, height)— graphics, mapping, and game codebases overflow with loose pairs that want to bePoint,GeoPoint, orSize. - Address fragments.
street, city, state, pinre-declared inCustomer,Order,Warehouse, andInvoice— four hand-copied address books in one system, each validated differently. - Money pairs.
amountpluscurrencyCodetravelling separately — this clump doubles as Primitive Obsession, and its cure, theMoneyvalue object, is among the most celebrated in all of software design. - Connection settings.
host, port, username, passwordpassed as four arguments through layers of infrastructure code instead of oneConnectionInfo— a clump that also leaks secrets into log lines more easily. - Pagination trios.
page, pageSize, sortBycopied into every list endpoint of an API instead of onePageRequestobject.
Some IDEs and analysis tools detect clumps automatically by finding identical parameter groups across methods — IntelliJ's "Extract Parameter Object" and similar inspections in other tools will even perform the bundling for you. Academic tools study "tuple recurrence" across signatures for exactly this purpose.
🤔 When it is okay to ignore
| Situation | Ignore the smell? | Why |
|---|---|---|
| Two values together once, by coincidence | ✅ Yes | Bundling invents a fake concept; a name that means nothing is worse than no name |
| Group repeats across 3+ sites | ❌ No | Every repetition is duplicated validation plus a swap hazard |
| Group carries an invariant (start < end) | ❌ No | An invariant needs exactly one enforcer — give it a constructor |
| Group passes the deletion test (members independent) | ✅ Yes | They are travel companions, not one concept — leave them loose |
| Tiny script, group used twice, no rules | ✅ Probably | The new type is overhead with no payoff in a short-lived script |
| You cannot think of an honest name for the bundle | ✅ Wait | No natural name often means no natural concept; do not force DataHolder3 |
The honest rule: extract when the grouping is stable (it keeps recurring) and meaningful (deletion test fails, or a shared rule exists). A clump must earn its class — but when it has, the class pays rent forever. That is exactly the quadrant in Figure 6.
💊 Which refactorings cure it
| Refactoring | When to use it |
|---|---|
| Extract Class | The clump lives as fields in one or more classes — give it a class of its own |
| Introduce Parameter Object | The clump travels through method signatures — replace the group with one object |
| Preserve Whole Object | Callers disassemble an object to pass its parts — pass the whole object instead |
| Replace Data Value with Object | Individual members of the clump are themselves smelly primitives — upgrade them too |
| Extract Method | Logic about the clump (formatting, comparing) is pasted around — gather it, then move it onto the new type |
🧠 The whole smell on one page
Neha's summary slide for the school's tech club — Kabir sits in the front row:
📦 Quick revision box
+------------------------------------------------------------------+
| DATA CLUMPS - CHEAT SHEET |
+------------------------------------------------------------------+
| What : The same small group of values travelling together |
| everywhere, never given a name (name+class+rollNo) |
| Family : Bloaters (the quiet, scattered one) |
| Spot it : Repeated parameter trios, re-declared field groups, |
| duplicated group rules, swap-prone neighbours |
| Test : Deletion test - remove one member; if the rest |
| become meaningless, it is a real clump |
| Costs : Drifting validation, ripple on change, long |
| parameter lists, transposition bugs |
| Main fix : Introduce Parameter Object / Extract Class |
| Helper : Preserve Whole Object (stop re-scattering it!) |
| Ignore : One-off pairings; bundles you cannot honestly name |
| Mantra : "If they always travel together, give them |
| one ID card." |
+------------------------------------------------------------------+✍️ Practice exercise
Neha's exercise for the tech club — and for you. A tuition centre's app is below. One clump appears four times — twice in signatures, twice as fields — and its rule is duplicated. Find it and fix it.
function scheduleBatch(
subject: string,
startHour: number,
endHour: number,
teacher: string,
): void {
if (startHour >= endHour) throw new Error("Bad timing");
console.log(`${subject} batch: ${startHour}:00 to ${endHour}:00 with ${teacher}`);
}
function isTeacherFree(
teacher: string,
startHour: number,
endHour: number,
bookings: { teacher: string; startHour: number; endHour: number }[],
): boolean {
if (startHour >= endHour) throw new Error("Bad timing");
return !bookings.some(
(b) =>
b.teacher === teacher &&
startHour < b.endHour &&
b.startHour < endHour,
);
}
class Batch {
subject = "";
startHour = 0;
endHour = 0;
teacher = "";
}Your tasks:
- Name the clump. Apply the deletion test: does
endHourmean anything withoutstartHour? - Create a
TimeSlotclass with the validation in its constructor and two methods:overlaps(other: TimeSlot)and a nicetoString(). - Rewrite both functions and the
Batchclass to useTimeSlot. How many copies of thestartHour >= endHourcheck remain? (Target: exactly one.) - Bonus: the overlap formula (
start < b.end && b.start < end) is tricky logic sitting insideisTeacherFree. After your refactor it should be a tested, reusable method onTimeSlot— write two quick test cases for it: touching slots (10-12 and 12-14) and overlapping slots (10-13 and 12-14). - Extra challenge: plot
TimeSloton Figure 6's quadrant. Which corner does it land in, and why?
With this lesson, you have completed the whole Bloater family: Long Method, Large Class, Primitive Obsession, Long Parameter List, and Data Clumps. Notice how often the same medicines reappeared — Extract, Introduce, Preserve. Learn those few refactorings well, and the whole bloated family fears you — just like every form at St. Mary's now fears Kabir's laminated ID card.
Frequently asked questions
- What is the quickest test to confirm a data clump?
- The deletion test, suggested by Martin Fowler: imagine deleting one value from the group. Do the remaining values still make sense? An endDate without a startDate is meaningless — so the pair is a true clump and deserves to become one object.
- How is Data Clumps different from Long Parameter List?
- Long Parameter List is about one signature having too many parameters. Data Clumps is about the same small group repeating in many places — signatures, fields, locals. A clump often causes long parameter lists, but it can hide in field declarations too, where no parameter list exists.
- How is Data Clumps different from Primitive Obsession?
- Primitive Obsession is about one value wearing the wrong type (an email as a plain string). Data Clumps is about several values missing a shared home (street, city, PIN never bundled). They usually appear together, and the same cure — a value object — often fixes both.
- Is every pair of values that appears together a clump?
- No. Two values that meet once by coincidence are not a clump. Bundling them invents a fake concept that confuses readers. Look for groups that repeat across many places, fail the deletion test, or share a rule like 'start must come before end'.
- Which refactorings fix data clumps?
- Extract Class when the clump appears as fields, Introduce Parameter Object when it appears in signatures, and Preserve Whole Object when callers take an object apart just to pass its pieces. Often you use all three together.
Further reading
Related Lessons
Long Parameter List: The Chai Order That Took Ten Instructions
Long Parameter List code smell made simple — why methods with too many arguments cause bugs, and how parameter objects make calls short, clear, and safe.
Primitive Obsession: When Everything Is Just a String or a Number
Primitive Obsession explained simply — why plain strings and numbers hide bugs, and how value objects like Money and Address make code safe and clear.
Large Class: The School Bag That Carries Everything
Understand the Large Class code smell — why god classes grow, how to spot low cohesion, and how Extract Class splits them into small, focused classes.
Long Method: When One Function Tries to Do Everything
Learn the Long Method code smell with simple stories, TypeScript and C# examples, and step-by-step refactoring using Extract Method. Beginner friendly guide.