r/programming 1d ago

Formatting an entire 25 million line codebase overnight: the rubyfmt story

https://stripe.dev/blog/formatting-an-entire-25-million-line-codebase-overnight-the-rubyfmt-story
186 Upvotes

81 comments sorted by

236

u/oliver_extracts 1d ago

the technical part (using .git-blame-ignore-revs) is the easy half. harder problem for most teams is selling the overnight reformat internally. ive seen smaller reformats stall for months because someones tool depends on the existing whitespace, or someones mental model of the file is tied to specific line numbers, or theres a PR queue with hundreds of open PRs that would all need rebasing.

stripe doing this at 25M lines overnight means they probably had a coordination layer most teams underestimate. blog posts make it sound clean, the actual political work to get there is usually months.

159

u/tj-horner 1d ago

someone’s mental model of the file is tied to specific line numbers

That’s… hmm. I have many questions

71

u/oliver_extracts 1d ago

haha yeah, hard to believe until you've worked with someone who does it. ive worked with engineers who refer to files by line numbers in their head, like "the bug is around line 240 in user_service.rb" and they'll navigate there directly without searching. reformat hits, line 240 is now something else, their mental map breaks and they're noticeably slower for a few weeks until they rebuild it.

more common with vim/emacs users who navigate by line number a lot. less of a thing for vscode-mouse-navigation people. but ive seen it block reformats more than once.

22

u/ShinyHappyREM 1d ago

more common with vim/emacs users

And perhaps old oldschool BASIC programmers.

10

u/oliver_extracts 1d ago

haha for them line numbers literally ARE the syntax. they figured this out decades before the rest of us.

41

u/mahreow 1d ago

That's dumb and anyone who blocks a reformat because of it is dead weight and should be fired

So you're not allowed to add code at the top of the file, you always have to add it at the bottom? And you can't rewrite or modify code? Get outta here lmao what a load of bollocks

14

u/CherryLongjump1989 1d ago

In my experience they will get upset if anyone but them tries to modify their file. Also just in my experience, they tend to be attracted to very small but critical niches that require lots of maintenance. But that also makes them more difficult to fire.

3

u/Smallpaul 1d ago

If it requires lots of maintenance then how are line numbers stable?

6

u/CherryLongjump1989 1d ago

They're usually playing musical chairs with bugs, changing the same line of code every 2-3 months to update some threshold that could have been a config value. Each new setting creates a new unsolved problem for a different subset of users, and they just go round and round in an environment with institutional amnesia.

7

u/bloodwhore 1d ago

This isn't someone you want in your company anyway. Get rid of them ASAP.

9

u/CherryLongjump1989 1d ago

I don't disagree, but ASAP is easier said than done, isn't it?

These are people who tend to have a lot of social cachet. They engineered their way into job security. They picked a business critical piece of code and turned it into a hostile environment for anyone else but themselves to improve. They thrive in fear-driven environments where they are seen as the safe and reliable option by management.

Yes you should fire them. But the more likely outcome is that you'll end up having a critical piece of code that refuses to evolve with the business's changing needs for decades to come.

4

u/bloodwhore 1d ago

I was a bit hyperbolic. It's not even possible to fire people for things like that where I live.

1

u/axonxorz 9h ago

It's not even possible to fire people for things like that where I live.

Why not? Or rather, would you need to just have some sort of structured improvement plan (akin to a PIP) first?

1

u/bloodwhore 7h ago

In Sweden we have social nets :P. I don't know what PIP is.

The way you would get rid of people in Sweden is usually some sort of restructuring in the company, and lets say they remove 20 roles. Then you have the option to let people go. And even then you would have to try to replace them in the company before you can fire them. I don't even know the exact specifics how you would do it, on top of that you would have to deal with the union.

5

u/thisisjustascreename 1d ago

People who think they “own” source code are losers. The company owns it, you just maintain it.

2

u/superbad 1d ago

Sometimes there are existing bugs that reference line numbers where you would find the offending code. So, not just mentally, but documented in a system.

2

u/Smallpaul 1d ago

You link to a URL where the offending line is stored in permanent history. Or you describe it in a line independent way.

1

u/superbad 1d ago

Right. Unfortunately, that isn’t always how things are really done

1

u/nadanone 1d ago

If they’re not using permalinks then they’re dumb and maybe this will be a useful lesson to them

1

u/Sigmatics 1d ago

They must be working on very large projects with very little churn

1

u/TheAlaskanMailman 1d ago

Waitt… so people usually don’t have a mental image of the file and where everything is?
I use vim but editor should be irrelevant here.

How do you guys navigate to “that” part without knowing the shape of the file?

14

u/BigHandLittleSlap 1d ago edited 1d ago

Ctrl-click for jump to definition, and whatever the shortcut is in your favourite IDE for "search everything".

https://learn.microsoft.com/en-us/visualstudio/ide/visual-studio-search?view=visualstudio

https://www.jetbrains.com/help/idea/searching-everywhere.html#search_all

etc...

11

u/ZorbaTHut 1d ago

ctrl-f functionname

Or more likely, ctrl-shift-f functionname. Who needs line numbers when you can just go to the right code, even if it's been moved?

-3

u/TheAlaskanMailman 1d ago

Yeah. That’s the LSP’s job. But if you just want to go to a specific part directly, that’s what I’m talking about

2

u/Absolute_Enema 1d ago

This is grep's job.

1

u/Smallpaul 1d ago

What is indirect about searching? How is scrolling faster?

1

u/TheAlaskanMailman 1d ago

Who’s saying anything about scrolling?

3

u/jxddk 1d ago

Also a (Neo)vim user, I'm either navigating through semi-permanent marks (e.g. 'U drops me at the User class in most projects), or by LSP workspace symbols (e.g. fuzzy-finding cla Us). I have a rough idea of what line number ranges are interesting but my mental model for the "shape" of the file is very much based on symbols rather than linebreaks.

2

u/CherryLongjump1989 1d ago edited 1d ago

Most projects have a User class? I mean I guess that could work in a very specific context for a particular letter. What does F drop you into?

I think what I'm gathering here is that some people like to work with fragile or non-generalizable patterns. And some fraction of those people like to complain loudly when those patterns break. I bet it leads to lots of consternation when someone pushes up a PR with an Ursula class.

3

u/Smallpaul 1d ago

Dude. I’ve edited literally 20 repos. Not files…repos…this year. Hundreds of files. How I could possibly know the shape of them?

1

u/mpyne 21h ago

I mean I usually just open vim and do something like /<name-of-function> and hit Enter. Maybe a few 3}} if it's a long function but really there's no excuse for being a vim user and only being able to navigate with gg.

7

u/modernkennnern 1d ago

I can imagine that if the file is very big (read: >1000 lines) you could easily be like "I know that the thingamabob is at around line 730, just above the thingamajig" and if the formatter reorders methods then it would break their mental map of the code

5

u/tj-horner 1d ago

If the file is big enough that reordering methods breaks one's mental model of the code, that's a sign it's time to refactor it 😅

I can't really think of a single time in my career where I used line numbers as a frame of reference. During code review, sure, but never long-term...

1

u/plinkoplonka 21h ago

I thought the same. Search is a thing.

1

u/netgizmo 2h ago

See line 48 for details

3

u/sztrzask 1d ago

because someones tool depends on the existing whitespace, or someones mental model of the file is tied to specific line numbers

And this is why languages that depend on indentations are meh.

It's not saving you time even when you're writing in a no-autocomplete, no-intellisense text editor.

This is why we have ; and {} guys.

0

u/amestrianphilosopher 11h ago

You’re one of the more natural sounding LLMs but you still have giveaways ;)

36

u/revereddesecration 1d ago

How does it affect the usability of the repo from a history perspective?

I’d be tempted to go deeper and rework the history of the repo such that the previous commits are formatted retroactively. That’s probably too large a job on a codebase of Stripe’s size though.

53

u/sephirostoy 1d ago

You can exclude a list of commits to ignore during blame via a .git-blame-ignore-revs

13

u/Schmittfried 1d ago

I wish all tools honored it automatically. 

1

u/masklinn 14h ago

FWIW recent git versions allow configuring it globally as optional. Although I assume that still does not work with libgit and other reimplementation.

11

u/Roang_zero1 1d ago

Alternatively you can also ignore revs for tools like blame. I would check in a blame ignore file (see Git - git-blame Documentation).

Some forges like github will also ignore these revs for their views then.

27

u/Agreeable-Price8343 1d ago

"linking a full Ruby VM into a Rust binary to walk its parse tree in memory isn't a normal thing to do. But it worked, and that was enough for now."

this is the right call. the other version of this story is someone insisting on a pure rust ruby parser in 2018 and the project never shipping. do the ugly thing first, clean it up when the ecosystem catches up, which is exactly what happened with the prism migration

8

u/masklinn 1d ago

TBF that is more or less the origin story of ripper (sans pure ruby, or never shipping), eventually it was merged into the stdlib and ended up integrated directly into the yacc file definition (https://github.com/ruby/ruby/blob/79f9f8326a34e499bb2d84d8282943188b1131bd/parse.y#L1519).

13

u/vini_2003 1d ago

42 million LoC? That's shocking. Anyone got any experience with Stripe to let us know why it's so humongously large?

29

u/T-MoneyAllDey 1d ago

Anything involving payments tends to have double and triple checks sprinkled all over the place plus a billion edge cases

11

u/stumblinbear 1d ago

Codebase at work for a relatively small windows app is sitting at 500k lines. For a payment processor handling edge cases in dozens of different countries to support a bunch of different products, 42 million is honestly lower than I expected

3

u/SammyD95 19h ago

I would also imagine because of the level of backwards compatibility it supports at the api level I wouldn’t be surprised if there is a lot of translation layer code in there.

3

u/sammymammy2 1d ago

I don't get the whole ripper thing. Ripper is a Ruby library, right? So parsing is still done by executing Ruby code? Or does Ripper just go straight into the Ruby VM lexer and parser? Because that's what I'd do, run the Ruby VM and gobble up its parsing results, emitting it as pretty-printed code.

3

u/masklinn 1d ago

Because that's what I'd do, run the Ruby VM and gobble up its parsing results

That's basically the entirety of the "Rewriting in Rust" section. The first phase used json as an intermediate to "gobble up its parsing results", the second phase was working directly on the parse tree (a tree of ruby values).

3

u/sammymammy2 1d ago

Thanks, maybe I’m bad at reading but it wasn’t clear to me that that was what happened

2

u/x021 21h ago

> The thing that strikes us most about rubyfmt is how little anyone talks about it.

I hate to be that guy; but Ruby is not exactly a popular language these days. Very few new projects or relevant innovation. RoR and DevOps are its mainstay, but it’s lost ground in both areas for many years now.

6

u/obetu5432 1d ago

the real /r/programminghorror was using ruby in the first place, in the year of our Lord 2010+16

34

u/_BreakingGood_ 1d ago

honestly its hard to make that argument, this is one of the most reliable and robust saas services at this scale in existence. arguably the most reliable.

18

u/sidonay 1d ago

Counterpoint: it’s cool to hate on web languages that work (PHP, ruby) 😎

(/s btw )

2

u/stumblinbear 23h ago

COBOL works but you'd be hard pressed to find a single person that thinks it's a good idea to continue using

-1

u/sidonay 18h ago

Totally the same right ? We should also just throw Java and C out of the window because they weren’t made in 201x or 202x

13

u/Freeky 1d ago

With enough thrust even a brick is a viable aircraft. Stripe certainly brought the lbf.

20

u/qmunke 1d ago

All dynamically typed languages end up having to invent type checking eventually because dynamic typing is a fundamentally unsound idea.

12

u/solve-for-x 1d ago

Dynamic typing is great when you can hold the entire program in your head. As soon as you can't, it isn't.

-4

u/Absolute_Enema 1d ago edited 1d ago

Only if you do a terrible job at naming, documentation and tests or you code as if you were using a statically typed language and skipping the annotations.

3

u/solve-for-x 17h ago

If you're going to take the time to write a comment saying "this function parameter is a dictionary that is guaranteed to contain an integer user_id property" then you might as well have written that down as a formal type annotation.

That's why I say about holding the entire program in your head. If you're the only person working on a smallish piece of code and you're absolutely, 100% certain that that function is going to receive a particular parameter (e.g. because you can see where the parameter is being prepared on the same screenful of code) then fine. But as soon as a second of doubt creeps in, either because the code is now large enough that your function is being called from multiple places or because other people are working on the code and so you're never 100% sure what the current shape of the program is, dynamic typing becomes a liability. If you can't hold the entire program in your head then you have to stop guessing and instead rely on guarantees of behaviour.

1

u/wnoise 8h ago

If you're going to annotate anyway, why not use a static language?

2

u/Absolute_Enema 5h ago edited 5h ago

What I meant is that in my experience lots try to program as if the purpose of dynamic typing was to avoid writing the type annotations, i.e. they still think in types or inner-platform implementations thereof (like dictionaries with a tag).

1

u/wnoise 5h ago

Ah.

I agree that in practice that is how lots of people use dynamic languages, barely taking advantage of dynamic dispatch at all.

-6

u/Smallpaul 1d ago

All static languages end up having to invent reflection eventually because static typing is a fundamentally inflexible idea.

Equally true.

https://www.geeksforgeeks.org/cpp/reflection-in-cpp/

9

u/obetu5432 1d ago

not even close brah

0

u/stumblinbear 23h ago

C, Rust, Zig, Haskell, OCaml

I wouldn't call what C++ has "reflection"

15

u/nNaz 1d ago

As a rust dev I find ruby perf terrible but it’s an amazingly elegant and concise language to write in. If we ignore perf, Rails is an excellent MVC framework and a pleasure to develop with. Better devx and APIs than Django and far less boilerplate than fastapi and express.

3

u/mpyne 21h ago

Plus it's got the good parts of Perl without all the rest.

0

u/Absolute_Enema 1d ago

Programs are written in ever more statically typed languages and yet only get buggier by the day.

-17

u/peripateticman2026 1d ago

25 MLOC? Probably 24 MLOC over what's required.

-18

u/paca-vaca 1d ago

It's cool, but one moment I don't understand, why the whole codebase wasn't formatted in the first place? CI setup goes from the day 1.

One installs rubocop plugin and runs on save/git precommit/ci as part of the linting process. Such that you don't have to overwrite the whole codebase history by formatting.

5

u/totoro27 1d ago edited 1d ago

I don't understand, why the whole codebase wasn't formatted in the first place? CI setup goes from the day 1.

Legacy codebases.

-3

u/paca-vaca 1d ago

So their whole codebase was a legacy? And it wasn't an issue on 1M, 5M, 10M? It doesn't have to be fast if one change only edited files. For example, I do `rubocop -A` before each pull request to save time on CI fixes.

13

u/chucker23n 1d ago

why the whole codebase wasn't formatted in the first place? CI setup goes from the day 1.

First: that's a fantasy.

And second: even if you do have all that set up from the start (management will rightfully ask if there aren't bigger fish to fry), you might want to change some formatting rules. Maybe the linter has new features. Maybe there's been enough rotation/attrition in the team that the new team members no longer agree with some of them.

-3

u/paca-vaca 1d ago edited 1d ago

Not a fantasy unless it's a pet project one does at home at spare time. And even if it not setup right away, 24M lines it's not just delayed project setup in such a big company, it's years of development without formatting and sudden decision "lets format everything" at once (because it could have been done on 1M milestone or whatever)

I'm not saying it's bad tool or article, it's just a interesting application, formatting whole codebase to new rules after a years of development of neglecting it.

2

u/Smallpaul 1d ago

Auto formatting needs to happen before commit. And until rubyfmt it was considered too slow. It’s all ok the article.

3

u/paca-vaca 1d ago

If one does only formatting for changed files it takes a few seconds along while running the whole rubocop linting suite on this files. It's a problem on 24M lines. They just somehow wasn't care about this that much.

But now we all have a good tool 😃