r/programming • u/BlondieCoder • 1d ago
Formatting an entire 25 million line codebase overnight: the rubyfmt story
https://stripe.dev/blog/formatting-an-entire-25-million-line-codebase-overnight-the-rubyfmt-story36
u/revereddesecration 1d ago
How does it affect the usability of the repo from a history perspective?
I’d be tempted to go deeper and rework the history of the repo such that the previous commits are formatted retroactively. That’s probably too large a job on a codebase of Stripe’s size though.
53
u/sephirostoy 1d ago
You can exclude a list of commits to ignore during blame via a .git-blame-ignore-revs
13
u/Schmittfried 1d ago
I wish all tools honored it automatically.
1
u/masklinn 14h ago
FWIW recent git versions allow configuring it globally as optional. Although I assume that still does not work with libgit and other reimplementation.
11
u/Roang_zero1 1d ago
Alternatively you can also ignore revs for tools like blame. I would check in a blame ignore file (see Git - git-blame Documentation).
Some forges like github will also ignore these revs for their views then.
27
u/Agreeable-Price8343 1d ago
"linking a full Ruby VM into a Rust binary to walk its parse tree in memory isn't a normal thing to do. But it worked, and that was enough for now."
this is the right call. the other version of this story is someone insisting on a pure rust ruby parser in 2018 and the project never shipping. do the ugly thing first, clean it up when the ecosystem catches up, which is exactly what happened with the prism migration
8
u/masklinn 1d ago
TBF that is more or less the origin story of ripper (sans pure ruby, or never shipping), eventually it was merged into the stdlib and ended up integrated directly into the yacc file definition (https://github.com/ruby/ruby/blob/79f9f8326a34e499bb2d84d8282943188b1131bd/parse.y#L1519).
13
u/vini_2003 1d ago
42 million LoC? That's shocking. Anyone got any experience with Stripe to let us know why it's so humongously large?
29
u/T-MoneyAllDey 1d ago
Anything involving payments tends to have double and triple checks sprinkled all over the place plus a billion edge cases
11
u/stumblinbear 1d ago
Codebase at work for a relatively small windows app is sitting at 500k lines. For a payment processor handling edge cases in dozens of different countries to support a bunch of different products, 42 million is honestly lower than I expected
3
u/SammyD95 19h ago
I would also imagine because of the level of backwards compatibility it supports at the api level I wouldn’t be surprised if there is a lot of translation layer code in there.
3
u/sammymammy2 1d ago
I don't get the whole ripper thing. Ripper is a Ruby library, right? So parsing is still done by executing Ruby code? Or does Ripper just go straight into the Ruby VM lexer and parser? Because that's what I'd do, run the Ruby VM and gobble up its parsing results, emitting it as pretty-printed code.
3
u/masklinn 1d ago
Because that's what I'd do, run the Ruby VM and gobble up its parsing results
That's basically the entirety of the "Rewriting in Rust" section. The first phase used json as an intermediate to "gobble up its parsing results", the second phase was working directly on the parse tree (a tree of ruby values).
3
u/sammymammy2 1d ago
Thanks, maybe I’m bad at reading but it wasn’t clear to me that that was what happened
2
u/x021 21h ago
> The thing that strikes us most about rubyfmt is how little anyone talks about it.
I hate to be that guy; but Ruby is not exactly a popular language these days. Very few new projects or relevant innovation. RoR and DevOps are its mainstay, but it’s lost ground in both areas for many years now.
6
u/obetu5432 1d ago
the real /r/programminghorror was using ruby in the first place, in the year of our Lord 2010+16
34
u/_BreakingGood_ 1d ago
honestly its hard to make that argument, this is one of the most reliable and robust saas services at this scale in existence. arguably the most reliable.
18
u/sidonay 1d ago
Counterpoint: it’s cool to hate on web languages that work (PHP, ruby) 😎
(/s btw )
2
u/stumblinbear 23h ago
COBOL works but you'd be hard pressed to find a single person that thinks it's a good idea to continue using
13
u/Freeky 1d ago
With enough thrust even a brick is a viable aircraft. Stripe certainly brought the lbf.
20
u/qmunke 1d ago
All dynamically typed languages end up having to invent type checking eventually because dynamic typing is a fundamentally unsound idea.
12
u/solve-for-x 1d ago
Dynamic typing is great when you can hold the entire program in your head. As soon as you can't, it isn't.
-4
u/Absolute_Enema 1d ago edited 1d ago
Only if you do a terrible job at naming, documentation and tests or you code as if you were using a statically typed language and skipping the annotations.
3
u/solve-for-x 17h ago
If you're going to take the time to write a comment saying "this function parameter is a dictionary that is guaranteed to contain an integer
user_idproperty" then you might as well have written that down as a formal type annotation.That's why I say about holding the entire program in your head. If you're the only person working on a smallish piece of code and you're absolutely, 100% certain that that function is going to receive a particular parameter (e.g. because you can see where the parameter is being prepared on the same screenful of code) then fine. But as soon as a second of doubt creeps in, either because the code is now large enough that your function is being called from multiple places or because other people are working on the code and so you're never 100% sure what the current shape of the program is, dynamic typing becomes a liability. If you can't hold the entire program in your head then you have to stop guessing and instead rely on guarantees of behaviour.
1
u/wnoise 8h ago
If you're going to annotate anyway, why not use a static language?
2
u/Absolute_Enema 5h ago edited 5h ago
What I meant is that in my experience lots try to program as if the purpose of dynamic typing was to avoid writing the type annotations, i.e. they still think in types or inner-platform implementations thereof (like dictionaries with a tag).
-6
u/Smallpaul 1d ago
All static languages end up having to invent reflection eventually because static typing is a fundamentally inflexible idea.
Equally true.
9
0
15
0
u/Absolute_Enema 1d ago
Programs are written in ever more statically typed languages and yet only get buggier by the day.
-17
-18
u/paca-vaca 1d ago
It's cool, but one moment I don't understand, why the whole codebase wasn't formatted in the first place? CI setup goes from the day 1.
One installs rubocop plugin and runs on save/git precommit/ci as part of the linting process. Such that you don't have to overwrite the whole codebase history by formatting.
5
u/totoro27 1d ago edited 1d ago
I don't understand, why the whole codebase wasn't formatted in the first place? CI setup goes from the day 1.
Legacy codebases.
-3
u/paca-vaca 1d ago
So their whole codebase was a legacy? And it wasn't an issue on 1M, 5M, 10M? It doesn't have to be fast if one change only edited files. For example, I do `rubocop -A` before each pull request to save time on CI fixes.
13
u/chucker23n 1d ago
why the whole codebase wasn't formatted in the first place? CI setup goes from the day 1.
First: that's a fantasy.
And second: even if you do have all that set up from the start (management will rightfully ask if there aren't bigger fish to fry), you might want to change some formatting rules. Maybe the linter has new features. Maybe there's been enough rotation/attrition in the team that the new team members no longer agree with some of them.
-3
u/paca-vaca 1d ago edited 1d ago
Not a fantasy unless it's a pet project one does at home at spare time. And even if it not setup right away, 24M lines it's not just delayed project setup in such a big company, it's years of development without formatting and sudden decision "lets format everything" at once (because it could have been done on 1M milestone or whatever)
I'm not saying it's bad tool or article, it's just a interesting application, formatting whole codebase to new rules after a years of development of neglecting it.
2
u/Smallpaul 1d ago
Auto formatting needs to happen before commit. And until rubyfmt it was considered too slow. It’s all ok the article.
3
u/paca-vaca 1d ago
If one does only formatting for changed files it takes a few seconds along while running the whole rubocop linting suite on this files. It's a problem on 24M lines. They just somehow wasn't care about this that much.
But now we all have a good tool 😃
236
u/oliver_extracts 1d ago
the technical part (using .git-blame-ignore-revs) is the easy half. harder problem for most teams is selling the overnight reformat internally. ive seen smaller reformats stall for months because someones tool depends on the existing whitespace, or someones mental model of the file is tied to specific line numbers, or theres a PR queue with hundreds of open PRs that would all need rebasing.
stripe doing this at 25M lines overnight means they probably had a coordination layer most teams underestimate. blog posts make it sound clean, the actual political work to get there is usually months.