Demystifying Lock Files and Package Management at Buffer

yarn.lock, package-lock, lerna, npm install, what??

At Buffer with the strategies to work on multiple products, our teams have fully adopted react JS based application workflow. As an engineer who doesn't actually work on the products, a lot of this has been foreign to me. I don't even use React or Node JS in my day to day work or in any of my side projects. It looks like I have a chance to dive into the world of Node package management though now.

Over the months, the team has upgraded its workflow. Moving from npm to yarn. Then using lerna and yarn workspaces. A lot of this has resulted in better build times and better package management. At the same time, the we've adopted some of these but have not taken some time to fully understand what's going on under the hood. This has resulted in some really interesting questions which we've tried to solve as we go along.

A quick summary of my understanding so far

I could be completely wrong with my analysis here so don't take anything here as "this is how stuff works". Not yet. But what I do know at a basic level is that Node has a lot of dependencies that need to be installed. And these dependencies depend on other dependencies. Which could potentially depend on other depencies. You get the idea.

All these dependencies mean that a lot of the internet gets downloaded each time you install packages that the main project depends on. There are several possible layers in how package management can work with Node.

At the first layer, you define whatever packages you need in a package.json file. It's a specification that says I need package-x. You can even specify an exact version or a more general version or no version at all. There are pros and cons of each approach but it's the first two I'm interested in.

If you specify an exact version needed, like 3.2.1 then that means you'll always be at that version. It's up to you to keep up with releases and ensure you bump the version number up as you go along. Useful for really critical projects
If you specify a general number of sorts, like ^3.2.1 then that means that you'll use any version of 3.x.x. Basically you'll use the latest package in version 3 (and start from 3.2.1 as the oldest allowed). Could be wrong here, but that's what my initial reading felt like.

In the second scenario though, you really don't want to install v3.2.1 locally, and then when you push to production through your continuous integration pipeline have production running v3.2.2 without you realising. This can happen when you step away for some coffee, come back, and deploy, and in the meantime the package owner has pushed a new version. But oops! The package pwner pushed a regression bug by mistake. Oh no! So how do package managers solve this without asking you to always use an exact version instead?

Enter lock files.

Lock files were introduced to solve a host of problems. I won't explain them. The npm site has some wonderful documentation (and some more over here regarding the actual package-lock.json). The quote that matters from that doc is:

This file describes an exact, and more importantly reproducible node_modules tree. Once it's present, any future installation will base its work off this file, instead of recalculating dependency versions off package.json.

Now you have the best of both worlds. When you do want to upgrade, you can choose to run npm install [email protected] and that will automatically update the lock file for you. Or you can use --no-shrinkwrap to ignore the lock file, and get a fully upgraded node modules intead. Whether this last flag updates package-lock.json or not is something I'll need to find out.

The problems don't end here though. Many packages that will be installed in a project will have the exact same dependencies that other packages needed for a project have. The process of eliminating all of this is called de-duplication. I need to research more about how this works with package managers.

Then you have something called deterministic vs non deterministic builds. In the past npm would install things on people's computers with ever so slightly different configurations apparently. It really was only mid 2017 that they too introduced lock files. In the meantime, one of the major tools that came up as a solution that problem is yarn which you can check out here. Apparently yarn is being caught up to by npm which might make the case for yarn disappear over time.

Lastly, we have one more section about which I have no idea. This is because I haven't worked on a production setting Node app. With any big project in any language that I can think of, code is eventually split into multiple packages for sanity's sake. I'm unaware of the history to this, but there are tools built around making this process a sane workflow in the Node world. Essentially a sane process is not baked into Node from what I can tell. In the world of multiple package management, there are two tools that we use at Buffer. Yarn workspaces, and lerna. Apparently, the latter works better with the former which is a very good reason to continue using yarn.

At the time of writing this, I'm literally reading as I go along. Here are the relevant post so far:

On yarn and determinism. Note the post date
On determinism and other features introduced in npm 5. Related: A twitter post first showing the speed of npm 5
On workspaces in yarn

Questions that need digging into

This brings me to my final understanding of the project which I mostly came to while writing this down:

Buffer's publish product for example (whose source you can check out here) uses yarn, yarn workspaces, and lerna to manage its dependencies. That said, it feels like there might be some changes that these tools have received over time that we don't fully take advantage of. Some examples:

Why use yarn install followed by lerna bootsrap in our init step when the latter should do the former anyway?
Are the duplicated steps in our code necessary?
Why does our Dockerfile use yarn install
At the same time another product we are working on hasn't adopted the same practice

Separate from all of this is another issue around lock files.

In the event that I run yarn install on my machine locally, the yarn.lock file is always regenerated. Isn't this defeating the purpose entirely? Or is it that we aren't managing our process around lock files correctly (so what's in the repo doesn't actually have everything we need, therefore generating a brand new lockfile each time).

End goal

Outside of these questions, the end goal is to analyze what our main public Node JS based repos (buffer publish buffer analyze, and account management) are using as processes for package management and to build a unified, reproducible way of managing workflows. These can be free to evolve and diverge again over time but now feels like a good time to bring them all back together and merge all the learnings before setting them free again :D.

Posted on February 06 2018 by Adnan Issadeen