The early days of flight were fraught with uncertainty and peril. Instruments? Who needs instruments? We're talking VFR, folks. Pick a direction, and hope you can find a safe spot to land.
My father is a pilot of small planes. When I was growing up, he used to take me to the Wadsworth Municipal Airport on Saturdays to pal around with his pilot buddies at their hangars. Sometimes we'd take short trips, just for fun, to some other municipal airport nearby. The diner at Carrol County Airport served great pies. But forget about the pies---I got to fly! As a little kid, he'd stick me in the copilot seat and let me take the yoke from time to time.
There were a number of pilot sayings from that era that stuck with me. My father had a few of these on placards in his hangar.
"There are old pilots, and there are bold pilots, but there are no old, bold pilots."
"The air, like the sea, is very unforgiving of an error."
And one of the nearby airports had "Steve's Weather Rock," a 20-pound hunk of granite on a chain, hanging outside of its administration building, with a sign that read:
What we had: players spread out onto three or four servers for load-balancing purposes. During peak times, this was necessary to prevent any individual server from becoming too overloaded. During off-peak times, we kept sending players to all the previously-active servers to avoid any one server dying out unfairly (see the earlier Population Stabilization update). But this meant that during off-peak times, even with plenty of people still playing, the population on each server got a little thin.
What we want: everyone playing on one server, together, all the time.
The problem: CPU overload when populations get high results in lag for players, not to mention Linode sending me warning emails (these server nodes are virtual servers co-hosted on multi-core machines---I don't want to be a bad neighbor to other users who have virtual servers on the same host machine).
It has been a long time since I examined this problem in detail, so I wasn't really sure where the issue was, or if there even was an issue anymore. I was keeping the server population caps relatively low to avoid lag at all costs while I worked on other things.
So, I needed to do some stress-testing and some profiling. Server1, with its ancient, gigantic map that has maybe only been wiped once in the past eight months, was historically the biggest offender in this department, so it made the perfect candidate for a stress test. How many people can we put on there before it chokes?
Does the database engine need another overhaul?
Well, it turns out that with the existing database engine (which was written from scratch for our purposes and heavily optimized by me many months ago), we could pretty much house all the active players on server1 with no player lag. CPU usage, however, was going above and beyond what keeps Linode happy, though. At one point, our externally-monitored CPU usage was over 120%.
How is that possible? Well, it turns out that a virtual CPU consumes additional CPU resources on its host CPU, apparently overhead from the virtualization process itself. So, while I was seeing server1 sitting happily at 60% internally, it was well over 100% as far as Linode was concerned.
By running a busy-wait test program in parallel with server1 on the same node, I was able to push my internal CPU (viewed through top) up to 100%, and that brought Linode's CPU measurement up to 140%. Yikes. This likely means that my virtual server is so resource-hungry that the virtualization process is itself consuming resources from more than one physical core. I'm not sure of the details here, but that's my best guess.
Regardless, we want to steer WAY clear of 140%.
But the lack of lag when 170 players were together on the usually-bedraggled server1 was promising.
Were there any unnecessary hot spots left in the code that could be eliminated? Maybe the database engine needs to be rewritten again. Keeping the database in RAM is one idea that might speed things up, but who knows?
This is where profiling is supposed to help.
But existing profilers do a notoriously poor job at measuring actual performance issues in I/O-bound processes. My server is likely spending a lot of time waiting for data from the disk. Asleep, essentially. Not running code, in the way that a profiler might measure, but still slow.
After testing every profiling tool under the sun, and finding nothing that worked for this purpose, I ended up writing my own. More details about that, and proof that it works, and examples of why other profilers don't work, can be found here:
Profiling a toy program with a toy profiler is one thing, but profiling an extremely complex, multi-faceted server process is quite another. This made an excellent test case that helped me actually turn my toy profiler into a working, useable tool. At some point along the line, I realized that the text data that the profiler was outputting (essentially annotated stack traces) was too tedious to read through by hand, so I even wrote a conversion program that allows the resulting profile to be viewed in the Kcachegrind profile visualizer.
With all that working, here is a rough visualization of where server1 was spending its time while hosting 155 simultaneous players:
Now, before you tell me that I've lost my mind, let me reassure you that such an image isn't all that useful in practice. It's just the best way to quickly represent the complexity of the profile visually. In reality, I'm looking at sorted lists of functions and the amount of samples that hit each function. But a screen shot of that doesn't make for a very interesting picture.
Anyway, from that image, we can see what looks like a pretty "clean room." That big "empty space" in the middle is indeed empty space: time the server spent waiting on epoll for incoming client messages. We're doing that 54% of the time. The rest of the clutter around the edges of the room is actual work being done.
The biggest forehead-slapper in the profile, which can actually be seen here in this image, is the 12% of our running time spent on recomputeHeatMap. This is the bit of code that examines the environment around you to determine how cold you are (the thermal propagation simulation). This is an expensive bit of code to run, but it's only supposed to be updated for two players every server step (thus spreading the load), so what's going on here?
It turns out that the wall-clock duration of a "server step" varies depending on the rate at which messages are arriving. Big gaps between messages means the server sleeps longer before executing the next step. Short gaps mean many steps happen in a short time. The server is intentionally player-reactive in this way, actually using almost no resources at all if no on is logged in.
Checking the logs, I found that with such a huge population of players, with such a high inbound player message rate, the server step was being run something like 65 times per second. Yikes. Not only did this result in excessive calls to recomputeHeatMap (recomputing maps for something like 130 players every second, which isn't even useful), there were a bunch of other regular-interval parts of the server step that were being triggered 65 times per second as well. We don't need to check whether a player's curse score is decremented 65 times a second, for example.
After finding the parts of the server step that weren't necessarily reactive, I put them on fixed timesteps so that they would only run if enough time has passed, not every single step. Heat maps are now limited to 20 players per second, max, for example, regardless of how quickly messages are coming in.
The results are pretty dramatic. Here's the new profile picture, after these changes, with about 150 players on server 1:
And here's a 30-minute monitor graph of both old and new (sampled every 5 seconds, for 360 samples total):
Yes, that's around half the CPU used per player now. This should allow us to double the number of players that occupy a given server.
But even so, when we start getting above 60% internal CPU, external resource consumption can get up into the 90% range, which does not make Linode happy.
However, they did inform me that 2-core nodes (which are more expensive) are allowed to go up to 160% utilization, and 4-core nodes are allowed to go up to 320% utilization.
The server code is single-threaded, so it can't take advantage of more than one physical core directly, but the external resource consumption from virutalization, including disk access and so on, apparently can.
2 cores, 4x the RAM, a bigger disk, and a bigger upstream network pipe. Most of these extra resources aren't needed, but the extra core may help with external resource usage. Four times the cost, though. Is it worth it? How many players can we put on this sucker before it starts to choke?
To give you a taste of the difference between internal and external resource consumption on a virtual server, bigserver1 currently has 155 players on it. Internally, in top, it is using less than 1% of its CPU. Something around 0.3%, to be exact. Hard to believe, but true. A fresh---and tiny---map database likely helps with this, for sure.
But externally, as far as Linod is concerned? 50% CPU. Granted, I can safely go up to 160%, but still, 50% is way different than 0.3%. My external networking and disk access graphs are relatively high, though, and my guess is that some of those aspects contribute to external CPU usage. Again, my guess is that the process of virtualizing networking and disk involves extra host CPU operations that wouldn't be necessary on non-virtual hardware.
As another example, if I run a pure-CPU test process that busy loops, I see both 100% internally and externally, but that's a process that isn't touching the disk or network at all.
So, over the next few weeks, we'll see where bigserver1 can take us, in terms of a large population of players all in one cohesive world.
This week's update focuses on a bunch of bug fixes and other little improvements. I took some time off for the holidays this week, and will be back with a substantial content update next week.
The biggest change is an improvement to the way that players are automatically distributed among the available servers. The original goal was to keep as many players as possible together on the same server, and only expand to additional servers when necessary during a population boom. During a population decline, we still want as many players as possible playing together, so the remaining players should be brought together onto one server, instead of being left spread out on the overflow servers.
This system was working as intended, but had some unfortunate side-effects on village fertility. Essentially, if you were on one of the overflow servers during a population downswing, your village was doomed, because no new players would be sent there. As player population changes throughout the day, this means that various villages die out again and again. And even worse, other logic in the player distribution code tries to make sure a given player always plays on the same server, whenever possible. So, depending on the time of the day that you play, and the luck of the draw, you might get stuck always being born on an overflow server right before a population downturn---always playing in a doomed village.
Take a look at the red line (server 4) in this graph, which was generated by Thundersen:
You can see that as the population rises, server 4 is brought into the mix to handle it, and then the population reaches a noisy plateau, which soon after results in server 4 being removed from the mix, only to be brought back into the mix a few hours later, only to be removed again shortly after. Villages on server4 were dying out over and over. Pity the players who were stuck playing on server 4 that evening.
Also, that system was designed a long time ago, when Eve distribution wasn't really in place, and I imagined players mostly all playing in the same area on the map. Now that players are playing in separate villages anyway, keeping them clumped on the same server together isn't as high of a priority.
CrazyEddie suggested that we try a different method, picking an appropriate number of servers for the load, and then just letting populations rise and fall on the servers together, as long as we're still above some lower threshold. Thus, once a server is brought into the mix, it generally stays in the mix. As population falls, it falls simultaneously on all servers, but no server is singled out to be childless. This means that a village being doomed by outside circumstances will happen way less often---almost never (except in very rare cases where the overall population falls to very low levels and a server really does need to be taken out of circulation).
Keep in mind that villages are still competing for babies, due to variable mother fertility factors (warmth and diet variety). So, that's still happening on each server. If your village is dying out, perhaps another village is stealing incoming players by taking better care of their fertile mothers. There is no explicit trans-server competition, though there's a kind of meta competition, based on how many of the players that are assigned to your server are motivated to keep playing across multiple lives.
I have something special in store for next week. Not magical, but still magic. Well, I guess it's only called "not magical" because of how jaded we are. If I told you that a few coils of copper wire and a galena crystal could be used to pull invisible voices from the sky, you'd probably think I was crazy. But to the untrained eye, a schematic can easily be mistaken for a sigil.
Weekly Update #42
Apocalypse 2.0:
First, a few important fixes that you all should be aware of.
There was a bug in temperature weighting on mothers. It was supposed to make ideal-temp mothers more likely to have a baby, but it was broken and not working. That has been fixed now. Furthermore, a Yum multiplier factor has been added to this weighting. If you have a large Yum multiplier (from eating a chain of unique foods), you will also be more likely to have a baby. If you're warm in addition to being on a yummy diet, you will be even more likely to have a baby.
And the way that Eve spawn locations were remembered---when Eve died of old age---was buggy. Thus, the surprise appearance of Eves near villages. This has been fixed. But even when it's working correctly, it's meant to only function on low-pop servers, and not as a way-of-life for reviving collapsed villages, so that has been fixed as well (your last-Eve-death location will only be used for your next Eve spawn if there are fewer than four fertile females on the server currently). This fix is even more important in light of this week's update, which I will describe in detail below.
In last week's update, I talked about how there will be no magic in the game. What I meant to say is that there will be no non-inherent magic.
Some things about the game are inescapably magic. Reincarnation---a reality for any commercially viable game---is the prime example of this.
But the map itself, and the servers, and how they get set up, and how they get updated, and how they get cleared, is another example. I'm doing all this stuff behind the scenes to keep things updated and working. I'm making choices. I'm adding things. I'm in control of the parameters that control when and how certain parts of the map go back to their natural state.
And the map is huge---unnaturally huge. 36,000x larger than the surface area of the earth. Walking from one edge to the other in the game would take you 34 years of real-life time. Walking around to visually see the entire map would take you 14 billion years.
It's a big map. Mind-mindbogglingly so. Yet I can change the entire thing with the push of a button, like when I add a new biome, or wipe it back to its natural state in the blink of an eye. How can something so big be changed so fast? Through procedural generation and the properties of computer file systems (where deleting data of length N is a constant time operation on N). It's not magic, really.
But when we try to square these possibilities with a simulation of the real world, the end results are nothing short of miraculous.
And what does that make me, the guy pushing the buttons behind the scenes?
There's an amazing idea lurking in this game, and credit for the idea goes to Edmund McMillen. When I visited him a few years ago, in between petting his hairless cat and having him kick my ass in a Magic draft, I told him about the game I was working on. In a game that starts back at zero as a premise, a question arises: how did we get to zero in the first place? And what if, Edmund suggested, players were in control of taking everything back to zero? What if, at the top of the tech tree, the most difficult-to-craft item was The Button?
It seems that, after all is said and done in this game, after all my updates are out, and the game stops evolving due to developer input, this just has to be the way that it will work. Otherwise the game will stagnate. Edmund was right.
But what about along the way? In the arms race of player progress in the face of my weekly updates, players always win.
So, the idea of an along-the-way apocalypse arose. What if The Button was a moving target? Some item at the top of the current tech tree that represented the current endgame?
The problem here is that players can get to the top of the tech tree ridiculously fast.
This means that the apocalyptic item can't be technological. It needs to be magic in some way.
Long ago, shortly after the game's release in early 2018, I tried something like this. A monolith in the desert that you could use to conduct a kind of absurd ritual using a bit of material that was high-level tech at the time. This experiment was an utter failure, as the first apocalypse was triggered four hours after the update, and subsequent apocalypses were triggered hourly after that until I gave up and disabled the whole system.
I left that failed experiment behind, without thinking about it any further. Players can get through the tech tree---and craft any imaginable thing---way too fast. This even planted seeds of doubt in my mind about Edmund's Button, even at the end of the update process, once the tech tree was gigantic.
Still, I really liked the shared collective event that had occurred. People who were playing that fateful day will never forget that flash of white...
In the mean time, other ideas surfaced, like the bell tower, which involved slowing down player progress toward a goal and ensuring trans-generational cooperation. A bell tower takes 18 hours to build. In order to build it, your village has to survive that long.
This takes a page from the Clock of the Long Now.
The insight this week was that these two ideas can be combined. An apocalypse, for the time being, is a magical, not technological event. So there's a ritual. What if it was a very slow ritual? What if people had plenty of opportunity---and warning---to interrupt the ritual before completion?
That was always the idea with Edmund's Button anyway---that people would be fighting to stop it along the way.
So, I give you a new and improved apocalypse. It has:
Rare, unsustainable ingredients that you cannot procure while working completely alone.
A map-wiping wave that is limited to one server only (no more chance of an Easy Apocalypse on a vacant server causing wipes on the populated servers).
A map-wiping wave that you can live through and come out the other side.
World-wide warnings as the ritual gets closer and closer to completion.
The ritual itself is very fragile and easy to set back along the way.
The entire ritual, if uninterrupted, takes 24 hours to complete.
And, for the time being at least, it's magic.
Weekly Update #41
Yuletide Together:
This update is on time for sure this week, and lemme tell you why. Tomorrow is the solstice, the shortest and darkest day of the year in the northern hemisphere. The sun will die as it passes through the constellation of the Southern Cross, remain dead for three days, and the be born again as it rises on December 25 through the constellation of Virgo, the virgin. But enough of that astrological claptrap! What's that got to do with the update?
What we do, in my family, on the solstice is take a step back in time for a day. We use no lights except for sunlight and candles. That also means no computer screens. This is actually a pretty amazing thing to do every once in a while, because everything---and everyone---looks absolutely gorgeous when you've got candles all over the place in your house.
So, no sneaking the update in after the bell tomorrow. Today or bust.
We also have salsa, on the solstice, because my oldest kid thought that sounded funny when he was little. Salsa on the solstice. Tradition!
And yes, you can now celebrate this season in various ways in One Hour One Life. But be forewarned: do NOT expect some magical Santa NPC to be running around in-game handing out presents. That will never happen in this game (as hilarious as it sounds), for a good reason. This is a game that draws as many of its aesthetics as possible from real life. It's about human technology, and human society. It is not about magic or other supernatural things. No gremlins, no dragons, no ghosts, no Santa.
The only place the game breaks with this aesthetic is via reincarnation, for sheer playability reasons (as much as I was tempted to make a game where you only live once). And the curse system follows as a necessity from that (because criminals can reincarnate just like everyone else, and keep bugging you for all of eternity---unlike in real life).
So, holiday stuff, but actual human holiday stuff.
There are also two new chat "commands" that you can use to help you in diagnosing lag. /FPS will toggle a count of the current frames per second (are you experiencing GPU slowdown in dense areas?), and /PING will ping the server and display the round-trip time in milliseconds (is your connection to the server getting flaky?).
Have a great holiday season, everyone!
Weekly Update #40
Internal Combustion:
Sheesh, an internal combustion engine has a lot of moving parts. I should know, because I just drew them all. There was so much detail that I had to lay out the whole thing in CAD software first, as printed a tracing guide. It's all still hand-drawn. Call it Computer Aided Human Drawing.
So I drew it, and now it's up to you to put the damn thing together.
The internal combustion engine was actually a major sticking-point in the design of the game: how do we get over the very steep hump that leads into industrialization? Like I mentioned in a previous update, the actual history here is far from clear. We went from very crude machines that were mostly made out of wood and powered by animals, water, or humans to finely crafted clockwork contraptions that could literally pump like well oiled machines. My guess is that it was a process of micro-refinements over about five hundred years.
So, I kinda just winged it here, assuming that if we had something spinning fast enough, that would be enough to bootstrap the whole thing via the magic of the lathe. And here we are, a week later, with a pretty accurate model of a four-stroke, two-cylinder diesel engine, complete with all major parts.
If you're interested in more details about how this works, this video explains the working of a single cylinder:
The major thought experiment in this game is this:
It took us 4000 years to advance from stone-aged tech to the iPhone the first time around. If we had to start over from scratch, naked in the wilderness, with nothing but rocks and sticks, but we retained all knowledge, how long would it take the second time?
The more closely I study this stuff, the more baffled I am about how we ever did it in the first place. How long would it take the second time? My current best guess: Forever.
As in, never.
Weekly Update #39
Black Gold:
"And then one day he was shootin' at some food, and up through the ground come a bubblin' crude. Oil that is. Black gold. Texas tea." Californy here we come!
My father was an oil man. At least he tried to be, for a while. When I was young, he invested in a local oil exploration company in Ohio. Everyone in the family got free ball caps with the drilling outfit's logo. I remember climbing up the steps on the side of a huge oil tank, where he wrenched open the porthole cover and we peered inside as the oil poured in. Was a sight! And more importantly, what a smell! With tiny me perched up there by his side, my father reached his arm down inside with an empty peanut butter jar and filled her up directly from the gushing pump stream. (What the hell was he thinking?) That jar sat---and settled---on his office desk for many years. Sludge would be a naive appraisal of what was in that jar. Brown sludge. A reminder of an investment in oil that never turned a profit.
But your investment in oil will turn a profit!
The more time you spend around crude oil, the more trouble you will have believing that people actually used to rub this noxious stuff on their bodies as a medicine.
But what about the very best stuff? Light, sweet crude, with a very low sulfur content---really top notch. If you're brave enough to take a sip, it actually tastes sweet.
So you want that oil. But oil prospecting and refining requires a lot of equipment along the way. And that's why this update is one of the largest and most complicated in the history of the game so far. Machines galore.
But when you finally see that black plume shoot toward the sky, you too will dream of striking oil before you die.
Weekly Update #38
Newcomen Atmospheric Engine:
This update paves the way for the forthcoming industrial revolution. After banging my head against the actuality of human history in this area (much of which is, strangely, shrouded in uncertainty) for four straight hours on Monday, I realized that bootstrapping is hard. This is the point where the true mystery of human civilization comes to a head: how do you make a lathe without a lathe?
We've come a long way so far, and bootstrapped a whole bunch of things in this game. But these are all things that I myself know how to make from scratch, in principle. This is the kind of stuff that is made on the Primitive Technology YouTube channel. If it looks hand-made, it can probably be made by hand, right?
But how do you bootstrap metal machines? I'm starting to suspect that this is something that no one now living actually knows how to do. So it's my job to figure it out, or at least make up a reasonable approximation. I was thinking that we'd be skipping right over the age of steam---that steam was just an unnecessary side-branch on in the inevitable path toward internal combustion. But now I think it's actually about tolerances. Internal combustion requires extremely tight tolerances, and metal gaskets, because there's fire right there in the cylinder. Steam engines can be made with much lower tolerances, and leather or rubber gaskets, because the fire is kept outside the cylinder. I.e., crude steam machines can be made without the process of machining. And with steam machines, we can make the machines with which we can actually "machine" parts with tighter tolerances. Steam lathe, here we come.
And speaking of steam lathes, this video of a 1900's era steam-powered machine shop is pretty amazing. But none of those belt-driven machines contain parts that were made by a blacksmith:
But there's no lathe in there yet. What can you do with this Newcomen Atmospheric Engine? Pump water! Why would you want to do that? You'll find out soon enough.
This update also dramatically improves the mouse interface when dealing with stacks of things that should also be moveable as a stack (stacks of plates, for example). Left click grabs the whole stack, and right click removes an item from the stack. In other words, the stack now behaves similar to a container when you left or right click on it.
New wild wounds (or sickness) no longer replace your current wound. You can't heal from a snake bite by contracting yellow fever or getting hog cut.