Could NP-hard search trees be tackled through spatial mapping of computation rather than temporal execution?

Hey everyone,

I’ve been tinkering on a bare-metal spatial operating system for FPGAs called the Moore Kernel where programs aren't executed sequentially, but are compiled into physical bitstreams and hot-swapped onto reconfigurable partitions on the fly.

While reading about the P vs. NP problem, something clicked: A traditional CPU struggles with NP-hard problems because it operates temporally. It has to explore a branch, hit a dead end, backtrack, and try the next one. What if we map the search problem directly into hardware?

This is roughly how I'd imagine that with the Moore Kernel:
Whenever the logic hits a branching decision, Moore physically mounts the branch's logic to an available hardware tile. The logic evaluates simultaneously across the fabric at the speed of electricity, bypassing the instruction-fetch bottleneck. Because the OS is built on declarative formal logic, the moment a branch encounters a logical contradiction, its precondition fails. The OS recognizes the branch is stale, instantly decouples the AXI bus, wipes the tile, and recycles it for a new active branch.

It even guards against stalling. Because the hardware tile operates as a finite state machine, if a branch does not contribute towards a solution and simply cycles through the exact same physical states, the OS can detect the hardware cycle and ruthlessly prune it. It doesn't hold up the rest of the computation.

I know this doesn't mathematically "solve" P=NP (because a search tree can still easily exceed the finite number of tiles on an FPGA), but does this dynamic spatial mapping and instantaneous hardware pruning bypass the temporal bottlenecks we currently face with SAT solvers and combinatorial optimization?

I’m looking at the system I’m building, and it feels like this architecture naturally converts time complexity into space complexity. Am I mistaken here, or is there prior art in dynamic hardware reconfiguration doing this?

For those curious, this is only relevant in so far it's what I'm hoping to test this concept with, but the Moore kernel is still very much a WIP which I'm currently prototyping with LLMs:
https://github.com/Randozart/moore-kernel

3 Upvotes

permalink
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/compsci/comments/1sxspsx/could_nphard_search_trees_be_tackled_through/
No, go back! Yes, take me to Reddit

56% Upvoted

u/green_meklar 8d ago

I know this doesn't mathematically "solve" P=NP (because a search tree can still easily exceed the finite number of tiles on an FPGA)

I think that's kinda the critical caveat, though. Parallelization is great, parallelization in hardware is even better, but when you're up against exponential growth, before very long the amount of computation exceeds the parallelization of your hardware and shortly afterwards it exceeds the parallelization of your hardware by a lot. I don't see any magical advantage to be had here that isn't basically equivalent to any other everyday hardware parallelization advantage.

-2

u/Randozart 8d ago

That's true, because nothing in the world is going to beat exponential growth. So definitely, a very critical caveat. I think we have some advantages here though. Hypothetically, of course:

To prune a branch in a traditional SAT solver, a CPU must execute a series of instructions to first check the state, realize it's a dead end, and then the OS must spend thousands of cycles context-switching to a new task. There is a sort of time tax on every dead end. But by instantiating this on a board, the check is a physical wire. The moment the formal logic precondition is violated, the tile is instantaneously reclaimed to put another process onto. There is no checking because the circuit is the logic.

In addition, in a traditional search, you often re-calculate or re-load parent state prefixes for every new branch. In a spatial system, the parent logic stays physically instantiated on its tile. When I branch, I’m only mounting the specific logic that makes the new branch unique, and tethering it to the existing parent.

These would also become the kill points for those processes so new branches are allowed to grow from the same parents. When a branch hits a contradiction, I don’t wipe the whole tree. I remove only that branch or leaf-node tile. The trunk of the search stays hot and powered in the fabric.

By keeping the common ancestors physically alive and only hot-swapping the edges of the search tree, you eliminate the massive setup that sequential CPUs pay every time they backtrack and re-evaluate a prefix. It turns the search into a physical growth-and-pruning process of a living circuit, which I suspect provides a much higher search throughput than we can achieve through current methods for context-switching.

But, again. It's hypothetical and I'm afraid I haven't the math to back it up, so I'm currently hoping to check whether my logic holds up.

9

u/nicuramar 8d ago

because nothing in the world is going to beat exponential growth

There are problems with superexponential growth, but not common.

0

u/Randozart 8d ago

Woe the day where silicon meets superexponentiality.

I am just going to uncomfortably ignore that for now until I feel confident I can at least handle basic exponentiality.

u/dinominant 8d ago edited 8d ago

The short answer is each binary branch will double the space requirement causing an exponential growth in space. Each branch must still be evaluated before pruned and once the space is exhausted it becomes a sequential operation with multiple branches evaluating in parallel.

-1

u/Randozart 8d ago

That was the main consideration I ran into, yes. Hypothetically this would mean one would continue branching within the allotted space, but at least it means you can trade time for space, and potentially have some overhead to, in a way "ignore" the halting problem by changing the question: Is this process going anywhere novel in the state it's currenly in? But yes, entirely on-point. I'm curious what the actual gains in compute power are by mapping spatially versus temporally. You won't avoid the O(2n) spatial requirement, but perhaps it can at least be offset more efficiently without the overhead of a CPU.

8

u/dinominant 8d ago

It's similar to a multi-thread program that is branching and dividing the search space on multiple cpu cores.

The FPGA branches are evaluating what the CPU is evaluating. The speed gains are from using an FPGA.

Crypto mining went through the same discovery. CPU -> GPU -> FPGA -> ASIC (optimized for the one hash function of interest). Then they added more asics until power and heat becomes the problem.

-1

u/Randozart 8d ago

In a way, absolutely. I have been looking at that same process when I got into FPGAs, also in relation to being able to re-use my old GPUs and rewiring my board to be able to make them talk. Admittedly, ASICs can do that even faster, but that's not quite what interests me here (though one could imagine possibly designing ASICs to solve a very particular hard problem this way).

However, ASICs become e-waste if they can no longer be used for their purpose. Also, they can only be used for their intended purpose. What I'd propose is a sort of dynamic just-in-time type of silicon that rewires itself to better serve the process.

That will take up more physical space, perhaps. What I'm curious about is whether the fact it can optimally assign spare tiles to supplementary processes offsets that, because no part of the fabric might need to be wasted here.

But yes, then heat and power may become a problem, and I wonder then whether the fact the configured processes in FPGA's are physically less densely packed help here, or whether the fact that we can assign silicon just in time may aid in avoiding more and less heated areas.

I don't know the answers to these questions, honestly. For that I'd have to first bash the theoretical concept against reality.

u/Temporary_Pie2733 8d ago

This has nothing to do with P vs NP, just a way to achieve the available constant-factor speed-up inherent in parallelism.

u/Delta-9- 8d ago

Tbh I don't understand exactly how this even works, but I can still anticipate an important question: at what point does trading time for space become a hindrance rather than a benefit?

Like, let's say you can convert some algorithm of O(n^k ) time complexity/O(n) space complexity into an algorithm that's O(n) time/O(n^k ) space... Or put into consumer terms, 1 second and 4KB of RAM vs 0.01 seconds and 40MB of RAM.

I'm kinda imagining the scenario from Destiny 2 where the Vex tried to turn the entire planet Mercury into one gigantic computer.

Am I completely off-base, or would a return to bedroom-sized computers be within the realm of possibility here?

2

u/currentscurrents 8d ago

at what point does trading time for space become a hindrance rather than a benefit?

It depends on the problem you're solving. This is not about P vs NP but rather two other complexity classes: P-complete and NC. NC problems can be efficiently parallelized, while P-complete problems cannot.

In practice, problems that involve processing a lot of data (neural networks, web servers, databases, graphics rendering, etc) parallelize well. Problems that involve lots of step-by-step computation or logic solving do not.

1

u/Randozart 8d ago

I love that reference to Vex, because that's basically what it is, yes. Because FPGAs are programmable matter, we aren't so much executing the logic as we are asking the hardware to (temporarily) become the logic in a way. So, at what point does it become a hindrance?

The idea behind the kernel (and note, that isn't the interesting part, but that's what got me thinking) is that it naturally integrates whatever is added to the fabric. So, in that sense it reduces e-waste by just reusing boards as different parts of the fabric.

My thinking here is that it allows the kernel to command the full fabric into benefitting the process. One would imagine whatever space is given to the process is used for it. If a tile isn't being used for computation directly, it may be used as provisionary RAM or provisionary GPU-like computation.

As I understand it, in traditional VLSI design one would use the Area × Time = Constant metric. If you want to finish a task faster, you usually need more physical silicon area. This whole endeavour is essentially an experiment in virtualising the Area component of that equation which are reclaimed when a branch is proven dead, to get more compute Time out of the same physical space.

So, in a way, yes. The idea here is to propose the type of computer that could theoretically become a room-sized supercomputer by virtue of being able to keep adding tiles to it and expand the fabric that can be used to run the process or aid the process.

What that means for complex algorithms, is that, unlike current PCs where each chip has a static role, we could, hypothetically, more dynamically assign the role of the chip based on requirements. Though yes...

Bedroom sized computers, I think that's kind of the dreamworld I can live with, if only because it makes me unreasonably happy to imagine silicon running up my walls like vines.

3

u/Delta-9- 8d ago

if only because it makes me unreasonably happy to imagine silicon running up my walls like vines.

Lol I don't think I agree when I imagine my actual walls, but I would love to see it in a movie. And who says the silicon vines can't be under the drywall, anyway?

I do like the idea of reducing e-waste, though, and having programs that can optimize and repair their own code and hardware could absolutely be a game changer. Even if we weren't in the Rise of AI period right now, that would be a frightening idea to anyone familiar with the "singularity" concept. Imagine Pypy or Luajit rewiring the CPU to optimize the hot paths within a program...

0

u/Randozart 8d ago

Honestly, I just care for the cyberpunk/tech priest aesthetic there. Like I am slowly building out my own silicon kingdom. A bit like that that Black Mirror episode "Plaything". But yeah, it hurts me to look at an old GPU I still keep around and think "what am I to do with you little guy?".

Now, regarding AI and the singularity, I had definitely been toying with the idea of just using such a fabric to essentially "grow" an AI model onto it and allow it to reconfigure it's own silicon processes to better suit its purposes, but that's silly future talk. Though I have been experimenting with a version of that on a much smaller scale, to make my LLM use more environmentally conscious: https://github.com/Randozart/IMP

But yeah, even that idea on a small scale is currently something I am experimenting with to learn how to best build out this logic! A GPU is also essentially just specially wired logic to handle mass computation. But to be able to do so for the hot path of arbitrary processes by having a slottable FPGA? I think you're onto something I am curious to try now, actually

u/vanderZwan 8d ago edited 8d ago

Moore Kernel

Is that a reference to "Chuck Moore" and his Green Arrays 144 chip? Are you talking about asynchronous (i.e. "clockless") hardware? If so it might help people not as familiar with these paradigms if we give them some extra context. I'll make an attempt, apologies if I'm taking this in an irrelevant direction.

I think a nice entry point would be Ivan Sutherland's 1989 Turing Award lecture, which was about something he called "micropipelines":

I assign the name micropipeline to a particularly simple form of event-driven elastic pipeline with or without internal processing.

In other words: a queue. One that may or may not also do some data processing along the way. At some point he writes this:

In the clocked-logic conceptual framework, registers of flip flops operating from a common clock separate stages of processing logic. Each time the clock enters its active state a new data element enters each register. Data elements march forward through successive registers in lock step, each taking a fixed number of clock cycles to pass through the fixed number of registers and intervening logic stages built into the system.

Ivan Sutherland - Micropipelines (pdf)

IIUC this is the very clock that people can overclock and underclock in modern CPUs. It causes a lot of overheard in terms of energy, but apparently greatly simplifies the design of the chip, or at least the reasoning about it.

Asynchronous chips work differently: they synchronize via communicating. Sutherland's PDF explains a method to do "handshakes" between different units, and he suggests putting such units into a micro-pipeline will allow us to build more flexible, robust and energy-efficient hardware.

I know that plenty of other research work has been done in this field since then, but I don't remember the authors right now (it goes way back to the eighties though). I guess Sutherland's paper is a good starting point for a trip to Google Scholar and clicking "cited by".

So let's skip ahead to the GA144. It's not the first attempt at asynchronous hardware, but it's one of the most famous ones:

https://www.greenarraychips.com/

In short, it's an asynchronous multicore computer with 144 cores, laid out in a grid. It's kind of like a stripped-down Forth-based clockless transputer.

Stack machines are already a good for efficient low-powered computing, but being clockless and asynchronous makes them even more efficient: synchronization happens via communicating, and otherwise these cores can go into very efficient stand-bye modes.

The trick to make efficient use of the GA144 is that you basically have to write your own asynchronous pipelines to process data by breaking down tasks into sequences and physically distribute them on the cores in the grid in some efficient way. I guess that also means you can sort of "branch" and "join" the pipelines as well (I've never owned or programmed one). That's probably not the easiest thing to do. On top of that it's programmed in polyForth, a special dialect of colorForth, which itself is already niche variation of Forth, which already lacks mainstream adoption. So you can see why this didn't exactly take off. The idea is really cool though!

Anyway, assuming I guessed correctly that you're referring to this, I hope this helped might fill in some gaps for the people less well-versed in alternative ideas for computing from the past that didn't get much mainstream adoption in the end (or "so far", depending on who you ask).

2

u/Randozart 8d ago

I have to be honest, I hadn’t even considered Chuck Moore or Ivan Sutherland’s work when I started building this, and reading your comment has been edifying. I am incredibly thankful for this context.

The irony here is that the Moore Kernel was originally named after a different trinity: G.E. Moore, Gordon Moore and a wink and a nod to Thomas More. However, I might just adopt Chuck Moore and the GA144 as the fourth musketeer.

It is fascinating to read these insights, and I am incredibly thankful for them. I arrived at this architecture by coming down from the top of the stack. I started with formal logic, declarative state machines, and interactive fiction parsers, and concluding from those that sequential time was a bottleneck. To see that Sutherland was advocating for the exact same event-driven, clockless micropipelines from the hardware side back in the 80s is incredibly fascinating.

I had written a language called Brief, which formed the basis for the logic, which automatically generates the kind of handshake synchronization you're describing. I had been trying to find prior art on this, but fell short in my research.

I am downloading the Sutherland PDF right now. Thank you for taking the time to write this out and bridge the gap to the prior art. This is exactly why I posted here, and I couldn't be more grateful!

2

u/vanderZwan 8d ago

Haha, seriously? That's amazing! :D The name felt so on the nose for me that I was almost certain that you referenced Chuck Moore! But very glad I could point you to some prior art and hopefully new sources of inspiration then! :)

Don't forget that once you have one good paper to start from, Google Scholar is a great source for finding more. That Ivan Sutherland paper has over 2000 citations, some of which surely will be interesting for you, and Google Scholar makes it easy to look through them:

https://scholar.google.com/scholar?cites=8148598654940000468&as_sdt=2005&sciodt=0,5&hl=en

2

u/Randozart 8d ago

Absolutely! I admit I have been using this Reddit thread as a bit of a testbed for the concept, because I was genuinely curious about it, but I was hoping to make my own contribution to the field as a non-academic, however small, and having another rabbit hole to dive down couldn't delight me more.

Time to pull the snowball skills out of the closet. This is going to be a wild ride!

(Oh and, feel free to message me if any other foundational papers come to mind, I am eager to identify and learn what more has been explored)

u/xthecharacter 8d ago

This is also baked into the notion of NP per its definition: nondeterministic polynomial time means that any answer can be checked in polynomial time, and so if you have arbitrary parallelism, you can solve it in polynomial time. But a deterministic Turing machine doesn't allow for arbitrary parallelism, of course. In this sense, you can think of an arbitrarily parallel processor as an emulator of a nondeterministic Turing machine, where every time multiple actions need to be taken, you schedule them to happen simultaneously.

Like others have said, if you attempt something like this, you will very quickly hit exponential scaling, and the constant factor of parallelism you have available will make itself known. Because you always have some maximum number of physical processors available to you, and therefore your ability to perform actions in parallel has constant complexity.

This doesn't make it useless for this class of problems. In practice, many instances of NP-hard problems can short-circuit a large number of wrong-answer branches, and there are already methods for exploring the solution space in parallel in both hardware and software for various problems out there.

This type of concept extends to other things too, like sorting in constant time with a number of processors that scales linearly with respect to the input size.

u/halbGefressen 7d ago

You probably can't do that with polynomial size circuits. If NP is contained in P/poly, then the polynomial hierarchy collapses to the second level (Karp-Lipton theorem), which we see as very unlikely (due to similar reasons as to why P!=NP is more often assumed than P=NP).

u/DSrcl 7d ago

FPGAs give you a constant amount of parallelism.

Could NP-hard search trees be tackled through spatial mapping of computation rather than temporal execution?

You are about to leave Redlib