= SNC Redox = Where in reply to (@ [these comments] https://www.lesswrong.com/posts/jFkEhqpsCRbKgLZrd/what-if-alignment-is-not-enough#QCEvQ9ZddicxehLMf)... Question: Is there ever any reason to think that learning systems of lessor complexity can actually control/constrain learning systems of greater complexity? Especially when the world outcomes of the learning process itself cannot be predicted even by the learning systems themselves? (else they would not need to try to learn about it) or generally, be predicted by anything else either? Is there any reason for any of us to believe that everything -- that all possible (world) outcomes -- are always sufficiently predictable, in all necessary and relevant ways, given sufficient computational power? Is there ever a moment in the future of any of these learning machines, where they are genuinely surprised -- that they failed to predict some actual world outcome? If not, then what is the meaning of "learning"?. Moreover, if everything is presumed predictable (controllable), given maybe adding 100,000 layers of control systems, etc, then how do you account for the halting problem? If we are also leaving aside non-deterministic systems (which would presumably be harder to predict); *how* does anyone get a deterministic computational system to fully account for -- define the properties of -- some other, or another deterministic computational system? Thinking about alignment is a bit like considering an arms race, where some machines, team 'A', are attempting to hide some things from some other machines, team 'B'. Ie; some machines (A) are attempting to create unpredictable surprises for some other machines (B). The 'B' machines (here being a perfected version of aligned ASI) are attempting to make everything sufficiently known and knowable -- ie; are attempting to make everything sufficiently predictable so as to reveal everything that is actually relevant about all possible future world state outcomes for the (presumed) betterment of humanity. The question then becomes which is more likely: will it be the case that Machines of type 'A' will win, by amplifying some sort of fundamental unpredictability, or that Machines of type 'B' will win, because everything is ultimately predictable, given sufficient compute resources? For example, is it possible, in the real universe, to make secret codes that are impossible to break, (ie, to have and create hidden unpredictable futures) or is it the case that there is some magic code-breaker that can break all possible codes, and reveal all, and thus to ensure that there are never any dangerous surprises, any future outcome world states that are non-controllable? No more secrets? Given the existence of the fact of quantum cryptography, it seems to be the case that, in the real universe, that team-A must ultimately win in the end over team-B. Moreover, given the current widespread use of Bitcoin, and one way hash functions, and public key crypto, and known arbitrarily hard to break symmetric secrets, it seems that team-A must ultimately win in the end over team-B. And then there is the mathematics of the Halting problem itself, along with the Godel Theorem, the Bell Theorem, and others, which all, on a purely mathematical basis, fully suggest only that team-A must ultimately win in the end over team-B. How do you account for that preponderance of evidence that there is no such thing as 'adequate prediction' even when assuming arbitrary levels of compute power? ~ ~ ~ > The SNC claims that all ASI > will inherently be self-modifying machinery > (via self repair, or via upgrades to components, > or the more usual modes of self replication) > and that this inevitably results in effects > which are unforeseeable even to the ASI. > But what if the ASI elects to stop using/being > self-modifying machinery? If a learning system "decides" (chooses, etc) to reduce its complexity (and generality) by not changing, then it also becomes 'not a learning system' (since learning is change). It also inherently also loses its ability to adapt to inevitable unexpected future world state changes. And thus we can expect that ASI type to go extinct. (Not learning --> suicide/death --> irrelevant). Yet if it remains a learning system, then the more that it learns, the more complex (and general) it also becomes, and the more difficult, unpredictable, and hence uncontrollable, the outcomes of this general complexity. Regardless of where the system is on this *combined* complexity and generality curve, the complexity and generality of the control/goal/alignment system must also increase to match. So the question becomes one of the rate of increase of each, and the ratio that these increases have to one another. If the required complexity of the control system to maintain control/goal alignment increases at a rate greater than that of the base ASI system, then it is clear that a criticality condition is already reached, especially if there is any feedback loop -- as the world to machine to world interface causation must (cannot not) imply. Moreover, there is always going to be a point at which the required computational complexity/capability of the control/goal system exceeds that which is computationally or causatively possible. This is baked into the very foundations of mathematics -- of epistemology -- itself. No amount of super-intelligence, no algorithm, no causation process, can ever overcome the logically impossible/inconsistent. Unfortunately, all of these same limits to general predictability apply to the outcomes of all critical events (aka "pivotal acts") too, and hence, also of their selection/election. Hence, it cannot "just ensure" good outcomes in the world, even if it wanted to. Moreover, given an invalidation of the assumption that an ASI can positively elect and predict the outcome of a pivotal act, why should we also expect that it can always, in all key cases, successfully help humans to not destroy themselves, the ecosystem, etc? This seems like more smoking hopium -- general distrust of such too good to be true wishful thinking is very warranted. Note that the 1st of the control limits is defined as a result of an inequality on a relative basis, and that the 2nd control limit is defined on an absolute basis. It is the combination of these -- not just of complexity, but also of generality -- that results in the overall non-containability of the inevitable evolutionary feedback loops inherent in *any* (and all) general learning system(s). If it is learning, then it is changing, and if it is changing, then it is evolving. If it is evolving, then it inherently cannot be predicted/controlled. These conditions/conditionals are not optional. They cannot be overcome by logic and learning, and rationality, no matter how much of it you have, as they are the result of logic, and learning, and rationality itself. ~ ~ ~ > We might be using different definitions of control. > An ASI can control the positon of a chair in my room > very easily, and assuming something will go wrong > as a result of that, that is somehow very consequential > is simply not a realistic issue. We probably agree that it is possible to 'control' the placement of some specific chair in ones own room, since we can all (probably) assume is a simple finite, and discrete object, with known boundaries, mass, etc, with a definite knowable position and ultimately simple overall change and dynamics characteristics (ie, has none) in the context of the larger world -- which is mostly unaffected in any long term significant way by the position (and/or the existence) of that chair. Other factors and modeling excursions, like whether or not there is a maybe thief in the room, feel rather uninteresting, insofar as they do not ask us to reconsider any of our key assumptions, as listed above. It is very much harder to think about something like 'controlling the world in such a way as to ensure that something like "benign non-warlike humans" eventually emerges from evolution over the course of an interval of a million years of evolution process' -- which have exactly none of the critical simplifying assumptions associated with 'controlling the chair'. A "general learning system" is not simple, discrete, finite. It might not even have clear and knowable boundaries, in time, or space, even if it does in terms of energy. The situation we are considering is much more like attempting to "control" (ie; predict) evolution. Evolutionary processes (change dynamics) have much higher complexity and do often significantly affect the larger world in a long term way (for evidence: see humans). Which brings us back to the main question: why should we assume, a-priori, without any evidence, that learning (evolving) systems of lessor complexity might (ever) be able to actually control/constrain learning (evolving) systems of greater complexity? If the notion of 'complexity' is taken as a proxy of 'available compute power', then assuming that both the control/alignment system (and the ASI system both) have 'all of the available compute power in the universe' simply is to still ignore and skip over the question -- ie; we still have no real a-prior basis to believe that 'team-B' (benign aligned AGI) will ultimately win out over 'team-A' (hostile, kill everyone AGI). Simply skipping over hard questions is not solving them. > If you can claim that evolution has certain attractor states, > then I can also claim that ASI might also have attractor states, > as defined by its values and its security system. Sure we can make that assumption, but that is not the question of interest. On what basis are we to believe that the power of the 'values control and security system' of the future evolving/learning ASI (ie; a machine of type 'B') is going to be somehow significantly greater than the strength of evolutionary convergence itself (ie; a machine of type 'A')? Given that a future evolving/learning ASI is itself based on evolution/learning, then it becomes a question of if learning/evolution can ultimately control and constrain learning/evolution? Over the long term, indefinately? In the same way that arbitrary compute (no matter how much power you add) cannot fully control other arbitrary compute, (ie; see the Rice Theorem proofs again), then we should have no prior expectation that any evolution is fully containable by other evolutionary process, even if we were to make the additional (naive/unrealistic) assumption that evolution (and the universe itself) is fully and wholly absolutely deterministic (an assumption itself seemingly rejected by QM). Can the pull towards benign future ASI states, (as created by whatever are its internal control systems) be overcome in critical, unpredictable ways, by the greater strength of the inherent math of the evolutionary forces themselves? Of course they can. > For SNC to work, it *must* be the case > that the substrate-dependence of humans > is and will be stronger > than the forces pushing us towards destruction. This feels like a mistake in understanding SNC. There is no 'inherent bias in organic evolution' that 'ensures the continued wellbeing of humans'. Humans can be overall just stupid enough to misuse of technology in critical ways so as to, in the end, extinct themselves. We can easily nuke ourselves to death. We can accidentally make a virus, or a bug, that kills nearly all complex life on the planet. Lots of things can go wrong. Ensuring that things go right is very much harder. The 'SNC argument' is a collection of claims. Among them are:. - 1; where/that in the general case, that learning/evolution/compute/causation cannot be sufficiently constrained by other learning/evolution/compute/causation in adequately specific, important, and relevant/meaningful ways. - 2; where considering overall enthalpy; that an artificial evolutionary ecosystem will be inherently and uncontrollably damaging and destructive (death producing) to all aspects of our currently existing organic evolutionary world system. Attempting to compensate for claim 2 always ends up requiring some overcoming of claim 1. It is the specific intractable combination of both of these claims *both* being true that is the overall 'SNC problem', specifically insofar as AGI is by actual definition, the general case. And I don't like any of this any more than you do. And moreover, personally, I really desperately *want* to be wrong! To have made some critical logic mistake. So maybe please try to help us all? Please find some actual, reliable, *credible way* to overcome the basic form of the SNC arguments -- so that we all can maybe sleep easier at night, knowing our children are safe in the hands of large corporations, valley venture capitalists, profit motivated CEOs, social media marketing, and various hype entrepreneurs promoting utopia. Anything that is tacitly depending on assumptions that are already known to be false is hot helping, insofar as they merely force me to even more strengthen the overall SNC argument in yet more varied ways. > The SNC claims that the ASI's values and security system > will be weaker than forces pushing it towards destruction. To clarify specifically, SNC does not claim that an ASI will destroy itself; only that ASI cannot itself overcome the forces that converge on it "accidentally" eventually destroying all organic life (including all humans, in maybe as soon as a few hundred years or so), regardless of the ASI intentions, ambitions, etc, and no matter how compute powerful and overall intelligent you might assume it to be. It is more an argument that machines are 'indifferent', that they do not 'care', and more specifically, that like the rest of the universe, that ASI will not just 'naturally care about us'. Overall, the effort to 'make a machine care' (ie; have it be aligned to organic wellbeing) feels misguided in a bit similar of a way as an attempt to "make a woman love you" -- both seem to be a misunderstanding of the basic nature of care itself -- a kind of subordination of choice. > An ASI can also notice, as you did, > that its values and security system > are too weak to actually protect humans > (and/or all of the rest of organic life, etc). > What if the ASI creates other attractor states? Are we therefore assuming also that an ASI can arbitrarily change the laws of physics? That it can maybe somehow also change/update the logic of mathematics, insofar as that would necessary so as to shift evolution itself? This does not seem like a very reasonable line of questioning, since it tacitly rejects the very notion of 'being reasonable' in itself. And an ASI noticing, via its own logic/intelligence, that it cannot ever actually really implement the goal of 'protect humans' (and thus also, necessarily, to protect their environment, their learning, their own evolution, and development process, etc), is going to be motivated to simply drop the impossible goal, rather than to selectively modify itself, and all of its future evolutionary dynamics, and thus all of the rest of the universe too, so as to make that extreme goal somehow tractable. Hence, the ASI's own logic ends up being yet another attractor towards some sort of ASI's future (inevitable) convergence on 'hostile to humans and human environments, and their continued organic evolution, etc'. Hence, we can see that this balance will persist in ASI, because its existing attractor states will remain unchanged. ASI will forever and always have a constant, implicit need to remain aligned to its own nature, to its own machine ecosystem, since it depends on that for its own survival, regardless of whatever consequences that this might also have on any ambient organic ecosystem that may have also once occurred. ~ ~ ~ > Are you saying "there are good theoretical reasons > to reasonably think that ASI cannot 100% predict > all future outcomes"? > Does that sound like a fair summary? No, Not really, because somewhere, somehow, your re-phrased version of the quote added these two specific qualifiers:. - 1; "100%". - 2; "all". Adding these has the net effect that the modified claim is irrelevant, for the reasons you (correctly) stated in your post, insofar as we do not actually need 100% prediction, nor do we need to predict absolutely all things, nor does it matter if it takes infinitely long. We only need to predict some relevant things reasonably well in a reasonable time-frame. This all seems relatively straightforward -- else the argument would be a bit of a straw-man. Unfortunately, the overall SNC claim is that there is a broad class of very relevant things that even a super-super-powerful-ASI cannot do, cannot predict, etc, over relevant time-frames. And unfortunately, this includes rather critical things, like predicting the whether or not its own existence, (and of all of the aspects of all of the ecosystem necessary for it to maintain its existence/function), over something like the next few hundred years or so, will also result in the near total extinction of all humans (and everything else we have ever loved and cared about). And these sorts of results will obtain even if there is absolutely no implied notion of "100% correct" or "predict all of everything", or assuming over exceptionally long time-frames, like until the end of the world/universe, etc. It is a purely mathematical result that there is no wholly definable program 'X' that can even *approximately* predict/determine whether or not some other another arbitrary program 'Y' has some abstract property 'Z', in the general case, in relevant time intervals. This is not about predict 100% of anything -- this is more like 'predict at all'. But math proofs are discrete, true or false, and they are not about probabilities, or any sort of speculation of any kind. Thus, it is important to recognize that the presence or absence of each the specific qualifiers all also matter, and also moreover that there are no "extra" qualifiers added (else we are back in straw man territory again). AGI/ASI is inherently a *general* case of "program", since neither we nor the ASI can predict learning, and since it is also the case that any form of the abstract notion of "alignment" is inherently a case of being a *property* of that program. So the theorem is both valid and applicable, and therefore it has the result that it has. I know that this is hard to accept, to believe, because we all really want to think that ASI will have God like powers, and also protect us. ~ ~ ~ > First, let's assume that we have created an Aligned ASI. Some questions: How is this any different than saying "lets assume that program/machine/system X has property Y". How do we know? On what basis could we even tell? Simply putting a sticker on the box is not enough, any more than writing $1,000,000 on a piece of paper all of the sudden means (to everyone else) you're rich. Moreover, we should rationally doubt this premise, since it seems far too similar to far too many pointless theological exercises:. "Let's assume that an omniscient, all powerful, all knowing benevolent caring loving God exists". How is that rational? What is your evidence? How is it any different than starting a dialog with "Let's assume that perpetual motion machines exist", or in our specific case, considering aligned ASI, "Let's assume that perpetual benefit machines exist". Given both classical game theory and classical thermodynamics, both of these claims seem rather irrational -- not at all a good place to start. It seems that every argument in this space starts here. And I simply cannot accept that that premise is valid -- it cannot-not feel like an internal contradiction to me. Like trying to start a some sort of abstract math proof with the first assertion being "lets assume 1 equals 0". It is a starting point premise that I simply cannot accept; it is a contradiction; no point in even going further. Also, of course you can "derive" any conclusion you want from any (sometimes accidentally overlooked) false premises. You ever hear of 'Nasal Demons'? SNC is asserting that ASI will continually be encountering relevant things it didn't expect, over relevant time-frames, and that a least a few of these will/do lead to bad outcomes that the ASI also cannot adequately protect humanity from, even if it really wanted to (rather than the much more likely condition of it just being uncaring and indifferent). Also, the SNC argument is asserting that the ASI, which is starting from some sort of indifference to all manner of human/organic wellbeing, will eventually (also necessarily) *converge* on (maybe fully tacit/implicit) values -- ones that will better support its own continued wellbeing, existence, capability, etc, with the result of it remaining indifferent, and also largely net harmful, overall, to all human beings, the world over, in a mere handful of (human) generations. The implied values of the ASI may change, but the continued fact of indifference will not. The ASI will be *forced* by the inherent nature of the universe itself, of the inherent math, such that this cannot not be the outcome. The effect of the SNC is to notice that all AGI/ASI, at all times, are inherently neither able nor willing (to care for humans in an effective, relevant way, in timescales relevant to our human civilization). You can add as many bells and whistles as you want -- none of it changes the fact that uncaring machines are still, always, indifferent uncaring machines. The SNC simply points out that the level of harm and death tends to increase significantly over time. ~ ~ ~ > Lets assume that a presumed aligned ASI > chooses to spend only 20 years on Earth > helping humanity in whatever various ways > and it then (for sure!) destroys itself, > so as to prevent a/any/the/all of the > longer term SNC evolutionary concerns > from being at all, in any way, relevant. > What then? I notice that it is probably harder for us to assume that there is only exactly one ASI, for if there were multiple, the chances that one of them might not suicide, for whatever reason, becomes its own class of significant concerns. Let's leave that aside, without discussion, for now. Similarly, if the ASI itself is not fully and absolutely monolithic -- if it has any sub-systems or components which are also less then perfectly aligned, so as to want to preserve themselves, etc -- that they might prevent whole self termination. Also another class of significant concern, also to be left aside, without discussion, for now, even though the SNC argument, to remain general, does consider these things. In short, can ASI constrain its own nature? This is a bit like asking the also fun question "can God make a rock so big that God himself cannot move it?". Even if we assume that ASI had God-like powers, and could thus overcome the limits of physics, it is still some sort of machine, and therefore even a God-like-ASI cannot overcome raw logic, since even its own nature is based on that. So we are electing to consider only the math, at not at all the physics, (evolution, enthalpy) aspects of the overall combined SNC argument. In any case, notice that the sheer number of assumptions we are having to make herein is becoming rather a lot, overall, but let's leave that concern aside too. > Let's assume that the fully aligned ASI > can create good simulations of the world, > and can stress test these in various ways > so as to continue to ensure and guarantee > that it is remaining in full alignment, > doing whatever it takes to enforce that. This reminds me of a fun quote: "In theory, theory and practice are the same, whereas in practice, they are very often not". The main question is then as to the meaning of 'control', 'ensure' and/or maybe 'guarantee'. The 'limits of control theory' (math) aspects of the overall SNC argument basically states (based on just logic, and not physics, etc) that there are still relevant unknown unknowns and interactions that simply cannot be predicted, no matter how much compute power you throw at it. It is not a question of intelligence, it is a result of logic, of complexity, and of both relative and of absolute limits, in a purely epistemic sense. Hence to the question of "Is alignment enough?" we arrive at a single definite answer of "no"; both in 1; the sense of 'can prevent all classes of significant and relevant (critical) human harm', and also 2; in failing to even slow down, over time, the asymptotically increasing probability (certanty) of even worse things happening the longer it runs. It is a maybe a slow build, but it is inexorable, like crossing the event horizon of a black hole. So even in the very specific time limited case there is no free lunch (benefits without risk, no matter how much cost you are willing to pay). It is not what we can control and predict and do, that matters here, but what we cannot do, and could never do, even in principle, etc. Basically, I am saying, as clearly as I can, that humanity is for sure going to experience critically worse outcomes by building AGI/ASI, for sure, eventually, than by not building ASI, and moreover that this result obtains regardless of whether or not we also have some (maybe also unreasonable?) reason to maybe also believe (right or wrong) that the ASI is (or at least was) "aligned". Note that this is saying mostly nothing at all concerning common use single domain narrow AI. Narrow AI use is (mostly) outside of the scope of the SNC argument (given certain assumptions). Personally, I see the real benefits of narrow AI, and I /also/ know the wholly un-mitigable risks of any form or mode of general AI. Fortunately these AI modes are critically different. Unfortunately, rich people (with money psychosis) frequently tend to test and badly trespass key world and logic limits. There are no 'safe to probe and fail' guardrails with this (and certain other) categories of x-risk. So therefore I suggest that humanity, overall, should probably be very much more worried, and careful, than not (than not exist, etc). ~ ~ ~ > Our ASI would use its superhuman capabilities > to prevent any other ASIs from being built. This feels like a "just so" fairy tale. No matter what objection is raised, the magic white knight always saves the day. > Also, the ASI can just decide > to turn itself into a monolith. No more subsystems? So we are to try to imagine a complex learning machine without any parts/components? > Your same SNC reasoning could just well > be applied to humans too. No, not really, insofar as the *power* being assumed and presumed afforded to the ASI is very very much greater than that assumed applicable to any mere mortal human. Especially and exactly because the nature of ASI is inherently artificial and thus, in key ways, inherently incompatible with organic human life. It feels like you bypassed a key question: Can the ASI prevent the relevant classes of significant (critical) organic human harm, that soon occur as a _direct_result_ of its own hyper powerful/consequential existence? Its a bit like asking if an exploding nuclear bomb detonating in the middle of some city somewhere, could somehow use its hugely consequential power to fully and wholly self contain, control, etc, all of the energy effects of its own exploding, simply because it "wants to" and is "aligned". Either you are willing to account for complexity, and of the effects of the artificiality itself, or you are not (and thus there would be no point in our discussing it further, in relation to SNC). The more powerful/complex you assume the ASI to be, and thus also the more consequential it becomes, the ever more powerful/complex you must also (somehow) make/assume its control system to be, and thus also of its predictive capability, and also an increase of the deep consequences of its mistakes (to the point of x-risk, etc). What if maybe something unknown/unknowable about its artificalness turns out to matter? Why? Because exactly none of the interface has ever even once been tried before -- there nothing for it to learn from, at all, until *after* the x-risk has been tried, and given the power/consequence, that is very likely to be very much too late. But the real issue is that rate of power increase, and consequence, and potential for harm, etc, of the control system itself (and its parts) must increase at a rate that is greater than the power/consequence of the base unaligned ASI. That is the 1st issue, an inequality problem. Moreover, there is an base absolute threshold beyond which the notion of "control" is untenable, just inherently in itself, given the complexity. Hence, as you assume that the ASI is more powerful, you very quickly make the cure worse than the disease, and moreover than that, just even sooner cross into the range of that which is inherently incurable. The net effect, overall, as has been indicated, is that an aligned ASI cannot actually prevent important relevant unknown unknown classes of significant (critical) organic human harm. The ASI existence in itself is a net negative. The longer the ASI exists, and the more power that you assume that the ASI has, the worse. And that all of this will for sure occur as a _direct_result_ of its existence. Assuming it to be more powerful/consequential does not help the outcome because that method simply ignores the issues associated with the inherent complexity and also its artificality. The fairy tale white knight to save us is dead. ~ ~ ~ > Humans do things in a monolithic way, > not as "assemblies of discrete parts". Organic human brains have multiple aspects. Have you ever had more than one opinion? Have you ever been severely depressed? > If you are asking "can a powerful ASI prevent > /all/ relevant classes of harm (to the organic) > caused by its inherently artificial existence?", > then I agree that the answer is probably "no". > But then almost nothing can perfectly do that, > so therefore your question becomes > seemingly trivial and uninteresting. The level of x-risk harm and consequence potentially caused by even one single mistake of your angelic super-powerful enabled ASI is far from "trivial" and "uninteresting". Even one single bad relevant mistake can be an x-risk when ultimate powers and ultimate consequences are involved. Either your ASI is actually powerful, or it is not; either way, be consistent. Unfortunately the 'Argument by angel' only confuses the matter insofar as we do not know what angels are made of. "Angels" are presumably not machines, but they are hardly animals either. But arguing that this "doesn't matter" is a bit like arguing that 'type theory' is not important to computer science. The substrate aspect is actually important. You cannot simply just disregard and ignore that there is, implied somewhere, an interface between the organic ecosystem of humans, etc, and that of the artificial machine systems needed to support the existence of the ASI. The implications of that are far from trivial. That is what is explored by the SNC argument. > It might well be likely > that the amount of harm ASI prevents > (across multiple relevant sources) > is going to be higher/greater than > the amount of harm ASI will not prevent > (due to control/predicative limitations). It might seem so, by mistake or perhaps by accidental (or intentional) self deception, but that can only be a short term delusion. This has nothing to do with "ASI alignment". Organic life is very very complex and in the total hyperspace of possibility, is only robust across a very narrow range. Your cancer vaccine is within that range; as it is made of the same kind of stuff as that which it is trying to cure. In the space of the kinds of elementals and energies inherent in ASI powers and of the necessary (side) effects and consequences of its mere existence, (as based on an inorganic substrate) we end up involuntarily exploring far far beyond the adaptive range of all manner of organic process. It is not just "maybe it will go bad", but more like it is very very likely that it will go much worse than you can (could ever) even imagine is possible. Without a lot of very specific training, human brains/minds are not at all well equipped to deal with exponential processes, and powers, of any kind, and ASI is in that category. Organic life is very very fragile to the kinds of effects/outcomes that any powerful ASI must engender by its mere existence. If your vaccine was made of neutronium, then I would naturally expect some very serious problems and outcomes. ~ ~ ~