Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Equality Is Hard (craigstuntz.com)
94 points by ohjeez on March 9, 2020 | hide | past | favorite | 55 comments


As much as null is considered the "billion dollar mistake", personally, I think that Java giving Object a virtual equals() and hashCode() was an even bigger mistake. NPEs blow up loudly in your face and are fairly easy to lint for. By comparison, bad equals() and hashCode() implementations can cause difficult-to-track-down errors. Unfortunately, this design was inherited by C# and other OO languages. C# does have the IEqualityComparer interface, which helps in some cases, but many containers and other framework methods rely on an overridden instance equality method.


C's actual billion dollar mistake was losing array bounds with every function call:

https://www.digitalmars.com/articles/b44.html


The mistake wasn't in giving Object those functions. You get "=" behavior for all types in a given language, and it's nice to be able to control this somehow (i.e. they were actually thinking a little bit about the equality problem here). The problem came when they decided that the default behavior should be a pointer comparison rather than a member-by-member sub-equals. As for hashCode(), it should have thrown an exception as the default behavior, because there's no way a language designer can anticipate what makes a proper hash for class XYZ.


Recursive by default (which is what member-by-member implies) leads trivially to infinite recursion in reference loops, which your default equals would need to guard against, greatly increasing its complexity.

Reference equality is generally fine for mutable types, you probably only want value equality for immutable types. If you use value equality for mutable types, you start getting into trouble when you use them as e.g. hash keys.


You are correct, but it's far better to have expensive-by-default but safe operations vs the counterintuitive mess you end up with when comparing references by default.

The point is that you need to counter human nature, which is to not do work you can get away with not doing. If the worst case result of not doing the work (i.e. not creating an equals method) is slow performance, that's better than getting incorrect results. You'll have a much harder time designing a useable language if your understanding of human psychology is poor.

Besides, in 90% of cases, you won't have potentially recursive types, and so a smart compiler can optimize for this when generating a default equals method. In fact, many cases could just degenerate into a memcmp.


I'm not so sure, when Java was created computers weren't that fast.. Of course, now with computers fast enough that Python and JavaScript are king, sure!


Member-by-member sub-equals might make more sense in some situations, but not all (as the article discusses, equality is context-dependent). By making equals() an instance method, encapsulation is broken by tying an external concern (equality) to the internal implementation of a class. It would have been better if Object had no instance equals() or hashCode() method at all, with static methods providing non-reference-equality semantics. Something like:

    @FunctionalInterface
    interface Equivalence<T>
    {
        boolean equals(T one, T other);
    }

    @FunctionalInterface
    interface Comparison<T>
            extends Equivalence<T>
    {
        int compare(T one, T other);

        @Override
        default boolean equals(final T one, final T other)
        {
            return compare(one, other) == 0;
        }
    }

    @FunctionalInterface
    interface HashFunction<T>
    {
        int hashCode(T obj);
    }

    interface Equalitor<T>
            extends Equivalence<T>, HashFunction<T>
    {
    }

    interface Comparator<T>
            extends Comparison<T>, Equalitor<T>
    {
    }

    interface Equatable
    {
        Equalitor<? super Object> equivalence();
    }

    interface Comparable
            extends Equatable
    {
        @Override
        Comparator<? super Object> equivalence();
    }

    final class Objects
    {
        static int compare(final Object one, final Object other)
        {
            if (one == other)
            {
                return 0;
            }

            if (one == null)
            {
                return -1;
            }

            if (other == null)
            {
                return 1;
            }

            if (one instanceof Comparable oneCmp && other instanceof Comparable otherCmp)
            {
                final Comparison<? super Object> cmp = oneCmp.equivalence();

                if (equals(cmp, otherCmp.equivalence()))
                {
                    return cmp.compare(one, other);
                }
            }

            throw new SomeRuntimeException();
        }

        static boolean equals(final Object one, final Object other)
        {
            if (one == other)
            {
                return true;
            }

            if (one instanceof Equatable oneEq && other instanceof Equatable otherEq)
            {
                final Equivalence<? super Object> eq = oneEq.equivalence();

                if (equals(eq, otherEq.equivalence()))
                {
                    return eq.equals(one, other);
                }
            }

            return false;
        }

        static int hashCode(final Object obj)
        {
            if (obj instanceof Equatable objEq)
            {
                final HashFunction<? super Object> hf = objEq.equivalence();
                return hf.hashCode(obj);
            }

            return someIntrinsicHashCode(obj);
        }
    }
(The generic parameters aren't correct, but to give a rough idea)


Yes, that's why I was recommending that the default implementation be a compiler-generated member-by-member non-reference comparison, with the option to override that behavior by providing your own custom implementation of equals(). Then you get the intuitive result for the 80% case, and have the means to write custom code for the other 20%. Not having an equals in the root object makes equality checks on type-erased values impossible because there would be no standard, reliable way to do it (C++ solved this with operator overloads).

hashCode, on the other hand, would have been better as an interface so that the developer is forced to think about how it gets generated (because a default implementation is more likely to get it wrong than right). The Map classes, for example, would only accept key values that implement Hashable.


What I was trying to get across with my example was that you can define a standard, reliable way to perform equality checks on type-erased values without having equals on the root object. Granted, it looks a bit wonky, but it provides safer behavior than an instance equality method.


I understand, but the issue here is with making a fundamental operation (equality) an optional interface, which would make things unnecessarily complicated for the user.

We always need the ability to check for equality, regardless of type (so much so that we have an "=" operator). The trouble comes with deciding what equality even means (as this article pointed out), because it really depends on the types you're comparing.

C++ decided that the best way to solve it would be to allow operator overloading. Unfortunately, overloading "=" worked so well that they thought "Hey! We could do the same for ALL operators!", and the mess that is C++ operator overloading was born.

The Java folks figured that they'd learned from C++ mistakes, and would avoid them by forbidding operator overloading entirely, thus throwing the baby out with the bathwater (and also they cheated because the "+" operator is overloaded for strings).

What they should have done is simply allowed overloading of the "=" operator ONLY, then maybe also add a special notation to say "I mean the actual reference (i.e. pointer value) as opposed to the object contents" (to cover the 20% use case where you actually DO want to know if it's the same reference). This would have obviated the need for equals(), made "=" work the way you'd intuitively expect it to, and also allowed implementors to override the default implementation to "do the right thing", even when you don't know the type of the values you are comparing (for example after fetching from an untyped collection).


>and also they cheated because the "+" operator is overloaded for strings

Which isn't a good idea IMHO, in most case a + b == b + a.. Also using a different operator for concatenation (say ~ like D) allow unambiguous array-extension of these operation: a[]+b[] is element by element addition and a[]~b[] is concatenation.


There's something interesting about 'sortBy' semantics which you see in some languages.

As the cost of your comparator goes up (especially on real hardware with pointer chasing) the difference between running the comparator m * nlogn times and running the transform m*n times could start to stack up.


I agree. It might make sense to use something like memcmp() on structs to assert that they are exactly the same, or to have a dedicated comparison function that makes sure they are functionally equivalent for some defined need.

However Objects should have only the second and it might also need to be (I think replacing a function pointer might be) an over-loaded function if something has extended a base object type. I'm not even actually positive that could work if there are two extensions to a base type.


> It might make sense to use something like memcmp() on structs to assert that they are exactly the same

No, actually, it doesn't make sense in general. Due to padding, structs may have arbitrary meaningless bytes between the actual data fields, and memcmp will break on those. It works because structs are often filled with zeroes on initialization, but it isn't required or enforced in any way. I was burnt by this problem more than once, not really fun to debug


Fun, create a struct on the stack. Set the elements individually. Padding is now filled with garbage. Any elements you forget to set will also be garbage. When you copy the struct the garbage goes with it.

I've been burnt too.


You've missed the 'functionally equal' check option; which covers cases like you're focusing on.

To check if two differently created things happen to be equal, yes, you should prefer the function that checks exactly what matters exactly how it matters.

As you rightly point out, the memcmp check would only work on either pre-zeroed memory (depends on environment and what you're doing) or actual literal copies of a specific struct.


memcmp() on your structs means any fields of a reference type are back to using reference equality. You need to recurse if you're going to be consistent.


> First, programming languages should make it simple to create types where equality comparison is disabled because it makes no sense

This is kind of a silly "solution" as some languages don't even have types, others have "soft" types, and others support arbitrary typecasting. (Not even going to get into duck typing, union types, etc. Type theory is hard.)

> Programming languages must make the difference between structural equality and reference equality crystal clear

This is probably intractable. Author doesn't even touch cases where you have objects that contain other objects. Are these references or values? How does the equality operator work there? How deep does the rabbit hole go?

If you only deal with values, then you're probably crippling performance. There are reasons for these design concessions, it's unfair to assume that people that designed these languages are complete doofuses.

> One might look at the length of this post and say, “Wow, equality is really complicated! I’m going to give up coding and become a soybean farmer.”

Farming is magnitudes harder than what the average programmer does day-to-day :)


> This is kind of a silly "solution" as some languages don't even have types, others have "soft" types, and others support arbitrary typecasting.

Well, more fool those languages. It's a good solution, and if your language can't offer an equivalent, take it up with your language's maintainers.

> This is probably intractable. Author doesn't even touch cases where you have objects that contain other objects. Are these references or values?

Let's not declare it intractable before we've actually tried. The language would need to draw a clear distinction between (possibly compound) values and objects-with-identity - like the distinction between struct and class in C#, or between case class and class in Scala - and ripple that distinction all the way up; values should only be allowed to contain other values, should be immutable, and should be able to be treated in a generic way (i.e. a record system). Objects-with-identity should be rarer and heavily encapsulated, almost like Erlang actors; the simplest example is a reference cell (i.e. a mutable thing holding a value of a particular type that can be get and set). Doesn't sound like such a bad language design.


I'd have to think about this more, but I think your purported solution (mostly immutable record-type objects) would tax performance and memory constraints.


Haskell already works that way, and is a high-performance language. (The same is true of strict ML-family languages, so laziness is not required to make this practical). Those languages generally just don't bother with reference comparisons at all, but one could imagine a more hybrid language that included support for traditional OO - in fact idiomatic Scala works like this (and again is a high-performance language, though to some extent that relies on what's probably the best garbage collector in the world). I don't know what kind of support Erlang has for making comparisons between actor identities, but that might be an even better example.


> This is probably intractable

You might be thinking too ambitiously. Having separate operators for identity vs. structural equality checks would suffice.


In fairness, he did also say that being a soybean farmer was harder.


I really like this about Rust, implementing equality is just another trait. The same is true of comparison operators. If you don’t want a type to be able to be compared by == you can make it always evaluate to false or even panic and give a message that the type cannot be used in that way. That might aggravate coders who do not test comprehensively but it is an option.

Edit: This is a response to the concerns of the article. For general info on this it’s a great introduction to Rust’s awesome documentation.

https://doc.rust-lang.org/std/cmp/trait.Eq.html


You mean there is no compile time way to accomplish that? Doesn't sound very Rust-like.


It is determined at compile time; I'm not sure why GP is suggesting returning false if you don't want types to be compared. By default structs are not comparable. If you opt-in to the standard PartialEq implementation using the `derive` annotation[1], it will only allow a struct to be compared for equality with other structs of the same type. You can choose to add implementations to compare against other types[2].

[1] https://doc.rust-lang.org/book/ch05-02-example-structs.html#...

[2] https://doc.rust-lang.org/std/cmp/trait.PartialEq.html#how-c...


Rust is such a delightful language.


“Option” is the operative word. A type that does not implement the correct trait cannot be used in a way that requires the trait. If you are writing anything non-trivial that would probably be preferable. It requires nothing, that is the default behavior, so I did not bother to mention it.


I don't know why they would make it always return false. In Rust you can just not implement Eq for the type, and it becomes a compile error to try and compare it.


To correct the opening statement: The two hard problems in computer science are cache invalidation, naming things, and off-by-one errors.



I’m surprised the article discusses extensional equality for functions, then doesn’t mention why languages don’t try to derive it: it’s equivalent to solving the Halting Problem (per Rice’s Theorem).


That's only for Turing complete languages. It's not true for total functions. Total functions can be written in Coq, Agda and Idris among others.


Funnily, in programming languages based on Martin Lof Type Theory, the topic of function extensionality is a whole rabbit hole for reasons other than the halting problem.

There is a distinction between intensional Martin Lof type theory and extensional Martin Lof type theory, which differ in their treatments of equality. First, one must distinguish between judgmental equality x = y, which is a statement in the mathematical metalanguage that two terms are equal, and propositional equality, which "internalizes" judgmental equality within the programming language through the equality type Id(A, x, y), the type of proofs that x and y are judgmentally equal. In extensional type theory, judgmental equality also follows from propositional equality, and together with the eta-conversion rule for functions, function extensionality is provable [0]. However, typechecking extensional type theory is undecidable [1].

Alternatively, in Homotopy Type Theory (HoTT), supports function extensionality is provable from the univalence axiom, which states that isomorphic types are equal [2]. Function extensionality is trivially provable in Cubical Agda [3], which gives a computational interpretation of HoTT's notion of equality.

[0] https://cstheory.stackexchange.com/questions/46331/extension...

[1] https://cs.stackexchange.com/a/112559

[2] https://ncatlab.org/nlab/show/function+extensionality#relati...

[3] https://agda.readthedocs.io/en/v2.6.0.1/language/cubical.htm...


Yeah, I didn't mean to insinuate that function extensionality is provable or testable in those languages, just that one can ensure that one is actually writing total functions today. I'm sure the same could be done in a simple or polymorphic type theory, I'm just not aware of one. Is there a Haskell extension like Idris's totality checking?

I was gonna say, just wait until the OP hears about HoTT.

Thanks though, I didn't know Cubical Agda exists.


> Is there a Haskell extension like Idris's totality checking?

There's LiquidHaskell https://ucsd-progsys.github.io/liquidhaskell-blog/


Even for things like regular expressions checking for equality can cost exponential time. Whether something is exponential or undecidable is pretty much irrelevant if you want to run the algorithm as part of compilation.


Good thing that's not relevant in general then.


I think Rice's theorem is a big hammer here: it's equivalent to the halting problem because one can just test extensional equality of

    given M: simulate M; return 1
and

    given M: return 1
Or is it even true of extensional equality for _total_ functions?


"I think Rice's theorem is a big hammer here"

It is certainly a big hammer, but as theoretical computer science tools go, it is pleasingly easy to use. Unfortunately it is pleasingly easy to use in much the same way high explosives are pleasingly easy to blow yourself up with, as it annihilates so many things that we'd rather like to keep. But, still... easy to use.


If no mainstream programming languages implement equals like math—an obvious and well known example throughout the history of computing—maybe it’s because that’s a bad idea.


It's because they are dealing with different things. In math, things are immutable, so you either "compare all fields" for the most fine-grained equality, or a more "coarse" equality that is compatible in the sense that if two things are "finely equal", they are implied to be "coarsely equal".

In programming, on the other hand, you have variables that change their value over time. Many, many problems with equality have to do with how they interact with time and change.

Those languages that stay away from (implicit) time and change, such as Haskell, don't have the same problems with equality. They do have other problems with equality being undecidable, which results in some types not being able to derive equality (especially functions, since automatically derived equality would mean to solve the halting problem).


Or really hard, when you're communicating with a computer and not another human. Lots of concepts that are easy to communicate to a human, are difficult to communicate to a computer, not just equality.

It is a good point, though, that if virtually EVERY mainstream language gets it wrong, when every single one of the people creating those languages knew about the math =, there has to be a reason.


and what might that reason be?


It depends what the meaning of "is" is.


Philosophers distinguish three senses of "is":

- Identity: "Clark Kent is Superman." Anything true of Superman is true of Clark Kent and vice versa, since these are two names for the same thing. - Predication: "Superman's cape is red." Red is a quality of the cape. It's not reflexive because not all red things are capes. - Existence: "There is no Superman." In English, existence sentences usually begin with "there is" or more poetically, you say "Superman is ~not~." Existence is not a relation.

In the case of Clinton, he was being evasive about "is" versus "was" in the time dimension. There was a sexual relationship between him and Lewinski, but he called there "is" no relationship in general.


I know that opening line as "there are two hard problems in computer science: cache invalidation, naming things, and off-by-one errors"


I'm pretty sure the original statement is just with "cache invalidation" and "naming things", and later the "off-by-one errors" was added, as when I first learned about this quote, it was never with the "off-by-one errors" parts but nowadays, seems to always include the three of them.


I came here for this, too. You're not mistaken.


I have to admit that I expected an article on a different type of equality before I clicked on the link.


I, uh, equally expected that.


Is reference equality ever useful in itself, in a high level language?

Of course reference equality is a straightforward optimization for structural equality, as mentioned in the post. But it is not required to make it visible to the programmer.

Reference equally is mandatory for the garbage collector. Again it may not be visible to the programmer.

Finally there is the case of two data structures sharing a reference to a single instance of some data. Usually that is a bad idea if the pointed to data is mutable, and often makes the program leaky.

The paper reads: "So reference equality is a sensible default in a language which supports mutation.". Is it not a bad reason to support reference equality?


> Never use == in JavaScript. Use === instead.

Or better, know what == [0] and === [1] do by looking at the ECMAScript reference

[0]: https://www.ecma-international.org/ecma-262/10.0/index.html#...

[1]: https://www.ecma-international.org/ecma-262/10.0/index.html#...


Hm...

let john = { Name = "John"; Age = 15; Offspring = [] }; let jane = { Name = "Jane"; Age = 47; Offspring = [ john ] } let sue = { Name = "Sue"; Age = 35; Offspring = [ john ] }

Not very realistic :)


I believe the erlang VM and the languages on top of it get all four conditions correct (not counting rebinding a variable to a new value, which you can do in some BEAM languages)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: