The homoglyph attack is a very old and often abused technique. A related, though not identical one, is bit flipping where a single character is swapped in a domain name and you prey on those who make misspellings. It turns out, however, that it isn't even necessary for someone to make a blatant error like that...
I actually purchased a few domain of erroneous domain names spellings for the purpose of experimentation and ended up with A LOT of traffic.
Because I got so much traffic, I decided to make the sites useful, and I wrote a little php script that downloads the latest technology RSS feeds and displayed the headlines.
On one particular domain I was getting more than 20 hits per day.
Some of these attacks, such as the exe.doc one, are the fault of the use of in-band signaling by Windows to indicate that a file is executable. You can't do that on OSX or Linux, where attributes determine execute capability.
The equivalent domain name issues are a lot tougher, and are going to require a character lookalike table or some other system of rules to warn the user.
That's not quite accurate. NTFS has fine-grained permissions, including execute (which is independent of read/open). If a program decides to download or save an attachment and leave execute permissions set, then yes, you can execute it.
It'd not be any different on Linux if your email client saved attachments and marked them +x. Now as a matter of defaults, sure Windows is different.
Oh, thanks for pointing out - actually I thought it would be a pretty good thing to deny executing by default in folders used by Email clients, Instant messengers, browsers and removable drives.
Still someone could save file in different location and execute it. Is there a way to identify which program generated specific file? Or perhaps we could solve this by allowing these programs (like Skype.exe) save files only in specific folders. Is that possible?
It's hard to retroactively implement this model on Win32 desktop apps. Some programs (Skype, for instance) actually create and run executables on-the-fly (for their stupid obfuscation stuff).
Files should inherit the folder ACL, so a deny execute on a folder would work nicely, but end up breaking things for users.
Microsoft addresses these issues with the "WinRT" platform and sandboxed/appstore model.
I was pretty skeptical of that particular example anyway. If a user is opening random files from untrusted sources, security has already gone out the window.
Well on a mass scale you're right, but if you're targeting a particular person X, you may create a throwaway gmail account impersonating X's friend Y, and send an innocuously looking pdf/doc/xls/lolcat gif as an attachment. IMO lots of people may fall prey to this kind of attack if the email looks legit.
are you kidding? there's a huge difference between parsing a document with a program you've already trusted and installed, and running an executable.
your web browser literally downloads and parses documents all day long. pdf and .doc files might be marginally less secure, but they're by no means supposed to be executable.
if windows refuses to display the correct extension, I suppose you have to manually open them from whatever you want to view them with (File|Open) after downloading them to a certain folder.
I'd have zero qualms about doing this with a .txt file on windows for example - wouldn't you?
I knew about the A vs. Α vs. А issue, where visually similar/identical characters map to different domain names. But I didn't know IDNs also could map visually different characters to the same domain names. I would've guessed that full-width characters would be punycoded as well, rather than treated as their ASCII equivalents. Is this done with any other characters?
Perhaps this is not so surprising. Prior to IDNs, the DNS also did case folding so "a" and "A" would go to the same place.
One of the particular challenges with IDNs is that there are two versions of the specification, a deprecated 2003 version and the current 2008 version. For a few characters they provide subtly different transforms. The 2008 version also ratchets down on a lot of non-sensical characters — they are no longer eligible in domain names. The remaining permissible set is quite conservative to limit some of the issues seen in the original version.
Stacked diacritics are used in Thai and other Asian languages, as well as rarely-seen languages such as those of the Yukon.
The right-to-left control character is for embedding e.g. Arabic or Hebrew script inside Latin text (or vice versa). It is actually a controversial feature of Unicode as some people feel it belongs in a higher-level protocol.
The point is that the typical rules in one language are completely bizarre in another. Unicode tries hard to be at least minimally useful to everyone, meaning that it has to make allowances for all of the rules.
It's complicated. It's more complicated than any encoding standard that came before. It's also the most broadly useful, and the first standard to really take into account the complexities of human written language, as opposed to just one region's written language.
Asking if utf-8 is safe on the basis of these examples seems like asking if we should throw the baby out with the bath water -- along with the toys and the tub for good measure.
The potential for abuse is evident, but it seems like these primarily ought to be fixed in userland. For instance, by giving cues by highlighting characters in widely different areas (latin vs cyrillic) or by ignoring rtl for extensions when a string starts in ltr.
(Not to mention, in the latter case, if users are opening random docs attached to spammy emails, utf8 is the last of your problems.)
I don't think the problem is Unicode, the problem is trusting your ability to determine the ownership of a URL (and thus the trust that should be inherited from its owner) based on its name. Plenty of phishing attacks work with domains like "yahoo-password-reset.com".
If you're not seeing a valid TLS session with a certificate signed by an issuer you trust not to allow these shenanigans, it really doesn't matter what chracters you're seeing in the URL bar.
To me, we're in a Unicode transition period - 10 years ago it was almost completely unsupported, and as it is adopted more an more, we're finding the places it can cause issues.
Part of the problem is that a lot of the languages and tools we use pre-exist widescale use of Unicode and don't handle it very well. The Python 3 approach is by far the best one I've come across (would be interested to hear of other examples), and they needed to make a backwards incompatible change to handle it in a way that made it harder to screw up.
It is a complex technology, and inevitably there are going to be holes, but as in a lot of other cases, it is worth it (necessary, even), and as we move forward our tooling, languages, libraries and practices will get better and reduce the risk.
The internet is a complex technology that can never be completely secure. Doesn't mean it's not worth it though.
I was reading this and I'm like, Unicode (I assume UTF-8) isn't really that complicated at all. The UTF-8 system is straightforward, no more complex than simple run length coding. I'm also thinking that Unicode is basically a list of glyphs in every language plus a few control codes for rendering glyphs correctly, BOM, etc.
It's like saying a dictionary contains dangerous information.
I think the problem is software that enables Unicode input but is not willing to handle all the different types of input. For example, it seems like a bad idea to even let people input combined words of different languages; that's why we have input methods that filter out bad combinations; and dumping this on the font renderer without making sure the difference is highlighted.
The article here relates to Unicode in general rather than any of its specific encodings (the author themselves gets confused by this in the first paragraph).
This is quite a good article that explains the difference (particularly of use to Python users):
http://nedbatchelder.com/text/unipain.html
Words of different languages are often suggested for a ban, but fail the real-world test.
Countless languages borrow words from one another, especially names. For example, some people argue that calling Marie Skłodowska-Curie simply Curie misrepresents her character naïvely, as her name and country of birth were important to her. You could, of course, latinize it, but that gets you into trouble with the Turks and the Koreans and essentially all languages that had writing before the industrial age.
These unicode attacks are interesting and unicode is far too useful to stop from using it.
The question is what can we do to fix some of these issues?
Like the RTL character. It shouldn't be blocked as it has a valid use case, but is there a non malicious use case for it when surrounded by normal latin characters?
eg:
abc[RTL]def
If it's just one RTL character then that should be fairly easy to filter out. Of course if that's a way a filter works then there will be other unicode characters you can add to the mix and still make it look the same for an average user and pass that particular filter.
One could identify unicode characters that belong to a particular character set (say latin) and see if some text contains more as one character set. Then invoke the filter if a text has more as 2 different character sets. Of course I can see that getting in the way of some use cases as well (text with translations in 3 languages for example)
This reminded me about an interview to an adware author, in which he told a story about creating unwritable registry keys and file names 'by exploiting an “impedance mismatch” between the Win32 API and the NT API':
The adware registered a key in the Windows Registry with Null unicode in the middle of the string so that the UI of Windows failed to display or modify that string.
In my experience, NULL is the least-supported UTF-8 character. Whenever software claims conformance, that's the first thing I check.
Personally, I'd have preferred they disallowed it in the standard, but it's too late for that now. Anyone know why it was included (other than the obvious reason that it obeys the encoding rules)?
Does Windows really display the that "exe.doc" RTL example with the icon for a Word Document? Or is the exe file just set to use that for its icon in order to complete the illusion?
IIRC, when Windows file browser sees a .exe file, it looks inside the file for the program's icon and only uses the window thing if it doesn't find one. Nothing stops the program's icon from being the same as a Word doc.
The extension is still .exe, though. It only _looks_ like .doc after font rendering. And so, as the OP suspected, the .exe needs to have the proper icon embedded. It's not provided by the file browser.
I like and appreciate the fact that the Spotify people used the one guy's findings to improve security (for everything that uses Twisted!) rather than just throwing the legal system at him.
https://www.youtube.com/watch?v=ZPbyDSvGasw [DEFCON 21: DNS May Be Hazardous To Your Health]