Skip to search.

Breaking News Visit Yahoo! News for the latest.

×Close this window

json · JSON JavaScript Object Notation

The Yahoo! Groups Product Blog

Check it out!

Group Information

  • Members: 590
  • Category: Data Formats
  • Founded: Jul 19, 2005
  • Language: English
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Hear how Yahoo! Groups has changed the lives of others. Take me there.

Messages

Advanced
Messages Help
Messages 1586 - 1615 of 1958   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Show Message Summaries Sort by Date ^  
#1586 From: "Douglas Crockford" <douglas@...>
Date: Fri Feb 25, 2011 8:00 pm
Subject: Re: JSON and the Unicode Standard
douglascrock...
Send Email Send Email
 
--- In json@yahoogroups.com, "johne_ganz" <john.engelhart@...> wrote:
>
> --- In json@yahoogroups.com, "Douglas Crockford" <douglas@> wrote:
> >
> > A receiver can do what it chooses to with the character codes it receives.
If it wants to delete them or reject them, that is its business.
>
> Not if it's Unicode.  It is a common misconception that "Unicode" is just a
set of code points, like say ASCII or EBCDIC.

For JSON's purpose, Unicode is just a set of code points. It gives some, such as
{ and }, special meaning. But in strings, everything should simply be passed
through.

> Previous versions of the Unicode standard used to have a clause that permitted
the deleting of code points from a string (though it was not recommended, ala
SHOULD NOT).  Later versions of the Unicode standard do not permit this, and it
has been verboten to do so for some time.
>
> Some of the most compelling reasons why deleting characters is forbidden is
covered by the section "Non-Visual Security Issues": 
http://www.unicode.org/reports/tr36/#Canonical_Represenation .
>
> There is no language in RFC 4627 that I can find that supports your
interpretation.  There is an awful lot of language in the Unicode standard that
unambiguously says that you can not "delete" characters from a Unicode "string".
There is also compelling arguments in TR#36 for why arbitrarily deleting
characters is a huge mistake.
>
> > But a JSON channel should not interfere with or bias the communication.
>
> This statement is contrary to what you just said.  If it is deleting
characters, it is obviously interfering and biasing the communication.

By receiver I mean the program that ultimately receives the message. It can
interpret it and process it or damage it or ignore it as it will. What it does
with the data is none of my business. The JSON channel itself must do none of
those things.

> A standard, and its interpretation, should strive to be unambiguous.  An
interpretation that boils down to "An implementation MAY interfere with or bias
the communication, but an implementation SHOULD NOT interfere with or bias the
communication" is meaningless and non-sensical.
>
> > It should faithfully deliver what the sender sent, provided that it conforms
to the JSON grammar.
>
> Which must be encoded as Unicode.  Again, Unicode IS NOT, and MUST NOT be
treated as a stream of Unicode code points.  That's not Unicode.  I freely admit
that this is a belief that I once had.  However, after a few years of dealing
with low level Unicode string processing (where Unicode means "The Unicode
Standard"), I no longer hold this view.  It's much more complicated and much
more nuanced than people realize.
>
> > If the sender wants to send characters that some consortium considers
indecent, and if the receiver wants to receive them, then that is their
business.
>
> I don't have a problem with this.  I would have a problem with such a set up
claiming "strictly RFC 4627 conforming" (or some language implying 4627
conformance).
>
> My specific point is this:  I strongly believe that RFC 4627 requires Unicode,
and by implication, processing said Unicode in a Unicode Standard conforming
way.  Therefore, in order to claim "RFC 4627 conformance", one must also process
and handle the JSON in a way that is also "Unicode Standard conforming" as well.
>
> You don't HAVE to do this, obviously.. but then you can no longer claim RFC
4627 conformance.
>
> If I may make a suggestion:  perhaps an informal "JSON Best Practices"
document be started that catalogs and records these types of things.  The
document would be totally non-normative, but would be a fantastic resource for
this who need to implement JSON parsers and generators.  It would also help
ensure that implementations converge on something that ensures they will
interoperate more reliably.  Since it would be non-normative, it wouldn't have
any "requirements" weight to it, but I can tell you such a document would have
been a big help to me.


Tell you what. If you ever encounter a real problem, we will deal with that.

#1587 From: "johne_ganz" <john.engelhart@...>
Date: Fri Feb 25, 2011 11:09 pm
Subject: Re: JSON and the Unicode Standard
johne_ganz
Send Email Send Email
 
--- In json@yahoogroups.com, "Douglas Crockford" <douglas@...> wrote:
>
> --- In json@yahoogroups.com, "johne_ganz" <john.engelhart@> wrote:
> >
> > --- In json@yahoogroups.com, "Douglas Crockford" <douglas@> wrote:
> > >
> > > A receiver can do what it chooses to with the character codes it receives.
If it wants to delete them or reject them, that is its business.
> >
> > Not if it's Unicode.  It is a common misconception that "Unicode" is just a
set of code points, like say ASCII or EBCDIC.
>
> For JSON's purpose, Unicode is just a set of code points.

Not according to RFC 4627 it isn't.  Section 3, Encoding, "JSON text SHALL be
encoded in Unicode.", where SHALL is interpreted via RFC 2119 (i.e., SHALL is
synonymous with MUST).

I appreciate that your interpretation may have been your original intent, but
the scope of the language in the standard is far, far greater than "JSON text
SHALL be interpreted as a stream of disjoint Unicode code points.", which is
what you are arguing that the standard means.

Unless you can make a compelling argument with language from the RFC 4627
standard, the standard clearly and plainly says that the JSON text is encoded in
Unicode.  This means that the text must conform to the Unicode standard, and
it's rules for processing and handling text MUST (via the use of SHALL in RFC
4627) be followed.


> By receiver I mean the program that ultimately receives the message. It can
interpret it and process it or damage it or ignore it as it will. What it does
with the data is none of my business. The JSON channel itself must do none of
those things.

Surely you realize that in practice, this is not the way that things are done. 
All of the JSON libraries are effectively "part of the JSON channel".

There is a clear demarcation point where a piece of text has ceased to be JSON
and has (usually) become an instantiated data structure in the host language.

How and what the "language" does with the data is not relevant to RFC 4627.  The
"language" may manipulate the JSON data, examining keys, manipulating them in
any way it chooses.  But at this point, it very clearly has ceased to be "JSON".

Every JSON implementation that is in the form of a library for a host language
that I'm aware of could be interpreted to be "the program that ultimately
receives the message".  The libraries parse the JSON and transliterate it in to
a form useable by the host language.  How and what the host language, or program
written by someone to enumerate or manipulate the data structure that was
instantiated from the original JSON is obviously outside the scope of RFC 4627.

My pedantic point is: A JSON implementation, in the form of a library that
provides bindings between a host language and JSON (of which there are many),
MUST NOT arbitrarily delete characters in the original JSON.  Furthermore, any
such implementation MUST interpret the original JSON text in accordance with the
Unicode Standard.  Just like RFC 4627 gives a grammar and rules for how to
interpret JSON, the Unicode Standard has rules for how to interpret text encoded
as Unicode.  Unicode is not just a simple set of code points.

Another issue is normalization.  In particular, the way normalization is handled
for the "key" portion of an "object" (i.e., {"key": "value"}) can dramatically
alter the meaning and contents of the object.  For example:

{
"\u212b": "one",
"\u0041\u030a": "two",
"\u00c5": "three"
}

Are these three keys distinct?  Should there be a requirement that they MUST be
handled and interpreted such that they are distinct?  Does that requirement
extend past the "channel" demarcation point (i.e., not a JSON library or
communication channel used to interchange the JSON between two hosts) to the
"host language"?

In case it is not obvious, under the rules of Unicode NFC (Normalization Form
C), all three of the keys above will become "\u00c5" after NFC processing.

A first order approximation would seem to suggest that a JSON implementation
"should" use the precomposed form for keys, and for objects that contain keys
with non-precomposed keys that, when converted to their precomposed form are
duplicate with other keys, the behavior is undefined.

Again, this is another point where the use of Unicode introduces an awful lot of
non-obvious dependencies.  The Unicode standard has a lot to say about what it
means for two strings to "compare equal", and since JSON specifies what is
essentially a key/value hash table, it is critically important to define what
"equal" means for a key.  If the keys were ASCII or Binary, this would probably
be a non-issue, but its a pretty big one when you're dealing with Unicode.

> Tell you what. If you ever encounter a real problem, we will deal with that.

This is a rather snarky comment, and to be blunt, unprofessional and unfair.

Every point I've raised here is something that an implementor of a JSON library
will likely encounter.  As an implementor of such a library (for Objective-C),
everything I've raised here is something that took an enormous amount of time
and consideration.

In my case, I've had to deal with the subtle nuances of what happens to a
Unicode string when I parse it and then hand that parsed string off to another
library to instantiate a string object.  I have no control over how this
external library (a combination of Foundation and Core Foundation) deals with or
interprets various aspects of the Unicode Standard.  For the sake of argument,
if this external library automatically precomposes all strings it instantiates,
and I have to uses those instantiated strings as the keys in a NSDictionary (the
equivalent of a JSON object), I've got some problems.

Your snarky comment ignores the real world complexities that one faces when
attempting to create a "RFC 4627 compliant" JSON implementation, at least if one
is trying to do so "the right way" as opposed to a quick hack JSON
implementation.

For someone who is creating a JSON library or some other form of a JSON
implementation, the corner cases are usually far more important than the
obvious, common case.

#1588 From: Tatu Saloranta <tsaloranta@...>
Date: Fri Feb 25, 2011 11:19 pm
Subject: Re: Re: JSON and the Unicode Standard
cowtowncoder
Send Email Send Email
 
On Fri, Feb 25, 2011 at 3:09 PM, johne_ganz <john.engelhart@...> wrote:
>
>
> --- In json@yahoogroups.com, "Douglas Crockford" <douglas@...> wrote:
>>
>> --- In json@yahoogroups.com, "johne_ganz" <john.engelhart@> wrote:
>> >
>> > --- In json@yahoogroups.com, "Douglas Crockford" <douglas@> wrote:
>> > >
>> > > A receiver can do what it chooses to with the character codes it
receives. If it wants to delete them or reject them, that is its business.
>> >
>> > Not if it's Unicode.  It is a common misconception that "Unicode" is just a
set of code points, like say ASCII or EBCDIC.
>>
>> For JSON's purpose, Unicode is just a set of code points.
>
> Not according to RFC 4627 it isn't.  Section 3, Encoding, "JSON text SHALL be
encoded in Unicode.", where SHALL is interpreted via RFC 2119 (i.e., SHALL is
synonymous with MUST).

Do you have an ACTUAL problem worth discussion, or is this from just
purity standpoint?

-+ Tatu +-

#1589 From: Tatu Saloranta <tsaloranta@...>
Date: Fri Feb 25, 2011 11:45 pm
Subject: Re: Re: JSON and the Unicode Standard
cowtowncoder
Send Email Send Email
 
On Fri, Feb 25, 2011 at 3:09 PM, johne_ganz <john.engelhart@...> wrote:
...

> Unicode is not just a simple set of code points.

This is true statement, although the more practical question seems to
be what is the practical relationship of JSON with Unicode
specification.
I think your suggestions for clarifying some parts do make sense,
although it may be hard to reconcile basic diffences between full
Unicode support, and goals of simplicity for JSON.

>
> Another issue is normalization.  In particular, the way normalization is
handled for the "key" portion of an "object" (i.e., {"key": "value"}) can
dramatically alter the meaning and contents of the object.  For example:
>
> {
> "\u212b": "one",
> "\u0041\u030a": "two",
> "\u00c5": "three"
> }
>
> Are these three keys distinct?  Should there be a requirement that they MUST
be handled and interpreted such that they are distinct?  Does that requirement
extend past the "channel" demarcation point (i.e., not a JSON library or
communication channel used to interchange the JSON between two hosts) to the
"host language"?
>
> In case it is not obvious, under the rules of Unicode NFC (Normalization Form
C), all three of the keys above will become "\u00c5" after NFC processing.

For what it is worth, I have not seen a single JSON parser that would
do such normalization; and the only XML parser I recall even trying
proper Unicode code point normalization was XOM. This is not an
argument against proper handling, but rather an observation regarding
how much of a practical issue it seems to be.
Nor have I seen feature requests to support normalization (XOM
implements it because its author is very ambitious wrt supporting
standards, it is very respectable achievement), during time I have
spend maintaining XML and JSON parser/generator implementations.
Do others have difference experiences?

So to me it seems that most likely high-level clarifications regarding
normalization aspects would be:

(a) Whether to do normalization or not is up to implementation
(normalization is left out of scope, on purpose), or
(b) Say that with JSON no normalization would be done (which would be
more at odds with unicode spec)

Why? Just because I see very little chance of anything more ambitious
having effect on implementations (beyond small number that are willing
to tackle such complexity). While it would seem wrong to punt the
issue, there is the practical question of whether full solution would
matter.
My guess is that about last thing I implements would want was a
mandate to support full Unicode 4.0 (and above) normalization rules.
It would just mean that there would be the specification in one
corner; and implementations, practically none of which would be
compliant.

...
> Your snarky comment ignores the real world complexities that one faces when
attempting to create a "RFC 4627 compliant" JSON implementation, at least if one
is trying to do so "the right way" as opposed to a quick hack JSON
implementation.

For better or worse, most JSON implementations fall in quick hack
category; which is just to say that chances of getting significant
number of implementations to do much more than decoding code points
correctly is vanishingly small. Or that even getting them to do basic
decoding is quite a challenge in itself.

> For someone who is creating a JSON library or some other form of a JSON
implementation, the corner cases are usually far more important than the
obvious, common case.

True.

I think your suggestions of how this could be clarified make sense.

-+ Tatu +-

#1590 From: David Graham <david.malcom.graham@...>
Date: Sat Feb 26, 2011 12:35 am
Subject: Re: Re: JSON and the Unicode Standard
dgraham...
Send Email Send Email
 
I had the normalization question while writing json-stream in Ruby as well.
I decided the parser shouldn't do Unicode normalization for the following
reasons:


1. The json and yajl-json Ruby parsers and the popular org.json Java parser
do not do normalization.


2. CouchDB does not do normalization.  I wrote json-stream to handle CouchDB
documents so this was my primary use case.


3. Ruby and Java consider combined characters to be unequal to their single
codepoint counterparts.  The é character, for example, can be a 2 byte
single codepoint form of \u00e9 or a 3 byte two codepoint form of
\u0065\u0301.


In Ruby, "\u00e9" == "\u0065\u0301" => false.


So, given a Ruby Hash (or Java Map) like this:


{"\u00e9" => 1, "\u0065\u0301" => 2}

=> {"é"=>1, "é"=>2}


A JSON serializer that performed Unicode normalization on this Hash object
would corrupt the data in some way.  The two keys would become equal, so
which value gets serialized: 1 or 2?


In my opinion, this means JSON parsers and generators must not perform
normalization.  They must respect the data stored in the JSON byte stream as
is.


It's easy for the application to normalize data before handing it to the
JSON library for serialization, though.  In Ruby, we can do:


ActiveSupport::Multibyte::Chars.new("\u0065\u0301").normalize(:c)


I hope that helps.


David

On Fri, Feb 25, 2011 at 4:45 PM, Tatu Saloranta <tsaloranta@...>wrote:

>
>
> On Fri, Feb 25, 2011 at 3:09 PM, johne_ganz <john.engelhart@...>
> wrote:
> ...
>
>
> > Unicode is not just a simple set of code points.
>
> This is true statement, although the more practical question seems to
> be what is the practical relationship of JSON with Unicode
> specification.
> I think your suggestions for clarifying some parts do make sense,
> although it may be hard to reconcile basic diffences between full
> Unicode support, and goals of simplicity for JSON.
>
>
> >
> > Another issue is normalization.  In particular, the way normalization is
> handled for the "key" portion of an "object" (i.e., {"key": "value"}) can
> dramatically alter the meaning and contents of the object.  For example:
> >
> > {
> > "\u212b": "one",
> > "\u0041\u030a": "two",
> > "\u00c5": "three"
> > }
> >
> > Are these three keys distinct?  Should there be a requirement that they
> MUST be handled and interpreted such that they are distinct?  Does that
> requirement extend past the "channel" demarcation point (i.e., not a JSON
> library or communication channel used to interchange the JSON between two
> hosts) to the "host language"?
> >
> > In case it is not obvious, under the rules of Unicode NFC (Normalization
> Form C), all three of the keys above will become "\u00c5" after NFC
> processing.
>
> For what it is worth, I have not seen a single JSON parser that would
> do such normalization; and the only XML parser I recall even trying
> proper Unicode code point normalization was XOM. This is not an
> argument against proper handling, but rather an observation regarding
> how much of a practical issue it seems to be.
> Nor have I seen feature requests to support normalization (XOM
> implements it because its author is very ambitious wrt supporting
> standards, it is very respectable achievement), during time I have
> spend maintaining XML and JSON parser/generator implementations.
> Do others have difference experiences?
>
> So to me it seems that most likely high-level clarifications regarding
> normalization aspects would be:
>
> (a) Whether to do normalization or not is up to implementation
> (normalization is left out of scope, on purpose), or
> (b) Say that with JSON no normalization would be done (which would be
> more at odds with unicode spec)
>
> Why? Just because I see very little chance of anything more ambitious
> having effect on implementations (beyond small number that are willing
> to tackle such complexity). While it would seem wrong to punt the
> issue, there is the practical question of whether full solution would
> matter.
> My guess is that about last thing I implements would want was a
> mandate to support full Unicode 4.0 (and above) normalization rules.
> It would just mean that there would be the specification in one
> corner; and implementations, practically none of which would be
> compliant.
>
> ...
>
> > Your snarky comment ignores the real world complexities that one faces
> when attempting to create a "RFC 4627 compliant" JSON implementation, at
> least if one is trying to do so "the right way" as opposed to a quick hack
> JSON implementation.
>
> For better or worse, most JSON implementations fall in quick hack
> category; which is just to say that chances of getting significant
> number of implementations to do much more than decoding code points
> correctly is vanishingly small. Or that even getting them to do basic
> decoding is quite a challenge in itself.
>
>
> > For someone who is creating a JSON library or some other form of a JSON
> implementation, the corner cases are usually far more important than the
> obvious, common case.
>
> True.
>
> I think your suggestions of how this could be clarified make sense.
>
> -+ Tatu +-
>
>
>


[Non-text portions of this message have been removed]

#1591 From: "Douglas Crockford" <douglas@...>
Date: Sat Feb 26, 2011 12:44 am
Subject: Re: JSON and the Unicode Standard
douglascrock...
Send Email Send Email
 
--- In json@yahoogroups.com, David Graham <david.malcom.graham@...> wrote:

> In my opinion, this means JSON parsers and generators must not perform
> normalization.  They must respect the data stored in the JSON byte stream as
> is.

I agree.

#1592 From: "johne_ganz" <john.engelhart@...>
Date: Sat Feb 26, 2011 4:01 am
Subject: Re: JSON and the Unicode Standard
johne_ganz
Send Email Send Email
 
--- In json@yahoogroups.com, Tatu Saloranta <tsaloranta@...> wrote:
>
> On Fri, Feb 25, 2011 at 3:09 PM, johne_ganz <john.engelhart@...> wrote:
> ...
>
> > Unicode is not just a simple set of code points.
>
> This is true statement, although the more practical question seems to
> be what is the practical relationship of JSON with Unicode
> specification.

True.  It would seem, at least to me, that this is one of those nuanced points
that either

a) Has not been given the proper consideration by Unicode (ostensibly) experts.

b) The Unicode standard has evolved in such a way since the publication of RFC
4627 that it may require revisiting the issue.

> I think your suggestions for clarifying some parts do make sense,
> although it may be hard to reconcile basic diffences between full
> Unicode support, and goals of simplicity for JSON.

I'm all for simplicity, and for a "less is more" philosophy.  Unfortunately, RFC
4627 allows for two strictly RFC 4627 compliant implementations to "generate"
wildly different results (were "generate" here means the JSON is parsed and
interpreted in such a way that the two implementations have what reasonable
people would consider "very different semantics").

Numbers are another corner case.  4627 only describes how to parse a decimal
representation of numbers (both integer and floating-point).  In practice this
means that a strictly conforming RFC 4627 JSON implementation can use a 8, 16,
32, or 64 bit "native primitive" to represent integers.  It's perfectly valid
JSON to have integer numbers that require 128 or 256 bits in order to represent
them.  To me, a serialization format such as JSON should make some effort to
ensure that the values contained within it will be properly interpreted by any
and all JSON implementations.  I've seen several JSON implementations that use a
32-bit size C99 primitive type to represent the parsed numbers.  This is a
problem for anything that wants to parse contemporary Twitter JSON as the ID's
are > 2^32 at this point.  The desire to "keep it simple" has to be balanced
against real world practical needs- when your ID's exceed 2^32, it is a
legitimate question to ask "Are JSON implementations going to handle this value
correctly?"  If not 2^32, then when?

> >
> > Another issue is normalization.  In particular, the way normalization is
handled for the "key" portion of an "object" (i.e., {"key": "value"}) can
dramatically alter the meaning and contents of the object.  For example:
> >
> > {
> > "\u212b": "one",
> > "\u0041\u030a": "two",
> > "\u00c5": "three"
> > }
> >
> > Are these three keys distinct?  Should there be a requirement that they MUST
be handled and interpreted such that they are distinct?  Does that requirement
extend past the "channel" demarcation point (i.e., not a JSON library or
communication channel used to interchange the JSON between two hosts) to the
"host language"?
> >
> > In case it is not obvious, under the rules of Unicode NFC (Normalization
Form C), all three of the keys above will become "\u00c5" after NFC processing.
>
> For what it is worth, I have not seen a single JSON parser that would
> do such normalization; and the only XML parser I recall even trying
> proper Unicode code point normalization was XOM. This is not an
> argument against proper handling, but rather an observation regarding
> how much of a practical issue it seems to be.

I have not seen a JSON implementation / parser that does such normalization.

On the other hand, I very strongly suspect that whether or not such
normalization is taking place is not up to the writer of that parser.  In my
particular case (JSONKit, for Objective-C), I pass the parsed JSON String to the
NSString class to instantiate an object.

I have ZERO control over what and how NSString interprets or manipulates the
parsed JSON String that finally becomes the instantiated object that ostensibly
the same as the original JSON String used to create it.  It could be that
NSString decides that the instantiated object is always converted to its
precomposed form.  Objective-C is flexible enough where someone might decide to
swizzle in some logic at run time that forces all strings to be precomposed
before being handed off to the main NSString instantiation method.

> Nor have I seen feature requests to support normalization (XOM
> implements it because its author is very ambitious wrt supporting
> standards, it is very respectable achievement), during time I have
> spend maintaining XML and JSON parser/generator implementations.
> Do others have difference experiences?

I don't have a particular opinion on the matter one way or the other other than
to highlight the point that in many practical, real-world situations, whether or
not such things take place may not be under the control of the JSON parser.

I also suspect that it's one of those things that most people haven't really
given a whole lot of consideration to- they just had the parsed string over to
"the Unicode string handling code", and that's that.  Most people may not
realize that such string handling code may subtly alter the original Unicode
text as a result (ala precomposing the string).

> So to me it seems that most likely high-level clarifications regarding
> normalization aspects would be:
>
> (a) Whether to do normalization or not is up to implementation
> (normalization is left out of scope, on purpose), or
> (b) Say that with JSON no normalization would be done (which would be
> more at odds with unicode spec)
>
> Why? Just because I see very little chance of anything more ambitious
> having effect on implementations (beyond small number that are willing
> to tackle such complexity). While it would seem wrong to punt the
> issue, there is the practical question of whether full solution would
> matter.

I can guarantee you that the practical question of whether a full solution would
matter will be answered the first time someone exploits it in a security
vulnerable way that results in a major security fiasco.

Then it will be with 20/20 hindsight, and the question will be "Why didn't
anyone address (this behavior) that allowed two keys that were not bit for bit
identical, but became identical after converting them to their precomposed form,
and the security checks allowed the decomposed form through because it assumed
that everything was in precomposed form?"

Unfortunately, the use of Unicode coupled with the fact that most JSON
implementations are dependent on external code for their Unicode support means
that this is an extremely non-trivial issue.  I can't think of a simple solution
to the problem at the moment, other than it exists.

> My guess is that about last thing I implements would want was a
> mandate to support full Unicode 4.0 (and above) normalization rules.
> It would just mean that there would be the specification in one
> corner; and implementations, practically none of which would be
> compliant.

You really ought to read:

http://www.unicode.org/faq/security.html

http://www.unicode.org/reports/tr36/#Canonical_Represenation

Microsoft Security Bulletin (MS00-078): Patch Available for 'Web Server Folder
Traversal' Vulnerability
(http://www.microsoft.com/technet/security/bulletin/MS00-078.mspx,
http://www.cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2000-0884)

Creating Arbitrary Shellcode In Unicode Expanded Strings
(http://www.net-security.org/article.php?id=144)

There's a long history of "Those little Unicode details aren't really important"
causing huge security problems later on.

#1593 From: "johne_ganz" <john.engelhart@...>
Date: Sat Feb 26, 2011 5:08 am
Subject: Re: JSON and the Unicode Standard
johne_ganz
Send Email Send Email
 
--- In json@yahoogroups.com, David Graham <david.malcom.graham@...> wrote:
>
> 3. Ruby and Java consider combined characters to be unequal to their single
> codepoint counterparts.  The é character, for example, can be a 2 byte
> single codepoint form of \u00e9 or a 3 byte two codepoint form of
> \u0065\u0301.
>
>
> In Ruby, "\u00e9" == "\u0065\u0301" => false.
>
>
> So, given a Ruby Hash (or Java Map) like this:
>
>
> {"\u00e9" => 1, "\u0065\u0301" => 2}
>
> => {"é"=>1, "é"=>2}
>
>
> A JSON serializer that performed Unicode normalization on this Hash object
> would corrupt the data in some way.  The two keys would become equal, so
> which value gets serialized: 1 or 2?

This is my point.  It happens both on the serialization side, but (at least in
my opinion), it is much more likely to happen on the deserialization side.

What happens to a JSON deserializer that relies on external libraries (say ICU),
or in object oriented programming, a "string" class to handle all this, or is in
some other way completely out of the control the the person writing the JSON
deserializer?

How many people do you think actually checked to make sure that these external
code dependencies offer a guarantee that they will not mutilate or otherwise
transform the original string in some Unicode Equivalence compatible way?

From the Unicode Standard, Section 2.12, Equivalent Sequences and Normalization-

If an application or user attempts to distinguish between canonically equivalent
sequences, as shown in the first example in Figure 2-23, there is no guarantee
that other applications would recognize the same distinctions.   To prevent the
introduction of interoperability problems between applications, such
distinctions must be avoided wherever possible. Making distinctions between
compatibly equivalent sequences is less problematical. However, in restricted
contexts, such as the use of identifiers, avoiding compatibly equivalent
sequences reduces possible security issues. See Unicode Technical Report #36,
"Unicode Security Considerations."

In other words, the Unicode Standard says that the behavior that you are
observing is not guaranteed.  This means that there exists the very real
possibility that a JSON implementation that depends on external code to handle
strings (i.e., ICU or a "string" object in object oriented languages) can not
reasonably ensure that said code does not convert the string argument in to a
Unicode Standard equivalent form.

The practical implication of this is that the behavior that you are seeing is
contrary to the requirements and expectations set forth in the Unicode standard.
It seems reasonable to assume that external libraries that adhere to the Unicode
standard that a JSON implementation is using are under no obligation what so
ever to treat a Unicode string in a way that you have described.

> In my opinion, this means JSON parsers and generators must not perform
> normalization.  They must respect the data stored in the JSON byte stream as
> is.

It's trivial for a parser to respect the data stored in the JSON byte stream.

While I'm sure there are exceptions to this, I'd be willing to bet that the
majority of JSON parsers hand the parsed string off to some "create a string"
piece of code.  It seems reasonable to assume that this "create a string" piece
of code is Unicode aware.  These code bases are probably disjoint, with the
string handling code focused on Unicode Standard conformance, and said
conformance does not require that it "respect [the original string]".

In fact, for my parser (JSONKit), which is Objective-C based and uses NSString
to represent the JSON String objects, it is not practical for me to create a
JSON parser that "respects the data stored in the JSON byte stream".  The
NSString class makes no such guarantees in its documentation, nor does the
Unicode Standard.  It would be extremely non-trivial for me to meet a "respects
the data stored in the JSON byte stream" requirement, at least in the sense that
the behavior is deterministic.

#1594 From: Mark Slater <mark.slater@...>
Date: Sat Feb 26, 2011 11:05 am
Subject: Re: Re: JSON and the Unicode Standard
markosslater
Send Email Send Email
 
Regarding the handling of numbers, the RFC doesn't appear to make any mention of
native representations of numbers - in fact, it only specifies what constitutes
a number in JSON. This maps onto the set of real numbers plus -0, and some
syntactic sugar in the form of scientific notation. I can't see anything in the
grammar that actually limits it to numbers with a finite decimal expansion.

The RFC does state that parsers can limit the numbers they accept to a specified
range of their choice, just as they can impose a limit on the size of strings,
if they want. In practice, some libraries apply unwritten limits to the range of
numbers along the lines of "the range (set?) of numbers that can be represented
as a Java double." Personally, I think such restrictions greatly reduce the
usefulness of a parser, because of examples like twitter ids, but all parsers
I'm aware of impose some limit, even if it is something like, "the range of
numbers that, when represented as a string, fit in the memory available to the
parser". I think it is absolutely valid to ask what range of numbers a
particular parser supports, but this is explicit in the RFC.

Mark


On 26 Feb 2011, at 04:01, "johne_ganz" <john.engelhart@...> wrote:

> --- In json@yahoogroups.com, Tatu Saloranta <tsaloranta@...> wrote:
> >
> > On Fri, Feb 25, 2011 at 3:09 PM, johne_ganz <john.engelhart@...> wrote:
> > ...
> >
> > > Unicode is not just a simple set of code points.
> >
> > This is true statement, although the more practical question seems to
> > be what is the practical relationship of JSON with Unicode
> > specification.
>
> True. It would seem, at least to me, that this is one of those nuanced points
that either
>
> a) Has not been given the proper consideration by Unicode (ostensibly)
experts.
>
> b) The Unicode standard has evolved in such a way since the publication of RFC
4627 that it may require revisiting the issue.
>
> > I think your suggestions for clarifying some parts do make sense,
> > although it may be hard to reconcile basic diffences between full
> > Unicode support, and goals of simplicity for JSON.
>
> I'm all for simplicity, and for a "less is more" philosophy. Unfortunately,
RFC 4627 allows for two strictly RFC 4627 compliant implementations to
"generate" wildly different results (were "generate" here means the JSON is
parsed and interpreted in such a way that the two implementations have what
reasonable people would consider "very different semantics").
>
> Numbers are another corner case. 4627 only describes how to parse a decimal
representation of numbers (both integer and floating-point). In practice this
means that a strictly conforming RFC 4627 JSON implementation can use a 8, 16,
32, or 64 bit "native primitive" to represent integers. It's perfectly valid
JSON to have integer numbers that require 128 or 256 bits in order to represent
them. To me, a serialization format such as JSON should make some effort to
ensure that the values contained within it will be properly interpreted by any
and all JSON implementations.  I've seen several JSON implementations that use a
32-bit size C99 primitive type to represent the parsed numbers. This is a
problem for anything that wants to parse contemporary Twitter JSON as the ID's
are > 2^32 at this point. The desire to "keep it simple" has to be balanced
against real world practical needs- when your ID's exceed 2^32, it is a
legitimate question to ask "Are JSON implementations going to handle this value
correctly?" If not 2^32, then when?
>
> > >
> > > Another issue is normalization.  In particular, the way normalization is
handled for the "key" portion of an "object" (i.e., {"key": "value"}) can
dramatically alter the meaning and contents of the object.  For example:
> > >
> > > {
> > > "\u212b": "one",
> > > "\u0041\u030a": "two",
> > > "\u00c5": "three"
> > > }
> > >
> > > Are these three keys distinct?  Should there be a requirement that they
MUST be handled and interpreted such that they are distinct?  Does that
requirement extend past the "channel" demarcation point (i.e., not a JSON
library or communication channel used to interchange the JSON between two hosts)
to the "host language"?
> > >
> > > In case it is not obvious, under the rules of Unicode NFC (Normalization
Form C), all three of the keys above will become "\u00c5" after NFC processing.
> >
> > For what it is worth, I have not seen a single JSON parser that would
> > do such normalization; and the only XML parser I recall even trying
> > proper Unicode code point normalization was XOM. This is not an
> > argument against proper handling, but rather an observation regarding
> > how much of a practical issue it seems to be.
>
> I have not seen a JSON implementation / parser that does such normalization.
>
> On the other hand, I very strongly suspect that whether or not such
normalization is taking place is not up to the writer of that parser. In my
particular case (JSONKit, for Objective-C), I pass the parsed JSON String to the
NSString class to instantiate an object.
>
> I have ZERO control over what and how NSString interprets or manipulates the
parsed JSON String that finally becomes the instantiated object that ostensibly
the same as the original JSON String used to create it. It could be that
NSString decides that the instantiated object is always converted to its
precomposed form. Objective-C is flexible enough where someone might decide to
swizzle in some logic at run time that forces all strings to be precomposed
before being handed off to the main NSString instantiation method.
>
> > Nor have I seen feature requests to support normalization (XOM
> > implements it because its author is very ambitious wrt supporting
> > standards, it is very respectable achievement), during time I have
> > spend maintaining XML and JSON parser/generator implementations.
> > Do others have difference experiences?
>
> I don't have a particular opinion on the matter one way or the other other
than to highlight the point that in many practical, real-world situations,
whether or not such things take place may not be under the control of the JSON
parser.
>
> I also suspect that it's one of those things that most people haven't really
given a whole lot of consideration to- they just had the parsed string over to
"the Unicode string handling code", and that's that. Most people may not realize
that such string handling code may subtly alter the original Unicode text as a
result (ala precomposing the string).
>
> > So to me it seems that most likely high-level clarifications regarding
> > normalization aspects would be:
> >
> > (a) Whether to do normalization or not is up to implementation
> > (normalization is left out of scope, on purpose), or
> > (b) Say that with JSON no normalization would be done (which would be
> > more at odds with unicode spec)
> >
> > Why? Just because I see very little chance of anything more ambitious
> > having effect on implementations (beyond small number that are willing
> > to tackle such complexity). While it would seem wrong to punt the
> > issue, there is the practical question of whether full solution would
> > matter.
>
> I can guarantee you that the practical question of whether a full solution
would matter will be answered the first time someone exploits it in a security
vulnerable way that results in a major security fiasco.
>
> Then it will be with 20/20 hindsight, and the question will be "Why didn't
anyone address (this behavior) that allowed two keys that were not bit for bit
identical, but became identical after converting them to their precomposed form,
and the security checks allowed the decomposed form through because it assumed
that everything was in precomposed form?"
>
> Unfortunately, the use of Unicode coupled with the fact that most JSON
implementations are dependent on external code for their Unicode support means
that this is an extremely non-trivial issue. I can't think of a simple solution
to the problem at the moment, other than it exists.
>
> > My guess is that about last thing I implements would want was a
> > mandate to support full Unicode 4.0 (and above) normalization rules.
> > It would just mean that there would be the specification in one
> > corner; and implementations, practically none of which would be
> > compliant.
>
> You really ought to read:
>
> http://www.unicode.org/faq/security.html
>
> http://www.unicode.org/reports/tr36/#Canonical_Represenation
>
> Microsoft Security Bulletin (MS00-078): Patch Available for 'Web Server Folder
Traversal' Vulnerability
(http://www.microsoft.com/technet/security/bulletin/MS00-078.mspx,
http://www.cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2000-0884)
>
> Creating Arbitrary Shellcode In Unicode Expanded Strings
(http://www.net-security.org/article.php?id=144)
>
> There's a long history of "Those little Unicode details aren't really
important" causing huge security problems later on.
>
>


[Non-text portions of this message have been removed]

#1595 From: John Cowan <cowan@...>
Date: Sat Feb 26, 2011 4:54 pm
Subject: Re: Re: JSON and the Unicode Standard
johnwcowan
Send Email Send Email
 
Mark Slater scripsit:

> The RFC does state that parsers can limit the numbers they accept to
> a specified range of their choice, just as they can impose a limit on
> the size of strings, if they want. In practice, some libraries apply
> unwritten limits to the range of numbers along the lines of "the range
> (set?) of numbers that can be represented as a Java double."

I think that limit is implied by the statement in the "Security
considerations" section that says that JSON is a subset of JavaScript,
where numbers are clearly constrained to IEEE 64-bit floats.

--
John Cowan    cowan@...    http://ccil.org/~cowan
         Sound change operates regularly to produce irregularities;
         analogy operates irregularly to produce regularities.
                 --E.H. Sturtevant, ca. 1945, probably at Yale

#1596 From: John Cowan <cowan@...>
Date: Sat Feb 26, 2011 5:04 pm
Subject: Re: Re: JSON and the Unicode Standard
johnwcowan
Send Email Send Email
 
Douglas Crockford scripsit:
> --- In json@yahoogroups.com, David Graham <david.malcom.graham@...> wrote:
>
> > In my opinion, this means JSON parsers and generators must not perform
> > normalization.  They must respect the data stored in the JSON byte stream as
> > is.
>
> I agree.

I agree in part.  JSON parsers MUST NOT normalize their inputs, for
the reasons given upthread.  But JSON generators SHOULD generate
normalization form C, and JSON parsers MAY check for it and
warn their applications if it is not present.

--
Your worships will perhaps be thinking          John Cowan
that it is an easy thing to blow up a dog?      http://www.ccil.org/~cowan
[Or] to write a book?
     --Don Quixote, Introduction                 cowan@...

#1597 From: "Douglas Crockford" <douglas@...>
Date: Sat Feb 26, 2011 6:04 pm
Subject: Re: JSON and the Unicode Standard
douglascrock...
Send Email Send Email
 
--- In json@yahoogroups.com, John Cowan <cowan@...> wrote:
>
> Mark Slater scripsit:
>
> > The RFC does state that parsers can limit the numbers they accept to
> > a specified range of their choice, just as they can impose a limit on
> > the size of strings, if they want. In practice, some libraries apply
> > unwritten limits to the range of numbers along the lines of "the range
> > (set?) of numbers that can be represented as a Java double."
>
> I think that limit is implied by the statement in the "Security
> considerations" section that says that JSON is a subset of JavaScript,
> where numbers are clearly constrained to IEEE 64-bit floats.

That's not quite right. JSON says nothing at all about number representations.
All it knows is sequences of digits with the occasional decimal points. JSON
says nothing about 2's complement vs signed magnitude integers, and it says
nothing about word size or binary vs decimal.

#1598 From: John Cowan <cowan@...>
Date: Sat Feb 26, 2011 8:05 pm
Subject: Re: Re: JSON and the Unicode Standard
johnwcowan
Send Email Send Email
 
johne_ganz scripsit:

> In fact, for my parser (JSONKit), which is Objective-C based and uses
> NSString to represent the JSON String objects, it is not practical
> for me to create a JSON parser that "respects the data stored in the
> JSON byte stream".  The NSString class makes no such guarantees in its
> documentation, nor does the Unicode Standard.  It would be extremely
> non-trivial for me to meet a "respects the data stored in the JSON
> byte stream" requirement, at least in the sense that the behavior
> is deterministic.

Normalization is non-trivial, and I doubt if any existing Unicode library
imposes it on all strings at creation/modification time.  Certainly ICU
does not; it provides the ability to normalize, that's all.

--
John Cowan       http://www.ccil.org/~cowan        <cowan@...>
         You tollerday donsk?  N.  You tolkatiff scowegian?  Nn.
         You spigotty anglease?  Nnn.  You phonio saxo?  Nnnn.
                 Clear all so!  `Tis a Jute.... (Finnegans Wake 16.5)

#1599 From: John Cowan <cowan@...>
Date: Sat Feb 26, 2011 8:08 pm
Subject: Re: Re: JSON and the Unicode Standard
johnwcowan
Send Email Send Email
 
johne_ganz scripsit:

> There is no language in RFC 4627 that I can find that supports your
> interpretation.  There is an awful lot of language in the Unicode
> standard that unambiguously says that you can not "delete" characters
> from a Unicode "string".  There is also compelling arguments in TR#36
> for why arbitrarily deleting characters is a huge mistake.

You are over-interpreting the standard.  Of course applications can
delete characters:  sed -s 's/t//' deletes all t's from the input.
And that's all that's being said here: JSON parsers and serializers
shouldn't edit any of the codepoints, leaving that up to their clients.

--
John Cowan    cowan@...    http://ccil.org/~cowan
The present impossibility of giving a scientific explanation is no proof
that there is no scientific explanation. The unexplained is not to be
identified with the unexplainable, and the strange and extraordinary
nature of a fact is not a justification for attributing it to powers
above nature.  --The Catholic Encyclopedia, s.v. "telepathy" (1913)

#1600 From: John Cowan <cowan@...>
Date: Sat Feb 26, 2011 8:09 pm
Subject: Re: Re: JSON and the Unicode Standard
johnwcowan
Send Email Send Email
 
Douglas Crockford scripsit:

> For JSON's purpose, Unicode is just a set of code points. It gives
> some, such as { and }, special meaning. But in strings, everything
> should simply be passed through.

So you are now conceding that it's invalid JSON to send through unpaired
surrogate code units, since they don't correspond to code points?
We discussed this a while back, and you were then (IIRC) claiming that
JSON allowed any arbitrary code unit, including unpaired surrogates.

--
John Cowan   http://ccil.org/~cowan  cowan@...
[P]olice in many lands are now complaining that local arrestees are insisting
on having their Miranda rights read to them, just like perps in American TV
cop shows.  When it's explained to them that they are in a different country,
where those rights do not exist, they become outraged.  --Neal Stephenson

#1601 From: Tatu Saloranta <tsaloranta@...>
Date: Sat Feb 26, 2011 8:59 pm
Subject: Re: Re: JSON and the Unicode Standard
cowtowncoder
Send Email Send Email
 
On Fri, Feb 25, 2011 at 8:01 PM, johne_ganz <john.engelhart@...> wrote:
> --- In json@yahoogroups.com, Tatu Saloranta <tsaloranta@...> wrote:
...
> I have not seen a JSON implementation / parser that does such normalization.
>
> On the other hand, I very strongly suspect that whether or not such
normalization is taking place is not up to the writer of that parser.  In

Yes.

> my particular case (JSONKit, for Objective-C), I pass the parsed JSON String
to the NSString class to instantiate an object.
>
> I have ZERO control over what and how NSString interprets or manipulates the
parsed JSON String that finally becomes the instantiated object that ostensibly
the same as the original JSON String used to create it.  It could be that
NSString decides that the instantiated object is
> always converted to its precomposed form.  Objective-C is flexible enough
where someone might decide to swizzle in some logic at run time that forces all
strings to be precomposed before being handed off to the main NSString
instantiation method.

Ok. But in this case, would JSON specification itself help a lot? I
understand that this is problematic, in that different platforms can
choose different default (and possible opaque dealing).

...
> I don't have a particular opinion on the matter one way or the other other
than to highlight the point that in many practical, real-world situations,
whether or not such things take place may not be under the control of the JSON
parser.
> I also suspect that it's one of those things that most people haven't really
given a whole lot of consideration to- they just had the parsed string over to
"the Unicode string handling code", and that's that.  Most people may not
realize that such string handling code may subtly alter the original Unicode
text as a result (ala precomposing the string).

Right. And if specification says nothing, it can uncover real
complexities and ambiguities.

...
>> to tackle such complexity). While it would seem wrong to punt the
>> issue, there is the practical question of whether full solution would
>> matter.
>
> I can guarantee you that the practical question of whether a full solution
would matter will be answered the first time someone exploits it in a security
vulnerable way that results in a major security fiasco.

I would be interested in how you would see this leading to security
issues, outside of problems specific String handling on platforms has.
Or are you equally concerned in general about parser implementation
quality (which is understandable), above and beyond question of what
JSON specification says? At least to me it would seem more likely that
issues would be outside of realm of core specification itself.

> Then it will be with 20/20 hindsight, and the question will be "Why didn't
anyone address (this behavior) that allowed two keys that were not bit for bit
identical, but became identical after converting them to their precomposed form,
and the security checks allowed the
> decomposed form through because it assumed that everything was in precomposed
form?"

I can see how this can be problematic from side of applications that
make assumptions on uniqueness. And also that it is important that
parsers will clearly define how they handle things -- not all parsers
necessarily even check for uniqueness for same byte patterns, much
less for normalization (and I think this is even allowed by the spec,
i.e. uniqueness checks are not mandated).

So in a way, it would be useful to have bit more concrete examples of
known practical issues. Links below may give some insight -- but it
would seem that they are typically platform specific. Which makes it
even harder to find shared solutions, or to recommend best practices.

> Unfortunately, the use of Unicode coupled with the fact that most JSON
implementations are dependent on external code for their Unicode support means
that this is an extremely non-trivial issue.  I can't think of a simple solution
to the problem at the moment, other than it exists.
>
...
> You really ought to read:
>
> http://www.unicode.org/faq/security.html
> http://www.unicode.org/reports/tr36/#Canonical_Represenation
>
> Microsoft Security Bulletin (MS00-078): Patch Available for 'Web Server Folder
Traversal' Vulnerability
(http://www.microsoft.com/technet/security/bulletin/MS00-078.mspx,
http://www.cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2000-0884)
>
> Creating Arbitrary Shellcode In Unicode Expanded Strings
(http://www.net-security.org/article.php?id=144)
>
> There's a long history of "Those little Unicode details aren't really
important" causing huge security problems later on.

Thank you. While I had heard about issues with request to
non-canonical UTF-8 code sequences (which were discussed to have such
issues), I admit I had not heard much about issue regarding
normalization.

-+ Tatu +-

#1602 From: Tatu Saloranta <tsaloranta@...>
Date: Sat Feb 26, 2011 9:03 pm
Subject: Re: Re: JSON and the Unicode Standard
cowtowncoder
Send Email Send Email
 
On Sat, Feb 26, 2011 at 9:04 AM, John Cowan <cowan@...> wrote:
> Douglas Crockford scripsit:
>> --- In json@yahoogroups.com, David Graham <david.malcom.graham@...> wrote:
>>
>> > In my opinion, this means JSON parsers and generators must not perform
>> > normalization.  They must respect the data stored in the JSON byte stream
as
>> > is.
>>
>> I agree.
>
> I agree in part.  JSON parsers MUST NOT normalize their inputs, for
> the reasons given upthread.  But JSON generators SHOULD generate
> normalization form C, and JSON parsers MAY check for it and
> warn their applications if it is not present.

This sounds reasonable to me as well.

-+ Tatu +-

#1603 From: "Douglas Crockford" <douglas@...>
Date: Sun Feb 27, 2011 12:07 am
Subject: Re: JSON and the Unicode Standard
douglascrock...
Send Email Send Email
 
--- In json@yahoogroups.com, John Cowan <cowan@...> wrote:
>
> Douglas Crockford scripsit:
>
> > For JSON's purpose, Unicode is just a set of code points. It gives
> > some, such as { and }, special meaning. But in strings, everything
> > should simply be passed through.
>
> So you are now conceding that it's invalid JSON to send through unpaired
> surrogate code units, since they don't correspond to code points?

No.


> We discussed this a while back, and you were then (IIRC) claiming that
> JSON allowed any arbitrary code unit, including unpaired surrogates.

Right.

#1604 From: "mehdigholam@..." <mgholam@...>
Date: Sun Feb 27, 2011 7:47 am
Subject: fastJSON v1.4
mehdigholam...
Send Email Send Email
 
Hello all,

Huge speed optimizations in fastJSON v1.4 now officially the fastest JSON on the
.net platform.

http://www.codeproject.com/KB/IP/fastJSON.aspx

#1605 From: Petri Lehtinen <petri@...>
Date: Mon Feb 28, 2011 7:31 pm
Subject: Jansson 2.0 released
akhern...
Send Email Send Email
 
Jansson 2.0 is finally out. This is a new major release that is
(slightly) backwards incompatible with the older versions.

Changes since v1.3
------------------

* Backwards incompatible changes:

   - Unify unsigned integer usage in the API

   - Change JSON integer's underlying type to the widest signed integer
     type available

   - Change the maximum indentation depth to 31 spaces in encoder

   - For future needs, add a flags parameter to all decoding functions

* New features

   - JSON value building (packing) functionality based on a format
     string.

   - Extraction and validation functionality based on a format string.

   - Error reporting enhancements.

   - Preprocessor constants that define the library version.

   - Custom memory allocation functions.

* Fix many portability issues, especially ease building on Windows.

Download source: http://www.digip.org/jansson/releases/jansson-2.0.tar.gz
View documentation: http://www.digip.org/jansson/doc/2.0/
Changelog: http://www.digip.org/jansson/doc/2.0/changes.html#version-2-0
GitHub: https://github.com/akheron/jansson


What is Jansson?
----------------

Jansson is a C library for encoding, decoding and manipulating JSON data.
It features:

* Simple and intuitive API and data model
* Comprehensive documentation
* No dependencies on other libraries
* Full Unicode support (UTF-8)
* Extensive test suite

Jansson is licensed under the MIT license.

For more details, see http://www.digip.org/jansson/.


Petri Lehtinen

#1606 From: "johne_ganz" <john.engelhart@...>
Date: Wed Mar 2, 2011 3:57 am
Subject: Re: JSON and the Unicode Standard
johne_ganz
Send Email Send Email
 
--- In json@yahoogroups.com, Tatu Saloranta <tsaloranta@...> wrote:
>
> On Fri, Feb 25, 2011 at 8:01 PM, johne_ganz <john.engelhart@...> wrote:
> > --- In json@yahoogroups.com, Tatu Saloranta <tsaloranta@> wrote:
> > my particular case (JSONKit, for Objective-C), I pass the parsed JSON String
to the NSString class to instantiate an object.
> >
> > I have ZERO control over what and how NSString interprets or manipulates the
parsed JSON String that finally becomes the instantiated object that ostensibly
the same as the original JSON String used to create it.  It could be that
NSString decides that the instantiated object is
> > always converted to its precomposed form.  Objective-C is flexible enough
where someone might decide to swizzle in some logic at run time that forces all
strings to be precomposed before being handed off to the main NSString
instantiation method.
>
> Ok. But in this case, would JSON specification itself help a lot? I
> understand that this is problematic, in that different platforms can
> choose different default (and possible opaque dealing).

It is my opinion that the answer is "Yes".  The standard must address some of
the issues introduced by the use of Unicode (see below).  Then there is the
practical real world issue that many JSON implementations are going to use
external code to manage the "Unicode part", and I think it's fair to say that
that external code is going to be focused on Unicode Standard compliance rather
than implementing semantics that are useful or even desired for RFC 4627
compliance.

Please, don't get me wrong, I honestly wish that the whole thing could be
treated as some sort of ideal "extended ASCII" that was for all practical
purposes synonymous with "binary".  This would be much, much simpler.  But
that's not Unicode.

> ...
> > I don't have a particular opinion on the matter one way or the other other
than to highlight the point that in many practical, real-world situations,
whether or not such things take place may not be under the control of the JSON
parser.
> > I also suspect that it's one of those things that most people haven't really
given a whole lot of consideration to- they just had the parsed string over to
"the Unicode string handling code", and that's that.  Most people may not
realize that such string handling code may subtly alter the original Unicode
text as a result (ala precomposing the string).
>
> Right. And if specification says nothing, it can uncover real
> complexities and ambiguities.

Yes.  The use of Unicode, and the language surrounding the issue of Unicode in
RFC 4627 means there are some very real complexities and ambiguities.  The
particular example that comes to mind is

"What does it mean for two keys (or names in RFC 4627 nomenclature) to compare
equal?"

For example:

{ // Example #1
"Ä" : "launch nukes",
"Ä" : "do not launch nukes
}

Do these keys "compare equal"?

{ // Example #2
"\u00C4" : "launch nukes",
"A\u0308" : "do not launch nukes
}

How about this?
Is it "identical" to example #1?
Do the keys in example #2 compare equal?
Do the keys in example #2 compare equal to their respective keys in example #1?

From ECMA-262, "ECMAScript Language Specification", 5th Edition / December 2009,
page 11, section 6 "Source Text":

ECMAScript source text is represented as a sequence of characters in the Unicode
character encoding, version 3.0 or later. The text is expected to have been
normalised to Unicode Normalised Form C (canonical composition), as described in
Unicode Technical Report #15.
------

So let's say you're using a (Java|ECMA)Script editor to edit your JSON.
And the editor happens to follow this advice, as given in the ECMA-262 document.

What happens to example #1 in this case?

> ...
> >> to tackle such complexity). While it would seem wrong to punt the
> >> issue, there is the practical question of whether full solution would
> >> matter.
> >
> > I can guarantee you that the practical question of whether a full solution
would matter will be answered the first time someone exploits it in a security
vulnerable way that results in a major security fiasco.
>
> I would be interested in how you would see this leading to security
> issues, outside of problems specific String handling on platforms has.

It doesn't necessarily have anything to do with a platforms string handling, it
has to do with Unicode.

{
"A": 1,
"A": 2,
"𝖠": 3,
"Å": 4,
"Å": 5,
"Å": 6,
"𝖠̊": 7
}

Unicode vastly complicates the above.  If one uses a unicode aware editor to
edit the above, it is perfectly fine for it to mangle it so that it is not
precisely the unicode I pasted.  In fact, it wouldn't surprise me if this
groups.yahoo.com software washes it through a bit of unicode processing and what
finally appears isn't exactly what I put in.

One also needs to switch to the mindset of a security person, not someone who is
interested in writing a JSON specification or parser implementation.

Security people love to sell and stick magic boxes that sit in the network,
usually between you and the bad, evil internet.  One particular brand of voodoo,
known as the firewall, will occasionally sanitize or reject data from the bad,
outside internet.

Now imagine you're a security person, and you're buying or making one of these
magic boxes.  You know some of the issues involved and that various JSON
implementations are all over the map when it comes to how they deal with the
corner cases, and these corner cases can dramatically alter what it means for
two keys to "compare equal".  Which way are you going to come down on the issue?

> Or are you equally concerned in general about parser implementation
> quality (which is understandable), above and beyond question of what
> JSON specification says? At least to me it would seem more likely that
> issues would be outside of realm of core specification itself.

Don't care about particular implementations.

Keep in mind there's a huge difference between what the spec says and what
people do.

The spec should be "right", for some strong definition of right.  It should also
not exist solely in some idealized vacuum, but be tempered with the practical,
real world issues that real world implementations of the standard have to deal
with.  It should represent "the best possible" at the time the standard was
forged, incorporating the wisdom and experience of those who actually have to
deal with and implement whatever the standard represents so that those who come
after, who may not have similar levels of experience or willingness to
thoroughly examine all the issues can use the standard with some degree of
safety and confidence.

> > Then it will be with 20/20 hindsight, and the question will be "Why didn't
anyone address (this behavior) that allowed two keys that were not bit for bit
identical, but became identical after converting them to their precomposed form,
and the security checks allowed the
> > decomposed form through because it assumed that everything was in
precomposed form?"
>
> I can see how this can be problematic from side of applications that
> make assumptions on uniqueness. And also that it is important that
> parsers will clearly define how they handle things -- not all parsers
> necessarily even check for uniqueness for same byte patterns, much
> less for normalization (and I think this is even allowed by the spec,
> i.e. uniqueness checks are not mandated).

I am in violent disagreement with this entire premiss.

> > There's a long history of "Those little Unicode details aren't really
important" causing huge security problems later on.
>
> Thank you. While I had heard about issues with request to
> non-canonical UTF-8 code sequences (which were discussed to have such
> issues), I admit I had not heard much about issue regarding
> normalization.

I would also recommend downloading the Unicode Standard
(http://www.unicode.org/versions/Unicode6.0.0/UnicodeStandard-6.0.pdf) and doing
a simple search for "security".  This will give you a list of pages that are
probably the most relevant to what I'm talking about.

And keep in mind that those issues are directly related to JSON because JSON is
"encoded as Unicode".  Anything that treats JSON as Unicode, such as a text
editor or linked library like ICU, is going to follow the rules and
recommendations of the Unicode Standard.  This means in the real world, JSON is
likely to be washed through one of these libraries and be exposed to the Unicode
standard, and that standard DOES NOT require it to preserve the exact sequence
of bytes as Douglas Crockford thinks it should.

Even the official ECMA recommendation says that it expects "the source to be
normalised to Unicode Normalised Form C".  It's one thing to write code that
manipulates data and bytes that are (for some definition of) "local" to that
instance of the program at that point in time.  It's an entirely different thing
when you start slinging bytes between machines or need the bytes to be archived
and possibly processed by a different program.

#1607 From: "johne_ganz" <john.engelhart@...>
Date: Wed Mar 2, 2011 4:46 am
Subject: Re: JSON and the Unicode Standard
johne_ganz
Send Email Send Email
 
--- In json@yahoogroups.com, John Cowan <cowan@...> wrote:
>
> johne_ganz scripsit:
>
> > In fact, for my parser (JSONKit), which is Objective-C based and uses
> > NSString to represent the JSON String objects, it is not practical
> > for me to create a JSON parser that "respects the data stored in the
> > JSON byte stream".  The NSString class makes no such guarantees in its
> > documentation, nor does the Unicode Standard.  It would be extremely
> > non-trivial for me to meet a "respects the data stored in the JSON
> > byte stream" requirement, at least in the sense that the behavior
> > is deterministic.
>
> Normalization is non-trivial, and I doubt if any existing Unicode library
> imposes it on all strings at creation/modification time.  Certainly ICU
> does not; it provides the ability to normalize, that's all.

The Foundation framework (specifically the NSString class) on Mac OS X and
iPhone / iPad does.  Not sure if 90+ million iPhones count for much, though.

In particular, [@"Ä" compare:@"Ä"] is zero, or "identical", whereas [@"Ä"
isEqual:@"Ä"] is "no".  Each has different semantics, and -compare: is preferred
when dealing with strings because it has the right semantics in that context.

As an analogy, it would be as if Javascript behaved as:
if("Ä" == "Ä") // True
if("Ä" === "Ä") // False
in the same way that ("1" == 1) is true, but ("1" === 1) is false.

And just in case things get mangled along the way, the first string is "\u00c4"
and the second string is "\u0041\u0308".  In fact, if they do get mangled.... I
think that should serve as a warning that these things can and do happen behind
your back when dealing with Unicode.

#1608 From: "mehdigholam@..." <mgholam@...>
Date: Wed Mar 2, 2011 6:17 pm
Subject: fastJSON v1.5
mehdigholam...
Send Email Send Email
 
Hello all,

Huge optimizations again for fastJSON the .net implementations.

http://www.codeproject.com/KB/IP/fastJSON.aspx

Cheers,

#1609 From: Dave Gamble <davegamble@...>
Date: Wed Mar 2, 2011 6:21 pm
Subject: Re: Re: JSON and the Unicode Standard
signalzerodb
Send Email Send Email
 
Would it be too much to specify that key names are to be ASCII top-bit-unset
strings?

i.e. in the definition of an object, designate that the "string" there is a
"simplestring" which uses a restricted definition of char?

As far as I can see, this is the only case where the Unicode interpretation
is potentially dangerous.
In usage of strings as data, I believe they are to be delivered unprocessed
to the user of the data.

Maybe designate this json_littlebitmoresecure.

Cheers,

Dave.

On Wed, Mar 2, 2011 at 4:46 AM, johne_ganz <john.engelhart@...> wrote:

>
>
>
>
> --- In json@yahoogroups.com, John Cowan <cowan@...> wrote:
> >
> > johne_ganz scripsit:
> >
> > > In fact, for my parser (JSONKit), which is Objective-C based and uses
> > > NSString to represent the JSON String objects, it is not practical
> > > for me to create a JSON parser that "respects the data stored in the
> > > JSON byte stream". The NSString class makes no such guarantees in its
> > > documentation, nor does the Unicode Standard. It would be extremely
> > > non-trivial for me to meet a "respects the data stored in the JSON
> > > byte stream" requirement, at least in the sense that the behavior
> > > is deterministic.
> >
> > Normalization is non-trivial, and I doubt if any existing Unicode library
> > imposes it on all strings at creation/modification time. Certainly ICU
> > does not; it provides the ability to normalize, that's all.
>
> The Foundation framework (specifically the NSString class) on Mac OS X and
> iPhone / iPad does. Not sure if 90+ million iPhones count for much, though.
>
> In particular, [@"Ä" compare:@"Ä"] is zero, or "identical", whereas [@"Ä"
> isEqual:@"Ä"] is "no". Each has different semantics, and -compare: is
> preferred when dealing with strings because it has the right semantics in
> that context.
>
> As an analogy, it would be as if Javascript behaved as:
> if("Ä" == "Ä") // True
> if("Ä" === "Ä") // False
> in the same way that ("1" == 1) is true, but ("1" === 1) is false.
>
> And just in case things get mangled along the way, the first string is
> "\u00c4" and the second string is "\u0041\u0308". In fact, if they do get
> mangled.... I think that should serve as a warning that these things can and
> do happen behind your back when dealing with Unicode.
>
>
>


[Non-text portions of this message have been removed]

#1610 From: Dave Gamble <davegamble@...>
Date: Wed Mar 2, 2011 6:22 pm
Subject: Re: Re: JSON and the Unicode Standard
signalzerodb
Send Email Send Email
 
Better question: How does the ECMA/javascript spec limit variable names?
This seems to be the same question, in practical terms.

Dave.

On Wed, Mar 2, 2011 at 6:21 PM, Dave Gamble <davegamble@...> wrote:

> Would it be too much to specify that key names are to be ASCII
> top-bit-unset strings?
>
> i.e. in the definition of an object, designate that the "string" there is a
> "simplestring" which uses a restricted definition of char?
>
> As far as I can see, this is the only case where the Unicode interpretation
> is potentially dangerous.
> In usage of strings as data, I believe they are to be delivered unprocessed
> to the user of the data.
>
> Maybe designate this json_littlebitmoresecure.
>
> Cheers,
>
> Dave.
>
>
> On Wed, Mar 2, 2011 at 4:46 AM, johne_ganz <john.engelhart@...>wrote:
>
>>
>>
>>
>>
>> --- In json@yahoogroups.com, John Cowan <cowan@...> wrote:
>> >
>> > johne_ganz scripsit:
>> >
>> > > In fact, for my parser (JSONKit), which is Objective-C based and uses
>> > > NSString to represent the JSON String objects, it is not practical
>> > > for me to create a JSON parser that "respects the data stored in the
>> > > JSON byte stream". The NSString class makes no such guarantees in its
>> > > documentation, nor does the Unicode Standard. It would be extremely
>> > > non-trivial for me to meet a "respects the data stored in the JSON
>> > > byte stream" requirement, at least in the sense that the behavior
>> > > is deterministic.
>> >
>> > Normalization is non-trivial, and I doubt if any existing Unicode
>> library
>> > imposes it on all strings at creation/modification time. Certainly ICU
>> > does not; it provides the ability to normalize, that's all.
>>
>> The Foundation framework (specifically the NSString class) on Mac OS X and
>> iPhone / iPad does. Not sure if 90+ million iPhones count for much, though.
>>
>> In particular, [@"Ä" compare:@"Ä"] is zero, or "identical", whereas [@"Ä"
>> isEqual:@"Ä"] is "no". Each has different semantics, and -compare: is
>> preferred when dealing with strings because it has the right semantics in
>> that context.
>>
>> As an analogy, it would be as if Javascript behaved as:
>> if("Ä" == "Ä") // True
>> if("Ä" === "Ä") // False
>> in the same way that ("1" == 1) is true, but ("1" === 1) is false.
>>
>> And just in case things get mangled along the way, the first string is
>> "\u00c4" and the second string is "\u0041\u0308". In fact, if they do get
>> mangled.... I think that should serve as a warning that these things can and
>> do happen behind your back when dealing with Unicode.
>>
>>
>>
>
>


[Non-text portions of this message have been removed]

#1611 From: John Cowan <cowan@...>
Date: Wed Mar 2, 2011 6:38 pm
Subject: Re: Re: JSON and the Unicode Standard
johnwcowan
Send Email Send Email
 
Dave Gamble scripsit:

> Better question: How does the ECMA/javascript spec limit variable names?
> This seems to be the same question, in practical terms.

In JSON, unquoted keys are not permitted,
so both keys and values are strings.

--
Unless it was by accident that I had            John Cowan
offended someone, I never apologized.           cowan@...
         --Quentin Crisp                         http://www.ccil.org/~cowan

#1612 From: Dave Gamble <davegamble@...>
Date: Wed Mar 2, 2011 6:39 pm
Subject: Re: Re: JSON and the Unicode Standard
signalzerodb
Send Email Send Email
 
To save people looking it up:

ECMA-262, section 7.6:

Two IdentifierName that are canonically equivalent according to the
Unicode standard are not equal unless they are represented by the
exact same sequence of code units (in other words, conforming
ECMAScript implementations are only required to do bitwise comparison
on IdentifierName values). The intent is that the incoming source text
has been converted to normalised form C before it reaches the
compiler.

ECMAScript implementations may recognize identifier characters defined
in later editions of the Unicode Standard. If portability is a
concern, programmers should only employ identifier characters defined
in Unicode 3.0.

There then follows a syntax definition, which expressly precludes use
of reserved keywords from being identifiers.

Looks like the most interesting attacks on json, from a security
viewpoint, would be using keywords as object member names.
Has anyone checked what happens if you do? I suspect the javascript
implementations would be the most at risk.

I think it's fairly clear that a JSON parser has ABSOLUTELY NO
BUSINESS poking around with actual data strings; Douglas has been very
clear that you are to pass them bit-identical to the recipient. On the
other hand, there's an argument for some kind of sanitation when it
comes to object member names.
I'm really tempted by the idea of a JSON-secure spec, which clamps
down on these details.

Arguing the Unicode details is decidedly NOT compatible with the
"spirit" of JSON, which Douglas has been very clear about; a
lightweight, simple, modern data representation.

I think it speaks to the merit of JSON as a format that you (@Johne)
want to consider the security details.
But I think what you need might well be a branch and a new spec?

I'm probably speaking way out of turn here, so please do accept my
apologies if I've overstepped any bounds.

Best,

Dave.




On Wed, Mar 2, 2011 at 6:22 PM, Dave Gamble <davegamble@...> wrote:
>
> Better question: How does the ECMA/javascript spec limit variable names?
> This seems to be the same question, in practical terms.
> Dave.
>
> On Wed, Mar 2, 2011 at 6:21 PM, Dave Gamble <davegamble@...> wrote:
>>
>> Would it be too much to specify that key names are to be ASCII top-bit-unset
strings?
>> i.e. in the definition of an object, designate that the "string" there is a
"simplestring" which uses a restricted definition of char?
>> As far as I can see, this is the only case where the Unicode interpretation
is potentially dangerous.
>> In usage of strings as data, I believe they are to be delivered unprocessed
to the user of the data.
>> Maybe designate this json_littlebitmoresecure.
>> Cheers,
>> Dave.
>>
>> On Wed, Mar 2, 2011 at 4:46 AM, johne_ganz <john.engelhart@...> wrote:
>>>
>>>
>>>
>>> --- In json@yahoogroups.com, John Cowan <cowan@...> wrote:
>>> >
>>> > johne_ganz scripsit:
>>> >
>>> > > In fact, for my parser (JSONKit), which is Objective-C based and uses
>>> > > NSString to represent the JSON String objects, it is not practical
>>> > > for me to create a JSON parser that "respects the data stored in the
>>> > > JSON byte stream". The NSString class makes no such guarantees in its
>>> > > documentation, nor does the Unicode Standard. It would be extremely
>>> > > non-trivial for me to meet a "respects the data stored in the JSON
>>> > > byte stream" requirement, at least in the sense that the behavior
>>> > > is deterministic.
>>> >
>>> > Normalization is non-trivial, and I doubt if any existing Unicode library
>>> > imposes it on all strings at creation/modification time. Certainly ICU
>>> > does not; it provides the ability to normalize, that's all.
>>>
>>> The Foundation framework (specifically the NSString class) on Mac OS X and
iPhone / iPad does. Not sure if 90+ million iPhones count for much, though.
>>>
>>> In particular, [@"Ä" compare:@"Ä"] is zero, or "identical", whereas [@"Ä"
isEqual:@"Ä"] is "no". Each has different semantics, and -compare: is preferred
when dealing with strings because it has the right semantics in that context.
>>>
>>> As an analogy, it would be as if Javascript behaved as:
>>> if("Ä" == "Ä") // True
>>> if("Ä" === "Ä") // False
>>> in the same way that ("1" == 1) is true, but ("1" === 1) is false.
>>>
>>> And just in case things get mangled along the way, the first string is
"\u00c4" and the second string is "\u0041\u0308". In fact, if they do get
mangled.... I think that should serve as a warning that these things can and do
happen behind your back when dealing with Unicode.
>>>
>>>
>

#1613 From: Dave Gamble <davegamble@...>
Date: Wed Mar 2, 2011 6:43 pm
Subject: Re: Re: JSON and the Unicode Standard
signalzerodb
Send Email Send Email
 
On Wed, Mar 2, 2011 at 6:38 PM, John Cowan <cowan@...> wrote:
>
>
>
> Dave Gamble scripsit:
>
> > Better question: How does the ECMA/javascript spec limit variable names?
> > This seems to be the same question, in practical terms.
>
> In JSON, unquoted keys are not permitted,
> so both keys and values are strings.
>
I am aware of that :)
It occurs to me that since the intention is that JSON text
deserializes into an instanced object, which will in some cases be a
javascript object (where member variable names are used), there could
be call for a greater limitation on this? Does that make sense?

In other words, my question didn't concern the JSON spec, it concerned
the limitations imposed on member variable names in javascript; since
this seemed like it might be a sensible set of limitations to apply to
JSON keys.

Dave.



>
> --
> Unless it was by accident that I had John Cowan
> offended someone, I never apologized. cowan@...
> --Quentin Crisp http://www.ccil.org/~cowan
>
>

#1614 From: John Cowan <cowan@...>
Date: Wed Mar 2, 2011 6:45 pm
Subject: Re: Re: JSON and the Unicode Standard
johnwcowan
Send Email Send Email
 
Dave Gamble scripsit:

> Looks like the most interesting attacks on json, from a security
> viewpoint, would be using keywords as object member names.
> Has anyone checked what happens if you do? I suspect the javascript
> implementations would be the most at risk.

Indeed they would, which is precisely why {foo: "bar"} is not conformant
JSON any more than {if: "bar"} would be, although the first is conformant
JavaScript and the second is not.  However, {"foo": "bar"} and {"if":
"bar"} are both good JSON and good JavaScript, and in JavaScript {foo:
"bar"} and {"foo": "bar"} mean the same thing.

--
John Cowan  cowan@...   http://ccil.org/~cowan
Promises become binding when there is a meeting of the minds and consideration
is exchanged. So it was at King's Bench in common law England; so it was
under the common law in the American colonies; so it was through more than
two centuries of jurisprudence in this country; and so it is today.
        --Specht v. Netscape

#1615 From: John Cowan <cowan@...>
Date: Wed Mar 2, 2011 7:20 pm
Subject: Re: Re: JSON and the Unicode Standard
johnwcowan
Send Email Send Email
 
Dave Gamble scripsit:

> In other words, my question didn't concern the JSON spec, it concerned
> the limitations imposed on member variable names in javascript; since
> this seemed like it might be a sensible set of limitations to apply to
> JSON keys.

I don't think so.  In particular, it is often helpful to allow keys named
"$" or "#foo" or such.  In any case, the normalization rule for JavaScript
identifiers is "Don't".

--
There is no real going back.  Though I          John Cowan
may come to the Shire, it will not seem         cowan@...
the same; for I shall not be the same.          http://www.ccil.org/~cowan
I am wounded with knife, sting, and tooth,
and a long burden.  Where shall I find rest?           --Frodo

Messages 1586 - 1615 of 1958   Oldest  |  < Older  |  Newer >  |  Newest
Add to My Yahoo!      XML What's This?

Copyright © 2010 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines NEW - Help