Skip to search.

Breaking News Visit Yahoo! News for the latest.

×Close this window

json · JSON JavaScript Object Notation

The Yahoo! Groups Product Blog

Check it out!

Group Information

  • Members: 590
  • Category: Data Formats
  • Founded: Jul 19, 2005
  • Language: English
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Message search is now enhanced, find messages faster. Take it for a spin.

Messages

Advanced
Messages Help
Messages 1571 - 1600 of 1958   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Show Message Summaries Sort by Date ^  
#1571 From: jonathan wallace <ninja9578@...>
Date: Thu Feb 3, 2011 1:54 pm
Subject: libjson 7.0 new features
ninja9578
Send Email Send Email
 
Hello all,

I just wanted to mention that a while ago I changed the name from libJSON to
libjson, would you mind reflecting that on json.org?

I added a ton of new stuff in this upgrade, most significantly, streaming
ability.  I got a bunch of requests for the ability to take a stream (like from
the internet) and parse it as it comes in.  Since it may be partial JSON, or
multiple JSON objects at a time, the stream will check each time something gets
added to it, and call a callback with the new node each time one is completed. 
This should make life a lot easier for those streaming JSON from websources.

I also added the option to turn off all libjson extensions such as comments,
hexidecimal support... so that only 100% compliant JSON is considered valid.

I exposed an interface to use libjson's base64 encoder and decoder since a few
people asked if they could use it.

There is a new makefile with lots more options, including an install script. 
(Thanks to Bernhard Fluehmann)

There were several other changes too, you can see them all in the changelog in
the documentation if you wish.
http://sourceforge.net/projects/libjson/

Jon




[Non-text portions of this message have been removed]

#1572 From: Tatu Saloranta <tsaloranta@...>
Date: Thu Feb 3, 2011 7:12 pm
Subject: Re: libjson 7.0 new features
cowtowncoder
Send Email Send Email
 
Interesting. One thing I noticed from the project page is that there
are big claims on performance, but it seems to lack links to actual
measurements? I was wondering if you can add links, so it is possible
to see actual performance numbers, figure out relative importance of
performance and so on
I have noticed that at least half of all JSON projects claim to be
faster than anyone else, so measurements could also clear up the
situation and keep everyone honest.

-+ Tatu +-

On Thu, Feb 3, 2011 at 5:54 AM, jonathan wallace <ninja9578@...> wrote:
> Hello all,
>
> I just wanted to mention that a while ago I changed the name from libJSON to
libjson, would you mind reflecting that on json.org?
>
> I added a ton of new stuff in this upgrade, most significantly, streaming
ability.  I got a bunch of requests for the ability to take a stream (like from
the internet) and parse it as it comes in.  Since it may be partial JSON, or
multiple JSON objects at a time, the stream will check each time something gets
added to it, and call a callback with the new node each time one is completed.
 This should make life a lot easier for those streaming JSON from websources.
>
> I also added the option to turn off all libjson extensions such as comments,
hexidecimal support... so that only 100% compliant JSON is considered valid.
>
> I exposed an interface to use libjson's base64 encoder and decoder since a few
people asked if they could use it.
>
> There is a new makefile with lots more options, including an install script.
 (Thanks to Bernhard Fluehmann)
>
> There were several other changes too, you can see them all in the changelog in
the documentation if you wish.
> http://sourceforge.net/projects/libjson/
>
> Jon
>
>
>
>
> [Non-text portions of this message have been removed]
>
>
>
> ------------------------------------
>
> Yahoo! Groups Links
>
>
>
>

#1573 From: David Graham <david.malcom.graham@...>
Date: Thu Feb 3, 2011 7:36 pm
Subject: Re: libjson 7.0 new features
dgraham...
Send Email Send Email
 
If you'd prefer a really slow JSON parser, check out
https://github.com/dgraham/json-stream.  It's about 20x slower than the Ruby
json gem, but it's quite handy if you need to parse a single huge JSON
document in constant memory space.

Enjoy!

David


On Thu, Feb 3, 2011 at 12:12 PM, Tatu Saloranta <tsaloranta@...>wrote:

>
>
> Interesting. One thing I noticed from the project page is that there
> are big claims on performance, but it seems to lack links to actual
> measurements? I was wondering if you can add links, so it is possible
> to see actual performance numbers, figure out relative importance of
> performance and so on
> I have noticed that at least half of all JSON projects claim to be
> faster than anyone else, so measurements could also clear up the
> situation and keep everyone honest.
>
> -+ Tatu +-
>
> On Thu, Feb 3, 2011 at 5:54 AM, jonathan wallace
<ninja9578@...<ninja9578%40yahoo.com>>
> wrote:
> > Hello all,
> >
> > I just wanted to mention that a while ago I changed the name from libJSON
> to libjson, would you mind reflecting that on json.org?
> >
> > I added a ton of new stuff in this upgrade, most significantly, streaming
> ability.  I got a bunch of requests for the ability to take a stream (like
> from the internet) and parse it as it comes in.  Since it may be partial
> JSON, or multiple JSON objects at a time, the stream will check each time
> something gets added to it, and call a callback with the new node each time
> one is completed.  This should make life a lot easier for those streaming
> JSON from websources.
> >
> > I also added the option to turn off all libjson extensions such as
> comments, hexidecimal support... so that only 100% compliant JSON is
> considered valid.
> >
> > I exposed an interface to use libjson's base64 encoder and decoder since
> a few people asked if they could use it.
> >
> > There is a new makefile with lots more options, including an install
> script.  (Thanks to Bernhard Fluehmann)
> >
> > There were several other changes too, you can see them all in the
> changelog in the documentation if you wish.
> > http://sourceforge.net/projects/libjson/
> >
> > Jon
> >
> >
> >
> >
> > [Non-text portions of this message have been removed]
> >
> >
> >
> > ------------------------------------
> >
> > Yahoo! Groups Links
> >
> >
> >
> >
>
>


[Non-text portions of this message have been removed]

#1574 From: Gregg Irwin <gregg.irwin@...>
Date: Thu Feb 3, 2011 8:00 pm
Subject: Re[2]: libjson 7.0 new features
greggirwin143
Send Email Send Email
 
TS> I have noticed that at least half of all JSON projects claim to be
TS> faster than anyone else, so measurements could also clear up the
TS> situation and keep everyone honest.

A standard JSON benchmark perhaps?

--Gregg

#1575 From: Tatu Saloranta <tsaloranta@...>
Date: Fri Feb 4, 2011 12:03 am
Subject: Re: Re[2]: libjson 7.0 new features
cowtowncoder
Send Email Send Email
 
On Thu, Feb 3, 2011 at 12:00 PM, Gregg Irwin <gregg.irwin@...> wrote:
> TS> I have noticed that at least half of all JSON projects claim to be
> TS> faster than anyone else, so measurements could also clear up the
> TS> situation and keep everyone honest.
>
> A standard JSON benchmark perhaps?

Yes, one would be very useful. I know there are couple for Java
(general purpose ones that can also use JSON; and specific ones), but
haven't seen many for other platforms, or comparing between platforms.

-+ Tatu +-

#1576 From: jonathan wallace <ninja9578@...>
Date: Mon Feb 7, 2011 2:11 pm
Subject: Re: Re[2]: libjson 7.0 new features
ninja9578
Send Email Send Email
 
If there is a standard JSON benchmark that would be nice.


I've only compared it to wxJSON, cJSON, and a few others.

"People have always been impressed by the power of our example, not the example
of our power." - William Jefferson Clinton


From: Tatu Saloranta <tsaloranta@...>
To: json@yahoogroups.com
Cc:
Sent: Thursday, February 3, 2011 7:03 PM
Subject: Re: Re[2]: [json] libjson 7.0 new features



On Thu, Feb 3, 2011 at 12:00 PM, Gregg Irwin <gregg.irwin@...> wrote:
> TS> I have noticed that at least half of all JSON projects claim to be
> TS> faster than anyone else, so measurements could also clear up the
> TS> situation and keep everyone honest.
>
> A standard JSON benchmark perhaps?

Yes, one would be very useful. I know there are couple for Java
(general purpose ones that can also use JSON; and specific ones), but
haven't seen many for other platforms, or comparing between platforms.

-+ Tatu +-






[Non-text portions of this message have been removed]

#1577 From: Tatu Saloranta <tsaloranta@...>
Date: Mon Feb 7, 2011 10:09 pm
Subject: Re: Re[2]: libjson 7.0 new features
cowtowncoder
Send Email Send Email
 
On Mon, Feb 7, 2011 at 6:11 AM, jonathan wallace <ninja9578@...> wrote:
> If there is a standard JSON benchmark that would be nice.
>
>
> I've only compared it to wxJSON, cJSON, and a few others.

I suspect many users would like to see comparisons. Maybe blog about
it or such (and include test code), or send a link if already
published?

I don't doubt at all that there are differences, given difference
skill & experience levels of implementors (and the common "simplest
must be fasters" fallacy wrt performance). It's just hard to find out
real numbers when project home pages do not show measurements, just
state results.

-+ Tatu +-

#1578 From: Jonathan Wallace <ninja9578@...>
Date: Tue Feb 8, 2011 6:05 pm
Subject: Re: Re[2]: libjson 7.0 new features
ninja9578
Send Email Send Email
 
Well there is a benchmark included in the source, but it's mostly for my own
purposes of comparing upgrade versions against the previous version.  I think
ill come up a set of common json tasks and implement them in a few libraries or
ask the library maintained to do it to be sure it's done right.

Sent from my iPhone

On Feb 7, 2011, at 17:09, Tatu Saloranta <tsaloranta@...> wrote:

> On Mon, Feb 7, 2011 at 6:11 AM, jonathan wallace <ninja9578@...> wrote:
> > If there is a standard JSON benchmark that would be nice.
> >
> >
> > I've only compared it to wxJSON, cJSON, and a few others.
>
> I suspect many users would like to see comparisons. Maybe blog about
> it or such (and include test code), or send a link if already
> published?
>
> I don't doubt at all that there are differences, given difference
> skill & experience levels of implementors (and the common "simplest
> must be fasters" fallacy wrt performance). It's just hard to find out
> real numbers when project home pages do not show measurements, just
> state results.
>
> -+ Tatu +-
>


[Non-text portions of this message have been removed]

#1579 From: Tatu Saloranta <tsaloranta@...>
Date: Tue Feb 8, 2011 7:20 pm
Subject: Re: Re[2]: libjson 7.0 new features
cowtowncoder
Send Email Send Email
 
On Tue, Feb 8, 2011 at 10:05 AM, Jonathan Wallace <ninja9578@...> wrote:
> Well there is a benchmark included in the source, but it's mostly for my own
purposes of comparing upgrade versions against the previous version.  I think
ill come up a set of common json tasks and implement them in a few libraries or
ask the library maintained to do it to be sure it's done right.

That would be very useful! I know it might be bit of work, given
differing APIs and all.. but then again, it should be beneficial for
authors of other packages to work on being able to run similar tests.
So it should be possible to get things bootstrapped. This is how
"jvm-serializers" (https://github.com/eishay/jvm-serializers) for Java
serialization libraries started, and seems to work quite well.

-+ Tatu +-

#1580 From: "johne_ganz" <johne_ganz@...>
Date: Sun Feb 13, 2011 1:04 am
Subject: JSONKit
johne_ganz
Send Email Send Email
 
Since this is a group dedicated to JSON, I just thought I'd post a note about
yet another Objective-C serializer / deserializer: JSONKit, available at
https://github.com/johnezang/JSONKit

Also, I suspect this message will reach the right people, but perhaps someone on
the list could forward this message on to webmaster for json.org and add JSONKit
to the list of Objective-C implementations?  Many thanks in advance.

#1581 From: "mehdigholam@..." <mehdigholam@...>
Date: Sun Feb 20, 2011 9:39 am
Subject: smallest fastest polymorphic json serializer for .net
mehdigholam...
Send Email Send Email
 
Hello all,

Follow the link for my .net implementation.

http://www.codeproject.com/KB/IP/fastJSON.aspx

Cheers

#1582 From: "johne_ganz" <john.engelhart@...>
Date: Thu Feb 24, 2011 1:22 am
Subject: JSON and the Unicode Standard
johne_ganz
Send Email Send Email
 
In RFC 4627, Section 3 Encoding, it states:
"JSON text SHALL be encoded in Unicode.  The default encoding is UTF-8."
Unicode is defined as: The Unicode Consortium, "The Unicode Standard
Version 4.0", 2003, <http://www.unicode.org/versions/Unicode4.1.0/>.
Is it safe to assume that RFC 4627 implies "The minimum Unicode Standard
is version 4.0", or does it mean "The Unicode Standard as defined in
version 4.0, and ONLY version 4.0" (i.e., later versions of the Unicode
standard are non-RFC 4627 conforming)?  The standard is silent on this
point, but I believe a "best practices" interpretation is "The minimum
Unicode Standard is version 4.0" with the implicit assumption that the
Unicode Standard is strongly motivated to preserve backwards
compatibility.  Is this the accepted interpretation?
Furthermore, I interpret the quoted RFC 4627 section to imply:
Where RFC 4627 is in conflict with the Unicode Standard, the Unicode
Standard interpretation shall be the one used unless explicitly and
unambiguously superseded by RFC 4627. Otherwise, by referencing the
Unicode Standard, the Unicode Standard is incorporated in to RFC 4627 as
part of the requirements for JSON.
In other words, JSON is built on top of Unicode.  When defining JSON,
the author(s) of RFC 4627 were aware of conflicts between what they were
defining (JSON) and the Unicode Standard (at the time, v4.0), and have
explicitly called out any exceptions that JSON requires.
Assuming this is a valid interpretation, this places a number of
requirements on a JSON implementation that are non-obvious by just
reading RFC 4627.  For example, from Unicode Standard (note: I'm using
the latest version at the time of this writing, 6.0), Chapter 3
Conformance, Section 3.4 Characters and Encoding
(http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf):
C2  A process shall not interpret a noncharacter code point as an
abstract character. [ed: this is C5 in v4.0. The text appears to be
identical.]
D14  Noncharacter: A code point that is permanently reserved for
internal use and that should never be interchanged. Noncharacters
consist of the values U+nFFFE and U+nFFFF (where n is from 0 to 10_16
[ed: base 16]) and the values U+FDD0..U+FDEF. [ed: this is D7b in v4.0.
The text appears to be identical.]
Unicode Standard (6.0), Chapter 2 General Structure, Section 2.13
Special Characters and Noncharacters - Special Noncharacter Code Points:
The Unicode Standard contains a number of code points that are
intentionally not used to represent assigned characters. These code
points are known as noncharacters. They are permanently reserved for
internal use and should never be used for open interchange of Unicode
text. For more information on noncharacters, see Section 16.7,
Noncharacters. [ed: have not compared this to v4.0]
Unicode Standard (6.0), Chapter 16 Special Areas and Format Characters,
Section 16.7 Noncharacters:
Applications are free to use any of these noncharacter code points
internally but should never attempt to exchange them. If a noncharacter
is received in open interchange, an application is not required to
interpret it in any way. It is good practice, however, to recognize it
as a noncharacter and to take appropriate action, such as replacing it
with U+FFFD replacement character, to indicate the problem in the text.
It is not recommended to simply delete noncharacter code points from
such text, because of the potential security issues caused by deleting
uninterpreted characters. [ed: have not compared this to v4.0]
---------
This means strings like "\ufffe", "\ufdd0", "\ud83f\udfff" are
"noncharacters", and a plain reading of the standard clearly implies
that it is in some way "invalid" (I quote the term because the Unicode
standard has a lot to say about how to deal with this).  While the
examples given are the \u escaped variety, it should be obvious that the
(same) code points U+FFFE, U+FDD0, U+1FFFF encoded in their UTF-*
representation are also "invalid".  In UTF-8, this would be <EF BF BF>,
<EF B7 90>, <F0 9F BF BE>.
Unicode Standard (6.0), Chapter 3 Conformance, Section 3.9 Unicode
Encoding Forms (http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf)
covers a lot of these details.  In particular, the section "Best
Practices for Using U+FFFD" gives details on using the special U+FFFD
replacement character to replace the "invalid" Unicode.  For example,
the \u escape sequence of '\ud83f\udfff' in a quoted string would be
replaced with a single U+FFFD.
It is important to note that there have been extensive changes to
section 3.9 between 4.0 and 6.0.  Some of these are due to various
security issues (http://www.unicode.org/faq/security.html).
Is the above the "generally agreed on" interpretation of how things
should be done?
Is it safe to say that a "strictly RFC 4627 conforming JSON
implementation" MUST also be "strictly Unicode Standard conforming" (at
least in terms of Chapter / Section 3 of the Unicode Standard,
"Conformance")?
Is there an opinion on whether or not JSON that is used for interchange
SHOULD NOT or MUST NOT contain "noncharacters"?  That is to say that a
JSON generator should/must not create JSON with noncharacters, and a
parser should/must either reject as invalid or replace such
noncharacters with U+FFFD?  There's technically a difference between
JSON used for interchange and JSON not used for interchange since the
Unicode Standard allows an implementation to use the noncharacters as
"internal, private" code points, but those characters should not be
present in the Unicode that (for some reasonable definition of) "leaves
the implementation".  Personally, I don't think such a distinction
should be made for JSON, or is really even meaningful, and all JSON
should/must be treated as "interchange".
The Unicode Standard, and in particular later versions of the standard,
for all practical purposes make it a requirement that "characters MUST
NOT be deleted".  One course of action is to simply not accept a string
and report an error, and another is to replace a bad or malformed
character with U+FFFD.  There are some very compelling security related
reasons for doing this.  Is there an opinion that a RFC 4627 JSON
implementation "MUST NOT arbitrarily delete characters" as well?  (This
is a somewhat complicated issue, see
http://www.unicode.org/faq/security.html and
http://www.unicode.org/reports/tr36/ for more info, in particular UTR#36
- Section 3 "Non-Visual Security Issues").


[Non-text portions of this message have been removed]

#1583 From: "Douglas Crockford" <douglas@...>
Date: Thu Feb 24, 2011 2:06 am
Subject: Re: JSON and the Unicode Standard
douglascrock...
Send Email Send Email
 
A receiver can do what it chooses to with the character codes it receives. If it
wants to delete them or reject them, that is its business. But a JSON channel
should not interfere with or bias the communication. It should faithfully
deliver what the sender sent, provided that it conforms to the JSON grammar.

If the sender wants to send characters that some consortium considers indecent,
and if the receiver wants to receive them, then that is their business.

#1584 From: "Douglas Crockford" <douglas@...>
Date: Thu Feb 24, 2011 2:18 am
Subject: Re: JSON and the Unicode Standard
douglascrock...
Send Email Send Email
 
When the informational RFC insists on Unicode, it is the sense that the encoding
is not EBCDIC nor Big5 nor anything other than Unicode.

#1585 From: "johne_ganz" <john.engelhart@...>
Date: Fri Feb 25, 2011 7:44 pm
Subject: Re: JSON and the Unicode Standard
johne_ganz
Send Email Send Email
 
--- In json@yahoogroups.com, "Douglas Crockford" <douglas@...> wrote:
>
> A receiver can do what it chooses to with the character codes it receives. If
it wants to delete them or reject them, that is its business.

Not if it's Unicode.  It is a common misconception that "Unicode" is just a set
of code points, like say ASCII or EBCDIC.  It is not.  With ASCII, you can
"delete" characters from a stream and it's still ASCII.  Code points in Unicode
have semantics, and deleting them can alter the meaning of the string in
surprising and unexpected ways.

Previous versions of the Unicode standard used to have a clause that permitted
the deleting of code points from a string (though it was not recommended, ala
SHOULD NOT).  Later versions of the Unicode standard do not permit this, and it
has been verboten to do so for some time.

Some of the most compelling reasons why deleting characters is forbidden is
covered by the section "Non-Visual Security Issues": 
http://www.unicode.org/reports/tr36/#Canonical_Represenation .

There is no language in RFC 4627 that I can find that supports your
interpretation.  There is an awful lot of language in the Unicode standard that
unambiguously says that you can not "delete" characters from a Unicode "string".
There is also compelling arguments in TR#36 for why arbitrarily deleting
characters is a huge mistake.

> But a JSON channel should not interfere with or bias the communication.

This statement is contrary to what you just said.  If it is deleting characters,
it is obviously interfering and biasing the communication.

A standard, and its interpretation, should strive to be unambiguous.  An
interpretation that boils down to "An implementation MAY interfere with or bias
the communication, but an implementation SHOULD NOT interfere with or bias the
communication" is meaningless and non-sensical.

> It should faithfully deliver what the sender sent, provided that it conforms
to the JSON grammar.

Which must be encoded as Unicode.  Again, Unicode IS NOT, and MUST NOT be
treated as a stream of Unicode code points.  That's not Unicode.  I freely admit
that this is a belief that I once had.  However, after a few years of dealing
with low level Unicode string processing (where Unicode means "The Unicode
Standard"), I no longer hold this view.  It's much more complicated and much
more nuanced than people realize.

> If the sender wants to send characters that some consortium considers
indecent, and if the receiver wants to receive them, then that is their
business.

I don't have a problem with this.  I would have a problem with such a set up
claiming "strictly RFC 4627 conforming" (or some language implying 4627
conformance).

My specific point is this:  I strongly believe that RFC 4627 requires Unicode,
and by implication, processing said Unicode in a Unicode Standard conforming
way.  Therefore, in order to claim "RFC 4627 conformance", one must also process
and handle the JSON in a way that is also "Unicode Standard conforming" as well.

You don't HAVE to do this, obviously.. but then you can no longer claim RFC 4627
conformance.

If I may make a suggestion:  perhaps an informal "JSON Best Practices" document
be started that catalogs and records these types of things.  The document would
be totally non-normative, but would be a fantastic resource for this who need to
implement JSON parsers and generators.  It would also help ensure that
implementations converge on something that ensures they will interoperate more
reliably.  Since it would be non-normative, it wouldn't have any "requirements"
weight to it, but I can tell you such a document would have been a big help to
me.

#1586 From: "Douglas Crockford" <douglas@...>
Date: Fri Feb 25, 2011 8:00 pm
Subject: Re: JSON and the Unicode Standard
douglascrock...
Send Email Send Email
 
--- In json@yahoogroups.com, "johne_ganz" <john.engelhart@...> wrote:
>
> --- In json@yahoogroups.com, "Douglas Crockford" <douglas@> wrote:
> >
> > A receiver can do what it chooses to with the character codes it receives.
If it wants to delete them or reject them, that is its business.
>
> Not if it's Unicode.  It is a common misconception that "Unicode" is just a
set of code points, like say ASCII or EBCDIC.

For JSON's purpose, Unicode is just a set of code points. It gives some, such as
{ and }, special meaning. But in strings, everything should simply be passed
through.

> Previous versions of the Unicode standard used to have a clause that permitted
the deleting of code points from a string (though it was not recommended, ala
SHOULD NOT).  Later versions of the Unicode standard do not permit this, and it
has been verboten to do so for some time.
>
> Some of the most compelling reasons why deleting characters is forbidden is
covered by the section "Non-Visual Security Issues": 
http://www.unicode.org/reports/tr36/#Canonical_Represenation .
>
> There is no language in RFC 4627 that I can find that supports your
interpretation.  There is an awful lot of language in the Unicode standard that
unambiguously says that you can not "delete" characters from a Unicode "string".
There is also compelling arguments in TR#36 for why arbitrarily deleting
characters is a huge mistake.
>
> > But a JSON channel should not interfere with or bias the communication.
>
> This statement is contrary to what you just said.  If it is deleting
characters, it is obviously interfering and biasing the communication.

By receiver I mean the program that ultimately receives the message. It can
interpret it and process it or damage it or ignore it as it will. What it does
with the data is none of my business. The JSON channel itself must do none of
those things.

> A standard, and its interpretation, should strive to be unambiguous.  An
interpretation that boils down to "An implementation MAY interfere with or bias
the communication, but an implementation SHOULD NOT interfere with or bias the
communication" is meaningless and non-sensical.
>
> > It should faithfully deliver what the sender sent, provided that it conforms
to the JSON grammar.
>
> Which must be encoded as Unicode.  Again, Unicode IS NOT, and MUST NOT be
treated as a stream of Unicode code points.  That's not Unicode.  I freely admit
that this is a belief that I once had.  However, after a few years of dealing
with low level Unicode string processing (where Unicode means "The Unicode
Standard"), I no longer hold this view.  It's much more complicated and much
more nuanced than people realize.
>
> > If the sender wants to send characters that some consortium considers
indecent, and if the receiver wants to receive them, then that is their
business.
>
> I don't have a problem with this.  I would have a problem with such a set up
claiming "strictly RFC 4627 conforming" (or some language implying 4627
conformance).
>
> My specific point is this:  I strongly believe that RFC 4627 requires Unicode,
and by implication, processing said Unicode in a Unicode Standard conforming
way.  Therefore, in order to claim "RFC 4627 conformance", one must also process
and handle the JSON in a way that is also "Unicode Standard conforming" as well.
>
> You don't HAVE to do this, obviously.. but then you can no longer claim RFC
4627 conformance.
>
> If I may make a suggestion:  perhaps an informal "JSON Best Practices"
document be started that catalogs and records these types of things.  The
document would be totally non-normative, but would be a fantastic resource for
this who need to implement JSON parsers and generators.  It would also help
ensure that implementations converge on something that ensures they will
interoperate more reliably.  Since it would be non-normative, it wouldn't have
any "requirements" weight to it, but I can tell you such a document would have
been a big help to me.


Tell you what. If you ever encounter a real problem, we will deal with that.

#1587 From: "johne_ganz" <john.engelhart@...>
Date: Fri Feb 25, 2011 11:09 pm
Subject: Re: JSON and the Unicode Standard
johne_ganz
Send Email Send Email
 
--- In json@yahoogroups.com, "Douglas Crockford" <douglas@...> wrote:
>
> --- In json@yahoogroups.com, "johne_ganz" <john.engelhart@> wrote:
> >
> > --- In json@yahoogroups.com, "Douglas Crockford" <douglas@> wrote:
> > >
> > > A receiver can do what it chooses to with the character codes it receives.
If it wants to delete them or reject them, that is its business.
> >
> > Not if it's Unicode.  It is a common misconception that "Unicode" is just a
set of code points, like say ASCII or EBCDIC.
>
> For JSON's purpose, Unicode is just a set of code points.

Not according to RFC 4627 it isn't.  Section 3, Encoding, "JSON text SHALL be
encoded in Unicode.", where SHALL is interpreted via RFC 2119 (i.e., SHALL is
synonymous with MUST).

I appreciate that your interpretation may have been your original intent, but
the scope of the language in the standard is far, far greater than "JSON text
SHALL be interpreted as a stream of disjoint Unicode code points.", which is
what you are arguing that the standard means.

Unless you can make a compelling argument with language from the RFC 4627
standard, the standard clearly and plainly says that the JSON text is encoded in
Unicode.  This means that the text must conform to the Unicode standard, and
it's rules for processing and handling text MUST (via the use of SHALL in RFC
4627) be followed.


> By receiver I mean the program that ultimately receives the message. It can
interpret it and process it or damage it or ignore it as it will. What it does
with the data is none of my business. The JSON channel itself must do none of
those things.

Surely you realize that in practice, this is not the way that things are done. 
All of the JSON libraries are effectively "part of the JSON channel".

There is a clear demarcation point where a piece of text has ceased to be JSON
and has (usually) become an instantiated data structure in the host language.

How and what the "language" does with the data is not relevant to RFC 4627.  The
"language" may manipulate the JSON data, examining keys, manipulating them in
any way it chooses.  But at this point, it very clearly has ceased to be "JSON".

Every JSON implementation that is in the form of a library for a host language
that I'm aware of could be interpreted to be "the program that ultimately
receives the message".  The libraries parse the JSON and transliterate it in to
a form useable by the host language.  How and what the host language, or program
written by someone to enumerate or manipulate the data structure that was
instantiated from the original JSON is obviously outside the scope of RFC 4627.

My pedantic point is: A JSON implementation, in the form of a library that
provides bindings between a host language and JSON (of which there are many),
MUST NOT arbitrarily delete characters in the original JSON.  Furthermore, any
such implementation MUST interpret the original JSON text in accordance with the
Unicode Standard.  Just like RFC 4627 gives a grammar and rules for how to
interpret JSON, the Unicode Standard has rules for how to interpret text encoded
as Unicode.  Unicode is not just a simple set of code points.

Another issue is normalization.  In particular, the way normalization is handled
for the "key" portion of an "object" (i.e., {"key": "value"}) can dramatically
alter the meaning and contents of the object.  For example:

{
"\u212b": "one",
"\u0041\u030a": "two",
"\u00c5": "three"
}

Are these three keys distinct?  Should there be a requirement that they MUST be
handled and interpreted such that they are distinct?  Does that requirement
extend past the "channel" demarcation point (i.e., not a JSON library or
communication channel used to interchange the JSON between two hosts) to the
"host language"?

In case it is not obvious, under the rules of Unicode NFC (Normalization Form
C), all three of the keys above will become "\u00c5" after NFC processing.

A first order approximation would seem to suggest that a JSON implementation
"should" use the precomposed form for keys, and for objects that contain keys
with non-precomposed keys that, when converted to their precomposed form are
duplicate with other keys, the behavior is undefined.

Again, this is another point where the use of Unicode introduces an awful lot of
non-obvious dependencies.  The Unicode standard has a lot to say about what it
means for two strings to "compare equal", and since JSON specifies what is
essentially a key/value hash table, it is critically important to define what
"equal" means for a key.  If the keys were ASCII or Binary, this would probably
be a non-issue, but its a pretty big one when you're dealing with Unicode.

> Tell you what. If you ever encounter a real problem, we will deal with that.

This is a rather snarky comment, and to be blunt, unprofessional and unfair.

Every point I've raised here is something that an implementor of a JSON library
will likely encounter.  As an implementor of such a library (for Objective-C),
everything I've raised here is something that took an enormous amount of time
and consideration.

In my case, I've had to deal with the subtle nuances of what happens to a
Unicode string when I parse it and then hand that parsed string off to another
library to instantiate a string object.  I have no control over how this
external library (a combination of Foundation and Core Foundation) deals with or
interprets various aspects of the Unicode Standard.  For the sake of argument,
if this external library automatically precomposes all strings it instantiates,
and I have to uses those instantiated strings as the keys in a NSDictionary (the
equivalent of a JSON object), I've got some problems.

Your snarky comment ignores the real world complexities that one faces when
attempting to create a "RFC 4627 compliant" JSON implementation, at least if one
is trying to do so "the right way" as opposed to a quick hack JSON
implementation.

For someone who is creating a JSON library or some other form of a JSON
implementation, the corner cases are usually far more important than the
obvious, common case.

#1588 From: Tatu Saloranta <tsaloranta@...>
Date: Fri Feb 25, 2011 11:19 pm
Subject: Re: Re: JSON and the Unicode Standard
cowtowncoder
Send Email Send Email
 
On Fri, Feb 25, 2011 at 3:09 PM, johne_ganz <john.engelhart@...> wrote:
>
>
> --- In json@yahoogroups.com, "Douglas Crockford" <douglas@...> wrote:
>>
>> --- In json@yahoogroups.com, "johne_ganz" <john.engelhart@> wrote:
>> >
>> > --- In json@yahoogroups.com, "Douglas Crockford" <douglas@> wrote:
>> > >
>> > > A receiver can do what it chooses to with the character codes it
receives. If it wants to delete them or reject them, that is its business.
>> >
>> > Not if it's Unicode.  It is a common misconception that "Unicode" is just a
set of code points, like say ASCII or EBCDIC.
>>
>> For JSON's purpose, Unicode is just a set of code points.
>
> Not according to RFC 4627 it isn't.  Section 3, Encoding, "JSON text SHALL be
encoded in Unicode.", where SHALL is interpreted via RFC 2119 (i.e., SHALL is
synonymous with MUST).

Do you have an ACTUAL problem worth discussion, or is this from just
purity standpoint?

-+ Tatu +-

#1589 From: Tatu Saloranta <tsaloranta@...>
Date: Fri Feb 25, 2011 11:45 pm
Subject: Re: Re: JSON and the Unicode Standard
cowtowncoder
Send Email Send Email
 
On Fri, Feb 25, 2011 at 3:09 PM, johne_ganz <john.engelhart@...> wrote:
...

> Unicode is not just a simple set of code points.

This is true statement, although the more practical question seems to
be what is the practical relationship of JSON with Unicode
specification.
I think your suggestions for clarifying some parts do make sense,
although it may be hard to reconcile basic diffences between full
Unicode support, and goals of simplicity for JSON.

>
> Another issue is normalization.  In particular, the way normalization is
handled for the "key" portion of an "object" (i.e., {"key": "value"}) can
dramatically alter the meaning and contents of the object.  For example:
>
> {
> "\u212b": "one",
> "\u0041\u030a": "two",
> "\u00c5": "three"
> }
>
> Are these three keys distinct?  Should there be a requirement that they MUST
be handled and interpreted such that they are distinct?  Does that requirement
extend past the "channel" demarcation point (i.e., not a JSON library or
communication channel used to interchange the JSON between two hosts) to the
"host language"?
>
> In case it is not obvious, under the rules of Unicode NFC (Normalization Form
C), all three of the keys above will become "\u00c5" after NFC processing.

For what it is worth, I have not seen a single JSON parser that would
do such normalization; and the only XML parser I recall even trying
proper Unicode code point normalization was XOM. This is not an
argument against proper handling, but rather an observation regarding
how much of a practical issue it seems to be.
Nor have I seen feature requests to support normalization (XOM
implements it because its author is very ambitious wrt supporting
standards, it is very respectable achievement), during time I have
spend maintaining XML and JSON parser/generator implementations.
Do others have difference experiences?

So to me it seems that most likely high-level clarifications regarding
normalization aspects would be:

(a) Whether to do normalization or not is up to implementation
(normalization is left out of scope, on purpose), or
(b) Say that with JSON no normalization would be done (which would be
more at odds with unicode spec)

Why? Just because I see very little chance of anything more ambitious
having effect on implementations (beyond small number that are willing
to tackle such complexity). While it would seem wrong to punt the
issue, there is the practical question of whether full solution would
matter.
My guess is that about last thing I implements would want was a
mandate to support full Unicode 4.0 (and above) normalization rules.
It would just mean that there would be the specification in one
corner; and implementations, practically none of which would be
compliant.

...
> Your snarky comment ignores the real world complexities that one faces when
attempting to create a "RFC 4627 compliant" JSON implementation, at least if one
is trying to do so "the right way" as opposed to a quick hack JSON
implementation.

For better or worse, most JSON implementations fall in quick hack
category; which is just to say that chances of getting significant
number of implementations to do much more than decoding code points
correctly is vanishingly small. Or that even getting them to do basic
decoding is quite a challenge in itself.

> For someone who is creating a JSON library or some other form of a JSON
implementation, the corner cases are usually far more important than the
obvious, common case.

True.

I think your suggestions of how this could be clarified make sense.

-+ Tatu +-

#1590 From: David Graham <david.malcom.graham@...>
Date: Sat Feb 26, 2011 12:35 am
Subject: Re: Re: JSON and the Unicode Standard
dgraham...
Send Email Send Email
 
I had the normalization question while writing json-stream in Ruby as well.
I decided the parser shouldn't do Unicode normalization for the following
reasons:


1. The json and yajl-json Ruby parsers and the popular org.json Java parser
do not do normalization.


2. CouchDB does not do normalization.  I wrote json-stream to handle CouchDB
documents so this was my primary use case.


3. Ruby and Java consider combined characters to be unequal to their single
codepoint counterparts.  The é character, for example, can be a 2 byte
single codepoint form of \u00e9 or a 3 byte two codepoint form of
\u0065\u0301.


In Ruby, "\u00e9" == "\u0065\u0301" => false.


So, given a Ruby Hash (or Java Map) like this:


{"\u00e9" => 1, "\u0065\u0301" => 2}

=> {"é"=>1, "é"=>2}


A JSON serializer that performed Unicode normalization on this Hash object
would corrupt the data in some way.  The two keys would become equal, so
which value gets serialized: 1 or 2?


In my opinion, this means JSON parsers and generators must not perform
normalization.  They must respect the data stored in the JSON byte stream as
is.


It's easy for the application to normalize data before handing it to the
JSON library for serialization, though.  In Ruby, we can do:


ActiveSupport::Multibyte::Chars.new("\u0065\u0301").normalize(:c)


I hope that helps.


David

On Fri, Feb 25, 2011 at 4:45 PM, Tatu Saloranta <tsaloranta@...>wrote:

>
>
> On Fri, Feb 25, 2011 at 3:09 PM, johne_ganz <john.engelhart@...>
> wrote:
> ...
>
>
> > Unicode is not just a simple set of code points.
>
> This is true statement, although the more practical question seems to
> be what is the practical relationship of JSON with Unicode
> specification.
> I think your suggestions for clarifying some parts do make sense,
> although it may be hard to reconcile basic diffences between full
> Unicode support, and goals of simplicity for JSON.
>
>
> >
> > Another issue is normalization.  In particular, the way normalization is
> handled for the "key" portion of an "object" (i.e., {"key": "value"}) can
> dramatically alter the meaning and contents of the object.  For example:
> >
> > {
> > "\u212b": "one",
> > "\u0041\u030a": "two",
> > "\u00c5": "three"
> > }
> >
> > Are these three keys distinct?  Should there be a requirement that they
> MUST be handled and interpreted such that they are distinct?  Does that
> requirement extend past the "channel" demarcation point (i.e., not a JSON
> library or communication channel used to interchange the JSON between two
> hosts) to the "host language"?
> >
> > In case it is not obvious, under the rules of Unicode NFC (Normalization
> Form C), all three of the keys above will become "\u00c5" after NFC
> processing.
>
> For what it is worth, I have not seen a single JSON parser that would
> do such normalization; and the only XML parser I recall even trying
> proper Unicode code point normalization was XOM. This is not an
> argument against proper handling, but rather an observation regarding
> how much of a practical issue it seems to be.
> Nor have I seen feature requests to support normalization (XOM
> implements it because its author is very ambitious wrt supporting
> standards, it is very respectable achievement), during time I have
> spend maintaining XML and JSON parser/generator implementations.
> Do others have difference experiences?
>
> So to me it seems that most likely high-level clarifications regarding
> normalization aspects would be:
>
> (a) Whether to do normalization or not is up to implementation
> (normalization is left out of scope, on purpose), or
> (b) Say that with JSON no normalization would be done (which would be
> more at odds with unicode spec)
>
> Why? Just because I see very little chance of anything more ambitious
> having effect on implementations (beyond small number that are willing
> to tackle such complexity). While it would seem wrong to punt the
> issue, there is the practical question of whether full solution would
> matter.
> My guess is that about last thing I implements would want was a
> mandate to support full Unicode 4.0 (and above) normalization rules.
> It would just mean that there would be the specification in one
> corner; and implementations, practically none of which would be
> compliant.
>
> ...
>
> > Your snarky comment ignores the real world complexities that one faces
> when attempting to create a "RFC 4627 compliant" JSON implementation, at
> least if one is trying to do so "the right way" as opposed to a quick hack
> JSON implementation.
>
> For better or worse, most JSON implementations fall in quick hack
> category; which is just to say that chances of getting significant
> number of implementations to do much more than decoding code points
> correctly is vanishingly small. Or that even getting them to do basic
> decoding is quite a challenge in itself.
>
>
> > For someone who is creating a JSON library or some other form of a JSON
> implementation, the corner cases are usually far more important than the
> obvious, common case.
>
> True.
>
> I think your suggestions of how this could be clarified make sense.
>
> -+ Tatu +-
>
>
>


[Non-text portions of this message have been removed]

#1591 From: "Douglas Crockford" <douglas@...>
Date: Sat Feb 26, 2011 12:44 am
Subject: Re: JSON and the Unicode Standard
douglascrock...
Send Email Send Email
 
--- In json@yahoogroups.com, David Graham <david.malcom.graham@...> wrote:

> In my opinion, this means JSON parsers and generators must not perform
> normalization.  They must respect the data stored in the JSON byte stream as
> is.

I agree.

#1592 From: "johne_ganz" <john.engelhart@...>
Date: Sat Feb 26, 2011 4:01 am
Subject: Re: JSON and the Unicode Standard
johne_ganz
Send Email Send Email
 
--- In json@yahoogroups.com, Tatu Saloranta <tsaloranta@...> wrote:
>
> On Fri, Feb 25, 2011 at 3:09 PM, johne_ganz <john.engelhart@...> wrote:
> ...
>
> > Unicode is not just a simple set of code points.
>
> This is true statement, although the more practical question seems to
> be what is the practical relationship of JSON with Unicode
> specification.

True.  It would seem, at least to me, that this is one of those nuanced points
that either

a) Has not been given the proper consideration by Unicode (ostensibly) experts.

b) The Unicode standard has evolved in such a way since the publication of RFC
4627 that it may require revisiting the issue.

> I think your suggestions for clarifying some parts do make sense,
> although it may be hard to reconcile basic diffences between full
> Unicode support, and goals of simplicity for JSON.

I'm all for simplicity, and for a "less is more" philosophy.  Unfortunately, RFC
4627 allows for two strictly RFC 4627 compliant implementations to "generate"
wildly different results (were "generate" here means the JSON is parsed and
interpreted in such a way that the two implementations have what reasonable
people would consider "very different semantics").

Numbers are another corner case.  4627 only describes how to parse a decimal
representation of numbers (both integer and floating-point).  In practice this
means that a strictly conforming RFC 4627 JSON implementation can use a 8, 16,
32, or 64 bit "native primitive" to represent integers.  It's perfectly valid
JSON to have integer numbers that require 128 or 256 bits in order to represent
them.  To me, a serialization format such as JSON should make some effort to
ensure that the values contained within it will be properly interpreted by any
and all JSON implementations.  I've seen several JSON implementations that use a
32-bit size C99 primitive type to represent the parsed numbers.  This is a
problem for anything that wants to parse contemporary Twitter JSON as the ID's
are > 2^32 at this point.  The desire to "keep it simple" has to be balanced
against real world practical needs- when your ID's exceed 2^32, it is a
legitimate question to ask "Are JSON implementations going to handle this value
correctly?"  If not 2^32, then when?

> >
> > Another issue is normalization.  In particular, the way normalization is
handled for the "key" portion of an "object" (i.e., {"key": "value"}) can
dramatically alter the meaning and contents of the object.  For example:
> >
> > {
> > "\u212b": "one",
> > "\u0041\u030a": "two",
> > "\u00c5": "three"
> > }
> >
> > Are these three keys distinct?  Should there be a requirement that they MUST
be handled and interpreted such that they are distinct?  Does that requirement
extend past the "channel" demarcation point (i.e., not a JSON library or
communication channel used to interchange the JSON between two hosts) to the
"host language"?
> >
> > In case it is not obvious, under the rules of Unicode NFC (Normalization
Form C), all three of the keys above will become "\u00c5" after NFC processing.
>
> For what it is worth, I have not seen a single JSON parser that would
> do such normalization; and the only XML parser I recall even trying
> proper Unicode code point normalization was XOM. This is not an
> argument against proper handling, but rather an observation regarding
> how much of a practical issue it seems to be.

I have not seen a JSON implementation / parser that does such normalization.

On the other hand, I very strongly suspect that whether or not such
normalization is taking place is not up to the writer of that parser.  In my
particular case (JSONKit, for Objective-C), I pass the parsed JSON String to the
NSString class to instantiate an object.

I have ZERO control over what and how NSString interprets or manipulates the
parsed JSON String that finally becomes the instantiated object that ostensibly
the same as the original JSON String used to create it.  It could be that
NSString decides that the instantiated object is always converted to its
precomposed form.  Objective-C is flexible enough where someone might decide to
swizzle in some logic at run time that forces all strings to be precomposed
before being handed off to the main NSString instantiation method.

> Nor have I seen feature requests to support normalization (XOM
> implements it because its author is very ambitious wrt supporting
> standards, it is very respectable achievement), during time I have
> spend maintaining XML and JSON parser/generator implementations.
> Do others have difference experiences?

I don't have a particular opinion on the matter one way or the other other than
to highlight the point that in many practical, real-world situations, whether or
not such things take place may not be under the control of the JSON parser.

I also suspect that it's one of those things that most people haven't really
given a whole lot of consideration to- they just had the parsed string over to
"the Unicode string handling code", and that's that.  Most people may not
realize that such string handling code may subtly alter the original Unicode
text as a result (ala precomposing the string).

> So to me it seems that most likely high-level clarifications regarding
> normalization aspects would be:
>
> (a) Whether to do normalization or not is up to implementation
> (normalization is left out of scope, on purpose), or
> (b) Say that with JSON no normalization would be done (which would be
> more at odds with unicode spec)
>
> Why? Just because I see very little chance of anything more ambitious
> having effect on implementations (beyond small number that are willing
> to tackle such complexity). While it would seem wrong to punt the
> issue, there is the practical question of whether full solution would
> matter.

I can guarantee you that the practical question of whether a full solution would
matter will be answered the first time someone exploits it in a security
vulnerable way that results in a major security fiasco.

Then it will be with 20/20 hindsight, and the question will be "Why didn't
anyone address (this behavior) that allowed two keys that were not bit for bit
identical, but became identical after converting them to their precomposed form,
and the security checks allowed the decomposed form through because it assumed
that everything was in precomposed form?"

Unfortunately, the use of Unicode coupled with the fact that most JSON
implementations are dependent on external code for their Unicode support means
that this is an extremely non-trivial issue.  I can't think of a simple solution
to the problem at the moment, other than it exists.

> My guess is that about last thing I implements would want was a
> mandate to support full Unicode 4.0 (and above) normalization rules.
> It would just mean that there would be the specification in one
> corner; and implementations, practically none of which would be
> compliant.

You really ought to read:

http://www.unicode.org/faq/security.html

http://www.unicode.org/reports/tr36/#Canonical_Represenation

Microsoft Security Bulletin (MS00-078): Patch Available for 'Web Server Folder
Traversal' Vulnerability
(http://www.microsoft.com/technet/security/bulletin/MS00-078.mspx,
http://www.cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2000-0884)

Creating Arbitrary Shellcode In Unicode Expanded Strings
(http://www.net-security.org/article.php?id=144)

There's a long history of "Those little Unicode details aren't really important"
causing huge security problems later on.

#1593 From: "johne_ganz" <john.engelhart@...>
Date: Sat Feb 26, 2011 5:08 am
Subject: Re: JSON and the Unicode Standard
johne_ganz
Send Email Send Email
 
--- In json@yahoogroups.com, David Graham <david.malcom.graham@...> wrote:
>
> 3. Ruby and Java consider combined characters to be unequal to their single
> codepoint counterparts.  The é character, for example, can be a 2 byte
> single codepoint form of \u00e9 or a 3 byte two codepoint form of
> \u0065\u0301.
>
>
> In Ruby, "\u00e9" == "\u0065\u0301" => false.
>
>
> So, given a Ruby Hash (or Java Map) like this:
>
>
> {"\u00e9" => 1, "\u0065\u0301" => 2}
>
> => {"é"=>1, "é"=>2}
>
>
> A JSON serializer that performed Unicode normalization on this Hash object
> would corrupt the data in some way.  The two keys would become equal, so
> which value gets serialized: 1 or 2?

This is my point.  It happens both on the serialization side, but (at least in
my opinion), it is much more likely to happen on the deserialization side.

What happens to a JSON deserializer that relies on external libraries (say ICU),
or in object oriented programming, a "string" class to handle all this, or is in
some other way completely out of the control the the person writing the JSON
deserializer?

How many people do you think actually checked to make sure that these external
code dependencies offer a guarantee that they will not mutilate or otherwise
transform the original string in some Unicode Equivalence compatible way?

From the Unicode Standard, Section 2.12, Equivalent Sequences and Normalization-

If an application or user attempts to distinguish between canonically equivalent
sequences, as shown in the first example in Figure 2-23, there is no guarantee
that other applications would recognize the same distinctions.   To prevent the
introduction of interoperability problems between applications, such
distinctions must be avoided wherever possible. Making distinctions between
compatibly equivalent sequences is less problematical. However, in restricted
contexts, such as the use of identifiers, avoiding compatibly equivalent
sequences reduces possible security issues. See Unicode Technical Report #36,
"Unicode Security Considerations."

In other words, the Unicode Standard says that the behavior that you are
observing is not guaranteed.  This means that there exists the very real
possibility that a JSON implementation that depends on external code to handle
strings (i.e., ICU or a "string" object in object oriented languages) can not
reasonably ensure that said code does not convert the string argument in to a
Unicode Standard equivalent form.

The practical implication of this is that the behavior that you are seeing is
contrary to the requirements and expectations set forth in the Unicode standard.
It seems reasonable to assume that external libraries that adhere to the Unicode
standard that a JSON implementation is using are under no obligation what so
ever to treat a Unicode string in a way that you have described.

> In my opinion, this means JSON parsers and generators must not perform
> normalization.  They must respect the data stored in the JSON byte stream as
> is.

It's trivial for a parser to respect the data stored in the JSON byte stream.

While I'm sure there are exceptions to this, I'd be willing to bet that the
majority of JSON parsers hand the parsed string off to some "create a string"
piece of code.  It seems reasonable to assume that this "create a string" piece
of code is Unicode aware.  These code bases are probably disjoint, with the
string handling code focused on Unicode Standard conformance, and said
conformance does not require that it "respect [the original string]".

In fact, for my parser (JSONKit), which is Objective-C based and uses NSString
to represent the JSON String objects, it is not practical for me to create a
JSON parser that "respects the data stored in the JSON byte stream".  The
NSString class makes no such guarantees in its documentation, nor does the
Unicode Standard.  It would be extremely non-trivial for me to meet a "respects
the data stored in the JSON byte stream" requirement, at least in the sense that
the behavior is deterministic.

#1594 From: Mark Slater <mark.slater@...>
Date: Sat Feb 26, 2011 11:05 am
Subject: Re: Re: JSON and the Unicode Standard
markosslater
Send Email Send Email
 
Regarding the handling of numbers, the RFC doesn't appear to make any mention of
native representations of numbers - in fact, it only specifies what constitutes
a number in JSON. This maps onto the set of real numbers plus -0, and some
syntactic sugar in the form of scientific notation. I can't see anything in the
grammar that actually limits it to numbers with a finite decimal expansion.

The RFC does state that parsers can limit the numbers they accept to a specified
range of their choice, just as they can impose a limit on the size of strings,
if they want. In practice, some libraries apply unwritten limits to the range of
numbers along the lines of "the range (set?) of numbers that can be represented
as a Java double." Personally, I think such restrictions greatly reduce the
usefulness of a parser, because of examples like twitter ids, but all parsers
I'm aware of impose some limit, even if it is something like, "the range of
numbers that, when represented as a string, fit in the memory available to the
parser". I think it is absolutely valid to ask what range of numbers a
particular parser supports, but this is explicit in the RFC.

Mark


On 26 Feb 2011, at 04:01, "johne_ganz" <john.engelhart@...> wrote:

> --- In json@yahoogroups.com, Tatu Saloranta <tsaloranta@...> wrote:
> >
> > On Fri, Feb 25, 2011 at 3:09 PM, johne_ganz <john.engelhart@...> wrote:
> > ...
> >
> > > Unicode is not just a simple set of code points.
> >
> > This is true statement, although the more practical question seems to
> > be what is the practical relationship of JSON with Unicode
> > specification.
>
> True. It would seem, at least to me, that this is one of those nuanced points
that either
>
> a) Has not been given the proper consideration by Unicode (ostensibly)
experts.
>
> b) The Unicode standard has evolved in such a way since the publication of RFC
4627 that it may require revisiting the issue.
>
> > I think your suggestions for clarifying some parts do make sense,
> > although it may be hard to reconcile basic diffences between full
> > Unicode support, and goals of simplicity for JSON.
>
> I'm all for simplicity, and for a "less is more" philosophy. Unfortunately,
RFC 4627 allows for two strictly RFC 4627 compliant implementations to
"generate" wildly different results (were "generate" here means the JSON is
parsed and interpreted in such a way that the two implementations have what
reasonable people would consider "very different semantics").
>
> Numbers are another corner case. 4627 only describes how to parse a decimal
representation of numbers (both integer and floating-point). In practice this
means that a strictly conforming RFC 4627 JSON implementation can use a 8, 16,
32, or 64 bit "native primitive" to represent integers. It's perfectly valid
JSON to have integer numbers that require 128 or 256 bits in order to represent
them. To me, a serialization format such as JSON should make some effort to
ensure that the values contained within it will be properly interpreted by any
and all JSON implementations.  I've seen several JSON implementations that use a
32-bit size C99 primitive type to represent the parsed numbers. This is a
problem for anything that wants to parse contemporary Twitter JSON as the ID's
are > 2^32 at this point. The desire to "keep it simple" has to be balanced
against real world practical needs- when your ID's exceed 2^32, it is a
legitimate question to ask "Are JSON implementations going to handle this value
correctly?" If not 2^32, then when?
>
> > >
> > > Another issue is normalization.  In particular, the way normalization is
handled for the "key" portion of an "object" (i.e., {"key": "value"}) can
dramatically alter the meaning and contents of the object.  For example:
> > >
> > > {
> > > "\u212b": "one",
> > > "\u0041\u030a": "two",
> > > "\u00c5": "three"
> > > }
> > >
> > > Are these three keys distinct?  Should there be a requirement that they
MUST be handled and interpreted such that they are distinct?  Does that
requirement extend past the "channel" demarcation point (i.e., not a JSON
library or communication channel used to interchange the JSON between two hosts)
to the "host language"?
> > >
> > > In case it is not obvious, under the rules of Unicode NFC (Normalization
Form C), all three of the keys above will become "\u00c5" after NFC processing.
> >
> > For what it is worth, I have not seen a single JSON parser that would
> > do such normalization; and the only XML parser I recall even trying
> > proper Unicode code point normalization was XOM. This is not an
> > argument against proper handling, but rather an observation regarding
> > how much of a practical issue it seems to be.
>
> I have not seen a JSON implementation / parser that does such normalization.
>
> On the other hand, I very strongly suspect that whether or not such
normalization is taking place is not up to the writer of that parser. In my
particular case (JSONKit, for Objective-C), I pass the parsed JSON String to the
NSString class to instantiate an object.
>
> I have ZERO control over what and how NSString interprets or manipulates the
parsed JSON String that finally becomes the instantiated object that ostensibly
the same as the original JSON String used to create it. It could be that
NSString decides that the instantiated object is always converted to its
precomposed form. Objective-C is flexible enough where someone might decide to
swizzle in some logic at run time that forces all strings to be precomposed
before being handed off to the main NSString instantiation method.
>
> > Nor have I seen feature requests to support normalization (XOM
> > implements it because its author is very ambitious wrt supporting
> > standards, it is very respectable achievement), during time I have
> > spend maintaining XML and JSON parser/generator implementations.
> > Do others have difference experiences?
>
> I don't have a particular opinion on the matter one way or the other other
than to highlight the point that in many practical, real-world situations,
whether or not such things take place may not be under the control of the JSON
parser.
>
> I also suspect that it's one of those things that most people haven't really
given a whole lot of consideration to- they just had the parsed string over to
"the Unicode string handling code", and that's that. Most people may not realize
that such string handling code may subtly alter the original Unicode text as a
result (ala precomposing the string).
>
> > So to me it seems that most likely high-level clarifications regarding
> > normalization aspects would be:
> >
> > (a) Whether to do normalization or not is up to implementation
> > (normalization is left out of scope, on purpose), or
> > (b) Say that with JSON no normalization would be done (which would be
> > more at odds with unicode spec)
> >
> > Why? Just because I see very little chance of anything more ambitious
> > having effect on implementations (beyond small number that are willing
> > to tackle such complexity). While it would seem wrong to punt the
> > issue, there is the practical question of whether full solution would
> > matter.
>
> I can guarantee you that the practical question of whether a full solution
would matter will be answered the first time someone exploits it in a security
vulnerable way that results in a major security fiasco.
>
> Then it will be with 20/20 hindsight, and the question will be "Why didn't
anyone address (this behavior) that allowed two keys that were not bit for bit
identical, but became identical after converting them to their precomposed form,
and the security checks allowed the decomposed form through because it assumed
that everything was in precomposed form?"
>
> Unfortunately, the use of Unicode coupled with the fact that most JSON
implementations are dependent on external code for their Unicode support means
that this is an extremely non-trivial issue. I can't think of a simple solution
to the problem at the moment, other than it exists.
>
> > My guess is that about last thing I implements would want was a
> > mandate to support full Unicode 4.0 (and above) normalization rules.
> > It would just mean that there would be the specification in one
> > corner; and implementations, practically none of which would be
> > compliant.
>
> You really ought to read:
>
> http://www.unicode.org/faq/security.html
>
> http://www.unicode.org/reports/tr36/#Canonical_Represenation
>
> Microsoft Security Bulletin (MS00-078): Patch Available for 'Web Server Folder
Traversal' Vulnerability
(http://www.microsoft.com/technet/security/bulletin/MS00-078.mspx,
http://www.cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2000-0884)
>
> Creating Arbitrary Shellcode In Unicode Expanded Strings
(http://www.net-security.org/article.php?id=144)
>
> There's a long history of "Those little Unicode details aren't really
important" causing huge security problems later on.
>
>


[Non-text portions of this message have been removed]

#1595 From: John Cowan <cowan@...>
Date: Sat Feb 26, 2011 4:54 pm
Subject: Re: Re: JSON and the Unicode Standard
johnwcowan
Send Email Send Email
 
Mark Slater scripsit:

> The RFC does state that parsers can limit the numbers they accept to
> a specified range of their choice, just as they can impose a limit on
> the size of strings, if they want. In practice, some libraries apply
> unwritten limits to the range of numbers along the lines of "the range
> (set?) of numbers that can be represented as a Java double."

I think that limit is implied by the statement in the "Security
considerations" section that says that JSON is a subset of JavaScript,
where numbers are clearly constrained to IEEE 64-bit floats.

--
John Cowan    cowan@...    http://ccil.org/~cowan
         Sound change operates regularly to produce irregularities;
         analogy operates irregularly to produce regularities.
                 --E.H. Sturtevant, ca. 1945, probably at Yale

#1596 From: John Cowan <cowan@...>
Date: Sat Feb 26, 2011 5:04 pm
Subject: Re: Re: JSON and the Unicode Standard
johnwcowan
Send Email Send Email
 
Douglas Crockford scripsit:
> --- In json@yahoogroups.com, David Graham <david.malcom.graham@...> wrote:
>
> > In my opinion, this means JSON parsers and generators must not perform
> > normalization.  They must respect the data stored in the JSON byte stream as
> > is.
>
> I agree.

I agree in part.  JSON parsers MUST NOT normalize their inputs, for
the reasons given upthread.  But JSON generators SHOULD generate
normalization form C, and JSON parsers MAY check for it and
warn their applications if it is not present.

--
Your worships will perhaps be thinking          John Cowan
that it is an easy thing to blow up a dog?      http://www.ccil.org/~cowan
[Or] to write a book?
     --Don Quixote, Introduction                 cowan@...

#1597 From: "Douglas Crockford" <douglas@...>
Date: Sat Feb 26, 2011 6:04 pm
Subject: Re: JSON and the Unicode Standard
douglascrock...
Send Email Send Email
 
--- In json@yahoogroups.com, John Cowan <cowan@...> wrote:
>
> Mark Slater scripsit:
>
> > The RFC does state that parsers can limit the numbers they accept to
> > a specified range of their choice, just as they can impose a limit on
> > the size of strings, if they want. In practice, some libraries apply
> > unwritten limits to the range of numbers along the lines of "the range
> > (set?) of numbers that can be represented as a Java double."
>
> I think that limit is implied by the statement in the "Security
> considerations" section that says that JSON is a subset of JavaScript,
> where numbers are clearly constrained to IEEE 64-bit floats.

That's not quite right. JSON says nothing at all about number representations.
All it knows is sequences of digits with the occasional decimal points. JSON
says nothing about 2's complement vs signed magnitude integers, and it says
nothing about word size or binary vs decimal.

#1598 From: John Cowan <cowan@...>
Date: Sat Feb 26, 2011 8:05 pm
Subject: Re: Re: JSON and the Unicode Standard
johnwcowan
Send Email Send Email
 
johne_ganz scripsit:

> In fact, for my parser (JSONKit), which is Objective-C based and uses
> NSString to represent the JSON String objects, it is not practical
> for me to create a JSON parser that "respects the data stored in the
> JSON byte stream".  The NSString class makes no such guarantees in its
> documentation, nor does the Unicode Standard.  It would be extremely
> non-trivial for me to meet a "respects the data stored in the JSON
> byte stream" requirement, at least in the sense that the behavior
> is deterministic.

Normalization is non-trivial, and I doubt if any existing Unicode library
imposes it on all strings at creation/modification time.  Certainly ICU
does not; it provides the ability to normalize, that's all.

--
John Cowan       http://www.ccil.org/~cowan        <cowan@...>
         You tollerday donsk?  N.  You tolkatiff scowegian?  Nn.
         You spigotty anglease?  Nnn.  You phonio saxo?  Nnnn.
                 Clear all so!  `Tis a Jute.... (Finnegans Wake 16.5)

#1599 From: John Cowan <cowan@...>
Date: Sat Feb 26, 2011 8:08 pm
Subject: Re: Re: JSON and the Unicode Standard
johnwcowan
Send Email Send Email
 
johne_ganz scripsit:

> There is no language in RFC 4627 that I can find that supports your
> interpretation.  There is an awful lot of language in the Unicode
> standard that unambiguously says that you can not "delete" characters
> from a Unicode "string".  There is also compelling arguments in TR#36
> for why arbitrarily deleting characters is a huge mistake.

You are over-interpreting the standard.  Of course applications can
delete characters:  sed -s 's/t//' deletes all t's from the input.
And that's all that's being said here: JSON parsers and serializers
shouldn't edit any of the codepoints, leaving that up to their clients.

--
John Cowan    cowan@...    http://ccil.org/~cowan
The present impossibility of giving a scientific explanation is no proof
that there is no scientific explanation. The unexplained is not to be
identified with the unexplainable, and the strange and extraordinary
nature of a fact is not a justification for attributing it to powers
above nature.  --The Catholic Encyclopedia, s.v. "telepathy" (1913)

#1600 From: John Cowan <cowan@...>
Date: Sat Feb 26, 2011 8:09 pm
Subject: Re: Re: JSON and the Unicode Standard
johnwcowan
Send Email Send Email
 
Douglas Crockford scripsit:

> For JSON's purpose, Unicode is just a set of code points. It gives
> some, such as { and }, special meaning. But in strings, everything
> should simply be passed through.

So you are now conceding that it's invalid JSON to send through unpaired
surrogate code units, since they don't correspond to code points?
We discussed this a while back, and you were then (IIRC) claiming that
JSON allowed any arbitrary code unit, including unpaired surrogates.

--
John Cowan   http://ccil.org/~cowan  cowan@...
[P]olice in many lands are now complaining that local arrestees are insisting
on having their Miranda rights read to them, just like perps in American TV
cop shows.  When it's explained to them that they are in a different country,
where those rights do not exist, they become outraged.  --Neal Stephenson

Messages 1571 - 1600 of 1958   Oldest  |  < Older  |  Newer >  |  Newest
Add to My Yahoo!      XML What's This?

Copyright © 2010 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines NEW - Help