librelist archives

« back to archive

Lonely Number

Lonely Number

From:
Hatem Nassrat
Date:
2010-05-14 @ 01:12
I fixed a bug and added some testcases to yajl-py and while doing so
noticed a bug in yajl-py with respect to "lonely_number.json". Yajl-py
doesn't seem to parse lonely numbers.

Yajl-py is a simple wrapper around yajl. It uses the yajl parser
straight from the shared object *libyajl.so* so I do not see why yajl
works fine with lonely numbers yet yajl-py acts as if nothing was
passed to it. Would anyone have an idea about what's going on before i
start diving into gdb?

-- 
Hatem Nassrat

Re: Lonely Number

From:
Hatem Nassrat
Date:
2010-05-14 @ 02:03
On Thu, May 13, 2010 at 10:12 PM, Hatem Nassrat <hnassrat@gmail.com> wrote:
> Yajl-py is a simple wrapper around yajl. It uses the yajl parser
> straight from the shared object *libyajl.so* so I do not see why yajl
> works fine with lonely numbers yet yajl-py acts as if nothing was
> passed to it. Would anyone have an idea about what's going on before i
> start diving into gdb?

Looking at lloyd's TODO file from 2008:
    http://github.com/lloyd/yajl/commit/3e9d8ec3cbc751b2b85bacb78814f1c37a0a203d
I tried adding a newline to the test case and it worked. So would
there be a reason yajl is detecting the end of input, while yajl-py
isn't?

-- 
Hatem Nassrat

Re: [yajl] Re: Lonely Number

From:
Vitali Lovich
Date:
2010-05-14 @ 02:07
On 05/13/2010 07:03 PM, Hatem Nassrat wrote:
> On Thu, May 13, 2010 at 10:12 PM, Hatem Nassrat<hnassrat@gmail.com>  wrote:
>    
>> Yajl-py is a simple wrapper around yajl. It uses the yajl parser
>> straight from the shared object *libyajl.so* so I do not see why yajl
>> works fine with lonely numbers yet yajl-py acts as if nothing was
>> passed to it. Would anyone have an idea about what's going on before i
>> start diving into gdb?
>>      
> Looking at lloyd's TODO file from 2008:
>      
http://github.com/lloyd/yajl/commit/3e9d8ec3cbc751b2b85bacb78814f1c37a0a203d
> I tried adding a newline to the test case and it worked. So would
> there be a reason yajl is detecting the end of input, while yajl-py
> isn't?
>
>    
yajl ignores irrelevant whitespace .

Re: Lonely Number

From:
Hatem Nassrat
Date:
2010-05-14 @ 02:13
On Thu, May 13, 2010 at 11:03 PM, Hatem Nassrat <hnassrat@gmail.com> wrote:
> Looking at lloyd's TODO file from 2008:
>    http://github.com/lloyd/yajl/commit/3e9d8ec3cbc751b2b85bacb78814f1c37a0a203d
> I tried adding a newline to the test case and it worked. So would
> there be a reason yajl is detecting the end of input, while yajl-py
> isn't?

Eureka :P

I am not sure how I didn't see that earlier, it might have not existed
when I first implemented the core of yajl-py. There is a
yajl_parse_complete(hand) call that I didn't use to process all
unprocessed data in yajl's buffer. Here is the commit that fixes
Yajl-py (awesome, without GDB :P)

http://github.com/pykler/yajl-py/commit/c23568d0592ac9ec0610c33463fb9add849bd6e5

-- 
Hatem Nassrat

Re: [yajl] Lonely Number

From:
R. Tyler Ballance
Date:
2010-05-14 @ 01:43
On Thu, 13 May 2010, Hatem Nassrat wrote:

> I fixed a bug and added some testcases to yajl-py and while doing so
> noticed a bug in yajl-py with respect to "lonely_number.json". Yajl-py
> doesn't seem to parse lonely numbers.

So I maintain py-yajl [0] (similar I suppose to yajl-py in that Python is
involved) and I had a similar issue filed against py-yajl.

The thing is, "1" is not valid JSON [1], which needs to be wrapped in an object
or an array.

> Yajl-py is a simple wrapper around yajl. It uses the yajl parser
> straight from the shared object *libyajl.so* so I do not see why yajl
> works fine with lonely numbers yet yajl-py acts as if nothing was
> passed to it. Would anyone have an idea about what's going on before i
> start diving into gdb?

yajl itself can process a "number" via yajl_gen_number, but that's just parsing
/part/ of a JSON object. Yjal-py should probably raising a ValueError or
something similar to inform you that you're not passing it a valid JSON object
(but rather a JSON fragment).


[0] http://github.com/rtyler/py-yajl
[1] http://www.json.org/

Cheers,
-R. Tyler Ballance
--------------------------------------
  Jabber: rtyler@jabber.org
  GitHub: http://github.com/rtyler
Identica: http://identi.ca/dero
 Twitter: http://twitter.com/agentdero
    Blog: http://unethicalblogger.com

Re: [yajl] Lonely Number

From:
Hatem Nassrat
Date:
2010-05-14 @ 01:54
On Thu, May 13, 2010 at 10:43 PM, R. Tyler Ballance <tyler@monkeypox.org> wrote:
> yajl itself can process a "number" via yajl_gen_number, but that's just parsing
> /part/ of a JSON object.

Isn't yajl_gen_number for printing out json, is it also used in this
special case to parse json?

-- 
Hatem Nassrat

Re: [yajl] Lonely Number

From:
R. Tyler Ballance
Date:
2010-05-14 @ 02:11
On Thu, 13 May 2010, Hatem Nassrat wrote:

> On Thu, May 13, 2010 at 10:43 PM, R. Tyler Ballance <tyler@monkeypox.org> wrote:
> > yajl itself can process a "number" via yajl_gen_number, but that's 
just parsing
> > /part/ of a JSON object.
> 
> Isn't yajl_gen_number for printing out json, is it also used in this
> special case to parse json?

Ah, apologies, was flipping my inputs and outputs around, the parsing of a
number is handled by the caller. I.e. the yajl_number or yajl_integer
callbacks.

No clue what Yajl-py does, a developer more prone to whoring his wares might
suggest py-yajl to you, but I won't.

I'll take the high road.

Cheers,
-R. Tyler Ballance
--------------------------------------
  Jabber: rtyler@jabber.org
  GitHub: http://github.com/rtyler
Identica: http://identi.ca/dero
 Twitter: http://twitter.com/agentdero
    Blog: http://unethicalblogger.com

Re: [yajl] Lonely Number

From:
Hatem Nassrat
Date:
2010-05-14 @ 02:27
On Thu, May 13, 2010 at 11:11 PM, R. Tyler Ballance <tyler@monkeypox.org> wrote:
> No clue what Yajl-py does, a developer more prone to whoring his wares might
> suggest py-yajl to you, but I won't.

I was thinking at one point of using pyrex or cython to wrap yajl to
get more speed efficient. I had started off with this ctypes wrapper
as part of my master's thesis were I needed to use a stream parsing
(sax-like for lack of a better term) parser and thats why I used yajl.
The reason I needed to do this is I had to parse large json documents
that contain many instances of a certain object (known as
combinatorial block designs), using the built in json.loads (atleast
how I used it back then) would cause the whole file to be parsed at
once which was a no-no. So I found yajl-0.3, and implemented a quick
wrapper with ctypes around the shared library (libyajl.so) for my
needs.

I looked at py-yajl not a long time ago, actually thats what I wanted
to call my library and then came to find that it already exists, and
called mine yajl-py :p. From a 10k view I saw that it tries to stand
in as a replacement to the standard library json module and thus is
not really useful to me (if I am correct), as I need a sax like
parser, where I can define my own callbacks. Please correct me if I
misunderstood py-yajl.

I made some recent changes (the past couple of days) that make yajl-py
work more or less like the python sax module, by supplying a
content_handler instance when you are creating the parser. It even
shows how much cleaner python can be over C, as I don't need to use
the ctx at all (where as its kind of a must to do anything cleanly
with C and the yajl api). I attain this using closures. I really
enjoyed the recent changes, and will stick to whoring yajl-py :p,
maybe you should give yajl-py a try :p.

-- 
Hatem Nassrat

Re: [yajl] Lonely Number

From:
R. Tyler Ballance
Date:
2010-05-14 @ 02:52
On Thu, 13 May 2010, Hatem Nassrat wrote:

> On Thu, May 13, 2010 at 11:11 PM, R. Tyler Ballance <tyler@monkeypox.org> wrote:
> > No clue what Yajl-py does, a developer more prone to whoring his wares might
> > suggest py-yajl to you, but I won't.
> 
> I was thinking at one point of using pyrex or cython to wrap yajl to
> get more speed efficient. I had started off with this ctypes wrapper
> as part of my master's thesis were I needed to use a stream parsing
> (sax-like for lack of a better term) parser and thats why I used yajl.
> The reason I needed to do this is I had to parse large json documents
> that contain many instances of a certain object (known as
> combinatorial block designs), using the built in json.loads (atleast
> how I used it back then) would cause the whole file to be parsed at
> once which was a no-no. So I found yajl-0.3, and implemented a quick
> wrapper with ctypes around the shared library (libyajl.so) for my
> needs.

Understandably, I've not really addressed stream parsing at all yet (since I'm
dealing with smaller documents)

> I looked at py-yajl not a long time ago, actually thats what I wanted
> to call my library and then came to find that it already exists, and
> called mine yajl-py :p. From a 10k view I saw that it tries to stand
> in as a replacement to the standard library json module and thus is
> not really useful to me (if I am correct), as I need a sax like
> parser, where I can define my own callbacks. Please correct me if I
> misunderstood py-yajl.

I'm curious whether you've measured the Python method invocation overhead at
all? My main hesitance to expose callbacks up into Python has been because C
function faster than Python, so my focus on streaming support has been more to
the tune of how brianmario's streaming support works
(http://rdoc.info/projects/brianmario/yajl-ruby) i.e. for py-yajl streaming
support would something along these lines:

    for elem in yajl.iterstream(fileobj):
        do_thing_with(elem)

or

    for key, value in yajl.iterstream(fileobj):
        do_thing_with((key, value))

I'd be very interested to hear about any performance analysis you did with
regards to callbacks into Python.

> I made some recent changes (the past couple of days) that make yajl-py
> work more or less like the python sax module, by supplying a
> content_handler instance when you are creating the parser. It even
> shows how much cleaner python can be over C, as I don't need to use
> the ctx at all (where as its kind of a must to do anything cleanly
> with C and the yajl api). I attain this using closures. I really
> enjoyed the recent changes, and will stick to whoring yajl-py :p,
> maybe you should give yajl-py a try :p.

I can only promise that I will steal your good ideas ;)

Cheers,
-R. Tyler Ballance
--------------------------------------
  Jabber: rtyler@jabber.org
  GitHub: http://github.com/rtyler
Identica: http://identi.ca/dero
 Twitter: http://twitter.com/agentdero
    Blog: http://unethicalblogger.com

Re: [yajl] Lonely Number

From:
Hatem Nassrat
Date:
2010-05-14 @ 03:00
On Thu, May 13, 2010 at 11:52 PM, R. Tyler Ballance <tyler@monkeypox.org> wrote:
> I'm curious whether you've measured the Python method invocation overhead at
> all? My main hesitance to expose callbacks up into Python has been because C
> function faster than Python, so my focus on streaming support has been more to

I haven't really, but I know that ctypes itself is much slower than if
the same thing I am doing was implemented in Cython. So any speed
tests that I would measure would not be slower that what you may get
if you start using callbacks with the Cython version.

> the tune of how brianmario's streaming support works
> (http://rdoc.info/projects/brianmario/yajl-ruby) i.e. for py-yajl streaming
> support would something along these lines:
>
>    for elem in yajl.iterstream(fileobj):
>        do_thing_with(elem)
>
> or
>
>    for key, value in yajl.iterstream(fileobj):
>        do_thing_with((key, value))

I will have to look at yajl-ruby because to me this seems magical. I
would imagine that you would need specially formated documents to do
this, for example a document where the top level is all key, value
pairs.

> I'd be very interested to hear about any performance analysis you did with
> regards to callbacks into Python.

I am not even sure where to start with this. Maybe comparing the speed
of yajl-py vs. yajl to run through the parsing of a few documents?

>> maybe you should give yajl-py a try :p.
>
> I can only promise that I will steal your good ideas ;)

That would still be awesome. Thats what open source is all about.

-- 
Hatem Nassrat

Re: [yajl] Lonely Number

From:
Vitali Lovich
Date:
2010-05-14 @ 02:56
On 05/13/2010 07:52 PM, R. Tyler Ballance wrote:
> I'm curious whether you've measured the Python method invocation overhead at
> all? My main hesitance to expose callbacks up into Python has been because C
> function faster than Python, so my focus on streaming support has been more to
> the tune of how brianmario's streaming support works
> (http://rdoc.info/projects/brianmario/yajl-ruby) i.e. for py-yajl streaming
> support would something along these lines:
>
>      for elem in yajl.iterstream(fileobj):
>          do_thing_with(elem)
>
> or
>
>      for key, value in yajl.iterstream(fileobj):
>          do_thing_with((key, value))
>
> I'd be very interested to hear about any performance analysis you did with
> regards to callbacks into Python.
>
>    
I'd be interested to hear how you can do the key/value thing.  Remember 
that they value itself is indeterminate until it is completely parsed 
(from a stream perspective).  Thus the bulk of the document can sit in 
the value (& you'd also get out-of-order stream events since the 
top-most key will be emitted after the inner-most value).

Re: [yajl] Lonely Number

From:
R. Tyler Ballance
Date:
2010-05-14 @ 03:05
On Thu, 13 May 2010, Vitali Lovich wrote:

> On 05/13/2010 07:52 PM, R. Tyler Ballance wrote:
> > I'm curious whether you've measured the Python method invocation overhead at
> > all? My main hesitance to expose callbacks up into Python has been because C
> > function faster than Python, so my focus on streaming support has been more to
> > the tune of how brianmario's streaming support works
> > (http://rdoc.info/projects/brianmario/yajl-ruby) i.e. for py-yajl streaming
> > support would something along these lines:
> >
> >      for elem in yajl.iterstream(fileobj):
> >          do_thing_with(elem)
> >
> > or
> >
> >      for key, value in yajl.iterstream(fileobj):
> >          do_thing_with((key, value))
> >
> > I'd be very interested to hear about any performance analysis you did with
> > regards to callbacks into Python.
> >
> >    
> I'd be interested to hear how you can do the key/value thing.  Remember 
> that they value itself is indeterminate until it is completely parsed 
> (from a stream perspective).  Thus the bulk of the document can sit in 
> the value (& you'd also get out-of-order stream events since the 
> top-most key will be emitted after the inner-most value).

Disclaimer: I've not yet implemented this :)

There are some problems with this approach, insofar that Python doesn't really
have a concept of a "tree iterator", so this approach would only be pulling top
level key-value pairs, if your data looks like:

    {'rc' : 0', 'data' : <epic huge JSON>}

then you'd be screwed. If your data looks like:

    [<big chunk>, <big chunk>] * 1000
    {1 : <chunk>, 2 : <chunk>} * 1000

then this approach will work reasonably well. Internally I was indending on
using a `depth` counter, while we don't know when a `value` has ended, we do
know when a new key starts. For data along along the lines of:

    {'key1' : {'hello' : 'world'}, 'key2' : {'hello' : 'world'}}

The callback chain should go:

    yajl_start_map
        yajl_map_key
            yajl_start_map
            yajl_map_key
            yajl_string
        yajl_map_key
            yajl_start_map
            yajl_map_key
            yajl_string

(indented on purpose)

As much as I've thought this through, I should be able to return the value on
that second `yajl_map_key`, what this will look like in code I still don't
know. Waiting on a good free weekend :)


Cheers,
-R. Tyler Ballance
--------------------------------------
  Jabber: rtyler@jabber.org
  GitHub: http://github.com/rtyler
Identica: http://identi.ca/dero
 Twitter: http://twitter.com/agentdero
    Blog: http://unethicalblogger.com

Re: [yajl] Lonely Number

From:
Vitali Lovich
Date:
2010-05-14 @ 03:15
On 05/13/2010 08:05 PM, R. Tyler Ballance wrote:
> On Thu, 13 May 2010, Vitali Lovich wrote:
>
>    
>> On 05/13/2010 07:52 PM, R. Tyler Ballance wrote:
>>      
>>> I'm curious whether you've measured the Python method invocation overhead at
>>> all? My main hesitance to expose callbacks up into Python has been because C
>>> function faster than Python, so my focus on streaming support has been more to
>>> the tune of how brianmario's streaming support works
>>> (http://rdoc.info/projects/brianmario/yajl-ruby) i.e. for py-yajl streaming
>>> support would something along these lines:
>>>
>>>       for elem in yajl.iterstream(fileobj):
>>>           do_thing_with(elem)
>>>
>>> or
>>>
>>>       for key, value in yajl.iterstream(fileobj):
>>>           do_thing_with((key, value))
>>>
>>> I'd be very interested to hear about any performance analysis you did with
>>> regards to callbacks into Python.
>>>
>>>
>>>        
>> I'd be interested to hear how you can do the key/value thing.  Remember
>> that they value itself is indeterminate until it is completely parsed
>> (from a stream perspective).  Thus the bulk of the document can sit in
>> the value (&  you'd also get out-of-order stream events since the
>> top-most key will be emitted after the inner-most value).
>>      
> Disclaimer: I've not yet implemented this :)
>
> There are some problems with this approach, insofar that Python doesn't really
> have a concept of a "tree iterator", so this approach would only be pulling top
> level key-value pairs, if your data looks like:
>
>      {'rc' : 0', 'data' :<epic huge JSON>}
>
> then you'd be screwed. If your data looks like:
>
>      [<big chunk>,<big chunk>] * 1000
>      {1 :<chunk>, 2 :<chunk>} * 1000
>
> then this approach will work reasonably well. Internally I was indending on
> using a `depth` counter, while we don't know when a `value` has ended, we do
> know when a new key starts. For data along along the lines of:
>
>      {'key1' : {'hello' : 'world'}, 'key2' : {'hello' : 'world'}}
>
> The callback chain should go:
>
>      yajl_start_map
>          yajl_map_key
>              yajl_start_map
>              yajl_map_key
>              yajl_string
>          yajl_map_key
>              yajl_start_map
>              yajl_map_key
>              yajl_string
>
> (indented on purpose)
>
> As much as I've thought this through, I should be able to return the value on
> that second `yajl_map_key`, what this will look like in code I still don't
> know. Waiting on a good free weekend :)
>
>
> Cheers,
> -R. Tyler Ballance
> --------------------------------------
>    Jabber: rtyler@jabber.org
>    GitHub: http://github.com/rtyler
> Identica: http://identi.ca/dero
>   Twitter: http://twitter.com/agentdero
>      Blog: http://unethicalblogger.com
>    
I'd be interested to see how my native library performs against a big 
document.  My memory overhead is size of document + 2x * # of keys where 
x is the amount of space taken up for a generic json node.  If the file 
is on disk, my memory usage is 2x * # of keys (memory map the file).  
The runtime overhead is about 1/4 the speed of yajl (the ammortized 
random access time of the document though is really good).   Any chance 
the JSON document is public?

Thanks,
Vitali

Re: [yajl] Lonely Number

From:
Hatem Nassrat
Date:
2010-05-14 @ 03:35
On Fri, May 14, 2010 at 12:15 AM, Vitali Lovich <vitali.lovich@palm.com> wrote:
> I'd be interested to see how my native library performs against a big
> document.  My memory overhead is size of document + 2x * # of keys where
> x is the amount of space taken up for a generic json node.  If the file
> is on disk, my memory usage is 2x * # of keys (memory map the file).
> The runtime overhead is about 1/4 the speed of yajl (the ammortized
> random access time of the document though is really good).   Any chance
> the JSON document is public?

Yes, the documents are public :-), it seems the domain name is no
longer pointing to the right server. Gotta love our awesome Dalhousie
University IT services, for no here is the page that describes the
document:

    
http://webcache.googleusercontent.com/search?q=cache:O9VbmrkXmWAJ:batman.cs.dal.ca/~peter/designdb/+diesgn+db&cd=5&hl=en&ct=clnk&gl=ca

And here is a temporary location of the files:

    http://nassrat.cs.dal.ca/designdb/extrep/json-files/

You may also use the actual database to dump a large number of designs
as a json document, however the server is also having some technical
difficulties as it seems quite slow these days. I have to see whats up
with that:

    http://nassrat.cs.dal.ca/ddb2/

Here are some good big files from the collection:

    http://nassrat.cs.dal.ca/designdb/extrep/json-files/
        v-b-r-k-PLS/v14-b14-r3-k3-PLS.icgsa.json.bz2
        t-designs/t2-v21-b105-r20-k4-L3.icgs.json.bz2
        v-b-r-k-PLS/v12-b30-r5-k2-PLS.icgsa.json.bz2
        v-b-r-k-PLS/v12-b36-r6-k2-PLS.icgsa.json.bz2
        v-b-k/v8-b6-k4.icgsa.json.bz2
        t-designs/t2-v20-b76-r19-k5-L4.icgs.json.bz2
        v-b-k/v8-b8-k4.a.json.bz2
        t-designs/t2-v20-b95-r19-k4-L3.icgs.json.bz2

-- 
Hatem Nassrat