Why this particular xml parsing code works for small xml, but fails for large xml? - python

I have a function for parsing xml content, like below:
def parse_individual_xml(self, xml_url):
xml_data_to_parse = urlopen(xml_url).read()
jobs = ET.fromstring(xml_data_to_parse)
return jobs
This function is working perfectly, until I was working with smaller file (1-2 mb). but when I have taken a large xml url, I got this error.
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0
Afaik, it is some encoding-decoding issue.
the below function gives exactly same behavior.
def parse_individual_xml(self, xml_url):
xml_data_to_parse = urlopen(xml_url)
jobs = ET.parse(xml_data_to_parse).getroot()
return jobs
Then I tried a bit differently.
I downloaded that large file locally, and changed the function like below:
def parse_individual_xml(self, xml_local_path):
jobs = ET.parse(xml_local_path).getroot()
return jobs
And, it works for any file, large or small. Eventually I will use iterparse of etree. but at first I want to know the reason of above mentioned behavior.
How can I solve them?

The remote server almost certainly is compressing large responses using GZIP (or, less common, deflate).
Based on the Content-Encoding header, decompress the stream before trying to parse it:
import gzip
response = urlopen(xml_url)
if response.info().get('Content-Encoding') == 'gzip':
# transparent decompression of a GZIP-ed response
response = gzip.GzipFile(fileobj=response)
jobs = ET.parse(response).getroot()
You may want to consider using the requests library instead, it can handle this for you, transparently. To stream the data into a iterative parser, use stream=True, access the response.raw file-like object and configure it to do transparent decompression:
response = requests.get(xml_url, stream=True)
response.raw.decode_content = True # handle content-encoding compression
jobs = ET.parse(response.raw).getroot() # or use iterparse

Related

lxml.etree to iterparse requests response (with stream=True)

I have a SOAP client in Python receiving a response, which, in one element of the SOAP envelope's body receives a large stream of data (gzipped file, several GBs, machine's main memory not necessarily big enough to hold it).
Thus, I am required to process the information as a stream, which I am specifying when posting my SOAP request:
import requests
url = 'xyz.svc'
headers = {'SOAPAction': X}
response = requests.post(url, data=payload, headers=headers, stream=True)
In order to parse the information with lxml.etree (I have to first read some information from the header and then process the fields from the body, including the large file element), I want to now use the stream to feed iterparse:
from lxml import etree
context = etree.iterparse(response.raw, events = ('start', 'end'))
for event, elem in context:
if event == 'start':
if elem.tag == t_header:
# process header
if elem.tag == t_body:
# TODO: write element text to file, rely on etree to not load into memory?
else:
# do some cleanup
Unfortunately, structuring the code like this does not appear to work, passing response.raw to iterparse raises:
XMLSyntaxError: Document is empty, line 1, column 1
How can this be fixed? Searching for similar solutions, this approach generally seems to work for people.
Solution: etree received an encoded bytestream which it did not handle properly, setting
response.raw.decode_content = True
seems to work for the first part.
How to properly implement streaming the element text? Will etree.iterparse process the full element.text into memory, or can it be read in chunks?
Update: With the parser working (I am not sure, however, whether the decoding is done properly, since the files appear to end up corrupted - which is probably a result of bad base64 decoding), this remains to be properly implemented.
Update 2: Some additional debugging showed that it is not possible to access all the content in the large element's text.
Printing
len(elem.text)
in the start-event shows 33651, while in the end-event, it's 10389304. Any ideas how to read the full content iteratively with lxml?
Regarding package versions, I am using:
requests 2.9.1
python 3.4.4
etree 3.4.4 (from lxml)

writing decompressed file to disk fetched from web server

I can get a file that has content-encoding as gzip.
So does that mean that the server is storing it as compressed file or it is also true for files stored as compressed zip or 7z files too?
and if so (where durl is a zip file)
>>> durl = 'https://db.tt/Kq0byWzW'
>>> dresp = requests.get(durl, allow_redirects=True, stream=True)
>>> dresp.headers['content-encoding']
'gzip'
>>> r = requests.get(durl, stream=True)
>>> data = r.raw.read(decode_content=True)
but data is coming out to be empty while I want to extract the zip file to disk on the go !!
So first of all durl is not a zip file, it is a drop box landing page. So what you are looking at is HTML which is being sent using gzip encoding. If you where to decode the data from the raw socket using gzip you would simply get the HTML. So the use of raw is really just hiding that you accidentally go an other file than the one you thought.
Based on https://plus.google.com/u/0/100262946444188999467/posts/VsxftxQnRam where you ask
Does anyone has any idea about writing compressed file directy to disk to decompressed state?
I take it you are really trying to fetch a zip and decompress it directly to a directory without first storing it. To do this you need to use https://docs.python.org/2/library/zipfile.html
Though at this point the problem becomes that the response from requests isn't actually seekable, which zipfile requires in order to work (one of the first things it will do is seek to the end of the file to determine how long it is).
To get around this you need to wrap the response in a file like object. Personally I would recommend using tempfile.SpooledTemporaryFile with a max size set. This way your code would switch to writing things to disk if the file was bigger than you expected.
import requests
import tempfile
import zipfile
KB = 1<<10
MB = 1<<20
url = '...' # Set url to the download link.
resp = requests.get(url, stream=True)
with tmp as tempfile.SpooledTemporaryFile(max_size=500*MB):
for chunk in resp.iter_content(4*KB):
tmp.write(chunk)
archive = zipfile.ZipFile(tmp)
archive.extractall(path)
Same code using io.BytesIO:
resp = requests.get(url, stream=True)
tmp = io.BytesIO()
for chunk in resp.iter_content(4*KB):
tmp.write(chunk)
archive = zipfile.ZipFile(tmp)
archive.extractall(path)
You need the content from the requests file to write it.
Confirmed working:
import requests
durl = 'https://db.tt/Kq0byWzW'
dresp = requests.get(durl, allow_redirects=True, stream=True)
dresp.headers['content-encoding']
file = open('test.html', 'w')
file.write(dresp.text)
You have to differentiate between content-encoding (not to be confused with transfer-encoding) and content-type.
The gist of it is that content-type is the media-type (the real file-type) of the resource you are trying to get. And content-encoding is any kind of modification applied to it before sending it to the client.
So let's assume you'd like to get a resource named "foo.txt". It will probably have a content-type of text/plain.In addition to that, the data can be modified when sending over the wire. This is the content-encoding. So, with the above example, you can have a content-type of text/plain and a content-encoding of gzip. This means that before the server sends the file out onto the wire, it will compress it using gzip on the fly. So the only bytes which traverse the net are zipped. Not the raw-bytes of the original file (foo.txt).
It is the job of the client to process these headers accordingly.
Now, I am not 100% sure if requests, or the underlying python libs do this but chances are they do. If not, Python ships with a default gzip library, so you could do it on your own without a problem.
With the above in mind, to respond to your question: No, having a "content-encoding" of gzip does not mean that the remote resource is a zip-file. The field containing that information is content-type (based on your question this has probably a value of application/zip or application/x-7z-compressed depending of actual compression algorithm used).
If you cannot determine the real file-type based on the content-type field (f.ex. if it is application/octet-stream), you could just save the file to disk, and open it up with a hex editor. In the case of a 7z file you should see the byte sequence 37 7a bc af 27 1c somewhere. Most likely at the beginning of the file or at EOF-112 bytes. In the case of a gzip file, it should be 1f 8b at the beginning of the file.
Given that you have gzip in the content-encoding field: If you get a 7z file, you can be certain that requests has parsed content-encoding and properly decoded it for you. If you get a gzip file, it could mean two things. Either requests has not decoded anything, of the file is indeed a gzip file, as it could be a gzip file sent with the gzip encoding. Which would mean that it's doubly compressed. This would not make any sense, but, depending on the server this could still happen.
You could simply try to run gunzip on the console and see what you get.

How to streaming upload with python requests module (include file and data)?

I have this code python requests code snippet(part) to upload a file and some data to the server:
files = [("FileData", (upload_me_name, open(upload_me, "rb"), "application/octet-stream"))]
r = s.post(url, proxies = proxies, headers = headers, files = files, data = data)
Since this will read the whole file into memory, which may cause some issues in some situations. From the requests documentation, I know it supports streaming uploads like this:
with open('massive-body') as f:
requests.post('http://some.url/streamed', data=f)
However I don't know how to change my original code to support streaming. Anyone can help please?
Thanks.
Currently requests doesn't support doing streaming uploads that contain any more data than a single file. The fact that you're sending data on your POST means you're doing a multipart file upload, and currently Requests doesn't provide you with any way to stream that.

Python : Generating a gzip string by returning chunks

I am going to ask for something I could not find on Stackoverflow.
I am doing some Django, and I've recently discovered that I can stream the HTTP output using a generator.
The output of a page is perfect for a normal case, however I wanted to stream the page output using GZip compression.
I've tried using the simple zlib.compress function, to no avail. The function generates small gzip files.
I want return small chunks of data as they are processed, as a string. Those chunks should form the content of a Gzipped file. How one would do this ?
Thanks.
use zlib.compressobj([level]) and Compress.compress(string) and Compress.flush([mode]) to finish
import zlib
def compress(chunks):
c = zlib.compressobj()
for chunk in chunks:
yield c.compress(chunk)
yield c.flush(zlib.Z_FINISH)
Compressing the content is a property of your webserver and not the framework. If you are using apache, you can use mod_deflate. I hope you are not referring to the simple server given by the django for testing purposes.If you are referring do that, then simply looking into code for the term gzip and see if it does any sort of compression at all.
Also, if you are thinking of compression in the app server, be warned and it is better to move this job to the web server.

python requests: post and big content

I am sending a CSV file to a server using a POST request.
I am using a file-like object with requests.post
Will there be a problem if the CSV file is quite big and I have limited memory or the fact that I use a file-like object will never load the whole file in memory? I am not sure about that.
I know there is the stream option but it sounds like it's more for getting the response and not sending data.
headers = {
'content-type': 'text/csv',
}
csvfile = '/path/file.csv'
with open(csvfile) as f:
r = requests.post(url, data=f, headers=headers)
Using an open file object as the data parameter ensures that requests will stream the data for you.
If a file size can be determined (via the OS filesystem), the file object is streamed using a 8kb buffer. If no filesize can be determined, a Transfer-Encoding: chunked request is sent sending the data per line instead (the object is used as an iterable).
If you were to use the files= parameter for a multipart POST, on the other hand, the file would be loaded into memory before sending. Use the requests-toolbelt package to stream multi-part uploads.
This will not load the entire file into memory, it will be split into chunks and transmitted a little at a time. You can see this in the source code here.

Resources