Python Topics : Pickle Risks and Safer Serialization Alternatives
What Is Pickle?
the pickle module can serialize Python objects into a byte stream
stream can be written to a file or transmitted over a network
can deserialize this byte stream back into a Python object
can seem to be incredibly convenient for persisting data, caching, or sending complex objects
does have drawbacks
How To Use Pickle
Pickle is fairly straightforward to use
can serialize almost any Python object
convenient to use because there is no need to build complicated serialization tools
import dataclasses
import pickle

@dataclasses.dataclass
class User:
    name: str
    age: int

data = User(name="Flip", age=71)

pickled_data = pickle.dumps(data)
unpickled_data = pickle.loads(pickled_data)

assert data == unpickled_data
abbreviated output shown below
b'\\x80\\...\\x08__main__\\x94\\x8c\\x04User\\x94...\\x8c\\x04name\\x94...\\x8c\\x03age\\x94K/ub.'
can pick out some minor details but output is difficult to read
the __main__ and User fragments inform Pickle of the type it should deserialize
The Dangers
Security Vulnerabilities inherent insecurity
unpickling data essentially executes the byte stream as Python code
if an attacker can modify the serialized data, arbitrary code van be executed
vulnerability can lead to severe consequences
malicious payloads are trivial to create and can be as simple as just a few lines of code
Compatibility Issues Pickle and the data it generates are specific to Python
generally doesn't support unpickling between versions
problematic when upgrading Python version or sharing data between different systems and languages
Lack of Transparency Pickle's byte stream is generally not human-readable
challenging to debug and inspect the data
can be difficult to determine a problem's cause without detailed knowledge of the pickle protocol and the serialized objects
Size and Performance Pickle can produce large byte streams
increased storage requirements and slower performance when working with large or complex objects
can be a significant drawback in resource-constrained environments
Malicious Payloads
simple example of malicious payload
import os
import pickle

class Payload:
    def __init__(
            self,
            init,
            *args
    ):
        self.init = init
        self.args = args

    def __reduce__(self):
        return self.init, self.args

payload = pickle.dumps(
                Payload(
        os.system,
        'echo Malicious code executed!'>
    
    ))
# writes data as a binary file
with open('payload.bin', 'wb') as f:
    f.write(payload)
when payload.bin is read and unpickled the code is run
with open('payload.bin', 'rb') as f:
    payload = f.read()
    pickle.loads(payload)
normally __reduce__ is used to pickle how to re-instantiate the object which was serialized
above the passed constuctor is os.system and its c'tor argument
below shows normal use
@dataclasses.dataclass
class User:
    name: str
    age: int

    def __reduce__(self):
	    # Reduce returns the callable and args required to rebuild this class
	    return User, (self.name, self.age)
the malicious file
b'\x80...\x94\x8c\x06system\x94\x93\x94\x8c\x1decho Malicious code executed!\x94\x85\x94R\x94.'
the malicious code tricks Pickle into processing os.system as a class which has been serialized
Safer Alternatives
JSON a lightweight, human-readable data interchange format
is language-agnostic and supported by many programming languages
an excellent choice for interoperability
inherently safer than pickle because it's a data-only serialization format (supporting only primitive types)
does not need instructions to instantiate objects it supports
**Example**:
```python
import json

data = {'name': 'Flip', 'age': 71}

# Serialize to JSON
json_data = json.dumps(data)

# Deserialize from JSON
loaded_data = json.loads(json_data)
```
MessagePack a binary serialization format similar to JSON
more efficient than JSON in terms of size and speed
doesn't aim to be human-readable
can produce much smaller serialized objects
also a data-only format
**Example**:

```python
import msgpack
data = {'name': 'Flip', 'age': 71}

# Serialize to MessagePack
msgpack_data = msgpack.packb(data)

# Deserialize from MessagePack
loaded_data = msgpack.unpackb(msgpack_data)
```
index