Let’s be honest. Most of us never planned to care about file formats. You just wanted to get your data in, get your insights out, and go home before your cat eats your houseplants again.

But then someone asked, “Why is this query taking 47 years to run?”
And suddenly, you’re in a rabbit hole of terms like columnar storage, schema evolution, and ORC, wondering if you accidentally joined a Dungeons & Dragons campaign instead of a data team.

Fear not, brave analyst. I’ve got you.
Let’s talk about popular file formats like Parquet, Avro, JSON, CSV, and their fancy cousins—with real examples, actual use cases, and maybe a few jabs at CSV (because … why not?).


First, Why Do File Formats Even Matter?

Think of it like this: your data warehouse is a fridge.
File formats decide whether your groceries are vacuum-sealed, tossed in a bag, or just sitting on the shelf leaking juice all over your analytics.

The right file format can make queries faster, storage cheaper, and life generally less miserable.


Parquet: The Compressed Columnar Hero

“I only want 3 columns from this 200-column table. Can I not load the entire thing, thanks?”

Parquet says: “You got it.”

  • Type: Columnar (reads just what you need)
  • Compression: Yes (Snappy, GZip, ZSTD)
  • Schema: Yes, and quite strict about it
  • Use it for: Analytics, dashboards, OLAP workloads

Real Example (Python):

import pandas as pd

df = pd.DataFrame({
    'customer_id': [1, 2],
    'purchase': [100.5, 75.2],
    'date': ['2023-01-01', '2023-01-02']
})
df.to_parquet('sales_data.parquet', engine='pyarrow')

Pros:

  • Great compression
  • Only reads what it needs
  • Plays nice with Spark, Snowflake, BigQuery, Athena…

Parquet Docs


Avro: The Schema-Enforced Speedster

“I need to write data fast, and I want every record to play by the rules.”

Avro is like that friend who shows up on time and brings a laminated copy of the party rules.

  • Type: Row-based (good for writes)
  • Compression: Yup
  • Schema: Embedded inside the file (which is actually super handy)
  • Use it for: Streaming, Kafka, event logs, serialization

Real Example:

Schema (avro_schema.json):

{
  "type": "record",
  "name": "User",
  "fields": [
    {"name": "name", "type": "string"},
    {"name": "age", "type": "int"},
    {"name": "email", "type": "string"}
  ]
}

Python code:

from fastavro import writer, parse_schema
import json

with open('avro_schema.json') as f:
    schema = parse_schema(json.load(f))

records = [
    {"name": "Alice", "age": 30, "email": "alice@example.com"}
]

with open('users.avro', 'wb') as out:
    writer(out, schema, records)

Pros:

  • Schema included = fewer surprises
  • Great for event streams
  • Compact and efficient

Avro Docs


ORC: The OG of Hadoop Households

“This is Parquet’s cousin who only hangs out in Hive and still listens to CDs.”

  • Type: Columnar
  • Compression: High
  • Schema: Yes
  • Use it for: Hive, Hadoop, big batch ETL

Real Example (Spark):

df.write.format("orc").save("sales_data.orc")

Pros:

  • Great compression and performance
  • Especially good with complex data types in Hive

ORC Docs


CSV: The People’s Format (a.k.a. The Excel Enabler)

“It just works.” – A CSV file, right before it ruins your schema and adds a new column by mistake

  • Type: Row-based
  • Compression: None
  • Schema: LOL, nope
  • Use it for: Quick exports, legacy systems, Excel reports

Real Example:

sales.csv

customer_id,purchase,date
1,100.5,2023-01-01
2,75.2,2023-01-02

Pros:

  • Human-readable
  • Universally supported
  • Opens in Excel without drama

Cons:

  • No types, no structure, no mercy
  • Things break if someone adds a comma
  • UTF-8? Maybe. Maybe not.

Why CSV Isn’t Great for Big Data


🌐 JSON: The Devs’ Darling

“You wanted nested data and flexibility? Here, have infinite chaos.”

  • Type: Semi-structured
  • Compression: Not built-in
  • Schema: Optional (a.k.a. optional structure, guaranteed pain)
  • Use it for: APIs, NoSQL, logs

Real Example:

{
  "id": 101,
  "name": "Alice",
  "orders": [
    {"item": "T-shirt", "amount": 19.99},
    {"item": "Socks", "amount": 5.49}
  ]
}

Pros:

  • Great for nested or flexible data
  • Human-readable (until it’s not)
  • Ideal for web data

Cons:

  • Verbose
  • Parsing is expensive
  • Can break easily with missing commas

JSON Spec


Bonus Round: Delta Lake & Iceberg

Delta Lake (Databricks):

“Parquet with versioning, ACID transactions, and time travel. You’re welcome.”

df.write.format("delta").save("/delta/sales")

Delta Lake Docs


Iceberg (Apache):

“Let’s scale your lakehouse to petabytes and handle schema changes like a pro.”

Iceberg Docs


TL;DR Comparison Table

FormatTypeCompressionSchemaGood For
ParquetColumnarAnalytics, Dashboards
AvroRow-basedStreaming, Kafka, Events
ORCColumnarHive, Big ETL Jobs
CSVRow-basedSimplicity, Excel Reports
JSONSemi-structured❌/✅APIs, NoSQL, Logs
DeltaColumnarTime travel, Data lakes
IcebergColumnarOpen table formats, petabyte scale

Final Thoughts

So, which one should you use?

  • Doing analytics? Use Parquet or ORC
  • Streaming data? Avro is your friend
  • Exporting a table for your boss? Fine, CSV
  • Building an app? JSON all day
  • Need time travel and transactions? Delta or Iceberg

Choosing the right file format is like choosing the right tool for the job. Or the right emoji in a Slack message. Choose wrong and people start asking questions.


Want to Learn More?