Let’s be honest. Most of us never planned to care about file formats. You just wanted to get your data in, get your insights out, and go home before your cat eats your houseplants again.
But then someone asked, “Why is this query taking 47 years to run?”
And suddenly, you’re in a rabbit hole of terms like columnar storage, schema evolution, and ORC, wondering if you accidentally joined a Dungeons & Dragons campaign instead of a data team.
Fear not, brave analyst. I’ve got you.
Let’s talk about popular file formats like Parquet, Avro, JSON, CSV, and their fancy cousins—with real examples, actual use cases, and maybe a few jabs at CSV (because … why not?).
First, Why Do File Formats Even Matter?
Think of it like this: your data warehouse is a fridge.
File formats decide whether your groceries are vacuum-sealed, tossed in a bag, or just sitting on the shelf leaking juice all over your analytics.
The right file format can make queries faster, storage cheaper, and life generally less miserable.
Parquet: The Compressed Columnar Hero
“I only want 3 columns from this 200-column table. Can I not load the entire thing, thanks?”
Parquet says: “You got it.”
- Type: Columnar (reads just what you need)
- Compression: Yes (Snappy, GZip, ZSTD)
- Schema: Yes, and quite strict about it
- Use it for: Analytics, dashboards, OLAP workloads
Real Example (Python):
import pandas as pd
df = pd.DataFrame({
'customer_id': [1, 2],
'purchase': [100.5, 75.2],
'date': ['2023-01-01', '2023-01-02']
})
df.to_parquet('sales_data.parquet', engine='pyarrow')
Pros:
- Great compression
- Only reads what it needs
- Plays nice with Spark, Snowflake, BigQuery, Athena…
Avro: The Schema-Enforced Speedster
“I need to write data fast, and I want every record to play by the rules.”
Avro is like that friend who shows up on time and brings a laminated copy of the party rules.
- Type: Row-based (good for writes)
- Compression: Yup
- Schema: Embedded inside the file (which is actually super handy)
- Use it for: Streaming, Kafka, event logs, serialization
Real Example:
Schema (avro_schema.json):
{
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "age", "type": "int"},
{"name": "email", "type": "string"}
]
}
Python code:
from fastavro import writer, parse_schema
import json
with open('avro_schema.json') as f:
schema = parse_schema(json.load(f))
records = [
{"name": "Alice", "age": 30, "email": "alice@example.com"}
]
with open('users.avro', 'wb') as out:
writer(out, schema, records)
Pros:
- Schema included = fewer surprises
- Great for event streams
- Compact and efficient
ORC: The OG of Hadoop Households
“This is Parquet’s cousin who only hangs out in Hive and still listens to CDs.”
- Type: Columnar
- Compression: High
- Schema: Yes
- Use it for: Hive, Hadoop, big batch ETL
Real Example (Spark):
df.write.format("orc").save("sales_data.orc")
Pros:
- Great compression and performance
- Especially good with complex data types in Hive
CSV: The People’s Format (a.k.a. The Excel Enabler)
“It just works.” – A CSV file, right before it ruins your schema and adds a new column by mistake
- Type: Row-based
- Compression: None
- Schema: LOL, nope
- Use it for: Quick exports, legacy systems, Excel reports
Real Example:
sales.csv
customer_id,purchase,date
1,100.5,2023-01-01
2,75.2,2023-01-02
Pros:
- Human-readable
- Universally supported
- Opens in Excel without drama
Cons:
- No types, no structure, no mercy
- Things break if someone adds a comma
- UTF-8? Maybe. Maybe not.
Why CSV Isn’t Great for Big Data
🌐 JSON: The Devs’ Darling
“You wanted nested data and flexibility? Here, have infinite chaos.”
- Type: Semi-structured
- Compression: Not built-in
- Schema: Optional (a.k.a. optional structure, guaranteed pain)
- Use it for: APIs, NoSQL, logs
Real Example:
{
"id": 101,
"name": "Alice",
"orders": [
{"item": "T-shirt", "amount": 19.99},
{"item": "Socks", "amount": 5.49}
]
}
Pros:
- Great for nested or flexible data
- Human-readable (until it’s not)
- Ideal for web data
Cons:
- Verbose
- Parsing is expensive
- Can break easily with missing commas
Bonus Round: Delta Lake & Iceberg
Delta Lake (Databricks):
“Parquet with versioning, ACID transactions, and time travel. You’re welcome.”
df.write.format("delta").save("/delta/sales")
Iceberg (Apache):
“Let’s scale your lakehouse to petabytes and handle schema changes like a pro.”
TL;DR Comparison Table
| Format | Type | Compression | Schema | Good For |
|---|---|---|---|---|
| Parquet | Columnar | ✅ | ✅ | Analytics, Dashboards |
| Avro | Row-based | ✅ | ✅ | Streaming, Kafka, Events |
| ORC | Columnar | ✅ | ✅ | Hive, Big ETL Jobs |
| CSV | Row-based | ❌ | ❌ | Simplicity, Excel Reports |
| JSON | Semi-structured | ❌ | ❌/✅ | APIs, NoSQL, Logs |
| Delta | Columnar | ✅ | ✅ | Time travel, Data lakes |
| Iceberg | Columnar | ✅ | ✅ | Open table formats, petabyte scale |
Final Thoughts
So, which one should you use?
- Doing analytics? Use Parquet or ORC
- Streaming data? Avro is your friend
- Exporting a table for your boss? Fine, CSV
- Building an app? JSON all day
- Need time travel and transactions? Delta or Iceberg
Choosing the right file format is like choosing the right tool for the job. Or the right emoji in a Slack message. Choose wrong and people start asking questions.
Want to Learn More?
Kevin Naidoo
Related posts
Subscribe to our newsletter
Subscribe To Our Newsletter
Join our mailing list to receive the latest news and updates from our team.