Alright, so today I’m gonna walk you through this little data wrangling thing I did. The goal? Figure out who got the lowest score on the “rookie” level. Sounds easy, right? Well, it was… eventually. Here’s how it went down.
First off, I got my hands on the data. It was in some janky CSV format, you know, the kind where half the columns are mislabeled and the other half are missing. I fired up my trusty Python interpreter and pandas.
python
import pandas as pd
Next, I loaded that sucker in:
python

df = *_csv(‘rookie_*’)
Of course, it didn’t work right away. Encoding errors, missing delimiters, the whole shebang. Spent a good 20 minutes messing with `encoding=’latin1’` and `sep=’;’` until it finally looked somewhat decent. Lesson one: always expect your data to be a garbage fire.
Once the data was loaded, I needed to see what I was dealing with.
python
print(*())

Okay, column names were a mess. “Name (First, Last)”, “Score(Rookie)”, “Email Address?!”. Seriously? Renamed those bad boys:
python
*(columns={
‘Name (First, Last)’: ‘name’,
‘Score(Rookie)’: ‘rookie_score’,
‘Email Address?!’: ’email’
}, inplace=True)
Much better. Now, the `rookie_score` column was a string. Why? Because of course it was. Someone decided to use commas instead of periods for decimal points.
python
df[‘rookie_score’] = df[‘rookie_score’].*(‘,’, ‘.’).astype(float)

This little one-liner saved my bacon. Replaced the commas with periods, then forced the whole column to be a float. Boom. Now we’re talking.
Next up, the big reveal. Who bombed the rookie level the most?
python
lowest_score = df[‘rookie_score’].min()
lowest_scorer = df[df[‘rookie_score’] == lowest_score][‘name’].iloc[0]

print(f”The lowest score on the rookie level was {lowest_score} by {lowest_scorer}!”)
And there it was. Turns out, it was some poor soul named “John Doe” (of course it was). The guy got a measly 2.5. Ouch.
But I wasn’t done yet. I wanted to see if there was anything else interesting in the data. Maybe the email addresses were all from the same domain? Nope. A complete mix. Useless.
Final thoughts? Data cleaning is 90% of the job. The actual analysis is the easy part. Also, people need to learn how to use periods in their CSV files. Seriously.
- Always check your data types.
- Don’t trust column names.
- Be prepared to Google “pandas string replace” a lot.
That’s it. Another day, another data disaster averted. Hope this helps someone out there wrestling with their own dodgy datasets!
