Dummy data is fictitious information that is generated or used to simulate real data in various contexts such as testing, development, and training. This type of data is designed to mimic the characteristics and structure of actual data, without containing any meaningful or sensitive information. Dummy data is commonly used to:
Fictitious data is required for a variety of purposes. Whether for testing, anonymising sensitive data, or adding “noise” to a training dataset, it can be beneficial to have access to a fake dataset in the same shape as the real data. You may also need to generate dummy data for testing and operational purposes. That is, to test what you have developed and how your code reacts to different types of input.
However, finding the necessary data in a specific format we want can be difficult. So, where do you get dummy data for your own application? There is an elegant solution to this problem in the form of the Faker package. With Python, you can use the Faker package to generate data according to your data needs. Faker is an open source library designed to generate different types of synthetic data.
In this article, we’ll take a quick tour of Faker package in Python and how to use them to create a dummy dataset.
The Faker library in Python is a popular tool for generating fake data for a variety of uses, such as testing, development, and training machine learning models. It allows users to create dummy data that mimics real-world data in a flexible and customizable manner. Faker can generate data in various formats, including names, addresses, dates, text, and more.
Key features of the Faker library include:
Faker allows you to generate random data in dozens of languages. Since Faker is an open library for the community, it is constantly evolving. Providers –generators specific to a certain type of data– are added regularly by the community. Let’s take a look at how to use it in terms of codes.
The installation can be done via pip with the command:
pip install Faker
With the following two lines of code you can initialise Faker. While the first line imports the generator (Class Faker), the second one is used to initialise the generator with English as a default language parameter. If you want to initialise Faker in other languages you need to specify the language parameter (eg. Faker(“de_DE”) for German).
from faker import Faker
fake = Faker()
Now, you are ready to generate whatever data you want. The generated data is called fake. As the name suggests, it is fake data that is randomly generated. Its purpose is to act as a substitute or placeholder for the actual data. A fake is generated when the method corresponding to the data type is called.
The name() method can be used to create a full name. Let’s jump into the code and check how these methods work.
for i in range(5): # Returns full names
print(fake.name())
>>>Samantha Fernandez
>>>Denise Barnes
>>>Jason Strong
>>>Edward Burton
>>>Tonya Rocha
However, if you want the only first or last name instead, you can use the first_name() and last_name() methods.
fake.first_name() # Returns a first name
>>>Samuel
Note that, each call to these methods will generate a random name.
fake.last_name() # Returns last name
>>>Espinoza
To create addresses, you can use the address().
fake.address() # Returns an address
>>>3066 Mary Hills Suite 873
>>>Lake Stevenport, NV 32423
Moreover, the fake.sentence() method will return a string containing a random sentence, whereas faker.text() will return a randomly generated text.
fake.sentence() # Returns a random sentence
>>>Never across staff attention within.
As can be seen below faker.text() generates a random paragraph.
f
ake.text() # Returns a random text
>>>From send bed. Could country reveal send role. Guy involve issue picture get election. Sure do memory kitchen candidate fish defense. Try paper forward to build gas human.
Let’s say you want to generate a list of 5 email addresses. Each time, the below code generates 5 random names.
for i in range(5): # generates 5 random emails
print(fake.email())
>>>
contrerasaustin@example.org
But when the data gets bigger, there is a chance that you would get the same email address more than once. So, to create unique dummy data using the Faker package, you can use the .unique property of the generator.
for i in range(10): # generates 5 unique random emails
print(fake.unique.email())
Each time the above code runs, it will generate 5 unique email addresses. This is quite helpful when you are generating data like ID, that does not need to be repeated.
Faker also has a method to generate a dummy profile.
fake.profile() #Returns a fake profile
>>>{‘address’: ‘64992 Becky Stream Apt. 932\nRebeccaville, WV 34184’,
>>>‘birthdate’: datetime.date(2000, 3, 24),
>>>‘blood_group’: ‘O-’,
>>>‘company’: ‘Lopez and Sons’,
>>>‘current_location’: (Decimal(‘78.061493’), Decimal(‘-114.798399’)),
>>>‘job’: ‘Pharmacologist’,
>>>‘mail’: ‘
rebeccahansen@yahoo.com
’,
>>>‘name’: ‘Autumn Sanchez’,
>>>‘residence’: ‘8702 Matthew Circles Apt. 938\nDickersonfurt, WA 82226’,
>>>‘sex’: ‘F’,
>>>‘ssn’: ‘534–29–2074’,
>>>‘username’: ‘llowe’,
>>>‘website’: [‘
http://hawkins.com/
', ‘
https://wolf.com/
']}
So far we have used forger generator properties like name(), first_name(), last_name(), email(), etc. There are also many such properties packaged in ‘Providers’. Some are standard providers, while others are providers developed by the community.
There are many standard providers like address, currency, credit_card, date_time, internet, geo, person, profile, bank etc. that help create the relevant dummy data. More information on the full list of standard providers and their properties can be found here.
Let’s have a look at some examples from faker.providers.address
for i in range(5): # Returns 5 country names
print(fake.country())
>>>Luxembourg
>>>Vietnam
>>>Tonga
>>>Mozambique
>>>Austria
You can also get country codes.
for i in range(5): # Returns 5 country codes
print(fake.country_code())
>>>ES
>>>RO
>>>MH
>>>MR
>>>CL
As stated before, the default language is English and the default country is set to be the United States.
fake.current_country() #Returns current country
>>>United States
When the locale is changed the output of current_country(), current_country_code(), address(), etc will be changed as follows:
Fake = Fake(“de_DE”)
fake.current_country_code() #Returns current country code
>>>DE
There are many community providers like Credit Score, Air Travel, Vehicle, Music, etc. You can also create your provider and add it to the Faker package. More information on the full list of community providers and their properties can be found here.
Let’s have a look at some examples from Faker_music. Before you start generating fake music data using community providers you need to install the package using pip.
pip install faker_music
And then you need to add the provider to your Faker instance:
from faker_music import MusicProvider
fake = Faker()
fake.add_provider(MusicProvider)
Now you set to generate fake music data:
for i in range (5): #Returns music genres
print(fake.music_genre())
>>>Rock
>>>World
>>>Classical
>>>Pop
>>>Vocal
You can create the localised dummy data by providing the required locale as an argument to the dummy generator. It also supports multiple locales. In that case, all locales must be provided in the Python list data type like in the example shown below.
fake = Faker([‘De_DE’, ‘fr_FR’, ‘ja_JP’])
for _ in range(10):
print(fake.name())
>>>山本 陽子
>>>Lina Weinhold
>>>Dorothee Huhn
>>>Anika Henck-Hörle
>>>Ilonka Drubin MBA.
>>>Philomena Rohleder
>>>高橋 裕太
>>>Jacques Dumont Le Perrin
>>>斎藤 治
>>>小林 淳
The default locale is ‘en_US’, i.e. US English. Let’s code to create 5 addresses in Germany.
fake=Faker(“de_DE”) # Returns German addresses
for i in range(3):
print(fake.address())
>>>Rafael-Mende-Platz 04
>>>04196 Steinfurt
>>>Resi-Atzler-Allee 843
>>>96746 Coburg
>>>Scheibeplatz 5/1
>>>52115 Stollberg
fake=Faker(“de_DE”) #Returns German federal states
for i in range(5):
print(fake.administrative_unit())
>>>Bremen
>>>Hessen
>>>Rheinland-Pfalz
>>>Nordrhein-Westfalen
>>>Bayern
We will create a fictitious dataset of 100 people with attributes such as id, name, email, address, date of birth, place of birth, etc. We will use the standard provider ‘Profiles’ to create this data and use Pandas Dataframes to save that.
#Import packages
from faker import Faker
from faker_music import MusicProvider
import pandas as pd
#Declare faker object
fake = Faker()
#Add music faker
fake.add_provider(MusicProvider)
#Define function to generate fake data and store into a JSON file
def generate_dummy_data(records):
data={}
#Iterate the loop and generate fake data
for i in range(0, records):
data[i]={}
data[i][“id”] = fake.unique.random_number(8)
data[i][“name”] = fake.name()
data[i][“email_address”]= fake.unique.email()
data[i][“address”]= fake.address()
data[i][“date_of_birth”]= fake.date_between(“-67y”, “-18y”)
data[i][“country_of_birth”]= fake.country()
data[i][“member_since”]= fake.date_time_between(“-2y”,“now”)
return data
#Call the function to generate fake data and store into a json file
fake_data = generate_dummy_data(100)
# Convert JSON to DataFrame
fake_data = pd.DataFrame(fake_data)
fake_data = fake_data.T
fake_data
Faker is a Python library for generating fake data. It can be very practical in several cases. There are several alternatives to Faker but it remains the most well-known option in Python. It is popular because it is the easiest way to create fake records that look real. You can use it to create loops of dummy data –with simple steps it generates a large number of data in seconds.
I hope you enjoyed this article. If you have any questions leave a comment below.