Image Comparing Software, Part Two: Individual results, modules, and methods.

Image Comparing Software, Part Two: Individual results, modules, and methods.

Extracting Pixel Data

The way I went about analyzing our image data is by using PIL, or the Python Imaging Library. It allows us to grab the pixel data for each individual painting in our dataset. This would allow me to get the RGB data, and convert it into its closest available X11 color which is able to be displayed using Matplotlib’s color library.

import pandas as pd
import numpy as np
import os
import PIL
import sys
import matplotlib
import webcolors
from IPython.display import display, Image 
from matplotlib import pyplot as plt
from io import BytesIO
from collections import Counter
from PIL import Image as PILimage
PIL.Image.MAX_IMAGE_PIXELS = 933120000
%matplotlib inline
''' Color-comparison of 3 styles of art. Cubist, Impressionist, and Baroque.
    10 Images per style, each taken from public domain. 
'''
def load_styles():
    #These are the paths to the files in the local directory
    path_baroque = './Baroque'
    path_cubist = './Cubist'
    path_impressionism = './Impressionism'
    #This is a dataframe that will hold the image objects, extensions, and names/style
    styles = pd.DataFrame()
    series_objects = []
    number_colors = 256
    
    #Gets each individuals pixel data in RGB
    #As well as adds a column to convert RGB into human readable colors
    #later , (ex. Black, brickRed)
    for baroquefile in os.listdir(path_baroque):
        if baroquefile != '.DS_Store':
            imageadded = PILimage.open(path_baroque + '/' + baroquefile)
            pallet_mode = imageadded.getdata()
            array_like = list(pallet_mode)
            imageadded.close()
            fn, fext = os.path.splitext(baroquefile)
            toAdd = pd.Series(name = fn, data = {'web_colors':np.nan,'color_array': array_like, 'color_sequence': pallet_mode, 'color_length': len(array_like),'extension': fext, 'style': 'Baroque'})
            series_objects.append(toAdd)
    for cubistfile in os.listdir(path_cubist):
        if cubistfile != '.DS_Store':
            imageadded = PILimage.open(path_cubist + '/' + cubistfile)
            pallet_mode = imageadded.getdata()
            array_like = list(pallet_mode)
            imageadded.close()
            fn, fext = os.path.splitext(cubistfile)
            toAdd = pd.Series(name = fn, data={'web_colors':np.nan,'color_array': array_like, 'color_sequence': pallet_mode, 'color_length': len(array_like), 'extension': fext, 'style': 'Cubist'})
            series_objects.append(toAdd)
    for impressionistfile in os.listdir(path_impressionism):
        if impressionistfile != '.DS_Store':
            imageadded = PILimage.open(path_impressionism + '/' + impressionistfile)
            pallet_mode = imageadded.getdata()
            array_like = list(pallet_mode)
            imageadded.close()
            fn, fext = os.path.splitext(impressionistfile)
            toAdd = pd.Series(name = fn, data={'web_colors': np.nan, 'color_array': array_like,'color_sequence': pallet_mode, 'color_length': len(array_like),'extension': fext, 'style': 'Impressionist'})
            series_objects.append(toAdd)
            
    return styles.append(series_objects)

styles = load_styles()

Feature Engineering

This next block of code is responsible for creating a column in our data frame for creating our color data.

def convertAdd_rgbColumn():
    #Converts RGB to Color 
    def closest_colour(requested_colour):
        min_colours = {}
        for key, name in webcolors.css3_hex_to_names.items():
            r_c, g_c, b_c = webcolors.hex_to_rgb(key)
            rd = (r_c - requested_colour[0]) ** 2
            gd = (g_c - requested_colour[1]) ** 2
            bd = (b_c - requested_colour[2]) ** 2
            min_colours[(rd + gd + bd)] = name
        return min_colours[min(min_colours.keys())]

    def get_colour_name(requested_colour):
        try:
            closest_name = actual_name = webcolors.rgb_to_name(requested_colour)
        except ValueError:
            closest_name = closest_colour(requested_colour)
            actual_name = None
        return closest_name
    
    paintingColorMatches = pd.Series(index=styles.index)
    for painting in styles.index:
        color_array = map(get_colour_name, styles.loc[painting].loc['color_array'])
        paintingColorMatches.loc[painting] = color_array
    styles['web_colors'] = paintingColorMatches

convertAdd_rgbColumn()

Data Preparation via Compression

One setback I originally had to overcome was the immense amount of pixels I was mapping over. I originally intended to use all available pixel data from the source images, but I soon realized that this was computationally expensive work and that I would either have to create a more efficient program or to lower the number of pixels I would use by compression.

Before compression, the sum of our pixels was generally greater than 2 million for the bigger and denser paintings, and an average of 500,000 to 700,000 for the rest. With a total of 30 paintings, this was too large of a dataset.

After compression, the sum of our pixels was about 60,000 for the larger paintings, and an average of 30,000 to 50,000 for the rest, which lowered the estimated runtime of ratios(), our function that converted our RGB maps to x11, from an hour to about 1 minute.

'''
creating a dictionary of the set of complete web_colors, iterating over each paintings
web_color map, and getting the counts as compared to the ratio of the complete image.
'''
import multiprocessing
def ratios():
    count_array = []
    for painting in styles.index:
        counts = {}
        print(painting)
        for value in styles.loc[painting].loc['web_colors']:
            if value not in counts.keys():
                counts[value] = 1
            else:
                counts[value] += 1
        count_array.append(counts)
        print(counts)
#Ran the following script & saved the output.
#Having it run takes around ~30 minutes for 30 files.
'''if __name__ == '__main__':
    jobs = []
    for i in range(5):
        p = multiprocessing.Process(target=ratios)
        jobs.append(p)
        p.start()
'''

Below is the output from the command above, except cleaned and formatted into a dictionary.

dictionary_array = [
{'black': 7751, 'darkslategrey': 16896, 'teal': 99, 'darkslateblue': 185, 'seagreen': 5, 'dimgrey': 5602, 'darkolivegreen': 3955, 'steelblue': 31, 'slategrey': 1113, 'grey': 3096, 'lightslategrey': 355, 'darkseagreen': 9, 'rosybrown': 2212, 'darkgrey': 763, 'tan': 1458, 'silver': 990, 'sienna': 1287, 'lightgrey': 852, 'antiquewhite': 221, 'wheat': 212, 'gainsboro': 486, 'bisque': 38, 'peru': 279, 'indianred': 121, 'midnightblue': 107, 'saddlebrown': 1913, 'thistle': 3, 'peachpuff': 7, 'pink': 4, 'lightpink': 10, 'darkkhaki': 225, 'olivedrab': 14, 'darksalmon': 9, 'burlywood': 8, 'beige': 16, 'linen': 99, 'oldlace': 11, 'seashell': 9, 'floralwhite': 10, 'snow': 3, 'maroon': 698, 'whitesmoke': 11, 'ivory': 1, 'mistyrose': 3, 'brown': 18, 'white': 2, 'darkgreen': 1, 'palegoldenrod': 1, 'darkgoldenrod': 1},
{'black': 28142, 'darkslategrey': 12236, 'darkolivegreen': 3949, 'dimgrey': 2156, 'grey': 1297, 'saddlebrown': 1261, 'sienna': 1919, 'indianred': 246, 'rosybrown': 1015, 'darkgrey': 306, 'maroon': 244, 'tan': 154, 'peru': 177, 'silver': 115, 'lightgrey': 3, 'darkkhaki': 5, 'slategrey': 244, 'darkslateblue': 87, 'lightslategrey': 102, 'brown': 52, 'darkred': 1, 'midnightblue': 89}, [CONTINUED] --- 
{'tan': 1643, 'rosybrown': 11899, 'grey': 4896, 'indianred': 1614, 'sienna': 842, 'dimgrey': 3081, 'darkkhaki': 62, 'darksalmon': 15, 'darkolivegreen': 718, 'darkslategrey': 7249, 'black': 8146, 'peru': 86, 'silver': 660, 'saddlebrown': 89, 'darkgrey': 1758, 'lightgrey': 152, 'antiquewhite': 1, 'gainsboro': 24, 'maroon': 8, 'darkseagreen': 91, 'wheat': 38, 'lightslategrey': 981, 'palegoldenrod': 16, 'midnightblue': 220, 'darkslateblue': 187, 'slategrey': 1108, 'darkgreen': 1, 'lightsteelblue': 55, 'beige': 2, 'burlywood': 8, 'cadetblue': 62, 'steelblue': 28, 'lightblue': 8, 'seagreen': 44, 'mediumseagreen': 2, 'teal': 1, 'skyblue': 1, 'powderblue': 4}]


color_counts = pd.Series(index=styles.index, data = dictionary_array)
styles['color_count'] = color_counts

Data Analysis

Afterward, I continued the analysis by only estimating the average ratios per painting for colors that showed greater than 5 percent of all colors, and using a Matplotlib bar chart to show those colors using the function colored_graphs(), which takes a painting style as it’s the only parameter.

''' 
    Finally, let's display our matplots.
    // I decided to shorten the amount
       of colors displayed on the x-axis by limiting the 
    // percent of measurable significance on the ratio of colors. 
    
    // For example, if the ratio of pixels for the color 'blanchedalmond' 
        on any particular image is less than 5%, then that means that it is insignificant
        for our purposes.
'''
def grab_ratios(cut_off):
    pixel_dictionary = {}
    pixel_ratios_array = []
    for painting in styles.index:
        for color in styles.loc[painting].loc['color_count']:
            count = styles.loc[painting].loc['color_count'][color]
            if ((count / styles.loc[painting].loc['color_length']) > cut_off):
                pixel_ratios_array.append((color, (count / styles.loc[painting].loc['color_length'])))
        pixel_ratios_array.sort(key=lambda x: x[1], reverse=True)
        pixel_dictionary[painting] = pixel_ratios_array
        pixel_ratios_array = []
    graphs = {'baroque':[],'impressionism':[], 'cubism':[]}
    sep = 0
    for key in pixel_dictionary:
        sep += 1
        temp_index = []
        temp_vals = []
        for tuples in pixel_dictionary[key]:
            temp_index.append(tuples[0])
            temp_vals.append(tuples[1])
        temp_sers = pd.Series(index = temp_index, data = temp_vals)
        if sep <= 30:
            if sep >= 20:
                graphs['cubism'].append( (key, temp_sers) )

            elif sep >= 10:
                graphs['impressionism'].append( (key, temp_sers)) 

            else:
                graphs['baroque'].append((key, temp_sers))

        temp_index = []
        temp_vals = []
    return graphs
graphs = grab_ratios(0.02)



'''
This function prints the colores used in the graph from most used to least, based on percentage
of pixels of that particular color. 
'''
def colored_graphs(style):
    if style not in ['baroque', 'cubism', 'impressionism']:
        raise ValueError('style not acceptable, try baroque, cubism, or impressionism.')
    for painting in graphs[style]:
        figure, axes = plt.subplots()
        plt.title(painting[0])
        axes.bar(np.arange(len(painting[1])), painting[1].values, width=0.25, tick_label = list(painting[1].index), color = list(painting[1].index), linewidth=0.5, edgecolor='white')
        axes.set_xlabel('Color')
        axes.set_ylabel('Percent Used')
        plt.xticks(rotation='45')
        plt.show(figure)
        plt.savefig("" + painting[0] + ".pdf")
        plt.close()
        figure.clear()
baroque = colored_graphs('baroque')

Video Results

Here’s an example of our baroque data:

The ratio of Color Graph output by Genre, Painting

Finally, I created three pie charts to show the relative popularity of each color greater than 2% for each of our painting styles. This was the result of running the code on the right:

def average_by_style(style):
    plt.close()
    if style not in ['baroque', 'cubism', 'impressionism']:
        raise ValueError('style not acceptable, try baroque, cubism, or impressionism.')
    cut_off = 0.05
    graphs = grab_ratios(cut_off)
    color_dict = {}
    painting_sums = {}
    for painting in graphs[style]:
        sum_percent = 0
        for color in painting[1].index:
            sum_percent += painting[1][color]
        painting_sums[painting[0]] = sum_percent
        color = 0
    for painting in graphs[style]:
        for color in painting[1].index:
            painting[1].loc[color] = (painting[1].loc[color] / painting_sums[painting[0]])
            if color not in color_dict.keys():
                color_dict[color] = painting[1].loc[color]
            else:
                color_dict[color] += painting[1].loc[color]
    sum_all = sum(color_dict.values())
    tuple_set = []
    show_greater_than = 0.02
    iter_over = list(color_dict.keys())
    for key in iter_over:
        if (color_dict[key] / sum_all) < show_greater_than:
            del color_dict[key]
    new_sum = sum(color_dict.values())
    
    for key in color_dict.keys():
        color_dict[key] = (color_dict[key] / new_sum)
        tuple_set.append((key, color_dict[key]))
    tuple_set.sort(key= lambda x: x[1], reverse=True)
    avg_Series = pd.Series(index=[x[0] for x in tuple_set], data=[x[1] for x in tuple_set])
    figure = plt.figure(4, figsize=(10,10))
    axes = figure.add_subplot(211)
    new_ax = figure.add_subplot(212)
    
    axes.set_title('Average of ' + style)
    legend_labels = avg_Series.index
    def explode_sense(frequencies):
        explode = []
        for val in frequencies:
            if val > 0.02:
                explode.append(0)
            else:
                explode.append(0.2)
        return explode
    axes.axis('equal')
    new_ax.axis('off')
    explode = explode_sense(avg_Series.values)
    pie = axes.pie(x= avg_Series.values, labels=['%s, %1.1f%%' % (l, (float(s)) * 100) for l, s in zip(avg_Series.index, avg_Series.values)], colors = avg_Series.index, explode=explode, startangle=0)
    for text in pie[1]:
        text.set_color('white')
    
    new_ax.legend(pie[0], ['%s, %1.1f%%' % (l, (float(s)) * 100) for l, s in zip(avg_Series.index, avg_Series.values)], loc='center left')
    plt.show(figure)
    figure.clear()
        
        
average_by_style('cubism')

In my next post, I’ll be posting and comparing my results!

  • Image Comparing Software, Results

    Image Comparing Software, Results

    The differences I was able to spot immediately were that baroque and Impressionist art had higher use of black and dark slate grey than Cubism, and that variety was highest in Cubist art, followed by Impressionist and then baroque. Baroque had the least amount of ‘variety’, in the sense that most of the colors that…

    Read more….


Statement of Purpose

Please describe the project you would like to conduct in terms that can be understood by a non-expert audience.

My main project goal is in testing, through quantitative and qualitative research, my hypothesis that a person’s self-curated online environment can affect or predict their developmental growth. In the field of human development, I found it interesting that environments are able to shape the kinds of interactions humans face regularly, and that these interactions can be used to interpret a person’s development. The dawning of the internet has introduced a new vector through which we are being influenced to make real-world decisions, and I think it may be important to research the effects that it will have. I hypothesize that the internet’s influences are happening covertly a majority of the time, and that there exists little possibility for an average person to understand how they are being influenced. 

I believe that data science is a field which is helpful in this regard because it will allow me to incorporate human development research in the real world and apply it in an online, data-centered environment. A significant aspect of human development research is that they conduct their findings using field research in specific environments (e.g. a nursery, playground, home, work-environment). It is important to track human environments in this way because the data gathered is contextual. The data gathered in a nursery, for example, has more to do with a child’s development than an adult. One environment I believe is missing from the study of human development is the online environment.

My main hypothesis is that a human’s interaction with their “personally curated” online environment has the ability to impact their future development. One such example is the invisible effect that an online identity can incur in the real-world, or how your real-life upbringing can differentiate your online persona. My project aims to capture what is missing in human development using HCI. I truly believe the intersection of these two fields has been largely ignored but which can explain many uniquely human issues which have arisen due to new technologies. 

My project will be applying the fields of Human Development and Cognitive Science with data science techniques. I wish to develop research of an online environment (e.g. YouTube, Reddit, Twitter, Facebook). The “environment” will serve to inform my analysis of unique markers in Human Development: Such as goals, attitudes, and other developmental effects. It will also provide me with a valuable perspective from where to begin my research. As an example, a public forum, such as Reddit, will contain less personal information but may contain valuable unique perspectives/markers that may hint at trends in YouTube “personalities”. My analysis will be based on the body of knowledge present within human development. This knowledge would help in categorizing the “markers” present within these datasets/environments which may hint to a broader trend among humans.

Currently, I aim to utilize simple linear regression and statistical analysis. However, since identifying markers in data is something only humans can do, machine learning and sentiment analysis might also be useful for our purposes.

Independance

This Scholarship is meant to support your independent project, which should be supervised and mentored by a faculty member, but which should be your own work and responsibility, rather than something that mostly “belongs” to someone else in your research group.

Bryan Alexis Ambriz

Winter 2020

Identifying Online Environmental Factors Influencing Human Development

Keywords: Human Development (Critical Periods to Young Adulthood), Online Environment, Cognitive  Science.

Using U.S Census Bureau data from (2000 - 2018), I will be looking at the distribution of living accommodations by  age in the state of California to get a look at what the most common home environment looks like in different stages  of life.

1. This information can help to identify the first feature of analysis.  

2. In the U.S, at least, a common route to independence for many is to enroll in college and leave the home.

  3. This is not true for everyone, however, and so to gain a better ‘expectation’, we will do simple linear regression.