General Question

ScottyMcGeester's avatar

I guess I don't quite understand regression lines?

Asked by ScottyMcGeester (1897points) August 2nd, 2016

I have to do a lot of regression line calculations for work on Excel.

For the past four years, I’ve been using these Sodium Chloride (NaCl) standards that my boss had me make for conductivity readings.

I’ve noticed over time that the readings have drifted. Boss suggested I create new standards. So today I created a whole new set of NaCl standards.

After I made up the standards, I checked their conductivity. They are as follows:
(mM is the concentration – millimol)
1000mM = 57.5 ms
100mM = 8.12 ms
50mM = 4.29 ms
10mM = .941 ms
1mM = .1346 ms
DI H2O = .0093 ms

I created a standard curve out of those numbers (X axis being mM and y axis being ms), then used the regression line equation to calculate the amount of mM I really had – and this is where I got confused.

I got back:
996 mM
128 mM
61 mM
2 mM
-12 mM
-14 mM

So I thought, “Uh oh, something’s wrong.”

Without understanding exactly why, I had a hunch that maybe the 1000mM was too high of a standard. So I removed that and just made the graph from 100mM and down.

It got better although 1mM was half off – it was 0.58 mM.

Then I made a graph just using the 10mM and 1mM values . . . . and it calculated it as EXACTLY 10mM and 1mM.

So. . . huh???? Which graph is telling me the accuracy of my standards?

Also, I looked back to four years ago and found that my conductivity values for each standard was different. As one example, back then I got 74.7 ms for 1000mM.

I used the same exact procedure I did four years ago as directed by my boss. Even if I did do something wrong – like I zoned out while making the standards at some point – my question still remains as to why the calculated values for mM is different whenever I remove an x and y value from the graph. How can I know then the true value of what I’m measuring?

(I also played around with removing the DI H2O values but it doesn’t help)

Observing members: 0 Composing members: 0

17 Answers

ARE_you_kidding_me's avatar

Without seeing how you set up your regressions it’s hard to answer but did you select the appropriate regression type for your data set? Linear, exponential etc… Linear is not always the one to use and you just kind of need to know how the system you are modeling reacts to input—try polynomial and increase the order of magnitude until it closely matches your data—if you have a data point that is an outlier and can be eliminated that can help too.

ScottyMcGeester's avatar

@ARE_you_kidding_me Oh right it’s linear. Everything my boss wants has to be linear though. He always said since the beginning that the NaCl conductivity graph needs to be linear. I’ve never actually even done any other kind of regression line during my time here.

PhiNotPi's avatar

Linear regression is the correct thing to use here (based on what I know of the relationship of conductivity to concentration).

In statistics, the quality of the data you get out depends on the quality and amount of the data you put in. What linear regression does is try to “even out” all of the errors that occurred during the creation of the samples to tell you the “actual” line that the data follows. In order for the results to be meaningful, however, you need enough data for the “evening out” to work properly.

Above you list only 6 data points, which in my opinion is far too few. I would say you need at least 30 or more data points to get reasonably accurate results, but obviously more is better. As a general guideline, quadrupling the number of data points doubles the precision. It’s also best if your data is evenly distributed across the range that you’re testing: you have a sample at 100 but your next sample is 1000 (ten times higher). If you want to do more samples, do them at every multiple of 50 or so.

cazzie's avatar

I agree with PhiNotPi. More samples, regular intervals.

ScottyMcGeester's avatar

Okay, I’ll try that. But I still don’t get one thing.

Sorry, there’s many layers to this but like I said I ultimately use the 10mM and 1mM standards to create a curve when I check the conductivity of these samples at work once and a while. In those cases, my boss wants me to make a regression line out of only 10mM and 1mM – and the samples should fall somewhere in between the two.

Whenever I did that, the line equation gave me back 10mM and 1mM on the dot. Over time I realized something wasn’t right. Some of the sample readings were actually under the ms value I got for 1mM . So I adjusted the graph to have 0 as the intercept and then I realized that this entire time, more or less, the 10mM and 1mM were just a tiny bit off – like 9.6 mM or 1.21 mM. But those standards from four years ago had wildly different conductivity readings. Such as .163 ms for 1mM and 1465 ms for 10mM. And yesterday when I made up these new standards and checked the ms they were .1346 and .941 respectively. And yet when I make a graph of those two values, even with 0 as the intercept, they more or less give me the same calculated values for mM as four years ago. But that shouldn’t make sense if the ms readings are so different.

My boss has a habit of not catching things even though he signs off on these things I do like every other day, and since I was new then I wasn’t experienced with regression lines so I never bat an eye until much later.

Mariah's avatar

If I understand correctly, you’re doing linear regression based on two data points? The regression equation will necessarily plot a line that goes right through both of those points whether that’s what you want or not. If you’re doing a regression on just two points you better be damn sure those two points are spot on.

PhiNotPi's avatar

If you’re supposed to use 10mM and 1mM concentrations, what you need to do is take multiple measurements at each concentration, so that you have many more than two data points. (To be clear, you can’t just measure the same stock solution multiple times, you have to actually make several stock solutions for each of the two concentrations.)

Dutchess_III's avatar

When you guys get done could you explain, in small words, what you’re doing? Pleasenthankyou.

dappled_leaves's avatar

I just ran a linear regression on the data you provided, and I don’t get the same concentrations back that you do when I plug your conductivities into the resulting equation. Maybe there’s a problem with the way you’re doing the regression?

My regression equation was Y = 0.0811x + 0.0875.

So, to get the concentrations back, the equation is X = (Y – 0.0875) / 0.0811

What were you using?

Regardless, as others have said, every time you add or subtract a data point, you change the slope of the graph, because it’s trying to find a line that minimizes the distance (error) between your data and that line. So, as @Mariah points out, it should come as no surprise that the line that minimizes the distance between your data and the line if there are only two points goes through those two points. It can’t go anywhere else.

@PhiNotPi Multiple measurements for each level of the independent variable in linear regression? I think you have another kind of analysis in mind.

cazzie's avatar

@dappled_leaves, what @PhiNotPi have in mind, and correct me if i’m mistaken, @PhiNotPi , but what we want to see is that the data is extrapolated amongst equally measured and several more points among the variable line. I only have experience doing this in production accounting, but the principals are the same.

dappled_leaves's avatar

@cazzie Sure, more data points are generally better, but if the guy calling the shots wants all of this done based on five points, then that’s what will be done – my understanding is that this is a work situation, not an academic one. Collecting more data comes at a cost of materials and time.

The regression is highly significant in any case, so he’s probably correct that this is sufficient. Obviously, I wouldn’t recommend removing points to get a better fit, which seemed to be what the OP was trying to do.

cazzie's avatar

@dappled_leaves bosses can be wrong. I’ll ask BBE to be sure. He’s a professor at a university and teaches several subjects in science, including practical lab courses. Give me an hour or two. We live in very separate time zones, unfortunately.

ARE_you_kidding_me's avatar

I get the same trendline eqn as dappled throwing out the sample at 1000.

cazzie's avatar

How accurate does this prediction have to be? You have to ask your boss what is the range of the data needed and how close do you need it. I’m talking to BBE now and he is very familiar with this system of analysis. I could put you in touch directly, but I’d rather do it here so we all learn. I’ll try to be a decent conduit.

cazzie's avatar

If you follow up, he is willing to email a response, so ask away and you will have a physics professor, of some note, responding to your question. Hope this helps.

cazzie's avatar

Message me directly if you think it will help. We just had a very long discussion about this topic because it is so important and fundamental to statistical analysis when it comes to studies and papers that are currently published. It’s in our wheel house, so, please, ask away.

Mariah's avatar

@Dutchess_III I don’t know if you were serious, but I love explaining math things in small words so I’ll totally bite.

Sometimes after performing an experiment and plotting the data you received, it might be obvious that the data follows a straight line. Then you might ask questions such as, what if I had an x value of 100? What would y be then? There may be many real world reasons why actually doing an experiment with x=100 is impractical, so you may want to predict a y value by extrapolation instead.

You use the data points that you do have, and plug them into an equation which will then give you an equation for the line that most closely matches those points, known as a line of best fit. If the data weren’t linear you might need to use a different regression equation that would produce a curve instead of a line.

Then you can find a prediction for what y will be when x is 100 by plugging a value of 100 in for x in the equation of the line of best fit and solving for the y value.

The more points you have to start with, the more assured you can feel that your line of best fit actually represents the data, because for example, if you just have 2 points of real data, the line of best fit will just be whatever line directly connects those two points, so if one of the points is inaccurate it’ll throw things off a lot.

Answer this question




to answer.

This question is in the General Section. Responses must be helpful and on-topic.

Your answer will be saved while you login or join.

Have a question? Ask Fluther!

What do you know more about?
Knowledge Networking @ Fluther